Industrial assembly lines increasingly rely on automated vision systems to sort thousands of visually similar components. However, collecting large, labeled, multi-view image sets for every part is impractical, since parts often arrive directly from suppliers without prior access for training. This thesis investigates a synthetic-to-real pipeline for multi-view convolutional neural networks (MVCNNs). This approach enables classifiers to be trained from CAD models and then applied to real images in a five-camera imaging box. We developed a Blender-based renderer to generate a synthetic dataset of 80 parts with randomized poses, materials, lighting, and optics, which provides diverse five-view samples for training. Using this dataset, we evaluate transfer learning with ImageNet-pretrained backbones, freezing strategies, fusion mechanisms, weight sharing, and several backbone families. Freezing the first three ResNet-50 stages matches the accuracy of full fine-tuning while improving stability. Among fusion mechanisms, score-sum and deep early fusion achieve the most reliable transfer to real data. Full weight sharing across view branches improves robustness while reducing parameters. Backbone comparison shows that compact modern CNNs, such as ConvNeXt-Small, generalize best. Overall, the results demonstrate that synthetic training combined with judicious transfer learning, deep fusion, and full weight sharing yields near-perfect real-world accuracy with modest model footprints. However, the study is limited to 80 synthetic classes and a real sample evaluation set with a single physical object, while the target application involves up to 10,000 categories. The findings should therefore be regarded as preliminary, establishing a baseline and outlining a scalable route toward industrial deployment with reduced data-collection overhead.
Industrial assembly lines increasingly rely on automated vision systems to sort thousands of visually similar components. However, collecting large, labeled, multi-view image sets for every part is impractical, since parts often arrive directly from suppliers without prior access for training. This thesis investigates a synthetic-to-real pipeline for multi-view convolutional neural networks (MVCNNs). This approach enables classifiers to be trained from CAD models and then applied to real images in a five-camera imaging box. We developed a Blender-based renderer to generate a synthetic dataset of 80 parts with randomized poses, materials, lighting, and optics, which provides diverse five-view samples for training. Using this dataset, we evaluate transfer learning with ImageNet-pretrained backbones, freezing strategies, fusion mechanisms, weight sharing, and several backbone families. Freezing the first three ResNet-50 stages matches the accuracy of full fine-tuning while improving stability. Among fusion mechanisms, score-sum and deep early fusion achieve the most reliable transfer to real data. Full weight sharing across view branches improves robustness while reducing parameters. Backbone comparison shows that compact modern CNNs, such as ConvNeXt-Small, generalize best. Overall, the results demonstrate that synthetic training combined with judicious transfer learning, deep fusion, and full weight sharing yields near-perfect real-world accuracy with modest model footprints. However, the study is limited to 80 synthetic classes and a real sample evaluation set with a single physical object, while the target application involves up to 10,000 categories. The findings should therefore be regarded as preliminary, establishing a baseline and outlining a scalable route toward industrial deployment with reduced data-collection overhead.
Multi-View CNNs for Industrial Object Classification: From Synthetic Dataset Design to Transfer Learning and Fusion Strategies
FRIGO, GIANMARIA
2024/2025
Abstract
Industrial assembly lines increasingly rely on automated vision systems to sort thousands of visually similar components. However, collecting large, labeled, multi-view image sets for every part is impractical, since parts often arrive directly from suppliers without prior access for training. This thesis investigates a synthetic-to-real pipeline for multi-view convolutional neural networks (MVCNNs). This approach enables classifiers to be trained from CAD models and then applied to real images in a five-camera imaging box. We developed a Blender-based renderer to generate a synthetic dataset of 80 parts with randomized poses, materials, lighting, and optics, which provides diverse five-view samples for training. Using this dataset, we evaluate transfer learning with ImageNet-pretrained backbones, freezing strategies, fusion mechanisms, weight sharing, and several backbone families. Freezing the first three ResNet-50 stages matches the accuracy of full fine-tuning while improving stability. Among fusion mechanisms, score-sum and deep early fusion achieve the most reliable transfer to real data. Full weight sharing across view branches improves robustness while reducing parameters. Backbone comparison shows that compact modern CNNs, such as ConvNeXt-Small, generalize best. Overall, the results demonstrate that synthetic training combined with judicious transfer learning, deep fusion, and full weight sharing yields near-perfect real-world accuracy with modest model footprints. However, the study is limited to 80 synthetic classes and a real sample evaluation set with a single physical object, while the target application involves up to 10,000 categories. The findings should therefore be regarded as preliminary, establishing a baseline and outlining a scalable route toward industrial deployment with reduced data-collection overhead.| File | Dimensione | Formato | |
|---|---|---|---|
|
Frigo_Gianmaria.pdf
accesso aperto
Dimensione
4.81 MB
Formato
Adobe PDF
|
4.81 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/92194