Multimodal Semantic Segmentation aims at obtaining an accurate pixel level understanding of the scene by jointly exploiting multiple information sources, e.g., standard images and depth maps or 3D data. This thesis tackles the problem via a multi-step process: first we search for the best candidate estimated depths (EDs) through stereo vision and feed them to a pre-trained neural network in order to compute segmentation maps. Then, a neural network is trained from scratch using EDs instead of ground truth depths together with improvements of the results by denoising or filtering the images. The first step (stereo vision) is performed through well-established methods implemented usign the OpenCV library. The second step (semantic segmentation) is based on a deep learn ing framework of the encoder-decoder type. A common choice for encoder is ResNet101, while one for decoder is DeepLabv2, still other structures can be considered too, whenever high performance or fast computation is required. The overall work has been evaluated in a virtual environment (synthetic dataset) ensuring the availability of a huge number of images for training, validation, and test. The thesis then analyse Unsupervised Domain Adaptation, with the final purpose to solve the domain shift issue between synthetic data and real-world data. In conclusion, the combination between visual and 3D information promises optimal results, as shown by the analysis performed during this thesis: the neural network reaches satisfying segmentation ability even in its simplest implementation (e.g., without Domain Adaptation techniques).

Multimodal Semantic Segmentation aims at obtaining an accurate pixel level understanding of the scene by jointly exploiting multiple information sources, e.g., standard images and depth maps or 3D data. This thesis tackles the problem via a multi-step process: first we search for the best candidate estimated depths (EDs) through stereo vision and feed them to a pre-trained neural network in order to compute segmentation maps. Then, a neural network is trained from scratch using EDs instead of ground truth depths together with improvements of the results by denoising or filtering the images. The first step (stereo vision) is performed through well-established methods implemented usign the OpenCV library. The second step (semantic segmentation) is based on a deep learn ing framework of the encoder-decoder type. A common choice for encoder is ResNet101, while one for decoder is DeepLabv2, still other structures can be considered too, whenever high performance or fast computation is required. The overall work has been evaluated in a virtual environment (synthetic dataset) ensuring the availability of a huge number of images for training, validation, and test. The thesis then analyse Unsupervised Domain Adaptation, with the final purpose to solve the domain shift issue between synthetic data and real-world data. In conclusion, the combination between visual and 3D information promises optimal results, as shown by the analysis performed during this thesis: the neural network reaches satisfying segmentation ability even in its simplest implementation (e.g., without Domain Adaptation techniques).

Road Scene Understanding using Depth Data from Stereo Vision

RIGHETTO, LEONARDO
2021/2022

Abstract

Multimodal Semantic Segmentation aims at obtaining an accurate pixel level understanding of the scene by jointly exploiting multiple information sources, e.g., standard images and depth maps or 3D data. This thesis tackles the problem via a multi-step process: first we search for the best candidate estimated depths (EDs) through stereo vision and feed them to a pre-trained neural network in order to compute segmentation maps. Then, a neural network is trained from scratch using EDs instead of ground truth depths together with improvements of the results by denoising or filtering the images. The first step (stereo vision) is performed through well-established methods implemented usign the OpenCV library. The second step (semantic segmentation) is based on a deep learn ing framework of the encoder-decoder type. A common choice for encoder is ResNet101, while one for decoder is DeepLabv2, still other structures can be considered too, whenever high performance or fast computation is required. The overall work has been evaluated in a virtual environment (synthetic dataset) ensuring the availability of a huge number of images for training, validation, and test. The thesis then analyse Unsupervised Domain Adaptation, with the final purpose to solve the domain shift issue between synthetic data and real-world data. In conclusion, the combination between visual and 3D information promises optimal results, as shown by the analysis performed during this thesis: the neural network reaches satisfying segmentation ability even in its simplest implementation (e.g., without Domain Adaptation techniques).
2021
Road Scene Understanding using Depth Data from Stereo Vision
Multimodal Semantic Segmentation aims at obtaining an accurate pixel level understanding of the scene by jointly exploiting multiple information sources, e.g., standard images and depth maps or 3D data. This thesis tackles the problem via a multi-step process: first we search for the best candidate estimated depths (EDs) through stereo vision and feed them to a pre-trained neural network in order to compute segmentation maps. Then, a neural network is trained from scratch using EDs instead of ground truth depths together with improvements of the results by denoising or filtering the images. The first step (stereo vision) is performed through well-established methods implemented usign the OpenCV library. The second step (semantic segmentation) is based on a deep learn ing framework of the encoder-decoder type. A common choice for encoder is ResNet101, while one for decoder is DeepLabv2, still other structures can be considered too, whenever high performance or fast computation is required. The overall work has been evaluated in a virtual environment (synthetic dataset) ensuring the availability of a huge number of images for training, validation, and test. The thesis then analyse Unsupervised Domain Adaptation, with the final purpose to solve the domain shift issue between synthetic data and real-world data. In conclusion, the combination between visual and 3D information promises optimal results, as shown by the analysis performed during this thesis: the neural network reaches satisfying segmentation ability even in its simplest implementation (e.g., without Domain Adaptation techniques).
Stereo vision
Depth map
Deep learning
Image understanding
Autonomous driving
File in questo prodotto:
File Dimensione Formato  
Righetto_Leonardo.pdf

accesso aperto

Dimensione 6.59 MB
Formato Adobe PDF
6.59 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/33209