Multi-human parsing is an important and challenging task in vision-based human understanding, grouping together human body parts segmentation and human instance segmentation. Although recent deep-learning-based techniques achieve notable results on multi-human parsing datasets, many challenges still remain unresolved. One of them consists in accurately segmenting human bodies in images in which people are very close to each other or overlap. In such cases, multi-human parsing techniques struggle to properly segment human instances and to associate detected body parts to the correct person. This is confirmed by an in-depth analysis provided in this thesis on current state-of-the-art networks for multi-human parsing, which highlights significant issues in presence of severe occlusions between people in the image. To solve this problem, this thesis proposes to exploit multi-view information, based on the intuition that people occluded in an image taken from a particular point of view could be easily separated if framed from a different angle. Motivated by the absence of a suitable multi-view dataset in the literature, this work proposes to exploit the human instance segmentation task to improve multi-human parsing on strong occlusions. A novel learning framework is introduced to take advantage of human instance segmentation as auxiliary information to guide the multi-human parsing task. Network learning is driven by human segmentation loss functions evaluated on single-views, aiming at improving foreground human instance discrimination, and multi-view instance and body parts prediction consistency, to impose coherent instance and semantic predictions across multiple views of the same scene. The multi-view loss term exploits 3D knowledge to separate overlapping bodies and to provide sparse supervision to human parsing. To validate the approach, a human instance annotation strategy is used to retrieve human segmentation annotations from multi-view RGB+D data and 3D human skeletons. In the experimental validation, such dataset has been used to fine-tune the state-of-the-art AIParsing network, by leveraging its instance-level annotations and multi-view data. The final model has been then evaluated on a subset of images from CIHP dataset with consistent overlaps between people, showing the effectiveness of the proposed approach, with an improvement in terms of body part-aware mean Intersection-over-Union up to 4.25% with respect to the original AIParsing network.
Multi-human parsing is an important and challenging task in vision-based human understanding, grouping together human body parts segmentation and human instance segmentation. Although recent deep-learning-based techniques achieve notable results on multi-human parsing datasets, many challenges still remain unresolved. One of them consists in accurately segmenting human bodies in images in which people are very close to each other or overlap. In such cases, multi-human parsing techniques struggle to properly segment human instances and to associate detected body parts to the correct person. This is confirmed by an in-depth analysis provided in this thesis on current state-of-the-art networks for multi-human parsing, which highlights significant issues in presence of severe occlusions between people in the image. To solve this problem, this thesis proposes to exploit multi-view information, based on the intuition that people occluded in an image taken from a particular point of view could be easily separated if framed from a different angle. Motivated by the absence of a suitable multi-view dataset in the literature, this work proposes to exploit the human instance segmentation task to improve multi-human parsing on strong occlusions. A novel learning framework is introduced to take advantage of human instance segmentation as auxiliary information to guide the multi-human parsing task. Network learning is driven by human segmentation loss functions evaluated on single-views, aiming at improving foreground human instance discrimination, and multi-view instance and body parts prediction consistency, to impose coherent instance and semantic predictions across multiple views of the same scene. The multi-view loss term exploits 3D knowledge to separate overlapping bodies and to provide sparse supervision to human parsing. To validate the approach, a human instance annotation strategy is used to retrieve human segmentation annotations from multi-view RGB+D data and 3D human skeletons. In the experimental validation, such dataset has been used to fine-tune the state-of-the-art AIParsing network, by leveraging its instance-level annotations and multi-view data. The final model has been then evaluated on a subset of images from CIHP dataset with consistent overlaps between people, showing the effectiveness of the proposed approach, with an improvement in terms of body part-aware mean Intersection-over-Union up to 4.25% with respect to the original AIParsing network.
Multi-camera Multiple Human Parsing: Instance-level Human Body Parts Segmentation from Multi-view Images
BRAGAGNOLO, LAURA
2022/2023
Abstract
Multi-human parsing is an important and challenging task in vision-based human understanding, grouping together human body parts segmentation and human instance segmentation. Although recent deep-learning-based techniques achieve notable results on multi-human parsing datasets, many challenges still remain unresolved. One of them consists in accurately segmenting human bodies in images in which people are very close to each other or overlap. In such cases, multi-human parsing techniques struggle to properly segment human instances and to associate detected body parts to the correct person. This is confirmed by an in-depth analysis provided in this thesis on current state-of-the-art networks for multi-human parsing, which highlights significant issues in presence of severe occlusions between people in the image. To solve this problem, this thesis proposes to exploit multi-view information, based on the intuition that people occluded in an image taken from a particular point of view could be easily separated if framed from a different angle. Motivated by the absence of a suitable multi-view dataset in the literature, this work proposes to exploit the human instance segmentation task to improve multi-human parsing on strong occlusions. A novel learning framework is introduced to take advantage of human instance segmentation as auxiliary information to guide the multi-human parsing task. Network learning is driven by human segmentation loss functions evaluated on single-views, aiming at improving foreground human instance discrimination, and multi-view instance and body parts prediction consistency, to impose coherent instance and semantic predictions across multiple views of the same scene. The multi-view loss term exploits 3D knowledge to separate overlapping bodies and to provide sparse supervision to human parsing. To validate the approach, a human instance annotation strategy is used to retrieve human segmentation annotations from multi-view RGB+D data and 3D human skeletons. In the experimental validation, such dataset has been used to fine-tune the state-of-the-art AIParsing network, by leveraging its instance-level annotations and multi-view data. The final model has been then evaluated on a subset of images from CIHP dataset with consistent overlaps between people, showing the effectiveness of the proposed approach, with an improvement in terms of body part-aware mean Intersection-over-Union up to 4.25% with respect to the original AIParsing network.File | Dimensione | Formato | |
---|---|---|---|
Bragagnolo_Laura.pdf
accesso aperto
Dimensione
17.56 MB
Formato
Adobe PDF
|
17.56 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/50722