Human motion prediction aims to forecast future body poses from observed motion sequences and plays a crucial role in fields such as human–robot collaboration, assistive robotics, and autonomous driving. These motion sequences can be obtained in different ways. Most existing models are trained and evaluated on clean, marker-based motion capture data, which offer ideal conditions but require sophisticated and controlled setups to be generated. For these reasons vision-based marker-less pose extraction systems are more common in real-world scenarios. These system are easier to use but more prone to errors due to external factors like occlusions or lighting, making them be a source of noise in the input motion sequences. The robustness of human motion prediction models in realistic, noisy scenarios remains largely unexplored. This thesis presents a comparative study of deterministic and probabilistic state-of-the-art motion prediction models when applied to human pose sequences extracted from videos. The Human3.6M dataset was used as the foundation, as it provides both high-quality marker-based 3D poses and the corresponding video sequences. A dedicated extraction pipeline was designed to obtain marker-less poses from the videos, creating a parallel version of the dataset for evaluation. Each model was then tested under identical conditions with both marker-based and marker-less inputs. The analysis includes quantitative evaluation based on the standard error metric MPJPE as well as qualitative inspection of predicted motion sequences, allowing for a detailed assessment of how input noise affects prediction accuracy and stability across different model architectures and categories. Results show that the use of marker-less data causes a substantial performance drop across all models, with a MPJPE value 2.6 times higher in average, and especially for short-term predictions where extraction noise dominates the error. These findings highlight the crucial impact of input quality on motion prediction and suggest that future progress should address both sides of the problem: improving the reliability of pose extraction methods and developing prediction models trained and optimized to handle the imperfections of vision-based input data.
La previsione del movimento umano mira a stimare le pose corporee future a partire da sequenze di movimento osservate, e riveste un ruolo fondamentale in ambiti come la collaborazione uomo–robot, la robotica assistiva e la guida autonoma. Le sequenze di movimento utilizzate in questo contesto possono essere ottenute in diversi modi. La maggior parte dei modelli esistenti viene addestrata e valutata su dati di acquisizione marker–based, che offrono condizioni ideali ma richiedono configurazioni sofisticate e fortemente controllate. Per questo motivo, nei contesti reali sono più comuni i sistemi di estrazione delle pose marker–less basati su visione. Questi sistemi risultano più semplici da utilizzare, ma sono anche più soggetti a errori dovuti a fattori esterni come occlusioni o variazioni di illuminazione, introducendo così rumore nelle sequenze di input. La robustezza dei modelli di previsione del movimento umano in scenari realistici e rumorosi rimane tuttavia in gran parte inesplorata. Questa tesi presenta uno studio comparativo di modelli di previsione del movimento di tipo deterministico e probabilistico appartenenti allo stato dell’arte, applicati a sequenze di pose umane estratte da video. Come base di partenza è stato utilizzato il dataset Human3.6M, che fornisce sia pose 3D di alta qualità ottenute con sistemi marker–based, sia le corrispondenti sequenze video. È stata progettata e implementata una pipeline di estrazione dedicata per ottenere le pose marker–less dai video, creando così una versione parallela del dataset per la valutazione. Ogni modello è stato quindi testato nelle stesse condizioni sia con input marker–based che marker–less. L’analisi comprende una valutazione quantitativa, basata sulla metrica standard MPJPE, e un’ispezione qualitativa delle sequenze di movimento predette, consentendo una valutazione dettagliata di come il rumore in ingresso influenzi l’accuratezza e la stabilità della previsione attraverso diverse architetture e categorie di modelli. I risultati mostrano che l’utilizzo di dati marker–less comporta un notevole calo delle prestazioni per tutti i modelli, con valori medi di MPJPE circa 2.6 volte superiori, in particolare per le previsioni a breve termine, dove il rumore di estrazione domina l’errore complessivo. Questi risultati evidenziano il ruolo cruciale della qualità dell’input nella previsione del movimento umano e suggeriscono che i futuri progressi dovranno affrontare entrambi gli aspetti del problema: migliorare l’affidabilità dei metodi di estrazione delle pose e sviluppare modelli di previsione addestrati e ottimizzati per gestire le imperfezioni tipiche dei dati marker–less basati su visione.
A comparative study of human motion prediction models applied to marker-less motion capture data
FELLINE, ANDREA
2024/2025
Abstract
Human motion prediction aims to forecast future body poses from observed motion sequences and plays a crucial role in fields such as human–robot collaboration, assistive robotics, and autonomous driving. These motion sequences can be obtained in different ways. Most existing models are trained and evaluated on clean, marker-based motion capture data, which offer ideal conditions but require sophisticated and controlled setups to be generated. For these reasons vision-based marker-less pose extraction systems are more common in real-world scenarios. These system are easier to use but more prone to errors due to external factors like occlusions or lighting, making them be a source of noise in the input motion sequences. The robustness of human motion prediction models in realistic, noisy scenarios remains largely unexplored. This thesis presents a comparative study of deterministic and probabilistic state-of-the-art motion prediction models when applied to human pose sequences extracted from videos. The Human3.6M dataset was used as the foundation, as it provides both high-quality marker-based 3D poses and the corresponding video sequences. A dedicated extraction pipeline was designed to obtain marker-less poses from the videos, creating a parallel version of the dataset for evaluation. Each model was then tested under identical conditions with both marker-based and marker-less inputs. The analysis includes quantitative evaluation based on the standard error metric MPJPE as well as qualitative inspection of predicted motion sequences, allowing for a detailed assessment of how input noise affects prediction accuracy and stability across different model architectures and categories. Results show that the use of marker-less data causes a substantial performance drop across all models, with a MPJPE value 2.6 times higher in average, and especially for short-term predictions where extraction noise dominates the error. These findings highlight the crucial impact of input quality on motion prediction and suggest that future progress should address both sides of the problem: improving the reliability of pose extraction methods and developing prediction models trained and optimized to handle the imperfections of vision-based input data.| File | Dimensione | Formato | |
|---|---|---|---|
|
Felline_Andrea.pdf
accesso aperto
Dimensione
4.12 MB
Formato
Adobe PDF
|
4.12 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/95450