Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.
Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.
Action progress prediction in videos
ZOPPELLARI, ELENA
2024/2025
Abstract
Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.| File | Dimensione | Formato | |
|---|---|---|---|
|
Zoppellari_Elena.pdf
Accesso riservato
Dimensione
21.21 MB
Formato
Adobe PDF
|
21.21 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/87179