Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.

Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.

Action progress prediction in videos

ZOPPELLARI, ELENA
2024/2025

Abstract

Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.
2024
Action progress prediction in videos
Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.
computer vision
activity prediction
action progress
File in questo prodotto:
File Dimensione Formato  
Zoppellari_Elena.pdf

Accesso riservato

Dimensione 21.21 MB
Formato Adobe PDF
21.21 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87179