Action progress prediction in videos

Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.

Action progress prediction in videos

ZOPPELLARI, ELENA

2024/2025

Abstract

Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Fisica e Astronomia "Galileo Galilei" - DFA
			
	Corso di studio
	
				PHYSICS OF DATA Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Action progress prediction in videos
			
	Abstract in italiano
	
				Action progress prediction, the task of estimating how far an ongoing action has advanced, is essential for enabling context-aware interactions in autonomous systems such as human-robot collaboration, autonomous driving and surgical assistance. This thesis addresses prior criticisms, showing that the performance of the original ProgressNet model can be compromised by unsuitable datasets and inadequate preprocessing, which encourage reliance on frame counting rather than visual understanding. To mitigate this, we introduce a variable frame rate (VFR) preprocessing strategy and a segment-based training regime. These techniques promote reliance on visual semantics and achieve strong performance on procedure-driven datasets. On the Mobile ALOHA dataset, our approach yields a mean absolute error below 3%, further reduced to under 2% on a trimmed variant, the proposed Mobile ALOHA datset CUT, where static video segments are removed. We also investigate architectural components, enabling multi-view inputs and comparing sequential models and feature extractors. While LSTM, GRU, and Transformer models perform similarly, the choice of visual feature extractor proves critical. Notably, advanced extractors such as ResNet and ViT appear to learn implicit action phases, diverging from the linear supervision signal.
			
	Parola chiave
	
				computer vision
activity prediction
action progress
			
	Relatore
	
				BALLAN, LAMBERTO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Zoppellari_Elena.pdf Accesso riservato Dimensione 21.21 MB Formato Adobe PDF	21.21 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/87179