This research aims to provide a comprehensive overview of the state of the art in the field of HSI (Human-Scene Interaction), through the analysis of several systems, predominantly implemented using deep learning models, as significant case studies. Research in the area of human-scene interaction has recently received considerable attention due to its numerous applications, lying at the intersection of fields such as deep learning, computer vision, and generative models, which are more active and relevant than ever. Specifically, the systems analyzed fall into two broad categories: systems aimed at understanding and extracting human-scene interactions within an image or video, and generative systems that instead synthesize interactions, generally starting from a textual prompt. Starting from a brief introduction to human-scene interaction and some key concepts necessary to better understand the various approaches used in these systems, the work proceeds with an in-depth analysis of both classes and the techniques employed in this field. Particular emphasis is placed on the unique features of each approach and their commonalities. It is observed that, while synthesis of HSI tends to converge toward the use of generative models based on diffusion, LLMs, and attention mechanisms — especially in the form of Transformer and Mamba architectures — the landscape of human-scene interaction analysis is characterized by a wider variety of techniques, employing multiple methods for scene representation. The relatively young age of this research field brings with it a major challenge: the lack of sufficiently large datasets to allow for the training of deep learning models. For this reason, some of the analyzed research papers not only propose a methodology but also introduce entirely new datasets, created and curated specifically for training such models in HSI. In conclusion, after a detailed analysis of the various proposed approaches and their respective strengths, it is clear that the strong momentum in research and development of increasingly capable and efficient systems for the analysis and synthesis of HSI suggests excellent prospects for progress in the near future.
Questa ricerca si pone l’obiettivo di fornire una panoramica completa dello stato dell’arte del campo della HSI (Human Scene Interaction), mediante l’analisi di alcuni sistemi, prevalentemente implementati usando modelli di deep learning, come casi studio significativi. La ricerca nell’ambito dell’interazione umano-scena ha ricevuto di recente una forte attenzione grazie alle sue numerose applicazioni, collocandosi del resto all’intersezione di campi quali il deep learning, la computer vision e i modelli generativi che sono più che mai attivi e rilevanti. Nello specifico, i sistemi analizzati si dividono in due macrocategorie: sistemi atti alla comprensione e all’estrapolazione delle interazioni umano-scena all’interno di un’immagine o di un video, e sistemi di tipo generativo che sintetizzano invece le interazioni a partire generalmente da dato prompt testuale. Partendo da una breve introduzione all’interazione umano-scena e da alcuni richiami a concetti necessari per comprendere al meglio i diversi approcci utilizzati in questi sistemi, si prosegue con un’analisi approfondita di entrambe le classi, e delle tecniche adottate in questo ambito. Si evidenziano in particolare le peculiarità di ognuno degli approcci adottati e i punti in comune, e si nota come, se per quanto riguarda la sintesi di HSI ci sia una forte tendenza a convergere verso l’impiego di modelli generativi di diffusion, di LLMs e di meccanismi di attention, specie nella forma di architetture Transformers e Mamba, per il problema dell’analisi dell’interazione tra umano e scena il panorama sia caratterizzato da una maggiore varietà di tecniche utilizzate, spaziando su più metodologie di rappresentazione della scena stessa. La relativa giovane età di questo campo di ricerca porta con sé come principale complicazione la carenza di dataset sufficientemente estesi al fine di permettere l’addestramento di modelli di deep learning. Proprio per questo motivo, alcuni degli articoli di ricerca analizzati non si limitano a proporre una metodologia, ma introducono anche nuovi dataset interamente creati da zero e curati con lo specifico scopo di essere applicabili nell’addestramento di tali modelli per l’HSI. In conclusione, dopo aver analizzato dettagliatamente i vari approcci proposti con i relativi punti di forza, si evidenzia come la forte spinta nella ricerca e nell’avanzamento delle capacità e dell’efficienza di sistemi di analisi e sintesi dell’HSI lasci intuire ottime possibilità di progresso nel futuro prossimo.
Human-Scene Interaction: Principi, Tecniche e Applicazioni nella Computer Vision
ALBERTI, GIORDANO
2024/2025
Abstract
This research aims to provide a comprehensive overview of the state of the art in the field of HSI (Human-Scene Interaction), through the analysis of several systems, predominantly implemented using deep learning models, as significant case studies. Research in the area of human-scene interaction has recently received considerable attention due to its numerous applications, lying at the intersection of fields such as deep learning, computer vision, and generative models, which are more active and relevant than ever. Specifically, the systems analyzed fall into two broad categories: systems aimed at understanding and extracting human-scene interactions within an image or video, and generative systems that instead synthesize interactions, generally starting from a textual prompt. Starting from a brief introduction to human-scene interaction and some key concepts necessary to better understand the various approaches used in these systems, the work proceeds with an in-depth analysis of both classes and the techniques employed in this field. Particular emphasis is placed on the unique features of each approach and their commonalities. It is observed that, while synthesis of HSI tends to converge toward the use of generative models based on diffusion, LLMs, and attention mechanisms — especially in the form of Transformer and Mamba architectures — the landscape of human-scene interaction analysis is characterized by a wider variety of techniques, employing multiple methods for scene representation. The relatively young age of this research field brings with it a major challenge: the lack of sufficiently large datasets to allow for the training of deep learning models. For this reason, some of the analyzed research papers not only propose a methodology but also introduce entirely new datasets, created and curated specifically for training such models in HSI. In conclusion, after a detailed analysis of the various proposed approaches and their respective strengths, it is clear that the strong momentum in research and development of increasingly capable and efficient systems for the analysis and synthesis of HSI suggests excellent prospects for progress in the near future.| File | Dimensione | Formato | |
|---|---|---|---|
|
Alberti_Giordano.pdf
accesso aperto
Dimensione
4.16 MB
Formato
Adobe PDF
|
4.16 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/89649