The relentless surge in human-AI interaction in recent years has propelled Natural Language Processing (NLP) and Computer Vision (CV) to the forefront of this transformative evolution. The advent of LLMs in the former field has revolutionized the global landscape, while the latter remains a pivotal component in the ongoing pursuit of automating diverse aspects of our daily lives. This thesis delves into the exploration of the Vision Language (VL) field, tracing its evolution from inception and subsequently narrowing its focus to specific aspects. Specifically chosen for its simplicity and demonstrated competitive results, the ViLT model serves as the starting point for conducting tests and experiments. The primary aim is to reveal the intricate relationship between words and images. The analysis extensively delves into the model, subjecting it to rigorous test- ing across key downstream tasks characteristic of VLMs. The central focus of this research involves a comprehensive examination of challenges impacting the model, with a particular emphasis on the object counting task. Various techniques are employed in the proposed solutions, including leveraging datasets, modifying phrases to generate new instances, utilizing a zero-shot segmenter to enhance the model’s inference, adapting the model’s input reception using a Convolutional Neural Network (CNN) for improved feature extraction, and culminating in the implementation of an advanced training technique. Each of these aspects serves as a topic for discussion. This dissertation serves as a key reference for discussions and further exploration within the VL domain. It stands as a comprehensive investigation, highlighting the intricacies of the field by uncovering diverse challenges and possibilities derived from various obtained results. The findings contribute not only to the comprehension of potential issues in the VL domain but also lay the foundation for subsequent investigations and advancements in this evolving field.

Visual Language Models: an in-depth exploration of ViLT

GONELLA, GIACOMO
2023/2024

Abstract

The relentless surge in human-AI interaction in recent years has propelled Natural Language Processing (NLP) and Computer Vision (CV) to the forefront of this transformative evolution. The advent of LLMs in the former field has revolutionized the global landscape, while the latter remains a pivotal component in the ongoing pursuit of automating diverse aspects of our daily lives. This thesis delves into the exploration of the Vision Language (VL) field, tracing its evolution from inception and subsequently narrowing its focus to specific aspects. Specifically chosen for its simplicity and demonstrated competitive results, the ViLT model serves as the starting point for conducting tests and experiments. The primary aim is to reveal the intricate relationship between words and images. The analysis extensively delves into the model, subjecting it to rigorous test- ing across key downstream tasks characteristic of VLMs. The central focus of this research involves a comprehensive examination of challenges impacting the model, with a particular emphasis on the object counting task. Various techniques are employed in the proposed solutions, including leveraging datasets, modifying phrases to generate new instances, utilizing a zero-shot segmenter to enhance the model’s inference, adapting the model’s input reception using a Convolutional Neural Network (CNN) for improved feature extraction, and culminating in the implementation of an advanced training technique. Each of these aspects serves as a topic for discussion. This dissertation serves as a key reference for discussions and further exploration within the VL domain. It stands as a comprehensive investigation, highlighting the intricacies of the field by uncovering diverse challenges and possibilities derived from various obtained results. The findings contribute not only to the comprehension of potential issues in the VL domain but also lay the foundation for subsequent investigations and advancements in this evolving field.
2023
Visual Language Models: an in-depth exploration of ViLT
Multimodal
ViLT
NLP
Visual Language
File in questo prodotto:
File Dimensione Formato  
Gonella_Giacomo.pdf

accesso aperto

Dimensione 4.01 MB
Formato Adobe PDF
4.01 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/62285