Controlled Comparison of Deep Learning Models for Yacht Image Classification

Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.

Controlled Comparison of Deep Learning Models for Yacht Image Classification

BIYIKLI, DOGUSCAN

2025/2026

Abstract

Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria dell'Informazione - DEI
			
	Corso di studio
	
				ICT FOR INTERNET AND MULTIMEDIA - INGEGNERIA PER LE COMUNICAZIONI MULTIMEDIALI E INTERNET Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2025
			
	Titolo inglese
	
				Controlled Comparison of Deep Learning Models for Yacht Image Classification
			
	Abstract in italiano
	
				Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting.

Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.
			
	Parola chiave
	
				Deep Learning
Image Classification
Language Models
			
	Relatore
	
				ERSEGHE, TOMASO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Biyikli_Doguscan.pdf Accesso riservato Dimensione 16.48 MB Formato Adobe PDF	16.48 MB	Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106229