Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.

Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.

Controlled Comparison of Deep Learning Models for Yacht Image Classification

BIYIKLI, DOGUSCAN
2025/2026

Abstract

Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.
2025
Controlled Comparison of Deep Learning Models for Yacht Image Classification
Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.
Deep Learning
Image Classification
Language Models
File in questo prodotto:
File Dimensione Formato  
Biyikli_Doguscan.pdf

Accesso riservato

Dimensione 16.48 MB
Formato Adobe PDF
16.48 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/106229