Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.
Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.
Controlled Comparison of Deep Learning Models for Yacht Image Classification
BIYIKLI, DOGUSCAN
2025/2026
Abstract
Fine-grained visual recognition involves distinguishing between categories that exhibit highly similar global structures while differing in subtle and localized visual cues. This thesis investigates fine-grained yacht model recognition under limited data conditions using a custom dataset of 199 images from two visually similar classes, “Azimut55Fly” and “SunSeekerPre57”. To enable a controlled and reproducible comparison between fundamentally different learning paradigms, a fixed validation protocol is established by defining a dedicated validation subset of 20 images, while the remaining 179 images are used for training. Two modeling paradigms are compared. The first approach employs a detection-based pipeline using a YOLO architecture trained with explicit bounding-box supervision, providing strong spatial inductive bias. The second approach utilizes a vision-language model, Qwen2-VL, fine-tuned using image-level labels and formulated as prompt-based conditional generation without explicit localization signals. To ensure methodological fairness, both model outputs are reduced to a harmonized image-level classification setting. Experimental evaluation demonstrates that the detection-based approach achieves significantly higher validation accuracy (0.95) compared to the vision-language model (0.55) under identical evaluation conditions. Qualitative analysis further reveals systematic differences in failure patterns, highlighting the influence of supervision strength and inductive bias in fine-grained discrimination tasks. The findings contribute empirical evidence regarding the role of spatial supervision versus semantic alignment in small-scale fine-grained recognition scenarios.| File | Dimensione | Formato | |
|---|---|---|---|
|
Biyikli_Doguscan.pdf
Accesso riservato
Dimensione
16.48 MB
Formato
Adobe PDF
|
16.48 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106229