In the digital age, memes have evolved from humorous internet content into powerful instruments of persuasion, capable of shaping public opinion and reinforcing political ideologies. This thesis addresses the computational challenge of automatically detecting persuasion techniques in memes, focusing on the design and optimization of multimodal fusion architectures under realistic computational constraints. Building upon the framework established in SemEval-2024 Task 4 by Dimitrov et al., this research investigates how different multimodal fusion strategies perform when operating within typical academic resource limitations. The work systematically compares early fusion appro\-aches with late fusion architectures enhanced by sophisticated attention mechanisms, revealing crucial insights about the practical deployment of multimodal systems. The thesis is structured around two main research directions. First, comprehensive replication and analysis of state-of-the-art methods from recent competitions, including ensemble approaches with data augmentation and hierarchical prompt tuning strategies. Second, development of novel multimodal architectures featuring Gated Attention Fusion Networks (GAFN) combined with parameter-efficient Low-Rank Adaptation (LoRA) of large language models and Technique-Aware Negative Sampling useful for contrastive learning during pretraining of multimodal architectures. A key contribution of this work is the discovery that a TinyLLaMA-based late fusion approach (1.1B parameters) with LoRA adaptation can achieve superior performance (F1: 0.704) compared to larger early fusion models, when operating under realistic computational constraints, as well as a promising technique-aware cross-modal alignment strategy - a sampling strategy for contrastive learning with early-fusion models - proposed to be useful under fewer computational constraints, which improved the overall performance of the model despite the existing environment limitations. The winning architecture incorporates several innovative components: in-domain pretraining on augmented text datasets, a two-stage gated attention mechanism, and a 5-layer classification head. Importantly, this thesis provides empirical evidence that batch size limitations, memory constraints, and training time restrictions create optimization challenges that favor simpler, more modular architectures over theoretically superior but computationally demanding alternatives. This insight has significant implications for practical multimodal system deployment and challenges the trend toward ever-increasing architectural complexity. In sum, this thesis offers a comprehensive investigation into constraint-aware multimodal architecture design for persuasion technique detection, providing both methodological innovations and practical insights for the broader multimodal learning community.
In the digital age, memes have evolved from humorous internet content into powerful instruments of persuasion, capable of shaping public opinion and reinforcing political ideologies. This thesis addresses the computational challenge of automatically detecting persuasion techniques in memes, focusing on the design and optimization of multimodal fusion architectures under realistic computational constraints. Building upon the framework established in SemEval-2024 Task 4 by Dimitrov et al., this research investigates how different multimodal fusion strategies perform when operating within typical academic resource limitations. The work systematically compares early fusion appro\-aches with late fusion architectures enhanced by sophisticated attention mechanisms, revealing crucial insights about the practical deployment of multimodal systems. The thesis is structured around two main research directions. First, comprehensive replication and analysis of state-of-the-art methods from recent competitions, including ensemble approaches with data augmentation and hierarchical prompt tuning strategies. Second, development of novel multimodal architectures featuring Gated Attention Fusion Networks (GAFN) combined with parameter-efficient Low-Rank Adaptation (LoRA) of large language models and Technique-Aware Negative Sampling useful for contrastive learning during pretraining of multimodal architectures. A key contribution of this work is the discovery that a TinyLLaMA-based late fusion approach (1.1B parameters) with LoRA adaptation can achieve superior performance (F1: 0.704) compared to larger early fusion models, when operating under realistic computational constraints, as well as a promising technique-aware cross-modal alignment strategy - a sampling strategy for contrastive learning with early-fusion models - proposed to be useful under fewer computational constraints, which improved the overall performance of the model despite the existing environment limitations. The winning architecture incorporates several innovative components: in-domain pretraining on augmented text datasets, a two-stage gated attention mechanism, and a 5-layer classification head. Importantly, this thesis provides empirical evidence that batch size limitations, memory constraints, and training time restrictions create optimization challenges that favor simpler, more modular architectures over theoretically superior but computationally demanding alternatives. This insight has significant implications for practical multimodal system deployment and challenges the trend toward ever-increasing architectural complexity. In sum, this thesis offers a comprehensive investigation into constraint-aware multimodal architecture design for persuasion technique detection, providing both methodological innovations and practical insights for the broader multimodal learning community.
Rhetorical Analysis of Memes to Detect Persuasion Techniques
TOLEGEN, AKERKE
2024/2025
Abstract
In the digital age, memes have evolved from humorous internet content into powerful instruments of persuasion, capable of shaping public opinion and reinforcing political ideologies. This thesis addresses the computational challenge of automatically detecting persuasion techniques in memes, focusing on the design and optimization of multimodal fusion architectures under realistic computational constraints. Building upon the framework established in SemEval-2024 Task 4 by Dimitrov et al., this research investigates how different multimodal fusion strategies perform when operating within typical academic resource limitations. The work systematically compares early fusion appro\-aches with late fusion architectures enhanced by sophisticated attention mechanisms, revealing crucial insights about the practical deployment of multimodal systems. The thesis is structured around two main research directions. First, comprehensive replication and analysis of state-of-the-art methods from recent competitions, including ensemble approaches with data augmentation and hierarchical prompt tuning strategies. Second, development of novel multimodal architectures featuring Gated Attention Fusion Networks (GAFN) combined with parameter-efficient Low-Rank Adaptation (LoRA) of large language models and Technique-Aware Negative Sampling useful for contrastive learning during pretraining of multimodal architectures. A key contribution of this work is the discovery that a TinyLLaMA-based late fusion approach (1.1B parameters) with LoRA adaptation can achieve superior performance (F1: 0.704) compared to larger early fusion models, when operating under realistic computational constraints, as well as a promising technique-aware cross-modal alignment strategy - a sampling strategy for contrastive learning with early-fusion models - proposed to be useful under fewer computational constraints, which improved the overall performance of the model despite the existing environment limitations. The winning architecture incorporates several innovative components: in-domain pretraining on augmented text datasets, a two-stage gated attention mechanism, and a 5-layer classification head. Importantly, this thesis provides empirical evidence that batch size limitations, memory constraints, and training time restrictions create optimization challenges that favor simpler, more modular architectures over theoretically superior but computationally demanding alternatives. This insight has significant implications for practical multimodal system deployment and challenges the trend toward ever-increasing architectural complexity. In sum, this thesis offers a comprehensive investigation into constraint-aware multimodal architecture design for persuasion technique detection, providing both methodological innovations and practical insights for the broader multimodal learning community.| File | Dimensione | Formato | |
|---|---|---|---|
|
CS_MsC_Thesis___UniPD.pdf
Accesso riservato
Dimensione
2.17 MB
Formato
Adobe PDF
|
2.17 MB | Adobe PDF |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/89973