Automating complex browser-based tasks through natural language instructions is a longstanding goal in web automation. Early approaches relied on rule-based scripts and supervised models trained on specific websites, offering little generalization to unseen environments. Large multimodal models (LMMs) have recently emerged as a promising foundation for such agents, generating action plans from both a webpage's visual appearance and textual context. Translating these plans into executable browser actions requires identifying the correct HTML element on the page and the corresponding operation to perform, a step known as action grounding. This grounding step remains the primary bottleneck for LMM-based web agents, with a substantial accuracy gap compared to oracle conditions. This thesis, carried out in collaboration with MyMeta, whose need for general-purpose web automation motivated this work, builds on the SeeAct framework. SeeAct is a two-stage system: the model first generates a free-text action description from a webpage screenshot, then selects the correct element from a ranked list of candidates. In the grounding stage, it receives the screenshot and a textual list of candidates but no visual indication of where each candidate is located on the page. This work investigates whether overlaying bounding boxes on candidate elements can bridge this gap. Experiments are conducted on three backbone LMMs – GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro – in the Mind2Web Cross-Task test split. Oracle experiments, in which a single bounding box highlights the ground-truth element, confirm that correctly placed annotations consistently improve accuracy across all models, with the accuracy of the GPT-4o element rising from 45.1% to 53.4%. Control experiments verify that the gains come from spatial information rather than the mere visual presence of an annotation. Extending this benefit to non-oracle settings – where the ground-truth location is unknown and boxes must be generated automatically – proves to be more challenging. The first approach draws boxes using the pixel coordinates recorded for each candidate element in the Mind2Web dataset annotations. However, these coordinates suffer from coordinate drift: recorded positions do not always match the actual rendered positions on screen. This also highlights a deeper issue: any approach relying on pre-existing annotations is restricted to settings where such metadata is available. This motivates the use of OmniParser, a vision-based detector that derives bounding boxes purely from the rendered screenshot, making it applicable to any webpage without dataset-specific annotations. However, OmniParser-generated boxes generally underperform the no-box baseline: OmniParser detects every interactive region on the page, producing far more boxes than the candidates that the model must choose from. To reduce this clutter, detections are filtered against Mind2Web candidates using a new overlap metric, the containment ratio, which normalizes by the smaller box's area rather than their union and substantially outperforms Intersection over Union (IoU) at this scale mismatch. Because the prompting strategy also influences grounding, a series of prompt engineering experiments is conducted. These show that structured prompting — combining explicit size-invariance instructions, batch awareness (informing the model which batch of candidates it is evaluating and how many remain), and choice-by-choice candidate evaluation — yields consistent improvements, defining a non-oracle configuration that surpasses the baseline without bounding boxes. These findings establish both the promise and the current limits of visual grounding for LMM-based web agents, identifying target specificity and candidate presentation strategy as the primary directions for future work.
Automating complex browser-based tasks through natural language instructions is a longstanding goal in web automation. Early approaches relied on rule-based scripts and supervised models trained on specific websites, offering little generalization to unseen environments. Large multimodal models (LMMs) have recently emerged as a promising foundation for such agents, generating action plans from both a webpage's visual appearance and textual context. Translating these plans into executable browser actions requires identifying the correct HTML element on the page and the corresponding operation to perform, a step known as action grounding. This grounding step remains the primary bottleneck for LMM-based web agents, with a substantial accuracy gap compared to oracle conditions. This thesis, carried out in collaboration with MyMeta, whose need for general-purpose web automation motivated this work, builds on the SeeAct framework. SeeAct is a two-stage system: the model first generates a free-text action description from a webpage screenshot, then selects the correct element from a ranked list of candidates. In the grounding stage, it receives the screenshot and a textual list of candidates but no visual indication of where each candidate is located on the page. This work investigates whether overlaying bounding boxes on candidate elements can bridge this gap. Experiments are conducted on three backbone LMMs – GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro – in the Mind2Web Cross-Task test split. Oracle experiments, in which a single bounding box highlights the ground-truth element, confirm that correctly placed annotations consistently improve accuracy across all models, with the accuracy of the GPT-4o element rising from 45.1% to 53.4%. Control experiments verify that the gains come from spatial information rather than the mere visual presence of an annotation. Extending this benefit to non-oracle settings – where the ground-truth location is unknown and boxes must be generated automatically – proves to be more challenging. The first approach draws boxes using the pixel coordinates recorded for each candidate element in the Mind2Web dataset annotations. However, these coordinates suffer from coordinate drift: recorded positions do not always match the actual rendered positions on screen. This also highlights a deeper issue: any approach relying on pre-existing annotations is restricted to settings where such metadata is available. This motivates the use of OmniParser, a vision-based detector that derives bounding boxes purely from the rendered screenshot, making it applicable to any webpage without dataset-specific annotations. However, OmniParser-generated boxes generally underperform the no-box baseline: OmniParser detects every interactive region on the page, producing far more boxes than the candidates that the model must choose from. To reduce this clutter, detections are filtered against Mind2Web candidates using a new overlap metric, the containment ratio, which normalizes by the smaller box's area rather than their union and substantially outperforms Intersection over Union (IoU) at this scale mismatch. Because the prompting strategy also influences grounding, a series of prompt engineering experiments is conducted. These show that structured prompting — combining explicit size-invariance instructions, batch awareness (informing the model which batch of candidates it is evaluating and how many remain), and choice-by-choice candidate evaluation — yields consistent improvements, defining a non-oracle configuration that surpasses the baseline without bounding boxes. These findings establish both the promise and the current limits of visual grounding for LMM-based web agents, identifying target specificity and candidate presentation strategy as the primary directions for future work.
Enhancing Browser Automation with Multimodal LLM's
KAYA, ZEYNEP SILA
2025/2026
Abstract
Automating complex browser-based tasks through natural language instructions is a longstanding goal in web automation. Early approaches relied on rule-based scripts and supervised models trained on specific websites, offering little generalization to unseen environments. Large multimodal models (LMMs) have recently emerged as a promising foundation for such agents, generating action plans from both a webpage's visual appearance and textual context. Translating these plans into executable browser actions requires identifying the correct HTML element on the page and the corresponding operation to perform, a step known as action grounding. This grounding step remains the primary bottleneck for LMM-based web agents, with a substantial accuracy gap compared to oracle conditions. This thesis, carried out in collaboration with MyMeta, whose need for general-purpose web automation motivated this work, builds on the SeeAct framework. SeeAct is a two-stage system: the model first generates a free-text action description from a webpage screenshot, then selects the correct element from a ranked list of candidates. In the grounding stage, it receives the screenshot and a textual list of candidates but no visual indication of where each candidate is located on the page. This work investigates whether overlaying bounding boxes on candidate elements can bridge this gap. Experiments are conducted on three backbone LMMs – GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro – in the Mind2Web Cross-Task test split. Oracle experiments, in which a single bounding box highlights the ground-truth element, confirm that correctly placed annotations consistently improve accuracy across all models, with the accuracy of the GPT-4o element rising from 45.1% to 53.4%. Control experiments verify that the gains come from spatial information rather than the mere visual presence of an annotation. Extending this benefit to non-oracle settings – where the ground-truth location is unknown and boxes must be generated automatically – proves to be more challenging. The first approach draws boxes using the pixel coordinates recorded for each candidate element in the Mind2Web dataset annotations. However, these coordinates suffer from coordinate drift: recorded positions do not always match the actual rendered positions on screen. This also highlights a deeper issue: any approach relying on pre-existing annotations is restricted to settings where such metadata is available. This motivates the use of OmniParser, a vision-based detector that derives bounding boxes purely from the rendered screenshot, making it applicable to any webpage without dataset-specific annotations. However, OmniParser-generated boxes generally underperform the no-box baseline: OmniParser detects every interactive region on the page, producing far more boxes than the candidates that the model must choose from. To reduce this clutter, detections are filtered against Mind2Web candidates using a new overlap metric, the containment ratio, which normalizes by the smaller box's area rather than their union and substantially outperforms Intersection over Union (IoU) at this scale mismatch. Because the prompting strategy also influences grounding, a series of prompt engineering experiments is conducted. These show that structured prompting — combining explicit size-invariance instructions, batch awareness (informing the model which batch of candidates it is evaluating and how many remain), and choice-by-choice candidate evaluation — yields consistent improvements, defining a non-oracle configuration that surpasses the baseline without bounding boxes. These findings establish both the promise and the current limits of visual grounding for LMM-based web agents, identifying target specificity and candidate presentation strategy as the primary directions for future work.| File | Dimensione | Formato | |
|---|---|---|---|
|
thesis_zeynep_sıla_kaya.pdf
accesso aperto
Dimensione
5.39 MB
Formato
Adobe PDF
|
5.39 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/108229