Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.
Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.
Assessing LLMs as Network Administrators: Agent-Orchestrated Pipelines vs. Direct Querying
SALADINO, DAVIDE
2024/2025
Abstract
Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.| File | Dimensione | Formato | |
|---|---|---|---|
|
Saladino_Davide.pdf
accesso aperto
Dimensione
3.28 MB
Formato
Adobe PDF
|
3.28 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/92221