Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.

Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.

Assessing LLMs as Network Administrators: Agent-Orchestrated Pipelines vs. Direct Querying

SALADINO, DAVIDE
2024/2025

Abstract

Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.
2024
Assessing LLMs as Network Administrators: Agent-Orchestrated Pipelines vs. Direct Querying
Modern networks are becoming increasingly complex, creating opportunities for Large Language Models (LLMs) to assist network administrators with routine tasks and troubleshooting. However, we lack proper methods to evaluate how well these models actually perform in real network environments. Without standardized evaluation frameworks, it remains unclear how effectively different LLMs can handle network administration tasks and which interaction strategies yield the best results. This thesis addresses this gap by developing a comprehensive evaluation framework specifically designed to assess LLMs in network administration contexts. The framework features automated ground-truth generation, comparative analysis across diverse network environments, and systematic evaluation of both direct prompting and agent-based approaches using commercial and local LLMs. Through standardized network management scenarios, this work establishes performance baselines across different model types and interaction strategies, while identifying key challenges in applying LLMs to network administration tasks. The research contributes a reproducible evaluation methodology that provides foundational benchmarks for future AI-driven network management research. Our evaluation of eight commercial and local LLMs across standardized network scenarios reveals that GPT models achieve over 90\% accuracy in network administration tasks, significantly outperforming local models like Qwen and Mistral which averaged below 50\% accuracy. The results demonstrate that commercial models with agent-based approaches provide the most reliable performance for complex network troubleshooting, though at the cost of increased processing time.
LLM
Networking
Agent
File in questo prodotto:
File Dimensione Formato  
Saladino_Davide.pdf

accesso aperto

Dimensione 3.28 MB
Formato Adobe PDF
3.28 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/92221