Il modello Bradley-Terry per l’analisi delle partite della Serie A italiana di calcio

We live in the era of so-called Big Data where, thanks to interconnectivity, a large flow of information can be obtained from every activity. This also applies to soccer where for the past couple of years, soccer teams have relied on analysis systems to produce play tactics and to scout emerging players. In modern soccer, therefore, numerous statistics such as ball possession, the number of shots taken by a team, etc. are collected during a soccer game. This leads to the question: since we have a large amount of data on team performances in their games, can we identify which statistics significantly influence the success or failure of individual teams in sports? This is where the thesis comes in. The objective is to provide an analysis that answers this question using Data Mining techniques, specifically using a comparison model for soccer games that considers the statistics entered. The model chosen for the analysis will be the Bradley-Terry model with its extensions. Subsequently, the Bradley-Terry models will be used to predict the outcome of the games and compared with the predictions of the main bookmakers and the Machine Learning algorithms: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest, and AdaBoost. Finally, Decision Tree and Random Forest will be further studied to determine which statistics are important. The study will consider data relating to the Italian Serie A games of the 2021/2022 season.

Viviamo nell'era dei cosiddetti Big Data in cui, grazie all'interconnessione, è possibile ottenere un grande flusso di informazioni da ogni attività. Non fa eccezione il calcio in cui da un paio d'anni, le società calcistiche si affidano a sistemi di analisi per produrre tattiche di gioco ma anche per effettuare scouting di giocatori emergenti. Nel calcio moderno, perciò, numerose statistiche ad esempio il possesso della palla, il numero di tiri effettuati da una squadra ecc. vengono raccolte durante una partita di calcio. Questo porta alla domanda: poiché disponiamo di una grande quantità di dati sulle prestazioni delle squadre nelle loro partite, è possibile identificare quali statistiche influiscono significativamente sul successo o sul fallimento sportivo delle singole squadre? Da qui nasce la tesi che verrà presentata. L'obiettivo è quello di fornire un'analisi che risponda a questa domanda utilizzando tecniche di Data Mining, in particolare attraverso l'utilizzo di un modello di confronto a coppie per le partite di calcio che tenga conto delle statistiche inserite. Il modello scelto per l'analisi sarà il modello Bradley-Terry con le sue estensioni. Successivamente i modelli Bradley-Terry saranno utilizzati per predire l’esito delle partite e confrontati con le predizioni dei principali bookmakers e degli algoritmi di Machine Learning: K-Nearest-Neighbors (K-NN), Support Vector Machine (SVM), Decision Tree, Random Forest e AdaBoost. Infine, Decision Tree e Random Forest verranno ulteriormente approfonditi per individuare quali statistiche sono importanti. Lo studio prenderà in considerazione i dati relativi alle partite della Serie A italiana della stagione 2021/2022.