Web Scraping di dati statistici relativi alla NBA

Internet is today the largest collection of information and data in existence, and its use has undergone exponential growth in recent years, bringing this technology and the immense amount of data it offers, within everyone's reach. The information available on the web is heterogeneous: each developer decides how to represent it on his site and no data representation standard is used. For this reason, the collection of data in large quantities is difficult to automate. The practical copy-paste is not an option when we have hundreds, if not thousands or even millions of data. For example, if you had to manually save information on a spreadsheet, the process would consist in copying and pasting every single data from its web page to the document to be created, requiring time that would be preferable to save by using a program that performs all the operations. In this sense, an extremely useful and functional tool comes to our aid, web scraping, applied by a series of programs, called scrapers, which, based on the format and structure of the page, can adapt and perform the extraction of data in automated way. In this thesis project, we will analyze the web scraping technique, starting from a general presentation of its functioning and the most used tools to put it into practice, to then arrive at applying it concretely in a particular case. In particular, we will study and analyze the application of this tool in the field of statistical data analysis relating to the world of sport, in our specific case that of professional basketball.

Internet è ad oggi la più vasta collezione di informazioni e dati esistente, e il suo utilizzo ha subito negli ultimi anni una crescita esponenziale, portando questa tecnologia e l’immensa mole di dati che essa offre, alla portata di tutti. Le informazioni disponibili nel web sono eterogenee: ogni sviluppatore decide come rappresentarle nel proprio sito e non viene utilizzato nessuno standard di rappresentazione dei dati. Per questo, la collezione di dati in grosse quantità è complicata da automatizzare. Il pratico copia-incolla non è un opzione quando i dati da prelevare sono dell’ordine delle centinaia, se non migliaia o addirittura milioni. Se ad esempio si dovessero salvare manualmente su un foglio di calcolo delle informazioni, il processo consisterebbe nel copiare e incollare ogni singolo dato dalla sua pagina web al documento da creare, richiedendo tempo che sarebbe preferibile risparmiare utilizzando un programma che svolga tutte le operazioni. In tal senso, ci viene in aiuto uno strumento estremamente utile e funzionale, il web scraping, applicato da una serie di programmi, chiamati scraper, che in base al formato e alla struttura della pagina, possono adattarsi ed effettuare l’estrazione dei dati in maniera automatizzata. In questo progetto di tesi, si andrà ad analizzare la tecnica del web scraping, a partire da una presentazione generale del suo funzionamento e degli strumenti maggiormente utilizzati per metterla in pratica, per arrivare poi ad applicarla concretamente in un particolare caso preso in esame. In particolare, verrà studiata ed analizzata, l’applicazione di tale strumento nel campo dell’analisi dei dati statistici relativi al mondo dello sport, nel nostro specifico caso quello del basket professionistico.