This thesis describes motivations, techniques and results of a large crawl designed to obtain a suitable snapshot of the web graph. Our goal requires a properly designed crawling system to explore the whole .it domain. As a result, we obtained a fast and stable crawling system, which in a preliminary test collected more than 308 million distinct web pages in 28 days at an average rate of 204 pages per second, using a single high-end PC-class machine.

How Good Is a Web Page? Data Collection for Experimental Evaluation of Link Analysis Algorithms

Secco, Alessandro
2014/2015

Abstract

This thesis describes motivations, techniques and results of a large crawl designed to obtain a suitable snapshot of the web graph. Our goal requires a properly designed crawling system to explore the whole .it domain. As a result, we obtained a fast and stable crawling system, which in a preliminary test collected more than 308 million distinct web pages in 28 days at an average rate of 204 pages per second, using a single high-end PC-class machine.
2014-10-14
web, crawl, crawling, heritrix, link analysis, information retrieval, algorithms
File in questo prodotto:
File Dimensione Formato  
Alessandro_Secco_WebQual_Thesis.pdf

accesso riservato

Dimensione 1.31 MB
Formato Adobe PDF
1.31 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/18707