International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer Slide 1 The Common Crawl Slide 2 Hyperlink Graphs Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena. Slide 3 HTML-embedded Data on the Web Several million websites semantically markup the content of their HTML pages. Markup Syntaxes Microformats RDFa Microdata Data snippets within info boxes Slide 4 Relational HTML Tables HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia. In a corpus of 14B raw tables, 154M are „good“ relations (1.1%) • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. Slide 5 The Web Data Commons Project Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls. Has developed an Amazon-based framework for extracting data from large web crawls Capable to run on any cloud infrastructure Has applied this framework to the Common Crawl data Adaptable to other crawls Results and framework are publicly available http://webdatacommons.org Slide 6 Extraction Framework AWS SQS 3: Request file-reference 4: Download file AWS EC2 AWS EC2 AWS EC2 Instance Instance Instance AWS S3 5: Extract & Upload 2: Launch instances 1: Fill queue Master 6: Collect results automated manual Slide 7 Extraction Worker Filter: • Reduce Runtime • Mime-Type filter • Regex detection of content or metainformation Worker: • Written in Java • Process one page at once • Independent from other files and workers .(w)arc Download file Filter AWS S3 Worker AWS S3 output Upload output file WDC Extractor Slide 8 Web Data Commons – Extraction Framework Written in Java Mainly tailored for Amazon Web Services Fault tolerant and cheap 300 USD to extract 17 billion RDF statements from 44 TB Easy customizable Only worker has to be adapted Worker is a single process method processing one file each time Scaling is automated by the framework Access Open Source Code: https://www.assembla.com/code/commondata/ Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services. Slide 9 Extracted Datasets Hyperlink HyperlinkGraph Graph HTML-embedded HTML-embeddedData Data Relational RelationalHTML HTMLTables Tables Slide 10 Hyperlink Graph Extracted from the Common Crawl 2012 Dataset Over 3.5 billion pages connected by over 128 billion links Graph files: 386 GB http://webdatacommons.org/hyperlinkgraph/ http://wwwranking.webdatacommons.org/ Slide 11 Hyperlink Graph Discovery of evolutions in the global structure of the World Wide Web. Degrees do not follow a power-law Detection of Spam pages Further insights: WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.) WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.) Slide 12 Hyperlink Graph Discovery of important and interesting sites using different popularity rankings or website categorization libraries Websites connected by at least ½ Million Links Slide 13 HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Markup Syntaxes RDFa Microformats Microdata Slide 14 Websites containing Structured Data (2013) Web Data Commons - Microformat, Microdata, RDFa Corpus 17 billion RDF triples from Common Crawl 2013 Next release will be in winter 2014 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%). 1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%) http://webdatacommons.org/structureddata/ Slide 15 Top Classes Microdata (2013) • schema = Schema.org • dv = Google‘s Rich Snippet Vocabulary Slide 16 HTML Tables In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008) Classification Precision: 70-80% • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011. Slide 17 WDC - Web Tables Corpus Large corpus of relational Web tables for public download Extracted from Common Crawl 2012 (3.3 billion pages) 147 million relational tables selected out of 11.2 B raw tables (1.3%) download includes the HTML pages of the tables (1TB zipped) Table Statistics Min Max Average Median Attributes 2 2,368 3.49 3 Data Rows 1 70,068 12.41 6 Heterogeneity: Very high. http://webdatacommons.org/webtables/ Slide 18 WDC - Web Tables Corpus Attribute Statistics Subject Attribute Values Attribute #Tables Value #Rows name 4,600,000 usa 135,000 price 3,700,000 germany 91,000 date 2,700,000 greece 42,000 artist 2,100,000 new york 59,000 location 1,200,000 london 37,000 year 1,000,000 athens 11,000 manufacturer 375,000 david beckham 3,000 counrty 340,000 ronaldinho 1,200 isbn 99,000 oliver kahn 710 area 95,000 twist shout 2,000 population 86,000 yellow submarine 1,400 28,000,000 different attribute labels 1.74 billion rows 253,000,000 different subject labels Slide 19 Conclusion Three factors are necessary to work with web-scale data: Availability of Crawls Thanks to Common Crawl, this data is available Availability of cheap, easy-to-use infrastructures Like Amazon or other on-demand cloud-services Easy to adopt scalable extraction frameworks The Web Data Commons Framework, or standard tools like Pig Cost evaluation on task-base, but the WDC framework has turned out to be cheaper Slide 20 Questions Please visit our website: www.webdatacommons.org Data and Framework are available as free download Web Data Commons is supported by: Slide 21