Mining a large web corpus - International Internet Preservation

advertisement
International Internet Preservation Consortium
General Assembly 2014, Paris
Mining a Large Web Corpus
Robert Meusel
Christian Bizer
Slide 1
The Common Crawl
Slide 2
Hyperlink Graphs
Knowledge about the structure of the Web can be used to
improve crawling strategies, to help SEO experts or to
understand social phenomena.
Slide 3
HTML-embedded Data on the Web
Several million websites semantically markup the content of
their HTML pages.
Markup Syntaxes
 Microformats
 RDFa
 Microdata
Data snippets
within info boxes
Slide 4
Relational HTML Tables
HTML Tables over semi-structured data which can be used to
build up or extend knowledge bases as DBPedia.
 In a corpus of 14B raw
tables, 154M are „good“
relations (1.1%)
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
Slide 5
The Web Data Commons Project
Goal: Offer an easy-to-use, cost efficient, distributed
extraction framework for large web crawls, as well as
datasets extracted out of the crawls.
 Has developed an Amazon-based framework for extracting data
from large web crawls
 Capable to run on any cloud infrastructure
 Has applied this framework to the Common Crawl data
 Adaptable to other crawls
 Results and framework are publicly available
 http://webdatacommons.org
Slide 6
Extraction Framework
AWS SQS
3: Request
file-reference
4: Download file
AWS
EC2
AWS
EC2
AWS
EC2
Instance
Instance
Instance
AWS S3
5: Extract &
Upload
2: Launch instances
1: Fill queue
Master
6: Collect results
automated
manual
Slide 7
Extraction Worker
Filter:
• Reduce Runtime
• Mime-Type filter
• Regex detection of
content or metainformation
Worker:
• Written in Java
• Process one page at
once
• Independent from
other files and
workers
.(w)arc
Download file
Filter
AWS S3
Worker
AWS S3
output
Upload output file
WDC Extractor
Slide 8
Web Data Commons – Extraction Framework
 Written in Java
 Mainly tailored for Amazon Web Services
 Fault tolerant and cheap
 300 USD to extract 17 billion RDF statements from 44 TB
 Easy customizable
 Only worker has to be adapted
 Worker is a single process method processing one file each time
 Scaling is automated by the framework
 Access Open Source Code:
 https://www.assembla.com/code/commondata/
Alternative: Hadoop Version, which can run on any Hadoop
cluster without Amazon Web Services.
Slide 9
Extracted Datasets
Hyperlink

HyperlinkGraph
Graph
HTML-embedded

HTML-embeddedData
Data
Relational

RelationalHTML
HTMLTables
Tables
Slide 10
Hyperlink Graph
 Extracted from the Common Crawl 2012 Dataset
 Over 3.5 billion pages connected by over 128 billion links
 Graph files: 386 GB
http://webdatacommons.org/hyperlinkgraph/
http://wwwranking.webdatacommons.org/
Slide 11
Hyperlink Graph
Discovery of evolutions in the global structure of the World
Wide Web.
 Degrees do not follow a power-law
 Detection of Spam pages
 Further insights:

WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)

WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)
Slide 12
Hyperlink Graph
Discovery of important and interesting sites using different
popularity rankings or website categorization libraries
Websites connected by at least ½ Million Links
Slide 13
HTML-embedded Data
More and more Websites semantically
markup the content of their HTML pages.
Markup Syntaxes
RDFa
Microformats
Microdata
Slide 14
Websites containing Structured Data (2013)
Web Data Commons - Microformat, Microdata, RDFa Corpus
 17 billion RDF triples from Common Crawl 2013
 Next release will be in winter 2014
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26.3%).
1.8 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13.9%)
http://webdatacommons.org/structureddata/
Slide 15
Top Classes Microdata (2013)
• schema = Schema.org
• dv = Google‘s
Rich Snippet Vocabulary
Slide 16
HTML Tables
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
Cafarella (2008)
Classification Precision: 70-80%
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
Slide 17
WDC - Web Tables Corpus
 Large corpus of relational Web tables for public download
 Extracted from Common Crawl 2012 (3.3 billion pages)
 147 million relational tables
 selected out of 11.2 B raw tables (1.3%)
 download includes the HTML pages of the tables (1TB zipped)
 Table Statistics
Min
Max
Average
Median
Attributes
2
2,368
3.49
3
Data Rows
1
70,068
12.41
6
 Heterogeneity: Very high.
http://webdatacommons.org/webtables/
Slide 18
WDC - Web Tables Corpus
 Attribute Statistics
 Subject Attribute Values
Attribute
#Tables
Value
#Rows
name
4,600,000
usa
135,000
price
3,700,000
germany
91,000
date
2,700,000
greece
42,000
artist
2,100,000
new york
59,000
location
1,200,000
london
37,000
year
1,000,000
athens
11,000
manufacturer
375,000
david beckham
3,000
counrty
340,000
ronaldinho
1,200
isbn
99,000
oliver kahn
710
area
95,000
twist shout
2,000
population
86,000
yellow submarine
1,400
28,000,000 different attribute labels
1.74 billion rows
253,000,000 different subject labels
Slide 19
Conclusion
Three factors are necessary to work with web-scale data:
Availability of Crawls
 Thanks to Common Crawl, this data is available
Availability of cheap, easy-to-use infrastructures
 Like Amazon or other on-demand cloud-services
Easy to adopt scalable extraction frameworks
 The Web Data Commons Framework, or standard tools like Pig
 Cost evaluation on task-base, but the WDC framework has turned
out to be cheaper
Slide 20
Questions
 Please visit our website: www.webdatacommons.org
 Data and Framework are available as free download
 Web Data Commons is supported by:
Slide 21
Download