Uploaded by Mohammed M Ahmed

Statistics of the Common Crawl Corpus 2012

advertisement
Statistics of the Common Crawl Corpus 2012
Sebastian Spiegler, Data Scientist at SwiftKey
June 2013
he Common Crawl1 is a non-profit foundation dedicated to providing an open
repository of web crawl data that can
be accessed and analysed by everyone. The
foundation crawled the web in 2008, 2009,
2010 and 2012. The focus of this article is
an exploratory analysis of the latest crawl.
T
3.83
Billion
Documents
65
Terabytes
Compressed
Data
210
Terabytes
Content
41.4
Million
Domains
Figure 1: Data flow of experiment.
Introduction
The Common Crawl (CC) corpus allows individuals or businesses to cost-effectively access terabytes
of web crawl data using Amazon web services like
Elastic MapReduce2 .
At SwiftKey3 , an innovative London-based startup, we build world class language technology, such as
our award-winning Android soft keyboard. Amongst
other features, the keyboard delivers multilingual
error correction, word completion, next word prediction and space inference for up to three languages
concurrently. At present, we support more than
60 languages, with new languages being constantly
added to the list. The CC corpus represents an excellent addition to our internal data sources for building
and testing language models and for our research.
To better understand the content and the structure
of the 2012 corpus, we carried out the exploratory
analysis at hand. The remainder of the article is
organised as follows. We will start with a short
overview of the experimental setup and subsequently
examine our results.
857
Thousand
ARC files
Experiment
The 2012 corpus is made up of 857 thousand ARC
files which are stored in s3://aws-publicdatasets/
common-crawl/parse-output/. Each ARC file is
compressed and contains multiple entries of crawled
documents. An entry consists of a header and the
actual web document.4,5
Extracted data
For the purpose of this analysis, we have extracted
the following fields for each web document in the
2012 corpus:
• The public suffix. The public suffix is the level
under which a user can register a private domain.
An up-to-date list is maintained by the Mozilla
Foundation.6 The public suffix of ‘bbc.co.uk’
would be ‘.co.uk’ whereas ‘.uk’ is the top-level
domain (TLD). Although we will be using TLDs
4
1
The official website of the Common Crawl Foundation is
http://commoncrawl.org.
2
See aws.amazon.com/elasticmapreduce/ for more details.
3
Our official website is http://www.swiftkey.net/.
More details on the data format can be found under http:
//commoncrawl.org/data/accessing-the-data/.
5
The crawler truncated the content of fetched web documents
at 2 megabytes.
6
See http://publicsuffix.org/.
Page 1 of 6
s3://aws-publicdatasets/common-crawl/
parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
are unique so the segment name is not necessary
for identification.
25
Thousand
ARC files
EMR Hadoop Cluster
x 6h / cluster = 1260 inst. hours
35 x
Core
m1.xlarge
spot inst.
Master
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
• The byte size. The byte size is the number of
raw bytes of the document’s content. We will
be summing this value for multiple documents
of the same SLD and the entire corpus to make
assumptions about the data distribution.
241
Gigabytes
Extracted
Information
Setup
EMR Hive Cluster
Core
m1.xlarge
spot inst.
Master
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
Core
m1.xlarge
spot inst.
x 15h = 90 inst. hours
Figure 2: Experiment setup.
rather than public suffixes during our investigation we thought the additional information
might still be helpful for later analyses.
• The second-level domain (SLD). This domain is
directly below the public suffix and can be registered by an individual or organization. In our
previous example ‘bbc.co.uk’ the SLD would
be ‘bbc’.
• The internet media type or content type. It is
used to specify the content in various internet
protocols and consists of a type and sub-type
component. Examples are text/html, text/xml,
application/pdf or image/jpeg.7
Figure 1 shows the overall data flow of the experiment. The 2012 corpus8 consists of 210 terabytes
of web data which was processed to extract the
fields listed above. This resulted in a 241 gigabyte summary of 3.83 billion documents corresponding to 41.4 million distinct second-level domains.
The non-aggregated data is accessible at s3://
aws-publicdatasets/common-crawl/index2012/.
For the experiment we made two major decisions.
Instead of processing all ARC files at once, we split
the 2012 corpus into manageable subsets of 25 thousand files, processed them individually and later
combined intermediate results. Furthermore, we
chose the format of tab-separated values for the nonaggregated data – the long list of entries with public
suffix, second-level domain, content type, encoding,
file name and byte size – which would allow us to
easily run SQL-like queries using Apache Hive 9 later
on.
The actual experiment took approximately 1500
instance hours totalling in about US$ 200 including
the use of EC2 spot instances and data transfer from
and to S3. Along with development and testing we
spent about three times this figure. This makes the
Common Crawl corpus very accessible, especially to
start-ups like SwiftKey. A summary of the experimental setup is shown in Figure 2.
• The character encoding. The encoding describes
how one or more bytes are mapped to characters of a character set, a collection of letters
and symbols. If the encoding is unknown or an
incorrect encoding is applied, a byte sequence Exploratory analysis
cannot be restored to its original text. Examples of character sets are ‘ASCII ’ for English After extracting the public suffix, second-level dotext, ‘ISO-8859-6 ’ for Arabic or ‘UTF-8 ’ for main, content type, encoding, ARC file name and
most world languages.
byte size of 3.83 billion web documents we wanted
to answer general questions concerning the distribu• The ARC file name is the last component tion of domains, media types and encodings but also
of the following uniform resource identifier
8
7
See the Internet Assigned Numbers Authority for more details: http://www.iana.org/assignments/media-types.
The 2012 corpus can be found here:
s3://
aws-publicdatasets/common-crawl/parse-output/.
9
See http://hive.apache.org/ for more information.
Page 2 of 6
TLD
.com
.org
.net
.de
.uk
.pl
.ru
.nl
.info
.it
.fr
.jp
others
Abs. freq.
2,139,229,462
230,777,285
208,147,478
181,658,774
132,414,696
68,528,722
65,147,873
54,871,489
50,395,860
49,719,965
49,648,844
43,790,880
554,450,743
Rel. freq.
0.5587
0.0603
0.0544
0.0474
0.0346
0.0179
0.0170
0.0143
0.0132
0.0130
0.0130
0.0114
0.1448
Figure 3: Top-level domain distribution based on document frequencies.
TLD
.gov
.nz
.edu
.uk
.nl
.se
.ca
.ch
.cz
.org
Rel. freq.
W3 survey
0.001
0.001
0.003
0.019
0.008
0.003
0.004
0.003
0.005
0.041
Rel. freq.
CC corpus
0.0026
0.0022
0.0061
0.0346
0.0143
0.0053
0.0069
0.0052
0.0076
0.0603
Ratio
TLD
2.6
2.2
2.0
1.8
1.8
1.8
1.7
1.7
1.5
1.5
.in
.tk
.th
.kz
.co
.az
.asia
.pk
.ve
.ir
(a) Top 10 overrepresented TLDs.
Rel. freq.
W3 survey
0.009
0.001
0.001
0.001
0.003
0.001
0.001
0.001
0.001
0.006
Rel. freq.
CC corpus
0.0021
0.0002
0.0002
0.0002
0.0004
0.0001
0.0001
0.0001
0.0001
0.0005
Ratio
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
(b) Top 10 underrepresented TLDs.
Table 1: Representativeness of TLDs.
understand more about the structure of the corpus the internet and whether it is biased towards certain
and its representativeness.
TLDs. The Spearman rank correlation coefficient
gave a value of 0.84 for ρ which indicates a good
positive correlation for the top 75 TLDs.
Top-level domains
To extrapolate the bias, we took the ratio of the
One of the main questions was which TLDs had been relative frequency in the CC corpus and the expected
crawled and what their percentage was with respect value from the web technology survey. We labelled
to the total corpus. For this, we aggregated counts of TLDs with values above 1 as overrepresented and the
public suffixes like .org.uk and .co.uk under .uk. ones with values below 1 as underrepresented. The
Figure 3 summarizes these statistics by listing all top 10 over- and underrepresented domains are listed
TLDs above a relative frequency of 0.01, i.e. 1%. in Tables 1a and 1b. Most of the overrepresented
For the 2012 corpus, there are 12 TLDs above this TLDs are domains of English or European countries.
threshold.
The underrepresented domains are mostly Asian and
It becomes immediately apparent that more than some are South American.
half of the documents, which have been crawled,
are registered under the .com domain. This can be
explained by the fact that this TLD contains sites Second-level domains
from all over the world.
In Table 2a and 2b the top 10 second-level doComparing these figures to the general usage of mains are given by document frequency and by
TLDs for websites provided by the web technology data in terabytes. With 2.5% of all websites and
survey 10 it is possible to make assumptions about the 4.2% of the total data, youtube.com is the highrepresentativeness of the CC corpus as a sample of est ranking second-level domain in the CC cor10
Source:
http://w3techs.com/technologies/overview/
top_level_domain/all (March 2013).
pus 2012. Other high-ranking domains are blog
publishing services like blogspot.com, wordpress.
Page 3 of 6
Rank
1
2
3
4
5
6
7
8
9
10
SLD
youtube.com
blogspot.com
tumblr.com
flickr.com
amazon.com
google.com
thefreedictionary.com
tripod.com
hotels.com
flightaware.com
Abs. freq.
95,866,041
45,738,134
30,135,714
9,942,237
6,470,283
2,782,762
2,183,753
1,874,452
1,733,778
1,280,875
Rel. freq.
0.0250
0.0119
0.0079
0.0026
0.0017
0.0007
0.0006
0.0005
0.0005
0.0003
Rank
1
2
3
4
5
6
7
8
9
10
(a) .com domains by document frequency.
Rank
1
2
3
4
5
6
7
8
9
10
SLD
citysite.net
yahoo.co.jp
amazon.de
wrzuta.pl
dancesportinfo.net
atwiki.jp
weblio.jp
blogg.se
kijiji.ca
rakuten.co.jp
Abs. freq.
1194938
1022024
864516
827315
675029
665594
642366
611502
608583
564760
Rel. freq.
0.0003
0.0003
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0001
SLD
youtube.com
wordpress.com
flickr.com
hotels.com
typepad.com
federal-hotel.com
shopzilla.com
shopping.com
yoox.com
tripadvisor.es
terabytes
8.7560
1.0128
0.7500
0.2839
0.1693
0.1617
0.1230
0.1210
0.1081
0.1074
Rel. data
0.0417
0.0048
0.0036
0.0014
0.0008
0.0008
0.0006
0.0006
0.0005
0.0005
(b) .com domains by data in terabytes.
Rank
1
2
3
4
5
6
7
8
9
10
(c) Non-.com domains by document frequency.
SLD
tripadvisor.es
tripadvisor.in
ca.gov
epa.gov
iha.fr
amazon.fr
who.int
europa.eu
autotrends.be
astrogrid.org
terabytes
0.1074
0.1051
0.0857
0.0803
0.0781
0.0768
0.0763
0.0590
0.0569
0.0555
Rel. data
0.0005
0.0005
0.0004
0.0004
0.0004
0.0004
0.0004
0.0003
0.0003
0.0003
(d) Non-.com domains by data in terabytes.
Table 2: Top 10 second-level domains (SLD).
com or typepad.com, online shopping sites such
as amazon.com, shopzilla.com or shopping.com,
the online dictionary thefreedictionary.com, the
search engine google.com, travel and booking sites
like hotels.com and tripadvisor.es, and photo
sharing sites such as flickr.com and tumblr.com.
SLD
Media
type
youtube.com
twitter.com
myspace.com
pinterest.com
facebook.com
linkedin.com
text/html
text/html
text/html
text/html
text/html
text/html
Abs.
freq.
95,864,655
588,472
276,498
270,704
212,543
160,558
Rel.
freq.
SLD
1.0000
0.9992
0.9978
0.9762
0.9992
0.9996
Rel.
freq.
corpus
0.0250
0.0002
0.0001
0.0001
0.0001
0.0000
Once again, out of all top 10 domains by document
frequency and byte size only one is not from the .com Table 3: A video-sharing and social websites by media
type.
TLD, however, federal-hotel.com is a French hotel
booking site.
Tables 2c and 2d list top 10 non-.com sites by
document frequency and data. Although .jp is
ranked 12th by document frequency in the corpus,
there are four Japanese sites in the top 10 list:
yahoo.co.jp, the shopping site rakuten.co.jp, the
wiki site atwiki.jp and the Japanse-English online
dictionary weblio.jp.
Character encoding
Explicitly specifying the character encoding for a
given document ensures that its text can be properly
represented and further processed. Although utf-8
is the dominant encoding in the internet, it contains
character sets for all scripts and languages, 43% of
Another interesting fact is that for the videothe crawled documents did not have the encoding
sharing website youtube.com almost all documents
specified. A detailed summary is given in Figure 4.
are HTML text as summarized in Table 3. The same
Table 4 lists a number of top-level domains of counseems to be the case for social websites like facebook.
tries
which use mainly non-latin scripts. For websites
com, twitter.com, myspace.com, pinterest.com
under
these TLDs the correct encoding information
and linkedin.com. In contrast to youtube, however,
these social websites only account for a negligible is crucial to avoid encoding errors. Nevertheless, Chiportion of the corpus. This might be explained by nese (.cn), Japanese (.jp) and Urdu (.pk) have a
the fact that activities on these sites are not part of much higher ratio of websites with unknown encoding
the general web that is accessible by a web crawler. than the average top level domains.
Page 4 of 6
Character
encoding
utf-8
unknown
iso-8859-1
windows-1251
iso-8859-2
iso-8859-15
windows-1256
shift-jis
windows-1252
euc-jp
others
Abs. freq.
1,866,333,314
1,647,477,248
229,671,038
26,798,707
10,088,397
8,605,343
5,454,253
5,289,261
5,173,227
4,201,400
18,074,100
Rel.
freq.
0.4874
0.4303
0.0600
0.0070
0.0026
0.0022
0.0014
0.0014
0.0014
0.0011
0.0047
Figure 4: Distribution of character encodings.
Media type
Abs. freq.
text/html
application/pdf
text/xml
text/css
application/xjavascript
image/jpeg
application/javascript
text/plain
application/msword
application/xml
application/rss+xml
3,532,930,141
92,710,175
80,184,383
22,872,511
Rel.
freq.
0.9227
0.0242
0.0209
0.006
21,198,040
0.0055
14,116,839
0.0037
11,548,630
0.003
10,713,438
0.0028
6,648,861
0.0017
4,999,123
0.0013
4,200,583
0.0011
Figure 5: Distribution of media types.
Page 5 of 6
Script by TLD
Arabic (.eg)
Chinese (.cn)
Cyrillic (.ru)
Greek (.gr)
Hebrew (.il)
Japanese (.jp)
Korean (.kr)
Encoding
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
utf-8
unknown
Abs. freq.
183,825
119,860
1,904,719
5,636,104
702,907
531,632
226,261
52,246
1,627,998
1,327,629
3,957,950
12,995,862
1,130,168
863,269
386,920
389,542
94,574
253,869
Rel. freq.
0.6006
0.3916
0.2179
0.6448
0.4127
0.3122
0.8092
0.1868
0.5030
0.4102
0.1911
0.6275
0.5075
0.3876
0.4760
0.4793
0.2679
0.7191
Acknowledgement
Special thanks to Lisa Green, Matthew Kelcey, Jordan Mendelson and Ahad Rana from the Common Crawl Foundation for their support during this
project, and to my colleagues at SwiftKey for their
comments and suggestions.
Author
Sebastian Spiegler is a big data,
natural language processing and
Urdu (.pk)
machine learning enthusiast who
loves the idea of an enormous web
Table 4: Encodings of a selection of non-latin script topcorpus open to everyone. He curlevel domains.
rently leads a team of data and software engineers at SwiftKey, an innovative Londonbased start-up specialized in predictive text entry.
Media types
He holds a Ph.D. in machine learning and NLP from
The media type of documents across the corpus is the University of Bristol, England.
dominated by HTML with 92.27%, as shown in Figure 5. In the remainder, 2.4% are in portable document format (PDF) and 2.1% in extensible markup
language (XML). The remaining media types with
occurrences far below 1% are, for instance, cascading
style sheets (CSS), JavaScript, plain text, JPEGcompressed images and Microsoft Word documents.
Thai (.th)
Conclusions
The 2012 Common Crawl corpus is an excellent
opportunity for individuals or businesses to costeffectively access a large portion of the internet:
210 terabytes of raw data corresponding to 3.83
billion documents or 41.4 million distinct secondlevel domains. Twelve of the top-level domains
have a representation of above 1% whereas documents from .com account to more than 55% of
the corpus. The corpus contains a large amount
of sites from youtube.com, blog publishing services
like blogspot.com and wordpress.com as well as
online shopping sites such as amazon.com. These
sites are good sources for comments and reviews.
Almost half of all web documents are utf-8 encoded
whereas the encoding of the 43% is unknown. The
corpus contains 92% HTML documents and 2.4%
PDF files. The remainder are images, XML or code
like JavaScript and cascading style sheets.
The non-aggregated data of this exploratory analysis is accessible at s3://aws-publicdatasets/
common-crawl/index2012/ and the code used at
git@github.com:sebastianspiegler/Teneo.git.
Page 6 of 6
Download