Statistics of the Common Crawl Corpus 2012 Sebastian Spiegler, Data Scientist at SwiftKey June 2013 he Common Crawl1 is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analysed by everyone. The foundation crawled the web in 2008, 2009, 2010 and 2012. The focus of this article is an exploratory analysis of the latest crawl. T 3.83 Billion Documents 65 Terabytes Compressed Data 210 Terabytes Content 41.4 Million Domains Figure 1: Data flow of experiment. Introduction The Common Crawl (CC) corpus allows individuals or businesses to cost-effectively access terabytes of web crawl data using Amazon web services like Elastic MapReduce2 . At SwiftKey3 , an innovative London-based startup, we build world class language technology, such as our award-winning Android soft keyboard. Amongst other features, the keyboard delivers multilingual error correction, word completion, next word prediction and space inference for up to three languages concurrently. At present, we support more than 60 languages, with new languages being constantly added to the list. The CC corpus represents an excellent addition to our internal data sources for building and testing language models and for our research. To better understand the content and the structure of the 2012 corpus, we carried out the exploratory analysis at hand. The remainder of the article is organised as follows. We will start with a short overview of the experimental setup and subsequently examine our results. 857 Thousand ARC files Experiment The 2012 corpus is made up of 857 thousand ARC files which are stored in s3://aws-publicdatasets/ common-crawl/parse-output/. Each ARC file is compressed and contains multiple entries of crawled documents. An entry consists of a header and the actual web document.4,5 Extracted data For the purpose of this analysis, we have extracted the following fields for each web document in the 2012 corpus: • The public suffix. The public suffix is the level under which a user can register a private domain. An up-to-date list is maintained by the Mozilla Foundation.6 The public suffix of ‘bbc.co.uk’ would be ‘.co.uk’ whereas ‘.uk’ is the top-level domain (TLD). Although we will be using TLDs 4 1 The official website of the Common Crawl Foundation is http://commoncrawl.org. 2 See aws.amazon.com/elasticmapreduce/ for more details. 3 Our official website is http://www.swiftkey.net/. More details on the data format can be found under http: //commoncrawl.org/data/accessing-the-data/. 5 The crawler truncated the content of fetched web documents at 2 megabytes. 6 See http://publicsuffix.org/. Page 1 of 6 s3://aws-publicdatasets/common-crawl/ parse-output/segment/[segment]/[ARC_ file]. It allows to link a given web document to the ARC file it is stored in. ARC file names are unique so the segment name is not necessary for identification. 25 Thousand ARC files EMR Hadoop Cluster x 6h / cluster = 1260 inst. hours 35 x Core m1.xlarge spot inst. Master m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. • The byte size. The byte size is the number of raw bytes of the document’s content. We will be summing this value for multiple documents of the same SLD and the entire corpus to make assumptions about the data distribution. 241 Gigabytes Extracted Information Setup EMR Hive Cluster Core m1.xlarge spot inst. Master m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. Core m1.xlarge spot inst. x 15h = 90 inst. hours Figure 2: Experiment setup. rather than public suffixes during our investigation we thought the additional information might still be helpful for later analyses. • The second-level domain (SLD). This domain is directly below the public suffix and can be registered by an individual or organization. In our previous example ‘bbc.co.uk’ the SLD would be ‘bbc’. • The internet media type or content type. It is used to specify the content in various internet protocols and consists of a type and sub-type component. Examples are text/html, text/xml, application/pdf or image/jpeg.7 Figure 1 shows the overall data flow of the experiment. The 2012 corpus8 consists of 210 terabytes of web data which was processed to extract the fields listed above. This resulted in a 241 gigabyte summary of 3.83 billion documents corresponding to 41.4 million distinct second-level domains. The non-aggregated data is accessible at s3:// aws-publicdatasets/common-crawl/index2012/. For the experiment we made two major decisions. Instead of processing all ARC files at once, we split the 2012 corpus into manageable subsets of 25 thousand files, processed them individually and later combined intermediate results. Furthermore, we chose the format of tab-separated values for the nonaggregated data – the long list of entries with public suffix, second-level domain, content type, encoding, file name and byte size – which would allow us to easily run SQL-like queries using Apache Hive 9 later on. The actual experiment took approximately 1500 instance hours totalling in about US$ 200 including the use of EC2 spot instances and data transfer from and to S3. Along with development and testing we spent about three times this figure. This makes the Common Crawl corpus very accessible, especially to start-ups like SwiftKey. A summary of the experimental setup is shown in Figure 2. • The character encoding. The encoding describes how one or more bytes are mapped to characters of a character set, a collection of letters and symbols. If the encoding is unknown or an incorrect encoding is applied, a byte sequence Exploratory analysis cannot be restored to its original text. Examples of character sets are ‘ASCII ’ for English After extracting the public suffix, second-level dotext, ‘ISO-8859-6 ’ for Arabic or ‘UTF-8 ’ for main, content type, encoding, ARC file name and most world languages. byte size of 3.83 billion web documents we wanted to answer general questions concerning the distribu• The ARC file name is the last component tion of domains, media types and encodings but also of the following uniform resource identifier 8 7 See the Internet Assigned Numbers Authority for more details: http://www.iana.org/assignments/media-types. The 2012 corpus can be found here: s3:// aws-publicdatasets/common-crawl/parse-output/. 9 See http://hive.apache.org/ for more information. Page 2 of 6 TLD .com .org .net .de .uk .pl .ru .nl .info .it .fr .jp others Abs. freq. 2,139,229,462 230,777,285 208,147,478 181,658,774 132,414,696 68,528,722 65,147,873 54,871,489 50,395,860 49,719,965 49,648,844 43,790,880 554,450,743 Rel. freq. 0.5587 0.0603 0.0544 0.0474 0.0346 0.0179 0.0170 0.0143 0.0132 0.0130 0.0130 0.0114 0.1448 Figure 3: Top-level domain distribution based on document frequencies. TLD .gov .nz .edu .uk .nl .se .ca .ch .cz .org Rel. freq. W3 survey 0.001 0.001 0.003 0.019 0.008 0.003 0.004 0.003 0.005 0.041 Rel. freq. CC corpus 0.0026 0.0022 0.0061 0.0346 0.0143 0.0053 0.0069 0.0052 0.0076 0.0603 Ratio TLD 2.6 2.2 2.0 1.8 1.8 1.8 1.7 1.7 1.5 1.5 .in .tk .th .kz .co .az .asia .pk .ve .ir (a) Top 10 overrepresented TLDs. Rel. freq. W3 survey 0.009 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001 0.006 Rel. freq. CC corpus 0.0021 0.0002 0.0002 0.0002 0.0004 0.0001 0.0001 0.0001 0.0001 0.0005 Ratio 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 (b) Top 10 underrepresented TLDs. Table 1: Representativeness of TLDs. understand more about the structure of the corpus the internet and whether it is biased towards certain and its representativeness. TLDs. The Spearman rank correlation coefficient gave a value of 0.84 for ρ which indicates a good positive correlation for the top 75 TLDs. Top-level domains To extrapolate the bias, we took the ratio of the One of the main questions was which TLDs had been relative frequency in the CC corpus and the expected crawled and what their percentage was with respect value from the web technology survey. We labelled to the total corpus. For this, we aggregated counts of TLDs with values above 1 as overrepresented and the public suffixes like .org.uk and .co.uk under .uk. ones with values below 1 as underrepresented. The Figure 3 summarizes these statistics by listing all top 10 over- and underrepresented domains are listed TLDs above a relative frequency of 0.01, i.e. 1%. in Tables 1a and 1b. Most of the overrepresented For the 2012 corpus, there are 12 TLDs above this TLDs are domains of English or European countries. threshold. The underrepresented domains are mostly Asian and It becomes immediately apparent that more than some are South American. half of the documents, which have been crawled, are registered under the .com domain. This can be explained by the fact that this TLD contains sites Second-level domains from all over the world. In Table 2a and 2b the top 10 second-level doComparing these figures to the general usage of mains are given by document frequency and by TLDs for websites provided by the web technology data in terabytes. With 2.5% of all websites and survey 10 it is possible to make assumptions about the 4.2% of the total data, youtube.com is the highrepresentativeness of the CC corpus as a sample of est ranking second-level domain in the CC cor10 Source: http://w3techs.com/technologies/overview/ top_level_domain/all (March 2013). pus 2012. Other high-ranking domains are blog publishing services like blogspot.com, wordpress. Page 3 of 6 Rank 1 2 3 4 5 6 7 8 9 10 SLD youtube.com blogspot.com tumblr.com flickr.com amazon.com google.com thefreedictionary.com tripod.com hotels.com flightaware.com Abs. freq. 95,866,041 45,738,134 30,135,714 9,942,237 6,470,283 2,782,762 2,183,753 1,874,452 1,733,778 1,280,875 Rel. freq. 0.0250 0.0119 0.0079 0.0026 0.0017 0.0007 0.0006 0.0005 0.0005 0.0003 Rank 1 2 3 4 5 6 7 8 9 10 (a) .com domains by document frequency. Rank 1 2 3 4 5 6 7 8 9 10 SLD citysite.net yahoo.co.jp amazon.de wrzuta.pl dancesportinfo.net atwiki.jp weblio.jp blogg.se kijiji.ca rakuten.co.jp Abs. freq. 1194938 1022024 864516 827315 675029 665594 642366 611502 608583 564760 Rel. freq. 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 SLD youtube.com wordpress.com flickr.com hotels.com typepad.com federal-hotel.com shopzilla.com shopping.com yoox.com tripadvisor.es terabytes 8.7560 1.0128 0.7500 0.2839 0.1693 0.1617 0.1230 0.1210 0.1081 0.1074 Rel. data 0.0417 0.0048 0.0036 0.0014 0.0008 0.0008 0.0006 0.0006 0.0005 0.0005 (b) .com domains by data in terabytes. Rank 1 2 3 4 5 6 7 8 9 10 (c) Non-.com domains by document frequency. SLD tripadvisor.es tripadvisor.in ca.gov epa.gov iha.fr amazon.fr who.int europa.eu autotrends.be astrogrid.org terabytes 0.1074 0.1051 0.0857 0.0803 0.0781 0.0768 0.0763 0.0590 0.0569 0.0555 Rel. data 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 (d) Non-.com domains by data in terabytes. Table 2: Top 10 second-level domains (SLD). com or typepad.com, online shopping sites such as amazon.com, shopzilla.com or shopping.com, the online dictionary thefreedictionary.com, the search engine google.com, travel and booking sites like hotels.com and tripadvisor.es, and photo sharing sites such as flickr.com and tumblr.com. SLD Media type youtube.com twitter.com myspace.com pinterest.com facebook.com linkedin.com text/html text/html text/html text/html text/html text/html Abs. freq. 95,864,655 588,472 276,498 270,704 212,543 160,558 Rel. freq. SLD 1.0000 0.9992 0.9978 0.9762 0.9992 0.9996 Rel. freq. corpus 0.0250 0.0002 0.0001 0.0001 0.0001 0.0000 Once again, out of all top 10 domains by document frequency and byte size only one is not from the .com Table 3: A video-sharing and social websites by media type. TLD, however, federal-hotel.com is a French hotel booking site. Tables 2c and 2d list top 10 non-.com sites by document frequency and data. Although .jp is ranked 12th by document frequency in the corpus, there are four Japanese sites in the top 10 list: yahoo.co.jp, the shopping site rakuten.co.jp, the wiki site atwiki.jp and the Japanse-English online dictionary weblio.jp. Character encoding Explicitly specifying the character encoding for a given document ensures that its text can be properly represented and further processed. Although utf-8 is the dominant encoding in the internet, it contains character sets for all scripts and languages, 43% of Another interesting fact is that for the videothe crawled documents did not have the encoding sharing website youtube.com almost all documents specified. A detailed summary is given in Figure 4. are HTML text as summarized in Table 3. The same Table 4 lists a number of top-level domains of counseems to be the case for social websites like facebook. tries which use mainly non-latin scripts. For websites com, twitter.com, myspace.com, pinterest.com under these TLDs the correct encoding information and linkedin.com. In contrast to youtube, however, these social websites only account for a negligible is crucial to avoid encoding errors. Nevertheless, Chiportion of the corpus. This might be explained by nese (.cn), Japanese (.jp) and Urdu (.pk) have a the fact that activities on these sites are not part of much higher ratio of websites with unknown encoding the general web that is accessible by a web crawler. than the average top level domains. Page 4 of 6 Character encoding utf-8 unknown iso-8859-1 windows-1251 iso-8859-2 iso-8859-15 windows-1256 shift-jis windows-1252 euc-jp others Abs. freq. 1,866,333,314 1,647,477,248 229,671,038 26,798,707 10,088,397 8,605,343 5,454,253 5,289,261 5,173,227 4,201,400 18,074,100 Rel. freq. 0.4874 0.4303 0.0600 0.0070 0.0026 0.0022 0.0014 0.0014 0.0014 0.0011 0.0047 Figure 4: Distribution of character encodings. Media type Abs. freq. text/html application/pdf text/xml text/css application/xjavascript image/jpeg application/javascript text/plain application/msword application/xml application/rss+xml 3,532,930,141 92,710,175 80,184,383 22,872,511 Rel. freq. 0.9227 0.0242 0.0209 0.006 21,198,040 0.0055 14,116,839 0.0037 11,548,630 0.003 10,713,438 0.0028 6,648,861 0.0017 4,999,123 0.0013 4,200,583 0.0011 Figure 5: Distribution of media types. Page 5 of 6 Script by TLD Arabic (.eg) Chinese (.cn) Cyrillic (.ru) Greek (.gr) Hebrew (.il) Japanese (.jp) Korean (.kr) Encoding utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown utf-8 unknown Abs. freq. 183,825 119,860 1,904,719 5,636,104 702,907 531,632 226,261 52,246 1,627,998 1,327,629 3,957,950 12,995,862 1,130,168 863,269 386,920 389,542 94,574 253,869 Rel. freq. 0.6006 0.3916 0.2179 0.6448 0.4127 0.3122 0.8092 0.1868 0.5030 0.4102 0.1911 0.6275 0.5075 0.3876 0.4760 0.4793 0.2679 0.7191 Acknowledgement Special thanks to Lisa Green, Matthew Kelcey, Jordan Mendelson and Ahad Rana from the Common Crawl Foundation for their support during this project, and to my colleagues at SwiftKey for their comments and suggestions. Author Sebastian Spiegler is a big data, natural language processing and Urdu (.pk) machine learning enthusiast who loves the idea of an enormous web Table 4: Encodings of a selection of non-latin script topcorpus open to everyone. He curlevel domains. rently leads a team of data and software engineers at SwiftKey, an innovative Londonbased start-up specialized in predictive text entry. Media types He holds a Ph.D. in machine learning and NLP from The media type of documents across the corpus is the University of Bristol, England. dominated by HTML with 92.27%, as shown in Figure 5. In the remainder, 2.4% are in portable document format (PDF) and 2.1% in extensible markup language (XML). The remaining media types with occurrences far below 1% are, for instance, cascading style sheets (CSS), JavaScript, plain text, JPEGcompressed images and Microsoft Word documents. Thai (.th) Conclusions The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to costeffectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct secondlevel domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets. The non-aggregated data of this exploratory analysis is accessible at s3://aws-publicdatasets/ common-crawl/index2012/ and the code used at git@github.com:sebastianspiegler/Teneo.git. Page 6 of 6