Cornell Information Science Research Seminar: The Web Lab http://weblab.infosci.cornell.edu/ William Y. Arms Manuel Calimlim Lucy Walle Felix Weigel January 23, 2007 The Web Lab: A Joint Project of Cornell University and the Internet Archive Faculty William Arms, Johannes Gehrke, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang,... Researchers Manuel Calimlim, Dave Lifka, Ruth Mitchell, Lucia Walle, Felix Weigel,... Students Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 50 M.Eng., and undergraduate students from Information Science and Computer Science Internet Archive Brewster Kahle, Tracey Jacquith, Michael Stack, Kris Carpenter,... 2 Introduction to the Web Lab Mining the History of the Web The Internet Archive's Web Collection • Complete crawls of the Web, every two months since 1996 • Total archive is about 110,000,000,000 pages (110 billion) • Recent crawls are about 60+ TByte (compressed) • Total archive is about 1,900 TByte (compressed) • Metadata contains format, links, anchor text 3 The Library Stacks: the Internet Archive 4 The Wayback Machine Demo: http://www.archive.org/ 5 Research using Metadata about Web Pages Current NSF grant Research using anchor text • links to microsoft.com and google.com Changes to the link structure of the Web • differences between crawls • densification (increases in average node degree) Formation of online groups 6 Example of Past Work: Social and Information Networks, Joining a Community Close to one billion (user, community) instances Work by: Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan 7 The Never-ending Research Dialog RESEARCHER Here's an analysis we would like to do... Not as you suggest it, but here's another idea... We don't know how to do that analysis. Would this be any use to you? INFORMATION SCIENTIST That might be possible, with the following modification... Let's try it and see. 8 The Role of Web Data for Social Science Research Social networks are an important research topic – Emergence of global phenomena from local effects • Viral spreading of rumors – Behavior of individuals in a community • Roles in discussion threads, herd behavior in opinion polls – Network structure and dynamics • Strength of weak ties, triangle relations, homophily 9 How to Observe a Social Network? • Social network research before the web – Talk to people, make notes – Distribute questionnaires, gather statistics • Problems with this approach – Tedious task – Small scale • The Internet Archive is a great resource for research – Contains web pages with social networks – Records the history of the pages 10 Social Networks on the Web The web contains many social networks – Sites for social networking, social bookmarking, file sharing • MySpace, Facebook, Flickr, Delicious – Community portals • Yahoo Groups, DBLife – Encyclopedia and folksonomy projects • Wikipedia, Wikia – Review sites and customer comments • Amazon, Netflix – Blogs, web forums, Usenet 11 The Bliss and Curse of Digital Data Opportunities – Collecting network data at an unprecedented scale – Verifying hypotheses in many different networks – Monitoring communities at a finer granularity – Mining and searching social networks Challenges – Finding suitable information on the web – Extracting information from web pages – Making web data persistent – Processing very large data sets – Access rights and privacy 12 Web Lab and Social Science Research • Collaboration with Cornell’s Institute for the Social Sciences • Our goal: Make data available to researchers – Large web graph database with multiple crawls – Packaged subsets of crawls for analysis – Visual extraction tool for creating new data sets (ongoing) – Small-scale crawling for adding new web sites (starting) – Full-text indexing (planned) Demo of the extraction tool available at http://www.cs.cornell.edu/~weigel/WrapperDemo/ 13 Web Data Extraction Researchers often don’t care about web pages, but specific substructures inside the pages – Blog postings – Web forums – Social tagging – News headlines – Tables of content – Bibliographies – Product details – Customer reviews 14 Web Data Collaboration Server Data extraction • Writing extraction code is a tedious task • Create tools to make the data easily accessible in a structured format (e.g., tables in a database) Data sharing • Extracting the same data repeatedly is a waste of time and storage space • Let users share their data and extraction rules Data curation • Web data is often incomplete and erroneous • Let users collaborate to correct and complete the data 15 Demonstration Demo of the extraction tool available at http://www.cs.cornell.edu/~weigel/WrapperDemo/ 16 The Web Lab System Web Collection INTERNET ARCHIVE Text indexes National supercomputers File server Structure database Wayback Machine Computer cluster Page store Text indexes CORNELL UNIVERSITY 17 Technical Processing: the Web Lab Networking Internet 2, National Lambda Rail Wayback Machine Commodity computers with local file systems Structure database Relational database system on large shared memory computer Data analysis Specialized Linux cluster with Hadoop distributed file system and MapReduce programming Different types of computer for different functions 18 The Research Process Select a sub-set for analysis • SQL query the relational database directly • Use the GetPages tool on the Web site to send an SQL query Download the sub-set • To the researcher's computer • To the Web Lab file server Clean-up the data • MapReduce tasks on the Hadoop cluster Data analysis • MapReduce tasks on the Hadoop cluster 19 Selection Methods By known identifier (Wayback Machine) web pages with the URL http://www.nsf.gov/ By character string (full text indexing) -- future all pages containing, "Internet is doubling every six months" all page containing the SARS-CoV genetic sequence By metadata criteria all web pages that link to microsoft.com but not to google.com all email addresses that I used to receive mail from but have not had mail from recently* * Example provided by Marc Smith 20 Benefits of Using a Relational Database • Simple query language for retrieving data • Transaction support • Concurrency control for parallel queries • Multiple indices for high performance • Reliability since databases have built-in recovery functionality 21 Metadata Loading • The crawler outputs compressed metadata files (DAT files). • Each DAT file has a set of crawled pages with page metadata, including things like crawl time, IP address, mime type, language encoding, etc. • Most importantly, the outgoing links from each page are parsed, including the full URL and associated anchor text. 22 Database Schema Crawl – Name of the crawl from which data is loaded Page – Metadata about each webpage plus fields to help find and extract the full html text Link – The outgoing links from crawled pages Url – Lookup table for unique URLs Host – Lookup table for unique hostnames 23 Crawls Loaded Into SQL DB Crawl Period Databa se size Pages Links Urls Hosts DJ Jan-April 2002 2.5 TB 1.1 billion 26 billion 250 million 16 million DV Jan-April 2004 15 TB 1.3 billion 110 billion TBD TBD EB Jan-March 2005 20 TB 3 billion 130 billion 20 billion 380 million Amazon Jan-April 2004, JanAugust 2005 570 GB 40 million 3 billion 35 million 356 Cornell Jan-April 2002, JanApril 2004 5 GB 800,000 12 million 750,000 40,000 24 Selection from the Database • SQL query the relational database directly (Contact Manuel Calimlim) • Use the GetPages tool on the Web site to send an SQL query -work in progress 25 Demonstration Demonstration of the Web Lab web site http://weblab.infosci.cornell.edu/ and the GetPages tool 26 Massive Data Analysis by Non-Specialists A typical scientist or social scientist: • Has deep domain knowledge • Has good algorithmic understanding • Is often a competent computer user or has a research assistant who is familiar with languages such as Fortran, Python, and Matlab, or applications packages such as SAS and Excel. But... • Has limited understanding of large-scale data analysis • Is not skilled at any form of computing that requires parallel computing or concurrency Typical problem of scale: Given 100 billion URLs, how do you identify duplicates? 27 Hadoop and MapReduce Programming Hadoop An open source distributed file system similar to the Google File System. It supports MapReduce programming. http://lucene.apache.org/hadoop/ MapReduce A functional programming style to support large-scale data analysis without the need for global data structures. In the 1960s, Fortran gave scientists a simple way to translate mathematical problems into efficient computer codes. MapReduce programming gives researchers a simple way to run massive data analysis on large computer clusters. 28 The MapReduce Paradigm Input data split into files M map tasks Intermediate files R reduce tasks Output files Output 0 split 0 split 1 split 2 split 3 split 4 Output 1 Each intermediate file is divided into R partitions Each reduce task corresponds to one partition 29 A Web Graph Example 2 1 4 3 5 6 30 Building the Web Graph URLs, pages, and links: • URLs contained in Web pages may link to pages never crawled • URLs not canonicalized: different URLs may refer to same page • Links are from a page to a URL Web graph from crawl data: • Nodes are union of pages crawled and URLs seen • Each node and edge has time interval(s) over which it exists 31 Web Graph Example Problem: Given a set of URL pairs in uncanonicalized form (u0, v0), create a list of all the edges that point to each node of the web graph: • Replace each u0 or v0 with its canonicalized form u or v. • Create a list of all nodes of the graph, i.e., the set of unique u. • Discard all (u, v) pairs, where u = v, or v is not a node of the graph. • Discard all duplicate edges. • For each node v, create a list (v, {u}), where {u} is the set of nodes that have edges to node v. Each step is a simple programming task for a small numbers of links on a single computer. How can this simplicity be retained with huge numbers of links on a very large computer cluster? 32 MapReduce Example Map task Input: (u0, v0) Output: (u, d) (v, u) // Indicate that u is a from-URL // Indicate that v is a to-URL with link from u d is a dummy marker. Do not output if u = v. This is simple application code to write. 33 A MapReduce Example Merge The input to the reduce process merges the output values from the map task that correspond to each URL. For each URL, w, it creates a list: w, {d, ... , d, u1, ..., uk} This merge is performed automatically by the system libraries. 34 A MapReduce Example Reduce Input: w, {d, ... , d, u1, ..., uk}, where w is any URL. Output: If there is no marker d in the list, discard and do not output. This corresponds to a URL that never appears only as the first element of a (u, v) pair. Otherwise remove duplicates from u1, ..., uk and output. The output is a to-URL and a list of the nodes that link to it: v, {u1, ..., uk} This is simple application code to write. 35 For the Future: Examples of Tools and Services The Web Lab is steadily building a set of tools for researchers • API and Web services • GetPages Web forms to select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc. • Focused Web crawling (modification of Heritrix crawler) • Extraction of Web graph from subset and calculations, e.g., PageRank, hubs and authorities • Graph visualization • Natural language processing of anchor text 36 The Web Lab is Ready for Use We are ready to work with a number of researchers: Systems Relational database operational Hadoop pilot cluster (large cluster soon) File server and web server operational People Manuel Calimlim (database) Lucy Walle (Hadoop + MapReduce) Tools A variety of tools in prototype Experience with large volumes of anchor text and URLs 37 Thanks This work would not be possible without the forethought and long standing commitment of Brewster Kahle and the Internet Archive to capture and preserve the content of the Web for future generations. This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, SES0537606, IIS-0634677, and IIS-0705774. 38 Cornell Information Science Research Seminar: The Web Lab http://weblab.infosci.cornell.edu/ William Y. Arms Manuel Calimlim Lucy Walle Felix Weigel January 23, 2007