Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday, April 28, 2015 From the Ivory Tower… Source: Wikipedia (All Souls College, Oxford) … to building sh*t that works Source: Wikipedia (Factory) data products data science I worked on… – analytics infrastructure to support data science – data products to surface relevant content to users Busch et al. Earlybird: RealTime Search at Twitter. ICDE 2012 Mishne et al. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. SIGMOD 2013. Leibert et al. Automatic Management of Partitioned, Replicated Search Services. SoCC 2011 Gupta et al. WTF: The Who to Follow Service at Twitter. WWW 2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012 I worked on… – analytics infrastructure to support data science – data products to surface relevant content to users Source: https://www.flickr.com/photos/bongtongol/3491316758/ circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily And back! Source: Wikipedia (All Souls College, Oxford) Web archives are an important part of our cultural heritage… … but they’re underused Source: http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg Why? Restrictive use regimes? But I don’t think that’s all… Users can’t do much with current web archives Source: http://www.flickr.com/photos/cheryne/8417457803/ Hard to develop tools for non-existent needs We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues) Goal: tools to support exploration and discovery in web archives Beyond browsing… Beyond searching… Source: http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg What would a web archiving platform built on modern big data infrastructure look like? Source: Google Desiderata Scalable storage of archived data Efficient random access Scalable processing and analytics Scalable storage and access of derived data Desiderata Scalable storage of archived data HBase Efficient random access Scalable processing and analytics Scalable storage and access of derived data Open source implementation of the Google File System Stores data blocks across commodity servers Scales to 100s of PBs of data HDFS namenode Application HDFS Client (file name, block id) /foo/bar File namespace block 3df2 (block id, block location) instructions to datanode (block id, byte range) block data datanode state HDFS datanode HDFS datanode Linux file system Linux file system … … Open source implementation of Google’s framework Suitable for batch processing on HDFS data k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 c k5 v5 k6 v6 map 6 a 5 c map 2 b 7 c Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r 2 s2 r 3 s3 8 ~ Google’s Bigtable A collection of tables, each of which represents a sparse, distributed, persistent multidimensional sorted map Source: Bonaldo Big Table by Alain Gilles in a nutshell (row key, column family, column qualifier, timestamp) value row keys are lexicographically sorted column families define “locality groups” Client-side operations: gets, puts, deletes, range scans Image Source: Chang et al., OSDI 2006 Warcbase An open-source platform for managing web archives built on and H and http://warcbase.org/ Source: Wikipedia (Archive) WARC data Ingestion Processing & Analytics Applications and Services Scalability? We got 99 problems but scalability ain’t one… – JayZ Scalability of Warcbase limited by Hadoop/HBase Applications are lightweight clients WARC data Ingestion Processing & Analytics text analysis, link analysis, … Applications and Services Sample dataset: crawl of the 108th U.S. Congress Monthly snapshots, January 2003 to January 2005 1.15 TB gzipped ARC files 29 million captures of 7.8 million unique URLs 23.8 million captures are HTML files Hadoop/HBase cluster: 16 nodes, dual quad-core processors, 3 × 2TB disks each OpenWayBack + Warcbase Integration Implementation: OpenWayback frontend for rendering Offloads storage management to HBase Transparently scales out with HBase Topic Model Explorer Implementation LDA on each temporal slice Adaptation of Termite visualization Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 forces development community local state school children international foreign nations grants training schools students college services president funding states year united federal government grant program support programs education child security national forces development community local state school children international foreign nations grants training schools students college services president funding states year united federal government grant program support programs education child security national Webgraph Explorer Implementation Link extraction with Hadoop, site-level aggregation Computation of standard graph statistics d3 interactive visualization We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues) Warcbase: here. ✗ Warcbase in a box Wide Web scrape (2011) – Internet Archive wide0002 crawl Sample of 100 WARC files (~100 GB) Warcbase running on the Mac Pro: Ingestion in ~1 hour Extraction of webgraph using Pig in ~55 minutes Result: 17m links, .ca subset of 1.7m links Visualization in Gephi… Internet Archive Wide Web Scrap from 2011 What’s the big deal? Historians probably can’t afford Hadoop clusters… But they can probably afford a Mac Pro How will this change historical scholarship? Visual graph analysis on longitudinal data, select subsets for further textual analysis Drill down to examine individual pages … all on your desktop! Bonus! Warcbase: here. Raspberry Pi Experiments Columbia University’s Human Rights Web Archive 43 GB of crawls from 2008 (1.68 million records) Warcbase running on the Raspberry Pi (standalone mode) Ingestion in ~27 hours (17.3 records/second) Random browsing (avg over 100 pages): 2.1s page load Same pages in Internet Archive: 2.9s page load Draws 2.4 Watts Jimmy Lin. Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving. Temporal Web Analytics What’s the big deal? Store every page you’ve ever visited in your pocket! Throw in search, lightweight analytics, … When was the last time you searched to refind? What will you do with the web in your pocket? How will this change how you interact with the web? Goal: tools to support exploration and discovery We need deep collaborations between: Users (e.g., archivists, journalisms, historians, digital humanists, etc.) Tool builders (me and my colleagues) Warcbase is just the first step… Questions? Source: Wikipedia (Hammer) Bigtable use case: storing web crawls! Row key: domain-reversed URL Raw data Column family: contents Column qualifier: Value: raw HTML Derived data Column family: anchor Column qualifier: source Value: URL anchor text Warcbase: HBase data model Row key: domain-reversed URL Column family: c Column qualifier: MIME type Timestamp: crawl date Value: raw source