Warcbase: Building a scalable platform on HBase and Hadoop

advertisement
Warcbase: Building a Scalable Web Archiving
Platform on Hadoop and HBase
Jimmy Lin
University of Maryland
@lintool
IIPC 2015 General Assembly
Tuesday, April 28, 2015
From the Ivory Tower…
Source: Wikipedia (All Souls College, Oxford)
… to building sh*t that works
Source: Wikipedia (Factory)
data products
data science
I worked on…
– analytics infrastructure to support data science
– data products to surface relevant content to users
Busch et al. Earlybird: RealTime Search at Twitter. ICDE
2012
Mishne et al. Fast Data in the Era of Big Data: Twitter's
Real-Time Related Query Suggestion Architecture.
SIGMOD 2013.
Leibert et al. Automatic Management of
Partitioned, Replicated Search Services. SoCC
2011
Gupta et al. WTF: The Who to Follow Service at Twitter. WWW
2013
Lin and Kolcz. Large-Scale Machine Learning at Twitter.
SIGMOD 2012
I worked on…
– analytics infrastructure to support data science
– data products to surface relevant content to users
Source: https://www.flickr.com/photos/bongtongol/3491316758/
circa ~2010
~150 people total
~60 Hadoop nodes
~6 people use analytics stack daily
circa ~2012
~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
10s of PBs total Hadoop DW capacity
~100 TB ingest daily
dozens of teams use Hadoop daily
10s of Ks of Hadoop jobs daily
And back!
Source: Wikipedia (All Souls College, Oxford)
Web archives are an important
part of our cultural heritage…
… but they’re underused
Source: http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg
Why?
Restrictive use regimes?
But I don’t think that’s all…
Users can’t do much
with current web
archives
Source: http://www.flickr.com/photos/cheryne/8417457803/
Hard to develop tools
for non-existent
needs
We need deep collaborations between:
Users (e.g., archivists, journalists,
historians, digital humanists, etc.)
Tool builders (me and my
colleagues)
Goal: tools to support exploration
and discovery in web archives
Beyond browsing…
Beyond searching…
Source: http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg
What would a web archiving platform built on
modern big data infrastructure look like?
Source: Google
Desiderata
Scalable storage of archived
data
Efficient random access
Scalable processing and
analytics
Scalable storage and access of derived data
Desiderata
Scalable storage of archived
data
HBase
Efficient random access
Scalable processing and
analytics
Scalable storage and access of derived data
Open source implementation of the Google File
System
Stores data blocks across commodity
servers
Scales to 100s of PBs of data
HDFS namenode
Application
HDFS
Client
(file name, block id)
/foo/bar
File namespace
block 3df2
(block id, block location)
instructions to datanode
(block id, byte range)
block data
datanode state
HDFS datanode
HDFS datanode
Linux file system
Linux file system
…
…
Open source implementation of Google’s
framework
Suitable for batch processing on HDFS
data
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c
3
c
k5 v5
k6 v6
map
6
a 5
c
map
2
b 7
c
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r 2 s2
r 3 s3
8
~ Google’s
Bigtable
A collection of tables, each of which
represents a sparse, distributed,
persistent multidimensional sorted
map
Source: Bonaldo Big Table by Alain Gilles
in a nutshell
(row key, column family, column qualifier, timestamp)
value row keys are lexicographically sorted
column families define “locality
groups”
Client-side operations: gets, puts, deletes, range
scans
Image Source: Chang et al., OSDI 2006
Warcbase
An open-source platform for managing web
archives built on
and H and
http://warcbase.org/
Source: Wikipedia (Archive)
WARC data
Ingestion
Processing & Analytics
Applications
and Services
Scalability?
We got 99 problems but scalability ain’t one…
– JayZ
Scalability of Warcbase limited by Hadoop/HBase
Applications are lightweight clients
WARC data
Ingestion
Processing & Analytics
text analysis, link analysis,
…
Applications
and Services
Sample dataset: crawl of the 108th U.S.
Congress
Monthly snapshots, January 2003 to January 2005
1.15 TB gzipped ARC files
29 million captures of 7.8 million unique URLs
23.8 million captures are HTML files
Hadoop/HBase cluster:
16 nodes, dual quad-core processors, 3 × 2TB disks
each
OpenWayBack + Warcbase Integration
Implementation:
OpenWayback frontend for rendering
Offloads storage management to HBase
Transparently scales out with HBase
Topic Model Explorer
Implementation
LDA on each temporal slice
Adaptation of Termite visualization
Topic 0
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Topic 7
Topic 8
Topic 9
Topic 10
Topic 11
Topic 12
Topic 13
Topic 14
Topic 15
Topic 16
Topic 17
Topic 18
Topic 19
forces
development
community
local
state
school
children
international
foreign
nations
grants
training
schools
students
college
services
president
funding
states
year
united
federal
government
grant
program
support
programs
education
child
security
national
forces
development
community
local
state
school
children
international
foreign
nations
grants
training
schools
students
college
services
president
funding
states
year
united
federal
government
grant
program
support
programs
education
child
security
national
Webgraph Explorer
Implementation
Link extraction with Hadoop, site-level aggregation
Computation of standard graph statistics
d3 interactive visualization
We need deep collaborations between:
Users (e.g., archivists, journalists,
historians, digital humanists, etc.)
Tool builders (me and my
colleagues)
Warcbase: here.
✗
Warcbase in a box
Wide Web scrape (2011) – Internet Archive wide0002
crawl
Sample of 100 WARC files (~100 GB)
Warcbase running on the Mac Pro:
Ingestion in ~1 hour
Extraction of webgraph using Pig in ~55 minutes
Result: 17m links, .ca subset of 1.7m links
Visualization in Gephi…
Internet Archive
Wide Web Scrap
from 2011
What’s the big deal?
Historians probably can’t afford Hadoop
clusters…
But they can probably afford a Mac Pro
How will this change historical
scholarship?
Visual graph analysis on longitudinal data, select
subsets for further textual analysis
Drill down to examine individual pages
… all on your
desktop!
Bonus!
Warcbase: here.
Raspberry Pi Experiments
Columbia University’s Human Rights Web Archive
43 GB of crawls from 2008 (1.68 million records)
Warcbase running on the Raspberry Pi (standalone
mode)
Ingestion in ~27 hours (17.3 records/second)
Random browsing (avg over 100 pages): 2.1s page
load
Same pages in Internet Archive: 2.9s page load
Draws 2.4 Watts
Jimmy Lin. Scaling Down Distributed Infrastructure on Wimpy
Machines for Personal Web Archiving. Temporal Web Analytics
What’s the big deal?
Store every page you’ve ever visited in your
pocket!
Throw in search, lightweight analytics, …
When was the last time you searched to refind?
What will you do with the web in your pocket?
How will this change how you interact with the
web?
Goal: tools to support exploration and
discovery
We need deep collaborations between:
Users (e.g., archivists, journalisms,
historians, digital humanists, etc.)
Tool builders (me and my
colleagues)
Warcbase is just the first step…
Questions?
Source: Wikipedia (Hammer)
Bigtable use case: storing web crawls!
Row key: domain-reversed
URL
Raw data
Column family: contents
Column qualifier:
Value: raw HTML
Derived data
Column family: anchor
Column qualifier: source
Value:
URL anchor text
Warcbase: HBase data model
Row key: domain-reversed
URL
Column family: c
Column qualifier: MIME type
Timestamp: crawl date
Value: raw source
Download