pptx

advertisement
Hadoop/MapReduce as a Platform for
Data-Intensive Computing
Jimmy Lin
University of Maryland
(currently at Twitter)
Friday, December 2, 2011
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Our World: Large Data
Source: Wikipedia (Hard disk drive)
processes 20 PB a day (2008)
150 PB on 50k+ servers
running 15k apps
9 PB of user data +
>50 TB/day (11/2011)
Wayback Machine: 3 PB +
100 TB/month (3/2009)
36 PB of user data +
80-90 TB/day (6/2010)
LHC: ~15 PB a year
(at full capacity)
S3: 449B objects, peak 290k
request/second (7/2011)
640K ought to be
enough for anybody.
LSST: 6-10 PB a year (~2015)
How much data?
Why large data? Science
Engineering
Commerce
Source: Wikipedia (Everest)
Science

Emergence of the 4th Paradigm

Data-intensive e-Science
Maximilien Brice, © CERN
Engineering

The unreasonable effectiveness of data

Count and normalize!
Source: Wikipedia (Three Gorges Dam)
Commerce

Know thy customers

Data  Insights  Competitive advantages
Source: Wikiedia (Shinjuku, Tokyo)
Data nirvana requires the right infrastructure
store, manage, organize, analyze, distribute, visualize, …
Why large data?
How large data?
cheap commodity clusters (or utility computing)
+ simple, distributed programming models
= data-intensive computing for the masses!
Source: flickr (turtlemom_nancy/2046347762)
Divide et impera


Chop problem into smaller parts
w1
“Work”
w2
w3
r1
r2
“Result”
r3
Combine partial results
Source: Wikiedia (Forest)
Parallel computing is hard!
Fundamental issues
Different programming models
Message Passing
Shared Memory
P1 P2 P3 P4 P5
P1 P2 P3 P4 P5
Memory
scheduling, data distribution, synchronization,
inter-process communication, robustness, fault
tolerance, …
Architectural issues
Flynn’s taxonomy (SIMD, MIMD, etc.),
network typology, bisection bandwidth
UMA vs. NUMA, cache coherence
Different programming constructs
mutexes, conditional variables, barriers, …
masters/slaves, producers/consumers, work queues, …
Common problems
livelock, deadlock, data starvation, priority inversion…
dining philosophers, sleeping barbers, cigarette smokers, …
The reality: programmer shoulders the burden of managing concurrency…
(I want my students developing new algorithms, not debugging race conditions)
Source: Ricardo Guimarães Herrmann
Source: MIT Open Courseware
The datacenter is the computer!
Source: NY Times (6/14/2006)
MapReduce
MapReduce

Functional programming meets distributed processing



Independent per-record processing in parallel
Aggregation of intermediate results to generate final output
Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
 All values with the same key are sent to the same reducer

The execution framework handles everything else…




Handles scheduling
Handles data management, transport, etc.
Handles synchronization
Handles errors and faults
Recall “count and normalize”? Perfect!
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
User
Program
(1) submit
Master
(2) schedule map
(2) schedule reduce
worker
split 0
split 1
split 2
split 3
(5) remote read
(3) read
worker
worker
(6) write
output
file 0
(4) local write
split 4
worker
output
file 1
worker
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
(I want my students developing new algorithms, not debugging race conditions)
Adapted from (Dean and Ghemawat, OSDI 2004)
MapReduce Implementations

Google has a proprietary implementation in C++


Bindings in Java, Python
Hadoop is an open-source implementation in Java



Development led by Yahoo, used in production
Now an Apache project
Rapidly expanding software ecosystem
Statistical Machine Translation (Chris Dyer)
Source: Wikipedia (Rosetta Stone)
Statistical Machine Translation
Word Alignment
Training Data
Phrase Extraction
(vi, i saw)
(la mesa pequeña, the small table)
…
i saw the small table
vi la mesa pequeña
Parallel Sentences
he sat at the table
the service was good
Language
Model
Translation
Model
Target-Language Text
Decoder
maria no daba una bofetada a la bruja verde
mary did not slap the green witch
Foreign Input Sentence
English Output Sentence
eˆ1I  argmax P(e1I | f1J )  argmax P(e1I )P( f1J | e1I )
e1I
e1I
Translation as a Tiling Problem
Maria
no
dio
una
bofetada
a
la
bruja
verde
Mary
not
give
a
slap
to
the
witch
green
did not
by
a slap
no
green witch
to the
slap
did not give
to
the
slap
the witch
eˆ1I  argmax P(e1I | f1J )  argmax P(e1I )P( f1J | e1I )
e1I
e1I
The Data Bottleneck
“Every time I fire a linguist, the performance of our … system goes up.”
- Fred Jelinek
Statistical Machine Translation
We’ve built MapReduce implementations
of these two components! (2008)
Word Alignment
Training Data
Phrase Extraction
(vi, i saw)
(la mesa pequeña, the small table)
…
i saw the small table
vi la mesa pequeña
Parallel Sentences
he sat at the table
the service was good
Language
Model
Translation
Model
Target-Language Text
Decoder
maria no daba una bofetada a la bruja verde
mary did not slap the green witch
Foreign Input Sentence
English Output Sentence
eˆ1I  argmax P(e1I | f1J )  argmax P(e1I )P( f1J | e1I )
e1I
e1I
HMM Alignment: Giza
Single-core commodity server
HMM Alignment: MapReduce
Single-core commodity server
38 processor cluster
HMM Alignment: MapReduce
38 processor cluster
1/38 Single-core commodity server
What’s the point?

The optimally-parallelized version doesn’t exist!

MapReduce occupies a sweet spot in the design space for
a large class of problems:



Fast… in terms of running time + scaling characteristics
Easy… in terms of programming effort
Cheap… in terms of hardware costs
Sequence Assembly (Michael Schatz)
Source: Wikipedia (DNA)
Strangely-Formatted Manuscript

Dickens: A Tale of Two Cities

Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates

Dickens: A Tale of Two Cities

“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction

Dickens accidently shreds the manuscript
It was the
It was
best
the
of besttimes,
of times,
it was
it was
the worst
the worstofoftimes,
times,ititwas
wasthe
the age
ageofofwisdom,
wisdom,ititwas
wasthe age
the of
agefoolishness,
of foolishness,
… …
it was
the the worst of times, it was the the
it thewas
It was the
best
of times,
of wisdom,
age of foolishness,
It was
the best
of times,
it was
age age
of wisdom,
it was
agethe
of foolishness,
…
was
the
age
It was the
of times,
it was
thethe
worst
of times,
it it was the age
foolishness,
It wasbest
the best
of times,
it was
worst
of times,
age of
of wisdom, it it
was
the
age
of of
foolishness,
… …
It was It the
times,
it was
thethe
worst
of times,
wisdom,
it was
the
age
of foolishness,
wasbest
the of
best
of times,
it was
worst
of times, it
it was
was the
the age
age of
of wisdom,
it was
the
age
of foolishness,
… …
It
It the
wasbest
the best
of times,
it was
worst
wisdom,
it it
was
the
age of
foolishness,
… …
was
of times,
it was
thethe
worst
of of times, it was the age of of
wisdom,
was
the
age
of foolishness,

How can he reconstruct the text?



5 copies x 138,656 words / 5 words per fragment = 138k fragments
The short fragments from every copy are mixed together
Some fragments are identical
Greedy Assembly
It was the best of
age of wisdom, it was
best of times, it was
it was the age of
it was the age of
it was the worst of
of times, it was the
of times, it was the
of wisdom, it was the
It was the best of
was the best of times,
the best of times, it
best of times, it was
of times, it was the
of times, it was the
times, it was the worst
times, it was the age
the age of wisdom, it
the best of times, it
the worst of times, it
The repeated sequence make the correct
reconstruction ambiguous!
times, it was the age
times, it was the worst
was the age of wisdom,
was the age of foolishness,
was the best of times,
Alternative: model sequence reconstruction
as a graph problem…
de Bruijn Graph Construction

Dk = (V,E)


V = All length-k subfragments (k < l)
E = Directed edges between consecutive subfragments
(Nodes overlap by k-1 words)
Original Fragment
It was the best of

Directed Edge
It was the best
was the best of
Locally constructed graph reveals the global structure

Overlaps between sequences implicitly computed
de Bruijn, 1946
Idury and Waterman, 1995
Pevzner, Tang, Waterman, 2001
de Bruijn Graph Assembly
It was the best
was the best of
the best of times,
it was the worst
best of times, it
was the worst of
of times, it was
the worst of times,
worst of times, it
times, it was the
A unique Eulerian tour of
the graph reconstructs the
original text
If a unique tour does not
exist, try to simplify the
graph as much as possible
it was the age
the age of foolishness
was the age of
the age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
de Bruijn Graph Assembly
It was the best of times, it
it was the worst of times, it
of times, it was the
the age of foolishness
it was the age of
A unique Eulerian tour of
the graph reconstructs the
original text
If a unique tour does not
exist, try to simplify the
graph as much as possible
the age of wisdom, it was the
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
?
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
Reads
Human genome: 3 gbp
A few billion short reads
(~100 GB compressed data)
Subject
genome
Sequencer
Short Read Assembly

Genome assembly as finding an Eulerian tour of the de
Bruijn graph


Present short read assemblers require tremendous
computation:




Human genome: >3B nodes, >10B edges
Velvet (serial): > 2TB of RAM
ABySS (MPI): 168 cores × ~96 hours
SOAPdenovo (pthreads): 40 cores × 40 hours, >140 GB RAM
Can we get by with MapReduce on commodity clusters?

Horizontal scaling-out in the cloud!
(Zerbino & Birney, 2008)
(Simpson et al., 2009)
(Li et al., 2010)
Graph Compression
Challenges
– Nodes stored on different machines
– Nodes can only access direct neighbors
Randomized Solution
– Randomly assign H / T to each
compressible node
– Compress H  T links
Fast Graph Compression
Initial Graph: 42 nodes
Fast Graph Compression
Round 1: 26 nodes (38% savings)
Fast Graph Compression
Round 2: 15 nodes (64% savings)
Fast Graph Compression
Round 3: 6 nodes (86% savings)
Fast Graph Compression
Round 4: 5 nodes (88% savings)
Contrail

De Novo Assembly of the Human Genome


African male NA18507 (SRA000271, Bentley et al., 2008)
Input: 3.5B 36bp reads, 210bp insert (~40x coverage)
Initial
Compressed
Clip Tips
Pop Bubbles
B’
B
>7 B
27 bp
>1 B
303 bp
C
A
A
N
Max
B’
5.0 M
14,007 bp
B
4.2 M
20,594 bp
Aside: How to do this better…

MapReduce is a poor abstraction for graphs



Bulk synchronous parallel (BSP) as a better model:


No separation of computation from graph structure
Poor locality: unnecessary data movement
Google’s Pregel, open source Giraph clone
Interesting (open?) question: how many hammers and
how many nails?
Source: flickr (stuckincustoms/4051325193)
Science
Engineering
Commerce
Commoditization of large-data processing capabilities
allows us to ride the rising tide!
Source: Wikipedia (Tide)
Source: flickr (60in3/2338247189)
Best thing since sliced bread?

Distributed programming models:





It’s all about the right level of abstraction


MapReduce is the first
Definitely not the only
And probably not even the best
Alternatives: Pig, Dryad/DryadLINQ, Pregel, etc.
The von Neumann architecture won’t cut it anymore
Separating the what from how



Developer specifies the computation that needs to be performed
Execution framework handles actual execution
Framework hides system-level details from the developers
The datacenter is the computer!
Source: NY Times (6/14/2006)
What exciting applications do
new abstractions enable?
What are the appropriate
abstractions for the
datacenter computer?
What new abstractions do
applications demand?
How do we achieve true impact and change the world?
Education

Teaching students to “think at web scale”

Rise of the data scientist: necessary skill set?
Source: flickr (infidelic/3008675635)
Questions?
Download