COP5725 Advanced Database Systems MapReduce Tallahassee, Florida, 2016

advertisement
COP5725
Advanced Database Systems
Spring 2016
MapReduce
Tallahassee, Florida, 2016
What is MapReduce?
• Programming model
– expressing distributed computations at a massive scale
– “the computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: map and reduce”
• Execution framework
– organizing and performing data-intensive computations
– processing parallelizable problems across huge datasets using a
large number of computers (nodes)
• Open-source implementation: Hadoop and others
1
How Much Data ?
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s LHC (Large Hadron Collider) will generate 15 PB a
year
640K ought to be enough
for anybody
2
Who cares ?
• Ready-made large-data problems
– Lots of user-generated content, even more user behavior data
• Examples: Facebook friend suggestions, Google ad. placement
– Business intelligence: gather everything in a data warehouse and
run analytics to generate insight
• Utility computing
– Provision Hadoop clusters on-demand in the cloud
– Lower barriers to entry for tackling large-data problems
– Commoditization and democratization of large-data capabilities
3
Spread Work Over Many Machines
• Challenges
–
–
–
–
–
–
Workload partitioning: how do we assign work units to workers?
Load balancing: what if we have more work units than workers?
Synchronization: what if workers need to share partial results?
Aggregation: how do we aggregate partial results?
Termination: how do we know all the workers have finished?
Fault tolerance: what if workers die?
• Common theme
– Communication between workers (e.g., to exchange states)
– Access to shared resources (e.g., data)
• We need a synchronization mechanism
4
Current Methods
• Programming models
– Shared memory (pthreads)
– Message passing (MPI)
Message Passing
Memory
Shared Memory
• Design Patterns
– Master-slaves
– Producer-consumer flows
– Shared work queues
P1
P2
P3
P4
P5
P1
P2
P3
P4
P5
producer consumer
master
work queue
slaves
producer consumer
5
Problem with Current Solutions
• Lots of programming work
– communication and coordination
– workload partitioning
– status reporting
– optimization
– locality
• Repeat for every problem you want to solve
• Stuff breaks
– One server may stay up three years (1,000 days)
– If you have 10,000 servers, expect to lose 10 a day
6
What We Need
• A Distributed System
–
–
–
–
–
Scalable
Fault-tolerant
Easy to program
Applicable to many problems
……
7
How Do We Scale Up ?
• Divide & Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
Combine
8
General Ideas
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
• Key idea: provide a functional abstraction for these two
operations
– map (k, v) → <k’, v’>
– reduce (k’, v’) → <k’’, v’’>
• All values with the same key are sent to the same reducer
– The execution framework handles everything else…
9
General Ideas
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
10
Two More Functions
• Apart from Map and Reduce, the execution framework
handles everything else…
• Not quite…usually, programmers can also specify:
– partition (k’, number of partitions) → partition for k’
• Divides up key space for parallel reduce operations
• Often a simple hash of the key, e.g., hash(k’) mod n
– combine (k’, v’) → <k’, v’>*
• Mini-reducers that run in memory after the map phase
• Used as an optimization to reduce network traffic
11
k1 v1
k2 v2
map
a 1
k4 v4
map
b 2
c 3
combine
a 1
k3 v3
c 6
a 5
map
c 2
b 7
combine
c 9
partition
k6 v6
map
combine
b 2
k5 v5
a 5
partition
c 8
combine
c 2
b 7
partition
c 8
partition
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 9 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
12
Motivation for Local Aggregation
• Ideal scaling characteristics:
– Twice the data, twice the running time
– Twice the resources, half the running time
• Why can’t we achieve this?
– Synchronization requires communication
– Communication kills performance
• Thus… avoid communication!
– Reduce intermediate data via local aggregation
– Combiners can help
13
Word Count v1.0
• Input: {<document-id, document-contents>}
• Output: <word, num-occurrences-in-web>. e.g. <“obama”, 1000>
14
<doc1, “obama is
the president”>
<doc2, “hennesy
is the president
of stanford”>
<“obama”, 1>
<“hennesy”, 1>
<“is”, 1>
<“is”, 1>
<“is”, 1>
<“the”, 1>
<“the”, 1>
…
<“an”, 1>
<“president”, 1>
…
<docn, “this is
an example”>
…
<“this”, 1>
<“example”, 1>
Group by reduce key
<“the”,
{1, 1}>
<“the”, 2>
<“obama”,
{1}>
<“obama”, 1>
…
<“is”,
{1, 1, 1}>
…
<“is”, 3>
15
Word Count v2.0
16
Word Count v3.0
17
Combiner Design
• Combiners and reducers share same method signature
– Sometimes, reducers can serve as combiners
– Often, not…
• Remember: combiner are optional optimizations
– Should not affect algorithm correctness
– May be run 0, 1, or multiple times
• Example: find average of all integers associated with
the same key
18
Computing the Mean v1.0
Why can’t we use reducer as combiner?
19
Computing the Mean v2.0
• Why doesn’t this work?
•
combiners must have the same input and output key-value type, which also must
be the same as the mapper output type and the reducer input type
20
Computing the Mean v3.0
21
Computing the Mean v4.0
22
MapReduce Runtime
• Handles scheduling
– Assigns workers to map and reduce tasks
• Handles “data distribution”
– Moves processes to data
• Handles synchronization
– Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
– Detects worker failures and restarts
• Everything happens on top of a distributed FS
23
Execution
User
Program
(1) submit
Master
(2) schedule map
(2) schedule reduce
worker
split 0
split 1
split 2
split 3
split 4
(5) remote read
(3) read
worker
worker
(6) write
output
file 0
(4) local write
worker
output
file 1
worker
Input
files
Map
phase
Intermediate files
(on local disk)
Reduce
phase
Output
files
24
Implementation
• Google has a proprietary implementation in C++
– Bindings in Java, Python
• Hadoop is an open-source implementation in Java
– Development led by Yahoo, used in production
– Now an Apache project
– Rapidly expanding software ecosystem
• Lots of custom research implementations
– For GPUs, cell processors, etc.
25
Distributed File System
• Don’t move data to workers… move workers to the
data!
– Store data on the local disks of nodes in the cluster
– Start up the workers on the node that has the data local
• Why?
– Not enough RAM to hold all the data in memory
– Disk access is slow, but disk throughput (data transfer rate) is
reasonable
• A distributed file system is the answer
– GFS (Google File System) for Google’s MapReduce
– HDFS (Hadoop Distributed File System) for Hadoop
26
GFS
• Commodity hardware over “exotic” hardware
– Scale “out”, not “up”
• Scale out (horizontally): add more nodes to a system
• Scale up (vertically): add resources to a single node in a system
• High component failure rates
– Inexpensive commodity components fail all the time
• “Modest” number of huge files
– Multi-gigabyte files are common, if not encouraged
• Files are write-once, mostly appended to
– Perhaps concurrently
• Large streaming reads over random access
– High sustained throughput over low latency
27
Seeks vs. Scans
• Consider a 1 TB database with 100-byte records
– We want to update 1 percent of the records
• Scenario 1: random access
– Each update takes ~30 ms (seek, read, write)
– 108 updates = ~35 days
• Scenario 2: rewrite all records
– Assume 100 MB/s throughput
– Time = 5.6 hours(!)
• Lesson: avoid random seeks!
28
GFS
• Files stored as chunks
– Fixed size (64MB)
• Reliability through replication
– Each chunk replicated across 3+ chunk servers
• Single master to coordinate access, keep metadata
– Simple centralized management
• No data caching
– Little benefit due to large datasets, streaming reads
• Simplify the API
– Push some of the issues onto the client (e.g., data layout)
29
Relational Databases vs. MapReduce
• Relational databases:
–
–
–
–
–
Multipurpose: analysis and transactions; batch and interactive
Data integrity via ACID transactions
Lots of tools in software ecosystem (for ingesting, reporting, etc.)
Supports SQL (and SQL integration, e.g., JDBC)
Automatic SQL query optimization
• MapReduce (Hadoop):
–
–
–
–
–
Designed for large clusters, fault tolerant
Data is accessed in “native format”
Supports many query languages
Programmers retain control over performance
Open source
30
Workloads
• OLTP (online transaction processing)
–
–
–
–
Typical applications: e-commerce, banking, airline reservations
User facing: real-time, low latency, highly-concurrent
Tasks: relatively small set of “standard” transactional queries
Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
• OLAP (online analytical processing)
–
–
–
–
Typical applications: business intelligence, data mining
Back-end processing: batch workloads, less concurrency
Tasks: complex analytical queries, often ad hoc
Data access pattern: table scans, large amounts of data involved
per query
31
OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP services
32
Relational Algebra in MapReduce
• Projection
– Map over tuples, emit new tuples with appropriate attributes
– No reducers, unless for regrouping or resorting tuples
– Alternatively: perform in reducer, after some other processing
• Selection
– Map over tuples, emit only tuples that meet criteria
– No reducers, unless for regrouping or resorting tuples
– Alternatively: perform in reducer, after some other processing
33
Relational Algebra in MapReduce
• Group by
– Example: What is the average time spent per URL?
– In SQL:
• SELECT url, AVG(time) FROM visits GROUP BY url
– In MapReduce:
• Map over tuples, emit time, keyed by url
• Framework automatically groups values by keys
• Compute average in reducer
• Optimize with combiners
34
Join in MapReduce
• Reduce-side Join: group by join key
– Map over both sets of tuples
– Emit tuple as value with join key as the intermediate key
– Execution framework brings together tuples sharing the same key
– Perform actual join in reducer
– Similar to a “sort-merge join” in database terminology
35
Reduce-side Join: Example
Map
keys
values
R1
R1
R4
R4
S2
S2
S3
S3
Reduce
keys
values
R1
S2
S3
R4
Note: no guarantee if R is going to come first or S
36
Join in MapReduce
• Map-side Join: parallel scans
– Assume two datasets are sorted by the join key
R1
S2
R2
S4
R4
S3
R3
S1
A sequential scan through both datasets to join
(called a “merge join” in database terminology)
37
Join in MapReduce
• Map-side Join
– If datasets are sorted by join key, join can be accomplished by a
scan over both datasets
– How can we accomplish this in parallel?
• Partition and sort both datasets in the same manner
– In MapReduce:
• Map over one dataset, read from other corresponding partition
• No reducers necessary (unless to repartition or resort)
38
Join in MapReduce
• In-memory Join
– Basic idea: load one dataset into memory, stream over other
dataset
• Works if R << S and R fits into memory
• Called a “hash join” in database terminology
– MapReduce implementation
• Distribute R to all nodes
• Map over S, each mapper loads R in memory, hashed by join key
• For every tuple in S, look up join key in R
• No reducers, unless for regrouping or resorting tuples
39
Which Join Algorithm to Use?
• In-memory join > map-side join > reduce-side join
– Why?
• Limitations of each?
– In-memory join: memory
– Map-side join: sort order and partitioning
– Reduce-side join: general purpose
40
Processing Relational Data: Summary
• MapReduce algorithms for processing relational data:
– Group by, sorting, partitioning are handled automatically by
shuffle/sort in MapReduce
– Selection, projection, and other computations (e.g., aggregation),
are performed either in mapper or reducer
– Multiple strategies for relational joins
• Complex operations require multiple MapReduce jobs
– Example: top ten URLs in terms of average time spent
– Opportunities for automatic optimization
41
Download