Distributed Parallel Processing
The MapReduce Paradigm
Week 3 -Big Data Analysis
Américo Rio
Index
• Parallel processing
• Data vs compute
• MapReduce paradigm
• Examples
BDA3
2
Part I – Distributed Parallel
Processing
BDA3
3
Why Parallel Processing?
• Big Data = terabytes/petabytes → impossible on a single server.
• Sequential execution = too slow.
• Parallelism = split workload into smaller tasks and run simultaneously.
BDA3
4
Parallel vs Concurrent Computing
• Parallel: multiple tasks executed at the same time.
• Concurrent: multiple tasks make progress together, but not necessarily
simultaneously.
BDA3
5
Example
BDA3
6
Traditional vs Distributed
• Traditional: move all data into one machine, then process.
• Distributed: send computation to where the data already is.
• Saves network bandwidth, improves efficiency.
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
COMPUTE
COMPUTE
Traditional computing (inc. machine learning)
BDA3
Big Data Paradigm
Distributed Computing
7
Cluster Architecture
• Commodity servers (nodes) organized in racks.
• Nodes within racks have fast local connections.
• Racks linked by a slower backbone network.
• Parallelism achieved by distributing tasks across racks.
BDA3
8
Example: Summing Billions of Records
• Input: a billion numbers.
• Strategy:
• Divide into splits.
• Each node computes partial sum.
• Master node aggregates partial sums → final result.
workflow:
dataset → splits → parallel sums → final aggregation
BDA3
9
Part II – The MapReduce Paradigm
BDA3
10
Counting cats
Child 1 to 7
11
Counting cats (2)
• Divide the cats in rooms
• Put children 1-4 in the rooms to count cats
BDA3
12
Counting cats (3)
• children 1-4 count their sub totals
BDA3
13
Counting cats (4)
• Children divide by cat type and give to children 5-7
BDA3
14
Counting cats (5)
• Each children 5-7 count their assigned type
BDA3
15
Map reduce counting cats
MAP
SHUFFLE AND SORT
REDUCE
• Where do we have the most children allocated?
• Which children are working the hardest?
• How much information is being sent?
BDA3
16
Why MapReduce?
• Distributed programming is complex: failures, scheduling,
communication.
• MapReduce hides complexity.
• Programmer writes only Map and Reduce logic.
BDA3
17
MapReduce Overview
Workflow:
• Map phase → process input split → emit key-value pairs.
• Shuffle/Sort phase → group intermediate pairs by key.
• Reduce phase → aggregate values for each key → final output.
BDA3
18
Map
• We start with lot’s of key value pairs in our input
data. Potentially billions.
• We cannot go through each individually so we
typically split them into groups (“splits” in map
reduce).
• A logical way to do this is just to consider data in
each block as a group. Or maybe on each server.
• We MAP some function to this data which
produces output as key value pairs.
Suffle
• We typically group together outputs that are
similar, for example the same key. This is the
shuffle phase and happens individually for each
mapper.
Sort
• Each mapper then sends out its grouped keys to
an allocated reducer. The reducer will receive data
that is sorted from each mapper, but when it puts it
all together it may not be sorted overall. Each
reducer then need to sort the data it has received.
Reduce
• The reduce function must then be called, for every
unique key, on all the pairs with that key, outputting
zero, one or more output pairs.
• Just like the map function, the reduce function is
called in batches, on each intermediate partition
(multiple calls, one per unique key).
Combine
• Often a Map task will produce many pairs of the
form (k,v1), (k,v2), … for the same key k E.g.,
popular words in the word count example
• Can save network time by pre-aggregating values
in the mapper: combine(k, list(v1)) (k, v2)
• Combiner is usually the same as the reduce
function
Key-Value Model
• All data expressed as (key, value) pairs.
• Input, intermediate, and output all follow this model.
• Universal abstraction for logs, text, structured data.
BDA3
24
Example: Word Count
• Input: huge text corpus.
• Task: count occurrences of each word.
• Why MapReduce? → easy to split, aggregate, and parallelize.
BDA3
25
Word Count:
• Map
• Mapper takes lines of text.
• Emits (word, 1).
• Example: input text line “big data is big” → output: (big,1), (data,1), (is,1), (big,1).
• Shuffle
• Groups by key.
• Example: [(big,1), (big,1), (data,1)] → (big, [1,1]), (data,[1]).
• Reduce
• Aggregates grouped values.
• Example: (big,[1,1]) → (big,2).
BDA3
26
Word Count Code
def map(key, value):
for word in value.split():
emit(word, 1)
def reduce(key, values):
emit(key, sum(values))
BDA3
27
Example
BDA3
28
Part III – Case Study & Wrap-Up
BDA3
29
Case Study: Temperature Records
• Input: station, temp.
• Map: (station, temp).
• Reduce: max(temp).
BDA3
30
Case Study: Retail Analytics
• Input: purchases by category.
• Task: top product per category.
• Map: (category, product).
• Reduce: aggregate counts.
BDA3
31
Mapreduce
• Strengths of MapReduce
• Scales to thousands of nodes.
• Fault-tolerant.
• Simplifies distributed programming.
• Limitations
• Batch only (latency).
• Heavy disk I/O.
• Iterative ML inefficient.
BDA3
32
Summary
• Distributed parallelism essential.
• MapReduce = foundation for Hadoop.
• Practical: Python MapReduce demo.
BDA3
33