Distributed Parallel Processing & MapReduce Paradigm for Big Data Analysis

Distributed Parallel Processing The MapReduce Paradigm Week 3 -Big Data Analysis Américo Rio Index • Parallel processing • Data vs compute • MapReduce paradigm • Examples BDA3 2 Part I – Distributed Parallel Processing BDA3 3 Why Parallel Processing? • Big Data = terabytes/petabytes → impossible on a single server. • Sequential execution = too slow. • Parallelism = split workload into smaller tasks and run simultaneously. BDA3 4 Parallel vs Concurrent Computing • Parallel: multiple tasks executed at the same time. • Concurrent: multiple tasks make progress together, but not necessarily simultaneously. BDA3 5 Example BDA3 6 Traditional vs Distributed • Traditional: move all data into one machine, then process. • Distributed: send computation to where the data already is. • Saves network bandwidth, improves efficiency. DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA COMPUTE COMPUTE Traditional computing (inc. machine learning) BDA3 Big Data Paradigm Distributed Computing 7 Cluster Architecture • Commodity servers (nodes) organized in racks. • Nodes within racks have fast local connections. • Racks linked by a slower backbone network. • Parallelism achieved by distributing tasks across racks. BDA3 8 Example: Summing Billions of Records • Input: a billion numbers. • Strategy: • Divide into splits. • Each node computes partial sum. • Master node aggregates partial sums → final result. workflow: dataset → splits → parallel sums → final aggregation BDA3 9 Part II – The MapReduce Paradigm BDA3 10 Counting cats Child 1 to 7 11 Counting cats (2) • Divide the cats in rooms • Put children 1-4 in the rooms to count cats BDA3 12 Counting cats (3) • children 1-4 count their sub totals BDA3 13 Counting cats (4) • Children divide by cat type and give to children 5-7 BDA3 14 Counting cats (5) • Each children 5-7 count their assigned type BDA3 15 Map reduce counting cats MAP SHUFFLE AND SORT REDUCE • Where do we have the most children allocated? • Which children are working the hardest? • How much information is being sent? BDA3 16 Why MapReduce? • Distributed programming is complex: failures, scheduling, communication. • MapReduce hides complexity. • Programmer writes only Map and Reduce logic. BDA3 17 MapReduce Overview Workflow: • Map phase → process input split → emit key-value pairs. • Shuffle/Sort phase → group intermediate pairs by key. • Reduce phase → aggregate values for each key → final output. BDA3 18 Map • We start with lot’s of key value pairs in our input data. Potentially billions. • We cannot go through each individually so we typically split them into groups (“splits” in map reduce). • A logical way to do this is just to consider data in each block as a group. Or maybe on each server. • We MAP some function to this data which produces output as key value pairs. Suffle • We typically group together outputs that are similar, for example the same key. This is the shuffle phase and happens individually for each mapper. Sort • Each mapper then sends out its grouped keys to an allocated reducer. The reducer will receive data that is sorted from each mapper, but when it puts it all together it may not be sorted overall. Each reducer then need to sort the data it has received. Reduce • The reduce function must then be called, for every unique key, on all the pairs with that key, outputting zero, one or more output pairs. • Just like the map function, the reduce function is called in batches, on each intermediate partition (multiple calls, one per unique key). Combine • Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k E.g., popular words in the word count example • Can save network time by pre-aggregating values in the mapper: combine(k, list(v1)) (k, v2) • Combiner is usually the same as the reduce function Key-Value Model • All data expressed as (key, value) pairs. • Input, intermediate, and output all follow this model. • Universal abstraction for logs, text, structured data. BDA3 24 Example: Word Count • Input: huge text corpus. • Task: count occurrences of each word. • Why MapReduce? → easy to split, aggregate, and parallelize. BDA3 25 Word Count: • Map • Mapper takes lines of text. • Emits (word, 1). • Example: input text line “big data is big” → output: (big,1), (data,1), (is,1), (big,1). • Shuffle • Groups by key. • Example: [(big,1), (big,1), (data,1)] → (big, [1,1]), (data,[1]). • Reduce • Aggregates grouped values. • Example: (big,[1,1]) → (big,2). BDA3 26 Word Count Code def map(key, value): for word in value.split(): emit(word, 1) def reduce(key, values): emit(key, sum(values)) BDA3 27 Example BDA3 28 Part III – Case Study & Wrap-Up BDA3 29 Case Study: Temperature Records • Input: station, temp. • Map: (station, temp). • Reduce: max(temp). BDA3 30 Case Study: Retail Analytics • Input: purchases by category. • Task: top product per category. • Map: (category, product). • Reduce: aggregate counts. BDA3 31 Mapreduce • Strengths of MapReduce • Scales to thousands of nodes. • Fault-tolerant. • Simplifies distributed programming. • Limitations • Batch only (latency). • Heavy disk I/O. • Iterative ML inefficient. BDA3 32 Summary • Distributed parallelism essential. • MapReduce = foundation for Hadoop. • Practical: Python MapReduce demo. BDA3 33

Distributed Parallel Processing & MapReduce Paradigm for Big Data Analysis

Related documents

Products

Support

Distributed Parallel Processing & MapReduce Paradigm for Big Data Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib