Programming at Scale: Concurrency & Consistency NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1 © 2011-14 A. Haeberlen, Z. Ives Logistics • Homework 0 is due tonight @ 10PM • Over the next 24 hours: • Amazon AWS codes for credit • Homework 1 will be out by Thursday • 2 milestones, tentatively the 18th and 25th University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 2 Course Road Map • Overview of scalable & cloud computing • Basics: Scalability, concurrency, consistency... • Cloud basics: EC2, EBS, S3, SimpleDB, ... • Cloud programming: MapReduce, Hadoop, ... • Algorithms: PageRank, adsorption, ... • Web programming, servlets, XML, Ajax... • Beyond MapReduce: Dryad, Hive, PigLatin, ... University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 3 Goal: Building Fast, Scalable Software How do we speed up software? What are the hard problems? How do they affect what we can do, how we write the software? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 4 What is scalability? • A system is scalable if it can easily adapt to increased (or reduced) demand • Example: A storage system might start with a capacity of just 10TB but can grow to many PB by adding more nodes • Scalability is usually limited by some sort of bottleneck • Often, scalability also means... • the ability to operate at a very large scale • the ability to grow efficiently • Example: 4x as many nodes ~4x capacity (not just 2x!) University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 5 Scaling up by parallelizing University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 6 An Obvious Way to Scale Throw more hardware at the problem! … In this case, go from one processor to many, or one computer to many! University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 7 Scaling up via hardware increases complexity of algorithms Wide-area network Single-core machine Multicore server Large-scale distributed system Cluster More challenges Known from CIS120 True concurrency usec delays University of Pennsylvania Network Message passing More failure modes (faulty nodes, ...) Wide-area network Even more failure modes Incentives, laws, ... msec delays 100msec delays © 2013 A. Haeberlen, Z. Ives 8 A First Step: Parallelism in One Machine, Symmetric Multiprocessing (SMP) Cache Cache Cache Cache Memory bus • For now, assume we have multiple cores that can access the same shared memory • Any core can access any byte; speed is uniform (no byte takes longer to read or write than any other) • Not all machines are like that -- other models discussed later University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 9 Parallelizing an Algorithm void bubblesort(int nums[]) { boolean done = false; while (!done) { done = true; for (int i=1; i<nums.length; i++) { if (nums[i-1] > nums[i]) { swap(nums[i-1], nums[i]); done = false; } } } } int[] mergesort(int nums[]) { int numPieces = 10; int pieces[][] = split(nums, numPieces); for (int i=0; i<numPieces; i++) sort(pieces[i]); return merge(pieces); } Can be done in parallel! • The left algorithm works fine on one core • Can we make it faster on multiple cores? • Difficult - need to find something for the other cores to do • There are other sorting algorithms where this is much easier • Not all algorithms are equally parallelizable • Can you have scalability without parallelism? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 10 Scalability T1 Speedup: S N Tn Completion time with one core Completion time with n cores Ideal Numbers sorted per second Expected Cores used • If we increase the number of processors, will the speed also increase? • Yes, but (in almost all cases) only up to a point • Why? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 11 .... Amdahl's law Parallel part Sequential parts Core #6 Core #5 Core #4 Core #3 Core #3 Core #2 Core #2 Core #1 Core #1 Time Time Time • Usually, not all parts of the algorithm can be parallelized • Let f be the fraction of the algorithm that can be parallelized, and let Spart be the corresponding speedup • Then S overall University of Pennsylvania 1 (1 f ) f S part © 2013 A. Haeberlen, Z. Ives 12 Is more parallelism always better? Ideal Numbers sorted per second Sweet spot Expected Reality (often) Cores • Increasing parallelism beyond a certain point can cause performance to decrease! Why? • Communication / synchronization between the parallel tasks! • Example: Need to send a message to each core to tell it what to do University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 13 Granularity • How big a task should we assign to each core? • Coarse-grain vs. fine-grain parallelism • Frequent coordination creates overhead • Need to send messages back and forth, wait for other cores... • Result: Cores spend most of their time communicating • Coarse-grain parallelism is usually more efficient • Bad: Ask each core to sort three numbers • Good: Ask each core to sort a million numbers University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 14 Dependencies START START Dependencies ... Individual tasks DONE DONE "Embarrassingly parallel" With dependencies • What if tasks depend on other tasks? • • • Example: Need to sort lists before merging them Limits the degree of parallelism Minimum completion time (and thus maximum speedup) is determined by the longest path from start to finish • Assumes resources are plentiful; actual speedup may be lower University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 15 Heterogeneity • What if... • some tasks are larger than others? • some tasks are harder than others? • some tasks are more urgent than others? • not all cores are equally fast, or have different resources? • Result: Scheduling problem • Can be very difficult University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 16 Recap: Parallelization • Parallelization is hard • Not all algorithms are equally parallelizable -- need to pick very carefully • Scalability is limited by many things • Amdahl's law • Dependencies between tasks • Communication overhead • Heterogeneity • ... University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 17 Synchronization University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 18 We often need to synchronize among concurrent (parallel) tasks void transferMoney(customer A, customer B, int amount) { showMessage("Transferring "+amount+" to "+B); int balanceA = getBalance(A); int balanceB = getBalance(B); setBalance(B, balanceB + amount); setBalance(A, balanceA - amount); showMessage("Your new balance: "+(balanceA-amount)); } • Simple example: Accounting system in a bank • Maintains the current balance of each customer's account • Customers can transfer money to other customers University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 19 Why do we need synchronization? $100 $500 Alice 1) 2) 3) 4) B=Balance(Bob) A=Balance(Alice) SetBalance(Bob,B+100) SetBalance(Alice,A-100) Bob 1) 2) 3) 4) A=Balance(Alice) B=Balance(Bob) SetBalance(Alice,A+500) SetBalance(Bob,B-500) What can happen if this code runs concurrently? 3 1 2 1 Alice's balance: Bob's balance: 2 $200 $800 3 4 Time 4 $200 $700 $700 $900 $900 $300 © 2013 A. Haeberlen, Z. Ives $100 $300 20 Problem: Race Condition Alice's and Bob's threads of execution void transferMoney(customer A, customer B, int amount) { showMessage("Transferring "+amount+" to "+B); int balanceA = getBalance(A); int balanceB = getBalance(B); setBalance(B, balanceB + amount); setBalance(A, balanceA - amount); showMessage("Your new balance: "+(balanceA-amount)); } • What happened? • Race condition: Result of the computation depends on the exact timing of the two threads of execution, i.e., the order in which the instructions are executed • Reason: Concurrent updates to the same state • Can you get a race condition when all the threads are reading the data, and none of them are updating it? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 21 Goal: Consistency • What should have happened? • Intuition: It shouldn't make a difference whether the requests are executed concurrently or not • How can we formalize this? • Need a consistency model that specifies how the system should behave in the presence of concurrency University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 22 The Strictest Model: Sequential consistency (Serializability) Core #1: Core #2: T2 T3 T1 Actual execution T5 T4 T6 Time Single core: T1 T2 Same start state T3 T6 T4 Hypothetical execution T5 Same result Time • Sequential consistency: • The result of any execution is the same as if the operations of all the cores had been executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by the program University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 23 Other consistency models • Strong consistency • After update completes, all subsequent accesses will return the updated value • Weak consistency • After update completes, accesses do not necessarily return the updated value; some condition must be safisfied first • Example: Update needs to reach all the replicas of the object • Eventual consistency • Specific form of weak consistency: If no more updates are made to an object, then eventually all reads will return the latest value • Variants: Causal consistency, read-your-writes, monotonic writes, ... • More of this later in the course! • Why would we want any of these? Who uses them? How do we build systems / algorithms / services that achieve them? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 24 Building Blocks for Consistency: Mutual Exclusion, Locking, Conc. Control University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 25 Synchronization Primitive: Mutual exclusion Critical section void transferMoney(customer A, customer B, int amount) { showMessage("Transferring "+amount+" to "+B); int balanceA = getBalance(A); int balanceB = getBalance(B); setBalance(B, balanceB + amount); setBalance(A, balanceA - amount); showMessage("Your new balance: "+(balanceA-amount)); } • How can we achieve better consistency? • Key insight: Code has a critical section where accesses from other cores to the same resources will cause problems • Approach: Mutual exclusion • Enforce restriction that only one core (or machine) can execute the critical section at any given time • What does this mean for scalability? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 26 Locking Critical section void transferMoney(customer A, customer B, int amount) { showMessage("Transferring "+amount+" to "+B); int balanceA = getBalance(A); int balanceB = getBalance(B); setBalance(B, balanceB + amount); setBalance(A, balanceA - amount); showMessage("Your new balance: "+(balanceA-amount)); } • Idea: Implement locks • If LOCK(X) is called and X is not locked, lock X and continue • If LOCK(X) is called and X is locked, wait until X is unlocked • If UNLOCK(X) is called and X is locked, unlock X • How many locks, and where do we put them? • Option #1: One lock around the critical section • Option #2: One lock per variable (A's and B's balance) • Pros and cons? Other options? University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 27 Locking helps! $100 $500 Alice 1) 2) 3) 4) 5) 6) 7) 8) 1) 2) 3) 4) 5) 6) 7) 8) LOCK(Bob) LOCK(Alice) B=Balance(Bob) A=Balance(Alice) SetBalance(Bob,B+100) SetBalance(Alice,A-100) UNLOCK(Alice) UNLOCK(Bob) 1 2 3 4 5 blocked 1 Alice's balance: Bob's balance: $200 $800 6 7 Bob LOCK(Alice) LOCK(Bob) A=Balance(Alice) B=Balance(Bob) SetBalance(Alice,A+500) SetBalance(Bob,B-500) UNLOCK(Bob) UNLOCK(Alice) 8 12 2 $200 $100 $900 $900 3 4 5 6 7 8 $600 $600 $900 $400 © 2013 A. Haeberlen, Z. Ives Time 28 The Downside of Locks: Deadlocks $100 $500 Alice 1) 2) 3) 4) 5) 6) 7) 8) LOCK(Bob) LOCK(Alice) B=Balance(Bob) A=Balance(Alice) SetBalance(Bob,B+100) SetBalance(Alice,A-100) UNLOCK(Alice) UNLOCK(Bob) 2 1 1 1) 2) 3) 4) 5) 6) 7) 8) Bob LOCK(Alice) LOCK(Bob) A=Balance(Alice) B=Balance(Bob) SetBalance(Alice,A+500) SetBalance(Bob,B-500) UNLOCK(Bob) UNLOCK(Alice) blocked (waiting for lock on Alice) 2 blocked (waiting for lock on Bob) Time • Neither processor can make progress! University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 29 The dining philosophers problem Philosophers Philosopher: repeat think pick up left fork pick up right fork eat put down forks forever University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 30 What to do about deadlocks • Many possible solutions, including: • Lock manager: Hire a waiter and require that philosophers must ask the waiter before picking up any forks • Consequences for scalability? • Resource hierarchy: Number forks 1-5 and require that each philosopher pick up the fork with the lower number first • Problem? • Chandy/Misra solution: • • • • • Forks can either be dirty or clean; initially all forks are dirty After the philosopher has eaten, all his forks are dirty When a philosopher needs a fork he can't get, he asks his neighbor If a philosopher is asked for a dirty fork, he cleans it and gives it up If a philosopher is asked for a clean fork, he keeps it University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 31 Alternatives to Locks • Locks (and deadlocks) aren’t the only solution • Another approach: • Update things concurrently • Try to “merge” the effects (reconcile the accounts) • If not possible, do things in sequence “Optimistic” concurrency control University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 32 Plan for the next two lectures • Parallel programming and its challenges • Parallelization and scalability, Amdahl's law • Synchronization, consistency • Mutual exclusion, locking, issues related to locking • Architectures: SMP, NUMA, Shared-nothing NEXT • All about the Internet in 30 minutes • Structure; packet switching; some important protocols • Latency, packet loss, bottlenecks, and why they matter • Distributed programming and its challenges • Network partitions and the CAP theorem • Faults, failures, and what we can do about them University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 33 Other architectures • Earlier assumptions: ? • All cores can access the same memory • Access latencies are uniform • Why is this problematic? • Processor speeds in GHz, speed of light is 299 792 458 m/s Processor can do 1 addition while signal travels 30cm Putting memory very far away is not a good idea • Let's talk about other ways to organize cores and memory! • SMP, NUMA, Shared-nothing University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 34 Symmetric Multiprocessing (SMP) Cache Cache Cache Cache Memory bus All processor cores share the same memory; consider a single multicore processor • • • Any CPU can access any byte; latency is always the same Pros: Simplicity, easy load balancing Cons: Limited scalability (~16 processors/cores), expensive University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 35 Non-Uniform Memory Architecture (NUMA) Cache Cache Cache Cache Consistent cache interconnect • Memory is local to a specific processor • • • Each CPU can still access any byte, but accesses to 'local' memory are considerably faster (2-3x) • Consider a multiprocessor, multicore system Pros: Better scalability Cons: Complicates programming a bit, scalability still limited University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 36 Source: The Architecture of the Nehalem Microprocessor and Nehalem-EP SMP Platforms, M. Thomadakis, Texas A&M Example: Intel Xeon / Nehalem • Access to remote memory is slower • In this case, 105ns vs 65ns University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 37 Shared-Nothing Cache Cache Cache Cache Network • Independent machines connected by network • Each CPU can only access its local memory; if it needs data from a remote machine, it must send a message there • Pros: Much better scalability • Cons: Requires a different programming model This is the focus of many parallel and “big data” programming models University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 38 http://www.flickr.com/photos/brandoncripps/623631374/sizes/l/in/photostream/ Stay tuned Next time you will learn about: Internet basics; faults and failures Assigned reading: "The Antifragile Organization" by Ariel Tseitlin University of Pennsylvania © 2013 A. Haeberlen, Z. Ives 39