Programming at Scale - University of Pennsylvania

advertisement
Programming at Scale:
Concurrency & Consistency
NETS 212: Scalable & Cloud Computing
Fall 2014
Z. Ives
University of Pennsylvania
1
© 2011-14 A. Haeberlen, Z. Ives
Logistics
• Homework 0 is due tonight @ 10PM
• Over the next 24 hours:
• Amazon AWS codes for credit
• Homework 1 will be out by Thursday
• 2 milestones, tentatively the 18th and 25th
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
2
Course Road Map
• Overview of scalable & cloud computing
• Basics: Scalability, concurrency, consistency...
• Cloud basics: EC2, EBS, S3, SimpleDB, ...
• Cloud programming: MapReduce, Hadoop, ...
• Algorithms: PageRank, adsorption, ...
• Web programming, servlets, XML, Ajax...
• Beyond MapReduce: Dryad, Hive, PigLatin, ...
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
3
Goal: Building Fast, Scalable Software
How do we speed up software?
What are the hard problems?
How do they affect what we can
do, how we write the software?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
4
What is scalability?
• A system is scalable if it can easily adapt to increased (or
reduced) demand
• Example: A storage system might start with a capacity of just 10TB but
can grow to many PB by adding more nodes
• Scalability is usually limited by some sort of bottleneck
• Often, scalability also means...
• the ability to operate at a very large scale
• the ability to grow efficiently
•
Example: 4x as many nodes  ~4x capacity (not just 2x!)
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
5
Scaling up by parallelizing
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
6
An Obvious Way to Scale
Throw more hardware at the problem!
… In this case, go from one processor to many, or
one computer to many!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
7
Scaling up via hardware increases complexity
of algorithms
Wide-area
network
Single-core
machine
Multicore
server
Large-scale distributed
system
Cluster
More challenges
Known from
CIS120
True
concurrency
usec
delays
University of Pennsylvania
Network
Message passing
More failure modes
(faulty nodes, ...)
Wide-area network
Even more failure
modes
Incentives, laws, ...
msec delays
100msec delays
© 2013 A. Haeberlen, Z. Ives
8
A First Step: Parallelism in One Machine,
Symmetric Multiprocessing (SMP)
Cache
Cache
Cache
Cache
Memory bus
• For now, assume we have multiple cores that can access the same
shared memory
• Any core can access any byte; speed is uniform (no byte takes longer to
read or write than any other)
• Not all machines are like that -- other models discussed later
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
9
Parallelizing an Algorithm
void bubblesort(int nums[]) {
boolean done = false;
while (!done) {
done = true;
for (int i=1; i<nums.length; i++) {
if (nums[i-1] > nums[i]) {
swap(nums[i-1], nums[i]);
done = false;
}
}
}
}
int[] mergesort(int nums[]) {
int numPieces = 10;
int pieces[][] = split(nums, numPieces);
for (int i=0; i<numPieces; i++)
sort(pieces[i]);
return merge(pieces);
}
Can be done in parallel!
• The left algorithm works fine on one core
• Can we make it faster on multiple cores?
• Difficult - need to find something for the other cores to do
• There are other sorting algorithms where this is much easier
• Not all algorithms are equally parallelizable
• Can you have scalability without parallelism?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
10
Scalability
T1
Speedup: S N 
Tn
Completion time
with one core
Completion time
with n cores
Ideal
Numbers
sorted per
second
Expected
Cores used
• If we increase the number of processors, will the speed also
increase?
• Yes, but (in almost all cases) only up to a point
• Why?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
11
....
Amdahl's law
Parallel
part
Sequential
parts
Core #6
Core #5
Core #4
Core #3
Core #3
Core #2
Core #2
Core #1
Core #1
Time
Time
Time
• Usually, not all parts of the algorithm can be parallelized
• Let f be the fraction of the algorithm that can be parallelized,
and let Spart be the corresponding speedup
• Then
S overall 
University of Pennsylvania
1
(1  f ) 
f
S part
© 2013 A. Haeberlen, Z. Ives
12
Is more parallelism always better?
Ideal
Numbers
sorted per
second
Sweet
spot
Expected
Reality (often)
Cores
• Increasing parallelism beyond a certain point can cause
performance to decrease! Why?
• Communication / synchronization between the parallel tasks!
• Example: Need to send a message to each core to tell it what to do
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
13
Granularity
• How big a task should we assign to each core?
• Coarse-grain vs. fine-grain parallelism
• Frequent coordination creates overhead
• Need to send messages back and forth, wait for other cores...
• Result: Cores spend most of their time communicating
• Coarse-grain parallelism is usually more efficient
• Bad: Ask each core to sort three numbers
• Good: Ask each core to sort a million numbers
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
14
Dependencies
START
START
Dependencies
...
Individual
tasks
DONE
DONE
"Embarrassingly parallel"
With dependencies
• What if tasks depend on other tasks?
•
•
•
Example: Need to sort lists before merging them
Limits the degree of parallelism
Minimum completion time (and thus maximum speedup) is determined by the
longest path from start to finish
•
Assumes resources are plentiful; actual speedup may be lower
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
15
Heterogeneity
• What if...
• some tasks are larger than others?
• some tasks are harder than others?
• some tasks are more urgent than others?
• not all cores are equally fast, or have different resources?
• Result: Scheduling problem
• Can be very difficult
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
16
Recap: Parallelization
• Parallelization is hard
• Not all algorithms are equally parallelizable -- need to pick very carefully
• Scalability is limited by many things
• Amdahl's law
• Dependencies between tasks
• Communication overhead
• Heterogeneity
• ...
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
17
Synchronization
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
18
We often need to synchronize
among concurrent (parallel) tasks
void transferMoney(customer A, customer B, int amount)
{
showMessage("Transferring "+amount+" to "+B);
int balanceA = getBalance(A);
int balanceB = getBalance(B);
setBalance(B, balanceB + amount);
setBalance(A, balanceA - amount);
showMessage("Your new balance: "+(balanceA-amount));
}
• Simple example: Accounting system in a bank
• Maintains the current balance of each customer's account
• Customers can transfer money to other customers
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
19
Why do we need synchronization?
$100
$500
Alice
1)
2)
3)
4)
B=Balance(Bob)
A=Balance(Alice)
SetBalance(Bob,B+100)
SetBalance(Alice,A-100)
Bob
1)
2)
3)
4)
A=Balance(Alice)
B=Balance(Bob)
SetBalance(Alice,A+500)
SetBalance(Bob,B-500)
What can happen if this code runs concurrently?
3
1 2
1
Alice's balance:
Bob's balance:
2
$200
$800
3
4
Time
4
$200 $700 $700
$900 $900 $300
© 2013 A. Haeberlen, Z. Ives
$100
$300
20
Problem: Race Condition
Alice's and Bob's
threads of execution
void transferMoney(customer A, customer B, int amount)
{
showMessage("Transferring "+amount+" to "+B);
int balanceA = getBalance(A);
int balanceB = getBalance(B);
setBalance(B, balanceB + amount);
setBalance(A, balanceA - amount);
showMessage("Your new balance: "+(balanceA-amount));
}
• What happened?
• Race condition: Result of the computation depends on the exact timing of
the two threads of execution, i.e., the order in which the instructions are
executed
• Reason: Concurrent updates to the same state
•
Can you get a race condition when all the threads are reading the data, and none of
them are updating it?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
21
Goal: Consistency
• What should have happened?
• Intuition: It shouldn't make a difference whether the requests are
executed concurrently or not
• How can we formalize this?
• Need a consistency model that specifies how the system should behave
in the presence of concurrency
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
22
The Strictest Model:
Sequential consistency (Serializability)
Core #1:
Core #2:
T2
T3
T1
Actual
execution
T5
T4
T6
Time
Single core:
T1
T2
Same start
state
T3
T6
T4
Hypothetical
execution
T5
Same
result
Time
• Sequential consistency:
• The result of any execution is the same as if the operations of all the cores
had been executed in some sequential order, and the operations of each
individual processor appear in this sequence in the order specified by the
program
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
23
Other consistency models
• Strong consistency
• After update completes, all subsequent accesses will return the updated
value
• Weak consistency
• After update completes, accesses do not necessarily return the updated
value; some condition must be safisfied first
• Example: Update needs to reach all the replicas of the object
• Eventual consistency
• Specific form of weak consistency: If no more updates are made to an
object, then eventually all reads will return the latest value
• Variants: Causal consistency, read-your-writes, monotonic writes, ...
• More of this later in the course!
• Why would we want any of these? Who uses them?
How do we build systems / algorithms / services that achieve them?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
24
Building Blocks for Consistency:
Mutual Exclusion, Locking, Conc. Control
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
25
Synchronization Primitive:
Mutual exclusion
Critical section
void transferMoney(customer A, customer B, int amount)
{
showMessage("Transferring "+amount+" to "+B);
int balanceA = getBalance(A);
int balanceB = getBalance(B);
setBalance(B, balanceB + amount);
setBalance(A, balanceA - amount);
showMessage("Your new balance: "+(balanceA-amount));
}
• How can we achieve better consistency?
• Key insight: Code has a critical section where accesses from other cores to
the same resources will cause problems
• Approach: Mutual exclusion
• Enforce restriction that only one core (or machine) can execute the critical
section at any given time
• What does this mean for scalability?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
26
Locking
Critical section
void transferMoney(customer A, customer B, int amount)
{
showMessage("Transferring "+amount+" to "+B);
int balanceA = getBalance(A);
int balanceB = getBalance(B);
setBalance(B, balanceB + amount);
setBalance(A, balanceA - amount);
showMessage("Your new balance: "+(balanceA-amount));
}
• Idea: Implement locks
• If LOCK(X) is called and X is not locked, lock X and continue
• If LOCK(X) is called and X is locked, wait until X is unlocked
• If UNLOCK(X) is called and X is locked, unlock X
• How many locks, and where do we put them?
• Option #1: One lock around the critical section
• Option #2: One lock per variable (A's and B's balance)
• Pros and cons? Other options?
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
27
Locking helps!
$100
$500
Alice
1)
2)
3)
4)
5)
6)
7)
8)
1)
2)
3)
4)
5)
6)
7)
8)
LOCK(Bob)
LOCK(Alice)
B=Balance(Bob)
A=Balance(Alice)
SetBalance(Bob,B+100)
SetBalance(Alice,A-100)
UNLOCK(Alice)
UNLOCK(Bob)
1
2
3
4 5
blocked
1
Alice's balance:
Bob's balance:
$200
$800
6 7
Bob
LOCK(Alice)
LOCK(Bob)
A=Balance(Alice)
B=Balance(Bob)
SetBalance(Alice,A+500)
SetBalance(Bob,B-500)
UNLOCK(Bob)
UNLOCK(Alice)
8
12 2
$200 $100
$900 $900
3 4
5
6 7 8
$600 $600
$900 $400
© 2013 A. Haeberlen, Z. Ives
Time
28
The Downside of Locks: Deadlocks
$100
$500
Alice
1)
2)
3)
4)
5)
6)
7)
8)
LOCK(Bob)
LOCK(Alice)
B=Balance(Bob)
A=Balance(Alice)
SetBalance(Bob,B+100)
SetBalance(Alice,A-100)
UNLOCK(Alice)
UNLOCK(Bob)
2
1
1
1)
2)
3)
4)
5)
6)
7)
8)
Bob
LOCK(Alice)
LOCK(Bob)
A=Balance(Alice)
B=Balance(Bob)
SetBalance(Alice,A+500)
SetBalance(Bob,B-500)
UNLOCK(Bob)
UNLOCK(Alice)
blocked (waiting for lock on Alice)
2
blocked (waiting for lock on Bob)
Time
• Neither processor can make progress!
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
29
The dining philosophers problem
Philosophers
Philosopher:
repeat
think
pick up left fork
pick up right fork
eat
put down forks
forever
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
30
What to do about deadlocks
• Many possible solutions, including:
• Lock manager: Hire a waiter and require that philosophers must ask the
waiter before picking up any forks
•
Consequences for scalability?
• Resource hierarchy: Number forks 1-5 and require that each philosopher
pick up the fork with the lower number first
•
Problem?
• Chandy/Misra solution:
•
•
•
•
•
Forks can either be dirty or clean; initially all forks are dirty
After the philosopher has eaten, all his forks are dirty
When a philosopher needs a fork he can't get, he asks his neighbor
If a philosopher is asked for a dirty fork, he cleans it and gives it up
If a philosopher is asked for a clean fork, he keeps it
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
31
Alternatives to Locks
• Locks (and deadlocks) aren’t the only solution
• Another approach:
• Update things concurrently
• Try to “merge” the effects (reconcile the accounts)
• If not possible, do things in sequence
“Optimistic” concurrency control
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
32
Plan for the next two lectures
• Parallel programming and its challenges
• Parallelization and scalability, Amdahl's law
• Synchronization, consistency
• Mutual exclusion, locking, issues related to locking
• Architectures: SMP, NUMA, Shared-nothing
NEXT
• All about the Internet in 30 minutes
• Structure; packet switching; some important protocols
• Latency, packet loss, bottlenecks, and why they matter
• Distributed programming and its challenges
• Network partitions and the CAP theorem
• Faults, failures, and what we can do about them
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
33
Other architectures
• Earlier assumptions:
?
• All cores can access the same memory
• Access latencies are uniform
• Why is this problematic?
• Processor speeds in GHz, speed of light is 299 792 458 m/s
 Processor can do 1 addition while signal travels 30cm
 Putting memory very far away is not a good idea
• Let's talk about other ways to organize cores and memory!
• SMP, NUMA, Shared-nothing
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
34
Symmetric Multiprocessing (SMP)
Cache
Cache
Cache
Cache
Memory bus
All processor cores share the same memory; consider a single multicore
processor
•
•
•
Any CPU can access any byte; latency is always the same
Pros: Simplicity, easy load balancing
Cons: Limited scalability (~16 processors/cores), expensive
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
35
Non-Uniform Memory Architecture (NUMA)
Cache
Cache
Cache
Cache
Consistent cache interconnect
• Memory is local to a specific processor
•
•
•
Each CPU can still access any byte, but accesses to 'local' memory are
considerably faster (2-3x)
•
Consider a multiprocessor, multicore system
Pros: Better scalability
Cons: Complicates programming a bit, scalability still limited
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
36
Source: The Architecture of the Nehalem Microprocessor and
Nehalem-EP SMP Platforms, M. Thomadakis, Texas A&M
Example: Intel Xeon / Nehalem
• Access to remote memory is slower
• In this case, 105ns vs 65ns
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
37
Shared-Nothing
Cache
Cache
Cache
Cache
Network
• Independent machines connected by network
• Each CPU can only access its local memory; if it needs data from a remote
machine, it must send a message there
• Pros: Much better scalability
• Cons: Requires a different programming model
This is the focus of many parallel and “big data” programming models
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
38
http://www.flickr.com/photos/brandoncripps/623631374/sizes/l/in/photostream/
Stay tuned
Next time you will learn about:
Internet basics; faults and failures
Assigned reading: "The Antifragile Organization" by Ariel Tseitlin
University of Pennsylvania
© 2013 A. Haeberlen, Z. Ives
39
Download