Chapter 2

advertisement
“By the end of this chapter, you should have obtained a basic
understanding of how modern processors execute parallel
programs & understand some rules of thumb for scaling
performance of parallel applications.”
1
Serial Model
SISD
Parallel Models
SIMD
MIMD
MISD*
S = Single
M = Multiple
D = Data
I = Instruction
2
 Task vs. Data: tasks are instructions that operate
on data; modify or create new
 Parallel computation  multiple tasks
 Coordinate, manage,
 Dependencies
 Data: task requires data from another task
 Control: events/steps must be ordered (I/O)
3
Fork: split control flow, creating new control flow
Join: control flows are synchronized & merged
4
Task
Data
Fork
Join
Dependency
5
Data Parallelism
Best strategy for Scalable Parallelism
P. that grows as data set/problem size grows
Split data set over set of processors with task
processing each set
More Data  More Tasks
6
Control Parallelism or
Functional Decomposition
Different program functions run in parallel
Not scalable – best speedup is constant factor
As data grows, parallelism doesn’t
May be less/no overhead
7
Regular: tasks are similar with predictable
dependencies
Matrix multiplication
Irregular: tasks are different in ways that
create unpredictable dependencies
Chess program
Many problems contain combinations
8
Most important 2
Thread Parallelism: implementation in HW using
separate flow control for each worker – supports
regular, irregular, functional decomposition
Vector Parallelism: implementation in HW with one
flow control on multiple data elements – supports
regular, some irregular parallelism
9
Detrimental to
Parallelism
• Locality
• Pipelining
• HOW?
10
MASKING
if (a&1)
a = 3*a + 1
else
a=a/2
if/else contains branch
statements
Masking: Both parts are
executed in parallel, keep only
one result
p = (a&1)
t = 3*A + 1
if (p) a = t
t = a/2
if (!p) a = t
No branches – single control
of flow
Masking works as if it were
coded this way
11
Core
Functional Units
Registers
Cache memory – multiple levels
12
13
Blocks (cache lines) – amount fetched
Bandwidth – amount transferred concurrently
Latency – time to complete transfer
Cache Coherence – consistency among copies
14
Memory system
Disk storage + chip memory
Allows programs larger than memory to run
Allows multiprocessing
Swaps Pages
HW maps logical to physical address
Data locality important to efficiency
Page Fault  Thrashing
15
Cache (multiple)
NUMA – Non-Uniform Memory Access
PRAM – Parallel Random Access Memory Model
Theoretical Model
Assumes - Uniform memory access times
16
Data Locality
Choose code segments that fit in cache
Design to use data in close proximity
Align data with cache lines (blocks)
Dynamic Grain Size – good strategy
17
Arithmetic Intensity
Large number of on-chip compute operations
for every off-chip memory access
Otherwise, communication overhead is high
Related – Grain size
18
Serial Model
 SISD
Parallel Models
 SIMD –
 Array processor
 MIMD
 Heterogeneous
computer
 Clusters
 MISD* - not useful
 Vector processor
19
Shared Memory – each
processor accesses a
common memory
 Access issues
 No message passing
 PC usually has small local
memory
 Distributed Memory –
each processor has a
local memory
 Send explicit messages
between processors
20
GPU – Graphics accelerators
Now general purpose
Offload – running computations on accelerator,
GPU’s or co-processor (not the regular CPU’s)
Heterogeneous – different (hardware working
together)
Host Processor – for distribution, I/O, etc.
21
Various interpretations of Performance
Reduce Total Time for computation
Latency
Increasing Rate at which series of results are
computed
Throughput
Reduce Power Consumption
*Performance Target
22
Latency: time to complete a task
Throughput: rate at which tasks are complete
Units per time (e.g. jobs per hour)
23
24
Sp = T1 / Tp
 T1: time to complete on 1
processor
 Tp: time to complete on p
processors
REMEMBER: “time” means
number of instructions
E = Sp / P
= T1 / P*Tp
 E = 1 is “perfect”
 Linear Speedup – occurs
when algorithm runs P-times
faster on P processors
25
Efficiency > 1
Very Rare
Often due to HW variations (cache)
Working in parallel may eliminate some
work that is done when serial
26
Amdahl: speedup is limited by amount of
serial work required
G-B: as problem size grows, parallel work
grows faster than serial work, so speedup
increases
 See examples
27
Total operations (time) for task
T1 = Work
P * Tp = Work
T1 = P * Tp ??
Rare due to ???
28
Describes Dependencies among Tasks & allows for
estimated times
 Represents Tasks: DAG (Figure 2.8)
 Critical Path – longest path
 Span - minimum time of Critical Path
Assumes Greedy Task Scheduling – no wasted
resources, time
Parallel Slack – excess parallelism, more tasks than
can be scheduled at once
29
Speedup <= Work/Span
Upper Bound: ??
No more than…
30
Decomposing a program or data set into more
parallelism than hardware can utilize
WHY?
Advantages?
Disadvantages?
31
ASYMPTOTIC COMPLEXITY (2.5.7)
Comparing Algorithms!!
Time Complexity: defines execution time
growth in terms of input size
Space Complexity: defines growth of memory
requirements in terms of input size
Ignores constants
Machine independent
32
BIG OH NOTATION (P.66)
Big OH of F(n) – Upper Bound
O(F(n)) = {G(n) |there exist positive
constants c & No such that |G(n)| ≤ c F(n)
for n ≥ No
*Memorize
33
BIG OMEGA & BIG THETA
Big Omega – Functions that define Lower
Bound
Big Theta – Functions that define a Tight
Bound – Both Upper & Lower Bounds
34
Parallel  work actually occurring at same time
Limited by number of processors
Concurrent  tasks in progress at same time but
not necessarily executing
“Unlimited”
Omit 2.5.8 & most of 2.5.9
35
Pitfalls = Issues that can cause problems
 Due to dependencies
Synchronization – often required
Too little  non-determinism
Too much  reduces scaling, increases time &
may cause deadlock
36
1.
2.
3.
4.
5.
6.
7.
Race Conditions
Mutual Exclusion & Locks
Deadlock
Strangled Scaling
Lack of Locality
Load Imbalance
Overhead
37
Situation in which final results depend upon
order tasks complete work
Occurs when concurrent tasks share memory
location & there is a write operation
Unpredictable – don’t always cause errors
Interleaving: instructions from 2 or more tasks
are executed in an alternating manner
38
Task A
A = X
A += 1
X = A
Task B
B = X
B += 2
X = B
Assume X is initially 0.
What are the possible
results?
So, Tasks A & B are not
REALLY independent!
39
Task A
X = 1
A = Y
Task B
Y = 1
B = X
Assume X & Y are
initially 0.
What are the possible
results?
40
Mutual Exclusion, Locks, Semaphores, Atomic
Operations
Mechanisms to prevent access to a memory
location(s) – allows one task to complete
before allowing the other to start
Cause serialization of operations
Does not always solve the problem – may still
depend upon which task executes first
41
Situation in which 2 or more processes cannot
proceed due to waiting on each other – STOP
Recommendations for avoidance
Avoid mutual exclusion
Hold at most 1 lock at a time
Acquire locks in same order
42
1. Mutual Exclusion Condition: The resources involved are non-shareable.
Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one
process at a time claims exclusive control of the resource. If another process requests that resource, the
requesting process must be delayed until the resource has been released.
2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested
resources.
Explanation: There must exist a process that is holding a resource already allocated to it while waiting for
additional resource that are currently being held by other processes.
3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted.
Explanation: Resources cannot be removed from the processes are used to completion or released
voluntarily by the process holding it.
4. Circular Wait Condition
The processes in the system form a circular list or chain where each process in the list is waiting for a
resource held by the next process in the list.
43
Fine-Grain Locking – use of many locks on small
sections, not 1 lock on large section
Notes
1 large lock is faster but blocks other processes
Time consideration for set/release of many
locks
Example: lock row of matrix, not entire matrix
44
Two Assumptions for good locality
A core will…
Temporal Locality – access same location soon
Spatial Locality – access nearby location soon
Reminder: Cache Line – block that is retrieved
Currently – Cache miss ~~ 100 cycles
45
Uneven distribution of work
over processors
Related to decomposition of
problem
Few vs Many Tasks – what
are implications?
46
Always in parallel processing
Launch, synchronize
Small vs larger processors ~ Implications???
~the end of chapter 2~
47
Download