lecture notes

advertisement
Programming for Performance
1
Simulating Ocean Currents
(a) Cross sections
(b) Spatial discretization of a cross section
Model as two-dimensional grids
•
•
Discretize in space and time
finer spatial and temporal resolution => greater accuracy
Many different computations per time step
–
•
set up and solve equations
Concurrency across and within grid computations
Static and regular
2
Simulating Galaxy Evolution
Simulate the interactions of many stars evolving over time
Computing forces is expensive
•
O(n2) brute force approach
•
Hierarchical Methods take advantage of force law: G
m1m2
r2
Star on w hich forc es
are being computed
Star too close to
approximate
•Many
Large group far
enough aw ay to
approximate
Small gr oup far enough aw ay to
approximate to center of mass
time-steps, plenty of concurrency across stars within one
3
Rendering Scenes by Ray Tracing
Shoot rays into scene through pixels in image plane
Follow their paths
•
they bounce around as they strike objects
•
they generate new rays: ray tree per input ray
Result is color and opacity for that pixel
Parallelism across rays
How much concurrency in these examples?
4
4 Steps in Creating a Parallel Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
program
M
a
p
p
i
n
g
P0
P1
P2
P3
Processors
Decomposition of computation in tasks
Assignment of tasks to processes
Orchestration of data access, comm, synch.
Mapping processes to processors
5
Performance Goal => Speedup
Architect Goal
•
observe how program uses machine
and improve the design to enhance
performance
Programmer Goal
•
observe how the program uses the
machine and improve the
implementation to enhance
performance
What do you observe?
Who fixes what?
6
Analysis Framework
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Solving communication and load balance NP-hard in general case
•
But simple heuristic solutions work well in practice
Fundamental Tension among:
•
balanced load
•
minimal synchronization
•
minimal communication
•
minimal extra work
Good machine design mitigates the trade-offs
7
Decomposition
Identify concurrency and decide level at which to exploit it
Break up computation into tasks to be divided among processes
•
Tasks may become available dynamically
•
No. of available tasks may vary with time
Goal: Enough tasks to keep processes busy, but not too many
•
Number of tasks available at a time is upper bound on achievable
speedup
8
Limited Concurrency: Amdahl’s Law
Most fundamental limitation on parallel speedup
If fraction s of seq execution is inherently serial,
speedup <= 1/s
Example: 2-phase calculation
•
sweep over n-by-n grid and do some independent computation
•
sweep again and add each value to global sum
2
Time for first phase
2n2= n /p
2
2
Second phasenserialized
+ n2 at global variable, so time = n
Speedup <=
p
or at most 2
Trick: divide second phase into two
•
accumulate into private sum during sweep
•
add per-process private sum into global sum
2n2
2n2 + p2
9
Understanding Amdahl’s Law
1
(a)
work done concurrently
n2
n2
p
(b)
1
n2/p
n2
p
1
(c)
n2/p n2/p p
Time
10
Concurrency Profiles
1,400
1,200
Concurrency
1,000
800
600
400
733
702
662
633
589
564
526
504
483
444
415
380
343
313
286
247
219
0
150
200
Clock cycle number
•
•
•
•
Area under curve is total work done, or time with 1 processor
Horizontal extent is lowerbound on time (infinite processors)
Speedup is the ratio:
fk k

k=1

k=1 fk kp
1
, base case: s + 1-s
p
Amdahl’s law applies to any overhead, not just limited concurrency
11
Programming as Successive Refinement
Rich space of techniques and issues
•
Trade off and interact with one another
Issues can be addressed/helped by software or hardware
•
Algorithmic or programming techniques
•
Architectural techniques
Not all issues in programming for performance dealt with up front
•
Partitioning often independent of architecture, and done first
•
Then interactions with architecture
–
–
–
Extra communication due to architectural interactions
Cost of communication depends on how it is structured
May inspire changes in partitioning
12
Partitioning for Performance
Balancing the workload and reducing wait time at synch points
Reducing inherent communication
Reducing extra work
Even these algorithmic issues trade off:
•
Minimize comm. => run on 1 processor => extreme load imbalance
•
Maximize load balance => random assignment of tiny tasks => no
control over communication
•
Good partition may imply extra work to compute or manage it
Goal is to compromise
•
Fortunately, often not difficult in practice
13
Load Balance and Synchronization
Speedup problem(p) <
Sequential Work
Max Work on any Processor
P
P
0
0
P
P
1
1
2
2
P
P
3
3
P
P
Instantaneous load imbalance revealed as wait time
•
at completion
•
at barriers
•
at receive
•
at flags, even at mutex
Sequential Work
Max (Work + Synch Wait Time)
14
Load Balance and Synch Wait Time
Limit on speedup: Speedupproblem(p) <
Sequential Work
Max Work on any Processor
•
Work includes data access and other costs
•
Not just equal work, but must be busy at same time
Four parts to load balance and reducing synch wait time:
1. Identify enough concurrency
2. Decide how to manage it
3. Determine the granularity at which to exploit it
4. Reduce serialization and cost of synchronization
15
Reducing Inherent Communication
Communication is expensive!
Metric: communication to computation ratio
Focus here on inherent communication
•
Determined by assignment of tasks to processes
•
Later see that actual communication can be greater
Assign tasks that access same data to same process
Solving communication and load balance NP-hard in general case
But simple heuristic solutions work well in practice
•
Applications have structure!
16
Domain Decomposition
Works well for scientific, engineering, graphics, ... applications
Exploits local-biased nature of physical problems
•
Information requirements often short-range
•
Or long-range but fall off with distance
Simple example: nearest-neighbor grid computation
n
n
p
P0
P1
P2
P3
P4
P5
P6
P7
n
P8
P9
P10
P11
P12
P13
P14
P15
n
p
Perimeter to Area comm-to-comp ratio (area to volume in 3-d)
•Depends on n,p: decreases with n, increases with p
17
Domain Decomposition (contd)
Best domain decomposition depends on information requirements
Nearest neighbor example: block versus strip decomposition:
n
----p
n
n
P0
P1
P2
P3
P4
P5
P6
P7
----p
n
P8
P9
P10
P11
P12
P13
P14
P15
Comm to comp: 4*√p for block, 2*p for strip
n
n better in other cases
Application dependent:
strip may be
•
E.g. particle flow in tunnel
18
Finding a Domain Decomposition
Static, by inspection
•
Must be predictable: grid example above
Static, but not by inspection
•
Input-dependent, require analyzing input structure
•
E.g sparse matrix computations
Semi-static (periodic repartitioning)
•
Characteristics change but slowly; e.g. N-body
Static or semi-static, with dynamic task stealing
•
Initial domain decomposition but then highly unpredictable; e.g ray
tracing
19
N-body: Simulating Galaxy Evolution
•
Simulate the interactions of many stars evolving over time
•
Computing forces is expensive
•
O(n2) brute force approach
•
Hierarchical Methods take advantage of force law: G
m1m2
r2
Star on w hich forc es
are being computed
Star too close to
approximate
•Many
Large group far
enough aw ay to
approximate
Small gr oup far enough aw ay to
approximate to center of mass
time-steps, plenty of concurrency across stars within one
20
A Hierarchical Method: Barnes-Hut
(a) The spatial domain
(b) Quadtree representation
Locality Goal:
•
Particles close together in space should be on same processor
Difficulties: Nonuniform, dynamically changing
21
Application Structure
Time-steps
Build tree
•
Compute
forces
Update
properties
Compute
moments of cells
Traverse tree
to compute forces
Main data structures: array of bodies, of cells, and of pointers to them
–
–
Each body/cell has several fields: mass, position, pointers to others
pointers are assigned to processes
22
Partitioning
Decomposition: bodies in most phases (sometimes cells)
Challenges for assignment:
•
Nonuniform body distribution => work and comm. Nonuniform
–
•
Distribution changes dynamically across time-steps
–
•
Partitions should be spatially contiguous for locality
Different phases have different work distributions across bodies
–
–
•
Cannot assign statically
Information needs fall off with distance from body
–
•
Cannot assign by inspection
No single assignment ideal for all
Focus on force calculation phase
Communication needs naturally fine-grained and irregular
23
Load Balancing
•
Equal particles  equal work.
–
•
Solution: Assign costs to particles based on the work they do
Work unknown and changes with time-steps
–
Insight : System evolves slowly
–
Solution: Count work per particle, and use as cost for next time-step.
Powerful technique for evolving physical systems
24
A Partitioning Approach: ORB
Orthogonal Recursive Bisection:
Recursively bisect space into subspaces with equal work
•
–
Work is associated with bodies, as before
Continue until one partition per processor
•
•
High overhead for large no. of processors
25
Another Approach: Costzones
Insight: Tree already contains an encoding of spatial locality.
P1
(a) ORB
•
P2
P3
P4
P5 P6 P 7
P8
(b) Costzones
Costzones is low-overhead and very easy to program
26
Space Filling Curves
Morton Order
Peano-Hilbert Order
27
Rendering Scenes by Ray Tracing
•
Shoot rays into scene through pixels in image plane
•
Follow their paths
–
–
they bounce around as they strike objects
they generate new rays: ray tree per input ray
•
Result is color and opacity for that pixel
•
Parallelism across rays
All case studies have abundant concurrency
28
Partitioning
Scene-oriented approach
•
Partition scene cells, process rays while they are in an assigned cell
Ray-oriented approach
Partition primary rays (pixels), access scene data as needed
• Simpler; used here
•
Need dynamic assignment; use contiguous blocks to exploit spatial
coherence among neighboring rays, plus tiles for task stealing
A block,
the unit of
assignment
A tile,
the unit of decomposition
and stealing
Could use 2-D interleaved (scatter) assignment of tiles instead
29
Other Techniques
Scatter Decomposition, e.g. initial partition in Raytrace
1
3
2
1
2
1
2
1
2
1
2
3
4
3
4
3
4
3
4
1
2
1
2
1
2
1
2
3
4
3
4
3
4
3
4
1
2
1
2
1
2
1
2
3
4
3
4
3
4
3
4
1
2
1
2
1
2
1
2
3
4
3
4
3
4
3
4
4
Domain decomposition
Scatter decomposition
Preserve locality in task stealing
•Steal large tasks for locality, steal from same queues, ...
30
Determining Task Granularity
Task granularity: amount of work associated with a task
General rule:
•
Coarse-grained => often less load balance
•
Fine-grained => more overhead; often more comm., contention
Comm., contention actually affected by assignment, not size
•
Overhead by size itself too, particularly with task queues
31
Dynamic Tasking with Task Queues
Centralized versus distributed queues
Task stealing with distributed queues
•
•
•
•
Can compromise comm and locality, and increase synchronization
Whom to steal from, how many tasks to steal, ...
Termination detection
Maximum imbalance related to size of task
All processes
insert tasks
P0 inserts
QQ
0
P1 inserts
P2 inserts
P3 inserts
Q1
Q2
Q3
P2 removes
P3 removes
Others may
steal
All remove tasks
(a) Centralized task queue
P0 removes
P1 removes
(b) Distributed task queues (one per process)
Preserve locality in task stealing
•Steal large tasks for locality, steal from same queues, ...
32
Assignment Summary
Specify mechanism to divide work up among processes
• E.g. which process computes forces on which stars, or which rays
• Balance workload, reduce communication and management cost
Structured approaches usually work well
• Code inspection (parallel loops) or understanding of application
• Well-known heuristics
• Static versus dynamic assignment
As programmers, we worry about partitioning first
• Usually independent of architecture or prog model
• But cost and complexity of using primitives may affect decisions
33
Parallelizing Computation vs. Data
Computation is decomposed and assigned (partitioned)
Partitioning Data is often a natural view too
•
Computation follows data: owner computes
•
Grid example; data mining;
Distinction between comp. and data stronger in many
applications
Barnes-Hut
• Raytrace
•
34
Reducing Extra Work
Common sources of extra work:
•
Computing a good partition
–
e.g. partitioning in Barnes-Hut or sparse matrix
•
Using redundant computation to avoid communication
•
Task, data and process management overhead
–
•
applications, languages, runtime systems, OS
Imposing structure on communication
–
coalescing messages, allowing effective naming
Architectural Implications:
•
Reduce need by making communication and orchestration efficient
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
35
It’s Not Just Partitioning
Inherent communication in parallel algorithm is not all
•
•
artifactual communication caused by program implementation and
architectural interactions can even dominate
thus, amount of communication not dealt with adequately
Cost of communication determined not only by amount
•
also how communication is structured
•
and cost of communication in system
Both architecture-dependent, and addressed in orchestration step
36
Structuring Communication
Given amount of comm (inherent or artifactual), goal is to reduce cost
Cost of communication as seen by process:
C = f * ( o + l + nc/m + tc - overlap)
B
– f = frequency of messages
–
–
–
–
–
–
–
o = overhead per message (at both ends)
l = network delay per message
nc = total data sent
m = number of messages
B = bandwidth along path (determined by network, NI, assist)
tc = cost induced by contention per message
overlap = amount of latency hidden by overlap with comp. or comm.
Portion in parentheses is cost of a message (as seen by processor)
• That portion, ignoring overlap, is latency of a message
•
•
Goal: reduce terms in latency and increase overlap
37
Reducing Overhead
Can reduce no. of messages m or overhead per message o
o is usually determined by hardware or system software
•
Program should try to reduce m by coalescing messages
•
More control when communication is explicit
Coalescing data into larger messages:
•
Easy for regular, coarse-grained communication
•
Can be difficult for irregular, naturally fine-grained communication
–
may require changes to algorithm and extra work
• coalescing data and determining what and to whom to send
38
Reducing Network Delay
Network delay component = f*h*th
–
–
h = number of hops traversed in network
th = link+switch latency per hop
Reducing f: communicate less, or make messages larger
Reducing h:
•
Map communication patterns to network topology
–
•
e.g. nearest-neighbor on mesh and ring; all-to-all
How important is this?
–
–
–
used to be major focus of parallel algorithms
depends on no. of processors, how th, compares with other components
less important on modern machines
• overheads, processor count, multiprogramming
39
Reducing Contention
All resources have nonzero occupancy
Memory, communication controller, network link, etc.
• Can only handle so many transactions per unit time
•
Effects of contention:
Increased end-to-end cost for messages
• Reduced available bandwidth for other messages
• Causes imbalances across processors
•
Particularly insidious performance problem
Easy to ignore when programming
• Slow down messages that don’t even need that resource
•
–
•
by causing other dependent resources to also congest
Effect can be devastating: Don’t flood a resource!
40
Types of Contention
Network contention and end-point contention (hot-spots)
Location and Module Hot-spots
•
Location: e.g. accumulating into global variable, barrier
–
solution: tree-structured communication
Contention
Flat
•
Little contention
Tree structured
Module: all-to-all personalized comm. in matrix transpose
solution: stagger access by different processors to same
node temporally
–
•In
general, reduce burstiness; may conflict with making
messages larger
41
Overlapping Communication
Cannot afford to stall for high latencies
•
even on uniprocessors!
Overlap with computation or communication to hide latency
Requires extra concurrency (slackness), higher bandwidth
Techniques:
•
Prefetching
•
Block data transfer
•
Proceeding past communication
•
Multithreading
42
Communication Scaling (NPB2)
Normalized Msgs per Proc
Average Message Size
8
1.00E+07
7
1.00E+06
6
FT
IS
5
FT
IS
1.00E+05
LU
4
LU
MG
3
SP
MG
1.00E+04
SP
BT
2
BT
1.00E+03
1
0
1.00E+02
0
10
20
30
40
0
10
20
30
40
43
Communication Scaling: Volume
Bytes per Processor
Total Bytes
1.2
9.00E+09
8.00E+09
1
7.00E+09
FT
0.8
IS
0.6
IS
LU
5.00E+09
LU
MG
4.00E+09
MG
SP
0.4
FT
6.00E+09
SP
3.00E+09
BT
BT
2.00E+09
0.2
1.00E+09
0
0.00E+00
0
10
20
30
40
0
10
20
30
40
44
Mapping
Two aspects:
•
Which process runs on which particular processor?
–
•
mapping to a network topology
Will multiple processes run on same processor?
Space-sharing
•
Machine divided into subsets, only one app at a time in a subset
•
Processes can be pinned to processors, or left to OS
System allocation
Real world
•
User specifies desires in some aspects, system handles some
Usually adopt the view: process <-> processor
45
Recap: Performance Trade-offs
Programmer’s View of Performance
Speedup <
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Different goals often have conflicting demands
•
Load Balance
–
•
Communication
–
•
coarse grain tasks, decompose to obtain locality
Extra Work
–
•
fine-grain tasks, random or dynamic assignment
coarse grain tasks, simple assignment
Communication Cost:
–
–
big transfers: amortize overhead and latency
small transfers: reduce contention
46
Recap (cont)
Architecture View
•
cannot solve load imbalance or eliminate inherent
communication
But can:
•
reduce incentive for creating ill-behaved programs
–
efficient naming, communication and synchronization
•
reduce artifactual communication
•
provide efficient naming for flexible assignment
•
allow effective overlapping of communication
47
Uniprocessor View
Performance depends heavily on memory hierarchy
Managed by hardware
Time spent by a program
•
Timeprog(1) = Busy(1) + Data Access(1)
•
Divide by cycles to get CPI equation
P
Data access time can be reduced by:
Optimizing machine
–
•
bigger caches, lower latency...
Optimizing program
–
100
temporal and spatial locality
Data-local
Busy-useful
75
Time (s)
•
50
25
48
Same Processor-Centric Perspective
100
100
Synchronization
Data-remote
50
Busy-overhead
75
Time (s)
Time (s)
75
25
Data-local
Busy-useful
50
25
P
(a) Sequential
0
P1
P
2
P
3
(b) Parallel with four processors
49
What is a Multiprocessor?
A collection of communicating processors
•
Goals: balance load, reduce inherent communication and extra work
P
P
A multi-cache, multi-memory system
...
P
•
Role of these components essential regardless of programming model
•
Prog. model and comm. abstr. affect specific performance tradeoffs
...
P
P
P
50
Relationship between Perspectives
Pr ocessor time component
Parallelization step(s)
Performance issue
Decomposition/
assignment/
orchestration
Load imbalance and
synchronization
Synch w ait
Decomposition/
assignment
Extra w ork
Busy-overhead
Decomposition/
assignment
Inher ent
communication
volume
Data-remote
Orchestration
A rtif actual
communication
and data locality
Data-local
Orchestration/
mapping
Communication
structure
Speedup <
Busy(1) + Data(1)
Busyuseful(p)+Datalocal(p)+Synch(p)+Dataremote(p)+Busyoverhead(p)
51
Artifactual Communication
Accesses not satisfied in local portion of memory hierachy cause
“communication”
•
Inherent communication, implicit or explicit, causes transfers
–
•
Artifactual communication
–
–
–
–
–
–
•
determined by program
determined by program implementation and arch. interactions
poor allocation of data across distributed memories
unnecessary data in a transfer
unnecessary transfers due to system granularities
redundant communication of data
finite replication capacity (in cache or main memory)
Inherent communication is what occurs with unlimited capacity, small
transfers, and perfect knowledge of what is needed.
52
Spatial Locality Example
•
Repeated sweeps over 2-d grid, each time adding 1 to elements
Contiguity in memory layout
P0
P4
P1
P5
P2
P3
P6
P7
P8
Page straddles
partition boundaries:
difficult to distribute
memory well
Cache block
straddles partition
boundary
(a) Two-dimensional array
53
Spatial Locality Example
•
Repeated sweeps over 2-d grid, each time adding 1 to elements
•
Natural 2-d versus higher-dimensional array representation
Contiguity in memory layout
P0
P4
P1
P5
P2
P3
P6
P7
P0
P4
P1
P5
P2
P3
P6
P7
P8
P8
Page straddles
partition boundaries:
difficult to distribute
memory well
Cache block
straddles partition
boundary
(a) Two-dimensional array
Page does
not straddle
partition
boundary
Cache block is
within a partition
(b) Four-dimensional array
54
Tradeoffs with Inherent Communication
Partitioning grid solver: blocks versus rows
•
Blocks still have a spatial locality problem on remote data
•
Rowwise can perform better despite worse inherent c-to-c ratio
Good spacial locality on
nonlocal accesses at
row-oriented boudary
Poor spacial locality on
nonlocal accesses at
column-oriented
boundary
•Result
depends on n and p
55
Example Performance Impact
Equation solver on SGI Origin2000
50
30

25

Row s
4D
45

2D
40


10





5




Speedup
Speedup


15

30
(a) Smaller problem size
2D-rr
20
0








15











Number of processors






5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Row s-rr
2D
25
10


4D-rr
Row s

35
20
0
4D




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors
(a) Larger problem size
56
Working Sets Change with P
8-fold reduction
in miss rate from
4 to 8 proc
57
Implications for Programming Models
58
Implications for Programming Models
Coherent shared address space and explicit message passing
Assume distributed memory in all cases
Recall any model can be supported on any architecture
•
Assume both are supported efficiently
•
Assume communication in SAS is only through loads and stores
•
Assume communication in SAS is at cache block granularity
59
Issues to Consider
Functional issues:
•
Naming: How are logically shared data and/or processes referenced?
•
Operations: What operations are provided on these data
•
Ordering: How are accesses to data ordered and coordinated?
Performance issues
•
Granularity and endpoint overhead of communication
–
(latency and bandwidth depend on network so considered similar)
•
Replication: How are data replicated to reduce communication?
•
Ease of performance modeling
Cost Issues
•
Hardware cost and design complexity
60
Sequential Programming Model
Contract
•
Naming: Can name any variable ( in virtual address space)
–
Hardware (and perhaps compilers) does translation to physical addresses
•
Operations: Loads, Stores, Arithmetic, Control
•
Ordering: Sequential program order
Performance Optimizations
•
Compilers and hardware violate program order without getting caught
–
–
Compiler: reordering and register allocation
Hardware: out of order, pipeline bypassing, write buffers
•
Retain dependence order on each “location”
•
Transparent replication in caches
Ease of Performance Modeling: complicated by caching
61
SAS Programming Model
Naming: Any process can name any variable in shared space
Operations: loads and stores, plus those needed for ordering
Simplest Ordering Model:
•
Within a process/thread: sequential program order
•
Across threads: some interleaving (as in time-sharing)
•
Additional ordering through explicit synchronization
•
Can compilers/hardware weaken order without getting caught?
–
Different, more subtle ordering models also possible (more later)
62
Synchronization
Mutual exclusion (locks)
•
Ensure certain operations on certain data can be performed by
only one process at a time
•
Room that only one person can enter at a time
•
No ordering guarantees
Event synchronization
•
Ordering of events to preserve dependences
–
•
e.g. producer —> consumer of data
3 main types:
–
–
–
point-to-point
global
group
63
Message Passing Programming Model
Naming: Processes can name private data directly.
•
No shared address space
Operations: Explicit communication through send and receive
Send transfers data from private address space to another process
• Receive copies data from process to private address space
• Must be able to name processes
•
Ordering:
Program order within a process
• Send and receive can provide pt to pt synch between processes
•
–
•
Complicated by asynchronous message passing
Mutual exclusion inherent + conventional optimizations legal
Can construct global address space:
Process number + address within process address space
• But no direct operations on these names
•
64
Naming
Uniprocessor: Can name any variable ( in virtual address space)
•
Hardware (and perhaps compiler) does translation to physical addresses
SAS: similar to uniprocessor; system does it all
MP: each process can only directly name the data in its address space
•
Need to specify from where to obtain or where to transfer nonlocal data
•
Easy for regular applications (e.g. Ocean)
•
Difficult for applications with irregular, time-varying data needs
–
–
•
Barnes-Hut: where the parts of the tree that I need? (change with time)
Raytrace: where are the parts of the scene that I need (unpredictable)
Solution methods exist
–
–
–
Barnes-Hut: Extra phase determines needs and transfers data before
computation phase
Raytrace: scene-oriented rather than ray-oriented approach
both: emulate application-specific shared address space using hashing
65
Operations
Sequential: Loads, Stores, Arithmetic, Control
SAS: loads and stores, plus those needed for ordering
MP: Explicit communication through send and receive
Send transfers data from private address space to another process
• Receive copies data from process to private address space
• Must be able to name processes
•
66
Replication
Who manages it (i.e. who makes local copies of data)?
•
SAS: system, MP: program
Where in local memory hierarchy is replication first done?
•
SAS: cache (or memory too), MP: main memory
At what granularity is data allocated in replication store?
•
SAS: cache block, MP: program-determined
How are replicated data kept coherent?
•
SAS: system, MP: program
How is replacement of replicated data managed?
•
SAS: dynamically at fine spatial and temporal grain (every access)
•
MP: at phase boundaries, or emulate cache in main memory in software
Of course, SAS affords many more options too (discussed later)
67
Communication Overhead and Granularity
Overhead directly related to hardware support provided
•
Lower in SAS (order of magnitude or more)
Major tasks:
•
Address translation and protection
–
–
•
Buffer management
–
–
•
fixed-size small messages in SAS easy to do in hardware
flexible-sized message in MP usually need software involvement
Type checking and matching
–
•
SAS uses MMU
MP requires software protection, usually involving OS in some way
MP does it in software: lots of possible message types due to flexibility
A lot of research in reducing these costs in MP, but still much larger
Naming, replication and overhead favor SAS
•
Many irregular MP applications now emulate SAS/cache in software
68
Block Data Transfer
Fine-grained communication not most efficient for long messages
•
Latency and overhead as well as traffic (headers for each cache line)
SAS: can using block data transfer
•
Explicit in system we assume, but can be automated at page or object
level in general (more later)
•
Especially important to amortize overhead when it is high
–
latency can be hidden by other techniques too
Message passing:
•
Overheads are larger, so block transfer more important
•
But very natural to use since message are explicit and flexible
–
Inherent in model
69
Synchronization
SAS: Separate from communication (data transfer)
•
Programmer must orchestrate separately
Message passing
•
Mutual exclusion by fiat
•
Event synchronization already in send-receive match in synchronous
–
need separate orchestration (using probes or flags) in asynchronous
70
Hardware Cost and Design Complexity
Higher in SAS, and especially cache-coherent SAS
But both are more complex issues
•
Cost
–
–
•
must be compared with cost of replication in memory
depends on market factors, sales volume and other nontechnical issues
Complexity
–
–
must be compared with complexity of writing high-performance programs
Reduced by increasing experience
71
Performance Model
Three components:
•
Modeling cost of primitive system events of different types
•
Modeling occurrence of these events in workload
•
Integrating the two in a model to predict performance
Second and third are most challenging
Second is the case where cache-coherent SAS is more difficult
•
replication and communication implicit, so events of interest implicit
–
similar to problems introduced by caching in uniprocessors
•
MP has good guideline: messages are expensive, send infrequently
•
Difficult for irregular applications in either case (but more so in SAS)
Block transfer, synchronization, cost/complexity, and performance
modeling advantageus for MP
72
Summary for Programming Models
Given tradeoffs, architect must address:
•
Hardware support for SAS (transparent naming) worthwhile?
•
Hardware support for replication and coherence worthwhile?
•
Should explicit communication support also be provided in SAS?
Current trend:
•
Tightly-coupled multiprocessors support for cache-coherent SAS in hw
•
Other major platform is clusters of workstations or multiprocessors
–
•
currently don’t support SAS in hardware, mostly use message passing
At highest end, clusters of cache-coherent SAS multiprocessors
73
Summary
Crucial to understand characteristics of parallel programs
•
Implications for a host or architectural issues at all levels
Architectural convergence has led to:
•
Greater portability of programming models and software
–
•
Many performance issues similar across programming models too
Clearer articulation of performance issues
–
–
–
–
Used to use PRAM model for algorithm design
Now models that incorporate communication cost (BSP, logP,….)
Emphasis in modeling shifted to end-points, where cost is greatest
But need techniques to model application behavior, not just machines
Performance issues trade off with one another; iterative refinement
Ready to understand using workloads to evaluate systems issues
74
Download