Caches (Cont.)

advertisement
ENGS 116 Lecture 15
1
Multiprocessors and Thread-Level
Parallelism
Vincent Berk
November 31, 2005
Reading for Wednesday: Sections 6.1-6.3
Reading for Friday: Sections 6.4 – 6.7
Reading for Wednesday: Sections 6.8 – 6.10, 6.14
ENGS 116 Lecture 15
2
Parallel Computers
• Definition: “A parallel computer is a collection of processing
elements that cooperate and communicate to solve large problems
fast.”
Almasi and Gottlieb, Highly Parallel Computing, 1989
• Questions about parallel computers:
– How large a collection?
– How powerful are processing elements?
– How do they cooperate and communicate?
– How are data transmitted?
– What type of interconnection?
– What are HW and SW primitives for programmer?
– Does it translate into performance?
ENGS 116 Lecture 15
3
Parallel Processors “Religion”
• The dream of computer architects since 1960: replicate processors to
add performance vs. design a faster processor
• Led to innovative organization tied to particular programming
models since “uniprocessors can’t keep going”
– e.g., uniprocessors must stop getting faster due to limit of speed
of light: 1972, … , 1989
– Borders religious fervor: you must believe!
– Fervor damped some when 1990s companies went out of
business: Thinking Machines, Kendall Square, ...
• Argument instead is the “pull” of opportunity of scalable
performance, not the “push” of uniprocessor performance plateau
ENGS 116 Lecture 15
Opportunities: Scientific Computing
• Nearly Unlimited Demand (Grand Challenge):
App
Perf (GFLOPS)
Memory (GB)
48 hour weather
0.1
0.1
72 hour weather
3
1
Pharmaceutical design
100
10
Global Change, Genome
1000
1000
(Figure 1-2, page 25, of Culler, Singh, Gupta [CSG97])
• Successes in some real industries:
–
–
–
–
–
Petroleum: reservoir modeling
Automotive: crash simulation, drag analysis, engine
Aeronautics: airflow analysis, engine, structural mechanics
Pharmaceuticals: molecular modeling
Entertainment: full length movies (“Toy Story”)
4
ENGS 116 Lecture 15
5
Flynn’s Taxonomy
• SISD (Single Instruction Single Data)
– Uniprocessors
• MISD (Multiple Instruction Single Data)
– ???
• SIMD (Single Instruction Multiple Data)
– Examples: Illiac-IV, CM-2
•
•
•
•
Simple programming model
Low overhead
Flexibility
All custom integrated circuits
• MIMD (Multiple Instruction Multiple Data)
– Examples: Sun Enterprise 10000, Cray T3D, SGI Origin
• Flexible
• Use off-the-shelf microprocessors
Stream 4
Instruction
Stream 3
Instruction
Stream 2
Instruction
Data
Stream 4
Processor
Unit
Processor
Unit
Processor
Unit
Processor
Unit
Processor
Unit
Flynn’s classification of computer systems
c) MIMD Computer
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Memory
Module
Instruction Stream 1
Data
Stream 4
Data
Stream 3
Data
Stream 2
Stream 1
Data
Instruction Stream
b) SIMD Computer
Data
Stream 3
Data
Stream 2
Data
Stream 1
Processor
Unit
Processor
Unit
Processor
Unit
Data
Stream
a) SISD Computer
Processor
Unit
Instruction Stream 4
Instruction Stream 3
Instruction Stream 2
Control
Unit
Control
Unit
Control
Unit
Instruction
Stream 1
Instruction
Stream
Control
Unit
Control
Unit
Stream
Control
Unit
Instruction
Instruction Stream
ENGS 116 Lecture 15
6
ENGS 116 Lecture 15
7
Parallel Framework
• Programming Model:
– Multiprogramming: lots of jobs, no communication
– Shared address space: communicate via memory
– Message passing: send and receive messages
– Data parallel: several agents operate on several data sets
simultaneously and then exchange information globally and
simultaneously (shared or message passing)
• Communication Abstraction:
– Shared address space: e.g., load, store, atomic swap
– Message passing: e.g., send, receive library calls
– Debate over this topic (ease of programming, scaling)
ENGS 116 Lecture 15
P
8
P
P
  
P
Network
M
M
  
Network
P
M
Shared Memory System
P
P
  
M
M
M
Distributed Memory system
Diagrams of Shared and Distributed Memory Architectures
ENGS 116 Lecture 15
Shared Address/Memory Multiprocessor Model
• Communicate via Load and Store
– Oldest and most popular model
• Based on timesharing: processes on multiple processors vs.
sharing single processor
• Process: a virtual address space and ≥ 1 thread of control
– Multiple processes can overlap (share), but ALL threads share
a process address space
• Writes to shared address space by one thread are visible to reads
of other threads
– Usual model: share code, private stack, some shared heap,
some private heap
9
ENGS 116 Lecture 15
10
Example: Small-Scale MP Designs
• Memory: centralized with uniform memory access (“UMA”) and bus
interconnect, I/O
• Examples: Sun Enterprise, SGI Challenge, Intel SystemPro
Processor
Processor
Processor
Processor
One or
more levels
of cache
One or
more levels
of cache
One or
more levels
of cache
One or
more levels
of cache
Main memory
I/O System
ENGS 116 Lecture 15
11
SMP Interconnect
• Processors connected to memory AND to I/O
• Bus based: all memory locations equal access time so
SMP = “Symmetric MP”
– Sharing limited BW as we add processors, I/O
• Crossbar: expensive to expand
• Multistage network: less expensive to expand than crossbar
with more BW
• “Dance Hall” designs: all processors on the left, all memories
on the right
ENGS 116 Lecture 15
12
Large-Scale MP Designs
• Memory: distributed with nonuniform memory access (“NUMA”) and
scalable interconnect (distributed memory)
Processor
+ cache
Processor
+ cache
Processor
+ cache
Processor
+ cache
1 cycle
Memory
40 cycles
I/O
Memory
I/O
Memory
I/O
Processor
+ cache
I/O
100 cycles
Low Latency
High Reliability
Interconnection Network
Memory
Memory
I/O
Memory
Processor
+ cache
I/O
Memory
Processor
+ cache
I/O
Memory
Processor
+ cache
I/O
ENGS 116 Lecture 15
13
Shared Address Model Summary
• Each processor can name every physical location in the machine
• Each process can name all data it shares with other processes
• Data transfer via load and store
• Data size: byte, word, ... or cache blocks
• Uses virtual memory to map virtual to local or remote physical
• Memory hierarchy model applies: now communication moves data to
local processor’s cache (as load moves data from memory to cache)
– Latency, BW, scalability of communication?
ENGS 116 Lecture 15
14
Message Passing Model
• Whole computers (CPU, memory, I/O devices) communicate as
explicit I/O operations
– Essentially NUMA but integrated at I/O devices vs. memory system
• Send specifies local buffer + receiving process on remote computer
• Receive specifies sending process on remote computer + local buffer
to place data
– Usually send includes process tag and receive has rule on tag:
match one, match any
– Synchronization: when send completes, when buffer free, when request
accepted, receive waits for send
• Send + receive  memory-memory copy, where each supplies local
address AND does pairwise synchronization!
ENGS 116 Lecture 15
15
Processor 1
Communications Network
Port 2
Processor 2
Port 1
= processor
= process
= receiver’s buffer or queue
= message
Logical View of a Message Passing System
ENGS 116 Lecture 15
16
Message Passing Model
• Send + receive  memory-memory copy, synchronization on OS even
on 1 processor
• History of message passing:
– Network topology important because could only send to immediate
neighbor
– Typically synchronous, blocking send & receive
– Later DMA with non-blocking sends, DMA for receive into buffer until
processor does receive, and then data is transferred to local memory
– Later SW libraries to allow arbitrary communication
• Example: IBM SP-2, RS6000 workstations in racks
– Network Interface Card has Intel 960
– 88 Crossbar switch as communication building block
– 40 MB/sec per link
ENGS 116 Lecture 15
17
Message Passing Model
•
•
•
•
•
•
•
Special type of Message Passing: Pipe
Pipe is a point-to-point connection
Data is pushed in one way by first process, and
Received, in-order, by the second process.
Data can stay in pipe indefinitely
Asynchronous, simplex communication
Advantages:
– Simple implementation
– Works with Send-Receive calls
– Good way of implementing producer consumers systems.
ENGS 116 Lecture 15
18
Communication Models
• Shared Memory
– Processors communicate with shared address space
– Easy on small-scale machines
– Advantages:
•
•
•
•
Model of choice for uniprocessors, small-scale MPs
Ease of programming
Lower latency
Easier to use hardware controlled caching
• Message passing
– Processors have private memories, communicate via messages
– Advantages:
• Less hardware, easier to design
• Focuses attention on costly non-local operations
• Can support either SW model on either HW base
ENGS 116 Lecture 15
How Do We Develop a Parallel Program?
• Convert a sequential program to a parallel program
– Partition the program into tasks and combine tasks into processes
– Coordinate data accesses, communication, and synchronization
– Map the processes to processors
• Create a parallel program from scratch
19
ENGS 116 Lecture 15
20
Review: Parallel Framework
Programming Model
Communication Abstraction
Interconnection SW/OS
Interconnection HW
•
Layers:
– Programming Model:
•
•
•
•
Multiprogramming: lots of jobs, no communication
Shared address space: communicate via memory
Message passing: send and receive messages
Data parallel: several agents operate on several data sets simultaneously and
then exchange information globally and simultaneously (shared or message
passing)
– Communication Abstraction:
• Shared address space: e.g., load, store, atomic swap
• Message passing: e.g., send, receive library calls
• Debate over this topic (ease of programming, scaling)
ENGS 116 Lecture 15
21
Fundamental Issues
• Fundamental issues in parallel machines
1) Naming
2) Synchronization
3) Latency and Bandwidth
ENGS 116 Lecture 15
22
Fundamental Issue #1: Naming
• Naming:
– what data is shared
– how it is addressed
– what operations can access data
– how processes refer to each other
• Choice of naming affects code produced by a compiler; via load
where just remember address or keep track of processor number
and local virtual address for message passing
• Choice of naming affects replication of data; via load in cache
memory hierarchy or via SW replication and consistency
ENGS 116 Lecture 15
23
Fundamental Issue #1: Naming
• Global physical address space: any processor can generate, address
and access it in a single operation
• Global virtual address space: if the address space of each process
can be configured to contain all shared data of the parallel program
• Segmented shared address space:
locations are named < process number, address >
uniformly for all processes of the parallel program
• Independent local physical addresses
ENGS 116 Lecture 15
24
Fundamental Issue #2: Synchronization
• To cooperate, processes must coordinate
• Message passing is implicit coordination with transmission or
arrival of data
• Shared address system needs additional operations to explicitly
coordinate: e.g., write a flag, awaken a thread, interrupt a
processor
ENGS 116 Lecture 15
25
Fundamental Issue #3:
Latency and Bandwidth
• Bandwidth
– Need high bandwidth in communication
– Cannot scale, but stay close
– Match limits in network, memory, and processor
– Overhead to communicate is a problem in many machines
• Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since requires more thought to
overlap communication and computation
• Latency Hiding
– How can a mechanism help hide latency?
– Examples: overlap message send with computation, prefetch
data, switch to other tasks
ENGS 116 Lecture 15
26
Small-Scale — Shared Memory
• Caches serve to:
– Increase bandwidth
versus bus/memory
– Reduce latency of
access
– Valuable for both
private data and
shared data
• What about cache
coherence?
Processor
Processor
Processor
Processor
One or
more levels
of cache
One or
more levels
of cache
One or
more levels
of cache
One or
more levels
of cache
Main memory
I/O System
ENGS 116 Lecture 15
27
The Problem of Cache Coherency
CPU
CPU
CPU
Cache
Cache
Cache
A’
100
A’
550
A’
100
B’
200
B’
200
B’
200
Memory
Memory
Memory
A
100
A
100
A
100
B
200
B
200
B
400
I/O
(a) Cache and
memory coherent:
A’ = A & B’ = B
I/O
Output A
gives 100
(b) Cache and
memory incoherent:
A’ ≠ A (A stale)
I/O
Input
440 to B
(c) Cache and
memory incoherent:
B’ ≠ B (B’ stale)
ENGS 116 Lecture 15
28
What Does Coherency Mean?
• Informally:
– “Any read must return the most recent write”
– Too strict and too difficult to implement
• Better:
– “Any write must eventually be seen by a read”
– All writes are seen in proper order (“serialization”)
• Two rules to ensure this:
– “If P writes x and P1 reads it, P’s write will be seen by P1 if
the read and write are sufficiently far apart”
– Writes to a single location are serialized: seen in one order
• Latest write will be seen
• Otherwise could see writes in illogical order
(could see older value after a newer value)
ENGS 116 Lecture 15
Time
0
1
2
3
29
Event
CPU A reads X
CPU B reads X
CPU A stores 0 into X
Cache
Cache
Memory
contents
contents contents for
for CPU A for CPU B location X
1
1
1
1
1
1
0
1
0
The cache-coherence problem for a single memory
location (X), read and written by two processors (A and B).
This example assumes a write-through cache.
ENGS 116 Lecture 15
Potential HW Coherency Solutions
• Snooping Solution (Snoopy Bus):
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond
accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small-scale machines (most of the market)
• Directory-Based Schemes
– Keep track of what is being shared in one centralized place
– Distributed memory  distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes
30
ENGS 116 Lecture 15
31
Basic Snoopy Protocols
• Write Invalidate Protocol:
– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which
snoop and invalidate any copies
– Read Miss:
• Write-through: memory is always up-to-date
• Write-back: snoop in caches to find most recent copy
• Write Broadcast Protocol (typically write through):
– Write to shared data: broadcast on bus, processors snoop, and
update any copies
– Read miss: memory is always up-to-date
ENGS 116 Lecture 15
32
Processor activity
Bus activity
CPU A reads X
CPU B reads X
CPU A wr ites a 1 to X
CPU B reads X
Cache miss for X
Cache miss for X
Inva lidation for X
Cache miss for X
Contents of
CPU A's cache
0
0
1
1
Content s of
CPU B's cache
0
1
Contents of memory
location X
0
0
0
0
1
An example of an invalidation protocol working on a
snooping bus for a single cache block (X) with write-back
caches.
ENGS 116 Lecture 15
33
Processor activity
CPU A reads X
CPU B reads X
CPU A wr ites a 1 to X
CPU B reads X
Contents of
Content s of Contents of memory
Bus activity
CPU A's cache CPU B's cache
location X
0
Cache miss for X
0
0
Cache miss for X
0
0
0
Write broadcast of X
1
1
1
1
1
1
An example of a write update or broadcast protocol working
on a snooping bus for a single cache block (X) with writeback caches.
ENGS 116 Lecture 15
34
Basic Snoopy Protocols
• Write Invalidate versus Broadcast:
– Invalidate requires one transaction per write-run
– Invalidate uses spatial locality: one transaction per block
– Broadcast has lower latency between write and read
Download