The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer

advertisement
The Stanford Directory
Architecture for Shared Memory
(DASH)*
Presented by: Michael Bauer
ECE 259/CPS 221
Spring Semester 2008
Dr. Lebeck
* Based on “The Stanford Dash Multiprocessor” in IEEE Computer March 1992
Outline
1. Motivation
2. High Level System Overview
3. Cache Coherence Protocol
4. Memory Consistency Model: Release Consistency
5. Overcoming Long Latency Operations
6. Software Support
7. Performance Results
8. Conclusion: Where is it now?
Motivation
Goals:
1. Minimal impact on programming model
2. Cost efficiency
3. Scalability!!!
Design Decisions:
1. Shared Address Space (no MPI)
2. Parallel architecture instead of next sequential processor
(no clock issues yet!)
3. Hardware controlled, directory based cache coherency
High Level System Overview
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Memory
Memory
Memory
Directory
Directory
Cluster
Directory
Interconnect Network
A shared address space without shared memory??*
* See http://www.uschess.org/beginners/read/ for meaning of “??”
Cache Coherence Protocol
DASH’s Big Idea: Hierarchical Directory Protocol
Processor Level
- Locate cache blocks using
a hierarchy of directories
- Like NUCA except for directories
(NUDA = Non-Uniform Directory
Access?)
- Cache blocks in three possible
states
- Dirty (M)
- Shared (S)
- Uncached (I)
Processor Cache
Local Cluster Level
Other processor caches
within local cluster
Home Cluster Level
Directory and main memory
associated with a given address
Remote Cluster Level
Processor caches in
remote clusters
Cache Coherency Example
Processor
Holding Block
Requesting Processor
1. Processor makes
request on local bus
2. No response, directory
broadcasts on network
3. Home directory sees
request, sends message
to remote cluster
Home Cluster
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Processor
Cache
Processor
Cache
Memory
Memory
4. Remote directory puts
Cache
request on bus
5. Remote processor
Memory
responds with data
6. Remote directory forwards
data, updates home directory
7. Data delivered, home directory
updated
Directory
Processor
Cache
Directory
Interconnect Network
Directory
Implications of Cache Coherence Protocol
- What do hierarchical directories get us?
- Very fast access on local cluster
- Moderately fast access to home cluster
- Minimized data movement (assumed temporal and spatial locality?)
- What problems still exist?
- Broadcast in some circumstances can be bottleneck to scalability
- Complexity of cache and directory controllers, require many
outstanding requests to hide latency -> power hungry CAM’s
- Potential for long latency events as shown in example (more on
this later)
Memory Consistency Model:
Release Consistency
Release Consistency Review*:
1. W->R reordering allowed (to different blocks only)
2. W->W reordering allowed (to different blocks only)
3. R->W (to different blocks only) and R-R reordering allowed
Why Release Consistency?
1. Provides acceptable programming model
2. Reordering events is essential for performance
on a variable latency system
3. Relaxed requirements for interconnect network, no need for
in order distribution of messages
* Taken from “Shared Memory Consistency Models: A Tutorial”, we’ll read this later
Overcoming Long Latency Operations
Prefetching:
- How is this beneficial to execution?
- What can go wrong with prefetching?
- Does this scale?
Update and Deliver Operations:
- What if we know data is going to be needed by many threads?
- Tell system to broadcast data to everyone using Update-Write
operation
- Does this scale well?
- What about embarrassingly parallel applications?
Software Support
- Parallel version of Unix OS
- Handle prefetching in softwared (will this scale?)
- Parallelizing compiler (how well do you think this works?)
- Parallel language Jade (how easy to rewrite applications?)
Performance Results
Do these
look like
they scale
well?
What is going
on here?!?
Conclusion: Where is it now?
- Novel architecture and cache coherence protocol
- Some level of scalability for diverse applications
- Why don’t we see DASH everywhere?
- Parallel architectures not cost-effective for general purpose
computing until recently
- Requires adaptation of sequential code to parallel architecture
- Power?
- Any other reasons?
- For anyone interested: DASH -> FLASH -> SGI Origin (Server)
http://www-flash.stanford.edu/architecture/papers/ISCA94/
http://www.futuretech.blinkenlights.nl/origin/isca.pdf
Download