pptx

advertisement
Tornado: Maximizing Locality and
Concurrency
in a Shared Memory Multiprocessor
Operating System
Ben Gamsa, Orran Krieger, Jonathan
Appavoo, Michael Stumm
By : Priya Limaye
Locality
• What is Locality of reference?
Locality
• What is Locality of reference?
sum = 0;
for (int i = 0; i < 10; i ++) {
sum = sum + number[i];
}
Locality
• What is Locality of reference?
Temporal Locality
Recently accessed data and
instruction are likely to be
accessed in near future
sum = 0;
for (int i = 0; i < 10; i ++) {
sum = sum + number[i];
}
Locality
• What is Locality of reference?
sum = 0;
for (int i = 0; i < 10; i ++) {
sum = sum + number[i];
}
Spatial Locality
Data and instructions close
to recently accessed data
and instructions are likely
to be accessed in the near
future.
Locality
• What is Locality of reference?
– Recently accessed data and instructions and
nearby data and instructions are likely to be
accessed in the near future.
– Grab a larger chunk than you immediately need
– Once you’ve grabbed a chunk, keep it
Locality in multiprocessor
• Computation depends on data local to
processor
– Each processor uses data from its own cache
– Once data is brought in cache it stays there
Locality in multiprocessor
CPU
CPU
Cache
Cache
Memory
Counter
Counter: Shared
CPU
CPU
Memory
0
Counter: Shared
CPU
CPU
0
Memory
0
Counter: Shared
CPU
CPU
1
Memory
1
Counter: Shared
Read : OK
CPU
CPU
1
1
Memory
1
Counter: Shared
Invalidate
CPU
CPU
2
Memory
2
Comparing counter
1. Scales well with old
architecture
2. Performs worse with shared
memory multiprocessor
Counter: Array
• Sharing requires moving back and forth
between CPU Caches
• Split counter into array
• Each CPU get its own counter
Counter: Array
CPU
CPU
Memory
0
0
Counter: Array
CPU
CPU
1
Memory
1
0
Counter: Array
CPU
CPU
1
1
Memory
1
1
Counter: Array
CPU
Read
Counter
2
CPU
CPU
1
1
Memory
1
1
Add All
Counters
(1 + 1)
Counter: Array
• This solves the problem
• What about performance?
Comparing counter
Does not perform
better than ‘shared
counter’.
Counter: Array
• This solves the problem
• What about performance?
• What about false sharing?
Counter: False Sharing
CPU
CPU
Memory
0,0
Counter: False Sharing
CPU
CPU
0,0
Memory
0,0
Counter: False Sharing
CPU
Sharing
CPU
0,0
0,0
Memory
0,0
Counter: False Sharing
Invalidate
CPU
CPU
1,0
Memory
1,0
Counter: False Sharing
CPU
Sharing
CPU
1,0
1,0
Memory
1,0
Counter: False Sharing
Invalidate
CPU
CPU
1,1
Memory
1,1
Solution?
• Use padded array
• Different elements map to different locations
Counter: Padded Array
CPU
CPU
0
0
Memory
Counter: Padded Array
Update
independent
of each other
CPU
CPU
1
1
1
1
Memory
Comparing counter
Works better
Locality in OS
• Serious performance impact
• Difficult to retrofit
• Tornado
– Ground up design
– Object Oriented approach – Natural locality
Tornado
•
•
•
•
Object Oriented Approach
Clustered Objects
Protected Procedure Call
Semi-automatic garbage collection
– Simplified locking protocol
Object Oriented Approach
Process 1
Process 2
…
Process Table
Object Oriented Approach
Lock
Process 1
Process 2
Process 1
…
Process Table
Object Oriented Approach
Lock
Process 1
Process 2
Process 1
…
Process Table
Process 2
Object Oriented Approach
Lock
Process 1
Process 2
Process 1
…
Lock
Process Table
Process 2
Object Oriented Approach
Class ProcessTableEntry{
data
lock
code
}
Object Oriented Approach
• Each resource is represented by different
object
• Requests to virtual resources handled
independently
– No shared data structure access
– No shared locks
Object Oriented Approach
Process
Page Fault
Exception
Object Oriented Approach
Region
Process
Region
Page Fault
Exception
Object Oriented Approach
Region
FCM
Region
FCM
Process
Page Fault
Exception
FCM
File Cache Manager
Object Oriented Approach
Search for
responsibl
e region
HAT
Region
FCM
Region
FCM
Process
Page Fault
Exception
HAT
FCM
Hardware Address Translation
File Cache Manager
Object Oriented Approach
COR
Region
FCM
Process
DRAM
Region
FCM
Page Fault
Exception
COR
FCM
COR
DRAM
File Cache Manager
Cached Object Representative
Memory manager
Object Oriented Approach
• Multiple implementations for system objects
• Dynamically change the objects used for
resource
• Provides foundation for other Tornado
features
Clustered Objects
• Improve locality for widely shared objects
• Appears as single object
– Composed of multiple component objects
• Has representative ‘rep’ for processors
– Defines degree of clustering
• Common clustered object reference for client
Clustered Objects
Clustered Objects : Implementation
Clustered Objects : Implementation
• A translation table per processor
– Located at same virtual address
– Pointer to rep
• Clustered object reference is just a pointer
into the table
• ‘reps’ created on demand when first accessed
– Special global miss handling object
Counter: Clustered Object
CPU
CPU
Object Reference
rep 1
rep 1
Counter – Clustered Object
Counter: Clustered Object
CPU
CPU
1
1
Object Reference
rep 1
rep 1
Counter – Clustered Object
Counter: Clustered Object
CPU
Update
independent
of each other
2
CPU
1
Object Reference
rep 2
rep 1
Counter – Clustered Object
Clustered Objects
• Degree of clustering
• Multiple reps per object
– How to maintain consistency ?
• Coordination between reps
– Shared memory
– Remote PPCs
Counter: Clustered Object
CPU
CPU
1
1
Object Reference
rep 1
rep 1
Counter – Clustered Object
Counter: Clustered Object
Read
Counter
CPU
CPU
CPU
1
1
Object Reference
rep 1
rep 1
Counter – Clustered Object
Counter: Clustered Object
CPU
2
CPU
CPU
Add All
1Counters
(1 + 1)
1
Object Reference
rep 1
rep 1
Counter – Clustered Object
Clustered Objects : Benefits
• Facilitates optimizations applied on
multiprocessor e.g. replication and
partitioning of data structure
• Preserves object-oriented design
• Enables incremental optimizations
• Can have several different implementations
Synchronization
• Two kinds of locking issues
– Locking
– Existence guarantees
Synchronization: Locking
• Encapsulate locking within individual objects
• Uses clustered objects to limit contention
• Uses spin-then-block locks
– Highly efficient
– Reduces cost of lock/unlock pair
Synchronization: Existence guarantees
• All references to an object protected by lock
– Eliminates races where one thread is accessing the
object and another is deallcoating it
• Complex global hierarchy of locks
• Tornado - semi automatic garbage collection
– Clustered object reference can be used any time
– Eliminates needs for locks
Garbage Collection
• Distinguish between temporary references
and persistent references
– Temporary: clustered references held privately
– Persistent: shared memory, can persist beyond
lifetime of a thread
Garbage Collection
• Remove all persistent references
– Normal cleanup
• Remove all temporary references
– Event driven kernel
– Maintain counter for each processor
– Delete object if counter is zero
• Destroy object itself
Garbage Collection
Process 1
Read
2
5
9
Garbage Collection
Counter ++
Process 1
Read
2
5
9
Garbage Collection
Process 2
Counter = 1
Process 1
Delete
Read
2
5
9
Garbage Collection
GC
Counter = 1
Process 2
If counter = 0
Process 1
Delete
Read
2
5
9
Garbage Collection
Process 2
Counter-Process 1
2
5
9
Garbage Collection
GC
Counter = 0
If counter = 0
Process 1
2
9
Process 2
Interprocess communication
• Uses Protected Procedure Calls
• A call from client object to server object
– Clustered object call that crosses protection
domain of client to server
• Advantages
– Client requests serviced on local processor
– Client and server share processors similar to
handoff scheduling
– Each client request has one thread in server
PPC: Implementation
• On demand creation of server threads
• Maintains list of worker threads
• Implemented as a trap and some queue
manipulations
– Dequeue worker thread from ready workers
– Enqueue caller thread on the worker
– Return from-trap to the server
• Registers are used to pass parameters
Performance
Performance: summary
• Strong basic design
• Highly scalable
• Locality and locking overhead are major
source of slowdown
Conclusion
• Object-oriented approach and clustered
objects exploits locality and concurrency
• OO design has some overhead, but these are
low compared to performance advantages
• Tornado scales extremely well and achieves
high performance on shared-memory
multiprocessors
References
• http://web.cecs.pdx.edu/~walpole/class/cs51
0/papers/05.pdf
• Presentation by Holly Grimes, CS 533, Winter
2008
• http://en.wikipedia.org/wiki/Locality_of_refer
ence
Download