Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm By : Priya Limaye Locality • What is Locality of reference? Locality • What is Locality of reference? sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } Locality • What is Locality of reference? Temporal Locality Recently accessed data and instruction are likely to be accessed in near future sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } Locality • What is Locality of reference? sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } Spatial Locality Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future. Locality • What is Locality of reference? – Recently accessed data and instructions and nearby data and instructions are likely to be accessed in the near future. – Grab a larger chunk than you immediately need – Once you’ve grabbed a chunk, keep it Locality in multiprocessor • Computation depends on data local to processor – Each processor uses data from its own cache – Once data is brought in cache it stays there Locality in multiprocessor CPU CPU Cache Cache Memory Counter Counter: Shared CPU CPU Memory 0 Counter: Shared CPU CPU 0 Memory 0 Counter: Shared CPU CPU 1 Memory 1 Counter: Shared Read : OK CPU CPU 1 1 Memory 1 Counter: Shared Invalidate CPU CPU 2 Memory 2 Comparing counter 1. Scales well with old architecture 2. Performs worse with shared memory multiprocessor Counter: Array • Sharing requires moving back and forth between CPU Caches • Split counter into array • Each CPU get its own counter Counter: Array CPU CPU Memory 0 0 Counter: Array CPU CPU 1 Memory 1 0 Counter: Array CPU CPU 1 1 Memory 1 1 Counter: Array CPU Read Counter 2 CPU CPU 1 1 Memory 1 1 Add All Counters (1 + 1) Counter: Array • This solves the problem • What about performance? Comparing counter Does not perform better than ‘shared counter’. Counter: Array • This solves the problem • What about performance? • What about false sharing? Counter: False Sharing CPU CPU Memory 0,0 Counter: False Sharing CPU CPU 0,0 Memory 0,0 Counter: False Sharing CPU Sharing CPU 0,0 0,0 Memory 0,0 Counter: False Sharing Invalidate CPU CPU 1,0 Memory 1,0 Counter: False Sharing CPU Sharing CPU 1,0 1,0 Memory 1,0 Counter: False Sharing Invalidate CPU CPU 1,1 Memory 1,1 Solution? • Use padded array • Different elements map to different locations Counter: Padded Array CPU CPU 0 0 Memory Counter: Padded Array Update independent of each other CPU CPU 1 1 1 1 Memory Comparing counter Works better Locality in OS • Serious performance impact • Difficult to retrofit • Tornado – Ground up design – Object Oriented approach – Natural locality Tornado • • • • Object Oriented Approach Clustered Objects Protected Procedure Call Semi-automatic garbage collection – Simplified locking protocol Object Oriented Approach Process 1 Process 2 … Process Table Object Oriented Approach Lock Process 1 Process 2 Process 1 … Process Table Object Oriented Approach Lock Process 1 Process 2 Process 1 … Process Table Process 2 Object Oriented Approach Lock Process 1 Process 2 Process 1 … Lock Process Table Process 2 Object Oriented Approach Class ProcessTableEntry{ data lock code } Object Oriented Approach • Each resource is represented by different object • Requests to virtual resources handled independently – No shared data structure access – No shared locks Object Oriented Approach Process Page Fault Exception Object Oriented Approach Region Process Region Page Fault Exception Object Oriented Approach Region FCM Region FCM Process Page Fault Exception FCM File Cache Manager Object Oriented Approach Search for responsibl e region HAT Region FCM Region FCM Process Page Fault Exception HAT FCM Hardware Address Translation File Cache Manager Object Oriented Approach COR Region FCM Process DRAM Region FCM Page Fault Exception COR FCM COR DRAM File Cache Manager Cached Object Representative Memory manager Object Oriented Approach • Multiple implementations for system objects • Dynamically change the objects used for resource • Provides foundation for other Tornado features Clustered Objects • Improve locality for widely shared objects • Appears as single object – Composed of multiple component objects • Has representative ‘rep’ for processors – Defines degree of clustering • Common clustered object reference for client Clustered Objects Clustered Objects : Implementation Clustered Objects : Implementation • A translation table per processor – Located at same virtual address – Pointer to rep • Clustered object reference is just a pointer into the table • ‘reps’ created on demand when first accessed – Special global miss handling object Counter: Clustered Object CPU CPU Object Reference rep 1 rep 1 Counter – Clustered Object Counter: Clustered Object CPU CPU 1 1 Object Reference rep 1 rep 1 Counter – Clustered Object Counter: Clustered Object CPU Update independent of each other 2 CPU 1 Object Reference rep 2 rep 1 Counter – Clustered Object Clustered Objects • Degree of clustering • Multiple reps per object – How to maintain consistency ? • Coordination between reps – Shared memory – Remote PPCs Counter: Clustered Object CPU CPU 1 1 Object Reference rep 1 rep 1 Counter – Clustered Object Counter: Clustered Object Read Counter CPU CPU CPU 1 1 Object Reference rep 1 rep 1 Counter – Clustered Object Counter: Clustered Object CPU 2 CPU CPU Add All 1Counters (1 + 1) 1 Object Reference rep 1 rep 1 Counter – Clustered Object Clustered Objects : Benefits • Facilitates optimizations applied on multiprocessor e.g. replication and partitioning of data structure • Preserves object-oriented design • Enables incremental optimizations • Can have several different implementations Synchronization • Two kinds of locking issues – Locking – Existence guarantees Synchronization: Locking • Encapsulate locking within individual objects • Uses clustered objects to limit contention • Uses spin-then-block locks – Highly efficient – Reduces cost of lock/unlock pair Synchronization: Existence guarantees • All references to an object protected by lock – Eliminates races where one thread is accessing the object and another is deallcoating it • Complex global hierarchy of locks • Tornado - semi automatic garbage collection – Clustered object reference can be used any time – Eliminates needs for locks Garbage Collection • Distinguish between temporary references and persistent references – Temporary: clustered references held privately – Persistent: shared memory, can persist beyond lifetime of a thread Garbage Collection • Remove all persistent references – Normal cleanup • Remove all temporary references – Event driven kernel – Maintain counter for each processor – Delete object if counter is zero • Destroy object itself Garbage Collection Process 1 Read 2 5 9 Garbage Collection Counter ++ Process 1 Read 2 5 9 Garbage Collection Process 2 Counter = 1 Process 1 Delete Read 2 5 9 Garbage Collection GC Counter = 1 Process 2 If counter = 0 Process 1 Delete Read 2 5 9 Garbage Collection Process 2 Counter-Process 1 2 5 9 Garbage Collection GC Counter = 0 If counter = 0 Process 1 2 9 Process 2 Interprocess communication • Uses Protected Procedure Calls • A call from client object to server object – Clustered object call that crosses protection domain of client to server • Advantages – Client requests serviced on local processor – Client and server share processors similar to handoff scheduling – Each client request has one thread in server PPC: Implementation • On demand creation of server threads • Maintains list of worker threads • Implemented as a trap and some queue manipulations – Dequeue worker thread from ready workers – Enqueue caller thread on the worker – Return from-trap to the server • Registers are used to pass parameters Performance Performance: summary • Strong basic design • Highly scalable • Locality and locking overhead are major source of slowdown Conclusion • Object-oriented approach and clustered objects exploits locality and concurrency • OO design has some overhead, but these are low compared to performance advantages • Tornado scales extremely well and achieves high performance on shared-memory multiprocessors References • http://web.cecs.pdx.edu/~walpole/class/cs51 0/papers/05.pdf • Presentation by Holly Grimes, CS 533, Winter 2008 • http://en.wikipedia.org/wiki/Locality_of_refer ence