1. The Test-and-Set instruction puts 1 into a memory location after reading out its content first, while the very similar Exchange instruction puts the the value in a register into a memory location after reading out its content, and puts that into the register. Both swap something in CPU with something in memory. a. List the bus operations needed to complete a TAS/EXG instruction. TAS data = read(address) wait test data data = 1 write(address, data) wait ; Memory latency. ; Test operation. EXG data = read(address) wait write(address, reg) wait reg = data b. A split transaction bus may allow another TAS/EXG to use the bus between the read and write cycles of an earlier TAS/EXG and thus lead to a hazard. How can this be prevented. (Remember that memory modules may be pipelined.) Hazards occur because second process may read data before first process gets to write. E.g. EXG CC 1 2 3 4 5 6 Process I data = read(address) wait write(address, reg) wait reg = data Process II data = read(address) wait write(address) wait reg = data At CC2 Process 2 gets the wrong value from memory. Hazard! Solution: In a pipelined memory module stages used by the EXG/TAS are locked. Subsequent processes are blocked until EXG/TAS writes back. Correct operation now! CC 1 2 Process I data = read(address) wait 3 4 5 write(address, reg) wait reg = data 6 … … … … Process II data = read(address) (blocked) (blocked) (blocked) data = read(address) (successful) wait … … At CC5 Process 2 will now get correct value from memory. c. A cache coherence system ensures that any change to the lock value by another processor would be delivered to any processor trying to read it. Describe system activities when a processor does TAS/EXC on i) currently open lock no one has locked before; Never attempted read on this lock – Cache miss Unlocked value comes to processor’s cache. Processor closes the lock. Locked value broadcasted (or updated in directory). ii) currently closed lock recently locked by someone else; Cache miss (due to lock value being updated by another processor) Locked value comes to processor’s cache. iii) currently closed lock, after trying unsuccessfully earlier; Earlier attempt on this lock. Cache hit. Locked value is read from processor’s cache. iv) currently open lock recently released by someone else. Cache miss (due to lock value being updated to open by other processor) Unlocked value comes to processor’s cache. 2. We studied in Interconnect structures like Crossbar, Omega network, hypercube and token ring. Evaluate each structure for the construction of (a) uniform shared memory system (b) non-uniform shared memory system (c) distributed memory system. For each describe the envisaged procedure for accessing shared memory. Remember that processors all have caches so memory accesses are always in blocks of bytes ranging from a hundred or so bytes to a fraction of a page. Uniform Shared Cross-bar (Connect processors and memory to network) Omega (Connect processors on one side of switches and memory on other side). Hypercube (Connect processors and memory to network). Token Ring Non-uniformed Shared Each processor can If each processor has access memory by memory of its own, just flipping switches. then time to access Uniform access times own memory faster if each processor has than time to access no memory of its shared memory. Nonown. uniform access times. Can be as few as N/2 Each processor with parallel paths its own memory. depending on Access to own switches. Not all memory faster than processors can access access to shared shared memory. memory across Hence not practical. network. Distributed Flip switches from processor to memory. Varying connection lengths. Complex. Not always possible. Each processor with own memory. Access to own memory faster than shared memory. Not possible. Contention for token result in variable length latencies. Each processor with own memory. Contention for tokens already results in non-uniformity. Each processor with own memory and no shared memory. Access by flipping switches till the processor with memory to be accessed. Each processor with own memory. Access by grabbing and sending token. Each processor with its own memory (and no shared memory). Processors can access other processor’s memories through cross-bar network. Each processor with own memory (and now shared memory). Processors access each other’s memory through network. 3. Outline a hypercube node program to sum 2**n values with each node contributing one. It is structurally similar to the broadcasting program of last tutorial: each node i receives from nodes i+2**m for m=n-1 downward provided 2**m > i, and sends to i-2**m for the next m the received value plus its own value. Analysis: i) Set m = (n-1) ii) While(m > 0) i. If 2**m > i ii. Receive x from i+2**m iii. sum = sum + x iv. m = m –1 Send sum to i-2**m iii) This algorithm has the effect of flipping bits to read from nodes around a particular node i: E.g. for i = 3 (0011b), n = 4: Before: k = 2**4 = 16 Iteration 1: Iteration 2: Iteration 3: k = k/2 = 8, 8>3 Read from 3 + 8 = 11 (1011b) k = k/2 = 4, 4>3 Read from 3 + 4 = 7 (0111b) k = k/2 = 2, 2<3 Exit Send to 3 –2/2 = 2 (0010b) All address (1011b, 0111b, 0010b) differ from the node address (0011b) by exactly one bit. I.e. reads and writes from neighboring nodes. Algorithm: k = 2**n; for(j=1; j<n; j++) { k = k / 2; // Reduce n by 1 if(k <= i) exit; // Stop summing sum = sum + receive(i+k); } send(i-k, sum)