1. What limitations can a 64-bit processor have, compared to a 32bit processor. Generally 64-bit processors have higher performance, since more bytes are being transferred per clock cycle (8 vs. 4 bytes), hence memory transfer bandwidth is doubled. Representation range is increased too, allowing us to represent more precise floating point numbers and larger integers. If the address bus is 64bits as well we can now access much more memory, allowing more and larger processes to be running. Main difficulty is in terms of costs. 64-bit processors have many more connecting them to surrounding circuits than 32-bit processors. This makes routing of wires extremely difficult, requiring very expensive boards with 20 or more wiring layers. Denser wiring also results in more interference (cross-talk), resulting in poorer signal integrity. Longer wires (to relieve crowding) result in more capacitance, greater resistance and signal echoes. So it is extremely hard to design and build boards for 64-bit processors. 2. Explain the main differences between a superscalar pipeline and a vector pipeline. Superscalar Pipeline This actually means having multiple pipelines. E.g. the Intel Pentium had 2 integer pipelines to handle normal add, sub, mul, div, shl, shr, rol, ror etc, and 1 floating point pipeline. Main issues in superscalar pipeline is discovering instruction level parallelism (so that >1 instruction can be issued at a time), preventing multiple issue hazards (RAW, WAR, WAW) and retiring instructions in the correct order. Vector Pipeline These are special arithmetic pipelines that can work on arrays (otherwise called “vectors”) rapidly. This is done by pipelining the various stages of an arithmetic operation. E.g. for floating point add: Stage 1: Compare exponents (C) Stage 2: Shift mantissa of number with larger exponent left to reduce exponent until both are the same. (S) Stage 3: Add mantissas (A) Stage 4: Normalize result (N) So if we have the following operation: for(i=0; i<6; i++) c[i] = a[i] + b[i]; we will have: Clock a[0]+b[0] a[1]+b[1] a[2]+b[2] a[3]+b[3] a[4]+b[4] a[5]+b[5] 1 C 2 S C 3 A S C 4 N A S C 5 N A S C 6 7 8 9 N A S C N A S N A N Executing 6 iterations of this loop takes 9 clock cycles, as opposed to 6 x 4 = 24 clock cycles without pipelining. Note that this is different from our normal idea of pipelining (fetch, decode, execute, writeback). The vector pipeline corresponds to the execute stage of our normal pipeline. 3. What is the difference between “cluster” and “MIMD distributed memory multiprocessor”? Essentially the same thing: A cluster can be thought of as a special case of an MIMD distributed memory multiprocessor (MIMD-DMM) In general the difference between a cluster and an MIMD-DMM is the degree of coupling. MIMD-DMM are generally tightly coupled machines. All processing nodes and memories are connected via custom-built very high speed interconnects (e.g. omega-nets) that cost a lot to build (challenges include maintaining signal integrity across long bus lines, fast low-latency switches etc.). The entire arrangement is usually in a single enclosure (this keeps wires short, allows shielding, both of which allow much higher bandwidth). Low latency and fast bandwidth allow fine-grained programs to be run (i.e. programs that do a little bit of processing, exchange data, do more processing, exchage data ad infinitum). Clusters on the other hand usually consists of multiple PCs connected via commodity LANs. E.g. the Beowulf cluster consists of many PCs (up to about 128) connected together via two commodity Ethernet connections. Special device drivers spread traffic evenly across both connections to get almost double bandwidth. Another cluster at an Israeli University connects 256 PCs through Myrinet, a fast LAN standard. Clusters generally will have high latency and low bandwidth (most LAN are serial devices) and are thus more suited for course-grained applications (much processing, exchange a little data, processing, exchange a little data). much more 4. Fig 6.9, Chapter 6.3, CA-AQA 3rd Ed. A write-update protocol working on a snooping bus for a single cache block with write-back cache: Since it is write-back, why when CPU A writes a “1” to “X”, the contents of memory location X is also updated to a “1”? Proc Activity CPU A reads X CPU B reads X CPU A writes to X CPU B reads X Bus Activity Cache miss for X cache miss for X Writebroadcast for X A’s Cache 0 B’s Cache - Memory 0 0 0 0 1 - 0 1 1 1 Figure 6.9 is for write-broadcast. The steps are as follows: i) ii) iii) A reads X. Cache miss. Data is brought from memory. B reads X, cache miss, data is brought from memory. A writes to X. Write is broadcasted to all who are interested in X, including the memory system. CPU B’s cache and the main memory are both updated with X=1. 5. Page 434, Chapter 5.6, CA-AQA 3rd Ed: Using “blocking” compiler optimization to reduce miss rate: What after applying “blocking” the code has two extra “for loop” at the start of the code? Problem: If a piece of code accesses arrays both rowwise and column-wise in each iteration, it is impossible to achieve spatial locality. E.g.: for(i=0; i<n; i++) for(j=0; j<n; j++) x[i][j] = y[j][i]; Regardless of whether arrays are stored row-major or column-major at least one array will suffer cache miss all the time, regardless of block size. Useless to swap indices i and j since it will only cause the other matrix to miss all the time. Worst case # of memory accesses will be if block size is 1, and all matrices are larger than the total cache size. Then the number of memory accesses (cache misses) will be 2n2 (each matrix misses n2 times) for an n2 operation. Blocking allows us to work on sub-portions of a matrix at a time, so that the entire sub-portion can fit into the cache, and this allows us to exploit temporal locality (since we could not exploit spatial locality). Let our blocking factor be B (i.e. we will work on this number of blocks at a time. To re-write the code, we must: i) Insert a new outer loop that cycles through all sub-blocks (# of sub-blocks is of course n/B) for i and j. We need separate outer loops for each inner loop index to ensure that we cover the entire range of j for each sub-range of i. ii) Modify the existing loop to take into work in sub-blocks. Original code: for(i=0; i<n; i++) for(j=0; j<n; j++) x[i][j] = y[j][i]; New Code: for(ii=0; ii<n; ii+=B) for(jj=0; jj<n; jj+=B) for(i=ii; i<min(ii+B, n); i++) for(j=jj; j<min(jj+B, n); j++) x[i][j] = y[j][i]; Matrices x and y will miss n2/B times each, and hence the total number of misses is now 2n2/B, an improvement of B times. 6. RAID 4 and RAID 5 work almost the same, where the parity is stored as blocks and associated with a set of data blocks, just that RAID 4 is storing the parity information on a parity disk while RAID5 spreads the parity information throughout all the disk (as shown in Fig 7.19). Why in Fig 7.17, the “corresponding check disks” column for RAID 4 and RAID 5 are both “1”? Should it not be “0” for RAID 5? Difference between RAID 4 and RAID 5: RAID 4 – One disk is dedicated to storing all parity checksums. Disk 1 0 4 8 12 16 Disk 2 1 5 9 13 17 Disk 3 2 6 10 14 18 Disk 4 3 7 11 15 19 Disk 5 P0 P1 P2 P3 P4 Problem: All writes require update to the parity disk. E.g. Blocks 0, 6 can be written to simultaneously (since they’re on different disks), but writes to parities P0, P1 must be serialized because they’re on the same disk. Hence this becomes a bottleneck. RAID 5 solves this by distributing the parity across all disks: Disk 1 0 4 8 12 P4 Disk 2 1 5 9 P3 16 Disk 3 2 6 P2 13 17 Disk 4 3 P1 10 14 18 Disk 5 P0 7 11 15 19 A write to blocks 0, 6 require writes to Disk 1 (for block 0) and Disk 5 (for P0) and Disk 3 and Disk 4 (for 6 and P1). Since these are all independent, they can be done simultaneously. However the number of blocks required for parity still remains the same, and the parity still effectively occupies the equivalent of an entire disk. Hence the number of parity disks is still 1, and not 0. 7. What is bisection bandwidth? How to compute? To calculate bisection bandwidth: 1. Split network so that both sides have equal number of nodes 2. Count the number of links crossing the dividing lines. Multiply by bandwidth of each line to obtain bisection bandwidth (or simply count links if individual bandwidth is not known). Fully interconnected network: Take into account all the nodes on side A of the bisection. Each node must connect with n/2 nodes on the other side of the bisection (side B) (since the number of nodes is equally distributed). There are n/2 nodes on side A. Therefore # of connections crossing the bisection is (n/2) * (n/2) = (n/2)2. Buses: It is simpler for buses. Since there is one connection for all nodes, if we divided all the nodes into two equal groups, the number of links crossing the bisection will still remain as 1. Ring: Again this is trivial. Splitting a ring into two will always result in two links crossing the bisection. Bisection bandwidth is therefore 2. 2D Torus Non-trivial to do analytically, since number of links crossing depends on how nodes are arranged in the matrix. The number of connection is always twice the number of columns or rows (depending on how you divide the nodes) because of the connection between immediate neighbours on either side of the bisection, and the connection between extreme nodes of the same row/column. Assume a M x N (M = # of rows, N = # of columns) matrix of nodes: If nodes can be divided into two equal parts both horizontally and vertically, we’d divide the nodes so that the number of links is minimized: bisection bandwidth = 2 x min(M, N) If nodes can be divided equally only vertically, bisection bandwidth is 2M. If it can be divided only horizontally, bisection bandwidth is 2N. If it cannot be divided equally either horizontally or vertically (e.g. 7 x 5 network), bisection bandwidth is undefined as it is impossible to get equal number of nodes on each side. 3D HyperCube From diagrams, we find that the number of links crossing the bisection bandwidth from each node is 1. Since there are n/2 nodes on one side, total number of crossings is n/2, and the bisection bandwidth is n/2. 8. Example on page 325, Prof Yuen asked during the lecture whether we can replace the array T[i] by using a variable T. Can we? CA-AQA 3rd Ed Original Code After Renaming: for(i=0; i<=100; i++) { T[i] = X[i] + c; X1[i] = X[i] + c; Z[i] = T[i] + c; Y[i] = c – T[i]; } Yes, as T[i] has no more use beyond computing Z[i] and Y[i], so it can be replaced by a single variable T: for(i=0; i<=100; i++) { T = X[i] + c; X1[i] = X[i] + c; Z[i] = T + c; Y[i] = c – T; } Note that you cannot do the same for X1[i], which actually holds the correct result for X[i] after the iteration is done. X1[i] is needed to correct the value in X[i] (which is still holding old values because of the renaming) after all 100 iterations are over. Neither can you do this: for(i=0; i<=100; i++) { T = X[i] + c; X1 = X[i] + c; Z[i] = T + c; Y[i] = c – T; X[i] = X1; } This will re-introduce the anti-dependency (which makes the entire renaming exercise pointless). 9. Explain software pipelining (symbolic loop unrolling) page 329, chapter 4.4 CA-AQA 3rd Ed. Main gist: Instead of unrolling the loop, we just simply take instructions from different iterations and schedule them together. E.g.: for(i=0; i<n; i++) a[i] = a[i]+6; In assembly: a1 = base address of a, r2 = Address of final element Loop: lw r1, 0(a1) addi r4, r1, 6 sw r4, 0(a1) addi a1, a1, 4 ; Increment to next element. ; Assumed 1 element = 4 bytes bne a1, r2, Loop We can unroll this 3 times: Iteration i lw r1, 0(a1) addi r4, r1, 6 sw r4, 0(a1) Iteration i+1 lw r1, 4(a1) addi r4, r1, 6 sw r4, 4(a1) Iteration i+2 lw r1, 8(a1) addi r4, r1, 6 sw r4, 8(a1) Suppose our CPU has 1 load unit, 1 add unit and 1 store unit. We can pick 1 load, 1 add and 1 store from different iterations to keep all three units full. We can pick the sw from Iteration 0, the add from iteration 1 and the lw from iteration 2 (shaded instructions): Loop: sw r4, 0(a1) add r4, r1, 6 lw r1, 8(a1) addi a1, a1, 4 bne a1, r2, Loop ; Store a[i] ; Add a[i+1] ; Load a[i+2] ; Increment index Set up and shut-down code must be inserted for this to work correctly. The start-up code consists of all instructions from each unrolled iteration before the shaded instruction. The shutdown code consists of all instructions form each unrolled iteration after the shaded instruction. Startup: Loop: Shutdown: lw r1, 0(a1) addi r4, r1, 6 lw r1, 4(a1) sw r4, 0(a1) add r4, r1, 6 lw r1, 8(a1) addi a1, a1, 4 bne a1, r2, Loop sw r4, 4(a1) add r4, r1, 6 sw r4, 8(a1) ; Load a[i] ; Add a[i]+6 ; Load a[i+1] ; Store a[i] ; Add a[i+1] ; Load a[i+2] ; Increment index ; Store a[i+1] ; Add a[i+2] ; Store a[i+2]