1) (a) In the context of Virtual Memory explain what is meant by: (i) paging (ii) segmentation (b) Explain the difference between direct-mapped and associatively-mapped paging schemes. (c) In an associatively mapped scheme explain what special features are required of the memory system. (d) Describe how a virtual memory system might employ both paging and segmentation. (e) An 8 bit microprocessor with a 16 bit address bus is attached to a 16kbyte ROM located at address 0000H. Located in this ROM at 0038H is an interrupt vector. It is required that this value be changed to a different value. It has been suggested that this could be done by using a small 2kbyte EPROM to replace only the first 40H addresses with the first 40H of the EPROM. (i) Describe the chip select logic that would be needed to achieve this and what changes would have to be made. (ii) Show how this could be extended to allow a switch to be used to select different sections of the EPROM to replace this section of the ROM. 2) (a) With the aid of a block diagram, describe the operation of a direct mapped unbuffered write through cache. Indicate on a diagram how both the address bus and data connect to a 32-bit processor (with a 30-bit word address bus), and a 16Mbyte DRAM main memory with an 8kbyte cache. [20%] [10%] [10%] [10%] [30%] [20%] [50%] (b) Explain the benefit of adopting a buffered write through design. Indicate on your diagram where buffering takes place. [20%] (c) How should the design be changed to incorporate a deferred write (copy back) scheme? How does this work, and what are the added advantages of adopting this scheme? [30%] 3) (a) Explain what is meant if a processor is said to have a Harvard memory architecture. (b) Explain the concept of pipelining and how this can lead to close to a single instruction per cycle performance [10%] [15%] (c) A processor has four pipelined stages – an instruction fetch stage, an instruction decode stage which also loads register operands; an execution stage which performs all calculations (including addressing ones), accesses data in memory and a register write back stage which updates the registers (including the program counter). The processor uses a Harvard memory architecture, but initially provides no register forwarding or delayed branching options. Consider the following program fragment, which sums the squares of the numbers held in array “a” at a base address “arraya” for i from 0 to 255 inclusive, then divides by the total number of entries in the array and finally stores the result in b. register R0 = 0 register R1 = 255 loop: load from memory[arraya + R1] to register R2 Copy from R2 to R3 multiply R2 by R3, store result in R4 add R4 to R0, store result in R0 decrement R1 conditional jump to loop if R1 >= 0 divide R0 by 256, store result in memory[b] (i) (ii) (iii) Show how one complete iteration of the program loop proceeds along the pipeline as described above. Show how loop unrolling (by a factor of two) and consequent instruction optimisation and re-ordering can reduce the number of stall cycles in your solution to part (i). Show how adding register forwarding and delayed branching techniques to your answer for part (ii) can eliminate stall cycles. [25%] [25%] [25%] [30%] 4) (a) Derive Amdahl’s Law relating the speed-up of a parallel architecture to the fraction of serial code, f, in an application and show diagrammatically what the predicted speed-up for between 0 and 20 processors when f is 0%, 5% and 10%. (c) A computer simulation which numerically integrates a complex function by dividing the space variable into a set of 1000 points in each of which the function is considered to be constant is found to contain 10% sequential code. (i) From Amdahl’s Law estimate the respective potential speed up when using machines with 1, 5, 10 and 1000 processors. (ii) The simulation is to be scaled up which will increase the set of points from 1000 to 100,000. Originally the simulations were run on 5 processors. From Amdahl’s Law how many processors would we need to produce an answer from the scaled up version in a similar time? (iii) In view of your answers above explain why it is that the current trend in supercomputing is to construct machines with thousands of processors. [30%] [20%] [20%] 5) (a) When parallelism is introduced into the processing of computer instructions dependencies may arise. If the processor has a single instruction processing pipeline, in which instructions are issued in-order and completed in-order, describe the conditions that lead to the following forms of dependencies: (i) Read-after-Write hazards (ii) Write-after-Read hazards (iii) Write-after-Write hazards (iv) Why are Read-after-Read hazards not a problem? [40%] (b) Describe what is meant by a superscalar processor [10%] (c) For the list of Hazards given above show the additional circumstances in which they can occur in a superscalar processor in which instructions can be issued out of order and might even finish out of order. [30%] (d) Describe the Scoreboarding hazard detection scheme and how it is used to prevent hazards causing a problem. [20%] 6) (a) Quantum computers will potentially be very different from the classic computers that are currently available. Give a description of the three main differences in: (i) the fundamental unit of information; (ii) the range of logical operations; (iii) determining of the state of the computer. [50%] . (b) Explain what is meant by de-coherence in the concept of quantum computing and why this is a problem [30%] (c) What areas of computation are most likely to benefit from quantum computing in the immediate future [20%] 7) (a) Explain what is meant by the following types of parallelism: (i) Geometric Parallelism (ii) Algorithmic Parallelism (iii) Processor Farming (b) We would like to calculate F(x) = exp(sqrt(sin(x*x))) for 10 different values of x using five processors. Describe how this might be implemented using: (i) algorithmic parallelism; (ii) geometric parallelism. (iii) Calculate in each case the speed up achieved over a single processor. (iv) How would the speed up be affected in each case if the number of values calculated is increased to 1000? [50%] [50%] Solutions 1) (a)(i) Paging (general): virtual address of data contains page number and displacement on page -- (P,D) pair. The virtual page number is looked up in page table to provide actual frame number. Concatenated with D this gives the real memory address. Pages have fixed sizes – normally small of the order a few Kbytes (ii) Segmentation: A segment is a similar conception as a page but has no fixed size and can be as large as the entire memory system in principle. (b) Direct Mapped paging: page table contains all pages of virtual memory. Look up by using P as address in table (see diagram). Associative Mapped paging: table contains only those virtual memory pages that are stored in main memory. P D P D Frame No D Frame No D comparisons Direct mapped page Table Associative-mapped page table Advantages/disadvantages Direct: Simple lookup for all pages But large tables, hence slow (possibly partly paged to backing store) Associative: more complicated lookup But smaller table hence totally in main memory and faster lookup for pages (c) Hardware implications of associative mapping: Look-up via comparison with stored page numbers – content-addressable memory. (d) Often a scheme is used combining a number of fixed size pages inside a single program segment (e) (i) The original address decoder will generates an active chip select for the ROM when A15=0 and A14=0. This must be modified so that it goes active in the same circumstances except when addressing the bottom 40hex locations in memory - i.e. when A6=0, A7=0 up to A13=0. In this latter case, a second enable to the new EPROM must be activated instead. (ii) We can use A6 (and other address lines) of the EPROM, connected to a toggle switch Or dip switch, to select two (or more) different 40hex images. Tie the remaining upper EPROM address lines to GND, and its lower 6 lines to the address bus. (a) Block diagram below shows a direct mapped write through cache – the FIFO buffers indicate the positions where the buffers are added for part b) DRAM select Tag 8 bits Index 9 bits D0-31 FIFO 32bits Data Bus (32 bits) 32 D0-31 32 D Data Q Microprocessor control Cache Memory WR A A0-29 13 bit index FIFO 22bits 22bits Address Bus D control A Tag storage and comparison Match Control Logic Main DRAM memory A0-23 13 bit index 9 bit tag CPU timing signals 2bits (byte address) 13 bits FIFOs optional for buffered write-through 2) WR control 16Mbyte Memory using 8kbyte Direct-Mapped cache with Write-Through writes Directly mapped cache - simplest form of memory cache. In which the real memory address is treated in three parts: For a cache of 2c words, the cache index section of the real memory address indicates which cache entry is able to store data from that address. When cached the tag (msb of address) is stored in cache with data to indicate which page it came from. Cache will store 2c words from 2t pages. In operation tag is compared in every memory cycle if tag matches a cache hit is achieved and cache data is passed otherwise a cache miss occurs and the DRAM supplies word and data with tag are stored in the cache. Tag t bits Tags Data Cache Memory compare Index c bits Use Cache or Main Memory Main Memory Unbuffered Write Through Write data to relevant cache entry, update tag, also write data to location in main memory - speed determined by main memory b) Buffered write through (see first diagram for location of buffers). Data (and address) is written to A FIFO buffer between CPU and main memory, CPU continues with next access, FIFO buffer writes to DRAM. CPU can continue to write at cache speeds, until FIFO is full, then slows down to DRAM speed as FIFO empties. If CPU wants to read from DRAM (instead of cache) need to empty FIFO to ensure we have the correct data - can put long delay in. This delay can be shortened in FIFO has only one entry - simple latch buffer. (c) Deferred Write (Copy Back) data is written out to cache only, allowing the cached entry to be different from main memory. If the cache system wants to over-write a cache index with a different tag it looks to see if the current entry has been changed since it was copied in. If so it writes the new value to main memory before reading the new data to the location in cache. More logic is required for this operation, but the performance gain can be considerable as it allows the CPU to work at cache speed if it stays within the same block of memory. Other techniques will slow down to DRAM speed eventually. Adding a buffer to this allows CPU to write to cache before data is actually copied back to DRAM DRAM select Tag 8 bits D0-31 32 Latch 32bits Cache Memory WR A A0-29 13 bit index D 22bits Dirty bit Match D A Tag Q storage and comparison WR Main DRAM memory control Address Bus 13 bit 9 bit index tag A0-23 Q Latch D 2bits (byte address) D0-31 D Data Q Microprocessor Control Logic 13 bits Data Bus (32 bits) 32 CPU timing signals Index 9 bits control Q Latch control 16Mbyte Memory using 8kbyte Direct-Mapped cache with Copy-Back writes 3) (a) Harvard Architecture - having a separate internal instruction bus and data bus (and associated caches): Internal Instruction Bus Integer CPU Internal Data Bus FPU Cache/ Memory Management Unit Cache/ Memory Management Unit Off-Carrier Memory Bus (b) General Principle of Pipelining: stage 1 stage 2 stage 3 stage 1 stage 2 non-pipelined processing stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 pipelined processing stage 3 single instruction still takes as long, each instruction still has to be performed in the same order. Speed up occurs when all stages are kept in operation at same time. At this point it is possible to sustain a throughput of an instruction per cycle. Start up and ending become less efficient. c) (i) Ignoring the non-loop code – which is trivial – then: FETCH DECODE EXECUTE Loop: Load R2 Copy R3 Load R2 Load R2 Mult R2, R3 Copy R3Stall on R2 Mult R2, R3 Copy R3Stall on R2 Mult R2, R3 Copy R3OK Add R4 Mult R2, R3Stall on R3 Copy R3 Add R4 Mult R2, R3 Stall on R3 Add R4 Mult R2 R3OK Dec R1 Add R4, R0Stall on R4 Mult R2, R3 Dec R1 Add R4, R0 Stall on R4 Dec R1 Add R4, R0 OK BGEZ Dec R1 Add R4, R0 Next 1 BGEZStall R1 Dec R1 Next 1 BGEZ Stall R1 Next 1 BGEZOK Next 2 BGEZ R1 Next 3 WRITE Load R2 Copy R3 Mult R4 Add R4, R0 Dec R1 BGEZ PC Gives 17 cycles per iteration (ii) with loop unrolling by a factor of 2 and instruction optimised code we have: loop: load from memory[arraya + R1] to register R2 load from memory[arraya + R1+1] to register R5 decrement R1 by 2 Copy from R2 to R3 Copy from R5 to R6 multiply R2 by R3, store result in R4 multiply R5 by R6, store result in R7 add R4 to R0, store result in R0 add R7 to R0, store result in R0 conditional jump to loop if R1 >= 0 Loop: FETCH Load R2 Load R5 Dec R1 Copy R2, R3 Copy R5, R6 Mult R2, R3 Mult R5, R6 Add R4, R0 Add R7, R0 Add R7, R0 BGEZ BGEZ BGEZ DECODE Load R2 Load R5 Dec R1 Copy R2, R3 Copy R5, R6 Mult R2, R3 Mult R5, R6 Add R4, R0Stall on R4 Add R4, R0OK Add R7, R0 Stall on R0 Add R7, R0 Stall on R0 Add R7, R0OK EXECUTE Load R2 Load R5 Dec R1 Copy R2, R3 Copy R5, R6 Mult R2, R3 Mult R5, R6 WRITE Load R2 Load R5 Dec R1 Copy R3 Copy R5, R6 Mult R4 Mult R7 Add R4, R0 Add R0 Next 1 Next 2 Next 3 BGEZ Next 1 Next 2 Add R7, R0 BGEZ Next 1 Add R7, R0 BGEZ PC Gives 16 cycles for two iterations (iii) Forwarding can send the result of the Mult R2 by R3 to R4 directly from the output of the execute stage to the decode input of the Add R4 to R0 preventing the stall. Similarly the execute stage result of the Add R4 to R0 can be forwarded to the execute stage for the Add R7 to R0 in the next cycle – preventing the stall again. Moving the BGEZ up 3 instructions in the program and using a 3 stage delayed branch eliminates the pipeline flush and we then get 2 iterations in 10 cycles. 4) (a) Speedup S(n) = Execution time using one CPU/Execution time using n CPUs If the fraction of code which cannot be parallelised if f and if the time taken for the whole task on one processor is t, then the time taken to perform the computation with n processors is f.t + (1-f).t/n and the speedup s(n) = n/(1+(n-1)f) Amdahl's Law - Speedup v No. Processors 20 f = 0% 18 Speedup S(n) f = 5% 16 f = 10% 14 f = 20% 12 10 8 6 4 2 0 0 5 10 15 Number of Processors 20 b) (i) from Amdahl’s Law if a process takes t seconds on a single processor then with10% sequential code, the speed up s(n) = n/(1+(n-1)*0.1) so for 1 processor s = 1 for 5 processors s = 5/(1.4) = 3.57 for 10 processors s = 10/2 = 5 for 1000 processors = 1000/100 = 10 (ii)if the problem is scaled up to 100,000 points then the simulation will take 100 times longer than before. If we assume that the fraction of sequential code is the same then it will not be possible to obtain a speed up of a factor of 100 as according to Amdahl’s Law we can only hope to achieve a maximum speed up of 1/f ,where f is the sequential fraction, even with an infinite number of processors. OR Assuming Gustafson’s Law then the fraction of serial code will reduce to 0.1% but the time taken will be 100 times longer and so: t’=0.1t+0.9t/5 for the original program and t’=0.001(t*100)+0.999(t*100/n) so that 0.9t/5 = 99.9t/n hence 0.9n = 99.9*5 so n = 555 (iii) If this is was true then there would be little point in creating supercomputers with thousands of processors. However Gustafson’s Law suggests that the sequential component of code remains about the same whilst the parallel component increases with the size of the problem. Thus the fraction of sequential code should decrease significantly, dropping potentially to only 0.1%, giving a potential speed-up of 1000 on a large number of processors. 5(a) Considering instructions i and j where i leads instruction j along the pipeline then: (i) Read-after-Write - j tries to read an operand before i writes it, j gets the old value (ii) write after write - j writes a result before i, the value left by i rather than j is left at the destination (iii) write after read - j writes a result before it is read by i, i incorrectly gets new value (iv) Read after Read “hazards” aren’t a problem as reading is not an operation that causes changes, so reads in any order are not a problem (b) Superscalar processors use conventional instruction streams, read at several instructions per cycle. Decoded instructions issued to a number of pipelines - 2 or 3 pipelines can be kept busy. They Fetch and decode more instructions than needed to feed a single pipeline then launch instructions down a number of parallel pipelines in each cycle. Compilers often reorder instructions to place suitable instructions in parallel - the details of the strategy used will have a huge effect on the degree of parallelism achieved. Some superscalars can perform re-ordering at run time - to take advantage of free resources. Relatively easy to expand - add another pipelined functional unit. Will run previously compiled code, but will benefit from new compiler. Can provide exceptional peak performance, but extra data requirements put heavy demands on memory system and sustained performance might not be much more than 2 instructions per cycle. (c) In a superscalar machine (with two or more instruction pipelines): Read-after-Write hazards are more difficult to avoid cheaply since forwarding from one pipeline to another is difficult to achieve – this needs scoreboarding or register renaming techniques Write-after-read hazards can occur if a short instruction (from later in the instruction stream) propagates along one pipeline faster than a slower one in another pipeline. If the result of the fast instruction is required by the instruction following the slower one the this hazard occurs If a slow instruction in one pipeline write to a register and a faster instruction (from later in the instruction stream) is scheduled in a different pipeline and also attempts to write to the same register then a Write-after_Write Hazard might occur (d) Detecting Hazards. Several techniques - normally resulting in some stage of the pipeline being stopped for a cycle - can be used to overcome these hazards. They all depend on detecting register usage dependencies between instructions in the pipeline. An automated method of managing register accesses is needed. Most common detection scheme is scoreboarding Scoreboarding: keeping a 1-bit tag with each register. Clear tags when machine is booted. Set by Fetch or Decode stage when instruction is going to change a register when the change is complete the tag bit is cleared. If instruction is decoded which wants a tagged register, then instructions is not allowed to access it until tag is cleared. 6) (a) Three main differences between classical and Quantum computers are: Fundamental unit of information is a qubit Range of logical operations Process of determining the state of the computer (i) The Fundamental units of information processed by the two types of computers: Classical computers are built from bits two states: 0 or 1 Quantum computers are built from qubits Physical system which possess states analogous to 0 or 1, but which can also be in states between 0 and 1 The intermediate states are known as superposition states A qubit – in a sense – can store much more information than a bit (ii) Classical computers operate according to binary logic Quantum logic gates take one or more qubits as input and produce one or more qubits as output. Qubits have states corresponding to 0 and 1, so quantum logic gates can emulate classical logic gates. With superposition states between 0 and 1 there is a great expansion in the range pf quantum logic gates. e.g. quantum logic gates that take 0 and 1 as input and produce as output different superposition states between 0 and 1 – no classical analogue This expanded range of quantum gates can be exploited to achieve greater information processing power in quantum computers (iii) Determining the State of the Computer In Classical computers we read out the state of all the bits in the computer at any time In a Quantum computer it is in principle impossible to determine the exact state of the computer. i.e. we can’t determine exactly which superposition state is being stored in the qubits making up the computer We can only obtain partial information about the state of the computer Designing algorithms is a delicate balance between exploiting the expanded range of states and logical operations and the restricted readout of information. (b) Decoherence: the obstacle to quantum computation For a qubit to work successfully it must remain in an entangled quantum superposition of states. As soon as we measure the state it collapses to a single value. This happens even if we make the measurement by accident ↑ source source In a conventional double split experiment, the wave amplitudes corresponding to an electron (or photon) travelling along the two possible paths interfere. If another particle with spin is placed close to the left slit an electron passing will flip the spin. This “accidentally” records the which path the electron took and causes the loss of the interference pattern. In reality it is very difficult to prevent qubits from interacting with the rest of the world. The best solution (so far) to this is to build quantum computers with fault tolerant designs using error correction procedures. The result of this is that we need more qubits, between 2 and 5 times the number in an “ideal world” (c) Basically a massively parallel computation Can perform a calculation on 2L numbers in a single step, which would take 2L steps or processors in a conventional architecture Only good for certain types of computation…. NOT information storage – it can hold many states at once but only one see one of them Quantum interference allows us to obtain a single result that depends logically on all 2L of the intermediate results Grover’s Algorithm - Searches an unsorted list of N items in only sqrt(N) steps. Quantum cryptoanalysis – code breaking and encoding 7) (a)(i) Geometric parallelism is where the data for an application is spread across a number of processors and essentially the same algorithm is run on each processor. - The load balancing is static – the choice of data distribution is made at the time the program is designed. It may be difficult to design a scheme which keeps all processors equally busy. - Most algorithms suitable for geometric parallelism (e.g. image processing and some matrix arithmetic) do not work on each data point in turn, but instead on clusters of points around each point in turn. Points on the boundaries of the areas allocated to each processor will need to know the values of data on other processors when their turn comes to be computed, and this gives rise to communication - Initial loading of data onto the array of processors may be a significant overhead as will be the time to communicate the results of the computation. (ii) Algorithmic Parallelism is when the algorithm is split into several sequential steps; the results of the first are piped to the second and so on with the end stage hosted on a separate processor - The time taken for each time of initial data to be completely processed is the same as if it were processed in one stage on a single processor - As each stage of computation is completed the partial result is moved along the pipeline allowing another set to be introduced. For an n-stage pipeline (each doing 1/n of the whole task) one set of results is generated at each cycle or n results in the time that a single processor would have taken to generate one set. - Load balancing is static, since the shape and behaviour of the pipeline is fixed at the design stage. It may be possible to allocate more than one processor to a particular stage, taking alternative data values, in order to improve the load balancing. (iii). Process Farming. Subdivides the overall task into many independent sub-tasks and then uses a controller processor to send each sub-task to a worker process which has just completed its work and thus has recently become idle. - Load balancing is therefore semi-dynamic; each worker is only fed the sub-tasks which it can handle, and more workers are kept busy than in a poorley scheduled static scheme - Process farming works well when the ratio of computation to communication (both of the initial sub-task specifications and the results) is high. If the ratio falls, the latency of the controller-worker-controller round trip time becomes significant and it may be valuable to buffer a spare command packet next to the worker ready for it to accept as soon as it finishes the present sub-task (b) we want to calculate F(xi) = exp(sqrt(sin(xi*xi))) for x1, x2, x3… x10 using 5 processors (i) Geometric or Partitioned Version Each Processor performs the complete algorithm i.e. exp(sqrt(sin(xi*xi))) on its own data: Geometric: F1 F6 exp(sqrt(sin(x*x))) F2 F7 i.e. time = 8 units x4 x9 exp(sqrt(sin(x*x))) F3 F8 4 time units exp(sqrt(sin(x*x))) x3 x8 4 time units x2 x7 4 time units exp(sqrt(sin(x*x))) 4 time units 4 time units x1 x6 F4 F9 x5 x10 exp(sqrt(sin(x*x))) F5 F10 speedup = 40/8 = 5 (ii) Algorithmic or Pipelined Version x1,x2,… y*y 1 time unit sin(y) 1 time unit sqrt(y) exp(y) 1 time unit 1 time unit 5th Nothing for processor to do F1 is produced in 4 time units F2 is produced at time 5 i.e. time = 4+(10-1) = 13 units speedup = 40/13 = 3.1 (iii) As above single processor would take 4 time units per value. There are 10 values hence 40 time units. For Geometric takes only 8 units, for Algorithmic takes 13 units. Speed-ups are 5 and 3.1 respectively (iv) If we want to calculate 1000 values: then for single processor it would take 4000 units for the Geometric version it would take only 800 units for the Algorithmic version it would take 1003 units