1) (a) In the context of Virtual Memory explain what is meant by: (i

advertisement
1) (a) In the context of Virtual Memory explain what is meant by:
(i) paging
(ii) segmentation
(b) Explain the difference between direct-mapped and associatively-mapped paging
schemes.
(c) In an associatively mapped scheme explain what special features are required of the
memory system.
(d) Describe how a virtual memory system might employ both paging and segmentation.
(e) An 8 bit microprocessor with a 16 bit address bus is attached to a 16kbyte ROM
located at address 0000H. Located in this ROM at 0038H is an interrupt vector. It is
required that this value be changed to a different value. It has been suggested that this
could be done by using a small 2kbyte EPROM to replace only the first 40H
addresses with the first 40H of the EPROM.
(i) Describe the chip select logic that would be needed to achieve this and what
changes would have to be made.
(ii) Show how this could be extended to allow a switch to be used to select different
sections of the EPROM to replace this section of the ROM.
2)
(a) With the aid of a block diagram, describe the operation of a direct mapped
unbuffered write through cache. Indicate on a diagram how both the address bus and
data connect to a 32-bit processor (with a 30-bit word address bus), and a 16Mbyte
DRAM main memory with an 8kbyte cache.
[20%]
[10%]
[10%]
[10%]
[30%]
[20%]
[50%]
(b) Explain the benefit of adopting a buffered write through design. Indicate on your
diagram where buffering takes place.
[20%]
(c) How should the design be changed to incorporate a deferred write (copy back)
scheme? How does this work, and what are the added advantages of adopting this
scheme?
[30%]
3)
(a) Explain what is meant if a processor is said to have a Harvard memory architecture.
(b) Explain the concept of pipelining and how this can lead to close to a single instruction
per cycle performance
[10%]
[15%]
(c) A processor has four pipelined stages – an instruction fetch stage, an instruction
decode stage which also loads register operands; an execution stage which performs all
calculations (including addressing ones), accesses data in memory and a register write
back stage which updates the registers (including the program counter). The processor
uses a Harvard memory architecture, but initially provides no register forwarding or
delayed branching options.
Consider the following program fragment, which sums the squares of the numbers held
in array “a” at a base address “arraya” for i from 0 to 255 inclusive, then divides by the
total number of entries in the array and finally stores the result in b.
register R0 = 0
register R1 = 255
loop: load from memory[arraya + R1] to register R2
Copy from R2 to R3
multiply R2 by R3, store result in R4
add R4 to R0, store result in R0
decrement R1
conditional jump to loop if R1 >= 0
divide R0 by 256, store result in memory[b]
(i)
(ii)
(iii)
Show how one complete iteration of the program loop proceeds along the
pipeline as described above.
Show how loop unrolling (by a factor of two) and consequent instruction
optimisation and re-ordering can reduce the number of stall cycles in your
solution to part (i).
Show how adding register forwarding and delayed branching techniques to
your answer for part (ii) can eliminate stall cycles.
[25%]
[25%]
[25%]
[30%]
4)
(a) Derive Amdahl’s Law relating the speed-up of a parallel architecture to the fraction
of serial code, f, in an application and show diagrammatically what the predicted
speed-up for between 0 and 20 processors when f is 0%, 5% and 10%.
(c) A computer simulation which numerically integrates a complex function by dividing
the space variable into a set of 1000 points in each of which the function is
considered to be constant is found to contain 10% sequential code.
(i) From Amdahl’s Law estimate the respective potential speed up when using
machines with 1, 5, 10 and 1000 processors.
(ii) The simulation is to be scaled up which will increase the set of points from 1000
to 100,000. Originally the simulations were run on 5 processors. From
Amdahl’s Law how many processors would we need to produce an answer from
the scaled up version in a similar time?
(iii) In view of your answers above explain why it is that the current trend in
supercomputing is to construct machines with thousands of processors.
[30%]
[20%]
[20%]
5) (a) When parallelism is introduced into the processing of computer instructions
dependencies may arise. If the processor has a single instruction processing pipeline,
in which instructions are issued in-order and completed in-order, describe the
conditions that lead to the following forms of dependencies:
(i)
Read-after-Write hazards
(ii)
Write-after-Read hazards
(iii) Write-after-Write hazards
(iv)
Why are Read-after-Read hazards not a problem?
[40%]
(b) Describe what is meant by a superscalar processor
[10%]
(c) For the list of Hazards given above show the additional circumstances in which they
can occur in a superscalar processor in which instructions can be issued out of order
and might even finish out of order.
[30%]
(d) Describe the Scoreboarding hazard detection scheme and how it is used to prevent
hazards causing a problem.
[20%]
6) (a) Quantum computers will potentially be very different from the classic computers that
are currently available. Give a description of the three main differences in:
(i) the fundamental unit of information;
(ii) the range of logical operations;
(iii) determining of the state of the computer.
[50%]
.
(b) Explain what is meant by de-coherence in the concept of quantum computing and
why this is a problem
[30%]
(c) What areas of computation are most likely to benefit from quantum computing in the
immediate future
[20%]
7) (a) Explain what is meant by the following types of parallelism:
(i) Geometric Parallelism
(ii) Algorithmic Parallelism
(iii) Processor Farming
(b) We would like to calculate F(x) = exp(sqrt(sin(x*x))) for 10 different values of x
using five processors. Describe how this might be implemented using:
(i) algorithmic parallelism;
(ii) geometric parallelism.
(iii) Calculate in each case the speed up achieved over a single processor.
(iv) How would the speed up be affected in each case if the number of values
calculated is increased to 1000?
[50%]
[50%]
Solutions
1) (a)(i) Paging (general): virtual address of data contains page number and displacement
on page -- (P,D) pair. The virtual page number is looked up in page table to provide
actual frame number. Concatenated with D this gives the real memory address. Pages
have fixed sizes – normally small of the order a few Kbytes
(ii) Segmentation: A segment is a similar conception as a page but has no fixed size and
can be as large as the entire memory system in principle.
(b) Direct Mapped paging: page table contains all pages of virtual memory. Look up by
using P as address in table (see diagram).
Associative Mapped paging: table contains only those virtual memory pages that are
stored in main memory.
P
D
P
D
Frame No D
Frame No D
comparisons
Direct mapped page
Table
Associative-mapped page
table
Advantages/disadvantages
Direct:
Simple lookup for all pages
But large tables, hence slow (possibly partly paged to backing store)
Associative: more complicated lookup
But smaller table hence totally in main memory and faster lookup for
pages
(c) Hardware implications of associative mapping: Look-up via comparison with stored
page numbers – content-addressable memory.
(d) Often a scheme is used combining a number of fixed size pages inside a single
program segment
(e) (i) The original address decoder will generates an active chip select for the ROM
when A15=0 and A14=0. This must be modified so that it goes active in the same
circumstances except when addressing the bottom 40hex locations in memory - i.e.
when A6=0, A7=0 up to A13=0. In this latter case, a second enable to the new
EPROM must be activated instead.
(ii) We can use A6 (and other address lines) of the EPROM, connected to a toggle
switch Or dip switch, to select two (or more) different 40hex images. Tie the
remaining upper EPROM address lines to GND, and its lower 6 lines to the address
bus.
(a) Block diagram below shows a direct mapped write through cache – the FIFO
buffers indicate the positions where the buffers are added for part b)
DRAM select
Tag
8 bits
Index
9 bits
D0-31
FIFO
32bits
Data Bus (32 bits)
32
D0-31
32
D Data Q
Microprocessor
control
Cache
Memory
WR A
A0-29
13 bit
index
FIFO
22bits
22bits
Address Bus
D
control
A
Tag
storage
and
comparison Match
Control
Logic
Main
DRAM
memory
A0-23
13 bit
index
9 bit
tag
CPU
timing
signals
2bits
(byte address)
13 bits
FIFOs optional
for buffered
write-through
2)
WR
control
16Mbyte Memory using 8kbyte
Direct-Mapped cache with
Write-Through writes
Directly mapped cache - simplest form of memory cache. In which the real memory
address is treated in three parts:
For a cache of 2c words, the cache index section of the real memory address indicates
which cache entry is able to store data from that address. When cached the tag (msb of
address) is stored in cache with data to indicate which page it came from. Cache will
store 2c words from 2t pages. In operation tag is compared in every memory cycle
if tag matches a cache hit is achieved and cache data is passed otherwise a cache miss
occurs and the DRAM supplies word and data with tag are stored in the cache.
Tag
t bits
Tags
Data
Cache Memory
compare
Index
c bits
Use Cache or
Main Memory
Main
Memory
Unbuffered Write Through
Write data to relevant cache entry, update tag, also write data to location in main
memory - speed determined by main memory
b) Buffered write through (see first diagram for location of buffers).
Data (and address) is written to A FIFO buffer between CPU and main memory, CPU
continues with next access, FIFO buffer writes to DRAM. CPU can continue to write at
cache speeds, until FIFO is full, then slows down to DRAM speed as FIFO empties. If
CPU wants to read from DRAM (instead of cache) need to empty FIFO to ensure we
have the correct data - can put long delay in. This delay can be shortened in FIFO has
only one entry - simple latch buffer.
(c) Deferred Write (Copy Back)
data is written out to cache only, allowing the cached entry to be different from main
memory. If the cache system wants to over-write a cache index with a different tag it
looks to see if the current entry has been changed since it was copied in. If so it writes
the new value to main memory before reading the new data to the location in cache.
More logic is required for this operation, but the performance gain can be considerable
as it allows the CPU to work at cache speed if it stays within the same block of
memory. Other techniques will slow down to DRAM speed eventually.
Adding a buffer to this allows CPU to write to cache before data is actually copied back
to DRAM
DRAM select
Tag
8 bits
D0-31
32
Latch
32bits
Cache
Memory
WR A
A0-29
13 bit
index
D
22bits
Dirty bit
Match
D
A
Tag Q
storage
and
comparison
WR
Main
DRAM
memory
control
Address Bus
13 bit
9 bit index
tag
A0-23
Q
Latch
D
2bits
(byte address)
D0-31
D Data Q
Microprocessor
Control
Logic
13 bits
Data Bus (32 bits)
32
CPU
timing
signals
Index
9 bits
control
Q
Latch
control
16Mbyte Memory using 8kbyte
Direct-Mapped cache with
Copy-Back writes
3) (a) Harvard Architecture - having a separate internal instruction bus and data bus (and
associated caches):
Internal
Instruction
Bus
Integer
CPU
Internal
Data
Bus
FPU
Cache/
Memory
Management
Unit
Cache/
Memory
Management
Unit
Off-Carrier
Memory
Bus
(b) General Principle of Pipelining:
stage 1
stage 2
stage 3
stage 1
stage 2
non-pipelined processing
stage 1
stage 2
stage 3
stage 1
stage 2
stage 3
stage 1
stage 2
stage 3
stage 1
stage 2
stage 3
pipelined processing
stage 3
single instruction still takes as long, each instruction still has to be performed in the same
order. Speed up occurs when all stages are kept in operation at same time. At this point it is
possible to sustain a throughput of an instruction per cycle. Start up and ending become less
efficient.
c) (i) Ignoring the non-loop code – which is trivial – then:
FETCH
DECODE
EXECUTE
Loop:
Load R2
Copy R3
Load R2
Load R2
Mult R2, R3
Copy R3Stall on R2
Mult R2, R3
Copy R3Stall on R2
Mult R2, R3
Copy R3OK
Add R4
Mult R2, R3Stall on R3 Copy R3
Add R4
Mult R2, R3 Stall on R3
Add R4
Mult R2 R3OK
Dec R1
Add R4, R0Stall on R4 Mult R2, R3
Dec R1
Add R4, R0 Stall on R4
Dec R1
Add R4, R0 OK
BGEZ
Dec R1
Add R4, R0
Next 1
BGEZStall R1
Dec R1
Next 1
BGEZ Stall R1
Next 1
BGEZOK
Next 2
BGEZ R1
Next 3
WRITE
Load R2
Copy R3
Mult R4
Add R4, R0
Dec R1
BGEZ PC
Gives 17 cycles per iteration
(ii) with loop unrolling by a factor of 2 and instruction optimised code we have:
loop: load from memory[arraya + R1] to register R2
load from memory[arraya + R1+1] to register R5
decrement R1 by 2
Copy from R2 to R3
Copy from R5 to R6
multiply R2 by R3, store result in R4
multiply R5 by R6, store result in R7
add R4 to R0, store result in R0
add R7 to R0, store result in R0
conditional jump to loop if R1 >= 0
Loop:
FETCH
Load R2
Load R5
Dec R1
Copy R2, R3
Copy R5, R6
Mult R2, R3
Mult R5, R6
Add R4, R0
Add R7, R0
Add R7, R0
BGEZ
BGEZ
BGEZ
DECODE
Load R2
Load R5
Dec R1
Copy R2, R3
Copy R5, R6
Mult R2, R3
Mult R5, R6
Add R4, R0Stall on R4
Add R4, R0OK
Add R7, R0 Stall on R0
Add R7, R0 Stall on R0
Add R7, R0OK
EXECUTE
Load R2
Load R5
Dec R1
Copy R2, R3
Copy R5, R6
Mult R2, R3
Mult R5, R6
WRITE
Load R2
Load R5
Dec R1
Copy R3
Copy R5, R6
Mult R4
Mult R7
Add R4, R0
Add R0
Next 1
Next 2
Next 3
BGEZ
Next 1
Next 2
Add R7, R0
BGEZ
Next 1
Add R7, R0
BGEZ PC
Gives 16 cycles for two iterations
(iii) Forwarding can send the result of the Mult R2 by R3 to R4 directly from the output of
the execute stage to the decode input of the Add R4 to R0 preventing the stall. Similarly the
execute stage result of the Add R4 to R0 can be forwarded to the execute stage for the Add
R7 to R0 in the next cycle – preventing the stall again.
Moving the BGEZ up 3 instructions in the program and using a 3 stage delayed branch
eliminates the pipeline flush and we then get 2 iterations in 10 cycles.
4) (a) Speedup S(n) = Execution time using one CPU/Execution time using n CPUs
If the fraction of code which cannot be parallelised if f and if the time taken for the whole
task on one processor is t, then the time taken to perform the computation with n
processors is
f.t + (1-f).t/n
and the speedup s(n) = n/(1+(n-1)f)
Amdahl's Law - Speedup v No. Processors
20
f = 0%
18
Speedup S(n)
f = 5%
16
f = 10%
14
f = 20%
12
10
8
6
4
2
0
0
5
10
15
Number of Processors
20
b) (i) from Amdahl’s Law if a process takes t seconds on a single processor then with10%
sequential code, the speed up s(n) = n/(1+(n-1)*0.1)
so for 1 processor s = 1
for 5 processors s = 5/(1.4) = 3.57
for 10 processors s = 10/2 = 5
for 1000 processors = 1000/100 = 10
(ii)if the problem is scaled up to 100,000 points then the simulation will take 100 times
longer than before. If we assume that the fraction of sequential code is the same then it
will not be possible to obtain a speed up of a factor of 100 as according to Amdahl’s Law
we can only hope to achieve a maximum speed up of 1/f ,where f is the sequential
fraction, even with an infinite number of processors.
OR
Assuming Gustafson’s Law then the fraction of serial code will reduce to 0.1% but the
time taken will be 100 times longer and so:
t’=0.1t+0.9t/5 for the original program and t’=0.001(t*100)+0.999(t*100/n)
so that 0.9t/5 = 99.9t/n hence 0.9n = 99.9*5 so n = 555
(iii)
If this is was true then there would be little point in creating supercomputers with
thousands of processors. However Gustafson’s Law suggests that the sequential
component of code remains about the same whilst the parallel component
increases with the size of the problem. Thus the fraction of sequential code should
decrease significantly, dropping potentially to only 0.1%, giving a potential
speed-up of 1000 on a large number of processors.
5(a) Considering instructions i and j where i leads instruction j along the pipeline then:
(i) Read-after-Write - j tries to read an operand before i writes it, j gets the old value
(ii) write after write - j writes a result before i, the value left by i rather than j is left at the
destination
(iii) write after read - j writes a result before it is read by i, i incorrectly gets new value
(iv) Read after Read “hazards” aren’t a problem as reading is not an operation that causes
changes, so reads in any order are not a problem
(b) Superscalar processors use conventional instruction streams, read at several instructions
per cycle. Decoded instructions issued to a number of pipelines - 2 or 3 pipelines can be kept
busy. They Fetch and decode more instructions than needed to feed a single pipeline then
launch instructions down a number of parallel pipelines in each cycle. Compilers often reorder instructions to place suitable instructions in parallel - the details of the strategy used
will have a huge effect on the degree of parallelism achieved. Some superscalars can perform
re-ordering at run time - to take advantage of free resources. Relatively easy to expand - add
another pipelined functional unit. Will run previously compiled code, but will benefit from
new compiler. Can provide exceptional peak performance, but extra data requirements put
heavy demands on memory system and sustained performance might not be much more than
2 instructions per cycle.
(c) In a superscalar machine (with two or more instruction pipelines):
Read-after-Write hazards are more difficult to avoid cheaply since forwarding from one
pipeline to another is difficult to achieve – this needs scoreboarding or register renaming
techniques
Write-after-read hazards can occur if a short instruction (from later in the instruction
stream) propagates along one pipeline faster than a slower one in another pipeline. If the
result of the fast instruction is required by the instruction following the slower one the this
hazard occurs
If a slow instruction in one pipeline write to a register and a faster instruction (from
later in the instruction stream) is scheduled in a different pipeline and also attempts to write
to the same register then a Write-after_Write Hazard might occur
(d) Detecting Hazards.
Several techniques - normally resulting in some stage of the pipeline being stopped for a
cycle - can be used to overcome these hazards. They all depend on detecting register usage
dependencies between instructions in the pipeline. An automated method of managing
register accesses is needed. Most common detection scheme is scoreboarding
Scoreboarding: keeping a 1-bit tag with each register. Clear tags when machine is booted.
Set by Fetch or Decode stage when instruction is going to change a register when the change
is complete the tag bit is cleared. If instruction is decoded which wants a tagged register, then
instructions is not allowed to access it until tag is cleared.
6)
(a) Three main differences between classical and Quantum computers are:
Fundamental unit of information is a qubit
Range of logical operations
Process of determining the state of the computer
(i) The Fundamental units of information processed by the two types of computers:
Classical computers are built from bits
two states: 0 or 1
Quantum computers are built from qubits
Physical system which possess states analogous to 0 or 1, but which can also be in
states between 0 and 1
The intermediate states are known as superposition states
A qubit – in a sense – can store much more information than a bit
(ii) Classical computers operate according to binary logic
Quantum logic gates take one or more qubits as input and produce one or more qubits
as output.
Qubits have states corresponding to 0 and 1, so quantum logic gates can emulate
classical logic gates.
With superposition states between 0 and 1 there is a great expansion in the range pf
quantum logic gates.
e.g. quantum logic gates that take 0 and 1 as input and produce as output different
superposition states between 0 and 1 – no classical analogue
This expanded range of quantum gates can be exploited to achieve greater information
processing power in quantum computers
(iii) Determining the State of the Computer
In Classical computers we read out the state of all the bits in the computer at any time
In a Quantum computer it is in principle impossible to determine the exact state of the
computer.
i.e. we can’t determine exactly which superposition state is being stored in the qubits
making up the computer
We can only obtain partial information about the state of the computer
Designing algorithms is a delicate balance between exploiting the expanded range of
states and logical operations and the restricted readout of information.
(b) Decoherence: the obstacle to quantum computation
For a qubit to work successfully it must remain in an entangled quantum superposition
of states.
As soon as we measure the state it collapses to a single value.
This happens even if we make the measurement by accident
↑
source
source
In a conventional double split experiment, the wave amplitudes corresponding to an
electron (or photon) travelling along the two possible paths interfere. If another particle
with spin is placed close to the left slit an electron passing will flip the spin. This
“accidentally” records the which path the electron took and causes the loss of the
interference pattern.
In reality it is very difficult to prevent qubits from interacting with the rest of the world.
The best solution (so far) to this is to build quantum computers with fault tolerant
designs using error correction procedures.
The result of this is that we need more qubits, between 2 and 5 times the number in an
“ideal world”
(c) Basically a massively parallel computation
Can perform a calculation on 2L numbers in a single step, which would take 2L steps
or processors in a conventional architecture
Only good for certain types of computation….
NOT information storage – it can hold many states at once but only one see one of
them
Quantum interference allows us to obtain a single result that depends logically on all
2L of the intermediate results
Grover’s Algorithm - Searches an unsorted list of N items in only sqrt(N) steps.
Quantum cryptoanalysis – code breaking and encoding
7) (a)(i) Geometric parallelism is where the data for an application is spread across a
number of processors and essentially the same algorithm is run on each processor.
- The load balancing is static – the choice of data distribution is made at the time
the program is designed. It may be difficult to design a scheme which keeps all
processors equally busy.
- Most algorithms suitable for geometric parallelism (e.g. image processing and
some matrix arithmetic) do not work on each data point in turn, but instead on
clusters of points around each point in turn. Points on the boundaries of the areas
allocated to each processor will need to know the values of data on other
processors when their turn comes to be computed, and this gives rise to
communication
- Initial loading of data onto the array of processors may be a significant overhead
as will be the time to communicate the results of the computation.
(ii) Algorithmic Parallelism is when the algorithm is split into several sequential
steps; the results of the first are piped to the second and so on with the end stage
hosted on a separate processor
- The time taken for each time of initial data to be completely processed is the same
as if it were processed in one stage on a single processor
- As each stage of computation is completed the partial result is moved along the
pipeline allowing another set to be introduced. For an n-stage pipeline (each
doing 1/n of the whole task) one set of results is generated at each cycle or n
results in the time that a single processor would have taken to generate one set.
- Load balancing is static, since the shape and behaviour of the pipeline is fixed at
the design stage. It may be possible to allocate more than one processor to a
particular stage, taking alternative data values, in order to improve the load
balancing.
(iii). Process Farming. Subdivides the overall task into many independent sub-tasks
and then uses a controller processor to send each sub-task to a worker process which
has just completed its work and thus has recently become idle.
- Load balancing is therefore semi-dynamic; each worker is only fed the sub-tasks
which it can handle, and more workers are kept busy than in a poorley scheduled
static scheme
- Process farming works well when the ratio of computation to communication
(both of the initial sub-task specifications and the results) is high. If the ratio
falls, the latency of the controller-worker-controller round trip time becomes
significant and it may be valuable to buffer a spare command packet next to the
worker ready for it to accept as soon as it finishes the present sub-task
(b) we want to calculate F(xi) = exp(sqrt(sin(xi*xi))) for x1, x2, x3… x10 using 5 processors
(i) Geometric or Partitioned Version
Each Processor performs the complete algorithm i.e. exp(sqrt(sin(xi*xi))) on its own data:
Geometric:
F1
F6
exp(sqrt(sin(x*x)))
F2
F7
i.e. time = 8 units
x4
x9
exp(sqrt(sin(x*x)))
F3
F8
4 time units
exp(sqrt(sin(x*x)))
x3
x8
4 time units
x2
x7
4 time units
exp(sqrt(sin(x*x)))
4 time units
4 time units
x1
x6
F4
F9
x5
x10
exp(sqrt(sin(x*x)))
F5
F10
speedup = 40/8 = 5
(ii) Algorithmic or Pipelined Version
x1,x2,…
y*y
1 time unit
sin(y)
1 time unit
sqrt(y)
exp(y)
1 time unit
1 time unit
5th
Nothing for
processor to do
F1 is produced in 4 time units
F2 is produced at time 5
i.e. time = 4+(10-1) = 13 units
speedup = 40/13 = 3.1
(iii) As above single processor would take 4 time units per value. There are 10 values hence
40 time units. For Geometric takes only 8 units, for Algorithmic takes 13 units. Speed-ups are
5 and 3.1 respectively
(iv) If we want to calculate 1000 values:
then for single processor it would take 4000 units
for the Geometric version it would take only 800 units
for the Algorithmic version it would take 1003 units
Download