ee3.cma - Computer Architecture

advertisement
EE3004 (EE3.cma) - Computer Architecture
Roger Webb
R.Webb@surrey.ac.uk
University of Surrey
http://www.ee.surrey.ac.uk/Personal/R.Webb/l3a15
also link from Teaching/Course page
3/16/2016
EE3.cma - Computer Architecture
1
Introduction
Book List
Computer Architecture - Design & Performance
Barry Wilkinson, Prentice-Hall 1996
(nearest to course)
Advanced Computer Architecture
Richard Y. Kain, Prentice-Hall 1996
(good for multiprocessing + chips + memory)
Computer Architecture
Behrooz Parhami, Oxford Univ Press, 2005
(good for advanced architecture and Basics)
Computer Architecture
Dowsing & Woodhouse
(good for putting the bits together..)
Microprocessors & Microcomputers - Hardware & Software
Ambosio & Lastowski
(good for DRAM, SRAM timing diagrams etc.)
Computer Architecture & Design
Van de Goor
(for basic Computer Architecture)
Wikipedia is as good as anything...!
3/16/2016
EE3.cma - Computer Architecture
2
Introduction
Outline Syllabus
Memory Topics
• Memory Devices
• Interfacing/Graphics
• Virtual Memory
• Caches & Hierarchies
Instruction Sets
• Properties & Characteristics
• Examples
• RISC v CISC
• Pipelining & Concurrency
Parallel Architectures
• Performance Characteristics
• SIMD (vector) processors
• MIMD (message-passing)
• Principles & Algorithms
3/16/2016
EE3.cma - Computer Architecture
3
Computer Architectures - an overview
What are computers used for?
3 ranges of product cover the majority of processor
sales:
• Appliances (consumer electronics)
• Communications Equipment
• Utilities (conventional computer systems)
3/16/2016
EE3.cma - Computer Architecture
4
Computer Architectures - an overview
Consumer Electronics
This category covers a huge range of processor performance
• Micro-controlled appliances
– washing machines, time switches, lamp dimers
– lower end, characterised by:
• low processing requirements
• microprocessor replaces logic in small package
• low power requirements
• Higher Performance Applications
– Mobile phones, printers, fax machines, cameras, games
consoles, GPS, TV set-top boxes, video/DVD/HD
recorders…...
•
•
•
•
3/16/2016
High bandwidth - 64-bit data bus
Low power - to avoid cooling
Low cost - < $20 for the processor
Small amounts of software - small cache (tight program loops)
EE3.cma - Computer Architecture
5
Computer Architectures - an overview
Communications Equipment
has become the major market – WWW, mobile comms
• Main products containing powerful processors are:
– LAN products - bridges, routers, controllers in computers
– ATM exchanges
– Satellite & Cable TV routing and switching
– Telephone networks (all-digital)
• The main characteristics of these devices are:
– Standardised application (IEEE, CCITT etc.) - means
competitive markets
– High bandwidth interconnections
– Wide processor buses - 32 or 64 bits
– Multi-processing (either per-box, or in the distributed
computing sense
3/16/2016
EE3.cma - Computer Architecture
6
Computer Architectures - an overview
Utilities (Conventional Computer Systems)
Large scale computing devices will, to some extent, be replaced by
greater processing power on the desk-top.
• But some centralised facilities are still required, especially
where data storage is concerned
– General-purpose computer servers; supercomputers
– Database servers - often safer to maintain a central corporate
database
– File and printer servers - again simpler to maintain
– Video on demand servers
• These applications are characterised by huge memory
requirements and:
– Large operating systems
– High sustained performance over wide workload variations
– Scalability - as workload increases
– 64 bit (or greater) data paths, multiprocessing, large caches
3/16/2016
EE3.cma - Computer Architecture
7
Computer Architectures - an overview
Computer System Performance
• Most manufacturers quote performance of their processors in terms of
the peak rate - MIPS (MOPS) of MFLOPS.
• Most of the applications above depend on the continuous supply of
data or results - especially for video images
• Thus critical criterion is the sustained throughput of instructions
– (MPEG image decompression algorithm requires 1 billion
operations per second for full-quality widescreen TV)
– Less demanding VHS quality requires 2.7Mb per second of
compressed data
– Interactive simulations (games etc) must respond to a user input
within 100ms - re-computing and displaying the new image
• Important measures are:
– MIPS per dollar
– MIPS per Watt
3/16/2016
EE3.cma - Computer Architecture
8
Computer Architectures - an overview
% of CPU time spent managing interaction
User Interactions
Consider how we interact with our computers:
100
Virtual Reality, Cyberspace
90
80
What does a typical
CPU do?
70% User interface; I/O
processing
20% Network interface;
protocols
9%
Operating system;
system calls
1%
User application
70
60
WYSIWIG, Mice, Windows
50
40
Menus, Forms
30
Timesharing
20
Punched Card & Tape
10
Lights & Switches
0
1955
3/16/2016
1965
1975
1985
1995
2005
EE3.cma - Computer Architecture
9
Computer Architectures - an overview
Sequential Processor Efficiency
The current state-of-the-art of large microprocessors include:
• 64-bit memory words, using interleaved memory
• Pipelined instructions
• Multiple functional units (integer, floating point, memory
fetch/store)
• 5 GHz practical maximum clock speed
• Multiple processors
• Instruction set organised for simple decoding (RISC?)
However as word length increases, efficiency may drop:
• many operands are small (16 bit is enough for many VR tasks)
• many literals are small - loading 00….00101 as 64 bits is a waste
• may be worth operating on several literals per word in parallel
3/16/2016
EE3.cma - Computer Architecture
10
Computer Architectures - an overview
Example - reducing the number of instructions
Perform a 3D transformation of a point (x,y,z) by multiplying the 4-element
matrix (x,y,z,1) by a 4x4 transformation matrix A. All operands are 16-bits
long.
x y z 1
a
e
i
m
b
f
j
n
c
g
k
o
d
h
l
p
=
x’ y’ z’ r
Conventionally this requires 20 loads, 16 multiplies, 12 adds and 4 stores, using
16-bit operands on a 16-bit CPU.
On a 64-bit CPU with instructions dealing with groups of four parallel 16-bit
operands, as well as a modest amount of pipelining, all this can take just 7
processor cycles.
3/16/2016
EE3.cma - Computer Architecture
11
Computer Architectures - an overview
The Effect of Processor Intercommunication Latency
In a multiprocessor, and even in a uniprocessor, the delays associated with
communicating and fetching data (latency) can dominate the processing times.
Consider:
memory
memory
Interconnection Network
CPU
CPU
memory
memory
Symmetrical
Multiprocessor
CPU
Uniprocessor
cache
CPU
Delays can be minimised by placing components closer together and:
• Add caches to provide local data storage
• Hide latency by multi-tasking - needs fast context switching
• Interleave streams of independent instructions - scheduling
• Run groups of independent instructions together (each ending with long latency
instruction)
3/16/2016
EE3.cma - Computer Architecture
12
Computer Architectures - an overview
Memory Efficiency
Quote from 1980s “Memory is free”
By the 2000s the cost per bit is no longer falling so fast and
consumer electronics market is becoming cost sensitive
Renewed interest in compact instruction sets and data
compactness - both from the 1960s and 1970s 1977 - £3000/Mb
1994 - £4/Mb
Now – <1p/Mb
Instruction Compactness
RISC CPUs have a simple register-based instruction encoding
• Can lead to codebloat - as can poor coding and compiler design
• Compactness gets worse as the word size increases
e.g. INMOS (1980s) transputer had a stack based register scheme
• needed 60% of the code of an equivalent register based cpu
• lead to smaller cache needs for instruction fetches & data
3/16/2016
EE3.cma - Computer Architecture
13
Computer Architectures - an overview
Cache Efficiency
• Designer should aim to optimise the instruction
performance whilst using the smallest cache possible
• Hiding latency (using parallelism & instruction
scheduling) is an effective alternative to minimising it
(by using large caches)
• Instruction scheduling can initiate cache pre-fetches
• Switch to another thread if the cache is not ready to
supply data for the current one
• In video and audio processing, especially, unroll the
inner code loops – loop unrolling (more on that later)
3/16/2016
EE3.cma - Computer Architecture
14
Computer Architectures - an overview
Predictable Codes
In many applications (e.g. video and audio processing) much is
known about the code which will be executed. Techniques
which are suitable for these circumstances include:
• Partition the cache separately for code and different data
structures
• The cache requirements of the inner code loops can be predetermined, so cache usage can be optimised
• Control the amounts of a data structure which are cached
• Prevent interference between threads by careful scheduling
• Notice that a conventional cache’s contents are destroyed by a
single block copy instruction
3/16/2016
EE3.cma - Computer Architecture
15
Computer Architectures - an overview
Processor Engineering Issues
• Power consumption must be minimised (to simplify on-chip and inbox cooling issues)
– Use low-voltage processors (2V instead of 3.3V)
– Don’t over-clock the processor
– Design logic carefully to avoid propagation of redundant signals
– Tolerance of latency allows lower performance (cheaper)
subsystems to be used
– Explicit subsystem control allows subsystems to be powered down
when not in use
– Eliminate redundant actions - eg speculative pre-fetching
– Provide non-busy synchronisation to avoid the need for spin-locks
• Battery design is advancing slowly - power stored per unit weight or
volume will quadruple (over NiCd) with 5-10 years
3/16/2016
EE3.cma - Computer Architecture
16
Computer Architectures - an overview
Processor Engineering Issues
• Speed to market is increasing, so processor design is becoming
critical. Consider the time for several common devices to
become established:
– 70 years
Telephone (0% to 60% of households)
– 40 years
Cable Television
– 20 years
Personal Computer
– 10 years
Video Recorders
– <10years Web based video
• Modularity and common processor cores provide design
flexibility
– reusable cache and CPU cores
– product-specific interfaces and co-processors
– common connection schemes
3/16/2016
EE3.cma - Computer Architecture
17
Computer Architectures - an overview
Interconnect Schemes
Wide data buses are a problem:
• They are difficult to route on printed circuit boards
• They require huge numbers of processor and memory pins
(expensive to manufacture on chips and PCBs)
• Clocking must accommodate the slowest bus wire.
• Parallel back-planes add to loading and capacitance,
slowing signals further and increasing power consumption
Serial chip interconnects offer 1Gbit/s performance using just a
few pins and wires. Can we use a packet routing chip as a
back-plane?
• Processors, memories, graphic devices, networks, slow
external interfaces all joined to a central switch
3/16/2016
EE3.cma - Computer Architecture
18
3
3/16/2016
EE3.cma - Computer Architecture
19
Memory Devices
Regardless of scale of computer the memory is similar.
Two major types:
• Static
• Dynamic
Larger memories get cheaper as production increases and
smaller memories get more expensive - you pay more for
less!
See:
http://www.educypedia.be/computer/memoryram.htm
http://www.kingston.com/tools/umg/default.asp
http://www.ahinc.com/hhmemory.htm
3/16/2016
EE3.cma - Computer Architecture
20
Memory Devices
Static Memories
• made from static logic elements - an array of flip-flops
• don’t lose their stored contents until clocked again
• may be driven as slowly as needed - useful for single
stepping a processor
• Any location may be read or written independently
• Reading does not require a re-write afterwards
• Writing data does not require the row containing it to be
pre-read
• No housekeeping actions are needed
• The address lines are usually all supplied at the same time
• Fast - 15ns was possible in Bipolar and 4-15ns in CMOS
Not used anymore – too much power for little gain in speed
3/16/2016
EE3.cma - Computer Architecture
21
Memory Devices
I/O7
Memory Matrix
256x256
Input Data
Control
I/O0
Vcc
Row Decoder
HM6264 - 8K*8 static RAM organisation
3/16/2016
A0
A1
A2
A3
A4
A5
A6
A7
Gnd
Column I/O
Column Decoder
A15
A8
CS2
CS1
WE
Timing Pulse Generator
Read Write Control
OE
EE3.cma - Computer Architecture
22
Memory Devices
t
RC
Address
CS1
tCO1
tLZ1
tHZ1
tCO2
CS2
HM6264 Read Cycle
HM6264 - 8K*8 static RAM organisation
tAA
3/16/2016
OE
tLZ2
tOE
tHZ2
tOLZ
tOHZ
Data Valid
Dout
tOH
Item
Read Cycle Time
Address Access Time
Chip Selection to
CS1
Output
CS2
Output Enable to Output Valid
Chip Selection to
CS1
Output in Low Z
CS2
Output Enable to Output in Low Z
Chip Deselection to
CS1
Output in High Z
CS2
Output Disable to Output in High Z
Output Hold from Address Change
Symbol
tRC
tAA
tCO1
tCO2
tOE
tLZ1
tLZ2
tOLZ
tHZ1
tHZ2
tOHZ
tOH
EE3.cma - Computer Architecture
min
100
10
10
5
0
0
0
10
max
100
100
100
50
35
35
35
-
Unit
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
23
Memory Devices
tWC
Address
3/16/2016
tWR1
tCW
CS1
HM6264 Write Cycle
HM6264 - 8K*8 static RAM organisation
OE
tCW
CS2
tAW
WE
Dout
tAS
tWR2
tWP
tOHZ
tDW
tDH
Din
Item
Symbol
Write Cycle Time
tWC
Chip Selection to End of Write
tCW
Address set up time
tAS
Address valid to End of Write
tAW
Write Pulse Width
tWP
Write Recovery Time CS1,WE
tWR1
CS2
tWR2
Write to Output in High Z
tWHZ
Data to Write Time Overlap
tDW
Data Hold from Write Time
tDH
Output Enable to Output in High Z tOHZ
Output Active from End of Write
tOW
EE3.cma - Computer Architecture
min
100
80
0
80
60
5
15
0
40
0
0
5
max
35
35
-
Unit
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
ns
Data
sampled
by
memory
24
Memory Devices
Dynamic Memories
• information stored on a capacitor - discharges with time
• Only one transistor required to control - 6 for SRAM
• must be refreshed (0.1-0.01 pF needs refresh every 2-8ms)
• memory cells are organised so that cells can be refreshed a
row at a time to minimise the time taken
• row and column organisation lends itself to multiplexed row
and column addresses - fewer pins on chip
• Use RAS and CAS to latch row and column addresses
sequentially
• DRAM consumes high currents when switching transistors
(1024 columns at a time). Can cause nasty voltage transients
3/16/2016
EE3.cma - Computer Architecture
25
Memory Devices
Input Buffer
WE
Output Buffer
WE
Clock
OE Clock
R/W Switch
OE
CAS
CAS
Clock
RAS
RAS
Clock
X Decoder
X
Addrss
Memory
Array 3
Y
Addrss
Memory
Array 2
row select
X Decoder
Y Decoder
Ai
Memory
Array 1
Y Decoder
HM50464 - 64K*4 dynamic RAM organisation
I/O
1-4
Dynamic memory cell
Memory
Array 4
Refresh Address
Counter
3/16/2016
RAS
Bit Line
EE3.cma - Computer Architecture
CAS
26
Memory Devices
HM50464 Read Cycle
HM50464 - 64K*4 dynamic RAM organisation
RAS
CAS
Address
row
column
WRITE
valid
output
IO
OE
Read Cycle
Dynamic memory read operation is as follows
• The memory read cycle starts by setting all bit lines (columns) to a
suitable sense voltage. - pre charging
• Required row address is applied and a RAS (row address) is asserted
• selected row is decoded and opens transistors (one per column). This
dumps their capacitors charge into high feedback amplifiers which
recharge the capacitors - RAS must remain low
• simultaneously apply column address and set CAS. Decoded and
requested bits are gated to output - goes to outside when OE is active
3/16/2016
EE3.cma - Computer Architecture
27
Memory Devices
HM50464 Write Cycle
HM50464 - 64K*4 dynamic RAM organisation
RAS
CAS
Address
row
column
WRITE
IO
Valid Input
Early Write Cycle
Similar to the read cycle except the fall in WRITE signals time to latch
input data.
During the “Early Write” cycle - the WRITE falls before CAS - ensures
that memory device keeps data outputs disabled (otherwise when CAS
goes low they could output data!)
Alternatively a “Late Write” cycle the sequence is reversed and the OE
line is kept high - this can be useful in common address/data bus
architectures
3/16/2016
EE3.cma - Computer Architecture
28
Memory Devices
HM50464 - 64K*4 dynamic RAM organisation
Refresh Cycle
For a refresh no output is needed. A read, with a valid RAS and row
address pulls the data out all we need to do is put it back again by deasserting RAS.
This needs to be repeated for all 256 rows (on the HM50464) every 4ms.
There is an on chip counter which can be used to generate refresh
addresses.
Page Mode Access [“Fast Page Mode DRAM”] – standard DRAM
The RAS cycle time is relatively long so optimisations have been made
for common access patterns
Row address is supplied just once and latched with RAS. Then column
address are supplied and latched using CAS, data is read using WRITE
or OE. CAS and column address can then be cycled to access bits in
same row. The cycle ends when RAS goes high again.
Care must be taken to continue to refresh the other rows of memory at the
specified rate if needed
3/16/2016
EE3.cma - Computer Architecture
29
Memory Devices
HM50464 - 64K*4 dynamic RAM organisation
RAS
CAS
Address
row
col
col
Data
IO
col
Data
Data
Page Mode DRAM access - nibble and static column mode are similar
Nibble Mode
Rather than supplying the second and subsequent column addresses they
can be calculated by incrementing the initial address - first column
address stored in register when CAS goes low then incremented and
used in next low CAS transition - less common then Page Mode.
Static Column Mode
Column addresses are treated statically and when CAS is low the outputs
are read if OE is low as well. If the column address changes the outputs
change (after a propagation delay). The frequency of address changes
can be higher as there is no need to have an inactive CAS time
3/16/2016
EE3.cma - Computer Architecture
30
Memory Devices
HM50464 - 64K*4 dynamic RAM organisation
RAS
CAS
Address
row
col
col
col
OE
IO
Data
Data
Data
Extended Data Out DRAM access
Extended Data Out Mode (“EDO DRAM”)
EDO DRAM is very similar to page mode access. Except that data bus
outputs are controlled exclusively by the OE line. So that CAS can be
taken high and low again without data from previous word being
removed from data bus - so data can be latched by processor whilst
new column address is being latched by memory. Overall cycle times
can be shortened.
3/16/2016
EE3.cma - Computer Architecture
31
Memory Devices
HM50464 - 64K*4 dynamic RAM organisation
Clock
Command Act
NOP NOP read NOP NOP NOP NOP PChg NOP NOP Act
Address row
IO
col
bank
D0
Activate
DRAM
row
Read from
Column no.
(3 cycle latency)
D1
D2
row
D3
Read burst (4 words)
Simplified SDRAM burst read access
Synchronous DRAM (“SDRAM”)
Instead of asynchronous control signals SDRAMs accept one command
in each cycle. Different stages of access initiated by separate
commands - initial row address, reading etc. all pipelined so that a read
might not return a word for 2 or 3 cycles
Bursts of accesses to sequential words within a row may be requested by
issuing a burst-length command. Then, subsequent read or write
request operate in units of the burst length
3/16/2016
EE3.cma - Computer Architecture
32
Memory Devices
Summary DRAMs
• A whole row of the memory array must be read
• After reading the data must be re-written
• Writing requires the data to be read first (whole row has to be
stored if only a few bits are changed)
• Cycle time a lot slower than static RAM
• Address lines are multiplexed - saves package pin count
• Fastest DRAM commonly available has access time of
~60ns but a cycle time of 121ns
• DRAMs consume more current
• SDRAMS replace the asynchronous control mechanisms
Memory Type
3/16/2016
Cycles Required
Word 1
Word 2
Word 3
Word 4
DRAM
Page-Mode DRAM
5
5
5
3
5
3
5
3
EDO DRAM
SDRAM
5
5
2
1
2
1
2
1
SRAM
2
1
1
1
EE3.cma - Computer Architecture
33
4
3/16/2016
EE3.cma - Computer Architecture
34
Memory Interfacing
Interfacing
Most processors rely on external memory
The unit of access is a word carried along the Data Bus
Ignoring caching and virtual memory, all memory belongs to
a single address space.
Addresses are passed on the Address Bus
Hardware devices may respond to particular addresses Memory Mapped devices
External memory is a collection of memory chips.
All memory devices are joined to the same data bus
Main purpose of the addressing logic is to ensure only one
memory device is activated during each cycle
3/16/2016
EE3.cma - Computer Architecture
35
Memory Interfacing
Interfacing
The Data Bus has n lines - n = 8,16,32 or 64
The Address Bus has m lines - m = 16,20,24, 32 or 64
providing 2m words of memory
The Address Bus is used at the beginning of a cycle and the
Data Bus at the end
It is therefore possible to multiplex (in time) the two buses
Can create all sorts of timing complications - benefits are a
reduced processor pin count, makes it relatively common
Processor must tell memory subsystem what to do and when
to do it
Can do this either synchronously or asynchronously
3/16/2016
EE3.cma - Computer Architecture
36
Memory Interfacing
Interfacing
synchronously
• processor defines the duration of a memory cycle
• provides control lines for begin and end of cycle
• most conventional
• the durations and relationships might be determined at boot
time (available in 1980’s in the INMOS transputer)
asynchronously • processor starts cycle, memory signals end of cycle
• Error recovery is needed - if non-existent memory is
accessed (Bus Error)
3/16/2016
EE3.cma - Computer Architecture
37
Memory Interfacing
Interfacing
synchronous memory scheme control signals
• Memory system active
– goes active when the processor is accessing external
memory.
– Used to enable the address decoding logic
• provides one active chip select to a group of chips
• Read Memory
– says the processor is not driving the data bus
– selected memory can return data to the data bus
– usually connected to the output enable (OE) of memory
3/16/2016
EE3.cma - Computer Architecture
38
Memory Interfacing
Interfacing
synchronous memory scheme control signals (cont’d)
• Memory Write
– indicates data bus contains data which selected memory
device should store
– different processors use leading or trailing edges of signal
to latch data into memory
– Processors with data bus wider than 8 bits have separate
memory write byte signal for each byte of data
– Memory write lines connected to write lines of memories
• Address Latch Enable (in multiplexed address machines)
– tells the addressing logic when to take a copy of the address
from multiplexed bus so processor can use it for data later
• Memory Wait
– causes processor to extend memory cycle
– allows fast and slow memories to be used together without
loss of speed
3/16/2016
EE3.cma - Computer Architecture
39
Memory Interfacing
Address Blocks
How do we place blocks of memory within the address space
of our processor?
Two methods of addressing memory:
• Byte addressing
– each byte has its own address
– good for 8-bit mprocessors and graphics systems
– if memory is 16 or 32 bits wide?
• Word addressing
– only address lines which number individual words
– select multi-byte word
– extra byte address bits retained in processor to
manipulate individual byte
– or use write byte control signals
3/16/2016
EE3.cma - Computer Architecture
40
Memory Interfacing
Address Blocks
How do we place blocks of memory within the address space
of our processor?
Often want different blocks of memory:
• Particular addresses might be special:
– memory mapped I/O ports
– location executed first after a reset
– fast on-chip memory
– diagnostic or test locations
• Also want
– SRAM and/or DRAM in one contiguous block
– memory mapped graphics screen memory
– ROM for booting and low level system operation
– extra locations for peripheral controller registers
3/16/2016
EE3.cma - Computer Architecture
41
Memory Interfacing
Address Blocks
How do we place blocks of memory within the address space
of our processor?
• Each memory block might be built from individual
memory chips
– address and control lines wired in parallel
– data lines brought out separately to provide n bit word
• Fit all the blocks together in overall address map
– easier to place similar sized blocks next to each other
so that they can be combined to produce 2k+1 word area
– jumbling blocks of various sizes complicates address
decoding
– if contiguous blocks are not needed, place them at
major power of 2 boundaries - eg put base of SRAM at 0,
ROM half way up, lowest memory mapped peripheral at 7/8ths
3/16/2016
EE3.cma - Computer Architecture
42
Memory Interfacing
Address Decoding
address decoding logic determines which memory device to
enable depending upon address
• if each memory area stores contiguous words of 2k block
– all memory devices in that area will have k address
lines
– connected (normally) to the k least-significant lines
– remaining m-k examined to see if they provide mostsignificant (remaining) part of address of each area
3 schemes possible
– Full decoding - unique decoding
• All m-k bits are compared with exact values to make up
full address of that block
• only one block can become active
3/16/2016
EE3.cma - Computer Architecture
43
Memory Interfacing
Address Decoding
3 schemes possible (cont’d)
– Partial decoding
• only decode some of m-k lines so that a number of
blocks of addresses will cause a particular chip select to
become active
• eg ignoring one line will mean the same memory device
will be accessible at places in memory map
• makes decoding simpler
– Non-unique decoding
• connect different one of m-k lines directly to active low
chip select of each memory block
• can activate memory block by referencing that line
• No extra logic needed
• BUT can access 2 blocks at once this way…...
3/16/2016
EE3.cma - Computer Architecture
44
5
3/16/2016
EE3.cma - Computer Architecture
45
Memory Interfacing
Address Decoding - Example
A processor has a 32-bit data bus. It also provides a separate
30-bit word addressed address bus, which is labelled A2 to
A31 since it refers to memory initially using byte addressing,
where it uses A0 and A1 as byte addressing bits. It is desired
to connect 2 banks of SRAM (each built up from 128K*8
devices) and one bank of DRAM, built from 1M*4 devices,
to this processor. The SRAM banks should start at the
bottom of the address map, and the DRAM bank should be
contiguous with the SRAM. Specify the address map and
design the decoding logic.
3/16/2016
EE3.cma - Computer Architecture
46
Memory Interfacing
Address Decoding - Example
Each bank of SRAMs will require 4 devices to make up the 32 bit data bus.
Each Bank of DRAMs will require 8 devices.
0013FFFF
DRAM
bank 0
00040000
0003FFFF
00020000
0001FFFF
DRAM
bank 0
DRAM
bank 0
DRAM
bank 0
DRAM
bank 0
DRAM
bank 0
DRAM
bank 0
DRAM
bank 0
SRAM
Bank 1
SRAM
Bank 1
SRAM
Bank 1
SRAM
Bank 1
SRAM
Bank 2
SRAM
Bank 2
SRAM
Bank 2
SRAM
Bank 2
00000000
1M
words
(20 bits)
128k
words
(17 bits)
128k
words
(17bits)
------------------------- 32 bits -------------------------------------------------------------
3/16/2016
EE3.cma - Computer Architecture
47
Memory Interfacing
Address Decoding - Example
17 address lines
to all devices
in parallel
CPU
CS1
17 address lines
to all devices
in parallel
20 address lines
to all devices
in parallel
CS3
CS2
SRAM
128k*8
SRAM
128k*8
8 data lines
to each device
DRAM
1M*4
4 data lines
to each device
CS1 connects to chip select on SRAM bank 0
CS2 connects to chip select on SRAM bank 1
CS3 connects to chip select on DRAM bank
CS1 = A19*A20*A21*A22
CS2 =A19*A20*A21*A22
CS3 = A20+A21+A22
3/16/2016
}
}omitting all address lines A23 and above to simplify
}
EE3.cma - Computer Architecture
48
6
3/16/2016
EE3.cma - Computer Architecture
49
Memory Interfacing
Connecting Multiplexed Address and Data Buses
There are many multiplexing schemes but let’s choose 3
processor types and 2 memory types and look at the
possible interconnections:
• Processor types all 8-bit data and 16 bit address:
– No multiplexing - (eg Zilog Z80)
– multiplexes least significant address bits with data bus
(intel 8085)
– multiplexes the most significant and least significant
halves of address bus
• Memory types:
– SRAM (8k *8) - no address multiplexing
– DRAM (16k*4) - with multiplexed address inputs
3/16/2016
EE3.cma - Computer Architecture
50
Memory Interfacing
CPU vs Static Memory Configuration
CS
Address
decode
A0…15
A0…12
D0…7
D0…7
8k*8 SRAM
CPU
Non multiplexed
address bus
3/16/2016
EE3.cma - Computer Architecture
51
Memory Interfacing
CPU vs Static Memory Configuration
Address
decode
CS
MS
A8…15
A0…12
latch
LS
AD0…7
D0…7
8k*8 SRAM
CPU with LS
addresses multiplexed
with data bus
3/16/2016
EE3.cma - Computer Architecture
52
Memory Interfacing
CPU vs Static Memory Configuration
latch
Address
decode
CS
MA0…7
A0…12
D0…7
D0…7
CPU
time-multiplexed
address bus
8k*8 SRAM
3/16/2016
EE3.cma - Computer Architecture
53
Memory Interfacing
CPU vs Dynamic Memory Configuration
RAS
Address
decode
MPX
A0…15
CAS
D0…3
D0…7
MA0…6
D4…7
2 x 16k*4 DRAM
CPU
non - multiplexed
address bus
3/16/2016
MA0…6
EE3.cma - Computer Architecture
54
Memory Interfacing
CPU vs Dynamic Memory Configuration
RAS
Address
decode
CAS
MPX
latch
A8…15
D0…3
AD0…7
MA0…6
D4…7
2 x 16k*4 DRAM
CPU with LS
addresses multiplexed
with data bus
3/16/2016
MA0…6
EE3.cma - Computer Architecture
55
Memory Interfacing
CPU vs Dynamic Memory Configuration
RAS
Address
decode
MA0…7
MA0…6
D0…7
D0…3
CPU
time-multiplexed
address bus
3/16/2016
CAS
MA0…6
D4…7
2 x 16k*4 DRAM
EE3.cma - Computer Architecture
56
Displays
Video Display Characteristics
• Consider a video display capable of producing 640*240 pixel
monochrome, non-interlaced images at a frame rate of 50Hz:
h
Displayed
Image
v
add 20%
for frame
flyback
3/16/2016
dot rate = (640*1.2)*(240*1.2)*50 Hz
= 11MHz
add 20%
for line
flyback
= 90 ns/pixel
For 1024*800 non-interlaced display:
dot rate = (1024*1.2)*(800*1.2)* 50 Hz
= 65MHz
= 15 ns/pixel
add colour with 64 levels for rgb - 18 bits per pixel
Bandwidth now 1180MHz...
EE3.cma - Computer Architecture
57
Displays
Video Display Characteristics
• Problems with high bit rates:
– Memory mapping of the screen display within the processor map
couples CPU and display tightly - design together
– In order that screen display may be refreshed at the rates required
by video, display must have higher priority then processor for
DMA of memory bus - uses much of bandwidth
– In order to update the image the CPU may require very fast
access to screen memory too
– Megabytes of memory needed for large screen displays are still
relatively expensive - compared with CPU etc.
3/16/2016
EE3.cma - Computer Architecture
58
Displays
Bit-Mapped Displays
• Even 640*240 pixel display cannot be easily maintained
using DMA access to CPU’s RAM - except with multiple
word access
• Increase memory bandwidth for video display with special
video DRAM
– allows whole row of DRAM (256 or 1024 bits) in one
DMA access
• Many video DRAMs may be mapped to provide a single
bit of a multi-bit pixel in parallel - colour displays.
3/16/2016
EE3.cma - Computer Architecture
59
Displays
Character Based Displays
• limited to displaying one of a small number of images in fixed positions
– typically 24lines of 80 characters
– normally 8-bit ASCII
• Character value is used to determine the image from a look-up table
– table often in ROM (RAM version allows font changes)
• For a character of 9 dots wide by 14 high
– 14 rows are generated for each row of characters
– In order to display a complete frame, pixels are drawn a suitable do rate:
dot rate
3/16/2016
=
=
=
(80*9*1.2)*(24*14*1.2)*50 Hz
17.28 MHz
58ns/pixel
EE3.cma - Computer Architecture
60
Displays
Character Based Displays
• A row of 80 characters must be read for every displayed line
–
–
–
–
giving a line rate of 20.16kHz (similar to EGA standard)
overall memory access rate needed ~1.6Mbytes/second (625ns/byte)
barely supportable using DMA on small computers
even at 4bytes at a time (32 bit machines) still major use of data bus
• To avoid reading each line of 80 characters on other 13 rows characters
can be stored in a circular shift register on first access and used instead
of memory access.
– only need 80*24*50 accesses/sec - in bursts
– 167ms per byte - easily supported
– the whole 80 bytes can be read during flyback before start of new character
row at full memory speed in one DMA burst - 80 * about 200ns at a rate of
24*50 times a second - less than 2% of bus bandwidth.
3/16/2016
EE3.cma - Computer Architecture
61
Displays
Character Based Displays
• Assuming that rows of 80 characters in the CPU’s memory map are
stored at 128-byte boundaries (simplifies addressing) the CPU memory
addresses are:
address of screen memory
row
column
n-12 bits
5 bits
7 bits
0…23
0…79
address decode
• Address of character positions on the screen:
row
5 bits
0…23
Memory
3/16/2016
line number
in row
4 bits
carry
carry
column
7 bits
0…13
0…79
Look-up
Table
Memory
EE3.cma - Computer Architecture
dot number
across char
4 bits
carry
0…8
address of current
bit in shift register
62
Displays
Character Based Displays
• An appropriate block diagram of the display would be:
12
(r,c)
Screen
Address
4
ASCII bytes
Screen
Memory
Line no.
in char
Character
9
8 Generator
ROM
(16*256)*9 bits
dot clock
Shift
register
9 to 1 bit
FIFO
80*8bits
video data out
3/16/2016
EE3.cma - Computer Architecture
63
Displays
Character Based Displays
• The problem with DMA fetching individual characters from display
memory is its interference with processor.
• Alternative is to use Dual Port Memories
Dual Port SRAMs
• provide 2 (or more) separate data and address pathways to each memory cell
• 100% of memory bandwidth can be used by display without effecting CPU
• Can be expensive - ~£25 for 4kbytes - makes Mb displays impractical. For
character based display would be OK
Write
CE
Write
CE
I/O
1
OE
Do..Dn
Ao…An
3/16/2016
address
decode 1
row, col
I/O
2
Memory Array
EE3.cma - Computer Architecture
OE
Do..Dn
address
decode 2
row, col
Ao…An
64
7
3/16/2016
EE3.cma - Computer Architecture
65
Bit-Mapped Graphics & Memory Interleaving
Bit-Mapped Displays
• Instead of using an intermediate character generator can store all pixel
information in screen memory at pixel rates above.
• Even 640*240 pixel display cannot be maintained using DMA access to
CPU’s RAM - except with multiple word access
• Increase memory bandwidth for video display with special video
DRAM
– allows whole row of DRAM (256 or 1024 bits) in one DMA access
• Many video DRAMs may be mapped to provide a single bit of a
multi-bit pixel in parallel - colour displays.
• Use of video shift register limits clocking frequency to 25MHz 40ns/pixel
3/16/2016
EE3.cma - Computer Architecture
66
Graphics Card consists of:
GPU – Graphics Processing Unit
microprocessor optimized for 3D graphics rendering
clock rate 250-850MHz with pipelining – converts 3D
images of vertices and lines into 2D pixel image
Video BIOS – program to operate card and interface timings etc.
Video Memory – can use computer RAM, but more often has its
own VideoRAM (128Mb- 2Gb) – often multiport VRAM,
now DDR (double data rate – uses rising and falling edge
of clock)
RAMDAC – Random Access Digital to Analog Converter to CRT
3/16/2016
EE3.cma - Computer Architecture
67
Bit Mapped Graphics & Memory Interleaving
Using Video DRAMs
• To generate analogue signals for a colour display
– 3 fast DAC devices are needed
– each fed from 6 or 8 bits of data
– one each for red, green and blue video inputs
• To save storing so much data per pixel (24 bits) a Colour Look Up
Table (CLUT) device can be used.
– uses a small RAM as a look-up table
– E.g. a 256 entry table accessed by 8-bit values stored for each pixel - the
table contains 18 or 24 bits used to drive DACs
– Hence “256 colours may be displayed from a palette of 262144”
Data for
updating RAM
Pixel Data
6
Din
Addr Dout
RAM
218*28
DACs
Red output
6
Green output
Blue output
6
CLUT
3/16/2016
EE3.cma - Computer Architecture
68
Bit Mapped Graphics & Memory Interleaving
Using Video DRAMs
• Addressing Considerations
– if the number of bits in the shift registers is not the same as the
number of displayed pixels, it is easier to ignore the extra ones wasting memory may make addressing simpler
– processor’s screen memory bigger than displayable memory, gives a
scrollable virtual window.
remaining bits
0..(v-1)
log2v bits
0..(h-1)
log2h bits
screen block select
row address
column address
(not all combinations used)
– Even though most 32 bit processors can access individual bytes
(used as pixels) this is not as efficient as accessing memory in word
(32bits) units
3/16/2016
EE3.cma - Computer Architecture
69
Bit Mapped Graphics & Memory Interleaving
Addressing Considerations (cont’d)
– Sometimes it might be better NOT to arrange the displayed pixels in
ascending memory address order:
0
1
0
1
2
3
2
Each word defines 4 horizontally neighbouring
pixels. Each set fully specifies its colour - most
simple and common representation
3
Each word defines 2 pixels horizontally and
vertically with all colour data. Useful for text or
graphics applications where small rectangular
blocks are modified - might access fewer words
for changes
32
3/16/2016
Each word defines one bit of 32 horizontally
neighbouring pixels. 8 words (in 8 separate
colour planes) need to be changed to
completely change any pixel. Useful for adding
or moving blocks of solid colour - CAD
EE3.cma - Computer Architecture
70
Bit Mapped Graphics & Memory Interleaving
Addressing Considerations (cont’d)
• The video memories must now be arranged so that the bits
within the CPU’s 32-bit words can all be read or written to
their relevant locations in video memory in parallel.
– this is done by making sure that the pixels stored in each
neighbouring 32-bit word are stored in different memory
chips - interleaving
3/16/2016
EE3.cma - Computer Architecture
71
Bit Mapped Graphics & Memory Interleaving
Example
Design a 1024*512 pixel colour display capable of passing 8
bits per pixel to a CLUT. Use a video frame rate of 60Hz
and use video DRAMs with a shift register maximum
clocking frequency of 25MHz. Produce a solution that
supports a processor with an 8-bit data bus.
3/16/2016
EE3.cma - Computer Architecture
72
Bit Mapped Graphics & Memory Interleaving
Example
• 1024 pixels across the screen can be satisfied using 1 1024-bit shift
register (or 4 multiplexed 256-bit ones)
• The frame rate is 60Hz
• The number of lines displayed is 512
• The line rate becomes 60*512 = 30.72kHz - or 32.55ms/line
• 1024 pixels gives a dot rate of 30.72*1024 = 31.46MHz
• Dot time is thus 32ns - too fast for one shift register! So we will have
to interleave 2 or more.
• Multiplexing the minimum 3 shift registers will make addressing
complicated, easier to use 4 VRAMs - each with 256 rows of 256
columns, addressed row/column intersection containing 4 bits
interfaced by 4 pins to the processor and to 4 separate shift registers
3/16/2016
EE3.cma - Computer Architecture
73
Bit Mapped Graphics & Memory Interleaving
Example
• Hence for 8 bit CPU:
CPU memory
address
(BYTE address)
Video address
(pixel counters)
n-20 bits
0..512
9 bits
screen block select
512 rows
1 bit
0..512
8 bits
to RAS address i/p
to top/bottom
multiplexers
3/16/2016
0..1023
10 bits
1024 columns
0..1023
8 bits
1 bit
1 bit
implicit address
of bits in cascaded
shift registers
Which VRAM?
A+B, C+D
E+F, G+H
EE3.cma - Computer Architecture
to pixel
multiplexer
(odd/even pixels)
74
RHS of screen
LHS of screen
Example
top 256 lines
on screen
(8 bits of each
even pixel)
bottom 256 lines
on screen
(8 bits of each
even pixel)
top 256 lines
on screen
(8 bits of each
odd pixel)
bottom 256 lines
on screen
(8 bits of each
odd pixel)
even pixels
256*
256*4
A
256*
256*4
B
4
4
256*
256*4
C
256*
256*4
D
4
8
select top/ 8
bottm
mpx
8
4
odd pixels
256*
256*4
E
256*
256*4
F
4
select
256*
256*4
G
256*
256*4
H
4
odd/
even
pixel
mpx
R
G
B
CLUT
8
4
select
top/ 8
bottm
mpx
8
8 bits of
all pixels
(interleaved)
4
3/16/2016
EE3.cma - Computer Architecture
75
8
3/16/2016
EE3.cma - Computer Architecture
76
Mass Memory Concepts
Disk technology
•
•
•
•
•
•
•
•
•
•
•
•
unchanged for 50 years
similar for CD, DVD
1-12 platters
3600-10000rpm
double sided
circular tracks
subdivided into sectors
recording density >3Gb/cm2
innermost tracks not used – can not be used efficiently
inner tracks factor of 2 shorter than outer tracks
hence more sectors in outer tracks
cylinder – tracks with same diameter on all recording surfaces
3/16/2016
EE3.cma - Computer Architecture
77
Mass Memory Concepts
Access Time
• Seek time
– align head with
cylinder containing
track with sector inside
• Rotational Latency
– time for disk to rotate to
beginning of sector
• Data Transfer time
– time for sector to pass under head
Disk Capacity
= surfaces x tracks/surface x sectors/track x bytes/sector
3/16/2016
EE3.cma - Computer Architecture
78
Key Attributes of Example Discs
Identity of disc
Storage
attributes
Access
attributes
Physical
attributes
3/16/2016
Manufacturer
Seagate
Hitachi
IBM
Series
Model Number
Typical Application
Formatted Capacity GB
Recording surfaces
Cylinders
Sector size B
Avg tracks/sector
Max recording Density Gb/cm2
Min seek time ms
Max seek time ms
External data rate MB/s
Diameter, inches
Platters
Rotation speed rpm
Weight kg
Operating power W
Idle power W
Barracuda
ST1181677LW
Desktop
180
24
24,247
512
604
2.4
1
17
160
3.5
12
7200
1.04
14.1
10.3
DK23DA
ATA-5 40
Laptop
40
4
33,067
512
591
5.1
3
25
100
2.5
2
4200
0.10
2.3
0.7
Microdrive
DSCM-11000
Pocket device
1
2
7167
512
140
2.4
1
19
13
1
1
3600
0.04
0.8
0.5
EE3.cma - Computer Architecture
79
Key Attributes of Example Discs
Samsung launch 1Tb Hard drive:
3 x 3.5” platters
334Gb per platter
7200RPM
32Mb Cache
3Gb/s SATA interface
(SATA – serial Advanced Technology Attachment)
Highest density so far....
3/16/2016
EE3.cma - Computer Architecture
80
Mass Memory Concepts
Disk Organization
Data bits are small regions of magnetic coating magnetized in different
directions to give 0 or 1
Special encoding techniques maximize the storage density
eg rather than let data bit values dictate direction of magnetization can
magnetize based on change of bit value – nonreturn-to-zero (NRZ) –
allows doubling of recording capacity
3/16/2016
EE3.cma - Computer Architecture
81
Mass Memory Concepts
Disk Organization
•
Sector proceeded by sector number and followed by cyclic redundancy check
allows some errors and anomalies to be corrected
• Various gaps within and separating sectors allow processing to finish
• Unit of transfer is a sector – typically 512 to 2K bytes
• Sector address consists of 3 components:
– Disk address = Cylinder#, Track#, Sector#
17-31 bits
10-16bits 1-5bits 6-10bits
– Cylinder# - actuator arm
– Track# - selects read/write head or surface
– Sector# - compared with sector number recorded as it passes
• Sectors are independent and can be arranged in any logical order
• Each sector needs some time to be processed – some sectors may pass before disk
is ready to read again, so logical sectors not stored sequentially as physical sectors
track i
0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51 4 20 36 52 …..
track i+1 30 46 62 15 31 47 0 16 32 48 1 17 33 49 2 18 34 50 3 19….
track i+2 60 13 29 45 61 14 30 46 62 15 31 47 0 16 32 48 1 17 33 49…..
3/16/2016
EE3.cma - Computer Architecture
82
Mass Memory Concepts
Disk Performance
Disk Access Latency = Seek Time + Rotational Latency
• Seek Time – how far head travels from current cylinder
– mechanical motion – accelerates and brakes
• Rotational Latency – depends upon position
– Average rotational latency = time for half a rotation
– at 10,000 rpm = 3ms
3/16/2016
EE3.cma - Computer Architecture
83
Mass Memory Concepts
RAID - Redundant Array of Inexpensive (Independent) Disks.
• High capacity faster response without specialty hardware
3/16/2016
EE3.cma - Computer Architecture
84
Mass Memory Concepts
RAID0 – multiple disks appear as a single disk each accessing a
part of a single item across many disks
3/16/2016
EE3.cma - Computer Architecture
85
Mass Memory Concepts
RAID1 – robustness added by mirror contents on duplicate
disks – 100% redundancy
3/16/2016
EE3.cma - Computer Architecture
86
Mass Memory Concepts
RAID2 – robustness using error correcting codes – reducing
redundancy – Hamming codes – ~50% redundancy
3/16/2016
EE3.cma - Computer Architecture
87
Mass Memory Concepts
RAID3 – robustness using separate parity and spare disks –
reducing redundancy to 25%
3/16/2016
EE3.cma - Computer Architecture
88
Mass Memory Concepts
RAID4 – Parity/Checksum applied to sectors instead of bytes –
requires large use of parity disk
3/16/2016
EE3.cma - Computer Architecture
89
Mass Memory Concepts
RAID5 – Parity/Checksum distributed across disks – but 2 disk
failures can cause data loss
3/16/2016
EE3.cma - Computer Architecture
90
Mass Memory Concepts
RAID6 – Parity/Checksum distributed across disks and a second
checksum scheme (P+Q) distributed across different disks
3/16/2016
EE3.cma - Computer Architecture
91
9
3/16/2016
EE3.cma - Computer Architecture
92
Virtual Memory
In order to take advantage of the various performance and prices of
different types of memory devices it is normal for a memory
hierarchy to be used:
CPU register
fastest data storage medium
cache
for increased speed of access to DRAM
main RAM
normally DRAM for cost reasons; SRAM possible
disc
magnetic, random access
magnetic tape
serial access for archiving; cheap
• How and where do we find memory that is not RAM?
• How does a job maintain a consistent user image when there are
many others swapping resources between memory devices?
• How can all users pretend they have access to similar memory
addresses?
3/16/2016
EE3.cma - Computer Architecture
93
Virtual Memory
Paging
In a paged virtual memory system the virtual address is treated
as groups of bits which correspond to the Page number and
offset or displacement within the page
– often denoted as (P,D) pair.
• Page number can be looked up in a page table and
concatenated with the offset to give the real address.
• There is normally a separate page table for each virtual
machine which point to pages in the same memory.
• There are two methods used for page table lookup
– direct mapping
– associative mapping
3/16/2016
EE3.cma - Computer Architecture
94
Virtual Memory
Direct Mapping
• uses a page table with the same
number of entries as there are
pages of virtual memory.
• thus possible to look up the
entry corresponding to the
virtual page number to find
– the real address of the page
(if the page is currently
resident in real memory)
– or the address of that page
on the backing store if not
• This may not be economic for
large mainframes with many
users
• A large page table is expensive
to keep in RAM and may be
paged...
3/16/2016
EE3.cma - Computer Architecture
95
Virtual Memory
Content Addressable Memories
• when an ordinary
memory is given an
address it returns the
data word stored at that
location.
• A content addressable
memory is supplied data
rather than an address.
• It looks through all its
storage cells to find a
location which matches
the pattern and returns
which cell contained the
data - may be more than
one
3/16/2016
EE3.cma - Computer Architecture
96
Virtual Memory
Content Addressable Memories
• It is possible to perform a translation
operation using a content addressable
memory
• An output value is stored together
with each cell used for matching
• When a match is made the signal
from the match is used to enable the
register containing the output value
• Care needs to be taken so that only
one output becomes active at any
time
3/16/2016
EE3.cma - Computer Architecture
97
Virtual Memory
Associative Mapping
• Associative mapping uses a
content addressable memory to
find if the page number exist in
the page table
• If it does the rest of the entry
contains the real memory address
of the start of the page
• If not then page is currently in
backing store and needs to be
found from a directly mapped
page table on disc
• The associative memory only
needs to contain the same number
of entries as the number of pages
of real memory - much smaller
than the directly mapped table
3/16/2016
EE3.cma - Computer Architecture
98
Virtual Memory
Associative Mapping
• A combination of direct and associative mapping is often used.
3/16/2016
EE3.cma - Computer Architecture
99
Virtual Memory
Paging
• Paging is viable because programs tend to consist of loops and
functions which are called repeatedly from the same area of memory.
Data tends to be stored in sequential areas of memory and are likely to
be used frequently once brought into main memory.
• Some memory access will be unexpected, unrepeated and so wasteful
of page resources.
• It is easy to produce a program which mis-use virtual memory,
provoking frantic paging as they access memory over a wide area.
• When RAM is full, paging can not just read virtual pages from
backing store to RAM, it must first discard old ones to the backing
store.
3/16/2016
EE3.cma - Computer Architecture
100
10
3/16/2016
EE3.cma - Computer Architecture
101
Virtual Memory
Paging
• There are a number of algorithms that can be used to decide
which ones to move:
– Random replacement - easy to implement, but takes no
account of usage
– FIFO replacement - simple cyclic queue, similar to above
– First-In-Not-Used-First-Out - FIFO queue enhanced with
extra bits which are set when page is accessed and reset
when entry is tested cyclically.
– Least Recently Used - uses set of counters so that access
can be logged
– Working Set - all pages used in last x accesses are flagged
as working set. All other pages are discarded to leave
memory partially empty, ready for further paging
3/16/2016
EE3.cma - Computer Architecture
102
Virtual Memory
Paging - general points
• Every process requires its own page table - so that it can make
independent translation of location of actual page
• Memory fragmentation under paging can be serious.
– as pages are set size, usage will not be for complete page
and last one of a set will not normally be full
– especially if page size is large to optimise disc usage
(reduce the number of head movements)
• Extra bits can be stored in page table with the real address - dirty
bit - to determine if page has been written to since it was copied
and hence if it needs to be copied back
3/16/2016
EE3.cma - Computer Architecture
103
Virtual Memory
Segmentation
• A virtual address in a segmented system is made from 2 parts
– segment number
– displacement within (S,D) pairs
• unlike paging, segments are not fixed length, maybe variable
• Segments store complete entities - pages allow objects to be split
• Each task has its own
segment table
• segment table contains base
address and length of
segment so that other
segments aren’t corrupted
3/16/2016
EE3.cma - Computer Architecture
104
Virtual Memory
Segmentation
• Segmentation doesn’t give rise to fragmentation in the same way, pages
are of variable size so no waste of a segment.
• BUT as they are variable size not very easy to plan to fit them into
memory
?
• Keep a sorted table of vacant blocks of memory and combine
neighbouring blocks when possible
• Can keep information on the “type” a segment is - read-only executable
etc. as they correspond to complete entities.
3/16/2016
EE3.cma - Computer Architecture
105
Virtual Memory
Segmentation & Paging
• A combination of segmentation and Paging uses a triplet of virtual address
fields - the segment number, the page number within the segment and the
displacement within the page (S,P,D)
• More efficient than pure paging - use of space more flexible
• More efficient than pure segmentation - allows part of segment to be swapped
3/16/2016
EE3.cma - Computer Architecture
106
Virtual Memory
Segmentation & Paging
• It is easy to mis-use virtual memory by simple difference in the way that some
routines are coded: The 2 examples below perform exactly the same task, but
the left-hand one generates 1,000 page faults on a machine with 1K word pages,
while the one on the right generates 1,000,000. Most languages (except Fortran)
store arrays in memory with the rows laid out sequentially, the right hand
subscript varying most rapidly…..
void order
{
int array[1000][1000], ii, jj;
for (ii=0; ii<1000; ii++) {
for (jj=0;jj<1000; jj++) {
array[ii][jj];
}
}
void order
{
int array[1000][1000], ii, jj;
for (ii=0; ii<1000; ii++) {
for (jj=0;jj<1000; jj++) {
array[jj][ii];
}
}
}
}
3/16/2016
EE3.cma - Computer Architecture
107
Memory Caches
• Most general purpose mprocessor systems use DRAM for their bulk
RAM requirements because it is cheap and more dense than SRAM
• The penalty for this is that it is slower - SRAM has a 3-4 times
shorter cycle time
• To help some SRAM can be added:
– On-chip directly to the CPU for use as desired - use depends on
the compiler, not always easy to use efficiently but fast access
– Cacheing - between DRAM and CPU. Built using small fast
SRAM, copies of certain parts of the main memory are held
here. The method used to decide where to allocate cache
determines the performance.
– Combination of the two - on chip cache.
3/16/2016
EE3.cma - Computer Architecture
108
Memory Caches
Directly mapped cache - simplest form of memory cache.
• In which the real memory address is treated in three parts:
block select
tag (t bits)
cache index (c bits)
• For a cache of 2c words, the cache index section of the real memory
address indicates which cache entry is able to store data from that address
• When cached the tag (msb of address) is stored in cache with data to
indicate which page it came from
• Cache will store 2c words from 2t pages.
• In operation tag is compared in every memory cycle
– if tag matches a cache hit is achieved and cache data is passed
– otherwise a cache miss occurs and the DRAM supplies word and data
with tag are stored in the cache
Tag
t bits
Tags
Data
Cache Memory
3/16/2016
compare
Index
c bits
Use Cache or
Main Memory
EE3.cma - Computer Architecture
Main
Memory
109
Memory Caches
Set Associative Caches.
• A 2-way cache contains 2 cache blocks, each capable of storing one word
and the appropriate tag.
• For any memory access the two stored tags are checked
• Require Associative memory with 2 entries for each of the 2c cache lines
• Similarly a 4-way cache stores 4 cache entries for each cache index
Tag
t bits
Index
c bits
Cache Memory
Tags
Data
Tags
Data
Main
Memory
compare
Use Appropriate Cache or
Main Memory
3/16/2016
EE3.cma - Computer Architecture
110
Memory Caches
Fully Associative Caches
• A 2-way cache has two places which it must read and compare to
look for a tag
• This is extended to the size of the cache memory
– so that any main memory word can be cached at any location in
cache
• cache has no index (c=0) and contains longer tags and data
– notice as c (address length) decreases, t (tag length) must
increase to match
• all tags are compared on each memory access
• to be fast all tags must be compared in parallel
block select
tag (t bits)
no cache index (c=0)
INMOS T9000 had such a cache on chip
3/16/2016
EE3.cma - Computer Architecture
111
Memory Caches
Degree of Set Associativity
• for any chosen size of cache, there is a choice between more
associativity or a larger index field width
• optimum can depend on workload and instruction decoding accessible by simulation
In practice:
An 8kbyte (2k entries) cache, addressed directly, will produce a hit
rate of about 73%, a 32kbyte cache achieves 86% and a 128kbyte 2way cache 89%
(all these figures depend on characteristics of the instruction set and
code executed, data used, etc. - these are for the Intel 80386)
• considering the greater complexity of the 2-way cache there doesn’t
seem to be a great advantage in applying it
3/16/2016
EE3.cma - Computer Architecture
112
Memory Caches
Cache Line Size
• Possible to have cache data entries wider than a single word – i.e. a line size > 1
• Then a real memory access causes 2, 4 etc. words to be read
– reading performed over n-word data bus
– or from page mode DRAM, capable of transferring multiple
words from same row in DRAM, by supplying extra column
addresses
– extra words are stored in the cache in an extended data area
– as most code (and data access) occurs sequentially, it is likely
that next word will come in useful…
– real memory address specifies which word in the line it wants
block select
3/16/2016
tag (t bits)
cache index (c bits)
EE3.cma - Computer Architecture
line address (l bits)
113
11
3/16/2016
EE3.cma - Computer Architecture
114
Memory Caches
Writing Cached Memory
So far only really concerned with reading cache. But problem also exists to
keep cache and main memory consistent:
Unbuffered Write Through
• write data to relevant cache entry, update tag, also write data to location in
main memory - speed determined by main memory
Buffered Write Through
• Data (and address) is written to A FIFO buffer between CPU and main
memory, CPU continues with next access, FIFO buffer writes to DRAM
• CPU can continue to write at cache speeds, until FIFO is full, then slows
down to DRAM speed as FIFO empties
• If CPU wants to read from DRAM (instead of cache) need to empty FIFO
to ensure we have the correct data - can put long delay in.
• This delay can be shortened if FIFO has only one entry - simple latch
buffer
3/16/2016
EE3.cma - Computer Architecture
115
Memory Caches
Tag
8 bits
Index
9 bits
D0-31
13 bits
Data Bus (32 bits)
32
D Data Q
A0-31
13 bit
index
Address Bus
FIFO
22bits
22bits
Main
DRAM
memory
A0-21
13 bit
index
9 bit
tag
D
A
Tag
storage
and
comparison Match
WR
3/16/2016
control
Cache
Memory
WR A
Control
Logic
D0-31
32
Microprocessor
CPU
timing
signals
FIFO
32bits
2bits
(byte address)
FIFOs optional
for buffered
write-through
DRAM select
control
control
4Mword Memory using 8kword
Direct-Mapped cache with
Write-Through writes
EE3.cma - Computer Architecture
116
Memory Caches
Writing Cached Memory (cont’d)
Deferred Write (Copy Back)
• data is written out to cache only, allowing the cached entry to be
different from main memory. If the cache system wants to overwrite a cache index with a different tag it looks to see if the current
entry has been changed since it was copied in. If so it writes the
new value to main memory before reading the new data to the
location in cache.
• More logic is required for this operation, but the performance gain
can be considerable as it allows the CPU to work at cache speed if
it stays within the same block of memory. Other techniques will
slow down to DRAM speed eventually.
• Adding a buffer to this allows CPU to write to cache before data is
actually copied back to DRAM
3/16/2016
EE3.cma - Computer Architecture
117
Memory Caches
DRAM select
Tag
8 bits
9 bits
D0-31
Cache
Memory
A0-31
13 bit
index
13 bit
9 bit index
tag
22bits
D
Dirty bit
Match
D
A
Tag Q
storage
and
comparison
Main
DRAM
memory
control
Address Bus
WR
3/16/2016
Latch
32bits
D Data Q
WR A
A0-21
Q
Latch
D
2bits
(byte address)
D0-31
32
Microprocessor
Control
Logic
13 bits
Data Bus (32 bits)
32
CPU
timing
signals
Index
control
Q
Latch
control
4Mword Memory using 8kword
Direct-Mapped cache with
Copy-Back writes
EE3.cma - Computer Architecture
118
Memory Caches
Cache Replacement Policies for non direct-mapped caches
• when CPU accesses a location which is not already in cache need to
decide which existing entry to send back to main memory
• needs to be a quick decision
• Possible schemes are:
– Random replacement - a very simple scheme where a frequently
changing binary counter is used to supply a cache set number for
rejection.
– First-In-First-Out - a counter is incremented every time a new
entry is brought into the cache, which is used to point to the next slot
to be filled
– Least Recently Used - good strategy as keeps often used values in
cache, but difficult to implement with a few gates in short times
3/16/2016
EE3.cma - Computer Architecture
119
Memory Caches
Cache Consistency
A problem occurs when DMA is used by other devices or processors.
• Simple solution is to attach cache to memory and make all devices
operate through it.
• Not best idea as DMA transfer will cause all cache entries to be
overwritten, even though it is unlikely to be needed again soon
• If the cache is placed on the CPU side of the DMA traffic then
cache might not mirror DRAM contents
Bus Watching - monitor access to the DRAM and invalidate the
relevant cache tag entry if that DRAM has been updated can then
keep cache towards the CPU
3/16/2016
EE3.cma - Computer Architecture
120
12
3/16/2016
EE3.cma - Computer Architecture
121
Instruction Sets
Introduction
Instruction streams control all activity in the processor. All
characteristics of the machine depend on design of instruction set
– ease of programming
– code space efficiency
– performance
Look at a few different instruction sets:
– Zilog Z80
– DEC Vax-11
– Intel family
– INMOS Transputer
– Fairchild Clipper
– Berkeley RISC-I
3/16/2016
EE3.cma - Computer Architecture
122
Instruction Sets
General Requirements of an Instruction Set
Number of conflicting requirements of an instruction set:
• Space Efficiency - control information should be compact
– the major part of all data moved between memory and CPU
– obtained by careful design of instruction set
• variable length coding can be used so that frequently used
instructions are encoded into fewer bits
• Code Efficiency - can only translate a task efficiently if it is
easy to pick needed instructions from set.
– various attempts at optimising instruction sets resulted in :
• CISC - rich set of long instructions - results in small number
of translated instructions
• RISC - very short instructions, combined at compile time to
produce same result
3/16/2016
EE3.cma - Computer Architecture
123
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Ease of Compilation - in some environments compilation is a more
frequent activity than on machines where demanding executables
predominate. Both want execution efficiency however.
– more time consuming to produce efficient code for CISC - more
difficult to map program to wide range of complex instructions
– RISC simplifies compilation
– Ease of compilation doesn’t guarantee better code…..
– Orthogonality of instruction set also effects code generation.
• regular structure
• no special cases
• thus all actions (add, multiply etc.) able to work with each
addressing mode (immediate, absolute, indirect, register).
• If not compiler may have to treat different items differently constants, arrays and variables
3/16/2016
EE3.cma - Computer Architecture
124
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Ease of Programming
– still times when humans work directly at machine code level;
• compiler code generators
• performance optimisation
– in these cases there are advantages to regular, fixed length
instructions with few side effects and maximum orthoganality
• Backward Compatibility
– many manufacturers produce upgrade versions which allow
code written for earlier CPU to run without change.
– Good for public relations - if not compatible the could rewrite
for competitors CPU instead!
– But can make Instruction set a mess - deficiencies added to
rather than replaced - 8086 - 80286 - 80386 - 80486 - pentium
3/16/2016
EE3.cma - Computer Architecture
125
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Addressing Modes & Number of Addresses per Instruction
– Huge range of addressing modes can be provided - specifying
operands from 1 bit to several 32bit words.
– These modes may themselves need to include absolute
addresses, index registers, etc. of various lengths.
– Instruction sets can be designed which primarily use 0, 1, 2 or 3
operand addresses just to compound the problem.
3/16/2016
EE3.cma - Computer Architecture
126
Instruction Sets
Important Instruction Set Features:
• Operand Storage in the CPU
– where are operands kept other than in memory?
• Number of operands named per instruction
– How many operands are named explicitly per instruction?
• Operand Location
– can any ALU operand be located in memory or must some or all
of the operands be held in the CPU?
• Operations
– What types of operations are provided in the instruction set?
• Type and size of operands
– What is the size and type of each operand and how is it
specified?
3/16/2016
EE3.cma - Computer Architecture
127
Advantages
Disadvantages
Simple model of expression evaluation
Short instructions can give dense code
Stack can not be randomly accessed make efficient
code generation difficult
Stack can be hardware bottleneck
• Accumulator based Machines
Advantages
Disadvantages
Minimises internal state of machine
Short instructions
Since accumulator provides temporary storage
memory traffic is high
• Register based Machines
Advantages
Disadvantages
3/16/2016
one address machine
Three Classes of Machine:
• Stack based Machines
Most general model
All operands must be named, leading to long
instructions
EE3.cma - Computer Architecture
128
multi address machine
zero address machine
Instruction Sets
Instruction Sets
Register Machines
• Register to Register
Advantages
Disadvantages
Simple, fixed length instruction encoding
Simple model for code generation
Most compact
Instructions access operands in similar time
Higher instruction count than in architectures with
memory references in instructions
Some short instruction codings may waste instruction
space.
• Register to Memory
Advantages
Disadvantages
3/16/2016
Data can be accessed without loading first
Instruction format is easy to encode and dense
Operands are not symmetric, since one operand (in
the register) is destroyed
The no. of registers is fixed by instruction coding
Operand fetch speed depends on location (register or
memory)
EE3.cma - Computer Architecture
129
Instruction Sets
Register Machines (cont’d)
• Memory to Memory
Advantages
Disadvantages
3/16/2016
Simple, (fixed length?) instruction encoding
Does not waste registers for temporary storage
Large variation in instruction size - especially as
number of operands is increased
Large variation in operand fetch speed
Memory accesses create memory bottleneck
EE3.cma - Computer Architecture
130
13
3/16/2016
EE3.cma - Computer Architecture
131
Instruction Sets
Addressing Modes
Register
Add R4, R3
R4=R4+R3
When a value is in a register
Immediate
Add R4, #3
R4=R4+3
For constants
Indirect
Add R4, (R1)
R4=R4+M[R1]
Access via a pointer
Displacement
Add R4, 100(R1)
R4=R4+M[100+R1]
Access local variables
Indexed
Add R3, (R1+R2)
R3=R3+M[R1+R2]
Array access (base + index)
Direct
Add R1, (1001)
R1 = R1+M[1001]
Access static data
R1=R1+M[M[R3]]
Double indirect - pointers
Memory
Add R1, @(R3)
Indirect
Auto
Add R1, (R2)+
Postincrement
Auto
Add R1,-(R2)
Postdecrement
Scaled
Add R1,100(R2)[R3]
3/16/2016
R1=R1+M[R2]
step through arrays - d is
then R2=R2+d
word length
R2=R2-1
then R1=R1+M[R2]
can also be used for stacks
R1=R1+M[100+R2+(R3*d)]
EE3.cma - Computer Architecture
132
Instruction Sets
Instruction Formats
Number of Address (operands)
4 operation
1st operand
2nd operand
Result next address
3 operation
1st operand
2nd operand
Result
2 operation
1st operand
& result
2nd operand
1 operation
register
2nd operand
0 operation
3/16/2016
EE3.cma - Computer Architecture
133
Instruction Sets
Example Programs and simulations
(used in simulations by Hennessey & Patterson)
gcc
the gcc compiler (written in C) compiling a large number of C
source files
TeX
the TeX text formatter (written in C), formatting a set of
computer manuals
SPICE The spice electronic circuit simulator (written in FORTRAN)
simulating a digital shift register
3/16/2016
EE3.cma - Computer Architecture
134
Instruction Sets
Simulations on Instruction Sets from Hennessey & Patterson
The following tables are extracted from 4 graphs in Hennessey &
Patterson’s “Computer Architecture: A Quantitative Approach”
Use of Memory Addressing Modes
Addressing Mode
Memory Indirect
Scaled
Indirect
Immediate
Displacement
3/16/2016
TeX
1
0
24
43
32
Spice
6
16
3
17
55
EE3.cma - Computer Architecture
gcc
1
6
11
39
40
lists
Arrays
pointers
consts.
local var
135
Instruction Sets
Simulations on Instruction Sets (cont’d)
Number of bits needed for a Displacement Operand Value
TeX
Percentage of displacement operands using this # of bits
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
17 1 2 8 5 17 16 9 0 0 0 0 0 5 2 22
Spice
4
gcc
27 0
1 13 9
0
5
1
3
3
6
6
5 14 16 5 11 0 12
5 15 14 6
5
1
2
1
0
4
1 12
How local are the local variables?
< 8bits: 71% for TeX; 37% for Spice; 79% for gcc
3/16/2016
EE3.cma - Computer Architecture
136
Instruction Sets
Simulations on Instruction Sets (cont’d)
Percentage of Operations using Immediate Operands
Operation
TeX
Spice
gcc
Loads
38
26
23
Compares
83
92
84
ALU Operations
52
49
69
The Distributions of Immediate Operand Sizes
Program
TeX
Spice
gcc
3/16/2016
Number of bits needed for an Immediate Value
0
4
8 12 16 20 24 28 32
3 44 3
2 16 23 2
1
0
0 12 36 16 14 10 12 0
0
1 50 21 3
2 19 0
0
1
EE3.cma - Computer Architecture
137
14
3/16/2016
EE3.cma - Computer Architecture
138
Instruction Sets
The Zilog Z80
• 8 bit microprocessor derived from the Intel 8080
• has a small register set ( 8 bit accumulator + 6 other registers)
• Instructions are either register based or register and one memory
address - single address machine
• Enhanced 8080 with relative jumps and bit manipulation
• 8080 instruction set (8bit opcodes) – unused gaps filled in with extra instructions.
– even more needed so some codes cause next byte to be
interpreted as another set of opcodes….
• Typical of early register-based microprocessor
• Let down by lack of orthogonality - inconsistencies in instructions
eg:
– can load a register from address in single register
– but accumulator can only be loaded by address in register pair
3/16/2016
EE3.cma - Computer Architecture
139
Instruction Sets
The Zilog Z80 (cont’d)
8 improvements over 8080
1)
Enhanced Instruction set – index
• Separate PC, SP and 2 index registers
registers & instructions
2)
Two sets of registers for fast context
• Addressing modes:
switching
– Immediate (1 or 2 byte operands)
3)
Block Move
– Relative (one-byte displacement)
4)
Bit manipulation
5)
Built in DRAM refresh address counter
– Absolute (2-byte address)
6)
Single 5V power supply
– Indexed (M[index reg + 8 bit disp])
7)
Fewer extra support chips needed
– Register (specified in opcode itself)
8)
Very good price…
– Implied (e.g. references accumulator)
– Indirect (via HL, DE or BC register pairs)
• Instruction Types
– Load & Exchange - 64 opcodes used just for register- register copying
– Block Copy
– Arithmetic, rotate & shift - mainly 8 bit; some simple 16-bit operations
– Jump, call & return - uses condition code from previous instruction
– Input & Output - single byte; block I/O
3/16/2016
EE3.cma - Computer Architecture
140
Instruction Sets
Intel 8086 Family
• 8086 announced in 1978 - not used in PC until 1987 (slower 8088 from 1981)
– 16 bit processor, data paths
Concurrent Fetch (Prefetch) and Execute
– 20 bit base addressing mode
• 80186 upgrade: small extensions
• 80286 - used in PC/AT in 1984 (6 times faster than 8088 - 20MHz)
– Memory mapping & protection added
protected mode only switchable by
processor reset until 386!
• Support for VM through segmentation
• 4 levels of protection – to keep applications away from OS
•
– 24-bit addressing (16Mb) - segment table has 24 bit base field & 16 bit size field
80386 - 1986 - 40MHz
– 32 bit registers and addressing (4Gb)
– Incorporates “virtual” 8086 mode rather than direct hardware support
– Paging (4kbytes) and segmentation (up to 4Gb) – allows UNIX implementation
– general purpose register usage
– Incorporates 6 parallel stages:
•
•
•
•
•
•
Bus Interface Unit – I/O and memory
Code Prefetch Unit
Instruction Decode Unit
Execution Unit
Segment Unit – logical address to linear address translation
Paging Unit – linear address to physical address translation
– Includes cache for up to 32 most recently used pages
3/16/2016
EE3.cma - Computer Architecture
141
Instruction Sets
Intel 8086 Family
• i486 - 1988 - 100MHz not 80486 – court ruling “can’t trademark a number
– more performance
• added caching (8kb) to memory system
• integrated floating point processor on board
• Expanded decode and execute to 5 pipelined stages
• Pentium- 1994 - 150-750MHz (10,000 times speed of 8088)
– added second pipeline stage to give superscalar performance
– Now code (8k) and data (8k) cache
– Added branch prediction, with on-chip branch table for lookps
– Pages now 4Mb as well as 4kb
– Internal paths 128bits and 256bits, external still 32bits
– Dual processor support added
• Pentium Pro
– Instruction decode now 3 parallel units
– Breaks up code into “micro-ops”
– Micro-ops can be executed in any order using 5 parallel execution units, 2
integer, 2 floating point and 1 memory
3/16/2016
EE3.cma - Computer Architecture
142
Instruction Sets
Intel 8086 Registers (initially 16 bit)
Data
AX
used for general arithmetic
AH and AL used in byte arithmetic
BX
general-purpose register
used as address base register
CX
general-purpose register
used specifically in string, shift & loop instructions
DX
general-purpose register
used in multiply, divide and I/O instructions
Address
SP
Stack Pointer
BP
base register - for base-addressing mode
SI
index, string source base register
DI
index, string destination base register
Registers can be used in 32 bit mode when in 80386 mode
3/16/2016
EE3.cma - Computer Architecture
143
Instruction Sets
Intel 8086 Registers (initially 16 bit)
Segment Base Registers - shift left 4 bits and add to address specified in
instruction...
causes overlap!!!
CS
start address of code accesses
SS
start address of Stack Segment
changed in 80286
ES
extra segment (for string destinations)
DS
data segment - used for all other accesses
Control Registers
IP
Instruction Pointer (LS 16 bits of PC)
Flags 6 condition code bits plus 3 processor status control bits
Addressing Modes
A wide range of addressing modes are supported. Many modes can only be
accessed via specific registers eg:
Register Indirect
BX, SI, DI
Base + displacement BP, BX, SI, DI
Indexed
address is sum of 2 registers - BX+SI, BX+DI, BP+SI, BP+DI
3/16/2016
EE3.cma - Computer Architecture
144
15
3/16/2016
EE3.cma - Computer Architecture
145
Instruction Sets
The DEC Vax-11
Vax-11 family was compatible with the PDP-11 range - had 2 separate
processor modes - “Native” (VAX) and “Compatibility” Modes
• VAX had 16 32 bit general purpose registers including PC and SP
and a frame pointer.
• All data and address paths were 32bits wide - 4Gb address space.
• Full range of data types directly supported by hardware - 8, 16, 32
and 64 bit integers, 32 and 64 bit floating point and 32 digit BCD
numbers, character strings etc.
• A very full selection of addressing modes was available
• Used instructions made up from 8-bit bytes which specified:
– the operation
– the data type
– the number of operands
3/16/2016
EE3.cma - Computer Architecture
146
The DEC Vax-11
Instruction Sets
9 bytes
• Special opcodes FD and FF introduce even more opcodes in a second byte.
• Only the number of addresses is encoded into the opcode itself - the
addresses of operands are encoded in one or more succeeding bytes
So the operation:
ADDL3 #1, R0, @#12345678(R2)
or
“Add 1 to the longword in R0 and store the result in a memory location
addressed at an offset of the number of longwords stored in R2 from the
absolute address 12345678 (hex)”
is stored as:
193
ADDL3 opcode
0
1
Literal (immediate) constant
5
0
Register mode - register 0
4
2
Index prefix register 2
9
15
Abs address follows for indexing
78
56
Absolute address #12345678
34
- the VAX was little endian
12
3/16/2016
EE3.cma
147
8 bits - Computer Architecture
Instruction Sets
The INMOS Transputer
• The transputer is a
microprocessor designed to
operate with other
transputers in parallel
embedded systems.
• The T800 was
exceptionally powerful
when introduced in 1986
• The T9000 - more powerful
pipelined version in 1994
•
•
•
•
Government sell INMOS to EMI
EMI decide to concentrate on music
SGS Thompson buy what’s left
Japanese use transputer technology in
printers/scanners
• Then sold to ST microelectronics
• Now abandoned
3/16/2016
EE3.cma - Computer Architecture
148
Instruction Sets
The INMOS Transputer
• Designed for synchronised communications applications
• Suitable for coupling into a multiprocessing configuration allowing
a single program to be spread over all machines to perform task cooperatively.
• Has 4kbytes internal RAM - not cache, but a section of main
memory map for programmer/compiler to utilise.
• Compact instruction set
– most popular instructions in shortest opcodes - to minimise
bandwidth
– operate in conjunction with a 3 word execution stack - a zero
addressing strategy
3/16/2016
EE3.cma - Computer Architecture
149
Instruction Sets
The INMOS Transputer
The processor evaluates the following high-level expression:
x = a+b+c;
where x, a, b, and c represent integer variables.
No need to specify which processor registers receive the variables.
Processor just told to load - pushed on stack - and add them.
When an operation is performed, two values at the top are popped, then
combined, and the result of the operation is pushed back:
load a
load b
load c
add
add
store x
;stack contents (=Undefined)
;[
]
;[a
]
;[b
a
]
;[c
b
a]
;[c+b
a
]
;[c+b+a
]
;[c+b+a
]
• removes need to add extra bits to the instruction to specify which register
is accessed, instructions can be packed in smaller words - 80% of
instructions are only1 byte long - results in tighter fit in memory, and less
time spent fetching the instructions
3/16/2016
EE3.cma - Computer Architecture
150
Instruction Sets
The INMOS Transputer
• Has 6 registers
– 3 make the register stack
– a program counter (called the instruction pointer by Inmos)
– a stack pointer (called a workspace pointer by Inmos)
– and an operand register
• The stack is interfaced by the first of the 3 registers (A, B, C)
– “push”ing a value into A will cause A’s value to be pushed to B and
B’s value to C
– “pop”ping a value from A will cause B’s value to be popped to A and
C’s value to B
• The operand pointer is the focal point for instruction processing.
– the 4 upper bits of a transputer instruction contain the operation
– 16 possible operations
– 4 lower bits contain the operand - this can be enlarged to 32 bits by
using a “prefix” instructions
3/16/2016
EE3.cma - Computer Architecture
151
Instruction Sets
The INMOS Transputer
• The 16 instructions include jump, call, memory load/store and add. Three
of the 16 elementary instructions are used to enlarge the two 4-bit fields
(opcode or operand) in conjunction with the OR as follows:
Operand
– the “prefix” instruction adds its operand data into the OR (4bits) Register
– shifts the OR 4 bits to the left
– allowing numbers (upto 32 bits) to be built up in the OR
– a negative prefix instruction adds its operand into the OR and then
inverts all the bits in the OR before shifting 4 bits to the left - allows
2’s complement negative values to be built up - eg
Mnemonic
Code
ldc #3
#4
ldc #35
is coded as
pfix #3
#2
ldc #5
#4
ldc #987
is coded as
pfix #9
#2
pfix #8
#2
ldc #7
#4
3/16/2016
Memory
#43
#2345
#23
#45
#292847
#29
#28
#47
EE3.cma - Computer Architecture
152
Instruction Sets
The INMOS Transputer
Mnemonic
Code
ldc -31 (ldc #FFFFFFE1)
is coded as
nfix #1
#6
ldc #1
#4
Memory
#6141
#61
#41
This last example shows the advantage of loading the 2’s complement
negative prefix. Otherwise we would have to load all of the Fs making
5 additional operations….
• An additional “operate” instruction allows the OR to be treated as an
extended opcode - up to 32 bits. Such instructions can not have an
operand as OR is used for instruction so are all zero address instructions.
• We have 16 1-address instructions and potentially lots of zero length
instructions.
Mnemonic
Code
add
#5
is coded as
opr
#5
#F
ladd
#16
is coded as
pfix #1
#2
opr
#6
#F
3/16/2016
Memory
#F5
#F5
#21F6
#21
#F6
EE3.cma - Computer Architecture
153
Instruction Sets
The INMOS Transputer
• No dedicated data registers. The transputer does not have dedicated
registers, but a stack of registers, which allows for an implicit
selection of the registers. The net result is a smaller instruction
format.
• Reduced Instruction Set design. The transputer adopts the RISC
philosophy and supports a small set of instructions executed in a
few cycles each.
• Multitasking supported in microcode. The actions necessary for the
transputer to swap from one task to another are executed at the
hardware level, freeing the system programmer of this task, and
resulting in fast swap operations.
3/16/2016
EE3.cma - Computer Architecture
154
Instruction Sets
The Fairchild (now Intergraph) Clipper
• Had sixteen 32-bit general purpose registers for the user and another 16
for operating system functions.
– Separated interrupt activity and eliminated time taken to save register
information during an ISR
• Tightly coupled to a Floating Point Unit
• Had 101 RISC like instructions
– 16 bits long
– made up from an 8-bit opcode and two 4-bit register fields
– some instructions can carry 4 bits of immediate data
– the 16 bit instructions could be executed in extremely fast cycles
– also had 67 macro instructions - made up from multiples of simpler
instructions using a microprogramming technique - these incorporated
many more complex addressing modes as well as operations which
took several clock cycles
3/16/2016
EE3.cma - Computer Architecture
155
A tale of Intel
Intergraph were a leading workstation producer for CAD in transport, building
and local government products built using Intel chips.
1987 – Intergraph buys Advanced Processor Division of Fairchild from National
Semiconductor
1989-92 – Patents for Clipper transferred to Intergraph
1996 – Intergraph find that Intel are infringing their patents on Cache addressing,
memory and consistency between cache and memory, write through & copy
back modes for virtual addressing and bus snooping etc..
- Intergraph ask Intel to pay for patent rights
- Intel refuse
- Intel then cut off Intergraph from advanced information about Intel chips
- without that info Integraph could not design new products well
- Intergraph go from #1 to #5
1997 – Intergraph sue Intel – lots of legal stuff for next 3 years – court rules Intel
not licensed to use clipper technology in pentium
2002 – Intel pays Intergraph $300M for license plus $150M damages for
infringement of PIC technology – core of Itanium chip for high end servers
Parallel Instruction Computing
3/16/2016
EE3.cma - Computer Architecture
156
A tale of Intel
Federal Trade Commission site Intel in 2 other similar cases:
1997 – Digital sue Intel saying it copied DEC technology to make Pentium Pro.
In retaliation Intel cut off DEC from Intel pre-release material.
Shortly after this DEC get bought out by Compaq.
1994 – Compaq sue Packard Bell for violating patents for Comaq chip set.
Packard Bell say chip set made by Intel
Intel cut off Compaq from advanced information…..
3/16/2016
EE3.cma - Computer Architecture
157
Instruction Sets
The Fairchild (now Intergraph) Clipper
An example of a Harvard Architecture - having a separate internal
instruction bus and data bus (and associated caches)
Internal
Instruction
Bus
Integer
CPU
FPU
Cache/
Memory
Management
Unit
Internal
Data
Bus
Cache/
Memory
Management
Unit
Off-Carrier
Memory
Bus
3/16/2016
EE3.cma - Computer Architecture
The clipper is made up
from 3 chips mounted
on a ceramic carrier.
The Harvard Architecture
enables the caches to be
optimised to the different
characteristics of the
instruction and data
streams.
Microchips PIC chip
also uses a Harvard
Architecture
158
16
3/16/2016
EE3.cma - Computer Architecture
159
Instruction Sets
The Berkeley RISC-I Research Processor
A research project at UC Berkeley 1980-83 set out to build
• a “pure” RISC structure
• highly suited to executing compiled high level language programs
– procedural block, local & global variables
The team examined the frequency of execution of different types of
instructions in various C and Pascal programs
The RISC-I has had a strong influence on the design of SUN Sparc
architecture - (the Stanford MIPS (microprocessor without Interlocked Pipelined
Stages) architecture influenced the IBM R2000)
The RISC-I was a register based machine. The registers, data and addresses
were all 32 bits wide.
Had a total of 138 registers.
All instructions, except memory LOADs and STOREs, operated on 1,2 or 3
registers.
3/16/2016
EE3.cma - Computer Architecture
160
Instruction Sets
The Berkeley RISC-I Research Processor
When running program had available a total of 32 general-purpose registers
• 10 (R0-R9) are global
• the remaining 22 were split into 3 groups:
– low, local and high - 6,10 and 6 registers respectively
• When a program calls a procedure
– the first 6 parameters are stored to the programs low registers
– a new register window is formed
– these 6 low registers relabelled as the high 6 in a new block of 22
– this is the register space for the new procedure while it runs.
– the running procedure can keep 10 of its local variables in registers
– it can call further procedures using its own low registers
– it can nest calls to a depth of 8 calls – (thus using all 138 registers)
– on return from procedures the return results are in the high registers
and appear in the calling procedures low registers.
3/16/2016
EE3.cma - Computer Architecture
161
Instruction Sets
The Berkeley RISC-I Research Processor
Process A calls process B which calls process C:
137
Register
Bank
high
local
low
A
high
local
low
B
high
local
low
C
9
0
Global
3/16/2016
Global
EE3.cma - Computer Architecture
Global
162
Instruction Sets
The Berkeley RISC-I Research Processor
7
Op-Code
1
SCC
5
DEST
5
S1
1
13
S2
RISC-I Short Immediate Instruction Format
7
Op-Code
1
SCC
5
DEST
19
IMM
RISC-I Long Immediate Instruction Format
DEST is the register number for all operations except conditional branches,
when it specifies the condition
S1 is the number of the first source register and S2 the second if bit 13 is high a 2’s complement immediate value otherwise
SCC is a set condition code bit which causes the status word register to be
activated
3/16/2016
EE3.cma - Computer Architecture
163
Instruction Sets
The Berkeley RISC-I Research Processor
The Op-Code (7 bits) can be one of 4 types of instruction:
• Arithmetic
– where RDEST = RS1 OP S2 and OP is a math, logical or shift operation
• Memory Access
– where LOADs take the form RDEST = MEM[RS1+S2]
– and STOREs take the form MEM[RS1+S2] = RDEST
• Note that RDEST is really the source register in this case
• Control Transfer
– where various branches may be made relative to the current PC
(PC+IMM) or relative to RS1 using the short form (RS1+S2)
• Miscellaneous
– all the rest. Includes “Load immediate high” - uses the long immediate
format to load 19 bits into the MS part of a register - can be followed
by a short format load immediate to the other 13 bits - 32 in all
3/16/2016
EE3.cma - Computer Architecture
164
Instruction Sets
RISC Principles
Not just a machine with a small set of instructions. Must also have been
optimised and minimised to improve processor performance.
Many processors n the 60s and 70s were developed with a microcode engine at
the heart of the processor - easier to design (CAD and formal proof did not
exist) and easy to add extra, or change instructions
Most CISC programs spend most of their time in small number of instructions
If the time taken to decode all instructions can be reduced by having fewer of
them then more time can be spent on making the less frequent instructions
Various other features become necessary to make this work:
• One clock cycle per instruction
CISC machines typically take a variable number of cycles
– reading in variable numbers of instruction bytes
– executing microcode
Time wasted by waiting for these to complete is gained if all operate in the
same period
For this to happen a number of other features are required.
3/16/2016
EE3.cma - Computer Architecture
165
Instruction Sets
RISC Principles
• Hard-wired Controller, Fixed Format Instructions
– Single cycle operation only possible if instructions can be
decoded fast and executed straight away.
– Fast (old-fashioned?) hard-wired instruction sequences are
needed - microcode can be too slow
– As designing these controllers is hard even more important to
have few
– can be simplified by making all instructions share a common
format
• number of bytes, positions of op-code etc.
• smaller the better - provided that each instruction
contains needed information
– Typical for only 10% of the logic of a RISC chip to be used for
controller function, compared with 50-60% of a CISC chip like
the 68020
3/16/2016
EE3.cma - Computer Architecture
166
Instruction Sets
RISC Principles
• Larger Register Set
– It is necessary to minimise data movement to and from the processor
– The larger the number of registers the easier this is to do.
– Enables rapid supply of data to the ALU etc. as needed
– Many RISC machines have upward of 32 registers and over 100 is not
uncommon.
– There are problems with saving state of this many registers
– Some machines have “windows” of sets of registers so that a complete
set can be switched by a single reference change
• Memory Access Instructions
– One type of instruction can not be speeded up as much as others
– Use indexed addressing (via a processor register) to avoid having to
supply (long) absolute addresses in the instruction
– Harvard architecture attempts to keep most program instructions and
data apart by having 2 data and address buses
3/16/2016
EE3.cma - Computer Architecture
167
Instruction Sets
RISC Principles
• Minimal pipelining, wide data bus
– CISC machines use pipelining to improve the delivery of instructions to
the execution unit
– it is possible to read ahead in the instruction stream and so decode one
instruction whilst executing the previous one whilst retrieving another
– Complications in jump or branch instructions can make pipelining
unattractive as they invalidate the backed up instructions and new
instructions have to ripple their way through.
– RISC designers often prefer large memory cache so that data can be
read, decoded and executed in a single cycle independent of main
memory
– Regardless of pipelining, fetching program instructions fast is vital to
RISC and a wide data bus is essential to ensure this - same for CISC
3/16/2016
EE3.cma - Computer Architecture
168
Instruction Sets
RISC Principles
• Compiler Effort
– A CISC machine has to spend a lot of effort matching high-level
language fragments to the many different machine instructions - even
more so when the addressing modes are not orthogonal.
– RISC compilers have a much easier job in that respect - fewer choices
– They do, however, build up longer sequences of their small instructions
to achieve the same effect.
– The main complication of compiling for RISC is that of optimising
register usage.
– Data must be maintained on-chip when possible - difficult to evaluate
an importance to a variable.
• a variable accessed in a loop can be used many times and one
outside may be used only once - but both only appear in the code
once...
3/16/2016
EE3.cma - Computer Architecture
169
Instruction Sets
Convergence of RISC and CISC
Many of the principles developed for RISC machine optimisation have been
fed back into CISC machines (Intergraph and Intel…). This is tending to
bring the two styles of machine back together.
• Large caches on the memory interface - reduce the effects of memory usage
• CISC machines are getting an increasing number of registers
• More orthogonal instruction sets are making compiler implementation
easier
• Many of the techniques described above may be applied to the
microprogram controller inside a conventional CISC machine.
• This suggests that the microprogram will take on a more RISC like form
with fixed formats and fields, applying orthogonally over the registers etc.
3/16/2016
EE3.cma - Computer Architecture
170
17
3/16/2016
EE3.cma - Computer Architecture
171
Pipelined Parallelism in Instruction Processing
General Principles
Pipelined processing involves splitting a task into several sequential parts and
processing each in parallel with separate execution units.
• for one off tasks little advantage, but
• for repetitive tasks, can make substantial gains
Pipelining can be applied to many fields of computing, such as:
• large scale multi-processor distributed processing
• arithmetic processing using vector hardware to pipe individual vector
elements through a single high-speed arithmetic unit
• multi-stage arithmetic pipelines
• layered protocol processing
• as well as instruction execution within a processor
Over all task must be able to be broken into smaller sub-tasks which can be
chained together - all subtasks taking the same time to execute
Choosing the best sub-division of tasks is called load balancing
3/16/2016
EE3.cma - Computer Architecture
172
Pipelined Parallelism in Instruction Processing
General Principles
stage 1
stage 2
stage 3
stage 1
stage 2
stage 3
stage 1
non-pipelined processing
stage 1
stage 2
stage 3
stage 1
stage 2
stage 3
stage 1
stage 2
stage 2
stage 3
pipelined processing
stage 3
single instruction still takes as long, each instruction still has to be performed
in the same order. Speed up occurs when all stages are kept in operation at
same time. Start up and ending become less efficient.
3/16/2016
EE3.cma - Computer Architecture
173
Pipelined Parallelism in Instruction Processing
stage 1
stage 2
latch
Task
latch
General Principles
Two clocking schemes which can be incorporated in pipelining Synchronous
stage 3
Results
Clock
Operates using a global clock - indicates when each stage of the
pipeline should pass its result to the next stage.
Clock must run at rate of slowest possible element in pipeline when
given with most time consuming data.
To de-couple each stage they are separated by staging latches
3/16/2016
EE3.cma - Computer Architecture
174
Pipelined Parallelism in Instruction Processing
stage 1
ready
stage 2
latch
Task
latch
General Principles
Asynchronous
stage 3
Results
acknowledge
in this case the stages of the pipeline run independently of each other.
Two stages synchronise when a result has to pass from one to the other.
A little more complicated to design than synchronous, but benefits that
stages can run in time needed rather than use maximum time.
Use of a FIFO buffer instead of latch between stages can allow queuing
of results for each stage
3/16/2016
EE3.cma - Computer Architecture
175
Pipelined Parallelism in Instruction Processing
Pipelining for Instruction Processing
Processing a stream of instructions can be performed in a pipeline
Individual instructions can be executed in a number of distinct phases:
Fetch
Read instruction from memory
Decode instruction Inspect instruction - how many operands, how and
where will it be executed
Address generate Calculate addresses of registers and memory locations
to be accessed
Load operand
Read operands stored in memory - might read register
operands or set up pathways between registers and
functional units
Execute
Drive the ALU, shifter, FPU and other components
Store operand
Store result of previous stage
Update PC
PC must be updated for next fetch operation
No processor would implement all of these. Most common would be Fetch and
Execute
3/16/2016
EE3.cma - Computer Architecture
176
Pipelined Parallelism in Instruction Processing
Overlapping Fetch & Execute Phases
Fetch - involves memory activity (slow) can be overlapped with Decode and
Execute.
In RISC only 2 instructions access memory - LOAD and STORE - the
remainder operate on registers so for most instructions only Fetch needs
memory bus.
On starting the processor the Fetch unit gets an instruction from memory
At the end of the cycle the instruction just read is passed to the Execute unit
While the Execute unit is performing the operation Fetch is getting next
instruction (provided Execute doesn’t need to use the memory as well)
This and other contention can be resolved by:
• Extending the clashing cycle to give time for both memory accesses to take
place - hesitation - requires synchronous clock to be delayed
• Providing multi-port access to main memory (or cache) so that access can
happen in parallel. Memory interleaving may help.
• Widening data bus so that 2 instructions are fetched with each Fetch
• Use a Harvard memory architecture - separate instruction and data bus
3/16/2016
EE3.cma - Computer Architecture
177
Pipelined Parallelism in Instruction Processing
Overlapping Fetch & Execute Phases
Fetch #1
Fetch #2
Fetch #3
Execute #1 Execute #2 Execute #3
time
Overlapping Fetch, Decode & Execute Phases
Fetch #1
Fetch #2
Fetch #3
Decode #1
Decode #2 Decode #3
Execute #1 Execute #2 Execute #3
3/16/2016
EE3.cma - Computer Architecture
time
178
Pipelined Parallelism in Instruction Processing
Overlapping Fetch, Decode & Execute Phases
There are benefits to extending the pipeline to more than 2 stages even though more hardware is needed
A 3-stage pipeline splits the instruction processing into Fetch, Decode
and Execute.
The Fetch stage operates as before.
The Decode stage decodes the instruction and calculates any memory
addresses used in the Execute
The Execute stage controls the ALU and writes result back to a register
- and can perform LOAD and STORE accesses.
The Decode stage is guaranteed not to need a memory access. Thus
memory contention is no worse than in the 2 stage version.
Longer Pipelines of 5, 7 or more stages are possible and depend on the
complexity of hardware and instruction set.
3/16/2016
EE3.cma - Computer Architecture
179
18
3/16/2016
EE3.cma - Computer Architecture
180
Pipelined Parallelism in Instruction Processing
The Effect of Branch Instructions
One of the biggest problems with pipelining is the effect of a branch
instruction.
A branch is Fetched as usual and the target address Decoded. The
Execute stage then has the task of deciding whether or not to branch
and so changing the PC.
By this time the PC has already been used at least once by the Fetch
(and with a separate Decode maybe twice).
The effect of changing the PC is that all data in the pipeline following
the branch must be flushed.
Branches are common in some types of program (up to 10% of
instructions). So benefits of pipelining can be lost for 10% of
instructions and incur reloading overhead.
A number of schemes exist to avoid this flushing:
3/16/2016
EE3.cma - Computer Architecture
181
Pipelined Parallelism in Instruction Processing
• Delayed Branching – Sun SPARC
instead of branching as soon as a branch instruction has been decided, the
branch is modified to “Execute n more instructions before jumping to
the instruction specified” - used with n chosen to be 1 smaller than the
number of stages in pipeline. So that in a 2 stage pipeline, instead of the
loop:
a; b; c; a; b; c; ….. (where c is the branch instruction back to a)
in that order, the code could be stored as:
a; c; b; a; c; b; ……
in this case a is executed, then the decision to jump back to a, but before
the jump happens b is executed.
the delayed jump at c enables b - which has already been fetched when
evaluating c to be used rather than thrown away.
must be careful when operating instructions out of sequence and the
machine code becomes difficult to understand.
a good compiler can hide all of this and in about 70% of cases can be
implemented easily.
3/16/2016
EE3.cma - Computer Architecture
182
Pipelined Parallelism in Instruction Processing
• Delayed Branching
Consider the following code fragments - running on a 3 stage pipeline
loop:
Cycle
1
2
3
4
5
6
7
8(=1)
RA = RB ADD RC
RD = RB SUB RC
RE = RB MUL RC
RF = RB DIV RC
BNZ RA, loop
Fetch
Decode
ADD
SUB
ADD
MULT
SUB
DIV
MULT
BNZ
DIV
next
BNZ
next 2
next
ADD
-
Execute
ADD
SUB
MULT
DIV
BNZ
(Updates PC)
-
Pipeline has to be flushed to remove the two incorrectly fetched
instructions and code repeats every 7 cycles.
3/16/2016
EE3.cma - Computer Architecture
183
Pipelined Parallelism in Instruction Processing
• Delayed Branching
We can invoke the delayed branching behaviour of DBNZ and re-order 2
instructions (if possible) from earlier in the loop:
loop:
Cycle
1
2
3
4
5
6(=1)
7
8(=3)
RA = RB ADD RC
RD = RB SUB RC
DBNZ RA loop
RE = RB MULT RC
RF = RB DIV RC
Fetch
Decode
ADD
SUB
ADD
DBNZ
SUB
MULT
DBNZ
DIV
MULT
ADD
DIV
SUB
ADD
DBNZ
SUB
Execute
ADD
SUB
DBNZ (Updates PC)
MULT
DIV
ADD
Loop now executes every 5 processor cycles - no instructions are fetched
and unused.
3/16/2016
EE3.cma - Computer Architecture
184
Pipelined Parallelism in Instruction Processing
• Instruction Buffers – IBM PowerPC
When a branch is found in early stage of pipeline, the Fetch unit can be
made to start fetching both future instructions into separate buffers and
start decoding both, before branch is executed. A number of
difficulties with this:
– it imposes an extra load on instruction memory
– requires extra hardware - duplication of decode and fetch
– becomes difficult to exploit fully if several branches follow closely each fork will require a separate pair of instruction buffers
– early duplicated stages cannot fetch different values to the same
register, so register fetches may have to be delayed - pipeline stalling(?)
– duplicate pipeline stages must not write (memory or registers) unless
mechanism for reversing changes is included (if branch not taken)
3/16/2016
EE3.cma - Computer Architecture
185
Pipelined Parallelism in Instruction Processing
• Branch Prediction – Intel Pentium
When a branch is executed destination address chosen can be kept in
cache. When Fetch stage detects a branch, it can prime itself with a
next-program-counter value looked up in the cached table of previous
destinations for a branch at this instruction.
If the branch is made (at execution stage) in the same direction as before,
then pipeline already contains the correct prefetched instructions and
does not need to be flushed.
More complex schemes could even use a most-frequently-taken strategy to
guess where the next branch from any particular instruction is likely to
go and reduce the pipeline flush still further.
Look-up Table
load target address
Instruction
address
Target
address
target address found
PC
search
address
Memory
3/16/2016
Instruction
Fetch
EE3.cma - Computer Architecture
Decode
Execute
186
Pipelined Parallelism in Instruction Processing
• Dependence of Instructions on others which have not completed
Instructions can not be reliably fetched if all previous branch instructions
are incomplete - PC updated too late for next fetch
– Similar problem occurs with memory and registers.
– memory case can be solved by ensuring that all memory accesses are
atomically performed in a single Execute stage - get data only when
needed.
– but what if the memory just written contains a new instruction which
has already been prefetched? (self modifying code)
In a long pipeline, several stages may read from a particular register and
several may write to the same register.
– Hazards occur when the order of access to operands is changed by the
pipeline
– various methods may be used to prevent data from different stages
getting confused in the pipeline.
Consider 2 sequential instructions i, j, and a 3 stage pipeline.
Possible hazards are:
3/16/2016
EE3.cma - Computer Architecture
187
Pipelined Parallelism in Instruction Processing
• Read-after-write Hazards
When j tries to read a source before i writes it, j incorrectly gets the old value
– a direct consequence of pipelining conventional instructions
– occurs when a register is read very shortly after it has been updated
– value in register is correct
Example
R1 = R2 ADD R3
R4 = R1 MULT R5
Cycle
1
2
3
4
Fetch
ADD
MULT
next1
next2
3/16/2016
Decode
Execute
Comments
ADD fetches R2,R3 MULT fetches R1,R5 ADD stores R1 register fetch probably wrong value
next1
MULT stores R4 wrong value calculated
EE3.cma - Computer Architecture
188
Pipelined Parallelism in Instruction Processing
• Write-after-write Hazards
When j tries to write an operand before i writes it, the value left by i rather than the
value written by j is left at the destination
– Occurs if the pipeline permits write from more than one stage.
– value in register is incorrect
Example
R3 = R1 ADD R2
R5 = R4 MULT -(R3)
Cycle
1
2
3
4
Fetch
Decode
ADD
MULT ADD fetches R1,R2
next1 MULT fetches (R3-1),R4,saves R3-1 in R3
next2 next1
3/16/2016
Execute
Comments
ADD stores R3 which version of R3?
MULT stores R5
EE3.cma - Computer Architecture
189
Pipelined Parallelism in Instruction Processing
• Write-after-read Hazards
When j tries to write to a register before it is read by i, i incorrectly gets the
new value
– can only happen if the pipeline provides for early (decode-stage)
writing of registers and late reading - auto-increment addressing
– the value in the register is correct
Example
A realistic example is difficult in this case for several reasons.
• Firstly memory accessing introduces dependencies for the
data in the read case, or stalls due to bus activity in the write
case
• A long pipeline with early writing and late reading of
registers is rather untypical……..
• Read-after-read Hazards
These are not a hazard - multiple reads always return the same value…….
3/16/2016
EE3.cma - Computer Architecture
190
19
3/16/2016
EE3.cma - Computer Architecture
191
Pipelined Parallelism in Instruction Processing
• Detecting Hazards
several techniques - normally resulting in some stage of the pipeline being
stopped for a cycle - can be used to overcome these hazards.
They all depend on detecting register usage dependencies between
instructions in the pipeline.
An automated method of managing register accesses is needed
Most common detection scheme is scoreboarding
Scoreboarding
– keeping a 1-bit tag with each register.
– clear tags when machine is booted
– set by Fetch or Decode stage when instruction is going to change a
register
– when the change is complete the tag bit is cleared
– if instruction is decoded which wants a tagged register, then
instructions is not allowed to access it until tag is cleared.
3/16/2016
EE3.cma - Computer Architecture
192
Pipelined Parallelism in Instruction Processing
• Avoiding Hazards - Forwarding
Hazards will always be a possibility, particularly in long pipelines.
Some can be avoided by providing an alternative pathway for data from a
previous cycle but not written back in time:
Registers
Mpx
Mpx
ALU
reg
reg
bypass
paths
value
value
Normal register write path
3/16/2016
EE3.cma - Computer Architecture
193
Pipelined Parallelism in Instruction Processing
• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3
R4 = R1 SUB R5
R6 = R1 AND R7
R8 = R1 OR R9
R10 = R1 XOR R11
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
Fetch
ADD
SUB
AND
OR
XOR
next1
next2
next3
next4
next5
3/16/2016
on a 5 stage pipeline - no forwarding pathways
Fetch
Decode/regs
ADD read R2,R3
SUB read R5, R1(not ready)
SUB read R1(not ready)
SUB read R1(not ready)
SUB read R1
AND read R1, R7
OR read R1, R9
XOR read R1, R11
next1
next2
next3
next4
Decode
Reg read
ALU
Execute
ALU
ADD compute R1
SUB compute R4
AND compute R6
OR compute R8
XOR compute R10
next1
next2
next3
EE3.cma - Computer Architecture
Memory
Access
Memory
ADD pass R1
SUB pass R4
AND pass R6
OR pass R8
XOR pass R10
next1
next2
Register
Write
Writeback
ADD store R1
SUB store R4
AND store R6
OR store R8
XOR store R10
next1
194
Pipelined Parallelism in Instruction Processing
• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3
R4 = R1 SUB R5
R6 = R1 AND R7
R8 = R1 OR R9
R10 = R1 XOR R11
on a 5 stage pipeline - no forwarding pathways
BUT: registers read in second half of cycle and
written in first half of cycle
Fetch
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
Fetch
ADD
SUB
AND
OR
XOR
next1
next2
next3
next4
next5
3/16/2016
Decode/regs
ADD read R2,R3
SUB read R5, R1(not ready)
SUB read R1(not ready)
SUB read R1
AND read R1, R7
OR read R1, R9
XOR read R1, R11
next1
next2
next3
next4
Decode
Reg read
ALU
Execute
ALU
ADD compute R1
SUB compute R4
AND compute R6
OR compute R8
XOR compute R10
next1
next2
next3
EE3.cma - Computer Architecture
Memory
Access
Memory
ADD pass R1
SUB pass R4
AND pass R6
OR pass R8
XOR pass R10
next1
next2
Register
Write
Writeback
ADD store R1
SUB store R4
AND store R6
OR store R8
XOR store R10
next1
195
Pipelined Parallelism in Instruction Processing
• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3
R4 = R1 SUB R5
R6 = R1 AND R7
R8 = R1 OR R9
R10 = R1 XOR R11
Cycle
1
2
3
4
5
6
7
8
9
10
Fetch
ADD
SUB
AND
OR
XOR
next1
next2
next3
next4
next5
on a 5 stage pipeline - with full forwarding
Fetch
Decode/regs
ADD read R2,R3
SUB read R5, R1(from ALU)
AND read R1(from ALU),R7
OR read R1(from ALU),R9
XOR read R1, R11
next1
next2
next3
next4
Decode
Reg read
ALU
Execute
ALU
ADD compute R1
SUB compute R4
AND compute R6
OR compute R8
XOR compute R10
next1
next2
next3
Memory
Access
Memory
ADD pass R1
SUB pass R4
AND pass R6
OR pass R8
XOR pass R10
next1
next2
Register
Write
Writeback
ADD store R1
SUB store R4
AND store R6
OR store R8
XOR store R10
next1
In this case the forwarding prevents any pipeline stalls.
3/16/2016
EE3.cma - Computer Architecture
196
Pipelined Parallelism in Instruction Processing
• Characteristics of Memory Store Operations
Example - use the 5 stage pipeline as before in store cycle:
R1 = R2 ADD R3
25 (R1) = R1
Cycle
1
2
3
4
5
6
(store in main memory)
Fetch
ADD
Decode/regs
ALU
STORE ADD read R2,R3
next1 STORE read R1(not ready) ADD compute R1
next2 next1
STORE compute R1+25
stall
next2
next1
next3 next2
Memory
ADD pass R1
STORE R1(R1)
Writeback
ADD store R1
next1
STORE null
R1 from ALU
Wait for memory indirection
Since STORE is an output operation, it does not create register based hazards.
It might create memory-based hazards, which may be avoided by instruction
re-ordering or store-fetch avoidance techniques - see next section
3/16/2016
EE3.cma - Computer Architecture
197
Pipelined Parallelism in Instruction Processing
• Forwarding during Memory Load Operations
Example - use the 5 stage pipeline as before in Load cycle:
R1
R4
R5
R6
Cycle
1
2
3
4
5
6
7
8
9
=
=
=
=
32
R1
R1
R1
Fetch
LOAD
ADD
SUB
stall
AND
next1
next2
next3
next4
(R6)
ADD R7
SUB R8
AND R7
Decode/regs
LOAD read R6
ADD read R7,R1(not ready)
ADD R1(not ready)
SUB read R8,R1(from Mem)
AND read R7,R1
next1
next2
next3
ALU
LOAD R6+32
ADD R4,R1(Mem)
SUB R5
AND R6
next1
next2
Memory
LOAD (R6+32)
ADD pass R4
SUB pass R5
AND pass R6
next1
Writeback
LOAD store R1
ADD store R4
SUB store R5
AND store R6
In this case the result of the LOAD must be forwarded to the earlier ALU stage, and
the even earlier DECODE stage.
3/16/2016
EE3.cma - Computer Architecture
198
Pipelined Parallelism in Instruction Processing
• Forwarding (Optimisation) Applied to Memory Operations
– Store Fetch Forwarding - where words stored and then loaded by another
instruction further back in the pipeline can be piped directly without the need
to be passed into and out of that register or memory location: e.g
MOV [200],AX
ADD BX,[200]
;copy AX to memory
;add memory to BX
transforms to:
MOV
ADD
[200],AX
BX, AX
– Fetch Fetch Forwarding - where words loaded twice in successive stages
may be loaded together - or once from memory to register
MOV AX, [200] ;copy memory to AX
MOV BX,[200]
;copy memory to BX
transforms to:
MOV
MOV
AX, [200]
BX, AX
– Store Store Overwriting
MOV [200],AX
MOV [200],BX
;copy AX to memory
;copy BX to memory
transforms to:
MOV
3/16/2016
[200], BX
EE3.cma - Computer Architecture
199
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Instruction Re-ordering
Because hazards and data dependencies cause pipeline stalls, removing them can
improve performance. Re-ordering instructions is often simplest technique.
Consider a program to calculate 100 na n on a 3-stage pipeline:

loop:
Cycle
1
2
3
4
5
6
7
8
9
10
11
n 1
RT = RA EXP RN
RT = RT MULT RN
RS = RS ADD RT
RN = RN SUB 1
BNZ RN, loop
Fetch
Decode
EXP
MULT
EXP read RA,RN
ADD
MULT read RN,RT(not ready)
MULT read RN,RT
SUB
ADD read RS,RT(not ready)
ADD read RS,RT
BNZ
SUB read RN,1
BNZ read RN (not ready)
next1
BNZ read RN
next2
next1
EXP
flushed
3/16/2016
EE3.cma - Computer Architecture
Needs 10 cycles..
Execute
EXP store RT
MULT store RT
ADD store RS
SUB store RN
BNZ store PC
flushed
200
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Instruction Re-ordering
Re-order the sum and decrement instructions:
loop:
Cycle
1
2
3
4
5
6
8
9
10
Fetch
EXP
MULT
SUB
ADD
BNZ
next1
next2
EXP
RT = RA
RT = RT
RN = RN
RS = RS
BNZ RN,
EXP RN
MULT RN
SUB 1
ADD RT
loop
Needs 8 cycles..
these 2 swapped
these 2 swapped
Decode
EXP read RA,RN
MULT read RN,RT(not ready)
MULT read RN,RT
SUB read RN,1
ADD read RS,RT
BNZ read RN
next1
flushed
Execute
EXP store RT
MULT store RT
SUB store RN
ADD store RS
BNZ store PC
flushed
Can only make it better with forwarding - to remove final RT dependency
3/16/2016
EE3.cma - Computer Architecture
201
20
3/16/2016
EE3.cma - Computer Architecture
202
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
The unrolling of loops is a conventional technique for increasing
performance. It works especially well in pipelined systems:
– start with a tight program loop
– re-organise the loop construct so that the loop is traversed half (or a
third, quarter etc.) as many times
– re-write the code body so that it performs two (3, 4) times as much
work in each loop
– Optimise the new code body
In the case of pipeline execution, the code body gains from:
– more likely benefit from delayed branching
– less need to increment the loop variable
– instruction re-ordering avoids pipeline stalls
– parallelism is exposed - useful for vector and VLIW architectures
3/16/2016
EE3.cma - Computer Architecture
203
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
100
Example - Calculate  array (n) using a Harvard architecture and forwarding
n 1
R2 = 0
R1 = 100
loop: R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
BNZ R1, loop
Cycle
1
2
3
4
5
6
7
8
9
Fetch
LOAD
SUB
ADD
BNZ
next1
next2
next3
next4
LOAD
800 cycles to complete all loops
Decode
ALU
LOAD read R1
SUB read R1,1
LOAD R1+0
ADD R2,R3(not ready) SUB R1-1
BNZ read R1
ADD R2+R3,R3(Mem)
next1
BNZ R1(from ALU)
next2
next1
next3
next2
-
Memory
LOAD (R1+0)
SUB pass R1
ADD pass R2
BNZ pass R1
next1
-
Writeback
LOAD store R3
SUB store R1
ADD store R2
BNZ store PC
-
Code is difficult to write in optimal form - too short to implement delayed
branching - forwarding prevents stalling and performing decrement early hides
some of the memory latency
3/16/2016
EE3.cma - Computer Architecture
204
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
100
Example - Calculate  array (n) using a Harvard architecture and forwarding
n 1
R2 = 0
R1 = 100
loop: R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
BNZ R1, loop
Unrolling the loop body:
loop: R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
R3 = LOAD array(R1)
R1 = R1 SUB 1
R2 = R2 ADD R3
BNZ R1, loop
3/16/2016
Re-label registers and re-order
loop: R3 =
R4 =
R5 =
R6 =
R1 =
DBNZ
R2 =
R2 =
R2 =
R2 =
EE3.cma - Computer Architecture
LOAD array(R1)
LOAD array-1(R1)
LOAD array-2(R1)
LOAD array-3(R1)
R1 SUB 4
R1, loop
R2 ADD R3
R2 ADD R4
R2 ADD R5
R2 ADD R6
205
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
100
Example - Calculate  array (n) using a Harvard architecture and forwarding
n 1
Branch has been replaced with a delayed
branch - takes effect after 4 more
instructions (5 stage pipeline)
250
Cycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fetch
LOAD1
LOAD2
LOAD3
LOAD4
SUB
DBNZ
ADD1
ADD2
ADD3
ADD4
LOAD1
LOAD2
LOAD3
LOAD4
SUB
3/16/2016
Decode
LOAD1 read R1
LOAD2 read R1,1
LOAD3 read,R1,2
LOAD4 read R1,3
SUB read R1,4
DBNZ read R1
ADD1 read R2,R3
ADD2 read R2,R4
ADD3 read R2,R5
ADD4 read R2,R6
LOAD1 read R1
LOAD2 read R1,1
LOAD3 read,R1,2
LOAD4 read R1,3
cycles to complete all loops
ALU
LOAD1 array+R1
LOAD2 array+1+R1
LOAD3 array+2+R1
LOAD4 array+3+R1
SUB R1
DBNZ R1(from ALU)
ADD1 R2
ADD2 R2(from ALU)
ADD3 R2(from ALU)
ADD4 R2(from ALU)
LOAD1 array+R1
LOAD2 array+1+R1
LOAD3 array+2+R1
EE3.cma - Computer Architecture
Memory
LOAD1 R3
LOAD2 R4
LOAD3 R5
LOAD4 R6
SUB pass R1
DBNZ pass R1
ADD1 pass R2
ADD2 pass R2
ADD3 pass R2
ADD4 pass R2
LOAD1 R3
LOAD2 R4
Writeback
LOAD1 store R3
LOAD2 store R4
LOAD3 store R5
LOAD4 store R6
SUB store R1
DBNZ store PC
ADD1 store R2
ADD2 store R2
ADD3 store R2
ADD4 store R2
LOAD1 store R3
206
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
100
Example - Calculate  array (n) using a Harvard architecture and forwarding
n 1
The original loop took 8 cycles per iteration. The unrolled version allows a
delayed branch to be implemented and performs 4 iterations in 10 cycles.
Gives an improvement of a factor of 3.2
Benefits of Loop Unrolling
– Fewer instructions (multiple decrements can be performed in one operation)
– longer loop allows delayed branch to fit
– better use of pipeline - more independent operations
– disadvantage - more registers required to obtain these results
3/16/2016
EE3.cma - Computer Architecture
207
Pipelined Parallelism in Instruction Processing
• Parallelism at the Instruction Level
Conventional instruction sets rely on encoding of register numbers, instruction type
and addressing modes to reduce volume of instruction stream
CISC processors optimise a lower level encoding in a longer instruction word requires them to consume more instruction bits per cycle, forcing advancements
like Harvard memory architectures.
CISC architectures are still sequential processing machines - pipelining and
superscalar instruction grouping introduce a limited amount of parallelism
Parallelism can also be introduced explicitly with parallel operations in each
instruction word.
VLIW (Very Long Instruction Word) machines have instruction formats which
contain different fields, each referring to a separate functional unit in the
processor, this requires multi-ported access to registers etc.
Choice of parallel activities in a VLIW machine is made by the compiler, which
must determine when hazards exist and how to resolve them...
3/16/2016
EE3.cma - Computer Architecture
208
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Uniprocessor limits on performance
The speed of a pipelined processor (instructions per second) is limited by:
• clock frequency (AMD 2.66 GHz) is unlikely to increase - much more
• depth of pipeline. As depth increases, work in each stage per cycle initially
decreases. But effects of register hazards, branching etc. limit further subdivision and load balancing between stages gets increasingly difficult
So, why only initiate one instruction in each cycle?
Superpipelined processors double the clock frequency by pushing alternate
instructions from a conventional instruction stream to 2 parallel pipelines.
Compiler must separate instructions to run independently in the 2 streams and
when not possible must add NULL operations. Could use more than 2 pipelines.
Scheme is not very flexible and is superseded by:
Superscalar processors use conventional instruction stream, read at several
instructions per cycle. Decoded instructions issued to a number of pipelines - 2
or 3 pipelines can be kept busy this way
Very Long Instruction Word (VLIW) processors use modified instruction set each containing sub-instructions, each sent to separate functional units
3/16/2016
EE3.cma - Computer Architecture
209
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Superscalar Architectures
– fetch and decode more instructions than needed to feed a single pipeline
– launch instructions down a number of parallel pipelines in each cycle
– compilers often re-order instructions to place suitable instructions in
parallel - the details of the strategy used will have a huge effect on the
degree of parallelism achieved
– some superscalars can perform re-ordering at run time - to take advantage
of free resources
– relatively easy to expand - add another pipelined functional unit. Will run
previously compiled code, but will benefit from new compiler
– provide exceptional peak performance, but extra data requirements put
heavy demands on memory system and sustained performance might not
be much more than 2 instructions per cycle.
3/16/2016
EE3.cma - Computer Architecture
210
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Very Long Instruction Word architectures
– VLIW machines provide a number of parallel functional units
• typically 2 integer ALUs, 2 floating point units, 2 memory
access units and a branch control engine
– the units are controlled from bits in a very long instruction word - this can
be 150 bits or more in width
– needs fetching across a wide instruction bus - and hence wide memories
and cache.
– Many functional units require 2 register read ports and a register write port
– Application code must have plenty of instruction level parallelism and few
control hazards - obtainable by loop unrolling
– Compiler responsible for identifying activities to be combined into a
single VLIW.
3/16/2016
EE3.cma - Computer Architecture
211
21
3/16/2016
EE3.cma - Computer Architecture
212
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Hazards and Instruction Issue Matters with Multiple Pipelines
There are 3 main types of hazard:
– read after write - j tries to read an operand before i writes it, j gets the
old value
– write after write - j writes a result before i, the value left by i rather than
j is left at the destination
– write after read - j writes a result before it is read by i, i incorrectly gets
new value
In single pipeline machine with in-order execution read after write is the
only one that can not be avoided and is easily solved using forwarding.
Using extra superscalar pipelines (or altering the order of instruction
completion or issue) brings all three types of hazard further into play:
3/16/2016
EE3.cma - Computer Architecture
213
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Read After Write Hazards
It is difficult to organise forwarding from one pipeline to another.
Better is to allow each pipeline to write its result values directly to any
execution unit that needs them
• Write After Read Hazards
Consider
F0 = F1 DIV F2
F3 = F0 ADD F4
F4 = F2 SUB F6
Assume that DIV takes several cycles to execute in one floating point pipeline.
Its dependency with ADD (F0) stops ADD from being executed until DIV
finishes.
BUT SUB is independent of F0 and F3 and could be executed in parallel with
DIV and could finish 1st.
If it wrote to F4 before the ADD read it then ADD would have the wrong value
3/16/2016
EE3.cma - Computer Architecture
214
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Write After Write Hazards
Consider
F0
F3
F0
F3
=
=
=
=
F1
F0
F2
F0
DIV
MUL
SUB
ADD
F2
F4
F6
F4
On a superscalar the DIV and SUB have independent operands (F2 is read twice
but not changed)
If there are 2 floating point pipelines, each could be performed at the same time.
DIV would be expected to take longer
So SUB might try and write to F0 before DIV - hence ADD might get wrong
value from F0 (MUL would be made to wait for DIV to finish, however)
We can use Scoreboarding to resolve these issues.
3/16/2016
EE3.cma - Computer Architecture
215
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Limits to Superscalar and VLIW Expansion
Only 5 operations per cycle are typical, why not 50?
– Limited Parallelism available.
• VLIW machines depend on a stream of ready-parallelised instructions.
– Many parallel VLIW instructions can only be found by unrolling loops
– if a VLIW field can not be filled in an instruction, then the functional unit will
remain idle during that cycle
• superscalar machine depends on stream of sequential instructions
– loop unrolling is also beneficial for superscalars
– Limited Hardware resources
• cost of registers read/write ports scale linearly with number, but complexity
of access increases as the square of the number
• extra register access complexity may lead to longer cycle times
• more memory ports needed to keep processor supplied with data
– Code Size too high
• wasted fields in VLIW instructions lead to poor code density, need for
increased memory access and overall less benefit from wide instruction bus
3/16/2016
EE3.cma - Computer Architecture
216
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Amdahl’s Law
Gene Amdahl suggested the following law for vector processors - equally appropriate
for VLIW and superscalar machines and all multiprocessor machines.
Any parallel code has sequential elements - at startup and shutdown, at the beginning
and end of each loop etc.
To find the benefit from parallelism need to consider how much is done sequentially
and in parallel.
Speedup factor can be taken as:
Execution time using one processor
S(n) =
Execution time using a multiprocessor with n processors
If the fraction of code which can not be parallelised is f and the time taken for the
computation on one processor is t then the time taken to perform the computation
with n processors will be:
ft + (1 - f) t / n
The speed up is therefore:
S(n) = t / ( ft + (1 - f) t / n) = n/(1 + (n - 1) f)
(ignoring any overhead due to parallelism or communication between processors)
3/16/2016
EE3.cma - Computer Architecture
217
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Amdahl’s Law
S( ) 1/f
Amdahl's Law - Speedup v No. Processors
20
• even for an infinite number of
processors maximum speed up
is given by above
f = 5%
16
f = 10%
f = 20%
14
Speedup S(n)
• Small reduction in sequential
overhead can make huge
difference in throughput
f = 0%
18
12
10
8
6
4
2
0
0
5
10
15
20
Number of Processors
3/16/2016
EE3.cma - Computer Architecture
218
Instruction Level Parallelism
Superscalar and Very Long Instruction Word Processors
• Gustafson’s Law
– A result of observation and experience.
– If you increase the size of the problem then the size (not the fraction) of
the sequential part remains the same.
– eg if we have a problem that uses a number of grid points to solve a
partial differential equation
• for 1000 grid points 10% of the code is sequential.
• might expect that for 10,000 grid points only 1% of the code will be
sequential.
• If we expand the problem to 100,000 grid points the only 0.1% of the
problem remains sequential.
– So after Amdahl’s law things start to look better again!
3/16/2016
EE3.cma - Computer Architecture
219
22
3/16/2016
EE3.cma - Computer Architecture
220
Running Programs in Parallel
Running Programs in Parallel
Options for running programs in parallel include:
• Timesharing on a Uniprocessor - this is mainly muti-tasking to share a
processor rather than combining resources for a single application. Timesharing is
characterised by:
– Shared memory and semaphores
– high context-switch overheads
– limited parallelism
• Multiprocessors with shared memory - clustered computing combines several
processors communicating via shared memory and semaphores.
– Shared memory limits performance (even with caches) due to the delays when
the operating system or user processes wait for other processes to finish with
shared memory and let them have their turn.
– Four - eight processors, actively communicating on a shared bus is about the
limit before access delays become unacceptable
• Multiprocessors with separate communication switching devices - INMOS
transputer and Beowulf clusters.
– each element contains a packet routing controller as well as a processor
(transputer contained both on single chip)
– messages can be sent between any process on any processor in hardware
3/16/2016
EE3.cma - Computer Architecture
221
Running Programs in Parallel
Running Programs in Parallel
Options for running programs in parallel include: (cont’d)
• Vector Processors (native and attached)
– may just be specialised pipeline engines pumping operands through heavilypipelined, chained, floating point units.
– Or they might have enough parallel floating point units to allow vector
operands to be manipulated element-wise in parallel.
– can be integrated into otherwise fast scalar processors
– or might be co-processors which attach to general purpose processors
• Active Memory (Distributed Array Processor)
– rather than take data to the processors it is possible to take the processors to
the data, by implementing a large number of very simple processors in
association with columns of bits in memory
– thus groups of processors can be programmed to work together, manipulating
all the bits of stored words.
– All processors are fed the same instruction in a cycle by a master controller.
• Dataflow Architectures - an overall task is defined in terms of all operations
which need to be performed and all operands and intermediate results needed to
perform them. Some operations can be started immediately with initial data whilst
others must wait for the results of the first ones and so on to the result.
3/16/2016
EE3.cma - Computer Architecture
222
Running Programs in Parallel
• Semaphores:
– Lock shared resources
– Problems of deadlock and starvation
• Shared memory
– Fastest way to move information between two
processors is not to!
– Rather than:
• sender → receiver we have sender
receiver
– Use semaphore to prevent receiver reading until sender
has finished
– Segment created outside normal process space – system
call maps it into space of requesting process
Proc 1
Segment 1
Segment 3
Segment 2
3/16/2016
Proc 2
Proc 3
EE3.cma - Computer Architecture
223
Running Programs in Parallel
Flynn’s Classification of Computer Architectures
SISD
Single Instruction, Single Data machines are conventional uni-processors.
They read their instructions sequentially and operate on their data operands
individually. Each instruction only accesses a few operand words
SIMD Single Instruction, Multiple Data machines are typified by vector
processors. Instructions are still read sequentially but this time they each
perform work on operands which describe multi-word objects such as arrays
and vectors. These instructions might perform vector element summation,
complete matrix multiplication or the solution of a set of simultaneous
equations.
MIMD Multiple Instruction, Multiple Data machines are capable of fetching
many instructions at once, each of which performs operations on its own
operands. The architecture here is of a mutiprocessor - each processor
(probably a SISD or SIMD processor) performs its own computations but
shares the results with the others. The multiprocessor sub-divides a large
task into smaller sections which are suitable for parallel solution and permits
these tasks to share earlier results
(MISD) Multiple Instruction Single Data machines are not really implementable.
One might imagine an image processing engine capable of taking an image
and performing several concurrent operations upon it...
3/16/2016
EE3.cma - Computer Architecture
224
Major Classifications of Parallelism
Introduction
Almost all parallel applications can be classified into one or more of the following:
•
Algorithmic Parallelism- the algorithm is split into sections (eg pipelining)
•
Geometric Parallelism - static data space is split into sections (eg process an
image on an array of processors)
•
Processor Farming - the input data is passed to many processors (eg ray
tracing co-ordinates to several processors one ray at a time)
Load Balancing
There are 3 forms of load balancing
•
Static Load Balancing - the choice of which processor to use for each part of
the task is made at compile time
•
Semi-dynamic - the choice is made at run-time, but once started, each task
must run to completion on the chosen processor - more efficient
•
Fully-dynamic load balancing - tasks can be interrupted and moved between
processors at will. This enables processors with different capabilities to be used
to best advantage. Context switching and communication costs may outweigh
the gains
3/16/2016
EE3.cma - Computer Architecture
225
23
3/16/2016
EE3.cma - Computer Architecture
226
Major Classifications of Parallelism
Algorithm Parallelism
•
•
•
•
•
Tasks can be split so that a stream of data can be processed in successive stages
on a series of processors
As the first stage finishes its processing the result is passed to the second stage
and the first stage accepts more input data and processes it and so on.
When the pipeline is full one result is produced at every cycle
At the end of continuous operation the early stages go idle as the last results are
flushed through.
Load balancing is static - the speed of the pipeline is determined by the speed
of the slowest stage.
linear pipeline
data
results
or chain
data
results
data
results
pipeline with
parallel section
Irregular network
general case
data
3/16/2016
EE3.cma - Computer Architecture
227
Major Classifications of Parallelism
Geometric Parallelism
•
•
•
•
•
•
Some regular-patterned tasks can be processed by spreading their data across
several processors and performing the same task on each section in parallel
Many examples involve image processing - pixels mapped to an array of
transputers for example
Many such tasks involve communication of boundary data from one portion to
another - finite element calculations
Load balancing is static - initial partitioning of data determines the time to
process each area.
Rectangular blocks may not be the best choice - stripes, concentric squares…
Initial loading of the data may prove to be a serious overhead
data array
connected
transputers
3/16/2016
EE3.cma - Computer Architecture
228
Major Classifications of Parallelism
Geometric v Algorithmic
F(xi) = cos(sin(exp(xi*xi))) for x1, x2, … x6 using 4 processors
Algorithmic:
x3, x2, x 1
y
y
y
result
y
y*y
e
sin(y)
cos(y)
1 time unit
1 time unit
1 time unit
1 time unit
F1 is produced in 4 time units
F2 is produced at time 5
i.e. time = 4+(6-1) = 9 units
speedup = 24/9 = 2.6
Geometric:
F5
F6
4 time units
3/16/2016
F2
cos(sin(ex*x))
4 time units
i.e. time = 8 units
cos(sin(ex*x))
4 time units
F1
x2
4 time units
cos(sin(ex*x))
4 time units
4 time units
x1
x5
x3
x6
x4
cos(sin(ex*x))
F6
F4
speedup = 24/8 = 3
EE3.cma - Computer Architecture
229
Major Classifications of Parallelism
Processor Farming
•
•
•
•
•
•
•
Involves sharing work out from a central controller process to several worker
processes.
The “workers” just accept packets of command data and return results.
The “controller” splits up the tasks, sending work packets to free processors
(ones that have returned a result) and collating the results
Global data is sent to all workers at the outset.
Processor farming is only appropriate if:
– the task can be split into many independent sections
– the amount of communication (commands + results) is small
To minimise latency, it might be better to keep 2 (or 3) packets in circulation
for each worker - buffers are needed
Load balancing is semi-dynamic - the command packets are sent to processors
which have just (or are about to) run out of work. Thus all processors are kept
busy except for the closedown phase, when some finish before others.
3/16/2016
EE3.cma - Computer Architecture
230
Major Classifications of Parallelism
Processor Farming (cont’d)
Return routers
Buffers
controller
Workers
Buffers
Outgoing Routers
Each section on separate transputer/processor
display
results
received
results
initial proc nos
generate
work
packets
3/16/2016
Result packets
(result; CPU)
free CPU #
send
commands
A Processor Farm Controller
Command packets
(CPU; work)
EE3.cma - Computer Architecture
231
Vector Processors
Introduction
•
•
•
•
Vector processors extend the scalar model by incorporating vector registers
within the CPU
These registers can be operated on by special vector instructions - each performs
calculations element wise on the vector
Vector parallelism could enable a machine to be constructed with a row of FPUs
all driven in parallel. In practice a heavily pipelined single FPU is usually used.
Both are classified as SIMD
A vector instruction is similar to an unrolled loop, but:
– each computation is guaranteed independent of all others - allows a deep
pipeline (allowing the cycle time to be kept short) and removes the need to
check for data hazards (within the vector)
– Instruction bandwidth is considerably reduced
– There are no control hazards (eg pipeline flushes on branches) since the
looping has been removed
– Memory access pattern is well-known - thus latency of memory access can
be countered by interleaved memory blocks and serial memory techniques
– Overlap of ALU & FPU operations, memory accesses and address
calculations are possible.
3/16/2016
EE3.cma - Computer Architecture
232
Vector Processors
Types of Vector Processors
•
•
Vector-Register machines - vector registers, held in the CPU, are loaded and
stored using pipelined vector versions of the typical memory access instructions
Memory-to-Memory Vector machines - operate on memory only. Pipelines of
memory accesses and FPU instructions operate together without pre-loading the
data into vector registers. (This style has been overtaken by Vector-Register
machines.)
Vector-Register Machines
Main Sections of a Vector-Register Machine are:
• The Vector Functional units - machine can have several pipelined such units,
usually dedicated to just one purpose so that they can be optimised.
• Vector Load/Store activities are usually carried out by a dedicated pipelined
memory access unit. This unit must deliver one word per cycle (at least) in order
that the FPUs are not held up. If this is the case, vector fetches may be carried out
whilst part of the vector is being fed to the FPU
• Scalar Registers and Processing Engine - conventional machine
• Instruction Scheduler
3/16/2016
EE3.cma - Computer Architecture
233
Vector Processors
The Effects of Start-up Time and Initiation Rate
•
•
•
•
•
•
like all pipelined systems, the time taken for a vector operation is determined
from the start-up time, the initiation (result delivery) rate and the number of
calculations performed
The initiation rate is usually 1 - new vector elements are supplied in every
cycle
The start-up cost is the time for one element to pass along the vector pipeline the depth in stages. This time is increased by the time taken to fetch data
operands from memory if they are not already in the vector registers - can
dominate
The number of clock cycles per vector element is then:
cycles per result = (start-up time + n*initiation rate)/n
The start-up time is divided amongst all of the elements and dominates for
short vectors.
The start-up time is more significant (as a fraction of the time per result) when
the initiation rate drops to 1
3/16/2016
EE3.cma - Computer Architecture
234
Vector Processors
Load/Store Behaviour
•
•
•
•
the pipelined load/store unit must be able to sustain a memory access rate at
least as good as the initiation rate of the FPUs to avoid data starvation.
This is especially important when chaining the two units
Memory has a start-up overhead - access time latency - similar to the pipeline
start-up cost
Once data starts to flow, how can a rate of one word/cycle be maintained?
– interleaving is usually used
Memory Interleaving
We need to attach multiple memory banks to the processor and operate them all in
parallel so that the overall access rate is sufficient. Two schemes are common:
• Synchronised Banks
• Independent Banks
3/16/2016
EE3.cma - Computer Architecture
235
Vector Processors
Memory Interleaving (cont’d)
Synchronised Banks
• A single memory address is passed to all memory banks, and they all access a
related word in parallel.
• Once stable, all these words are latched and are then read out sequentially
across the data bus - achieving the desired rate.
• Once the latching is complete the memories can be supplied with another
address and may start to access it.
3/16/2016
EE3.cma - Computer Architecture
236
Vector Processors
Memory Interleaving (cont’d)
Independent Banks
• If each bank of memory can be supplied with a separate address, we
obtain more flexibility - BUT must generate and supply much more
information.
• The data latches (as in synchronised case) may not be necessary, since
all data should be available at the memory interface when required.
In both cases, we require more memory banks than the number of clock
cycles taken to get information from a bank of memory
The number of banks chosen is usually a power of 2 - to simplify
addressing (but this can also be a problem - see vector strides)
3/16/2016
EE3.cma - Computer Architecture
237
24
3/16/2016
EE3.cma - Computer Architecture
238
Vector Processors
Variable Vector Length
In practice the length of vectors will not be 64, 256 - whatever the memory size is
A hardware vector length register in the processor is set before each vector
operation - used also in load/store unit.
Programming Variable Length Vector Operations
Since the processor’s vector length is fixed, operations on long user vectors must
be covered by several vector instructions. This is called strip mining
Frequently, the user vector will not be a precise multiple of the machine vector
length and so one vector operation will have to compute results for a short
vector - this incurs greater set-up overheads
Consider the following:
for (j=0; j<n; j++)
x[j] = x[j] + (a * b[j]);
For a vector processor with vectors of length MAX and a vector-length register
called LEN, we need to process a number of MAX-sized chunks of x[j]and then
one section which covers the remainder:
3/16/2016
EE3.cma - Computer Architecture
239
Vector Processors
Variable Vector Length (cont’d)
start = 0;
LEN = MAX;
for (k=0; k<n/MAX; k++ ) {
for (j=start; j<start+MAX; j++) {
x[j] = x[j] + (a*b[j]);
}
start = start + MAX;
}
LEN = n-start;
for (j=start; j<n; j++) x[j]= x[j] + (a*b[j]);
The j-loop in each case is implemented as three vector instructions - a Load, a
multiply and an add.
The time to execute the whole program is simply:
Int(n/MAX)*(sum of start-up overheads) + (n*3*1) cycles
This equation exhibits a saw-tooth shape as n increases - the efficiency drops each
time a large vector fills up and an extra 1 element vector must be used,
carrying and extra start-up overhead…
Unrolling the outer loop will be effective too...
3/16/2016
EE3.cma - Computer Architecture
240
Vector Processors
Vector Stride
Multi-dimensional arrays are stored in memory as single-dimensional vectors. In all
common languages (except Fortran) row 1 is stored next to row 0, plane 0 is stored
next to plane 1 etc….
Thus,accessing an individual row of a matrix involves reading contiguous memory
locations, these reads are easily spread across several interleaved memory banks:
Accessing a column of a matrix the nth element in every row,
say - involves picking
individual words from
memory. These words are
separated from each other by
x words, where x is the
number of elements in each
row of the matrix. x is the
stride of the matrix in this
dimension. Each dimension has its own stride.
Once loaded, vector operations on columns can be carried out with no further reference
to their original memory layout.
3/16/2016
EE3.cma - Computer Architecture
241
Vector Processors
Vector Stride (cont’d)
Consider multiplying 2 rectangular matrices together:
What is the memory reference pattern of a column-wise vector load?
• We step through the memory in units of our stride
What about in a memory system with j interleaved banks?
• If j is co-prime with the stride x then we visit each bank just once before re-visiting
any one again (assuming that we use the LS words address bits as bank selectors)
• If j has any common factors with x (especially if j is a factor of x) then the banks are
visited in a pattern which favours some banks and totally omits others. Since the
number of active banks is reduced, the latency of memory accesses is not hidden
and the one-cycle-per-access goal is lost. This is an example of aliasing.
Does it matter whether the interleaving uses synchronised or independent banks?
•
Yes. In the synchronised case, the actual memory accesses must be timed correctly
since all the MS addresses are the same, and if the stride is wider than the
interleaving factor, only some of the word accesses will be used anyway.
• In the independent case, the separate accesses automatically happen at the right time
and to the right addresses. The load/store unit must generate the stream of addresses
in advance of the data being required, and must send each to the correct bank
A critically-banked system - interleaved banks are all used fully in a vector access
Overbanking - supplying more banks than needed, reduces danger of aliasing
3/16/2016
EE3.cma - Computer Architecture
242
Vector Processors
Forwarding and Chaining
If a vector processor is required to perform multiple operations on the same vector, then
it is pointless to save the first result before reading it back to another (or the same)
functional unit
Chaining - the vector equivalent of forwarding - allows the pipelined result output of
one functional unit to be joined to the input of another
The performance of two chained operations is far greater than that of just one, since the
first operation does not have to finish before the next starts. Consider
V1 = V2 MULT V3
V4 = V1 ADD V5
The non-chained solution requires a brief
stall (4 cycles) since V1 must be fully written
back to the registers before it can be re-used.
In the chained case, the dependence between
writes to elements of V1 and their re-reading
in the ADD are compensated by the forwarding
effect of the chaining - no storage is required
prior to use.
3/16/2016
EE3.cma - Computer Architecture
243
Multi-Core Processors
• A multi-core microprocessor is one which combines two or
more independent processors into a single package, often a
single IC. A dual-core device contains only two independent
microprocessors.
• In general, multi-core
microprocessors allow a
computing device to exhibit
some form of parallelism
without including multiple
microprocessors in separate
physical packages often known
as chip level multiprocessing or
CMP.
3/16/2016
EE3.cma - Computer Architecture
244
Multi-Core Processors
Commercial examples
• IBM’s POWER4, first Dual-Core module processor released in 2000.
• IBM's POWER5 dual-core chip now in production - in use in the Apple
PowerMac G5.
• Sun Microsystems UltraSPARC IV, UltraSPARC IV+, UltraSPARC T1
• AMD - dual-core Opteron processors on 22 April 2005,
– dual-core Athlon 64 X2 family, on 31 May 2005.
– And the FX-60, FX-62 and FX-64 for high performance desktops,
– and one for laptops.
• Intel's dual-core Xeon processors,
– also developing dual-core versions of its Itanium high-end CPU
– produced Pentium D, the dual core version of Pentium 4.
– A newer chip, the Core Duo, is available in the Apple Computer's iMac
• Motorola/Freescale has dual-core ICs based on the PowerPC e500 core,
and e600 and e700 cores in development.
• Microsoft's Xbox 360 game console uses a triple core PowerPC
microprocessor.
• The Cell processor, in PlayStation 3 is a 9 core design.
3/16/2016
EE3.cma - Computer Architecture
245
Multi-Core Processors
Why?
•
•
•
•
CMOS manufacturing technology continues
to improve:
– BUT reducing the size of single gates,
can’t continue to increase clock speed
– 5km of internal interconnects in modern
processor…. Speed of light is too slow!
Also significant heat dissipation and data
synchronization problems at high rates.
Some gain from
– Instruction Level Parallelism (ILP) superscalar pipelining – can be used for
many applications
– Many applications better suited to
Thread level Parallelism (TLP)- multiple
independent CPUs
A combination of increased available space
due to refined manufacturing processes and
the demand for increased TLP has led to
multi-core CPUs.
3/16/2016
EE3.cma - Computer Architecture
246
Multi-Core Processors
• Advantages
• Proximity of multiple CPU cores on the same die have the advantage that the
cache coherency circuitry can operate at a much higher clock rate than is possible
if the signals have to travel off-chip - combining equivalent CPUs on a single die
significantly improves the cache performance of multiple CPUs.
• Assuming that the die can fit into the package, physically, the multi-core CPU
designs require much less Printed Circuit Board (PCB) space than multi-chip
designs.
• A dual-core processor uses slightly less power than two coupled single-core
processors - fewer off chip signals, shared circuitry, like the L2 cache and the
interface to the main Bus.
• In terms of competing technologies for the available silicon die area, multi-core
design can make use of proven CPU core library designs and produce a product
with lower risk of design error than devising a new wider core design.
• Also, adding more cache suffers from diminishing returns, so better to use space
in other ways
3/16/2016
EE3.cma - Computer Architecture
247
Multi-Core Processors
• Disadvantages
• In addition to operating system (OS) support, adjustments to existing software can
be required to maximize utilization of the computing resources provided by multicore processors.
• The ability of multi-core processors to increase application performance depends
on the use of multiple threads within applications.
– eg, most current (2006) video games will run faster on a 3 GHz single-core
processor than on a 2GHz dual-core, despite the dual-core theoretically
having more processing power, because they are incapable of efficiently using
more than one core at a time.
• Integration of a multi-core chip drives production yields down and they are more
difficult to manage thermally than lower-density single-chip designs.
• Raw processing power is not the only constraint on system performance. Two
processing cores sharing the same system bus and memory bandwidth limits the
real-world performance advantage. Even in theory, a dual-core system cannot
achieve more than a 70% performance improvement over a single core, and in
practice, will most likely achieve less
3/16/2016
EE3.cma - Computer Architecture
248
25
3/16/2016
EE3.cma - Computer Architecture
249
The INMOS Transputer
The Transputer
Necessary features for a message-passing microprocessor are:
• A low context-switch time
• A hardware process scheduler
• Support for communicating process model
• Normal microprocessor facilities.
Special Features of Transputers:
• high performance microprocessor
• conceived as building blocks (like transistors or logic gates)
• designed for intercommunication
• CMOS devices - low power, high noise immunity
• integrated with small supporting chip count
• provided with a hardware task scheduler - supports multi-tasking with low
overhead
• capable of sub-microsecond interrupt responses - good for control
applications
3/16/2016
EE3.cma - Computer Architecture
250
The INMOS Transputer
3/16/2016
EE3.cma - Computer Architecture
251
The INMOS Transputer
Transputer Performance
The fastest first-generation transputer (IMS T805-30) is capable of:
• up to 15 MIPS sustained
• up to 3 MFLOPs sustained
• up to 40 Mbytes/sec at the main memory interface
• up to 120 Mbytes/sec to the 4K byte on-chip memory
• up to 2.3 Mbytes/sec on each of 4 bi-directional Links
30MHz clock speed
The fatstest projected second generation transputer (IMS T9000-50):
• is 5 times faster in calculation
• and 6 times faster in communication
50MHz clock speed - equivalent performance to the 100MHz intel 486
3/16/2016
EE3.cma - Computer Architecture
252
The INMOS Transputer
Low Chip Count
To run using internal RAM a T805 transputer only requires:
• a 5MHz clock
• a 5V power supply at about 150mA
• a power-on-reset or external reset input
• an incoming Link to supply boot code and sink results
Expansion possibilities
• 32K*32 SRAM (4 devices) require 3 support chips
• 8 support devices will support 8Mbytes of DRAM with optimal timing
• Extra PALs will directly implement 8-bit memory mapped I/O ports or
timing logic for conventional peripheral devices (Ethernet, SCSI, etc)
• Link adapters can be used for limited expansion to avoid memory mapping
• TRAMs (transputers plus peripherals) can be used as very high-level
building blocks
3/16/2016
EE3.cma - Computer Architecture
253
The INMOS Transputer
Transputer Processes
Software running on a transputer is made up from one or more sequential
processes, which are run in parallel and communicate with each other
periodically
Software running on many interconnected transputers is simply a group of
parallel processes - just the same as if all code were running on a single
processor
Processes can be reasoned about individually; rules exist which allow the
overall effect of parallel processes to be reasoned about too.
The benefits of breaking a task into separate processes include:
• Taking advantage of parallel hardware
• Taking advantage of parallelism on a single processor
• Breaking the task into separately-programmed sections
• Easy implementation of buffers and data management code which runs
asynchronously with the main processing sections
3/16/2016
EE3.cma - Computer Architecture
254
The INMOS Transputer
Transputer Registers
The transputer implements a stack of 3 hardware registers and is able to execute
0-address instructions.
It also has a few one-address instructions which are used for memory access.
All instructions and data operands are built up in 4-bit sections using an
Operand register and two special instruction Prefix and Negative Prefix.
Extra registers are used to store the head and tail pointers to two linked lists of
process workspace headers
- these make up the high and
low priority run-time process
queues. The hardware
scheduler takes a new
process from one of these
queues whenever it suspends
the current process (due to
time-slicing or
communication)
3/16/2016
EE3.cma - Computer Architecture
255
The INMOS Transputer
Action on Context Switching
Each process runs until it communicates,
is time-sliced or is pre-empted by a higher
priority process. Time-slices occur at the
next descheduling point - approx 2ms.
Pre-emption can occur at any time.
local variables for
PROC
program counter
pointer to workspace
chain
At a context switch the following happens:
• The PC of the stopping process is saved in its workspace at word WSP-1
• The process pointed to by the processor’s BPtr1 is changed to point to the
stopping processes’ WSP
• On a pre-emptive context switch (only) the registers in the ALU and FPU
may need saving
• The process pointed to by FPtr1 is unlinked from the process queue, has its
stored PC value loaded into the processor and starts executing
A context switch takes about 1ms. This translates to an interrupt rate of about
1,000,000 per second.
3/16/2016
EE3.cma - Computer Architecture
256
The INMOS Transputer
Joining Transputers Together
Three routing configurations are possible:
• static - nearest neighbour communications
• any-to-any routing across static configurations
• dynamic configuration with specialised routing devices
Static Configurations
Can be connected together in fixed configurations and are characterised by:
• Number of nodes
• Valency - number of interconnecting arcs (per processor)
• Diameter - maximum number of arcs traversed from point to point
• Latency - time for a message to pass across the network
• point-to-point bandwidth - message flow rate along a route
Structures
3/16/2016
EE3.cma - Computer Architecture
257
The Cray T3D
The T3D Network
After simulation the T3D network was chosen to be a 3D torus (as is the T3E)
Note:
config.
max latency average latency
8-node ring
4 hops
2 hops
2D, 4*2 torus
3 hops
1.5 hops
3D, 2*2*2 torus 2 hops
1 hop
cube = 4*2 2D torus
2D torus 4*4
hyper-cube
3/16/2016
EE3.cma - Computer Architecture
258
Beowulf Clusters
Introduction
Mass market competition has driven down prices of subsystems:
processors, motherboards, disks, network cards etc.
Development of publicly available software :
Linux, GNU compilers, PVM and MPI libraries
PVM - Parallel Virtual Machine (allows many inter-linked machines to be
combined as one parallel machine)
MPI - Message Passing Interface (similar to PVM)
High Performance Computing groups have many years of experience working
with parallel algorithms.
History of MIMD computing shows many academic groups and commercial
vendors building machines based on “state-of-the-art” processors, BUT
always needed special “glue” chips or one-of-a-kind interconnection
schemes.
Leads to interesting research and new ideas, but often results in one off
machines with a short life cycle.
Leads to vendor specific code (to use vendor specific connections)
Beowulf uses standard bits and Linux operating system (with MPI - or PVM)
3/16/2016
EE3.cma - Computer Architecture
259
Beowulf Clusters
Introduction
First Beowulf was built in 1994 with 16 DX4 processors and
a 10Mbit/s Ethernet.
Processors were too fast for a single Ethernet
Ethernet switches were still much too expensive to use
more than one.
So they re-wrote the linux ethernet drivers and built a channel bonded Ethernet
– network traffic was striped across 2 or more ethernets
As 100Mb/s ethernet and switches have become cheap less need for channel
bonding. This can support 16, 200MHz P6 processors…..
The best configuration continues to change. But this does not affect the user.
With the robustness of MPI, PVM, Linux (Extreme) and GNU compilers
programmers have the confidence that what they are writing today will still
work on future Beowulf clusters.
In 1997 CalTech’s 140 node cluster ran a problem sustaining a 10.9 Gflop/s rate
3/16/2016
EE3.cma - Computer Architecture
260
Beowulf Clusters
The Future
Beowulf clusters are not quite Massively Parallel Processors
like the Cray T3D as MPPs are typically bigger and have a
lower network latency and a lot of work must be done by the
programmer to balance the system.
But the cost effectiveness is such that many people are
developing do-it-yourself approaches to HPC and building their own
clusters.
A large number of computer companies are taking these machines very
seriously and offering full clusters.
2002 – 2096 processor linux cluster comes in as 5th fastest computer in the
world…
2005 – 4800 2.2GHz powerPC cluster is #5 – 42.14TFlops
40960 1.4GHz itanium is #2 – 114.67 TFlops
65536 0.7GHz powerPC is #1 – 183.5TFlops
5000 Opteron (AMD - Cray) is #10 – 20 TFlops
3/16/2016
EE3.cma - Computer Architecture
261
Fastest Super computers – June 2006
Rank
Site
Computer
Processors
Year
Rmax
Rpeak
1
LLNL US
Blue Gene – IBM
131072
2005
280600
367000
2
IBM US
Blue Gene –IBM
40960
2005
91290
114688
3
LLNL US
ASCI Purple IBM
12208
2006
75760
92781
4
NASA US
Columbia – SGI
10160
2004
51870
60960
5
CEA, France
Tera 10, Bull SA
8704
2006
42900
55705.6
6
Sandia US
Thunderbird – Dell
9024
2006
38270
64972.8
7
GSIC, Japan
TSUBAME - NEC/Sun
10368
2006
38180
49868.8
8
Julich, Germany
Blue Gene – IBM
16384
2006
37330
45875
9
Sandia, US
Red Storm - Cray Inc.
10880
2005
36190
43520
10
Earth Simulator, Japan
Earth-Simulator, NEC
5120
2002
35860
40960
11
Barcelona Super Computer Centre, Spain
MareNostrum – IBM
4800
2005
27910
42144
12
ASTRON/University Groningen, Netherlands
Stella (Blue Gene) – IBM
12288
2005
27450
34406.4
13
Oak Ridge, US
Jaguar - Cray Inc.
5200
2005
20527
24960
14
LLNL, US
Thunder - Digital Corporation
4096
2004
19940
22938
15
Computational Biology Research Center, Japan
Blue Protein (Blue Gene) –IBM
8192
2005
18200
22937.6
16
Ecole Polytechnique, Switzerland
Blue Gene - IBM
8192
2005
18200
22937.6
17
High Energy Accelerator Research Organization, Japan
KEK/BG Sakura (Blue Gene) – IBM
8192
2006
18200
22937.6
18
High Energy Accelerator Research Organization, Japan
KEK/BG Momo (Blue Gene) – IBM
8192
2006
18200
22937.6
19
IBM Rochester, On Demand Deep Computing Center, US
Blue Gene - IBM
8192
2006
18200
22937.6
20
ERDC MSRC, United States
Cray XT3 - Cray Inc.
4096
2005
16975
21299
3/16/2016
EE3.cma - Computer Architecture
262
3/16/2016
EE3.cma - Computer Architecture
263
3/16/2016
EE3.cma - Computer Architecture
264
3/16/2016
EE3.cma - Computer Architecture
265
3/16/2016
EE3.cma - Computer Architecture
266
3/16/2016
EE3.cma - Computer Architecture
267
3/16/2016
EE3.cma - Computer Architecture
268
26
3/16/2016
EE3.cma - Computer Architecture
269
3/16/2016
EE3.cma - Computer Architecture
270
3/16/2016
EE3.cma - Computer Architecture
271
3/16/2016
EE3.cma - Computer Architecture
272
3/16/2016
EE3.cma - Computer Architecture
273
3/16/2016
EE3.cma - Computer Architecture
274
Shared Memory Systems
Introduction
The earliest form of co-operating processors used shared memory as the
communication medium
Shared memory involves:
• connecting the buses of several processors together so that either:
– all memory accesses for all processors share the bus; or
– just inter-processor communication accesses share the common memory
Clearly the latter involves less contention
Shared memory systems typically operate under control of a single operating
system either:
• with one master processor and several slaves; or
• with all processors running separate copies of the OS, maintaining a common
set of VM and process tables.
3/16/2016
EE3.cma - Computer Architecture
275
Shared Memory Systems
The Shared-Memory Programming Model
Ideally a programmer wants each process to have access to a contiguous area of
memory - how is unimportant
Somewhere in the memory map will be sections of memory which are also
accessible by other processes.
Memory
Processor
shared
How do we implement this? We certainly need caches (for speed) and VM,
secondary storage etc. (for flexibility)
Notice that cache Processor
consistency issues
are introduced as Processor
soon as multiple
caches are
Processor
provided.
3/16/2016
Secondary
Memory
Local
cache
Shared
Virtual
Address
Space
Local
cache
Local
cache
Main
Memory
EE3.cma - Computer Architecture
276
Shared Memory Systems
Common Bus Structures
A timeshared common bus arrangement can provide the interconnection required:
mP
M
mP
M
mP
M
I/O
I/O
A common bus provides:
• contention resolution between the processors
• limited bandwidth, shared by all processors
• single-point failure modes
• cheap(ish) hardware - although speed requirements and complex wiring add to
expense
• easy, but non-scalable, expansion
3/16/2016
EE3.cma - Computer Architecture
277
Shared Memory Systems
Common Bus Structures (cont’d)
Adding caches, extra buses (making a crossbar arrangement) and mutiport memory
can help
mP
M
M
M
cache
cache
cache
cache
I/O
mP
mP
cache
I/O
cache
3/16/2016
EE3.cma - Computer Architecture
278
Shared Memory Systems
Kendall Square Research KSR1
One of the most recent shared memory architectures is the Kendall Square
Research KSR1, which implements the virtual memory model across multiple
memories, using a layered cacheing scheme.
The KSR1 processors are proprietary:
• 64-bit superscalar, issues 1 integer and 2 chained FP instructions per 50ns
cycle, giving a peak integer and FP performance of 20MIPS / 40 MFLOPs
• Each Processor has 256Kbytes of local instruction cache and 256Kbytes of
local data cache
• There is a 40bit global addressing scheme
1088 (32*34) processors can be attached in the current KSR1 architecture
Main memory comprises 32Mbytes DRAM per Processor Environment, connected
in a hierarchical cached scheme.
If a page is not held in one of the 32Mbyte caches it is stored on secondary memory
(disc - as with any other system)
3/16/2016
EE3.cma - Computer Architecture
279
Shared Memory Systems
KSR1 Processor Interconnection
The KSR1 architecture connects the caches on each processor with a special
memory controller called the Allcache Engine. Several such memory
controllers can be connected
level 0
router
directory
Cell
interconnect
Cell
interconnect
Cell
interconnect
Cell
interconnect
32 MB
main
cache
32 MB
main
cache
32 MB
main
cache
32 MB
main
cache
256kB
cache
256kB
cache
256kB
cache
mP
mP
mP
3/16/2016
…...
EE3.cma - Computer Architecture
256kB
cache
mP
280
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)
The Allcache Engine at the lowest level (level-0) provides:
• connections to all the 32Mbyte caches on the processor cells
• Up to 32 processors may be present in each ring
The level-0 Allcache Engine Features:
• a 16-bit wide slotted ring, which synchronously passes packets between the
interconnect cells (ie every path can carry a packet simultaneously)
• Each ring carries 8 million packets per second
• Each packet contains a 16-byte header and 128 bytes of data
• This gives the total throughput of 1Gbyte per second
• Each router directory contains an entry for each sub-page held in the main
cache memory (below)
• Requests for a sub-page are made by the cell interconnect unit, passed around
the ring and satisfied by data if it is found in the other level-0 caches.
3/16/2016
EE3.cma - Computer Architecture
281
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)
KSR1 Higher Level Routers
In order to connect more than 32 processors, a second layer of routing is needed.
This contains up to 34 Allcache router directory cells, plus the main level-1
directory which permits connection to level 2.
level-2
level-1
level-0
32 processors
1088 processors
unaffordable; minimal bandwidth per processor
3/16/2016
EE3.cma - Computer Architecture
282
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)
The Level-1 Allcache Ring
The routing directories in level 1 Allcache engine contain copies of the entries in
the lower level tables, so that requests may be sent downwards for sub-page
information as well as upwards - the Level-1 table is therefore very large
The higher level packet pipelines carry 4 Gbytes per second of inter-cell traffic
level 1
router
directory
ARD 0
copy
ARD 0
copy
ARD 0
copy
ARD 0
copy
ARD 0
ARD 0
ARD 0
ARD 0
directory
directory
directory
directory
…...
3/16/2016
EE3.cma - Computer Architecture
283
Shared Memory Systems
KSR1 Performance
As with all multi-processor machines, maximum performance is obtained when
there is no communication
The layered KSR1 architecture does not scale linearly in bandwidth or latency as
processors are added:
Relative Bandwidths
unit
bandwidth
shared
fraction
(MByte/s)
amongst
(MByte/s)
256 k subcache160
1 PE
160
32MB cache
90
1 PE
90
level-0 ring
1000
32 PEs
31
level-1 ring
4000
1088 PEs
3.7
Relative Latencies
Location
Latency (cycles) Copied (read-only) subsubcache
2
pages reside in more than
cache
18
one cache and thus provide
ring 0
150
a low-latency access to
ring 1
500
constant information
page fault (disc)
400,000
3/16/2016
EE3.cma - Computer Architecture
284
Shared Memory Systems
KSR1 Performance - how did it succeed?
Like most other parallel architectures, it relies on locality
Locality justifies the workings of:
• Virtual memory systems (working sets)
• Caches (hit rates)
• Interprocess connection networks
Kendall Square Research claim that the locality present in massively-parallel
programs can be exploited by their architecture.
1991 - 2nd commercial machine is installed in Manchester Regional Computer
Centre
1994 - upgraded to 64bit version
1998 - Kendall Square Research went out of business, patents transferred to SUN
microsystems
3/16/2016
EE3.cma - Computer Architecture
285
The Cray T3D
SV1
watercooled T3E
T3D
Introduction
The Cray T3D is the successor to several generations of vector conventional
processors. T3D has been replaced by newer T3E but much the same as T3D
T3E (with 512 processors) capable of 0.4 TFlops
SV1ex (unveiled 7/11/00 capable of 1.8 TFLOPs with 1000 processors - normally
delivered as 8-32 processor machines
3/16/2016
EE3.cma - Computer Architecture
286
The Cray T3D
Introduction
Like every other manufacturer, Cray would like to deliver:
• 1000+ processors with GFLOPs performance
• 10s of Gbytes/s per processor of communication bandwidth
• 100ns interprocessor latence
……they can’t afford to - just yet…….
They have tried to achieve these goals by:
• MIMD - multiple co-operating processors will beat small numbers of
intercommunicating ones (even vector supercomputers)
• Distributed memory
• Communication at the memory-access level, keeping latency short and packet
size small
• A scalable communications network
• Commercial processors (DEC Alpha)
3/16/2016
EE3.cma - Computer Architecture
287
The Cray T3D
The T3D Network
After simulation the T3D network was chosen to be a 3D torus (as is the T3E)
Note:
config.
max latency average latency
8-node ring
4 hops
2 hops
2D, 4*2 torus
3 hops
1.5 hops
3D, 2*2*2 torus 2 hops
1 hop
cube = 4*2 2D torus
2D torus 4*4
hyper-cube
3/16/2016
EE3.cma - Computer Architecture
288
The Cray T3D
T3D Macro-architecture
The T3D designers have decided that the programmer’s view of the architecture
should include:
• globally-addressed physically-distributed memory characteristics
• visible topological relationships between PEs
• synchronisation features visible from a high level
Their goal is led by the need to provide a slowly-changing view (to the
programmer) from one hardware generation to the next.
T3D Micro-architecture
Rather than choosing to develop their own processor, Cray selected the DEC Alpha
processor:
• 0.75 mm CMOS RISC processor core
• 64 bit bus
• 150MHz, 150 MFLOPS, 300MIPS (3 instructions/cycle)
• 32 integer and 32 FP registers
• 8Kbytes instruction and 8Kbytes data caches
• 43 bit virtual address space
3/16/2016
EE3.cma - Computer Architecture
289
The Cray T3D
Latency Hiding
The DEC Alpha has a FETCH instruction which allows memory to be loaded into
the cache before it is required in an algorithm.
This runs asynchronously with the processor
16 FETCHes may be in progress at once - they are FIFO queued
When data is received, it is slotted into the FIFO, ready for access by the processor
The processor stalls if data is not available at the head of the FIFO when needed
Stores do not have a a latency - they can proceed independently of the processor
(data dependencies permitting)
Synchronisation
Barrier Synchronisation
• no process may advance beyond the barrier until all processes have arrived
• used as a break between 2 blocks of code with data dependencies
• supported in hardware - 16 special registers - bits set to 1 on barrier creation;
set to 0 by arriving process; hardware interrupt on completion
Messaging (a form of synchronisation)
• T3D exchanges 32-byte messages + 32-byte control header
• Messages are queued at target PE, returned to sender PE’s queue if full
3/16/2016
EE3.cma - Computer Architecture
290
The Connection Machine Family
Introduction
The Connection MAchine family of suprecomputers has developed since first
descriptions were published in 1981. Today the CM5 is one of the fastest
available supercomputers
In 1981 the philosophy of the CM founders was for a machine capable of
sequential program execution, but where each instruction was spread to use lots
of processors.
The CM-1 had 65,536 processors organised in a layer between two communicating
planes:
Host
broadcast control
network
P
M
P
M
P
M
...
hyper-cube
data network
3/16/2016
Plane of 65536 cells
P
M
P = single-bit processor
M = 4kbit memory
Total Memory = 32Mbytes
EE3.cma - Computer Architecture
291
The Connection Machine Family
Introduction (cont’d)
Each single-bit processor can:
• perform single-bit calculations
• transfer data to its neighbours or via the data network
• be enabled or disabled (for each operation) by the control network and its own
stored data
The major lessons learnt from this machine were:
• A new programming model was needed - that of virtual processors. One
“processor” could be used per data element and a number of data elements
combined onto actual processors. The massive concurrency makes
programming and compiler design clearer
• 32Mbytes was not enough memory (even in 1985!)
• It was too expensive for AI - but physicists wanted the raw processing power
3/16/2016
EE3.cma - Computer Architecture
292
The Connection Machine Family
The Connection Machine 2
This was an enlarged CM-1 with several enhancements. It had:
• 256kbit DRAM per CPU
• Clusters of 32 bit serial processors augmented by floating point chip (2048 in total)
• Parallel I/O added to the processors - 40 discs (RAID - Redundant Array of
Inexpensive Disks) Graphics frame buffer
In addition, multiple hosts could be added to support multiple users; the plane of small
processors could be partitioned.
Architectural Lessons:
• Programmers used a high-level language (Fortran 90) rather than a lower-level
parallel language. F90 contains array operators, which provide the parallelism
directly. The term data parallel was coined for this style of computation
• Array operators compiled into instructions sent to separate vector or bit processors
• This SIMD programming model gives synchronisation between data elements in
each instruction but MIMD processor engine doesn’t need such constraints
• Differences between shared (single address space) and distributed memory blur.
• Data network now carries messages which correspond to memory accesses
• The compiler places memory and computations optimally, but statically
• multiple hosts are awkward compared with a single timesharing host
3/16/2016
EE3.cma - Computer Architecture
293
The Connection Machine Family
The Connection Machine 5
This architecture is more orthogonal than the earlier ones. It just uses larger multi-bit
processors, but similar communication architecture to the CM-1 and CM-2
Design Goals were:
• > 1 TFLOPs
• Several Tbytes of memory
• > 1 Tbit/s of I/O bandwidth
Host
broadcast control
network
I/O
W
W
W
...
H H
Hosts (H) and worker (W) processors
identical (hosts have more memory)
hyper-cube
data network
3/16/2016
EE3.cma - Computer Architecture
294
The Connection Machine Family
The CM-5 Processor
To save on development effort, CM used a common SPARC RISC processor for all the
hosts and workers. RISC CPUs are optimised for workstations, so they added extra
hardware and fast memory paths
Each Node has:
• 32Mbytes memory
• A Network interface
• Vector processors capable of up to 128 MFLOPS
• Vector-to-Memory bandwidth of 0.5Gbytes/s
Caching doesn’t really work here.
SPARC
main bus
cache
I/O
vector processor
vector processor
vector processor
vector processor
3/16/2016
32 Mbytes
memory
64 bit
0.5Gbyte/s
vector ports EE3.cma - Computer Architecture
295
Fastest Super computers – June 2006
Rank
Site
Computer
Processors
Year
Rmax
Rpeak
1
LLNL US
Blue Gene – IBM
131072
2005
280600
367000
2
IBM US
Blue Gene –IBM
40960
2005
91290
114688
3
LLNL US
ASCI Purple IBM
12208
2006
75760
92781
4
NASA US
Columbia – SGI
10160
2004
51870
60960
5
CEA, France
Tera 10, Bull SA
8704
2006
42900
55705.6
6
Sandia US
Thunderbird – Dell
9024
2006
38270
64972.8
7
GSIC, Japan
TSUBAME - NEC/Sun
10368
2006
38180
49868.8
8
Julich, Germany
Blue Gene – IBM
16384
2006
37330
45875
9
Sandia, US
Red Storm - Cray Inc.
10880
2005
36190
43520
10
Earth Simulator, Japan
Earth-Simulator, NEC
5120
2002
35860
40960
11
Barcelona Super Computer Centre, Spain
MareNostrum – IBM
4800
2005
27910
42144
12
ASTRON/University Groningen, Netherlands
Stella (Blue Gene) – IBM
12288
2005
27450
34406.4
13
Oak Ridge, US
Jaguar - Cray Inc.
5200
2005
20527
24960
14
LLNL, US
Thunder - Digital Corporation
4096
2004
19940
22938
15
Computational Biology Research Center, Japan
Blue Protein (Blue Gene) –IBM
8192
2005
18200
22937.6
16
Ecole Polytechnique, Switzerland
Blue Gene - IBM
8192
2005
18200
22937.6
17
High Energy Accelerator Research Organization, Japan
KEK/BG Sakura (Blue Gene) – IBM
8192
2006
18200
22937.6
18
High Energy Accelerator Research Organization, Japan
KEK/BG Momo (Blue Gene) – IBM
8192
2006
18200
22937.6
19
IBM Rochester, On Demand Deep Computing Center, US
Blue Gene - IBM
8192
2006
18200
22937.6
20
ERDC MSRC, United States
Cray XT3 - Cray Inc.
4096
2005
16975
21299
3/16/2016
EE3.cma - Computer Architecture
296
History of Supercomputers
1966/7: Michael Flynn’s Taxonomy & Amdahl’s Law
1976: Cray Research delivers 1st Cray-1 to LANL
1982: Fujitsu ships 1st VP200 vector machine ~500MFlops
1985: CM-1 demonstrated to DARPA
1988: Intel delivers iPSC/2 hypercubes
1990: Intel produces iPSC/860 hypercubes
1991: CM5 announced
1992: KSR1 delivered
1992: Maspar delivers its SIMD machine – MP2
1993: Cray delivers Cray T3D
1993: IBM delivers SP1
1994: SGI Power Challenge
1997: SGI/Cray Origin 2000 delivered to LANL - 0.7TFlops
1998: Cray T3E delivered to US military – 0.9Tflops
1996: Hitachi Parallel System
1997: Intel Paragon (ASCI Red) 2.3 Tflops to Sandia Nat Lab
2000: IBM (ASCI White) 7.2 Tflops to Lawrence Livermore NL
2002: HP (ASCI Q) 7.8 Tflops to Los Alamos Nat Lab
2002: NEC Earth Simulator Japan 36TFlops
2002: 5th fastest machine in world is a linux cluster (2304 processor)
3/16/2016
EE3.cma - Computer Architecture
297
History of Supercomputers
3/16/2016
EE3.cma - Computer Architecture
298
27
3/16/2016
EE3.cma - Computer Architecture
299
The fundamentals of Computing have remained unchanged for 70
years
• During all of the rapid development of computers during that
time little has changed since Turing and Von Neumann
Quantum Computers are Potentially different.
• They employ Quantum Mechanical principles that expand the
range of operations possible on a classical computer.
• Three main differences between classical and Quantum
computers are:
• Fundamental unit of information is a qubit
• Range of logical operations
• Process of determining the state of the computer
3/16/2016
EE3.cma - Computer Architecture
300
Qubits
Classical computers are built from bits
two states: 0 or 1
Quantum computers are built from qubits
Physical system which possess states analogous to 0 or 1,
but which can also be in states between 0 and 1
The intermediate states are known as superposition states
A qubit – in a sense – can store much more information
than a bit
3/16/2016
EE3.cma - Computer Architecture
301
Range of logical operations
Classical computers operate according to binary logic
Quantum logic gates take one or more qubits as input and produce one
or more qubits as output.
Qubits have states corresponding to 0 and 1, so quantum logic gates
can emulate classical logic gates.
With superposition states between 0 and 1 there is a great expansion in
the range of quantum logic gates.
• e.g. quantum logic gates that take 0 and 1 as input and produce as
output different superposition states between 0 and 1 – no
classical analogue
This expanded range of quantum gates can be exploited to achieve
greater information processing power in quantum computers
3/16/2016
EE3.cma - Computer Architecture
302
Determining the State of the Computer
In Classical computers we read out the state of all the bits in the
computer at any time
In a Quantum computer it is in principle impossible to determine
the exact state of the computer.
i.e. we can’t determine exactly which superposition state is being
stored in the qubits making up the computer
We can only obtain partial information about the state of the
computer
Designing algorithms is a delicate balance between exploiting the
expanded range of states and logical operations and the restricted
readout of information.
3/16/2016
EE3.cma - Computer Architecture
303
Detector A
What actually happens?
Equal probability of photon
reaching A or B
Does the photon travel
each path at Random?
Single particle
Detector B
Beam-splitter
What actually happens
here?
Detector A
mirror
Detector B
Beam-splitter
If path lengths are the same
photons always hit A.
mirror
3/16/2016
Beam-splitter
EE3.cma - Computer Architecture
A single photon travels both
routes simultaneously
304
Photons travel both paths simultaneously.
If we block either of the paths then A or B become equally probable
This is quantum interference and applies not just to photons but all
particles and physical systems
Quantum computation is all about making this effect work for us.
In this case the photon is a in a coherent superposition of being on
both paths at the same time.
Any qubit can be prepared in a superposition of two logical states – a
qubit can store both 0 and 1 simultaneously, and in arbitrary
proportions.
Any quantum system with at least two discrete states can be used as
a qubit – e.g. energy levels in an atom, photons, trapped ions, spins
of atomic nuclei…..
3/16/2016
EE3.cma - Computer Architecture
305
Once the qubit is measured, however, only one of the two values it
stores can be detected at random – just like the photon is detected on
only one of the two paths.
Not very useful – but….
Consider a traditional 3-bit register it can represent 8 different
numbers 000 - 111
A quantum register of 3 qubits can represent 8 numbers at the same
time in quantum superposition. The bigger the register the more
numbers we can represent at the same time.
A 250 qubit register could hold more numbers than there are atoms in
the known universe – all on 250 atoms…..
But we only see one of these if we measure the registers contents.
We can now do some real quantum computation…..
3/16/2016
EE3.cma - Computer Architecture
306
Mathematical Operations can be performed at the same time on all
the numbers held in the register.
If the qubits are atoms then tuned laser pulses can affect their
electronic states so that initial superpositions of numbers evolve
into different superpositions.
Basically a massively parallel computation
Can perform a calculation on 2L numbers in a single step, which
would take 2L steps or processors in a conventional architecture
Only good for certain types of computation….
NOT information storage – it can hold many states at once but
can only see one of them
Quantum interference allows us to obtain a single result that
depends logically on all 2L of the intermediate results
3/16/2016
EE3.cma - Computer Architecture
307
Grover’s Algorithm
Searches an unsorted list of N items in only N steps.
Conventionally this scales as N/2 – by brute force searching. The quantum computer
can search them all at the same time.
BUT if the QC is merely programmed to print out the result at that point it will not be
any faster than a conventional system.
Only one of the N paths would check the entry we are looking for, so the
probability that measuring the computer’s state would give us the correct answer
would require the same number of hits.
BUT if we leave the information in the computer, unmeasured, a further quantum
operation can cause the information to affect other paths. If we repeat the operation
N times a measurement will return information about which entry contains the
desired number with a probability of 0.5. Repeating just a few more times will find the
entry with a probability extremely close to 1.
Can be turned into a very useful searching, minimization or evaluation of the mean
tool.
3/16/2016
EE3.cma - Computer Architecture
308
Cryptoanalysis
Biggest use of quantum computing is in cracking encrypted data.
Cracking DES (Data encryption standard) requires a search among 256 keys.
Conventionally even at 1M/s this takes more than 1000 years.
A QC using Grover’s algorithm could do it in less than 4 minutes.
Factorisation is the key to RSA encryption system.
Conventionally the time taken to factorise a number increases
exponentially with the number of digits.
Largest number ever factorised contained 129 digits.
No way to factorise 1000 digits – conventionally…..
QC can do this in a fraction of a second
Already a big worry for data security, it is only a matter of a few years
before this will be available.
3/16/2016
EE3.cma - Computer Architecture
309
Decoherence: the obstacle to quantum computation
For a qubit to work successfully it must remain in an entangled quantum
superposition of states.
As soon as we measure the state it collapses to a single value.
This happens even if we make the measurement by accident

source
source
In a conventional double split experiment, the wave amplitudes corresponding
to an electron (or photon) travelling along the two possible paths interfere. If
another particle with spin is placed close to the left slit an electron passing will
flip the spin. This “accidentally” records the which path the electron took and
causes the loss of the interference pattern
3/16/2016
EE3.cma - Computer Architecture
310
Decoherence: the obstacle to quantum computation
In reality it is very difficult to prevent qubits from interacting with the rest of
the world.
The best solution (so far) to this is to build quantum computers with fault
tolerant designs using error correction procedures.
The result of this is that we need more qubits, between 2 and 5 times the
number in an “ideal world”
3/16/2016
EE3.cma - Computer Architecture
311
Download