von Neumann model

advertisement
Comp326 Review
by: Mamta Patel May/2003

3 driving forces behind architecture innovations:
o technology, applications, programming languages/paradigms
von Neumann model
Non-deterministic
Has side-effects – due to multiple assignments
Inherently sequential
Imperative languages
Separation btwn data and control – control flow
Dataflow model
Not non-deterministic (is deterministic)
No side-effects (side-effect-free) – single assignment
Explicitly parallel/concurrent (concurrency is explicit)
Functional languages
Only data space (no control space) – no control flow
Not general-purpose enough
A single actor thread is too much to manage at run-time – thus,
impractical as an execution architecture (synchronization overhead)

Moore’s Law: #devices (transistors) on a chip doubles every 2 years

Amdahl’s Law: speedup = 1/[(1-fracenh) + fracenh/speedupenh] = exec timeold/exec timenew
o fracenh = the portion that is enhanceable (always <= 1)
o speedupenh = the gain/speedup factor of the enhanceable portion (always > 1)
o the synchronization overhead is ignored by Amdahl’s Law

latency and throughput are not inversely related
o if no concurrency, then the 2 are inversely related
o otherwise, the 2 are independent parameters

CPU time = CPI*n/clock rate

MIPS = clock rate/CPI*106 = instruction count/exec time*106

CPI is instruction-dependent, program-dependent, machine-dependent, benchmark-dependent

problems with MIPS:
o it is instruction-set dependent
o it varies with diff programs on same computer
o it can be inversely proportional to actual performance

3 main reasons for emergence of GPR (general-purpose register) architectures:
o registers are faster than memory
o registers are more efficient for a compiler to use than other forms of internal
storage
o registers can be used to hold variables (reduced memory traffic, and hence
latency due to memory accesses)
; n = #instructions in program
Advantages
-
Disadvantages
-
-
-
Instruction-Set Architectures
Stack
Register-Register
simple model
- simple, fixed-length
good code density
instructions
generates short instructions
- simple code generation
simple address format
model (separation of
concerns)
- instructions take similar
#clocks to execute
- tolerates high degrees of
latency
- supports higher degree of
concurrency
- makes it possible to do
hardware optimizations
inherently sequential data
- higher instruction count
structure
than models with memory
programming may be
references in instructions
complex (lots of stack
- more instructions &
overhead)
lower instruction density
stack bottleneck
=> larger programs
(organization is too
limiting)
lack of random access
(memory can’t be accessed
randomly)
Register-Memory
data can be accessed
w/o separate load instr
first
- instr format easy to
encode
- good code density
-
-
-
CPI varies depending
on operand location
operands not equivalent
since source operand in
a binary operation is
destroyed
encoding a register
number & a memory
address in each instr
may restrict #registers
Addressing Modes
-

Advantages
can reduce instruction count
-
Disadvantages
complicates CPU design
incurs runtime overhead
some are rarely used by compilers
GCD test: if loop-carried dependence exists, then gcd(a,c) | (d – b)
o ie. if gcd(a,c) does not divide (d – b), then loop-carried dependence does not exist
Techniques to Remove Dependences and Increase Concurrency

rescheduling – shuffle around the instructions, put in delay slots if possible

loop unrolling – don’t forget to rename registers when you unroll (rename AS NEEDED)

software pipelining – make pipe segments, reverse order and rename IF NECESSARY
o rename registers to get rid of WAR/WAW dependencies
o unroll the loop a couple of times
o select instructions for the pipe segments
o reverse the order of the instructions and make adjustments to deplacements in the
LD (or SD) instructions

dynamic scheduling – can have out of order completion of instructions
o Scoreboard
o Tomasulo’s Algorithm
Scoreboard
-
-
-
centralized
stages:
o issue
o read operands
o execute
o write-back
table format:
o FU, Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj,
Rk
delays issue of WAW hazards and structural
hazards
delays RAW until resolved
delays write-back of WAR hazards
limited by:
o amt of parallelism available among
instructions
o #scoreboard entries
o # and types of FUs
o presence of WAR and WAW hazards
-
-
Tomasulo’s Algorithm
distributed
stages:
o issue
o execute
o write-back
table format:
o FU, Busy, Op, Vj, Vk, Qj, Qk, A
reservation stations (provide register renaming)
common data bus (which has serialized access)
handles WAR, WAW by using register
renaming
delays issue of structural hazards
delays RAW hazards until resolved
Vectors

n/MVL*(Tloop + Tstartup) + n*Tchime
o
o
o
n = actual vector length; MVL = max vector length
Tloop = time to execute scalar code in loop; Tstartup = flush time for all convoys
Tchime = #chimes/convoys ( = 1 when we use chaining)

don’t forget to consider the chaining overhead in your calculations

WAR and WAW hazards (false dependencies) drastically affect vectorization since it
serializes the processing of elements

stripmining = break down vector of length n into subvectors of size <= 64 (MVL)

vector stride = “distance” between two successive vector elements

#memory servers actually activated = n/gcd(n, vector stride)
o n = n-way interleaved memory
o vector stride affects the vector access of load and store operations
Cache
 miss penalty affected by:
o main memory bandwidth/concurrency
o write policy
o cache line size

hit ratio affected by:
o cache management policy
o cache size
o program behaviour (temporal/spatial locality)

cache size affected by:
o technology advancement
o program behaviour (temporal/spatial locality)
o hit rate

4 questions in Memory Hierarchy:
o placement (mapping)
 direct – block maps to only 1 line
 fully-associative – block maps to any line
 set-associative – block maps to lines in selected set

o
addressing
 direct – check 1 tag
 fully-associative – check all tags (expensive)
 set-associative – check tags of selected set
o
replacement policy – random, LRU, FIFO
o
write policy
 write through – easier to implement, cache is always clean (data
coherency), high memory bandwidth
 write back – uses less memory bandwidth, cache may have dirty blocks,
more difficult to implement
types of misses:
o compulsory (cold) – refer to inevitable misses that occur when the cache is empty
o
conflict – if a block maps to a cache line that is occupied by another block
 decreases as the associativity mapping increases
o
capacity – misses due to cache size (cache is full and we need to replace a line
with the requested block)
 decreases as the size of the cache increases

how to reduce miss rate:
o change parameters of cache (block size, cache size, degree of associativity)
o use a victim cache (“backup” memory where you dump all blocks that have been
“thrown” out of the cache recently)
o prefetching techniques
o programming techniques (improve spatial locality)
 merging arrays, loop interchange, loop fusion, matrix multiplication

how to reduce the miss penalty
o read through (bring the chunk that I need first; the rest can come in but I don’t
wait for the rest, only for what I need)
o sub-block replacement – bring in sub-block, not whole block
o non-blocking caches
o multi-level caches

CPU time = (CPU clock cycles + Memory stall cycles)*clock cycle time

Memory stall cycles = #misses*miss penalty

Memory stall cycles = IC*miss penalty*miss rate*(memory accesses/instruction)

Memory stall cycles = IC*(misses/instruction)*miss penalty

misses/instruction = miss rate*(memory accesses/instruction)

Average memory access time = hit time + miss rate*miss penalty

CPU time = IC*(CPI + (memory accesses/instruction)*miss penalty*miss rate)*CCT


spatial locality: the next references will likely be to addresses near the current one
temporal locality: a recently referenced item is likely to be referenced again in the near
future
Shared Memory Multiprocessing (SMP)
 UMA = Uniform Memory Access Machine
o centralized memory
o address-independent
o time to access memory is equal among the processors
 NUMA = Non-Uniform Memory Access Machine
o distributed memory
o address-dependent
o time to access memory depends on the distance between the requesting processor
and the data location
 COMA = Cache-Only Memory Access Machine

problems with SMP:
o memory latency
 sharing causes memory latency that doesn’t scale well with
multiprocessing (kills concurrency)
 sharing is expensive, but you can’t have concurrency without sharing
o synchronization overhead
 synchronization is required to ensure atomicity semantics in concurrent
systems
 locality obtained through replication incurs synchronization overhead
Memory Consistency Models
Description
Sequential Consistency
all memory operations are
performed sequentially
requires serialization and delay of
memory operations in a thread
the results are as if all memory
operations are performed in some
sequential order consistent with the
individual program orders
-
Rules
-
Problems
-
do all memory operations in order,
and delay all future memory
operations until the previous ones
are done
kills concurrency because of the
serialization of memory operations
-
-
Weak Ordering
classify memory
accesses as ordinary
data access (R, W) and
synchronization access
(S)
allows concurrent R/W
as long as data
dependencies do not
exist btwn them
S  R; S  W; S  S
R  S; W  S
-
ordering is still too strict
not all programs will run
correctly under WO
Release Consistency
refinement of WO model
classify synchronization
accesses as acquire (SA) and
release (SR)
- until an acquire is performed,
all later memory operations
are stalled
- until past memory operations
are performed, release
operation cannot be performed
SA  R; SA  W; SA  SA/R
SR  SA/R
R  SR; W  SR
-
-
program may still give nonSC results (ie. not data-racefree)

a data-race exists when 2 conflicting memory operations exist between 2 threads (ie a
conflict-pair occurs at runtime)

a properly-labelled program is one that cannot have a data-race (ie. we get only SC
results on an RC platform)
Cache Coherence

locality is improved with cache local to a processor at the expense of replication mgmt
(overhead)
Features
-
Advantages
-
Disadvantages
-
-



Snooping Bus Protocol
common bus that all caches
are connected to
each cache block maintains
status info
write is broadcast on the bus
and automatically used to
invalidate/update local
copies elsewhere
centralized protocol
no directory overhead
because writes are serialized
by the common bus, it
destroys concurrency
poor scalability due to
serialization (only 1 event
can occur on the bus at a
time)
-
-
-
Directory-Based Protocol
distributed protocol that is used to
route messages and maintain cache
coherence
central directory holds all status info
about cache blocks
all accesses must go through the
directory
solution is more scalable
enhances concurrency via concurrent
messages and processing
directory overhead (sync cost)
message routing
write invalidate protocol:
o processor has exclusive access to item before it writes it; all copies of the item in
other caches are invalidated (invalidate all old copies of the data)
write update/broadcast protocol:
o all copies of the item in other caches are updated with new value being written
(update all old copies of the data)
write-invalidate is preferred over write-update because it uses less bus bandwidth
Snooping Bus – Processor Side
Original state of cache block
New state of cache block
Invalid
Shared
Shared
Shared
Exclusive
Shared
Request
Read miss
Read hit
Write miss
Write hit
Shared
Exclusive
Invalid
Shared
Exclusive
Shared
Exclusive
Exclusive
Exclusive
Exclusive
Shared
Exclusive
Exclusive
Exclusive
-
Actions
place read miss on bus
place read miss on bus*
write old block to memory
place read miss on bus*
read data from cache
read data from cache
place write miss on bus
place write miss on bus*
write old block to memory
place write miss on bus*
place write miss on bus
write data in cache
* may cause address conflict (eg. in direct-mapping, we will always have a replacement of
the cache block– not always true for other addressing schemes)
 each cache block has 3 states:
o invalid
o shared (read-only)
o exclusive (read-write)


Request
Read miss
Write miss
observe that for processor side, we deal with 4 types of requests (read miss, read hit,
write miss, write hit)
for bus side, we deal with 2 types of requests (read miss, write miss) – write-back
block is handled internally
o the request on the bus is from another processor (we only do something if the
address of the request matches the address of our own cache line)
Snooping Bus – Bus Side
Original state of cache block
New state of cache block
Invalid
Invalid
Shared
Shared
Exclusive
Shared
Invalid
Shared
Invalid
Invalid
Exclusive
Invalid
Actions
- no action in our cache
- no action in our cache
- place cache block on bus
(share copy with other
processor)
- change state of our block to
Shared
- no action in our cache
- invalidate our block since
another processor wants to
write to it
- write-back our block
- invalidate our block since
another processor wants to
write to it

Request
Read miss
Read hit
Write miss
Write hit
in directory-based protocol, messages are sent btwn local cache (local processor), home
directory, and remote cache (remote processor)
o local cache: processor cache generating the request
o home directory: directory containing the status info for each cache block
o remote cache: processor cache containing cache copy of the requested block
Directory-Based – Local Processor Side
Original state of cache block
New state of cache block
Invalid
Shared
Shared
Shared
Exclusive
Shared
Shared
Exclusive
Invalid
Shared
Exclusive
Shared
Exclusive
Exclusive
Exclusive
Exclusive
Shared
Exclusive
Exclusive
Exclusive
-
Actions
send read miss msg to home
send read miss msg to home
data write-back to memory
send read miss msg to home
read data from cache
read data from cache
send write miss msg to home
send write miss msg to home
data write-back to memory
send write miss msg to home
put write hit msg on bus to home
write data in cache
Request
Read miss
Directory-Based – Directory Side
Original state of cache block
New state of cache block
Uncached
Shared
Shared
Shared
Exclusive
Shared
-
Read hit
Write miss
Shared
Exclusive
Uncached
Shared
Exclusive
Exclusive
Shared
Exclusive
-
Exclusive
Exclusive
-
Write hit
Shared
Exclusive
-
Exclusive
Exclusive
-
Actions
returns value from memory
change state to Shared
add local processor to Sharers
returns value from memory
add local processor to Sharers
send Fetch msg to owner
write returned data in memory
(data write-back)
send to local processor (data value
reply)
changes status of block to Shared
add local processor to Sharers
Make remote processor member of
Sharers (done by remote)
read data from cache
read data from cache
returns value from memory
changes state to Exclusive
add local processor to Sharers
restore value from memory
change state to Exclusive
send Invalidate msg to all
members of Sharers
add local processor as sole
member of Sharers
send Fetch/Invalidate msg and to
owner
write returned value in memory
(data write-back)
send to local processor (data value
reply)
add local processor as sole
member of Sharers
Invalidate remote processor copy
(done by remote)
change state to Exclusive
send Invalidate msg to all
members of Sharers
add local processor as sole
member of Sharers
write data in cache

msgs sent in this scheme:
o read miss, write miss (local  home)
o invalidate, fetch, fetch/invalidate (home  remote)
o data value reply, data write-back (home  local)

observe the following:
o the actions taken by the local processor are done based on the status of its own
local cache
o the actions taken by the home directory are done based on the status of the cache
block (not on the status of the local cache!)
o the actions taken by the remote processor are done based on the status of its own
“remote” cache
o directory is the new serialization point in this scheme
Interconnection Networks




transport time = time of flight + msg size/bandwidth
total latency = sender overhead + time of flight + msg size/bandwidth + receiver overhead
time of flight = distance btwn machines/speed of signal
o speed of signal is assumed to be 2/3 the speed of light
transmission time = msg size/(bandwidth of medium)
Topology
1D mesh
2D mesh
3D mesh
nD mesh
2D torus
nD torus
Ring
n-Hypercube
binary tree
k-ary tree
Degree
2
4
6
2n
4
2n
2
n = log2N
2 **
k +1 **
Diameter
N–1
2(N1/2 – 1)
3(N1/3 – 1)
n(N1/n – 1)
N1/2
N1/n
N/2
n = log2N
2log2N
2logkN
Bisection
1
N1/2
N2/3
N(n – 1)/n
2N1/2
2N(n – 1)/n
2
N/2
1
1
* N = total #nodes in above table
** however, leaves (which represent the nodes) have degree 1, which is what is used in the
analysis







broadcast: send msg to everyone  n messages sent in n/w
scatter: distinct msg sent from 1 source to all destinations (1 for each destination)
gather: distinct msgs collected from many sources to 1 destination (1 from each destination)
o n messages sent in n/w for scatter or gather operations
exchange: every pair of nodes will exchange a distinct set of msgs (everybody exchanges
distinct msgs w/everybody else)  n(n-1) messages sent in n/w
diameter reflects latency
bisection reflects reliability and also relates to latency/performance
o the exchange operation is limited by bisection bandwidth
hypercube is more scalable than mesh
o you gain in bandwidth with increases in N
Topology
nD mesh
nD torus
Ring
n-Hypercube
binary tree
k-ary tree




Broadcast
n(N1/n – 1)
N1/n
N/2
log2N
2*log2N
2*logkN
Scatter
N/n
N/2n
N/2
N/log2N
N
N
Gather
N/n
N/2n
N/2
N/log2N
N
N
Exchange
N2 – (n – 1) / n/2
N2 – (n – 1) / n/4
N2/4
N
N2/2
N2/2
broadcast: time for a node to send a message to every other node (same as the worst case
distance to the furthest node – based on diameter analysis)
scatter: time for a node to send a distinct message to every other node
gather: time for a node to receive a distinct message from every other node
o scatter and gather have similar analysis
o time for scatter/gather = #messages (N)/min degree of a node
 ie. analysis is based on #messages/worst case connectivity
exchange: time for each node to send a message to every other node
o time for exchange = 2*(N/2 * N/2)/bisection
o analysis is based on bisection bandwidth
Description
-
Routing Algorithms
Store-and-Forward
uses packet buffers in successive nodes
assumes a msg is stored before it is
forwarded to next node
immediate node must receive whole msg
before forwarding to next node
-
-
Latency
Parameters
S*[L/W + t]
Wormhole
uses flit buffers in successive
routers
o flit = min size unit of a
packet
once a worm has started a path, the
path is reserved for that worm
only requirement is that once the
flit train is started, it must follow
until completed (otherwise you
lose the msg)
S*t + L/W
S = #switches
L = packet size
t = switching cost per node (switch delay)
W = bandwidth of link
L/W = transfer time
Download