Comp326 Review by: Mamta Patel May/2003 3 driving forces behind architecture innovations: o technology, applications, programming languages/paradigms von Neumann model Non-deterministic Has side-effects – due to multiple assignments Inherently sequential Imperative languages Separation btwn data and control – control flow Dataflow model Not non-deterministic (is deterministic) No side-effects (side-effect-free) – single assignment Explicitly parallel/concurrent (concurrency is explicit) Functional languages Only data space (no control space) – no control flow Not general-purpose enough A single actor thread is too much to manage at run-time – thus, impractical as an execution architecture (synchronization overhead) Moore’s Law: #devices (transistors) on a chip doubles every 2 years Amdahl’s Law: speedup = 1/[(1-fracenh) + fracenh/speedupenh] = exec timeold/exec timenew o fracenh = the portion that is enhanceable (always <= 1) o speedupenh = the gain/speedup factor of the enhanceable portion (always > 1) o the synchronization overhead is ignored by Amdahl’s Law latency and throughput are not inversely related o if no concurrency, then the 2 are inversely related o otherwise, the 2 are independent parameters CPU time = CPI*n/clock rate MIPS = clock rate/CPI*106 = instruction count/exec time*106 CPI is instruction-dependent, program-dependent, machine-dependent, benchmark-dependent problems with MIPS: o it is instruction-set dependent o it varies with diff programs on same computer o it can be inversely proportional to actual performance 3 main reasons for emergence of GPR (general-purpose register) architectures: o registers are faster than memory o registers are more efficient for a compiler to use than other forms of internal storage o registers can be used to hold variables (reduced memory traffic, and hence latency due to memory accesses) ; n = #instructions in program Advantages - Disadvantages - - - Instruction-Set Architectures Stack Register-Register simple model - simple, fixed-length good code density instructions generates short instructions - simple code generation simple address format model (separation of concerns) - instructions take similar #clocks to execute - tolerates high degrees of latency - supports higher degree of concurrency - makes it possible to do hardware optimizations inherently sequential data - higher instruction count structure than models with memory programming may be references in instructions complex (lots of stack - more instructions & overhead) lower instruction density stack bottleneck => larger programs (organization is too limiting) lack of random access (memory can’t be accessed randomly) Register-Memory data can be accessed w/o separate load instr first - instr format easy to encode - good code density - - - CPI varies depending on operand location operands not equivalent since source operand in a binary operation is destroyed encoding a register number & a memory address in each instr may restrict #registers Addressing Modes - Advantages can reduce instruction count - Disadvantages complicates CPU design incurs runtime overhead some are rarely used by compilers GCD test: if loop-carried dependence exists, then gcd(a,c) | (d – b) o ie. if gcd(a,c) does not divide (d – b), then loop-carried dependence does not exist Techniques to Remove Dependences and Increase Concurrency rescheduling – shuffle around the instructions, put in delay slots if possible loop unrolling – don’t forget to rename registers when you unroll (rename AS NEEDED) software pipelining – make pipe segments, reverse order and rename IF NECESSARY o rename registers to get rid of WAR/WAW dependencies o unroll the loop a couple of times o select instructions for the pipe segments o reverse the order of the instructions and make adjustments to deplacements in the LD (or SD) instructions dynamic scheduling – can have out of order completion of instructions o Scoreboard o Tomasulo’s Algorithm Scoreboard - - - centralized stages: o issue o read operands o execute o write-back table format: o FU, Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk delays issue of WAW hazards and structural hazards delays RAW until resolved delays write-back of WAR hazards limited by: o amt of parallelism available among instructions o #scoreboard entries o # and types of FUs o presence of WAR and WAW hazards - - Tomasulo’s Algorithm distributed stages: o issue o execute o write-back table format: o FU, Busy, Op, Vj, Vk, Qj, Qk, A reservation stations (provide register renaming) common data bus (which has serialized access) handles WAR, WAW by using register renaming delays issue of structural hazards delays RAW hazards until resolved Vectors n/MVL*(Tloop + Tstartup) + n*Tchime o o o n = actual vector length; MVL = max vector length Tloop = time to execute scalar code in loop; Tstartup = flush time for all convoys Tchime = #chimes/convoys ( = 1 when we use chaining) don’t forget to consider the chaining overhead in your calculations WAR and WAW hazards (false dependencies) drastically affect vectorization since it serializes the processing of elements stripmining = break down vector of length n into subvectors of size <= 64 (MVL) vector stride = “distance” between two successive vector elements #memory servers actually activated = n/gcd(n, vector stride) o n = n-way interleaved memory o vector stride affects the vector access of load and store operations Cache miss penalty affected by: o main memory bandwidth/concurrency o write policy o cache line size hit ratio affected by: o cache management policy o cache size o program behaviour (temporal/spatial locality) cache size affected by: o technology advancement o program behaviour (temporal/spatial locality) o hit rate 4 questions in Memory Hierarchy: o placement (mapping) direct – block maps to only 1 line fully-associative – block maps to any line set-associative – block maps to lines in selected set o addressing direct – check 1 tag fully-associative – check all tags (expensive) set-associative – check tags of selected set o replacement policy – random, LRU, FIFO o write policy write through – easier to implement, cache is always clean (data coherency), high memory bandwidth write back – uses less memory bandwidth, cache may have dirty blocks, more difficult to implement types of misses: o compulsory (cold) – refer to inevitable misses that occur when the cache is empty o conflict – if a block maps to a cache line that is occupied by another block decreases as the associativity mapping increases o capacity – misses due to cache size (cache is full and we need to replace a line with the requested block) decreases as the size of the cache increases how to reduce miss rate: o change parameters of cache (block size, cache size, degree of associativity) o use a victim cache (“backup” memory where you dump all blocks that have been “thrown” out of the cache recently) o prefetching techniques o programming techniques (improve spatial locality) merging arrays, loop interchange, loop fusion, matrix multiplication how to reduce the miss penalty o read through (bring the chunk that I need first; the rest can come in but I don’t wait for the rest, only for what I need) o sub-block replacement – bring in sub-block, not whole block o non-blocking caches o multi-level caches CPU time = (CPU clock cycles + Memory stall cycles)*clock cycle time Memory stall cycles = #misses*miss penalty Memory stall cycles = IC*miss penalty*miss rate*(memory accesses/instruction) Memory stall cycles = IC*(misses/instruction)*miss penalty misses/instruction = miss rate*(memory accesses/instruction) Average memory access time = hit time + miss rate*miss penalty CPU time = IC*(CPI + (memory accesses/instruction)*miss penalty*miss rate)*CCT spatial locality: the next references will likely be to addresses near the current one temporal locality: a recently referenced item is likely to be referenced again in the near future Shared Memory Multiprocessing (SMP) UMA = Uniform Memory Access Machine o centralized memory o address-independent o time to access memory is equal among the processors NUMA = Non-Uniform Memory Access Machine o distributed memory o address-dependent o time to access memory depends on the distance between the requesting processor and the data location COMA = Cache-Only Memory Access Machine problems with SMP: o memory latency sharing causes memory latency that doesn’t scale well with multiprocessing (kills concurrency) sharing is expensive, but you can’t have concurrency without sharing o synchronization overhead synchronization is required to ensure atomicity semantics in concurrent systems locality obtained through replication incurs synchronization overhead Memory Consistency Models Description Sequential Consistency all memory operations are performed sequentially requires serialization and delay of memory operations in a thread the results are as if all memory operations are performed in some sequential order consistent with the individual program orders - Rules - Problems - do all memory operations in order, and delay all future memory operations until the previous ones are done kills concurrency because of the serialization of memory operations - - Weak Ordering classify memory accesses as ordinary data access (R, W) and synchronization access (S) allows concurrent R/W as long as data dependencies do not exist btwn them S R; S W; S S R S; W S - ordering is still too strict not all programs will run correctly under WO Release Consistency refinement of WO model classify synchronization accesses as acquire (SA) and release (SR) - until an acquire is performed, all later memory operations are stalled - until past memory operations are performed, release operation cannot be performed SA R; SA W; SA SA/R SR SA/R R SR; W SR - - program may still give nonSC results (ie. not data-racefree) a data-race exists when 2 conflicting memory operations exist between 2 threads (ie a conflict-pair occurs at runtime) a properly-labelled program is one that cannot have a data-race (ie. we get only SC results on an RC platform) Cache Coherence locality is improved with cache local to a processor at the expense of replication mgmt (overhead) Features - Advantages - Disadvantages - - Snooping Bus Protocol common bus that all caches are connected to each cache block maintains status info write is broadcast on the bus and automatically used to invalidate/update local copies elsewhere centralized protocol no directory overhead because writes are serialized by the common bus, it destroys concurrency poor scalability due to serialization (only 1 event can occur on the bus at a time) - - - Directory-Based Protocol distributed protocol that is used to route messages and maintain cache coherence central directory holds all status info about cache blocks all accesses must go through the directory solution is more scalable enhances concurrency via concurrent messages and processing directory overhead (sync cost) message routing write invalidate protocol: o processor has exclusive access to item before it writes it; all copies of the item in other caches are invalidated (invalidate all old copies of the data) write update/broadcast protocol: o all copies of the item in other caches are updated with new value being written (update all old copies of the data) write-invalidate is preferred over write-update because it uses less bus bandwidth Snooping Bus – Processor Side Original state of cache block New state of cache block Invalid Shared Shared Shared Exclusive Shared Request Read miss Read hit Write miss Write hit Shared Exclusive Invalid Shared Exclusive Shared Exclusive Exclusive Exclusive Exclusive Shared Exclusive Exclusive Exclusive - Actions place read miss on bus place read miss on bus* write old block to memory place read miss on bus* read data from cache read data from cache place write miss on bus place write miss on bus* write old block to memory place write miss on bus* place write miss on bus write data in cache * may cause address conflict (eg. in direct-mapping, we will always have a replacement of the cache block– not always true for other addressing schemes) each cache block has 3 states: o invalid o shared (read-only) o exclusive (read-write) Request Read miss Write miss observe that for processor side, we deal with 4 types of requests (read miss, read hit, write miss, write hit) for bus side, we deal with 2 types of requests (read miss, write miss) – write-back block is handled internally o the request on the bus is from another processor (we only do something if the address of the request matches the address of our own cache line) Snooping Bus – Bus Side Original state of cache block New state of cache block Invalid Invalid Shared Shared Exclusive Shared Invalid Shared Invalid Invalid Exclusive Invalid Actions - no action in our cache - no action in our cache - place cache block on bus (share copy with other processor) - change state of our block to Shared - no action in our cache - invalidate our block since another processor wants to write to it - write-back our block - invalidate our block since another processor wants to write to it Request Read miss Read hit Write miss Write hit in directory-based protocol, messages are sent btwn local cache (local processor), home directory, and remote cache (remote processor) o local cache: processor cache generating the request o home directory: directory containing the status info for each cache block o remote cache: processor cache containing cache copy of the requested block Directory-Based – Local Processor Side Original state of cache block New state of cache block Invalid Shared Shared Shared Exclusive Shared Shared Exclusive Invalid Shared Exclusive Shared Exclusive Exclusive Exclusive Exclusive Shared Exclusive Exclusive Exclusive - Actions send read miss msg to home send read miss msg to home data write-back to memory send read miss msg to home read data from cache read data from cache send write miss msg to home send write miss msg to home data write-back to memory send write miss msg to home put write hit msg on bus to home write data in cache Request Read miss Directory-Based – Directory Side Original state of cache block New state of cache block Uncached Shared Shared Shared Exclusive Shared - Read hit Write miss Shared Exclusive Uncached Shared Exclusive Exclusive Shared Exclusive - Exclusive Exclusive - Write hit Shared Exclusive - Exclusive Exclusive - Actions returns value from memory change state to Shared add local processor to Sharers returns value from memory add local processor to Sharers send Fetch msg to owner write returned data in memory (data write-back) send to local processor (data value reply) changes status of block to Shared add local processor to Sharers Make remote processor member of Sharers (done by remote) read data from cache read data from cache returns value from memory changes state to Exclusive add local processor to Sharers restore value from memory change state to Exclusive send Invalidate msg to all members of Sharers add local processor as sole member of Sharers send Fetch/Invalidate msg and to owner write returned value in memory (data write-back) send to local processor (data value reply) add local processor as sole member of Sharers Invalidate remote processor copy (done by remote) change state to Exclusive send Invalidate msg to all members of Sharers add local processor as sole member of Sharers write data in cache msgs sent in this scheme: o read miss, write miss (local home) o invalidate, fetch, fetch/invalidate (home remote) o data value reply, data write-back (home local) observe the following: o the actions taken by the local processor are done based on the status of its own local cache o the actions taken by the home directory are done based on the status of the cache block (not on the status of the local cache!) o the actions taken by the remote processor are done based on the status of its own “remote” cache o directory is the new serialization point in this scheme Interconnection Networks transport time = time of flight + msg size/bandwidth total latency = sender overhead + time of flight + msg size/bandwidth + receiver overhead time of flight = distance btwn machines/speed of signal o speed of signal is assumed to be 2/3 the speed of light transmission time = msg size/(bandwidth of medium) Topology 1D mesh 2D mesh 3D mesh nD mesh 2D torus nD torus Ring n-Hypercube binary tree k-ary tree Degree 2 4 6 2n 4 2n 2 n = log2N 2 ** k +1 ** Diameter N–1 2(N1/2 – 1) 3(N1/3 – 1) n(N1/n – 1) N1/2 N1/n N/2 n = log2N 2log2N 2logkN Bisection 1 N1/2 N2/3 N(n – 1)/n 2N1/2 2N(n – 1)/n 2 N/2 1 1 * N = total #nodes in above table ** however, leaves (which represent the nodes) have degree 1, which is what is used in the analysis broadcast: send msg to everyone n messages sent in n/w scatter: distinct msg sent from 1 source to all destinations (1 for each destination) gather: distinct msgs collected from many sources to 1 destination (1 from each destination) o n messages sent in n/w for scatter or gather operations exchange: every pair of nodes will exchange a distinct set of msgs (everybody exchanges distinct msgs w/everybody else) n(n-1) messages sent in n/w diameter reflects latency bisection reflects reliability and also relates to latency/performance o the exchange operation is limited by bisection bandwidth hypercube is more scalable than mesh o you gain in bandwidth with increases in N Topology nD mesh nD torus Ring n-Hypercube binary tree k-ary tree Broadcast n(N1/n – 1) N1/n N/2 log2N 2*log2N 2*logkN Scatter N/n N/2n N/2 N/log2N N N Gather N/n N/2n N/2 N/log2N N N Exchange N2 – (n – 1) / n/2 N2 – (n – 1) / n/4 N2/4 N N2/2 N2/2 broadcast: time for a node to send a message to every other node (same as the worst case distance to the furthest node – based on diameter analysis) scatter: time for a node to send a distinct message to every other node gather: time for a node to receive a distinct message from every other node o scatter and gather have similar analysis o time for scatter/gather = #messages (N)/min degree of a node ie. analysis is based on #messages/worst case connectivity exchange: time for each node to send a message to every other node o time for exchange = 2*(N/2 * N/2)/bisection o analysis is based on bisection bandwidth Description - Routing Algorithms Store-and-Forward uses packet buffers in successive nodes assumes a msg is stored before it is forwarded to next node immediate node must receive whole msg before forwarding to next node - - Latency Parameters S*[L/W + t] Wormhole uses flit buffers in successive routers o flit = min size unit of a packet once a worm has started a path, the path is reserved for that worm only requirement is that once the flit train is started, it must follow until completed (otherwise you lose the msg) S*t + L/W S = #switches L = packet size t = switching cost per node (switch delay) W = bandwidth of link L/W = transfer time