EE3004 (EE3.cma) - Computer Architecture Roger Webb R.Webb@surrey.ac.uk University of Surrey http://www.ee.surrey.ac.uk/Personal/R.Webb/l3a15 also link from Teaching/Course page 3/16/2016 EE3.cma - Computer Architecture 1 Introduction Book List Computer Architecture - Design & Performance Barry Wilkinson, Prentice-Hall 1996 (nearest to course) Advanced Computer Architecture Richard Y. Kain, Prentice-Hall 1996 (good for multiprocessing + chips + memory) Computer Architecture Behrooz Parhami, Oxford Univ Press, 2005 (good for advanced architecture and Basics) Computer Architecture Dowsing & Woodhouse (good for putting the bits together..) Microprocessors & Microcomputers - Hardware & Software Ambosio & Lastowski (good for DRAM, SRAM timing diagrams etc.) Computer Architecture & Design Van de Goor (for basic Computer Architecture) Wikipedia is as good as anything...! 3/16/2016 EE3.cma - Computer Architecture 2 Introduction Outline Syllabus Memory Topics • Memory Devices • Interfacing/Graphics • Virtual Memory • Caches & Hierarchies Instruction Sets • Properties & Characteristics • Examples • RISC v CISC • Pipelining & Concurrency Parallel Architectures • Performance Characteristics • SIMD (vector) processors • MIMD (message-passing) • Principles & Algorithms 3/16/2016 EE3.cma - Computer Architecture 3 Computer Architectures - an overview What are computers used for? 3 ranges of product cover the majority of processor sales: • Appliances (consumer electronics) • Communications Equipment • Utilities (conventional computer systems) 3/16/2016 EE3.cma - Computer Architecture 4 Computer Architectures - an overview Consumer Electronics This category covers a huge range of processor performance • Micro-controlled appliances – washing machines, time switches, lamp dimers – lower end, characterised by: • low processing requirements • microprocessor replaces logic in small package • low power requirements • Higher Performance Applications – Mobile phones, printers, fax machines, cameras, games consoles, GPS, TV set-top boxes, video/DVD/HD recorders…... • • • • 3/16/2016 High bandwidth - 64-bit data bus Low power - to avoid cooling Low cost - < $20 for the processor Small amounts of software - small cache (tight program loops) EE3.cma - Computer Architecture 5 Computer Architectures - an overview Communications Equipment has become the major market – WWW, mobile comms • Main products containing powerful processors are: – LAN products - bridges, routers, controllers in computers – ATM exchanges – Satellite & Cable TV routing and switching – Telephone networks (all-digital) • The main characteristics of these devices are: – Standardised application (IEEE, CCITT etc.) - means competitive markets – High bandwidth interconnections – Wide processor buses - 32 or 64 bits – Multi-processing (either per-box, or in the distributed computing sense 3/16/2016 EE3.cma - Computer Architecture 6 Computer Architectures - an overview Utilities (Conventional Computer Systems) Large scale computing devices will, to some extent, be replaced by greater processing power on the desk-top. • But some centralised facilities are still required, especially where data storage is concerned – General-purpose computer servers; supercomputers – Database servers - often safer to maintain a central corporate database – File and printer servers - again simpler to maintain – Video on demand servers • These applications are characterised by huge memory requirements and: – Large operating systems – High sustained performance over wide workload variations – Scalability - as workload increases – 64 bit (or greater) data paths, multiprocessing, large caches 3/16/2016 EE3.cma - Computer Architecture 7 Computer Architectures - an overview Computer System Performance • Most manufacturers quote performance of their processors in terms of the peak rate - MIPS (MOPS) of MFLOPS. • Most of the applications above depend on the continuous supply of data or results - especially for video images • Thus critical criterion is the sustained throughput of instructions – (MPEG image decompression algorithm requires 1 billion operations per second for full-quality widescreen TV) – Less demanding VHS quality requires 2.7Mb per second of compressed data – Interactive simulations (games etc) must respond to a user input within 100ms - re-computing and displaying the new image • Important measures are: – MIPS per dollar – MIPS per Watt 3/16/2016 EE3.cma - Computer Architecture 8 Computer Architectures - an overview % of CPU time spent managing interaction User Interactions Consider how we interact with our computers: 100 Virtual Reality, Cyberspace 90 80 What does a typical CPU do? 70% User interface; I/O processing 20% Network interface; protocols 9% Operating system; system calls 1% User application 70 60 WYSIWIG, Mice, Windows 50 40 Menus, Forms 30 Timesharing 20 Punched Card & Tape 10 Lights & Switches 0 1955 3/16/2016 1965 1975 1985 1995 2005 EE3.cma - Computer Architecture 9 Computer Architectures - an overview Sequential Processor Efficiency The current state-of-the-art of large microprocessors include: • 64-bit memory words, using interleaved memory • Pipelined instructions • Multiple functional units (integer, floating point, memory fetch/store) • 5 GHz practical maximum clock speed • Multiple processors • Instruction set organised for simple decoding (RISC?) However as word length increases, efficiency may drop: • many operands are small (16 bit is enough for many VR tasks) • many literals are small - loading 00….00101 as 64 bits is a waste • may be worth operating on several literals per word in parallel 3/16/2016 EE3.cma - Computer Architecture 10 Computer Architectures - an overview Example - reducing the number of instructions Perform a 3D transformation of a point (x,y,z) by multiplying the 4-element matrix (x,y,z,1) by a 4x4 transformation matrix A. All operands are 16-bits long. x y z 1 a e i m b f j n c g k o d h l p = x’ y’ z’ r Conventionally this requires 20 loads, 16 multiplies, 12 adds and 4 stores, using 16-bit operands on a 16-bit CPU. On a 64-bit CPU with instructions dealing with groups of four parallel 16-bit operands, as well as a modest amount of pipelining, all this can take just 7 processor cycles. 3/16/2016 EE3.cma - Computer Architecture 11 Computer Architectures - an overview The Effect of Processor Intercommunication Latency In a multiprocessor, and even in a uniprocessor, the delays associated with communicating and fetching data (latency) can dominate the processing times. Consider: memory memory Interconnection Network CPU CPU memory memory Symmetrical Multiprocessor CPU Uniprocessor cache CPU Delays can be minimised by placing components closer together and: • Add caches to provide local data storage • Hide latency by multi-tasking - needs fast context switching • Interleave streams of independent instructions - scheduling • Run groups of independent instructions together (each ending with long latency instruction) 3/16/2016 EE3.cma - Computer Architecture 12 Computer Architectures - an overview Memory Efficiency Quote from 1980s “Memory is free” By the 2000s the cost per bit is no longer falling so fast and consumer electronics market is becoming cost sensitive Renewed interest in compact instruction sets and data compactness - both from the 1960s and 1970s 1977 - £3000/Mb 1994 - £4/Mb Now – <1p/Mb Instruction Compactness RISC CPUs have a simple register-based instruction encoding • Can lead to codebloat - as can poor coding and compiler design • Compactness gets worse as the word size increases e.g. INMOS (1980s) transputer had a stack based register scheme • needed 60% of the code of an equivalent register based cpu • lead to smaller cache needs for instruction fetches & data 3/16/2016 EE3.cma - Computer Architecture 13 Computer Architectures - an overview Cache Efficiency • Designer should aim to optimise the instruction performance whilst using the smallest cache possible • Hiding latency (using parallelism & instruction scheduling) is an effective alternative to minimising it (by using large caches) • Instruction scheduling can initiate cache pre-fetches • Switch to another thread if the cache is not ready to supply data for the current one • In video and audio processing, especially, unroll the inner code loops – loop unrolling (more on that later) 3/16/2016 EE3.cma - Computer Architecture 14 Computer Architectures - an overview Predictable Codes In many applications (e.g. video and audio processing) much is known about the code which will be executed. Techniques which are suitable for these circumstances include: • Partition the cache separately for code and different data structures • The cache requirements of the inner code loops can be predetermined, so cache usage can be optimised • Control the amounts of a data structure which are cached • Prevent interference between threads by careful scheduling • Notice that a conventional cache’s contents are destroyed by a single block copy instruction 3/16/2016 EE3.cma - Computer Architecture 15 Computer Architectures - an overview Processor Engineering Issues • Power consumption must be minimised (to simplify on-chip and inbox cooling issues) – Use low-voltage processors (2V instead of 3.3V) – Don’t over-clock the processor – Design logic carefully to avoid propagation of redundant signals – Tolerance of latency allows lower performance (cheaper) subsystems to be used – Explicit subsystem control allows subsystems to be powered down when not in use – Eliminate redundant actions - eg speculative pre-fetching – Provide non-busy synchronisation to avoid the need for spin-locks • Battery design is advancing slowly - power stored per unit weight or volume will quadruple (over NiCd) with 5-10 years 3/16/2016 EE3.cma - Computer Architecture 16 Computer Architectures - an overview Processor Engineering Issues • Speed to market is increasing, so processor design is becoming critical. Consider the time for several common devices to become established: – 70 years Telephone (0% to 60% of households) – 40 years Cable Television – 20 years Personal Computer – 10 years Video Recorders – <10years Web based video • Modularity and common processor cores provide design flexibility – reusable cache and CPU cores – product-specific interfaces and co-processors – common connection schemes 3/16/2016 EE3.cma - Computer Architecture 17 Computer Architectures - an overview Interconnect Schemes Wide data buses are a problem: • They are difficult to route on printed circuit boards • They require huge numbers of processor and memory pins (expensive to manufacture on chips and PCBs) • Clocking must accommodate the slowest bus wire. • Parallel back-planes add to loading and capacitance, slowing signals further and increasing power consumption Serial chip interconnects offer 1Gbit/s performance using just a few pins and wires. Can we use a packet routing chip as a back-plane? • Processors, memories, graphic devices, networks, slow external interfaces all joined to a central switch 3/16/2016 EE3.cma - Computer Architecture 18 3 3/16/2016 EE3.cma - Computer Architecture 19 Memory Devices Regardless of scale of computer the memory is similar. Two major types: • Static • Dynamic Larger memories get cheaper as production increases and smaller memories get more expensive - you pay more for less! See: http://www.educypedia.be/computer/memoryram.htm http://www.kingston.com/tools/umg/default.asp http://www.ahinc.com/hhmemory.htm 3/16/2016 EE3.cma - Computer Architecture 20 Memory Devices Static Memories • made from static logic elements - an array of flip-flops • don’t lose their stored contents until clocked again • may be driven as slowly as needed - useful for single stepping a processor • Any location may be read or written independently • Reading does not require a re-write afterwards • Writing data does not require the row containing it to be pre-read • No housekeeping actions are needed • The address lines are usually all supplied at the same time • Fast - 15ns was possible in Bipolar and 4-15ns in CMOS Not used anymore – too much power for little gain in speed 3/16/2016 EE3.cma - Computer Architecture 21 Memory Devices I/O7 Memory Matrix 256x256 Input Data Control I/O0 Vcc Row Decoder HM6264 - 8K*8 static RAM organisation 3/16/2016 A0 A1 A2 A3 A4 A5 A6 A7 Gnd Column I/O Column Decoder A15 A8 CS2 CS1 WE Timing Pulse Generator Read Write Control OE EE3.cma - Computer Architecture 22 Memory Devices t RC Address CS1 tCO1 tLZ1 tHZ1 tCO2 CS2 HM6264 Read Cycle HM6264 - 8K*8 static RAM organisation tAA 3/16/2016 OE tLZ2 tOE tHZ2 tOLZ tOHZ Data Valid Dout tOH Item Read Cycle Time Address Access Time Chip Selection to CS1 Output CS2 Output Enable to Output Valid Chip Selection to CS1 Output in Low Z CS2 Output Enable to Output in Low Z Chip Deselection to CS1 Output in High Z CS2 Output Disable to Output in High Z Output Hold from Address Change Symbol tRC tAA tCO1 tCO2 tOE tLZ1 tLZ2 tOLZ tHZ1 tHZ2 tOHZ tOH EE3.cma - Computer Architecture min 100 10 10 5 0 0 0 10 max 100 100 100 50 35 35 35 - Unit ns ns ns ns ns ns ns ns ns ns ns ns 23 Memory Devices tWC Address 3/16/2016 tWR1 tCW CS1 HM6264 Write Cycle HM6264 - 8K*8 static RAM organisation OE tCW CS2 tAW WE Dout tAS tWR2 tWP tOHZ tDW tDH Din Item Symbol Write Cycle Time tWC Chip Selection to End of Write tCW Address set up time tAS Address valid to End of Write tAW Write Pulse Width tWP Write Recovery Time CS1,WE tWR1 CS2 tWR2 Write to Output in High Z tWHZ Data to Write Time Overlap tDW Data Hold from Write Time tDH Output Enable to Output in High Z tOHZ Output Active from End of Write tOW EE3.cma - Computer Architecture min 100 80 0 80 60 5 15 0 40 0 0 5 max 35 35 - Unit ns ns ns ns ns ns ns ns ns ns ns ns Data sampled by memory 24 Memory Devices Dynamic Memories • information stored on a capacitor - discharges with time • Only one transistor required to control - 6 for SRAM • must be refreshed (0.1-0.01 pF needs refresh every 2-8ms) • memory cells are organised so that cells can be refreshed a row at a time to minimise the time taken • row and column organisation lends itself to multiplexed row and column addresses - fewer pins on chip • Use RAS and CAS to latch row and column addresses sequentially • DRAM consumes high currents when switching transistors (1024 columns at a time). Can cause nasty voltage transients 3/16/2016 EE3.cma - Computer Architecture 25 Memory Devices Input Buffer WE Output Buffer WE Clock OE Clock R/W Switch OE CAS CAS Clock RAS RAS Clock X Decoder X Addrss Memory Array 3 Y Addrss Memory Array 2 row select X Decoder Y Decoder Ai Memory Array 1 Y Decoder HM50464 - 64K*4 dynamic RAM organisation I/O 1-4 Dynamic memory cell Memory Array 4 Refresh Address Counter 3/16/2016 RAS Bit Line EE3.cma - Computer Architecture CAS 26 Memory Devices HM50464 Read Cycle HM50464 - 64K*4 dynamic RAM organisation RAS CAS Address row column WRITE valid output IO OE Read Cycle Dynamic memory read operation is as follows • The memory read cycle starts by setting all bit lines (columns) to a suitable sense voltage. - pre charging • Required row address is applied and a RAS (row address) is asserted • selected row is decoded and opens transistors (one per column). This dumps their capacitors charge into high feedback amplifiers which recharge the capacitors - RAS must remain low • simultaneously apply column address and set CAS. Decoded and requested bits are gated to output - goes to outside when OE is active 3/16/2016 EE3.cma - Computer Architecture 27 Memory Devices HM50464 Write Cycle HM50464 - 64K*4 dynamic RAM organisation RAS CAS Address row column WRITE IO Valid Input Early Write Cycle Similar to the read cycle except the fall in WRITE signals time to latch input data. During the “Early Write” cycle - the WRITE falls before CAS - ensures that memory device keeps data outputs disabled (otherwise when CAS goes low they could output data!) Alternatively a “Late Write” cycle the sequence is reversed and the OE line is kept high - this can be useful in common address/data bus architectures 3/16/2016 EE3.cma - Computer Architecture 28 Memory Devices HM50464 - 64K*4 dynamic RAM organisation Refresh Cycle For a refresh no output is needed. A read, with a valid RAS and row address pulls the data out all we need to do is put it back again by deasserting RAS. This needs to be repeated for all 256 rows (on the HM50464) every 4ms. There is an on chip counter which can be used to generate refresh addresses. Page Mode Access [“Fast Page Mode DRAM”] – standard DRAM The RAS cycle time is relatively long so optimisations have been made for common access patterns Row address is supplied just once and latched with RAS. Then column address are supplied and latched using CAS, data is read using WRITE or OE. CAS and column address can then be cycled to access bits in same row. The cycle ends when RAS goes high again. Care must be taken to continue to refresh the other rows of memory at the specified rate if needed 3/16/2016 EE3.cma - Computer Architecture 29 Memory Devices HM50464 - 64K*4 dynamic RAM organisation RAS CAS Address row col col Data IO col Data Data Page Mode DRAM access - nibble and static column mode are similar Nibble Mode Rather than supplying the second and subsequent column addresses they can be calculated by incrementing the initial address - first column address stored in register when CAS goes low then incremented and used in next low CAS transition - less common then Page Mode. Static Column Mode Column addresses are treated statically and when CAS is low the outputs are read if OE is low as well. If the column address changes the outputs change (after a propagation delay). The frequency of address changes can be higher as there is no need to have an inactive CAS time 3/16/2016 EE3.cma - Computer Architecture 30 Memory Devices HM50464 - 64K*4 dynamic RAM organisation RAS CAS Address row col col col OE IO Data Data Data Extended Data Out DRAM access Extended Data Out Mode (“EDO DRAM”) EDO DRAM is very similar to page mode access. Except that data bus outputs are controlled exclusively by the OE line. So that CAS can be taken high and low again without data from previous word being removed from data bus - so data can be latched by processor whilst new column address is being latched by memory. Overall cycle times can be shortened. 3/16/2016 EE3.cma - Computer Architecture 31 Memory Devices HM50464 - 64K*4 dynamic RAM organisation Clock Command Act NOP NOP read NOP NOP NOP NOP PChg NOP NOP Act Address row IO col bank D0 Activate DRAM row Read from Column no. (3 cycle latency) D1 D2 row D3 Read burst (4 words) Simplified SDRAM burst read access Synchronous DRAM (“SDRAM”) Instead of asynchronous control signals SDRAMs accept one command in each cycle. Different stages of access initiated by separate commands - initial row address, reading etc. all pipelined so that a read might not return a word for 2 or 3 cycles Bursts of accesses to sequential words within a row may be requested by issuing a burst-length command. Then, subsequent read or write request operate in units of the burst length 3/16/2016 EE3.cma - Computer Architecture 32 Memory Devices Summary DRAMs • A whole row of the memory array must be read • After reading the data must be re-written • Writing requires the data to be read first (whole row has to be stored if only a few bits are changed) • Cycle time a lot slower than static RAM • Address lines are multiplexed - saves package pin count • Fastest DRAM commonly available has access time of ~60ns but a cycle time of 121ns • DRAMs consume more current • SDRAMS replace the asynchronous control mechanisms Memory Type 3/16/2016 Cycles Required Word 1 Word 2 Word 3 Word 4 DRAM Page-Mode DRAM 5 5 5 3 5 3 5 3 EDO DRAM SDRAM 5 5 2 1 2 1 2 1 SRAM 2 1 1 1 EE3.cma - Computer Architecture 33 4 3/16/2016 EE3.cma - Computer Architecture 34 Memory Interfacing Interfacing Most processors rely on external memory The unit of access is a word carried along the Data Bus Ignoring caching and virtual memory, all memory belongs to a single address space. Addresses are passed on the Address Bus Hardware devices may respond to particular addresses Memory Mapped devices External memory is a collection of memory chips. All memory devices are joined to the same data bus Main purpose of the addressing logic is to ensure only one memory device is activated during each cycle 3/16/2016 EE3.cma - Computer Architecture 35 Memory Interfacing Interfacing The Data Bus has n lines - n = 8,16,32 or 64 The Address Bus has m lines - m = 16,20,24, 32 or 64 providing 2m words of memory The Address Bus is used at the beginning of a cycle and the Data Bus at the end It is therefore possible to multiplex (in time) the two buses Can create all sorts of timing complications - benefits are a reduced processor pin count, makes it relatively common Processor must tell memory subsystem what to do and when to do it Can do this either synchronously or asynchronously 3/16/2016 EE3.cma - Computer Architecture 36 Memory Interfacing Interfacing synchronously • processor defines the duration of a memory cycle • provides control lines for begin and end of cycle • most conventional • the durations and relationships might be determined at boot time (available in 1980’s in the INMOS transputer) asynchronously • processor starts cycle, memory signals end of cycle • Error recovery is needed - if non-existent memory is accessed (Bus Error) 3/16/2016 EE3.cma - Computer Architecture 37 Memory Interfacing Interfacing synchronous memory scheme control signals • Memory system active – goes active when the processor is accessing external memory. – Used to enable the address decoding logic • provides one active chip select to a group of chips • Read Memory – says the processor is not driving the data bus – selected memory can return data to the data bus – usually connected to the output enable (OE) of memory 3/16/2016 EE3.cma - Computer Architecture 38 Memory Interfacing Interfacing synchronous memory scheme control signals (cont’d) • Memory Write – indicates data bus contains data which selected memory device should store – different processors use leading or trailing edges of signal to latch data into memory – Processors with data bus wider than 8 bits have separate memory write byte signal for each byte of data – Memory write lines connected to write lines of memories • Address Latch Enable (in multiplexed address machines) – tells the addressing logic when to take a copy of the address from multiplexed bus so processor can use it for data later • Memory Wait – causes processor to extend memory cycle – allows fast and slow memories to be used together without loss of speed 3/16/2016 EE3.cma - Computer Architecture 39 Memory Interfacing Address Blocks How do we place blocks of memory within the address space of our processor? Two methods of addressing memory: • Byte addressing – each byte has its own address – good for 8-bit mprocessors and graphics systems – if memory is 16 or 32 bits wide? • Word addressing – only address lines which number individual words – select multi-byte word – extra byte address bits retained in processor to manipulate individual byte – or use write byte control signals 3/16/2016 EE3.cma - Computer Architecture 40 Memory Interfacing Address Blocks How do we place blocks of memory within the address space of our processor? Often want different blocks of memory: • Particular addresses might be special: – memory mapped I/O ports – location executed first after a reset – fast on-chip memory – diagnostic or test locations • Also want – SRAM and/or DRAM in one contiguous block – memory mapped graphics screen memory – ROM for booting and low level system operation – extra locations for peripheral controller registers 3/16/2016 EE3.cma - Computer Architecture 41 Memory Interfacing Address Blocks How do we place blocks of memory within the address space of our processor? • Each memory block might be built from individual memory chips – address and control lines wired in parallel – data lines brought out separately to provide n bit word • Fit all the blocks together in overall address map – easier to place similar sized blocks next to each other so that they can be combined to produce 2k+1 word area – jumbling blocks of various sizes complicates address decoding – if contiguous blocks are not needed, place them at major power of 2 boundaries - eg put base of SRAM at 0, ROM half way up, lowest memory mapped peripheral at 7/8ths 3/16/2016 EE3.cma - Computer Architecture 42 Memory Interfacing Address Decoding address decoding logic determines which memory device to enable depending upon address • if each memory area stores contiguous words of 2k block – all memory devices in that area will have k address lines – connected (normally) to the k least-significant lines – remaining m-k examined to see if they provide mostsignificant (remaining) part of address of each area 3 schemes possible – Full decoding - unique decoding • All m-k bits are compared with exact values to make up full address of that block • only one block can become active 3/16/2016 EE3.cma - Computer Architecture 43 Memory Interfacing Address Decoding 3 schemes possible (cont’d) – Partial decoding • only decode some of m-k lines so that a number of blocks of addresses will cause a particular chip select to become active • eg ignoring one line will mean the same memory device will be accessible at places in memory map • makes decoding simpler – Non-unique decoding • connect different one of m-k lines directly to active low chip select of each memory block • can activate memory block by referencing that line • No extra logic needed • BUT can access 2 blocks at once this way…... 3/16/2016 EE3.cma - Computer Architecture 44 5 3/16/2016 EE3.cma - Computer Architecture 45 Memory Interfacing Address Decoding - Example A processor has a 32-bit data bus. It also provides a separate 30-bit word addressed address bus, which is labelled A2 to A31 since it refers to memory initially using byte addressing, where it uses A0 and A1 as byte addressing bits. It is desired to connect 2 banks of SRAM (each built up from 128K*8 devices) and one bank of DRAM, built from 1M*4 devices, to this processor. The SRAM banks should start at the bottom of the address map, and the DRAM bank should be contiguous with the SRAM. Specify the address map and design the decoding logic. 3/16/2016 EE3.cma - Computer Architecture 46 Memory Interfacing Address Decoding - Example Each bank of SRAMs will require 4 devices to make up the 32 bit data bus. Each Bank of DRAMs will require 8 devices. 0013FFFF DRAM bank 0 00040000 0003FFFF 00020000 0001FFFF DRAM bank 0 DRAM bank 0 DRAM bank 0 DRAM bank 0 DRAM bank 0 DRAM bank 0 DRAM bank 0 SRAM Bank 1 SRAM Bank 1 SRAM Bank 1 SRAM Bank 1 SRAM Bank 2 SRAM Bank 2 SRAM Bank 2 SRAM Bank 2 00000000 1M words (20 bits) 128k words (17 bits) 128k words (17bits) ------------------------- 32 bits ------------------------------------------------------------- 3/16/2016 EE3.cma - Computer Architecture 47 Memory Interfacing Address Decoding - Example 17 address lines to all devices in parallel CPU CS1 17 address lines to all devices in parallel 20 address lines to all devices in parallel CS3 CS2 SRAM 128k*8 SRAM 128k*8 8 data lines to each device DRAM 1M*4 4 data lines to each device CS1 connects to chip select on SRAM bank 0 CS2 connects to chip select on SRAM bank 1 CS3 connects to chip select on DRAM bank CS1 = A19*A20*A21*A22 CS2 =A19*A20*A21*A22 CS3 = A20+A21+A22 3/16/2016 } }omitting all address lines A23 and above to simplify } EE3.cma - Computer Architecture 48 6 3/16/2016 EE3.cma - Computer Architecture 49 Memory Interfacing Connecting Multiplexed Address and Data Buses There are many multiplexing schemes but let’s choose 3 processor types and 2 memory types and look at the possible interconnections: • Processor types all 8-bit data and 16 bit address: – No multiplexing - (eg Zilog Z80) – multiplexes least significant address bits with data bus (intel 8085) – multiplexes the most significant and least significant halves of address bus • Memory types: – SRAM (8k *8) - no address multiplexing – DRAM (16k*4) - with multiplexed address inputs 3/16/2016 EE3.cma - Computer Architecture 50 Memory Interfacing CPU vs Static Memory Configuration CS Address decode A0…15 A0…12 D0…7 D0…7 8k*8 SRAM CPU Non multiplexed address bus 3/16/2016 EE3.cma - Computer Architecture 51 Memory Interfacing CPU vs Static Memory Configuration Address decode CS MS A8…15 A0…12 latch LS AD0…7 D0…7 8k*8 SRAM CPU with LS addresses multiplexed with data bus 3/16/2016 EE3.cma - Computer Architecture 52 Memory Interfacing CPU vs Static Memory Configuration latch Address decode CS MA0…7 A0…12 D0…7 D0…7 CPU time-multiplexed address bus 8k*8 SRAM 3/16/2016 EE3.cma - Computer Architecture 53 Memory Interfacing CPU vs Dynamic Memory Configuration RAS Address decode MPX A0…15 CAS D0…3 D0…7 MA0…6 D4…7 2 x 16k*4 DRAM CPU non - multiplexed address bus 3/16/2016 MA0…6 EE3.cma - Computer Architecture 54 Memory Interfacing CPU vs Dynamic Memory Configuration RAS Address decode CAS MPX latch A8…15 D0…3 AD0…7 MA0…6 D4…7 2 x 16k*4 DRAM CPU with LS addresses multiplexed with data bus 3/16/2016 MA0…6 EE3.cma - Computer Architecture 55 Memory Interfacing CPU vs Dynamic Memory Configuration RAS Address decode MA0…7 MA0…6 D0…7 D0…3 CPU time-multiplexed address bus 3/16/2016 CAS MA0…6 D4…7 2 x 16k*4 DRAM EE3.cma - Computer Architecture 56 Displays Video Display Characteristics • Consider a video display capable of producing 640*240 pixel monochrome, non-interlaced images at a frame rate of 50Hz: h Displayed Image v add 20% for frame flyback 3/16/2016 dot rate = (640*1.2)*(240*1.2)*50 Hz = 11MHz add 20% for line flyback = 90 ns/pixel For 1024*800 non-interlaced display: dot rate = (1024*1.2)*(800*1.2)* 50 Hz = 65MHz = 15 ns/pixel add colour with 64 levels for rgb - 18 bits per pixel Bandwidth now 1180MHz... EE3.cma - Computer Architecture 57 Displays Video Display Characteristics • Problems with high bit rates: – Memory mapping of the screen display within the processor map couples CPU and display tightly - design together – In order that screen display may be refreshed at the rates required by video, display must have higher priority then processor for DMA of memory bus - uses much of bandwidth – In order to update the image the CPU may require very fast access to screen memory too – Megabytes of memory needed for large screen displays are still relatively expensive - compared with CPU etc. 3/16/2016 EE3.cma - Computer Architecture 58 Displays Bit-Mapped Displays • Even 640*240 pixel display cannot be easily maintained using DMA access to CPU’s RAM - except with multiple word access • Increase memory bandwidth for video display with special video DRAM – allows whole row of DRAM (256 or 1024 bits) in one DMA access • Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays. 3/16/2016 EE3.cma - Computer Architecture 59 Displays Character Based Displays • limited to displaying one of a small number of images in fixed positions – typically 24lines of 80 characters – normally 8-bit ASCII • Character value is used to determine the image from a look-up table – table often in ROM (RAM version allows font changes) • For a character of 9 dots wide by 14 high – 14 rows are generated for each row of characters – In order to display a complete frame, pixels are drawn a suitable do rate: dot rate 3/16/2016 = = = (80*9*1.2)*(24*14*1.2)*50 Hz 17.28 MHz 58ns/pixel EE3.cma - Computer Architecture 60 Displays Character Based Displays • A row of 80 characters must be read for every displayed line – – – – giving a line rate of 20.16kHz (similar to EGA standard) overall memory access rate needed ~1.6Mbytes/second (625ns/byte) barely supportable using DMA on small computers even at 4bytes at a time (32 bit machines) still major use of data bus • To avoid reading each line of 80 characters on other 13 rows characters can be stored in a circular shift register on first access and used instead of memory access. – only need 80*24*50 accesses/sec - in bursts – 167ms per byte - easily supported – the whole 80 bytes can be read during flyback before start of new character row at full memory speed in one DMA burst - 80 * about 200ns at a rate of 24*50 times a second - less than 2% of bus bandwidth. 3/16/2016 EE3.cma - Computer Architecture 61 Displays Character Based Displays • Assuming that rows of 80 characters in the CPU’s memory map are stored at 128-byte boundaries (simplifies addressing) the CPU memory addresses are: address of screen memory row column n-12 bits 5 bits 7 bits 0…23 0…79 address decode • Address of character positions on the screen: row 5 bits 0…23 Memory 3/16/2016 line number in row 4 bits carry carry column 7 bits 0…13 0…79 Look-up Table Memory EE3.cma - Computer Architecture dot number across char 4 bits carry 0…8 address of current bit in shift register 62 Displays Character Based Displays • An appropriate block diagram of the display would be: 12 (r,c) Screen Address 4 ASCII bytes Screen Memory Line no. in char Character 9 8 Generator ROM (16*256)*9 bits dot clock Shift register 9 to 1 bit FIFO 80*8bits video data out 3/16/2016 EE3.cma - Computer Architecture 63 Displays Character Based Displays • The problem with DMA fetching individual characters from display memory is its interference with processor. • Alternative is to use Dual Port Memories Dual Port SRAMs • provide 2 (or more) separate data and address pathways to each memory cell • 100% of memory bandwidth can be used by display without effecting CPU • Can be expensive - ~£25 for 4kbytes - makes Mb displays impractical. For character based display would be OK Write CE Write CE I/O 1 OE Do..Dn Ao…An 3/16/2016 address decode 1 row, col I/O 2 Memory Array EE3.cma - Computer Architecture OE Do..Dn address decode 2 row, col Ao…An 64 7 3/16/2016 EE3.cma - Computer Architecture 65 Bit-Mapped Graphics & Memory Interleaving Bit-Mapped Displays • Instead of using an intermediate character generator can store all pixel information in screen memory at pixel rates above. • Even 640*240 pixel display cannot be maintained using DMA access to CPU’s RAM - except with multiple word access • Increase memory bandwidth for video display with special video DRAM – allows whole row of DRAM (256 or 1024 bits) in one DMA access • Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays. • Use of video shift register limits clocking frequency to 25MHz 40ns/pixel 3/16/2016 EE3.cma - Computer Architecture 66 Graphics Card consists of: GPU – Graphics Processing Unit microprocessor optimized for 3D graphics rendering clock rate 250-850MHz with pipelining – converts 3D images of vertices and lines into 2D pixel image Video BIOS – program to operate card and interface timings etc. Video Memory – can use computer RAM, but more often has its own VideoRAM (128Mb- 2Gb) – often multiport VRAM, now DDR (double data rate – uses rising and falling edge of clock) RAMDAC – Random Access Digital to Analog Converter to CRT 3/16/2016 EE3.cma - Computer Architecture 67 Bit Mapped Graphics & Memory Interleaving Using Video DRAMs • To generate analogue signals for a colour display – 3 fast DAC devices are needed – each fed from 6 or 8 bits of data – one each for red, green and blue video inputs • To save storing so much data per pixel (24 bits) a Colour Look Up Table (CLUT) device can be used. – uses a small RAM as a look-up table – E.g. a 256 entry table accessed by 8-bit values stored for each pixel - the table contains 18 or 24 bits used to drive DACs – Hence “256 colours may be displayed from a palette of 262144” Data for updating RAM Pixel Data 6 Din Addr Dout RAM 218*28 DACs Red output 6 Green output Blue output 6 CLUT 3/16/2016 EE3.cma - Computer Architecture 68 Bit Mapped Graphics & Memory Interleaving Using Video DRAMs • Addressing Considerations – if the number of bits in the shift registers is not the same as the number of displayed pixels, it is easier to ignore the extra ones wasting memory may make addressing simpler – processor’s screen memory bigger than displayable memory, gives a scrollable virtual window. remaining bits 0..(v-1) log2v bits 0..(h-1) log2h bits screen block select row address column address (not all combinations used) – Even though most 32 bit processors can access individual bytes (used as pixels) this is not as efficient as accessing memory in word (32bits) units 3/16/2016 EE3.cma - Computer Architecture 69 Bit Mapped Graphics & Memory Interleaving Addressing Considerations (cont’d) – Sometimes it might be better NOT to arrange the displayed pixels in ascending memory address order: 0 1 0 1 2 3 2 Each word defines 4 horizontally neighbouring pixels. Each set fully specifies its colour - most simple and common representation 3 Each word defines 2 pixels horizontally and vertically with all colour data. Useful for text or graphics applications where small rectangular blocks are modified - might access fewer words for changes 32 3/16/2016 Each word defines one bit of 32 horizontally neighbouring pixels. 8 words (in 8 separate colour planes) need to be changed to completely change any pixel. Useful for adding or moving blocks of solid colour - CAD EE3.cma - Computer Architecture 70 Bit Mapped Graphics & Memory Interleaving Addressing Considerations (cont’d) • The video memories must now be arranged so that the bits within the CPU’s 32-bit words can all be read or written to their relevant locations in video memory in parallel. – this is done by making sure that the pixels stored in each neighbouring 32-bit word are stored in different memory chips - interleaving 3/16/2016 EE3.cma - Computer Architecture 71 Bit Mapped Graphics & Memory Interleaving Example Design a 1024*512 pixel colour display capable of passing 8 bits per pixel to a CLUT. Use a video frame rate of 60Hz and use video DRAMs with a shift register maximum clocking frequency of 25MHz. Produce a solution that supports a processor with an 8-bit data bus. 3/16/2016 EE3.cma - Computer Architecture 72 Bit Mapped Graphics & Memory Interleaving Example • 1024 pixels across the screen can be satisfied using 1 1024-bit shift register (or 4 multiplexed 256-bit ones) • The frame rate is 60Hz • The number of lines displayed is 512 • The line rate becomes 60*512 = 30.72kHz - or 32.55ms/line • 1024 pixels gives a dot rate of 30.72*1024 = 31.46MHz • Dot time is thus 32ns - too fast for one shift register! So we will have to interleave 2 or more. • Multiplexing the minimum 3 shift registers will make addressing complicated, easier to use 4 VRAMs - each with 256 rows of 256 columns, addressed row/column intersection containing 4 bits interfaced by 4 pins to the processor and to 4 separate shift registers 3/16/2016 EE3.cma - Computer Architecture 73 Bit Mapped Graphics & Memory Interleaving Example • Hence for 8 bit CPU: CPU memory address (BYTE address) Video address (pixel counters) n-20 bits 0..512 9 bits screen block select 512 rows 1 bit 0..512 8 bits to RAS address i/p to top/bottom multiplexers 3/16/2016 0..1023 10 bits 1024 columns 0..1023 8 bits 1 bit 1 bit implicit address of bits in cascaded shift registers Which VRAM? A+B, C+D E+F, G+H EE3.cma - Computer Architecture to pixel multiplexer (odd/even pixels) 74 RHS of screen LHS of screen Example top 256 lines on screen (8 bits of each even pixel) bottom 256 lines on screen (8 bits of each even pixel) top 256 lines on screen (8 bits of each odd pixel) bottom 256 lines on screen (8 bits of each odd pixel) even pixels 256* 256*4 A 256* 256*4 B 4 4 256* 256*4 C 256* 256*4 D 4 8 select top/ 8 bottm mpx 8 4 odd pixels 256* 256*4 E 256* 256*4 F 4 select 256* 256*4 G 256* 256*4 H 4 odd/ even pixel mpx R G B CLUT 8 4 select top/ 8 bottm mpx 8 8 bits of all pixels (interleaved) 4 3/16/2016 EE3.cma - Computer Architecture 75 8 3/16/2016 EE3.cma - Computer Architecture 76 Mass Memory Concepts Disk technology • • • • • • • • • • • • unchanged for 50 years similar for CD, DVD 1-12 platters 3600-10000rpm double sided circular tracks subdivided into sectors recording density >3Gb/cm2 innermost tracks not used – can not be used efficiently inner tracks factor of 2 shorter than outer tracks hence more sectors in outer tracks cylinder – tracks with same diameter on all recording surfaces 3/16/2016 EE3.cma - Computer Architecture 77 Mass Memory Concepts Access Time • Seek time – align head with cylinder containing track with sector inside • Rotational Latency – time for disk to rotate to beginning of sector • Data Transfer time – time for sector to pass under head Disk Capacity = surfaces x tracks/surface x sectors/track x bytes/sector 3/16/2016 EE3.cma - Computer Architecture 78 Key Attributes of Example Discs Identity of disc Storage attributes Access attributes Physical attributes 3/16/2016 Manufacturer Seagate Hitachi IBM Series Model Number Typical Application Formatted Capacity GB Recording surfaces Cylinders Sector size B Avg tracks/sector Max recording Density Gb/cm2 Min seek time ms Max seek time ms External data rate MB/s Diameter, inches Platters Rotation speed rpm Weight kg Operating power W Idle power W Barracuda ST1181677LW Desktop 180 24 24,247 512 604 2.4 1 17 160 3.5 12 7200 1.04 14.1 10.3 DK23DA ATA-5 40 Laptop 40 4 33,067 512 591 5.1 3 25 100 2.5 2 4200 0.10 2.3 0.7 Microdrive DSCM-11000 Pocket device 1 2 7167 512 140 2.4 1 19 13 1 1 3600 0.04 0.8 0.5 EE3.cma - Computer Architecture 79 Key Attributes of Example Discs Samsung launch 1Tb Hard drive: 3 x 3.5” platters 334Gb per platter 7200RPM 32Mb Cache 3Gb/s SATA interface (SATA – serial Advanced Technology Attachment) Highest density so far.... 3/16/2016 EE3.cma - Computer Architecture 80 Mass Memory Concepts Disk Organization Data bits are small regions of magnetic coating magnetized in different directions to give 0 or 1 Special encoding techniques maximize the storage density eg rather than let data bit values dictate direction of magnetization can magnetize based on change of bit value – nonreturn-to-zero (NRZ) – allows doubling of recording capacity 3/16/2016 EE3.cma - Computer Architecture 81 Mass Memory Concepts Disk Organization • Sector proceeded by sector number and followed by cyclic redundancy check allows some errors and anomalies to be corrected • Various gaps within and separating sectors allow processing to finish • Unit of transfer is a sector – typically 512 to 2K bytes • Sector address consists of 3 components: – Disk address = Cylinder#, Track#, Sector# 17-31 bits 10-16bits 1-5bits 6-10bits – Cylinder# - actuator arm – Track# - selects read/write head or surface – Sector# - compared with sector number recorded as it passes • Sectors are independent and can be arranged in any logical order • Each sector needs some time to be processed – some sectors may pass before disk is ready to read again, so logical sectors not stored sequentially as physical sectors track i 0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51 4 20 36 52 ….. track i+1 30 46 62 15 31 47 0 16 32 48 1 17 33 49 2 18 34 50 3 19…. track i+2 60 13 29 45 61 14 30 46 62 15 31 47 0 16 32 48 1 17 33 49….. 3/16/2016 EE3.cma - Computer Architecture 82 Mass Memory Concepts Disk Performance Disk Access Latency = Seek Time + Rotational Latency • Seek Time – how far head travels from current cylinder – mechanical motion – accelerates and brakes • Rotational Latency – depends upon position – Average rotational latency = time for half a rotation – at 10,000 rpm = 3ms 3/16/2016 EE3.cma - Computer Architecture 83 Mass Memory Concepts RAID - Redundant Array of Inexpensive (Independent) Disks. • High capacity faster response without specialty hardware 3/16/2016 EE3.cma - Computer Architecture 84 Mass Memory Concepts RAID0 – multiple disks appear as a single disk each accessing a part of a single item across many disks 3/16/2016 EE3.cma - Computer Architecture 85 Mass Memory Concepts RAID1 – robustness added by mirror contents on duplicate disks – 100% redundancy 3/16/2016 EE3.cma - Computer Architecture 86 Mass Memory Concepts RAID2 – robustness using error correcting codes – reducing redundancy – Hamming codes – ~50% redundancy 3/16/2016 EE3.cma - Computer Architecture 87 Mass Memory Concepts RAID3 – robustness using separate parity and spare disks – reducing redundancy to 25% 3/16/2016 EE3.cma - Computer Architecture 88 Mass Memory Concepts RAID4 – Parity/Checksum applied to sectors instead of bytes – requires large use of parity disk 3/16/2016 EE3.cma - Computer Architecture 89 Mass Memory Concepts RAID5 – Parity/Checksum distributed across disks – but 2 disk failures can cause data loss 3/16/2016 EE3.cma - Computer Architecture 90 Mass Memory Concepts RAID6 – Parity/Checksum distributed across disks and a second checksum scheme (P+Q) distributed across different disks 3/16/2016 EE3.cma - Computer Architecture 91 9 3/16/2016 EE3.cma - Computer Architecture 92 Virtual Memory In order to take advantage of the various performance and prices of different types of memory devices it is normal for a memory hierarchy to be used: CPU register fastest data storage medium cache for increased speed of access to DRAM main RAM normally DRAM for cost reasons; SRAM possible disc magnetic, random access magnetic tape serial access for archiving; cheap • How and where do we find memory that is not RAM? • How does a job maintain a consistent user image when there are many others swapping resources between memory devices? • How can all users pretend they have access to similar memory addresses? 3/16/2016 EE3.cma - Computer Architecture 93 Virtual Memory Paging In a paged virtual memory system the virtual address is treated as groups of bits which correspond to the Page number and offset or displacement within the page – often denoted as (P,D) pair. • Page number can be looked up in a page table and concatenated with the offset to give the real address. • There is normally a separate page table for each virtual machine which point to pages in the same memory. • There are two methods used for page table lookup – direct mapping – associative mapping 3/16/2016 EE3.cma - Computer Architecture 94 Virtual Memory Direct Mapping • uses a page table with the same number of entries as there are pages of virtual memory. • thus possible to look up the entry corresponding to the virtual page number to find – the real address of the page (if the page is currently resident in real memory) – or the address of that page on the backing store if not • This may not be economic for large mainframes with many users • A large page table is expensive to keep in RAM and may be paged... 3/16/2016 EE3.cma - Computer Architecture 95 Virtual Memory Content Addressable Memories • when an ordinary memory is given an address it returns the data word stored at that location. • A content addressable memory is supplied data rather than an address. • It looks through all its storage cells to find a location which matches the pattern and returns which cell contained the data - may be more than one 3/16/2016 EE3.cma - Computer Architecture 96 Virtual Memory Content Addressable Memories • It is possible to perform a translation operation using a content addressable memory • An output value is stored together with each cell used for matching • When a match is made the signal from the match is used to enable the register containing the output value • Care needs to be taken so that only one output becomes active at any time 3/16/2016 EE3.cma - Computer Architecture 97 Virtual Memory Associative Mapping • Associative mapping uses a content addressable memory to find if the page number exist in the page table • If it does the rest of the entry contains the real memory address of the start of the page • If not then page is currently in backing store and needs to be found from a directly mapped page table on disc • The associative memory only needs to contain the same number of entries as the number of pages of real memory - much smaller than the directly mapped table 3/16/2016 EE3.cma - Computer Architecture 98 Virtual Memory Associative Mapping • A combination of direct and associative mapping is often used. 3/16/2016 EE3.cma - Computer Architecture 99 Virtual Memory Paging • Paging is viable because programs tend to consist of loops and functions which are called repeatedly from the same area of memory. Data tends to be stored in sequential areas of memory and are likely to be used frequently once brought into main memory. • Some memory access will be unexpected, unrepeated and so wasteful of page resources. • It is easy to produce a program which mis-use virtual memory, provoking frantic paging as they access memory over a wide area. • When RAM is full, paging can not just read virtual pages from backing store to RAM, it must first discard old ones to the backing store. 3/16/2016 EE3.cma - Computer Architecture 100 10 3/16/2016 EE3.cma - Computer Architecture 101 Virtual Memory Paging • There are a number of algorithms that can be used to decide which ones to move: – Random replacement - easy to implement, but takes no account of usage – FIFO replacement - simple cyclic queue, similar to above – First-In-Not-Used-First-Out - FIFO queue enhanced with extra bits which are set when page is accessed and reset when entry is tested cyclically. – Least Recently Used - uses set of counters so that access can be logged – Working Set - all pages used in last x accesses are flagged as working set. All other pages are discarded to leave memory partially empty, ready for further paging 3/16/2016 EE3.cma - Computer Architecture 102 Virtual Memory Paging - general points • Every process requires its own page table - so that it can make independent translation of location of actual page • Memory fragmentation under paging can be serious. – as pages are set size, usage will not be for complete page and last one of a set will not normally be full – especially if page size is large to optimise disc usage (reduce the number of head movements) • Extra bits can be stored in page table with the real address - dirty bit - to determine if page has been written to since it was copied and hence if it needs to be copied back 3/16/2016 EE3.cma - Computer Architecture 103 Virtual Memory Segmentation • A virtual address in a segmented system is made from 2 parts – segment number – displacement within (S,D) pairs • unlike paging, segments are not fixed length, maybe variable • Segments store complete entities - pages allow objects to be split • Each task has its own segment table • segment table contains base address and length of segment so that other segments aren’t corrupted 3/16/2016 EE3.cma - Computer Architecture 104 Virtual Memory Segmentation • Segmentation doesn’t give rise to fragmentation in the same way, pages are of variable size so no waste of a segment. • BUT as they are variable size not very easy to plan to fit them into memory ? • Keep a sorted table of vacant blocks of memory and combine neighbouring blocks when possible • Can keep information on the “type” a segment is - read-only executable etc. as they correspond to complete entities. 3/16/2016 EE3.cma - Computer Architecture 105 Virtual Memory Segmentation & Paging • A combination of segmentation and Paging uses a triplet of virtual address fields - the segment number, the page number within the segment and the displacement within the page (S,P,D) • More efficient than pure paging - use of space more flexible • More efficient than pure segmentation - allows part of segment to be swapped 3/16/2016 EE3.cma - Computer Architecture 106 Virtual Memory Segmentation & Paging • It is easy to mis-use virtual memory by simple difference in the way that some routines are coded: The 2 examples below perform exactly the same task, but the left-hand one generates 1,000 page faults on a machine with 1K word pages, while the one on the right generates 1,000,000. Most languages (except Fortran) store arrays in memory with the rows laid out sequentially, the right hand subscript varying most rapidly….. void order { int array[1000][1000], ii, jj; for (ii=0; ii<1000; ii++) { for (jj=0;jj<1000; jj++) { array[ii][jj]; } } void order { int array[1000][1000], ii, jj; for (ii=0; ii<1000; ii++) { for (jj=0;jj<1000; jj++) { array[jj][ii]; } } } } 3/16/2016 EE3.cma - Computer Architecture 107 Memory Caches • Most general purpose mprocessor systems use DRAM for their bulk RAM requirements because it is cheap and more dense than SRAM • The penalty for this is that it is slower - SRAM has a 3-4 times shorter cycle time • To help some SRAM can be added: – On-chip directly to the CPU for use as desired - use depends on the compiler, not always easy to use efficiently but fast access – Cacheing - between DRAM and CPU. Built using small fast SRAM, copies of certain parts of the main memory are held here. The method used to decide where to allocate cache determines the performance. – Combination of the two - on chip cache. 3/16/2016 EE3.cma - Computer Architecture 108 Memory Caches Directly mapped cache - simplest form of memory cache. • In which the real memory address is treated in three parts: block select tag (t bits) cache index (c bits) • For a cache of 2c words, the cache index section of the real memory address indicates which cache entry is able to store data from that address • When cached the tag (msb of address) is stored in cache with data to indicate which page it came from • Cache will store 2c words from 2t pages. • In operation tag is compared in every memory cycle – if tag matches a cache hit is achieved and cache data is passed – otherwise a cache miss occurs and the DRAM supplies word and data with tag are stored in the cache Tag t bits Tags Data Cache Memory 3/16/2016 compare Index c bits Use Cache or Main Memory EE3.cma - Computer Architecture Main Memory 109 Memory Caches Set Associative Caches. • A 2-way cache contains 2 cache blocks, each capable of storing one word and the appropriate tag. • For any memory access the two stored tags are checked • Require Associative memory with 2 entries for each of the 2c cache lines • Similarly a 4-way cache stores 4 cache entries for each cache index Tag t bits Index c bits Cache Memory Tags Data Tags Data Main Memory compare Use Appropriate Cache or Main Memory 3/16/2016 EE3.cma - Computer Architecture 110 Memory Caches Fully Associative Caches • A 2-way cache has two places which it must read and compare to look for a tag • This is extended to the size of the cache memory – so that any main memory word can be cached at any location in cache • cache has no index (c=0) and contains longer tags and data – notice as c (address length) decreases, t (tag length) must increase to match • all tags are compared on each memory access • to be fast all tags must be compared in parallel block select tag (t bits) no cache index (c=0) INMOS T9000 had such a cache on chip 3/16/2016 EE3.cma - Computer Architecture 111 Memory Caches Degree of Set Associativity • for any chosen size of cache, there is a choice between more associativity or a larger index field width • optimum can depend on workload and instruction decoding accessible by simulation In practice: An 8kbyte (2k entries) cache, addressed directly, will produce a hit rate of about 73%, a 32kbyte cache achieves 86% and a 128kbyte 2way cache 89% (all these figures depend on characteristics of the instruction set and code executed, data used, etc. - these are for the Intel 80386) • considering the greater complexity of the 2-way cache there doesn’t seem to be a great advantage in applying it 3/16/2016 EE3.cma - Computer Architecture 112 Memory Caches Cache Line Size • Possible to have cache data entries wider than a single word – i.e. a line size > 1 • Then a real memory access causes 2, 4 etc. words to be read – reading performed over n-word data bus – or from page mode DRAM, capable of transferring multiple words from same row in DRAM, by supplying extra column addresses – extra words are stored in the cache in an extended data area – as most code (and data access) occurs sequentially, it is likely that next word will come in useful… – real memory address specifies which word in the line it wants block select 3/16/2016 tag (t bits) cache index (c bits) EE3.cma - Computer Architecture line address (l bits) 113 11 3/16/2016 EE3.cma - Computer Architecture 114 Memory Caches Writing Cached Memory So far only really concerned with reading cache. But problem also exists to keep cache and main memory consistent: Unbuffered Write Through • write data to relevant cache entry, update tag, also write data to location in main memory - speed determined by main memory Buffered Write Through • Data (and address) is written to A FIFO buffer between CPU and main memory, CPU continues with next access, FIFO buffer writes to DRAM • CPU can continue to write at cache speeds, until FIFO is full, then slows down to DRAM speed as FIFO empties • If CPU wants to read from DRAM (instead of cache) need to empty FIFO to ensure we have the correct data - can put long delay in. • This delay can be shortened if FIFO has only one entry - simple latch buffer 3/16/2016 EE3.cma - Computer Architecture 115 Memory Caches Tag 8 bits Index 9 bits D0-31 13 bits Data Bus (32 bits) 32 D Data Q A0-31 13 bit index Address Bus FIFO 22bits 22bits Main DRAM memory A0-21 13 bit index 9 bit tag D A Tag storage and comparison Match WR 3/16/2016 control Cache Memory WR A Control Logic D0-31 32 Microprocessor CPU timing signals FIFO 32bits 2bits (byte address) FIFOs optional for buffered write-through DRAM select control control 4Mword Memory using 8kword Direct-Mapped cache with Write-Through writes EE3.cma - Computer Architecture 116 Memory Caches Writing Cached Memory (cont’d) Deferred Write (Copy Back) • data is written out to cache only, allowing the cached entry to be different from main memory. If the cache system wants to overwrite a cache index with a different tag it looks to see if the current entry has been changed since it was copied in. If so it writes the new value to main memory before reading the new data to the location in cache. • More logic is required for this operation, but the performance gain can be considerable as it allows the CPU to work at cache speed if it stays within the same block of memory. Other techniques will slow down to DRAM speed eventually. • Adding a buffer to this allows CPU to write to cache before data is actually copied back to DRAM 3/16/2016 EE3.cma - Computer Architecture 117 Memory Caches DRAM select Tag 8 bits 9 bits D0-31 Cache Memory A0-31 13 bit index 13 bit 9 bit index tag 22bits D Dirty bit Match D A Tag Q storage and comparison Main DRAM memory control Address Bus WR 3/16/2016 Latch 32bits D Data Q WR A A0-21 Q Latch D 2bits (byte address) D0-31 32 Microprocessor Control Logic 13 bits Data Bus (32 bits) 32 CPU timing signals Index control Q Latch control 4Mword Memory using 8kword Direct-Mapped cache with Copy-Back writes EE3.cma - Computer Architecture 118 Memory Caches Cache Replacement Policies for non direct-mapped caches • when CPU accesses a location which is not already in cache need to decide which existing entry to send back to main memory • needs to be a quick decision • Possible schemes are: – Random replacement - a very simple scheme where a frequently changing binary counter is used to supply a cache set number for rejection. – First-In-First-Out - a counter is incremented every time a new entry is brought into the cache, which is used to point to the next slot to be filled – Least Recently Used - good strategy as keeps often used values in cache, but difficult to implement with a few gates in short times 3/16/2016 EE3.cma - Computer Architecture 119 Memory Caches Cache Consistency A problem occurs when DMA is used by other devices or processors. • Simple solution is to attach cache to memory and make all devices operate through it. • Not best idea as DMA transfer will cause all cache entries to be overwritten, even though it is unlikely to be needed again soon • If the cache is placed on the CPU side of the DMA traffic then cache might not mirror DRAM contents Bus Watching - monitor access to the DRAM and invalidate the relevant cache tag entry if that DRAM has been updated can then keep cache towards the CPU 3/16/2016 EE3.cma - Computer Architecture 120 12 3/16/2016 EE3.cma - Computer Architecture 121 Instruction Sets Introduction Instruction streams control all activity in the processor. All characteristics of the machine depend on design of instruction set – ease of programming – code space efficiency – performance Look at a few different instruction sets: – Zilog Z80 – DEC Vax-11 – Intel family – INMOS Transputer – Fairchild Clipper – Berkeley RISC-I 3/16/2016 EE3.cma - Computer Architecture 122 Instruction Sets General Requirements of an Instruction Set Number of conflicting requirements of an instruction set: • Space Efficiency - control information should be compact – the major part of all data moved between memory and CPU – obtained by careful design of instruction set • variable length coding can be used so that frequently used instructions are encoded into fewer bits • Code Efficiency - can only translate a task efficiently if it is easy to pick needed instructions from set. – various attempts at optimising instruction sets resulted in : • CISC - rich set of long instructions - results in small number of translated instructions • RISC - very short instructions, combined at compile time to produce same result 3/16/2016 EE3.cma - Computer Architecture 123 Instruction Sets General Requirements of an Instruction Set (cont’d) • Ease of Compilation - in some environments compilation is a more frequent activity than on machines where demanding executables predominate. Both want execution efficiency however. – more time consuming to produce efficient code for CISC - more difficult to map program to wide range of complex instructions – RISC simplifies compilation – Ease of compilation doesn’t guarantee better code….. – Orthogonality of instruction set also effects code generation. • regular structure • no special cases • thus all actions (add, multiply etc.) able to work with each addressing mode (immediate, absolute, indirect, register). • If not compiler may have to treat different items differently constants, arrays and variables 3/16/2016 EE3.cma - Computer Architecture 124 Instruction Sets General Requirements of an Instruction Set (cont’d) • Ease of Programming – still times when humans work directly at machine code level; • compiler code generators • performance optimisation – in these cases there are advantages to regular, fixed length instructions with few side effects and maximum orthoganality • Backward Compatibility – many manufacturers produce upgrade versions which allow code written for earlier CPU to run without change. – Good for public relations - if not compatible the could rewrite for competitors CPU instead! – But can make Instruction set a mess - deficiencies added to rather than replaced - 8086 - 80286 - 80386 - 80486 - pentium 3/16/2016 EE3.cma - Computer Architecture 125 Instruction Sets General Requirements of an Instruction Set (cont’d) • Addressing Modes & Number of Addresses per Instruction – Huge range of addressing modes can be provided - specifying operands from 1 bit to several 32bit words. – These modes may themselves need to include absolute addresses, index registers, etc. of various lengths. – Instruction sets can be designed which primarily use 0, 1, 2 or 3 operand addresses just to compound the problem. 3/16/2016 EE3.cma - Computer Architecture 126 Instruction Sets Important Instruction Set Features: • Operand Storage in the CPU – where are operands kept other than in memory? • Number of operands named per instruction – How many operands are named explicitly per instruction? • Operand Location – can any ALU operand be located in memory or must some or all of the operands be held in the CPU? • Operations – What types of operations are provided in the instruction set? • Type and size of operands – What is the size and type of each operand and how is it specified? 3/16/2016 EE3.cma - Computer Architecture 127 Advantages Disadvantages Simple model of expression evaluation Short instructions can give dense code Stack can not be randomly accessed make efficient code generation difficult Stack can be hardware bottleneck • Accumulator based Machines Advantages Disadvantages Minimises internal state of machine Short instructions Since accumulator provides temporary storage memory traffic is high • Register based Machines Advantages Disadvantages 3/16/2016 one address machine Three Classes of Machine: • Stack based Machines Most general model All operands must be named, leading to long instructions EE3.cma - Computer Architecture 128 multi address machine zero address machine Instruction Sets Instruction Sets Register Machines • Register to Register Advantages Disadvantages Simple, fixed length instruction encoding Simple model for code generation Most compact Instructions access operands in similar time Higher instruction count than in architectures with memory references in instructions Some short instruction codings may waste instruction space. • Register to Memory Advantages Disadvantages 3/16/2016 Data can be accessed without loading first Instruction format is easy to encode and dense Operands are not symmetric, since one operand (in the register) is destroyed The no. of registers is fixed by instruction coding Operand fetch speed depends on location (register or memory) EE3.cma - Computer Architecture 129 Instruction Sets Register Machines (cont’d) • Memory to Memory Advantages Disadvantages 3/16/2016 Simple, (fixed length?) instruction encoding Does not waste registers for temporary storage Large variation in instruction size - especially as number of operands is increased Large variation in operand fetch speed Memory accesses create memory bottleneck EE3.cma - Computer Architecture 130 13 3/16/2016 EE3.cma - Computer Architecture 131 Instruction Sets Addressing Modes Register Add R4, R3 R4=R4+R3 When a value is in a register Immediate Add R4, #3 R4=R4+3 For constants Indirect Add R4, (R1) R4=R4+M[R1] Access via a pointer Displacement Add R4, 100(R1) R4=R4+M[100+R1] Access local variables Indexed Add R3, (R1+R2) R3=R3+M[R1+R2] Array access (base + index) Direct Add R1, (1001) R1 = R1+M[1001] Access static data R1=R1+M[M[R3]] Double indirect - pointers Memory Add R1, @(R3) Indirect Auto Add R1, (R2)+ Postincrement Auto Add R1,-(R2) Postdecrement Scaled Add R1,100(R2)[R3] 3/16/2016 R1=R1+M[R2] step through arrays - d is then R2=R2+d word length R2=R2-1 then R1=R1+M[R2] can also be used for stacks R1=R1+M[100+R2+(R3*d)] EE3.cma - Computer Architecture 132 Instruction Sets Instruction Formats Number of Address (operands) 4 operation 1st operand 2nd operand Result next address 3 operation 1st operand 2nd operand Result 2 operation 1st operand & result 2nd operand 1 operation register 2nd operand 0 operation 3/16/2016 EE3.cma - Computer Architecture 133 Instruction Sets Example Programs and simulations (used in simulations by Hennessey & Patterson) gcc the gcc compiler (written in C) compiling a large number of C source files TeX the TeX text formatter (written in C), formatting a set of computer manuals SPICE The spice electronic circuit simulator (written in FORTRAN) simulating a digital shift register 3/16/2016 EE3.cma - Computer Architecture 134 Instruction Sets Simulations on Instruction Sets from Hennessey & Patterson The following tables are extracted from 4 graphs in Hennessey & Patterson’s “Computer Architecture: A Quantitative Approach” Use of Memory Addressing Modes Addressing Mode Memory Indirect Scaled Indirect Immediate Displacement 3/16/2016 TeX 1 0 24 43 32 Spice 6 16 3 17 55 EE3.cma - Computer Architecture gcc 1 6 11 39 40 lists Arrays pointers consts. local var 135 Instruction Sets Simulations on Instruction Sets (cont’d) Number of bits needed for a Displacement Operand Value TeX Percentage of displacement operands using this # of bits 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 1 2 8 5 17 16 9 0 0 0 0 0 5 2 22 Spice 4 gcc 27 0 1 13 9 0 5 1 3 3 6 6 5 14 16 5 11 0 12 5 15 14 6 5 1 2 1 0 4 1 12 How local are the local variables? < 8bits: 71% for TeX; 37% for Spice; 79% for gcc 3/16/2016 EE3.cma - Computer Architecture 136 Instruction Sets Simulations on Instruction Sets (cont’d) Percentage of Operations using Immediate Operands Operation TeX Spice gcc Loads 38 26 23 Compares 83 92 84 ALU Operations 52 49 69 The Distributions of Immediate Operand Sizes Program TeX Spice gcc 3/16/2016 Number of bits needed for an Immediate Value 0 4 8 12 16 20 24 28 32 3 44 3 2 16 23 2 1 0 0 12 36 16 14 10 12 0 0 1 50 21 3 2 19 0 0 1 EE3.cma - Computer Architecture 137 14 3/16/2016 EE3.cma - Computer Architecture 138 Instruction Sets The Zilog Z80 • 8 bit microprocessor derived from the Intel 8080 • has a small register set ( 8 bit accumulator + 6 other registers) • Instructions are either register based or register and one memory address - single address machine • Enhanced 8080 with relative jumps and bit manipulation • 8080 instruction set (8bit opcodes) – unused gaps filled in with extra instructions. – even more needed so some codes cause next byte to be interpreted as another set of opcodes…. • Typical of early register-based microprocessor • Let down by lack of orthogonality - inconsistencies in instructions eg: – can load a register from address in single register – but accumulator can only be loaded by address in register pair 3/16/2016 EE3.cma - Computer Architecture 139 Instruction Sets The Zilog Z80 (cont’d) 8 improvements over 8080 1) Enhanced Instruction set – index • Separate PC, SP and 2 index registers registers & instructions 2) Two sets of registers for fast context • Addressing modes: switching – Immediate (1 or 2 byte operands) 3) Block Move – Relative (one-byte displacement) 4) Bit manipulation 5) Built in DRAM refresh address counter – Absolute (2-byte address) 6) Single 5V power supply – Indexed (M[index reg + 8 bit disp]) 7) Fewer extra support chips needed – Register (specified in opcode itself) 8) Very good price… – Implied (e.g. references accumulator) – Indirect (via HL, DE or BC register pairs) • Instruction Types – Load & Exchange - 64 opcodes used just for register- register copying – Block Copy – Arithmetic, rotate & shift - mainly 8 bit; some simple 16-bit operations – Jump, call & return - uses condition code from previous instruction – Input & Output - single byte; block I/O 3/16/2016 EE3.cma - Computer Architecture 140 Instruction Sets Intel 8086 Family • 8086 announced in 1978 - not used in PC until 1987 (slower 8088 from 1981) – 16 bit processor, data paths Concurrent Fetch (Prefetch) and Execute – 20 bit base addressing mode • 80186 upgrade: small extensions • 80286 - used in PC/AT in 1984 (6 times faster than 8088 - 20MHz) – Memory mapping & protection added protected mode only switchable by processor reset until 386! • Support for VM through segmentation • 4 levels of protection – to keep applications away from OS • – 24-bit addressing (16Mb) - segment table has 24 bit base field & 16 bit size field 80386 - 1986 - 40MHz – 32 bit registers and addressing (4Gb) – Incorporates “virtual” 8086 mode rather than direct hardware support – Paging (4kbytes) and segmentation (up to 4Gb) – allows UNIX implementation – general purpose register usage – Incorporates 6 parallel stages: • • • • • • Bus Interface Unit – I/O and memory Code Prefetch Unit Instruction Decode Unit Execution Unit Segment Unit – logical address to linear address translation Paging Unit – linear address to physical address translation – Includes cache for up to 32 most recently used pages 3/16/2016 EE3.cma - Computer Architecture 141 Instruction Sets Intel 8086 Family • i486 - 1988 - 100MHz not 80486 – court ruling “can’t trademark a number – more performance • added caching (8kb) to memory system • integrated floating point processor on board • Expanded decode and execute to 5 pipelined stages • Pentium- 1994 - 150-750MHz (10,000 times speed of 8088) – added second pipeline stage to give superscalar performance – Now code (8k) and data (8k) cache – Added branch prediction, with on-chip branch table for lookps – Pages now 4Mb as well as 4kb – Internal paths 128bits and 256bits, external still 32bits – Dual processor support added • Pentium Pro – Instruction decode now 3 parallel units – Breaks up code into “micro-ops” – Micro-ops can be executed in any order using 5 parallel execution units, 2 integer, 2 floating point and 1 memory 3/16/2016 EE3.cma - Computer Architecture 142 Instruction Sets Intel 8086 Registers (initially 16 bit) Data AX used for general arithmetic AH and AL used in byte arithmetic BX general-purpose register used as address base register CX general-purpose register used specifically in string, shift & loop instructions DX general-purpose register used in multiply, divide and I/O instructions Address SP Stack Pointer BP base register - for base-addressing mode SI index, string source base register DI index, string destination base register Registers can be used in 32 bit mode when in 80386 mode 3/16/2016 EE3.cma - Computer Architecture 143 Instruction Sets Intel 8086 Registers (initially 16 bit) Segment Base Registers - shift left 4 bits and add to address specified in instruction... causes overlap!!! CS start address of code accesses SS start address of Stack Segment changed in 80286 ES extra segment (for string destinations) DS data segment - used for all other accesses Control Registers IP Instruction Pointer (LS 16 bits of PC) Flags 6 condition code bits plus 3 processor status control bits Addressing Modes A wide range of addressing modes are supported. Many modes can only be accessed via specific registers eg: Register Indirect BX, SI, DI Base + displacement BP, BX, SI, DI Indexed address is sum of 2 registers - BX+SI, BX+DI, BP+SI, BP+DI 3/16/2016 EE3.cma - Computer Architecture 144 15 3/16/2016 EE3.cma - Computer Architecture 145 Instruction Sets The DEC Vax-11 Vax-11 family was compatible with the PDP-11 range - had 2 separate processor modes - “Native” (VAX) and “Compatibility” Modes • VAX had 16 32 bit general purpose registers including PC and SP and a frame pointer. • All data and address paths were 32bits wide - 4Gb address space. • Full range of data types directly supported by hardware - 8, 16, 32 and 64 bit integers, 32 and 64 bit floating point and 32 digit BCD numbers, character strings etc. • A very full selection of addressing modes was available • Used instructions made up from 8-bit bytes which specified: – the operation – the data type – the number of operands 3/16/2016 EE3.cma - Computer Architecture 146 The DEC Vax-11 Instruction Sets 9 bytes • Special opcodes FD and FF introduce even more opcodes in a second byte. • Only the number of addresses is encoded into the opcode itself - the addresses of operands are encoded in one or more succeeding bytes So the operation: ADDL3 #1, R0, @#12345678(R2) or “Add 1 to the longword in R0 and store the result in a memory location addressed at an offset of the number of longwords stored in R2 from the absolute address 12345678 (hex)” is stored as: 193 ADDL3 opcode 0 1 Literal (immediate) constant 5 0 Register mode - register 0 4 2 Index prefix register 2 9 15 Abs address follows for indexing 78 56 Absolute address #12345678 34 - the VAX was little endian 12 3/16/2016 EE3.cma 147 8 bits - Computer Architecture Instruction Sets The INMOS Transputer • The transputer is a microprocessor designed to operate with other transputers in parallel embedded systems. • The T800 was exceptionally powerful when introduced in 1986 • The T9000 - more powerful pipelined version in 1994 • • • • Government sell INMOS to EMI EMI decide to concentrate on music SGS Thompson buy what’s left Japanese use transputer technology in printers/scanners • Then sold to ST microelectronics • Now abandoned 3/16/2016 EE3.cma - Computer Architecture 148 Instruction Sets The INMOS Transputer • Designed for synchronised communications applications • Suitable for coupling into a multiprocessing configuration allowing a single program to be spread over all machines to perform task cooperatively. • Has 4kbytes internal RAM - not cache, but a section of main memory map for programmer/compiler to utilise. • Compact instruction set – most popular instructions in shortest opcodes - to minimise bandwidth – operate in conjunction with a 3 word execution stack - a zero addressing strategy 3/16/2016 EE3.cma - Computer Architecture 149 Instruction Sets The INMOS Transputer The processor evaluates the following high-level expression: x = a+b+c; where x, a, b, and c represent integer variables. No need to specify which processor registers receive the variables. Processor just told to load - pushed on stack - and add them. When an operation is performed, two values at the top are popped, then combined, and the result of the operation is pushed back: load a load b load c add add store x ;stack contents (=Undefined) ;[ ] ;[a ] ;[b a ] ;[c b a] ;[c+b a ] ;[c+b+a ] ;[c+b+a ] • removes need to add extra bits to the instruction to specify which register is accessed, instructions can be packed in smaller words - 80% of instructions are only1 byte long - results in tighter fit in memory, and less time spent fetching the instructions 3/16/2016 EE3.cma - Computer Architecture 150 Instruction Sets The INMOS Transputer • Has 6 registers – 3 make the register stack – a program counter (called the instruction pointer by Inmos) – a stack pointer (called a workspace pointer by Inmos) – and an operand register • The stack is interfaced by the first of the 3 registers (A, B, C) – “push”ing a value into A will cause A’s value to be pushed to B and B’s value to C – “pop”ping a value from A will cause B’s value to be popped to A and C’s value to B • The operand pointer is the focal point for instruction processing. – the 4 upper bits of a transputer instruction contain the operation – 16 possible operations – 4 lower bits contain the operand - this can be enlarged to 32 bits by using a “prefix” instructions 3/16/2016 EE3.cma - Computer Architecture 151 Instruction Sets The INMOS Transputer • The 16 instructions include jump, call, memory load/store and add. Three of the 16 elementary instructions are used to enlarge the two 4-bit fields (opcode or operand) in conjunction with the OR as follows: Operand – the “prefix” instruction adds its operand data into the OR (4bits) Register – shifts the OR 4 bits to the left – allowing numbers (upto 32 bits) to be built up in the OR – a negative prefix instruction adds its operand into the OR and then inverts all the bits in the OR before shifting 4 bits to the left - allows 2’s complement negative values to be built up - eg Mnemonic Code ldc #3 #4 ldc #35 is coded as pfix #3 #2 ldc #5 #4 ldc #987 is coded as pfix #9 #2 pfix #8 #2 ldc #7 #4 3/16/2016 Memory #43 #2345 #23 #45 #292847 #29 #28 #47 EE3.cma - Computer Architecture 152 Instruction Sets The INMOS Transputer Mnemonic Code ldc -31 (ldc #FFFFFFE1) is coded as nfix #1 #6 ldc #1 #4 Memory #6141 #61 #41 This last example shows the advantage of loading the 2’s complement negative prefix. Otherwise we would have to load all of the Fs making 5 additional operations…. • An additional “operate” instruction allows the OR to be treated as an extended opcode - up to 32 bits. Such instructions can not have an operand as OR is used for instruction so are all zero address instructions. • We have 16 1-address instructions and potentially lots of zero length instructions. Mnemonic Code add #5 is coded as opr #5 #F ladd #16 is coded as pfix #1 #2 opr #6 #F 3/16/2016 Memory #F5 #F5 #21F6 #21 #F6 EE3.cma - Computer Architecture 153 Instruction Sets The INMOS Transputer • No dedicated data registers. The transputer does not have dedicated registers, but a stack of registers, which allows for an implicit selection of the registers. The net result is a smaller instruction format. • Reduced Instruction Set design. The transputer adopts the RISC philosophy and supports a small set of instructions executed in a few cycles each. • Multitasking supported in microcode. The actions necessary for the transputer to swap from one task to another are executed at the hardware level, freeing the system programmer of this task, and resulting in fast swap operations. 3/16/2016 EE3.cma - Computer Architecture 154 Instruction Sets The Fairchild (now Intergraph) Clipper • Had sixteen 32-bit general purpose registers for the user and another 16 for operating system functions. – Separated interrupt activity and eliminated time taken to save register information during an ISR • Tightly coupled to a Floating Point Unit • Had 101 RISC like instructions – 16 bits long – made up from an 8-bit opcode and two 4-bit register fields – some instructions can carry 4 bits of immediate data – the 16 bit instructions could be executed in extremely fast cycles – also had 67 macro instructions - made up from multiples of simpler instructions using a microprogramming technique - these incorporated many more complex addressing modes as well as operations which took several clock cycles 3/16/2016 EE3.cma - Computer Architecture 155 A tale of Intel Intergraph were a leading workstation producer for CAD in transport, building and local government products built using Intel chips. 1987 – Intergraph buys Advanced Processor Division of Fairchild from National Semiconductor 1989-92 – Patents for Clipper transferred to Intergraph 1996 – Intergraph find that Intel are infringing their patents on Cache addressing, memory and consistency between cache and memory, write through & copy back modes for virtual addressing and bus snooping etc.. - Intergraph ask Intel to pay for patent rights - Intel refuse - Intel then cut off Intergraph from advanced information about Intel chips - without that info Integraph could not design new products well - Intergraph go from #1 to #5 1997 – Intergraph sue Intel – lots of legal stuff for next 3 years – court rules Intel not licensed to use clipper technology in pentium 2002 – Intel pays Intergraph $300M for license plus $150M damages for infringement of PIC technology – core of Itanium chip for high end servers Parallel Instruction Computing 3/16/2016 EE3.cma - Computer Architecture 156 A tale of Intel Federal Trade Commission site Intel in 2 other similar cases: 1997 – Digital sue Intel saying it copied DEC technology to make Pentium Pro. In retaliation Intel cut off DEC from Intel pre-release material. Shortly after this DEC get bought out by Compaq. 1994 – Compaq sue Packard Bell for violating patents for Comaq chip set. Packard Bell say chip set made by Intel Intel cut off Compaq from advanced information….. 3/16/2016 EE3.cma - Computer Architecture 157 Instruction Sets The Fairchild (now Intergraph) Clipper An example of a Harvard Architecture - having a separate internal instruction bus and data bus (and associated caches) Internal Instruction Bus Integer CPU FPU Cache/ Memory Management Unit Internal Data Bus Cache/ Memory Management Unit Off-Carrier Memory Bus 3/16/2016 EE3.cma - Computer Architecture The clipper is made up from 3 chips mounted on a ceramic carrier. The Harvard Architecture enables the caches to be optimised to the different characteristics of the instruction and data streams. Microchips PIC chip also uses a Harvard Architecture 158 16 3/16/2016 EE3.cma - Computer Architecture 159 Instruction Sets The Berkeley RISC-I Research Processor A research project at UC Berkeley 1980-83 set out to build • a “pure” RISC structure • highly suited to executing compiled high level language programs – procedural block, local & global variables The team examined the frequency of execution of different types of instructions in various C and Pascal programs The RISC-I has had a strong influence on the design of SUN Sparc architecture - (the Stanford MIPS (microprocessor without Interlocked Pipelined Stages) architecture influenced the IBM R2000) The RISC-I was a register based machine. The registers, data and addresses were all 32 bits wide. Had a total of 138 registers. All instructions, except memory LOADs and STOREs, operated on 1,2 or 3 registers. 3/16/2016 EE3.cma - Computer Architecture 160 Instruction Sets The Berkeley RISC-I Research Processor When running program had available a total of 32 general-purpose registers • 10 (R0-R9) are global • the remaining 22 were split into 3 groups: – low, local and high - 6,10 and 6 registers respectively • When a program calls a procedure – the first 6 parameters are stored to the programs low registers – a new register window is formed – these 6 low registers relabelled as the high 6 in a new block of 22 – this is the register space for the new procedure while it runs. – the running procedure can keep 10 of its local variables in registers – it can call further procedures using its own low registers – it can nest calls to a depth of 8 calls – (thus using all 138 registers) – on return from procedures the return results are in the high registers and appear in the calling procedures low registers. 3/16/2016 EE3.cma - Computer Architecture 161 Instruction Sets The Berkeley RISC-I Research Processor Process A calls process B which calls process C: 137 Register Bank high local low A high local low B high local low C 9 0 Global 3/16/2016 Global EE3.cma - Computer Architecture Global 162 Instruction Sets The Berkeley RISC-I Research Processor 7 Op-Code 1 SCC 5 DEST 5 S1 1 13 S2 RISC-I Short Immediate Instruction Format 7 Op-Code 1 SCC 5 DEST 19 IMM RISC-I Long Immediate Instruction Format DEST is the register number for all operations except conditional branches, when it specifies the condition S1 is the number of the first source register and S2 the second if bit 13 is high a 2’s complement immediate value otherwise SCC is a set condition code bit which causes the status word register to be activated 3/16/2016 EE3.cma - Computer Architecture 163 Instruction Sets The Berkeley RISC-I Research Processor The Op-Code (7 bits) can be one of 4 types of instruction: • Arithmetic – where RDEST = RS1 OP S2 and OP is a math, logical or shift operation • Memory Access – where LOADs take the form RDEST = MEM[RS1+S2] – and STOREs take the form MEM[RS1+S2] = RDEST • Note that RDEST is really the source register in this case • Control Transfer – where various branches may be made relative to the current PC (PC+IMM) or relative to RS1 using the short form (RS1+S2) • Miscellaneous – all the rest. Includes “Load immediate high” - uses the long immediate format to load 19 bits into the MS part of a register - can be followed by a short format load immediate to the other 13 bits - 32 in all 3/16/2016 EE3.cma - Computer Architecture 164 Instruction Sets RISC Principles Not just a machine with a small set of instructions. Must also have been optimised and minimised to improve processor performance. Many processors n the 60s and 70s were developed with a microcode engine at the heart of the processor - easier to design (CAD and formal proof did not exist) and easy to add extra, or change instructions Most CISC programs spend most of their time in small number of instructions If the time taken to decode all instructions can be reduced by having fewer of them then more time can be spent on making the less frequent instructions Various other features become necessary to make this work: • One clock cycle per instruction CISC machines typically take a variable number of cycles – reading in variable numbers of instruction bytes – executing microcode Time wasted by waiting for these to complete is gained if all operate in the same period For this to happen a number of other features are required. 3/16/2016 EE3.cma - Computer Architecture 165 Instruction Sets RISC Principles • Hard-wired Controller, Fixed Format Instructions – Single cycle operation only possible if instructions can be decoded fast and executed straight away. – Fast (old-fashioned?) hard-wired instruction sequences are needed - microcode can be too slow – As designing these controllers is hard even more important to have few – can be simplified by making all instructions share a common format • number of bytes, positions of op-code etc. • smaller the better - provided that each instruction contains needed information – Typical for only 10% of the logic of a RISC chip to be used for controller function, compared with 50-60% of a CISC chip like the 68020 3/16/2016 EE3.cma - Computer Architecture 166 Instruction Sets RISC Principles • Larger Register Set – It is necessary to minimise data movement to and from the processor – The larger the number of registers the easier this is to do. – Enables rapid supply of data to the ALU etc. as needed – Many RISC machines have upward of 32 registers and over 100 is not uncommon. – There are problems with saving state of this many registers – Some machines have “windows” of sets of registers so that a complete set can be switched by a single reference change • Memory Access Instructions – One type of instruction can not be speeded up as much as others – Use indexed addressing (via a processor register) to avoid having to supply (long) absolute addresses in the instruction – Harvard architecture attempts to keep most program instructions and data apart by having 2 data and address buses 3/16/2016 EE3.cma - Computer Architecture 167 Instruction Sets RISC Principles • Minimal pipelining, wide data bus – CISC machines use pipelining to improve the delivery of instructions to the execution unit – it is possible to read ahead in the instruction stream and so decode one instruction whilst executing the previous one whilst retrieving another – Complications in jump or branch instructions can make pipelining unattractive as they invalidate the backed up instructions and new instructions have to ripple their way through. – RISC designers often prefer large memory cache so that data can be read, decoded and executed in a single cycle independent of main memory – Regardless of pipelining, fetching program instructions fast is vital to RISC and a wide data bus is essential to ensure this - same for CISC 3/16/2016 EE3.cma - Computer Architecture 168 Instruction Sets RISC Principles • Compiler Effort – A CISC machine has to spend a lot of effort matching high-level language fragments to the many different machine instructions - even more so when the addressing modes are not orthogonal. – RISC compilers have a much easier job in that respect - fewer choices – They do, however, build up longer sequences of their small instructions to achieve the same effect. – The main complication of compiling for RISC is that of optimising register usage. – Data must be maintained on-chip when possible - difficult to evaluate an importance to a variable. • a variable accessed in a loop can be used many times and one outside may be used only once - but both only appear in the code once... 3/16/2016 EE3.cma - Computer Architecture 169 Instruction Sets Convergence of RISC and CISC Many of the principles developed for RISC machine optimisation have been fed back into CISC machines (Intergraph and Intel…). This is tending to bring the two styles of machine back together. • Large caches on the memory interface - reduce the effects of memory usage • CISC machines are getting an increasing number of registers • More orthogonal instruction sets are making compiler implementation easier • Many of the techniques described above may be applied to the microprogram controller inside a conventional CISC machine. • This suggests that the microprogram will take on a more RISC like form with fixed formats and fields, applying orthogonally over the registers etc. 3/16/2016 EE3.cma - Computer Architecture 170 17 3/16/2016 EE3.cma - Computer Architecture 171 Pipelined Parallelism in Instruction Processing General Principles Pipelined processing involves splitting a task into several sequential parts and processing each in parallel with separate execution units. • for one off tasks little advantage, but • for repetitive tasks, can make substantial gains Pipelining can be applied to many fields of computing, such as: • large scale multi-processor distributed processing • arithmetic processing using vector hardware to pipe individual vector elements through a single high-speed arithmetic unit • multi-stage arithmetic pipelines • layered protocol processing • as well as instruction execution within a processor Over all task must be able to be broken into smaller sub-tasks which can be chained together - all subtasks taking the same time to execute Choosing the best sub-division of tasks is called load balancing 3/16/2016 EE3.cma - Computer Architecture 172 Pipelined Parallelism in Instruction Processing General Principles stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 stage 1 non-pipelined processing stage 1 stage 2 stage 3 stage 1 stage 2 stage 3 stage 1 stage 2 stage 2 stage 3 pipelined processing stage 3 single instruction still takes as long, each instruction still has to be performed in the same order. Speed up occurs when all stages are kept in operation at same time. Start up and ending become less efficient. 3/16/2016 EE3.cma - Computer Architecture 173 Pipelined Parallelism in Instruction Processing stage 1 stage 2 latch Task latch General Principles Two clocking schemes which can be incorporated in pipelining Synchronous stage 3 Results Clock Operates using a global clock - indicates when each stage of the pipeline should pass its result to the next stage. Clock must run at rate of slowest possible element in pipeline when given with most time consuming data. To de-couple each stage they are separated by staging latches 3/16/2016 EE3.cma - Computer Architecture 174 Pipelined Parallelism in Instruction Processing stage 1 ready stage 2 latch Task latch General Principles Asynchronous stage 3 Results acknowledge in this case the stages of the pipeline run independently of each other. Two stages synchronise when a result has to pass from one to the other. A little more complicated to design than synchronous, but benefits that stages can run in time needed rather than use maximum time. Use of a FIFO buffer instead of latch between stages can allow queuing of results for each stage 3/16/2016 EE3.cma - Computer Architecture 175 Pipelined Parallelism in Instruction Processing Pipelining for Instruction Processing Processing a stream of instructions can be performed in a pipeline Individual instructions can be executed in a number of distinct phases: Fetch Read instruction from memory Decode instruction Inspect instruction - how many operands, how and where will it be executed Address generate Calculate addresses of registers and memory locations to be accessed Load operand Read operands stored in memory - might read register operands or set up pathways between registers and functional units Execute Drive the ALU, shifter, FPU and other components Store operand Store result of previous stage Update PC PC must be updated for next fetch operation No processor would implement all of these. Most common would be Fetch and Execute 3/16/2016 EE3.cma - Computer Architecture 176 Pipelined Parallelism in Instruction Processing Overlapping Fetch & Execute Phases Fetch - involves memory activity (slow) can be overlapped with Decode and Execute. In RISC only 2 instructions access memory - LOAD and STORE - the remainder operate on registers so for most instructions only Fetch needs memory bus. On starting the processor the Fetch unit gets an instruction from memory At the end of the cycle the instruction just read is passed to the Execute unit While the Execute unit is performing the operation Fetch is getting next instruction (provided Execute doesn’t need to use the memory as well) This and other contention can be resolved by: • Extending the clashing cycle to give time for both memory accesses to take place - hesitation - requires synchronous clock to be delayed • Providing multi-port access to main memory (or cache) so that access can happen in parallel. Memory interleaving may help. • Widening data bus so that 2 instructions are fetched with each Fetch • Use a Harvard memory architecture - separate instruction and data bus 3/16/2016 EE3.cma - Computer Architecture 177 Pipelined Parallelism in Instruction Processing Overlapping Fetch & Execute Phases Fetch #1 Fetch #2 Fetch #3 Execute #1 Execute #2 Execute #3 time Overlapping Fetch, Decode & Execute Phases Fetch #1 Fetch #2 Fetch #3 Decode #1 Decode #2 Decode #3 Execute #1 Execute #2 Execute #3 3/16/2016 EE3.cma - Computer Architecture time 178 Pipelined Parallelism in Instruction Processing Overlapping Fetch, Decode & Execute Phases There are benefits to extending the pipeline to more than 2 stages even though more hardware is needed A 3-stage pipeline splits the instruction processing into Fetch, Decode and Execute. The Fetch stage operates as before. The Decode stage decodes the instruction and calculates any memory addresses used in the Execute The Execute stage controls the ALU and writes result back to a register - and can perform LOAD and STORE accesses. The Decode stage is guaranteed not to need a memory access. Thus memory contention is no worse than in the 2 stage version. Longer Pipelines of 5, 7 or more stages are possible and depend on the complexity of hardware and instruction set. 3/16/2016 EE3.cma - Computer Architecture 179 18 3/16/2016 EE3.cma - Computer Architecture 180 Pipelined Parallelism in Instruction Processing The Effect of Branch Instructions One of the biggest problems with pipelining is the effect of a branch instruction. A branch is Fetched as usual and the target address Decoded. The Execute stage then has the task of deciding whether or not to branch and so changing the PC. By this time the PC has already been used at least once by the Fetch (and with a separate Decode maybe twice). The effect of changing the PC is that all data in the pipeline following the branch must be flushed. Branches are common in some types of program (up to 10% of instructions). So benefits of pipelining can be lost for 10% of instructions and incur reloading overhead. A number of schemes exist to avoid this flushing: 3/16/2016 EE3.cma - Computer Architecture 181 Pipelined Parallelism in Instruction Processing • Delayed Branching – Sun SPARC instead of branching as soon as a branch instruction has been decided, the branch is modified to “Execute n more instructions before jumping to the instruction specified” - used with n chosen to be 1 smaller than the number of stages in pipeline. So that in a 2 stage pipeline, instead of the loop: a; b; c; a; b; c; ….. (where c is the branch instruction back to a) in that order, the code could be stored as: a; c; b; a; c; b; …… in this case a is executed, then the decision to jump back to a, but before the jump happens b is executed. the delayed jump at c enables b - which has already been fetched when evaluating c to be used rather than thrown away. must be careful when operating instructions out of sequence and the machine code becomes difficult to understand. a good compiler can hide all of this and in about 70% of cases can be implemented easily. 3/16/2016 EE3.cma - Computer Architecture 182 Pipelined Parallelism in Instruction Processing • Delayed Branching Consider the following code fragments - running on a 3 stage pipeline loop: Cycle 1 2 3 4 5 6 7 8(=1) RA = RB ADD RC RD = RB SUB RC RE = RB MUL RC RF = RB DIV RC BNZ RA, loop Fetch Decode ADD SUB ADD MULT SUB DIV MULT BNZ DIV next BNZ next 2 next ADD - Execute ADD SUB MULT DIV BNZ (Updates PC) - Pipeline has to be flushed to remove the two incorrectly fetched instructions and code repeats every 7 cycles. 3/16/2016 EE3.cma - Computer Architecture 183 Pipelined Parallelism in Instruction Processing • Delayed Branching We can invoke the delayed branching behaviour of DBNZ and re-order 2 instructions (if possible) from earlier in the loop: loop: Cycle 1 2 3 4 5 6(=1) 7 8(=3) RA = RB ADD RC RD = RB SUB RC DBNZ RA loop RE = RB MULT RC RF = RB DIV RC Fetch Decode ADD SUB ADD DBNZ SUB MULT DBNZ DIV MULT ADD DIV SUB ADD DBNZ SUB Execute ADD SUB DBNZ (Updates PC) MULT DIV ADD Loop now executes every 5 processor cycles - no instructions are fetched and unused. 3/16/2016 EE3.cma - Computer Architecture 184 Pipelined Parallelism in Instruction Processing • Instruction Buffers – IBM PowerPC When a branch is found in early stage of pipeline, the Fetch unit can be made to start fetching both future instructions into separate buffers and start decoding both, before branch is executed. A number of difficulties with this: – it imposes an extra load on instruction memory – requires extra hardware - duplication of decode and fetch – becomes difficult to exploit fully if several branches follow closely each fork will require a separate pair of instruction buffers – early duplicated stages cannot fetch different values to the same register, so register fetches may have to be delayed - pipeline stalling(?) – duplicate pipeline stages must not write (memory or registers) unless mechanism for reversing changes is included (if branch not taken) 3/16/2016 EE3.cma - Computer Architecture 185 Pipelined Parallelism in Instruction Processing • Branch Prediction – Intel Pentium When a branch is executed destination address chosen can be kept in cache. When Fetch stage detects a branch, it can prime itself with a next-program-counter value looked up in the cached table of previous destinations for a branch at this instruction. If the branch is made (at execution stage) in the same direction as before, then pipeline already contains the correct prefetched instructions and does not need to be flushed. More complex schemes could even use a most-frequently-taken strategy to guess where the next branch from any particular instruction is likely to go and reduce the pipeline flush still further. Look-up Table load target address Instruction address Target address target address found PC search address Memory 3/16/2016 Instruction Fetch EE3.cma - Computer Architecture Decode Execute 186 Pipelined Parallelism in Instruction Processing • Dependence of Instructions on others which have not completed Instructions can not be reliably fetched if all previous branch instructions are incomplete - PC updated too late for next fetch – Similar problem occurs with memory and registers. – memory case can be solved by ensuring that all memory accesses are atomically performed in a single Execute stage - get data only when needed. – but what if the memory just written contains a new instruction which has already been prefetched? (self modifying code) In a long pipeline, several stages may read from a particular register and several may write to the same register. – Hazards occur when the order of access to operands is changed by the pipeline – various methods may be used to prevent data from different stages getting confused in the pipeline. Consider 2 sequential instructions i, j, and a 3 stage pipeline. Possible hazards are: 3/16/2016 EE3.cma - Computer Architecture 187 Pipelined Parallelism in Instruction Processing • Read-after-write Hazards When j tries to read a source before i writes it, j incorrectly gets the old value – a direct consequence of pipelining conventional instructions – occurs when a register is read very shortly after it has been updated – value in register is correct Example R1 = R2 ADD R3 R4 = R1 MULT R5 Cycle 1 2 3 4 Fetch ADD MULT next1 next2 3/16/2016 Decode Execute Comments ADD fetches R2,R3 MULT fetches R1,R5 ADD stores R1 register fetch probably wrong value next1 MULT stores R4 wrong value calculated EE3.cma - Computer Architecture 188 Pipelined Parallelism in Instruction Processing • Write-after-write Hazards When j tries to write an operand before i writes it, the value left by i rather than the value written by j is left at the destination – Occurs if the pipeline permits write from more than one stage. – value in register is incorrect Example R3 = R1 ADD R2 R5 = R4 MULT -(R3) Cycle 1 2 3 4 Fetch Decode ADD MULT ADD fetches R1,R2 next1 MULT fetches (R3-1),R4,saves R3-1 in R3 next2 next1 3/16/2016 Execute Comments ADD stores R3 which version of R3? MULT stores R5 EE3.cma - Computer Architecture 189 Pipelined Parallelism in Instruction Processing • Write-after-read Hazards When j tries to write to a register before it is read by i, i incorrectly gets the new value – can only happen if the pipeline provides for early (decode-stage) writing of registers and late reading - auto-increment addressing – the value in the register is correct Example A realistic example is difficult in this case for several reasons. • Firstly memory accessing introduces dependencies for the data in the read case, or stalls due to bus activity in the write case • A long pipeline with early writing and late reading of registers is rather untypical…….. • Read-after-read Hazards These are not a hazard - multiple reads always return the same value……. 3/16/2016 EE3.cma - Computer Architecture 190 19 3/16/2016 EE3.cma - Computer Architecture 191 Pipelined Parallelism in Instruction Processing • Detecting Hazards several techniques - normally resulting in some stage of the pipeline being stopped for a cycle - can be used to overcome these hazards. They all depend on detecting register usage dependencies between instructions in the pipeline. An automated method of managing register accesses is needed Most common detection scheme is scoreboarding Scoreboarding – keeping a 1-bit tag with each register. – clear tags when machine is booted – set by Fetch or Decode stage when instruction is going to change a register – when the change is complete the tag bit is cleared – if instruction is decoded which wants a tagged register, then instructions is not allowed to access it until tag is cleared. 3/16/2016 EE3.cma - Computer Architecture 192 Pipelined Parallelism in Instruction Processing • Avoiding Hazards - Forwarding Hazards will always be a possibility, particularly in long pipelines. Some can be avoided by providing an alternative pathway for data from a previous cycle but not written back in time: Registers Mpx Mpx ALU reg reg bypass paths value value Normal register write path 3/16/2016 EE3.cma - Computer Architecture 193 Pipelined Parallelism in Instruction Processing • Avoiding Hazards - Forwarding - Example R1 = R2 ADD R3 R4 = R1 SUB R5 R6 = R1 AND R7 R8 = R1 OR R9 R10 = R1 XOR R11 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 Fetch ADD SUB AND OR XOR next1 next2 next3 next4 next5 3/16/2016 on a 5 stage pipeline - no forwarding pathways Fetch Decode/regs ADD read R2,R3 SUB read R5, R1(not ready) SUB read R1(not ready) SUB read R1(not ready) SUB read R1 AND read R1, R7 OR read R1, R9 XOR read R1, R11 next1 next2 next3 next4 Decode Reg read ALU Execute ALU ADD compute R1 SUB compute R4 AND compute R6 OR compute R8 XOR compute R10 next1 next2 next3 EE3.cma - Computer Architecture Memory Access Memory ADD pass R1 SUB pass R4 AND pass R6 OR pass R8 XOR pass R10 next1 next2 Register Write Writeback ADD store R1 SUB store R4 AND store R6 OR store R8 XOR store R10 next1 194 Pipelined Parallelism in Instruction Processing • Avoiding Hazards - Forwarding - Example R1 = R2 ADD R3 R4 = R1 SUB R5 R6 = R1 AND R7 R8 = R1 OR R9 R10 = R1 XOR R11 on a 5 stage pipeline - no forwarding pathways BUT: registers read in second half of cycle and written in first half of cycle Fetch Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Fetch ADD SUB AND OR XOR next1 next2 next3 next4 next5 3/16/2016 Decode/regs ADD read R2,R3 SUB read R5, R1(not ready) SUB read R1(not ready) SUB read R1 AND read R1, R7 OR read R1, R9 XOR read R1, R11 next1 next2 next3 next4 Decode Reg read ALU Execute ALU ADD compute R1 SUB compute R4 AND compute R6 OR compute R8 XOR compute R10 next1 next2 next3 EE3.cma - Computer Architecture Memory Access Memory ADD pass R1 SUB pass R4 AND pass R6 OR pass R8 XOR pass R10 next1 next2 Register Write Writeback ADD store R1 SUB store R4 AND store R6 OR store R8 XOR store R10 next1 195 Pipelined Parallelism in Instruction Processing • Avoiding Hazards - Forwarding - Example R1 = R2 ADD R3 R4 = R1 SUB R5 R6 = R1 AND R7 R8 = R1 OR R9 R10 = R1 XOR R11 Cycle 1 2 3 4 5 6 7 8 9 10 Fetch ADD SUB AND OR XOR next1 next2 next3 next4 next5 on a 5 stage pipeline - with full forwarding Fetch Decode/regs ADD read R2,R3 SUB read R5, R1(from ALU) AND read R1(from ALU),R7 OR read R1(from ALU),R9 XOR read R1, R11 next1 next2 next3 next4 Decode Reg read ALU Execute ALU ADD compute R1 SUB compute R4 AND compute R6 OR compute R8 XOR compute R10 next1 next2 next3 Memory Access Memory ADD pass R1 SUB pass R4 AND pass R6 OR pass R8 XOR pass R10 next1 next2 Register Write Writeback ADD store R1 SUB store R4 AND store R6 OR store R8 XOR store R10 next1 In this case the forwarding prevents any pipeline stalls. 3/16/2016 EE3.cma - Computer Architecture 196 Pipelined Parallelism in Instruction Processing • Characteristics of Memory Store Operations Example - use the 5 stage pipeline as before in store cycle: R1 = R2 ADD R3 25 (R1) = R1 Cycle 1 2 3 4 5 6 (store in main memory) Fetch ADD Decode/regs ALU STORE ADD read R2,R3 next1 STORE read R1(not ready) ADD compute R1 next2 next1 STORE compute R1+25 stall next2 next1 next3 next2 Memory ADD pass R1 STORE R1(R1) Writeback ADD store R1 next1 STORE null R1 from ALU Wait for memory indirection Since STORE is an output operation, it does not create register based hazards. It might create memory-based hazards, which may be avoided by instruction re-ordering or store-fetch avoidance techniques - see next section 3/16/2016 EE3.cma - Computer Architecture 197 Pipelined Parallelism in Instruction Processing • Forwarding during Memory Load Operations Example - use the 5 stage pipeline as before in Load cycle: R1 R4 R5 R6 Cycle 1 2 3 4 5 6 7 8 9 = = = = 32 R1 R1 R1 Fetch LOAD ADD SUB stall AND next1 next2 next3 next4 (R6) ADD R7 SUB R8 AND R7 Decode/regs LOAD read R6 ADD read R7,R1(not ready) ADD R1(not ready) SUB read R8,R1(from Mem) AND read R7,R1 next1 next2 next3 ALU LOAD R6+32 ADD R4,R1(Mem) SUB R5 AND R6 next1 next2 Memory LOAD (R6+32) ADD pass R4 SUB pass R5 AND pass R6 next1 Writeback LOAD store R1 ADD store R4 SUB store R5 AND store R6 In this case the result of the LOAD must be forwarded to the earlier ALU stage, and the even earlier DECODE stage. 3/16/2016 EE3.cma - Computer Architecture 198 Pipelined Parallelism in Instruction Processing • Forwarding (Optimisation) Applied to Memory Operations – Store Fetch Forwarding - where words stored and then loaded by another instruction further back in the pipeline can be piped directly without the need to be passed into and out of that register or memory location: e.g MOV [200],AX ADD BX,[200] ;copy AX to memory ;add memory to BX transforms to: MOV ADD [200],AX BX, AX – Fetch Fetch Forwarding - where words loaded twice in successive stages may be loaded together - or once from memory to register MOV AX, [200] ;copy memory to AX MOV BX,[200] ;copy memory to BX transforms to: MOV MOV AX, [200] BX, AX – Store Store Overwriting MOV [200],AX MOV [200],BX ;copy AX to memory ;copy BX to memory transforms to: MOV 3/16/2016 [200], BX EE3.cma - Computer Architecture 199 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Instruction Re-ordering Because hazards and data dependencies cause pipeline stalls, removing them can improve performance. Re-ordering instructions is often simplest technique. Consider a program to calculate 100 na n on a 3-stage pipeline: loop: Cycle 1 2 3 4 5 6 7 8 9 10 11 n 1 RT = RA EXP RN RT = RT MULT RN RS = RS ADD RT RN = RN SUB 1 BNZ RN, loop Fetch Decode EXP MULT EXP read RA,RN ADD MULT read RN,RT(not ready) MULT read RN,RT SUB ADD read RS,RT(not ready) ADD read RS,RT BNZ SUB read RN,1 BNZ read RN (not ready) next1 BNZ read RN next2 next1 EXP flushed 3/16/2016 EE3.cma - Computer Architecture Needs 10 cycles.. Execute EXP store RT MULT store RT ADD store RS SUB store RN BNZ store PC flushed 200 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Instruction Re-ordering Re-order the sum and decrement instructions: loop: Cycle 1 2 3 4 5 6 8 9 10 Fetch EXP MULT SUB ADD BNZ next1 next2 EXP RT = RA RT = RT RN = RN RS = RS BNZ RN, EXP RN MULT RN SUB 1 ADD RT loop Needs 8 cycles.. these 2 swapped these 2 swapped Decode EXP read RA,RN MULT read RN,RT(not ready) MULT read RN,RT SUB read RN,1 ADD read RS,RT BNZ read RN next1 flushed Execute EXP store RT MULT store RT SUB store RN ADD store RS BNZ store PC flushed Can only make it better with forwarding - to remove final RT dependency 3/16/2016 EE3.cma - Computer Architecture 201 20 3/16/2016 EE3.cma - Computer Architecture 202 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Loop Unrolling The unrolling of loops is a conventional technique for increasing performance. It works especially well in pipelined systems: – start with a tight program loop – re-organise the loop construct so that the loop is traversed half (or a third, quarter etc.) as many times – re-write the code body so that it performs two (3, 4) times as much work in each loop – Optimise the new code body In the case of pipeline execution, the code body gains from: – more likely benefit from delayed branching – less need to increment the loop variable – instruction re-ordering avoids pipeline stalls – parallelism is exposed - useful for vector and VLIW architectures 3/16/2016 EE3.cma - Computer Architecture 203 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Loop Unrolling 100 Example - Calculate array (n) using a Harvard architecture and forwarding n 1 R2 = 0 R1 = 100 loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop Cycle 1 2 3 4 5 6 7 8 9 Fetch LOAD SUB ADD BNZ next1 next2 next3 next4 LOAD 800 cycles to complete all loops Decode ALU LOAD read R1 SUB read R1,1 LOAD R1+0 ADD R2,R3(not ready) SUB R1-1 BNZ read R1 ADD R2+R3,R3(Mem) next1 BNZ R1(from ALU) next2 next1 next3 next2 - Memory LOAD (R1+0) SUB pass R1 ADD pass R2 BNZ pass R1 next1 - Writeback LOAD store R3 SUB store R1 ADD store R2 BNZ store PC - Code is difficult to write in optimal form - too short to implement delayed branching - forwarding prevents stalling and performing decrement early hides some of the memory latency 3/16/2016 EE3.cma - Computer Architecture 204 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Loop Unrolling 100 Example - Calculate array (n) using a Harvard architecture and forwarding n 1 R2 = 0 R1 = 100 loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop Unrolling the loop body: loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop 3/16/2016 Re-label registers and re-order loop: R3 = R4 = R5 = R6 = R1 = DBNZ R2 = R2 = R2 = R2 = EE3.cma - Computer Architecture LOAD array(R1) LOAD array-1(R1) LOAD array-2(R1) LOAD array-3(R1) R1 SUB 4 R1, loop R2 ADD R3 R2 ADD R4 R2 ADD R5 R2 ADD R6 205 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Loop Unrolling 100 Example - Calculate array (n) using a Harvard architecture and forwarding n 1 Branch has been replaced with a delayed branch - takes effect after 4 more instructions (5 stage pipeline) 250 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Fetch LOAD1 LOAD2 LOAD3 LOAD4 SUB DBNZ ADD1 ADD2 ADD3 ADD4 LOAD1 LOAD2 LOAD3 LOAD4 SUB 3/16/2016 Decode LOAD1 read R1 LOAD2 read R1,1 LOAD3 read,R1,2 LOAD4 read R1,3 SUB read R1,4 DBNZ read R1 ADD1 read R2,R3 ADD2 read R2,R4 ADD3 read R2,R5 ADD4 read R2,R6 LOAD1 read R1 LOAD2 read R1,1 LOAD3 read,R1,2 LOAD4 read R1,3 cycles to complete all loops ALU LOAD1 array+R1 LOAD2 array+1+R1 LOAD3 array+2+R1 LOAD4 array+3+R1 SUB R1 DBNZ R1(from ALU) ADD1 R2 ADD2 R2(from ALU) ADD3 R2(from ALU) ADD4 R2(from ALU) LOAD1 array+R1 LOAD2 array+1+R1 LOAD3 array+2+R1 EE3.cma - Computer Architecture Memory LOAD1 R3 LOAD2 R4 LOAD3 R5 LOAD4 R6 SUB pass R1 DBNZ pass R1 ADD1 pass R2 ADD2 pass R2 ADD3 pass R2 ADD4 pass R2 LOAD1 R3 LOAD2 R4 Writeback LOAD1 store R3 LOAD2 store R4 LOAD3 store R5 LOAD4 store R6 SUB store R1 DBNZ store PC ADD1 store R2 ADD2 store R2 ADD3 store R2 ADD4 store R2 LOAD1 store R3 206 Pipelined Parallelism in Instruction Processing • Code Changes affecting the pipeline - Loop Unrolling 100 Example - Calculate array (n) using a Harvard architecture and forwarding n 1 The original loop took 8 cycles per iteration. The unrolled version allows a delayed branch to be implemented and performs 4 iterations in 10 cycles. Gives an improvement of a factor of 3.2 Benefits of Loop Unrolling – Fewer instructions (multiple decrements can be performed in one operation) – longer loop allows delayed branch to fit – better use of pipeline - more independent operations – disadvantage - more registers required to obtain these results 3/16/2016 EE3.cma - Computer Architecture 207 Pipelined Parallelism in Instruction Processing • Parallelism at the Instruction Level Conventional instruction sets rely on encoding of register numbers, instruction type and addressing modes to reduce volume of instruction stream CISC processors optimise a lower level encoding in a longer instruction word requires them to consume more instruction bits per cycle, forcing advancements like Harvard memory architectures. CISC architectures are still sequential processing machines - pipelining and superscalar instruction grouping introduce a limited amount of parallelism Parallelism can also be introduced explicitly with parallel operations in each instruction word. VLIW (Very Long Instruction Word) machines have instruction formats which contain different fields, each referring to a separate functional unit in the processor, this requires multi-ported access to registers etc. Choice of parallel activities in a VLIW machine is made by the compiler, which must determine when hazards exist and how to resolve them... 3/16/2016 EE3.cma - Computer Architecture 208 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Uniprocessor limits on performance The speed of a pipelined processor (instructions per second) is limited by: • clock frequency (AMD 2.66 GHz) is unlikely to increase - much more • depth of pipeline. As depth increases, work in each stage per cycle initially decreases. But effects of register hazards, branching etc. limit further subdivision and load balancing between stages gets increasingly difficult So, why only initiate one instruction in each cycle? Superpipelined processors double the clock frequency by pushing alternate instructions from a conventional instruction stream to 2 parallel pipelines. Compiler must separate instructions to run independently in the 2 streams and when not possible must add NULL operations. Could use more than 2 pipelines. Scheme is not very flexible and is superseded by: Superscalar processors use conventional instruction stream, read at several instructions per cycle. Decoded instructions issued to a number of pipelines - 2 or 3 pipelines can be kept busy this way Very Long Instruction Word (VLIW) processors use modified instruction set each containing sub-instructions, each sent to separate functional units 3/16/2016 EE3.cma - Computer Architecture 209 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Superscalar Architectures – fetch and decode more instructions than needed to feed a single pipeline – launch instructions down a number of parallel pipelines in each cycle – compilers often re-order instructions to place suitable instructions in parallel - the details of the strategy used will have a huge effect on the degree of parallelism achieved – some superscalars can perform re-ordering at run time - to take advantage of free resources – relatively easy to expand - add another pipelined functional unit. Will run previously compiled code, but will benefit from new compiler – provide exceptional peak performance, but extra data requirements put heavy demands on memory system and sustained performance might not be much more than 2 instructions per cycle. 3/16/2016 EE3.cma - Computer Architecture 210 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Very Long Instruction Word architectures – VLIW machines provide a number of parallel functional units • typically 2 integer ALUs, 2 floating point units, 2 memory access units and a branch control engine – the units are controlled from bits in a very long instruction word - this can be 150 bits or more in width – needs fetching across a wide instruction bus - and hence wide memories and cache. – Many functional units require 2 register read ports and a register write port – Application code must have plenty of instruction level parallelism and few control hazards - obtainable by loop unrolling – Compiler responsible for identifying activities to be combined into a single VLIW. 3/16/2016 EE3.cma - Computer Architecture 211 21 3/16/2016 EE3.cma - Computer Architecture 212 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Hazards and Instruction Issue Matters with Multiple Pipelines There are 3 main types of hazard: – read after write - j tries to read an operand before i writes it, j gets the old value – write after write - j writes a result before i, the value left by i rather than j is left at the destination – write after read - j writes a result before it is read by i, i incorrectly gets new value In single pipeline machine with in-order execution read after write is the only one that can not be avoided and is easily solved using forwarding. Using extra superscalar pipelines (or altering the order of instruction completion or issue) brings all three types of hazard further into play: 3/16/2016 EE3.cma - Computer Architecture 213 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Read After Write Hazards It is difficult to organise forwarding from one pipeline to another. Better is to allow each pipeline to write its result values directly to any execution unit that needs them • Write After Read Hazards Consider F0 = F1 DIV F2 F3 = F0 ADD F4 F4 = F2 SUB F6 Assume that DIV takes several cycles to execute in one floating point pipeline. Its dependency with ADD (F0) stops ADD from being executed until DIV finishes. BUT SUB is independent of F0 and F3 and could be executed in parallel with DIV and could finish 1st. If it wrote to F4 before the ADD read it then ADD would have the wrong value 3/16/2016 EE3.cma - Computer Architecture 214 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Write After Write Hazards Consider F0 F3 F0 F3 = = = = F1 F0 F2 F0 DIV MUL SUB ADD F2 F4 F6 F4 On a superscalar the DIV and SUB have independent operands (F2 is read twice but not changed) If there are 2 floating point pipelines, each could be performed at the same time. DIV would be expected to take longer So SUB might try and write to F0 before DIV - hence ADD might get wrong value from F0 (MUL would be made to wait for DIV to finish, however) We can use Scoreboarding to resolve these issues. 3/16/2016 EE3.cma - Computer Architecture 215 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Limits to Superscalar and VLIW Expansion Only 5 operations per cycle are typical, why not 50? – Limited Parallelism available. • VLIW machines depend on a stream of ready-parallelised instructions. – Many parallel VLIW instructions can only be found by unrolling loops – if a VLIW field can not be filled in an instruction, then the functional unit will remain idle during that cycle • superscalar machine depends on stream of sequential instructions – loop unrolling is also beneficial for superscalars – Limited Hardware resources • cost of registers read/write ports scale linearly with number, but complexity of access increases as the square of the number • extra register access complexity may lead to longer cycle times • more memory ports needed to keep processor supplied with data – Code Size too high • wasted fields in VLIW instructions lead to poor code density, need for increased memory access and overall less benefit from wide instruction bus 3/16/2016 EE3.cma - Computer Architecture 216 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Amdahl’s Law Gene Amdahl suggested the following law for vector processors - equally appropriate for VLIW and superscalar machines and all multiprocessor machines. Any parallel code has sequential elements - at startup and shutdown, at the beginning and end of each loop etc. To find the benefit from parallelism need to consider how much is done sequentially and in parallel. Speedup factor can be taken as: Execution time using one processor S(n) = Execution time using a multiprocessor with n processors If the fraction of code which can not be parallelised is f and the time taken for the computation on one processor is t then the time taken to perform the computation with n processors will be: ft + (1 - f) t / n The speed up is therefore: S(n) = t / ( ft + (1 - f) t / n) = n/(1 + (n - 1) f) (ignoring any overhead due to parallelism or communication between processors) 3/16/2016 EE3.cma - Computer Architecture 217 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Amdahl’s Law S( ) 1/f Amdahl's Law - Speedup v No. Processors 20 • even for an infinite number of processors maximum speed up is given by above f = 5% 16 f = 10% f = 20% 14 Speedup S(n) • Small reduction in sequential overhead can make huge difference in throughput f = 0% 18 12 10 8 6 4 2 0 0 5 10 15 20 Number of Processors 3/16/2016 EE3.cma - Computer Architecture 218 Instruction Level Parallelism Superscalar and Very Long Instruction Word Processors • Gustafson’s Law – A result of observation and experience. – If you increase the size of the problem then the size (not the fraction) of the sequential part remains the same. – eg if we have a problem that uses a number of grid points to solve a partial differential equation • for 1000 grid points 10% of the code is sequential. • might expect that for 10,000 grid points only 1% of the code will be sequential. • If we expand the problem to 100,000 grid points the only 0.1% of the problem remains sequential. – So after Amdahl’s law things start to look better again! 3/16/2016 EE3.cma - Computer Architecture 219 22 3/16/2016 EE3.cma - Computer Architecture 220 Running Programs in Parallel Running Programs in Parallel Options for running programs in parallel include: • Timesharing on a Uniprocessor - this is mainly muti-tasking to share a processor rather than combining resources for a single application. Timesharing is characterised by: – Shared memory and semaphores – high context-switch overheads – limited parallelism • Multiprocessors with shared memory - clustered computing combines several processors communicating via shared memory and semaphores. – Shared memory limits performance (even with caches) due to the delays when the operating system or user processes wait for other processes to finish with shared memory and let them have their turn. – Four - eight processors, actively communicating on a shared bus is about the limit before access delays become unacceptable • Multiprocessors with separate communication switching devices - INMOS transputer and Beowulf clusters. – each element contains a packet routing controller as well as a processor (transputer contained both on single chip) – messages can be sent between any process on any processor in hardware 3/16/2016 EE3.cma - Computer Architecture 221 Running Programs in Parallel Running Programs in Parallel Options for running programs in parallel include: (cont’d) • Vector Processors (native and attached) – may just be specialised pipeline engines pumping operands through heavilypipelined, chained, floating point units. – Or they might have enough parallel floating point units to allow vector operands to be manipulated element-wise in parallel. – can be integrated into otherwise fast scalar processors – or might be co-processors which attach to general purpose processors • Active Memory (Distributed Array Processor) – rather than take data to the processors it is possible to take the processors to the data, by implementing a large number of very simple processors in association with columns of bits in memory – thus groups of processors can be programmed to work together, manipulating all the bits of stored words. – All processors are fed the same instruction in a cycle by a master controller. • Dataflow Architectures - an overall task is defined in terms of all operations which need to be performed and all operands and intermediate results needed to perform them. Some operations can be started immediately with initial data whilst others must wait for the results of the first ones and so on to the result. 3/16/2016 EE3.cma - Computer Architecture 222 Running Programs in Parallel • Semaphores: – Lock shared resources – Problems of deadlock and starvation • Shared memory – Fastest way to move information between two processors is not to! – Rather than: • sender → receiver we have sender receiver – Use semaphore to prevent receiver reading until sender has finished – Segment created outside normal process space – system call maps it into space of requesting process Proc 1 Segment 1 Segment 3 Segment 2 3/16/2016 Proc 2 Proc 3 EE3.cma - Computer Architecture 223 Running Programs in Parallel Flynn’s Classification of Computer Architectures SISD Single Instruction, Single Data machines are conventional uni-processors. They read their instructions sequentially and operate on their data operands individually. Each instruction only accesses a few operand words SIMD Single Instruction, Multiple Data machines are typified by vector processors. Instructions are still read sequentially but this time they each perform work on operands which describe multi-word objects such as arrays and vectors. These instructions might perform vector element summation, complete matrix multiplication or the solution of a set of simultaneous equations. MIMD Multiple Instruction, Multiple Data machines are capable of fetching many instructions at once, each of which performs operations on its own operands. The architecture here is of a mutiprocessor - each processor (probably a SISD or SIMD processor) performs its own computations but shares the results with the others. The multiprocessor sub-divides a large task into smaller sections which are suitable for parallel solution and permits these tasks to share earlier results (MISD) Multiple Instruction Single Data machines are not really implementable. One might imagine an image processing engine capable of taking an image and performing several concurrent operations upon it... 3/16/2016 EE3.cma - Computer Architecture 224 Major Classifications of Parallelism Introduction Almost all parallel applications can be classified into one or more of the following: • Algorithmic Parallelism- the algorithm is split into sections (eg pipelining) • Geometric Parallelism - static data space is split into sections (eg process an image on an array of processors) • Processor Farming - the input data is passed to many processors (eg ray tracing co-ordinates to several processors one ray at a time) Load Balancing There are 3 forms of load balancing • Static Load Balancing - the choice of which processor to use for each part of the task is made at compile time • Semi-dynamic - the choice is made at run-time, but once started, each task must run to completion on the chosen processor - more efficient • Fully-dynamic load balancing - tasks can be interrupted and moved between processors at will. This enables processors with different capabilities to be used to best advantage. Context switching and communication costs may outweigh the gains 3/16/2016 EE3.cma - Computer Architecture 225 23 3/16/2016 EE3.cma - Computer Architecture 226 Major Classifications of Parallelism Algorithm Parallelism • • • • • Tasks can be split so that a stream of data can be processed in successive stages on a series of processors As the first stage finishes its processing the result is passed to the second stage and the first stage accepts more input data and processes it and so on. When the pipeline is full one result is produced at every cycle At the end of continuous operation the early stages go idle as the last results are flushed through. Load balancing is static - the speed of the pipeline is determined by the speed of the slowest stage. linear pipeline data results or chain data results data results pipeline with parallel section Irregular network general case data 3/16/2016 EE3.cma - Computer Architecture 227 Major Classifications of Parallelism Geometric Parallelism • • • • • • Some regular-patterned tasks can be processed by spreading their data across several processors and performing the same task on each section in parallel Many examples involve image processing - pixels mapped to an array of transputers for example Many such tasks involve communication of boundary data from one portion to another - finite element calculations Load balancing is static - initial partitioning of data determines the time to process each area. Rectangular blocks may not be the best choice - stripes, concentric squares… Initial loading of the data may prove to be a serious overhead data array connected transputers 3/16/2016 EE3.cma - Computer Architecture 228 Major Classifications of Parallelism Geometric v Algorithmic F(xi) = cos(sin(exp(xi*xi))) for x1, x2, … x6 using 4 processors Algorithmic: x3, x2, x 1 y y y result y y*y e sin(y) cos(y) 1 time unit 1 time unit 1 time unit 1 time unit F1 is produced in 4 time units F2 is produced at time 5 i.e. time = 4+(6-1) = 9 units speedup = 24/9 = 2.6 Geometric: F5 F6 4 time units 3/16/2016 F2 cos(sin(ex*x)) 4 time units i.e. time = 8 units cos(sin(ex*x)) 4 time units F1 x2 4 time units cos(sin(ex*x)) 4 time units 4 time units x1 x5 x3 x6 x4 cos(sin(ex*x)) F6 F4 speedup = 24/8 = 3 EE3.cma - Computer Architecture 229 Major Classifications of Parallelism Processor Farming • • • • • • • Involves sharing work out from a central controller process to several worker processes. The “workers” just accept packets of command data and return results. The “controller” splits up the tasks, sending work packets to free processors (ones that have returned a result) and collating the results Global data is sent to all workers at the outset. Processor farming is only appropriate if: – the task can be split into many independent sections – the amount of communication (commands + results) is small To minimise latency, it might be better to keep 2 (or 3) packets in circulation for each worker - buffers are needed Load balancing is semi-dynamic - the command packets are sent to processors which have just (or are about to) run out of work. Thus all processors are kept busy except for the closedown phase, when some finish before others. 3/16/2016 EE3.cma - Computer Architecture 230 Major Classifications of Parallelism Processor Farming (cont’d) Return routers Buffers controller Workers Buffers Outgoing Routers Each section on separate transputer/processor display results received results initial proc nos generate work packets 3/16/2016 Result packets (result; CPU) free CPU # send commands A Processor Farm Controller Command packets (CPU; work) EE3.cma - Computer Architecture 231 Vector Processors Introduction • • • • Vector processors extend the scalar model by incorporating vector registers within the CPU These registers can be operated on by special vector instructions - each performs calculations element wise on the vector Vector parallelism could enable a machine to be constructed with a row of FPUs all driven in parallel. In practice a heavily pipelined single FPU is usually used. Both are classified as SIMD A vector instruction is similar to an unrolled loop, but: – each computation is guaranteed independent of all others - allows a deep pipeline (allowing the cycle time to be kept short) and removes the need to check for data hazards (within the vector) – Instruction bandwidth is considerably reduced – There are no control hazards (eg pipeline flushes on branches) since the looping has been removed – Memory access pattern is well-known - thus latency of memory access can be countered by interleaved memory blocks and serial memory techniques – Overlap of ALU & FPU operations, memory accesses and address calculations are possible. 3/16/2016 EE3.cma - Computer Architecture 232 Vector Processors Types of Vector Processors • • Vector-Register machines - vector registers, held in the CPU, are loaded and stored using pipelined vector versions of the typical memory access instructions Memory-to-Memory Vector machines - operate on memory only. Pipelines of memory accesses and FPU instructions operate together without pre-loading the data into vector registers. (This style has been overtaken by Vector-Register machines.) Vector-Register Machines Main Sections of a Vector-Register Machine are: • The Vector Functional units - machine can have several pipelined such units, usually dedicated to just one purpose so that they can be optimised. • Vector Load/Store activities are usually carried out by a dedicated pipelined memory access unit. This unit must deliver one word per cycle (at least) in order that the FPUs are not held up. If this is the case, vector fetches may be carried out whilst part of the vector is being fed to the FPU • Scalar Registers and Processing Engine - conventional machine • Instruction Scheduler 3/16/2016 EE3.cma - Computer Architecture 233 Vector Processors The Effects of Start-up Time and Initiation Rate • • • • • • like all pipelined systems, the time taken for a vector operation is determined from the start-up time, the initiation (result delivery) rate and the number of calculations performed The initiation rate is usually 1 - new vector elements are supplied in every cycle The start-up cost is the time for one element to pass along the vector pipeline the depth in stages. This time is increased by the time taken to fetch data operands from memory if they are not already in the vector registers - can dominate The number of clock cycles per vector element is then: cycles per result = (start-up time + n*initiation rate)/n The start-up time is divided amongst all of the elements and dominates for short vectors. The start-up time is more significant (as a fraction of the time per result) when the initiation rate drops to 1 3/16/2016 EE3.cma - Computer Architecture 234 Vector Processors Load/Store Behaviour • • • • the pipelined load/store unit must be able to sustain a memory access rate at least as good as the initiation rate of the FPUs to avoid data starvation. This is especially important when chaining the two units Memory has a start-up overhead - access time latency - similar to the pipeline start-up cost Once data starts to flow, how can a rate of one word/cycle be maintained? – interleaving is usually used Memory Interleaving We need to attach multiple memory banks to the processor and operate them all in parallel so that the overall access rate is sufficient. Two schemes are common: • Synchronised Banks • Independent Banks 3/16/2016 EE3.cma - Computer Architecture 235 Vector Processors Memory Interleaving (cont’d) Synchronised Banks • A single memory address is passed to all memory banks, and they all access a related word in parallel. • Once stable, all these words are latched and are then read out sequentially across the data bus - achieving the desired rate. • Once the latching is complete the memories can be supplied with another address and may start to access it. 3/16/2016 EE3.cma - Computer Architecture 236 Vector Processors Memory Interleaving (cont’d) Independent Banks • If each bank of memory can be supplied with a separate address, we obtain more flexibility - BUT must generate and supply much more information. • The data latches (as in synchronised case) may not be necessary, since all data should be available at the memory interface when required. In both cases, we require more memory banks than the number of clock cycles taken to get information from a bank of memory The number of banks chosen is usually a power of 2 - to simplify addressing (but this can also be a problem - see vector strides) 3/16/2016 EE3.cma - Computer Architecture 237 24 3/16/2016 EE3.cma - Computer Architecture 238 Vector Processors Variable Vector Length In practice the length of vectors will not be 64, 256 - whatever the memory size is A hardware vector length register in the processor is set before each vector operation - used also in load/store unit. Programming Variable Length Vector Operations Since the processor’s vector length is fixed, operations on long user vectors must be covered by several vector instructions. This is called strip mining Frequently, the user vector will not be a precise multiple of the machine vector length and so one vector operation will have to compute results for a short vector - this incurs greater set-up overheads Consider the following: for (j=0; j<n; j++) x[j] = x[j] + (a * b[j]); For a vector processor with vectors of length MAX and a vector-length register called LEN, we need to process a number of MAX-sized chunks of x[j]and then one section which covers the remainder: 3/16/2016 EE3.cma - Computer Architecture 239 Vector Processors Variable Vector Length (cont’d) start = 0; LEN = MAX; for (k=0; k<n/MAX; k++ ) { for (j=start; j<start+MAX; j++) { x[j] = x[j] + (a*b[j]); } start = start + MAX; } LEN = n-start; for (j=start; j<n; j++) x[j]= x[j] + (a*b[j]); The j-loop in each case is implemented as three vector instructions - a Load, a multiply and an add. The time to execute the whole program is simply: Int(n/MAX)*(sum of start-up overheads) + (n*3*1) cycles This equation exhibits a saw-tooth shape as n increases - the efficiency drops each time a large vector fills up and an extra 1 element vector must be used, carrying and extra start-up overhead… Unrolling the outer loop will be effective too... 3/16/2016 EE3.cma - Computer Architecture 240 Vector Processors Vector Stride Multi-dimensional arrays are stored in memory as single-dimensional vectors. In all common languages (except Fortran) row 1 is stored next to row 0, plane 0 is stored next to plane 1 etc…. Thus,accessing an individual row of a matrix involves reading contiguous memory locations, these reads are easily spread across several interleaved memory banks: Accessing a column of a matrix the nth element in every row, say - involves picking individual words from memory. These words are separated from each other by x words, where x is the number of elements in each row of the matrix. x is the stride of the matrix in this dimension. Each dimension has its own stride. Once loaded, vector operations on columns can be carried out with no further reference to their original memory layout. 3/16/2016 EE3.cma - Computer Architecture 241 Vector Processors Vector Stride (cont’d) Consider multiplying 2 rectangular matrices together: What is the memory reference pattern of a column-wise vector load? • We step through the memory in units of our stride What about in a memory system with j interleaved banks? • If j is co-prime with the stride x then we visit each bank just once before re-visiting any one again (assuming that we use the LS words address bits as bank selectors) • If j has any common factors with x (especially if j is a factor of x) then the banks are visited in a pattern which favours some banks and totally omits others. Since the number of active banks is reduced, the latency of memory accesses is not hidden and the one-cycle-per-access goal is lost. This is an example of aliasing. Does it matter whether the interleaving uses synchronised or independent banks? • Yes. In the synchronised case, the actual memory accesses must be timed correctly since all the MS addresses are the same, and if the stride is wider than the interleaving factor, only some of the word accesses will be used anyway. • In the independent case, the separate accesses automatically happen at the right time and to the right addresses. The load/store unit must generate the stream of addresses in advance of the data being required, and must send each to the correct bank A critically-banked system - interleaved banks are all used fully in a vector access Overbanking - supplying more banks than needed, reduces danger of aliasing 3/16/2016 EE3.cma - Computer Architecture 242 Vector Processors Forwarding and Chaining If a vector processor is required to perform multiple operations on the same vector, then it is pointless to save the first result before reading it back to another (or the same) functional unit Chaining - the vector equivalent of forwarding - allows the pipelined result output of one functional unit to be joined to the input of another The performance of two chained operations is far greater than that of just one, since the first operation does not have to finish before the next starts. Consider V1 = V2 MULT V3 V4 = V1 ADD V5 The non-chained solution requires a brief stall (4 cycles) since V1 must be fully written back to the registers before it can be re-used. In the chained case, the dependence between writes to elements of V1 and their re-reading in the ADD are compensated by the forwarding effect of the chaining - no storage is required prior to use. 3/16/2016 EE3.cma - Computer Architecture 243 Multi-Core Processors • A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single IC. A dual-core device contains only two independent microprocessors. • In general, multi-core microprocessors allow a computing device to exhibit some form of parallelism without including multiple microprocessors in separate physical packages often known as chip level multiprocessing or CMP. 3/16/2016 EE3.cma - Computer Architecture 244 Multi-Core Processors Commercial examples • IBM’s POWER4, first Dual-Core module processor released in 2000. • IBM's POWER5 dual-core chip now in production - in use in the Apple PowerMac G5. • Sun Microsystems UltraSPARC IV, UltraSPARC IV+, UltraSPARC T1 • AMD - dual-core Opteron processors on 22 April 2005, – dual-core Athlon 64 X2 family, on 31 May 2005. – And the FX-60, FX-62 and FX-64 for high performance desktops, – and one for laptops. • Intel's dual-core Xeon processors, – also developing dual-core versions of its Itanium high-end CPU – produced Pentium D, the dual core version of Pentium 4. – A newer chip, the Core Duo, is available in the Apple Computer's iMac • Motorola/Freescale has dual-core ICs based on the PowerPC e500 core, and e600 and e700 cores in development. • Microsoft's Xbox 360 game console uses a triple core PowerPC microprocessor. • The Cell processor, in PlayStation 3 is a 9 core design. 3/16/2016 EE3.cma - Computer Architecture 245 Multi-Core Processors Why? • • • • CMOS manufacturing technology continues to improve: – BUT reducing the size of single gates, can’t continue to increase clock speed – 5km of internal interconnects in modern processor…. Speed of light is too slow! Also significant heat dissipation and data synchronization problems at high rates. Some gain from – Instruction Level Parallelism (ILP) superscalar pipelining – can be used for many applications – Many applications better suited to Thread level Parallelism (TLP)- multiple independent CPUs A combination of increased available space due to refined manufacturing processes and the demand for increased TLP has led to multi-core CPUs. 3/16/2016 EE3.cma - Computer Architecture 246 Multi-Core Processors • Advantages • Proximity of multiple CPU cores on the same die have the advantage that the cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip - combining equivalent CPUs on a single die significantly improves the cache performance of multiple CPUs. • Assuming that the die can fit into the package, physically, the multi-core CPU designs require much less Printed Circuit Board (PCB) space than multi-chip designs. • A dual-core processor uses slightly less power than two coupled single-core processors - fewer off chip signals, shared circuitry, like the L2 cache and the interface to the main Bus. • In terms of competing technologies for the available silicon die area, multi-core design can make use of proven CPU core library designs and produce a product with lower risk of design error than devising a new wider core design. • Also, adding more cache suffers from diminishing returns, so better to use space in other ways 3/16/2016 EE3.cma - Computer Architecture 247 Multi-Core Processors • Disadvantages • In addition to operating system (OS) support, adjustments to existing software can be required to maximize utilization of the computing resources provided by multicore processors. • The ability of multi-core processors to increase application performance depends on the use of multiple threads within applications. – eg, most current (2006) video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core, despite the dual-core theoretically having more processing power, because they are incapable of efficiently using more than one core at a time. • Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs. • Raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage. Even in theory, a dual-core system cannot achieve more than a 70% performance improvement over a single core, and in practice, will most likely achieve less 3/16/2016 EE3.cma - Computer Architecture 248 25 3/16/2016 EE3.cma - Computer Architecture 249 The INMOS Transputer The Transputer Necessary features for a message-passing microprocessor are: • A low context-switch time • A hardware process scheduler • Support for communicating process model • Normal microprocessor facilities. Special Features of Transputers: • high performance microprocessor • conceived as building blocks (like transistors or logic gates) • designed for intercommunication • CMOS devices - low power, high noise immunity • integrated with small supporting chip count • provided with a hardware task scheduler - supports multi-tasking with low overhead • capable of sub-microsecond interrupt responses - good for control applications 3/16/2016 EE3.cma - Computer Architecture 250 The INMOS Transputer 3/16/2016 EE3.cma - Computer Architecture 251 The INMOS Transputer Transputer Performance The fastest first-generation transputer (IMS T805-30) is capable of: • up to 15 MIPS sustained • up to 3 MFLOPs sustained • up to 40 Mbytes/sec at the main memory interface • up to 120 Mbytes/sec to the 4K byte on-chip memory • up to 2.3 Mbytes/sec on each of 4 bi-directional Links 30MHz clock speed The fatstest projected second generation transputer (IMS T9000-50): • is 5 times faster in calculation • and 6 times faster in communication 50MHz clock speed - equivalent performance to the 100MHz intel 486 3/16/2016 EE3.cma - Computer Architecture 252 The INMOS Transputer Low Chip Count To run using internal RAM a T805 transputer only requires: • a 5MHz clock • a 5V power supply at about 150mA • a power-on-reset or external reset input • an incoming Link to supply boot code and sink results Expansion possibilities • 32K*32 SRAM (4 devices) require 3 support chips • 8 support devices will support 8Mbytes of DRAM with optimal timing • Extra PALs will directly implement 8-bit memory mapped I/O ports or timing logic for conventional peripheral devices (Ethernet, SCSI, etc) • Link adapters can be used for limited expansion to avoid memory mapping • TRAMs (transputers plus peripherals) can be used as very high-level building blocks 3/16/2016 EE3.cma - Computer Architecture 253 The INMOS Transputer Transputer Processes Software running on a transputer is made up from one or more sequential processes, which are run in parallel and communicate with each other periodically Software running on many interconnected transputers is simply a group of parallel processes - just the same as if all code were running on a single processor Processes can be reasoned about individually; rules exist which allow the overall effect of parallel processes to be reasoned about too. The benefits of breaking a task into separate processes include: • Taking advantage of parallel hardware • Taking advantage of parallelism on a single processor • Breaking the task into separately-programmed sections • Easy implementation of buffers and data management code which runs asynchronously with the main processing sections 3/16/2016 EE3.cma - Computer Architecture 254 The INMOS Transputer Transputer Registers The transputer implements a stack of 3 hardware registers and is able to execute 0-address instructions. It also has a few one-address instructions which are used for memory access. All instructions and data operands are built up in 4-bit sections using an Operand register and two special instruction Prefix and Negative Prefix. Extra registers are used to store the head and tail pointers to two linked lists of process workspace headers - these make up the high and low priority run-time process queues. The hardware scheduler takes a new process from one of these queues whenever it suspends the current process (due to time-slicing or communication) 3/16/2016 EE3.cma - Computer Architecture 255 The INMOS Transputer Action on Context Switching Each process runs until it communicates, is time-sliced or is pre-empted by a higher priority process. Time-slices occur at the next descheduling point - approx 2ms. Pre-emption can occur at any time. local variables for PROC program counter pointer to workspace chain At a context switch the following happens: • The PC of the stopping process is saved in its workspace at word WSP-1 • The process pointed to by the processor’s BPtr1 is changed to point to the stopping processes’ WSP • On a pre-emptive context switch (only) the registers in the ALU and FPU may need saving • The process pointed to by FPtr1 is unlinked from the process queue, has its stored PC value loaded into the processor and starts executing A context switch takes about 1ms. This translates to an interrupt rate of about 1,000,000 per second. 3/16/2016 EE3.cma - Computer Architecture 256 The INMOS Transputer Joining Transputers Together Three routing configurations are possible: • static - nearest neighbour communications • any-to-any routing across static configurations • dynamic configuration with specialised routing devices Static Configurations Can be connected together in fixed configurations and are characterised by: • Number of nodes • Valency - number of interconnecting arcs (per processor) • Diameter - maximum number of arcs traversed from point to point • Latency - time for a message to pass across the network • point-to-point bandwidth - message flow rate along a route Structures 3/16/2016 EE3.cma - Computer Architecture 257 The Cray T3D The T3D Network After simulation the T3D network was chosen to be a 3D torus (as is the T3E) Note: config. max latency average latency 8-node ring 4 hops 2 hops 2D, 4*2 torus 3 hops 1.5 hops 3D, 2*2*2 torus 2 hops 1 hop cube = 4*2 2D torus 2D torus 4*4 hyper-cube 3/16/2016 EE3.cma - Computer Architecture 258 Beowulf Clusters Introduction Mass market competition has driven down prices of subsystems: processors, motherboards, disks, network cards etc. Development of publicly available software : Linux, GNU compilers, PVM and MPI libraries PVM - Parallel Virtual Machine (allows many inter-linked machines to be combined as one parallel machine) MPI - Message Passing Interface (similar to PVM) High Performance Computing groups have many years of experience working with parallel algorithms. History of MIMD computing shows many academic groups and commercial vendors building machines based on “state-of-the-art” processors, BUT always needed special “glue” chips or one-of-a-kind interconnection schemes. Leads to interesting research and new ideas, but often results in one off machines with a short life cycle. Leads to vendor specific code (to use vendor specific connections) Beowulf uses standard bits and Linux operating system (with MPI - or PVM) 3/16/2016 EE3.cma - Computer Architecture 259 Beowulf Clusters Introduction First Beowulf was built in 1994 with 16 DX4 processors and a 10Mbit/s Ethernet. Processors were too fast for a single Ethernet Ethernet switches were still much too expensive to use more than one. So they re-wrote the linux ethernet drivers and built a channel bonded Ethernet – network traffic was striped across 2 or more ethernets As 100Mb/s ethernet and switches have become cheap less need for channel bonding. This can support 16, 200MHz P6 processors….. The best configuration continues to change. But this does not affect the user. With the robustness of MPI, PVM, Linux (Extreme) and GNU compilers programmers have the confidence that what they are writing today will still work on future Beowulf clusters. In 1997 CalTech’s 140 node cluster ran a problem sustaining a 10.9 Gflop/s rate 3/16/2016 EE3.cma - Computer Architecture 260 Beowulf Clusters The Future Beowulf clusters are not quite Massively Parallel Processors like the Cray T3D as MPPs are typically bigger and have a lower network latency and a lot of work must be done by the programmer to balance the system. But the cost effectiveness is such that many people are developing do-it-yourself approaches to HPC and building their own clusters. A large number of computer companies are taking these machines very seriously and offering full clusters. 2002 – 2096 processor linux cluster comes in as 5th fastest computer in the world… 2005 – 4800 2.2GHz powerPC cluster is #5 – 42.14TFlops 40960 1.4GHz itanium is #2 – 114.67 TFlops 65536 0.7GHz powerPC is #1 – 183.5TFlops 5000 Opteron (AMD - Cray) is #10 – 20 TFlops 3/16/2016 EE3.cma - Computer Architecture 261 Fastest Super computers – June 2006 Rank Site Computer Processors Year Rmax Rpeak 1 LLNL US Blue Gene – IBM 131072 2005 280600 367000 2 IBM US Blue Gene –IBM 40960 2005 91290 114688 3 LLNL US ASCI Purple IBM 12208 2006 75760 92781 4 NASA US Columbia – SGI 10160 2004 51870 60960 5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6 6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8 7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8 8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875 9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520 10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960 11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144 12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4 13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960 14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938 15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6 16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6 17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6 18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6 19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6 20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299 3/16/2016 EE3.cma - Computer Architecture 262 3/16/2016 EE3.cma - Computer Architecture 263 3/16/2016 EE3.cma - Computer Architecture 264 3/16/2016 EE3.cma - Computer Architecture 265 3/16/2016 EE3.cma - Computer Architecture 266 3/16/2016 EE3.cma - Computer Architecture 267 3/16/2016 EE3.cma - Computer Architecture 268 26 3/16/2016 EE3.cma - Computer Architecture 269 3/16/2016 EE3.cma - Computer Architecture 270 3/16/2016 EE3.cma - Computer Architecture 271 3/16/2016 EE3.cma - Computer Architecture 272 3/16/2016 EE3.cma - Computer Architecture 273 3/16/2016 EE3.cma - Computer Architecture 274 Shared Memory Systems Introduction The earliest form of co-operating processors used shared memory as the communication medium Shared memory involves: • connecting the buses of several processors together so that either: – all memory accesses for all processors share the bus; or – just inter-processor communication accesses share the common memory Clearly the latter involves less contention Shared memory systems typically operate under control of a single operating system either: • with one master processor and several slaves; or • with all processors running separate copies of the OS, maintaining a common set of VM and process tables. 3/16/2016 EE3.cma - Computer Architecture 275 Shared Memory Systems The Shared-Memory Programming Model Ideally a programmer wants each process to have access to a contiguous area of memory - how is unimportant Somewhere in the memory map will be sections of memory which are also accessible by other processes. Memory Processor shared How do we implement this? We certainly need caches (for speed) and VM, secondary storage etc. (for flexibility) Notice that cache Processor consistency issues are introduced as Processor soon as multiple caches are Processor provided. 3/16/2016 Secondary Memory Local cache Shared Virtual Address Space Local cache Local cache Main Memory EE3.cma - Computer Architecture 276 Shared Memory Systems Common Bus Structures A timeshared common bus arrangement can provide the interconnection required: mP M mP M mP M I/O I/O A common bus provides: • contention resolution between the processors • limited bandwidth, shared by all processors • single-point failure modes • cheap(ish) hardware - although speed requirements and complex wiring add to expense • easy, but non-scalable, expansion 3/16/2016 EE3.cma - Computer Architecture 277 Shared Memory Systems Common Bus Structures (cont’d) Adding caches, extra buses (making a crossbar arrangement) and mutiport memory can help mP M M M cache cache cache cache I/O mP mP cache I/O cache 3/16/2016 EE3.cma - Computer Architecture 278 Shared Memory Systems Kendall Square Research KSR1 One of the most recent shared memory architectures is the Kendall Square Research KSR1, which implements the virtual memory model across multiple memories, using a layered cacheing scheme. The KSR1 processors are proprietary: • 64-bit superscalar, issues 1 integer and 2 chained FP instructions per 50ns cycle, giving a peak integer and FP performance of 20MIPS / 40 MFLOPs • Each Processor has 256Kbytes of local instruction cache and 256Kbytes of local data cache • There is a 40bit global addressing scheme 1088 (32*34) processors can be attached in the current KSR1 architecture Main memory comprises 32Mbytes DRAM per Processor Environment, connected in a hierarchical cached scheme. If a page is not held in one of the 32Mbyte caches it is stored on secondary memory (disc - as with any other system) 3/16/2016 EE3.cma - Computer Architecture 279 Shared Memory Systems KSR1 Processor Interconnection The KSR1 architecture connects the caches on each processor with a special memory controller called the Allcache Engine. Several such memory controllers can be connected level 0 router directory Cell interconnect Cell interconnect Cell interconnect Cell interconnect 32 MB main cache 32 MB main cache 32 MB main cache 32 MB main cache 256kB cache 256kB cache 256kB cache mP mP mP 3/16/2016 …... EE3.cma - Computer Architecture 256kB cache mP 280 Shared Memory Systems KSR1 Processor Interconnection (cont’d) The Allcache Engine at the lowest level (level-0) provides: • connections to all the 32Mbyte caches on the processor cells • Up to 32 processors may be present in each ring The level-0 Allcache Engine Features: • a 16-bit wide slotted ring, which synchronously passes packets between the interconnect cells (ie every path can carry a packet simultaneously) • Each ring carries 8 million packets per second • Each packet contains a 16-byte header and 128 bytes of data • This gives the total throughput of 1Gbyte per second • Each router directory contains an entry for each sub-page held in the main cache memory (below) • Requests for a sub-page are made by the cell interconnect unit, passed around the ring and satisfied by data if it is found in the other level-0 caches. 3/16/2016 EE3.cma - Computer Architecture 281 Shared Memory Systems KSR1 Processor Interconnection (cont’d) KSR1 Higher Level Routers In order to connect more than 32 processors, a second layer of routing is needed. This contains up to 34 Allcache router directory cells, plus the main level-1 directory which permits connection to level 2. level-2 level-1 level-0 32 processors 1088 processors unaffordable; minimal bandwidth per processor 3/16/2016 EE3.cma - Computer Architecture 282 Shared Memory Systems KSR1 Processor Interconnection (cont’d) The Level-1 Allcache Ring The routing directories in level 1 Allcache engine contain copies of the entries in the lower level tables, so that requests may be sent downwards for sub-page information as well as upwards - the Level-1 table is therefore very large The higher level packet pipelines carry 4 Gbytes per second of inter-cell traffic level 1 router directory ARD 0 copy ARD 0 copy ARD 0 copy ARD 0 copy ARD 0 ARD 0 ARD 0 ARD 0 directory directory directory directory …... 3/16/2016 EE3.cma - Computer Architecture 283 Shared Memory Systems KSR1 Performance As with all multi-processor machines, maximum performance is obtained when there is no communication The layered KSR1 architecture does not scale linearly in bandwidth or latency as processors are added: Relative Bandwidths unit bandwidth shared fraction (MByte/s) amongst (MByte/s) 256 k subcache160 1 PE 160 32MB cache 90 1 PE 90 level-0 ring 1000 32 PEs 31 level-1 ring 4000 1088 PEs 3.7 Relative Latencies Location Latency (cycles) Copied (read-only) subsubcache 2 pages reside in more than cache 18 one cache and thus provide ring 0 150 a low-latency access to ring 1 500 constant information page fault (disc) 400,000 3/16/2016 EE3.cma - Computer Architecture 284 Shared Memory Systems KSR1 Performance - how did it succeed? Like most other parallel architectures, it relies on locality Locality justifies the workings of: • Virtual memory systems (working sets) • Caches (hit rates) • Interprocess connection networks Kendall Square Research claim that the locality present in massively-parallel programs can be exploited by their architecture. 1991 - 2nd commercial machine is installed in Manchester Regional Computer Centre 1994 - upgraded to 64bit version 1998 - Kendall Square Research went out of business, patents transferred to SUN microsystems 3/16/2016 EE3.cma - Computer Architecture 285 The Cray T3D SV1 watercooled T3E T3D Introduction The Cray T3D is the successor to several generations of vector conventional processors. T3D has been replaced by newer T3E but much the same as T3D T3E (with 512 processors) capable of 0.4 TFlops SV1ex (unveiled 7/11/00 capable of 1.8 TFLOPs with 1000 processors - normally delivered as 8-32 processor machines 3/16/2016 EE3.cma - Computer Architecture 286 The Cray T3D Introduction Like every other manufacturer, Cray would like to deliver: • 1000+ processors with GFLOPs performance • 10s of Gbytes/s per processor of communication bandwidth • 100ns interprocessor latence ……they can’t afford to - just yet……. They have tried to achieve these goals by: • MIMD - multiple co-operating processors will beat small numbers of intercommunicating ones (even vector supercomputers) • Distributed memory • Communication at the memory-access level, keeping latency short and packet size small • A scalable communications network • Commercial processors (DEC Alpha) 3/16/2016 EE3.cma - Computer Architecture 287 The Cray T3D The T3D Network After simulation the T3D network was chosen to be a 3D torus (as is the T3E) Note: config. max latency average latency 8-node ring 4 hops 2 hops 2D, 4*2 torus 3 hops 1.5 hops 3D, 2*2*2 torus 2 hops 1 hop cube = 4*2 2D torus 2D torus 4*4 hyper-cube 3/16/2016 EE3.cma - Computer Architecture 288 The Cray T3D T3D Macro-architecture The T3D designers have decided that the programmer’s view of the architecture should include: • globally-addressed physically-distributed memory characteristics • visible topological relationships between PEs • synchronisation features visible from a high level Their goal is led by the need to provide a slowly-changing view (to the programmer) from one hardware generation to the next. T3D Micro-architecture Rather than choosing to develop their own processor, Cray selected the DEC Alpha processor: • 0.75 mm CMOS RISC processor core • 64 bit bus • 150MHz, 150 MFLOPS, 300MIPS (3 instructions/cycle) • 32 integer and 32 FP registers • 8Kbytes instruction and 8Kbytes data caches • 43 bit virtual address space 3/16/2016 EE3.cma - Computer Architecture 289 The Cray T3D Latency Hiding The DEC Alpha has a FETCH instruction which allows memory to be loaded into the cache before it is required in an algorithm. This runs asynchronously with the processor 16 FETCHes may be in progress at once - they are FIFO queued When data is received, it is slotted into the FIFO, ready for access by the processor The processor stalls if data is not available at the head of the FIFO when needed Stores do not have a a latency - they can proceed independently of the processor (data dependencies permitting) Synchronisation Barrier Synchronisation • no process may advance beyond the barrier until all processes have arrived • used as a break between 2 blocks of code with data dependencies • supported in hardware - 16 special registers - bits set to 1 on barrier creation; set to 0 by arriving process; hardware interrupt on completion Messaging (a form of synchronisation) • T3D exchanges 32-byte messages + 32-byte control header • Messages are queued at target PE, returned to sender PE’s queue if full 3/16/2016 EE3.cma - Computer Architecture 290 The Connection Machine Family Introduction The Connection MAchine family of suprecomputers has developed since first descriptions were published in 1981. Today the CM5 is one of the fastest available supercomputers In 1981 the philosophy of the CM founders was for a machine capable of sequential program execution, but where each instruction was spread to use lots of processors. The CM-1 had 65,536 processors organised in a layer between two communicating planes: Host broadcast control network P M P M P M ... hyper-cube data network 3/16/2016 Plane of 65536 cells P M P = single-bit processor M = 4kbit memory Total Memory = 32Mbytes EE3.cma - Computer Architecture 291 The Connection Machine Family Introduction (cont’d) Each single-bit processor can: • perform single-bit calculations • transfer data to its neighbours or via the data network • be enabled or disabled (for each operation) by the control network and its own stored data The major lessons learnt from this machine were: • A new programming model was needed - that of virtual processors. One “processor” could be used per data element and a number of data elements combined onto actual processors. The massive concurrency makes programming and compiler design clearer • 32Mbytes was not enough memory (even in 1985!) • It was too expensive for AI - but physicists wanted the raw processing power 3/16/2016 EE3.cma - Computer Architecture 292 The Connection Machine Family The Connection Machine 2 This was an enlarged CM-1 with several enhancements. It had: • 256kbit DRAM per CPU • Clusters of 32 bit serial processors augmented by floating point chip (2048 in total) • Parallel I/O added to the processors - 40 discs (RAID - Redundant Array of Inexpensive Disks) Graphics frame buffer In addition, multiple hosts could be added to support multiple users; the plane of small processors could be partitioned. Architectural Lessons: • Programmers used a high-level language (Fortran 90) rather than a lower-level parallel language. F90 contains array operators, which provide the parallelism directly. The term data parallel was coined for this style of computation • Array operators compiled into instructions sent to separate vector or bit processors • This SIMD programming model gives synchronisation between data elements in each instruction but MIMD processor engine doesn’t need such constraints • Differences between shared (single address space) and distributed memory blur. • Data network now carries messages which correspond to memory accesses • The compiler places memory and computations optimally, but statically • multiple hosts are awkward compared with a single timesharing host 3/16/2016 EE3.cma - Computer Architecture 293 The Connection Machine Family The Connection Machine 5 This architecture is more orthogonal than the earlier ones. It just uses larger multi-bit processors, but similar communication architecture to the CM-1 and CM-2 Design Goals were: • > 1 TFLOPs • Several Tbytes of memory • > 1 Tbit/s of I/O bandwidth Host broadcast control network I/O W W W ... H H Hosts (H) and worker (W) processors identical (hosts have more memory) hyper-cube data network 3/16/2016 EE3.cma - Computer Architecture 294 The Connection Machine Family The CM-5 Processor To save on development effort, CM used a common SPARC RISC processor for all the hosts and workers. RISC CPUs are optimised for workstations, so they added extra hardware and fast memory paths Each Node has: • 32Mbytes memory • A Network interface • Vector processors capable of up to 128 MFLOPS • Vector-to-Memory bandwidth of 0.5Gbytes/s Caching doesn’t really work here. SPARC main bus cache I/O vector processor vector processor vector processor vector processor 3/16/2016 32 Mbytes memory 64 bit 0.5Gbyte/s vector ports EE3.cma - Computer Architecture 295 Fastest Super computers – June 2006 Rank Site Computer Processors Year Rmax Rpeak 1 LLNL US Blue Gene – IBM 131072 2005 280600 367000 2 IBM US Blue Gene –IBM 40960 2005 91290 114688 3 LLNL US ASCI Purple IBM 12208 2006 75760 92781 4 NASA US Columbia – SGI 10160 2004 51870 60960 5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6 6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8 7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8 8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875 9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520 10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960 11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144 12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4 13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960 14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938 15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6 16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6 17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6 18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6 19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6 20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299 3/16/2016 EE3.cma - Computer Architecture 296 History of Supercomputers 1966/7: Michael Flynn’s Taxonomy & Amdahl’s Law 1976: Cray Research delivers 1st Cray-1 to LANL 1982: Fujitsu ships 1st VP200 vector machine ~500MFlops 1985: CM-1 demonstrated to DARPA 1988: Intel delivers iPSC/2 hypercubes 1990: Intel produces iPSC/860 hypercubes 1991: CM5 announced 1992: KSR1 delivered 1992: Maspar delivers its SIMD machine – MP2 1993: Cray delivers Cray T3D 1993: IBM delivers SP1 1994: SGI Power Challenge 1997: SGI/Cray Origin 2000 delivered to LANL - 0.7TFlops 1998: Cray T3E delivered to US military – 0.9Tflops 1996: Hitachi Parallel System 1997: Intel Paragon (ASCI Red) 2.3 Tflops to Sandia Nat Lab 2000: IBM (ASCI White) 7.2 Tflops to Lawrence Livermore NL 2002: HP (ASCI Q) 7.8 Tflops to Los Alamos Nat Lab 2002: NEC Earth Simulator Japan 36TFlops 2002: 5th fastest machine in world is a linux cluster (2304 processor) 3/16/2016 EE3.cma - Computer Architecture 297 History of Supercomputers 3/16/2016 EE3.cma - Computer Architecture 298 27 3/16/2016 EE3.cma - Computer Architecture 299 The fundamentals of Computing have remained unchanged for 70 years • During all of the rapid development of computers during that time little has changed since Turing and Von Neumann Quantum Computers are Potentially different. • They employ Quantum Mechanical principles that expand the range of operations possible on a classical computer. • Three main differences between classical and Quantum computers are: • Fundamental unit of information is a qubit • Range of logical operations • Process of determining the state of the computer 3/16/2016 EE3.cma - Computer Architecture 300 Qubits Classical computers are built from bits two states: 0 or 1 Quantum computers are built from qubits Physical system which possess states analogous to 0 or 1, but which can also be in states between 0 and 1 The intermediate states are known as superposition states A qubit – in a sense – can store much more information than a bit 3/16/2016 EE3.cma - Computer Architecture 301 Range of logical operations Classical computers operate according to binary logic Quantum logic gates take one or more qubits as input and produce one or more qubits as output. Qubits have states corresponding to 0 and 1, so quantum logic gates can emulate classical logic gates. With superposition states between 0 and 1 there is a great expansion in the range of quantum logic gates. • e.g. quantum logic gates that take 0 and 1 as input and produce as output different superposition states between 0 and 1 – no classical analogue This expanded range of quantum gates can be exploited to achieve greater information processing power in quantum computers 3/16/2016 EE3.cma - Computer Architecture 302 Determining the State of the Computer In Classical computers we read out the state of all the bits in the computer at any time In a Quantum computer it is in principle impossible to determine the exact state of the computer. i.e. we can’t determine exactly which superposition state is being stored in the qubits making up the computer We can only obtain partial information about the state of the computer Designing algorithms is a delicate balance between exploiting the expanded range of states and logical operations and the restricted readout of information. 3/16/2016 EE3.cma - Computer Architecture 303 Detector A What actually happens? Equal probability of photon reaching A or B Does the photon travel each path at Random? Single particle Detector B Beam-splitter What actually happens here? Detector A mirror Detector B Beam-splitter If path lengths are the same photons always hit A. mirror 3/16/2016 Beam-splitter EE3.cma - Computer Architecture A single photon travels both routes simultaneously 304 Photons travel both paths simultaneously. If we block either of the paths then A or B become equally probable This is quantum interference and applies not just to photons but all particles and physical systems Quantum computation is all about making this effect work for us. In this case the photon is a in a coherent superposition of being on both paths at the same time. Any qubit can be prepared in a superposition of two logical states – a qubit can store both 0 and 1 simultaneously, and in arbitrary proportions. Any quantum system with at least two discrete states can be used as a qubit – e.g. energy levels in an atom, photons, trapped ions, spins of atomic nuclei….. 3/16/2016 EE3.cma - Computer Architecture 305 Once the qubit is measured, however, only one of the two values it stores can be detected at random – just like the photon is detected on only one of the two paths. Not very useful – but…. Consider a traditional 3-bit register it can represent 8 different numbers 000 - 111 A quantum register of 3 qubits can represent 8 numbers at the same time in quantum superposition. The bigger the register the more numbers we can represent at the same time. A 250 qubit register could hold more numbers than there are atoms in the known universe – all on 250 atoms….. But we only see one of these if we measure the registers contents. We can now do some real quantum computation….. 3/16/2016 EE3.cma - Computer Architecture 306 Mathematical Operations can be performed at the same time on all the numbers held in the register. If the qubits are atoms then tuned laser pulses can affect their electronic states so that initial superpositions of numbers evolve into different superpositions. Basically a massively parallel computation Can perform a calculation on 2L numbers in a single step, which would take 2L steps or processors in a conventional architecture Only good for certain types of computation…. NOT information storage – it can hold many states at once but can only see one of them Quantum interference allows us to obtain a single result that depends logically on all 2L of the intermediate results 3/16/2016 EE3.cma - Computer Architecture 307 Grover’s Algorithm Searches an unsorted list of N items in only N steps. Conventionally this scales as N/2 – by brute force searching. The quantum computer can search them all at the same time. BUT if the QC is merely programmed to print out the result at that point it will not be any faster than a conventional system. Only one of the N paths would check the entry we are looking for, so the probability that measuring the computer’s state would give us the correct answer would require the same number of hits. BUT if we leave the information in the computer, unmeasured, a further quantum operation can cause the information to affect other paths. If we repeat the operation N times a measurement will return information about which entry contains the desired number with a probability of 0.5. Repeating just a few more times will find the entry with a probability extremely close to 1. Can be turned into a very useful searching, minimization or evaluation of the mean tool. 3/16/2016 EE3.cma - Computer Architecture 308 Cryptoanalysis Biggest use of quantum computing is in cracking encrypted data. Cracking DES (Data encryption standard) requires a search among 256 keys. Conventionally even at 1M/s this takes more than 1000 years. A QC using Grover’s algorithm could do it in less than 4 minutes. Factorisation is the key to RSA encryption system. Conventionally the time taken to factorise a number increases exponentially with the number of digits. Largest number ever factorised contained 129 digits. No way to factorise 1000 digits – conventionally….. QC can do this in a fraction of a second Already a big worry for data security, it is only a matter of a few years before this will be available. 3/16/2016 EE3.cma - Computer Architecture 309 Decoherence: the obstacle to quantum computation For a qubit to work successfully it must remain in an entangled quantum superposition of states. As soon as we measure the state it collapses to a single value. This happens even if we make the measurement by accident source source In a conventional double split experiment, the wave amplitudes corresponding to an electron (or photon) travelling along the two possible paths interfere. If another particle with spin is placed close to the left slit an electron passing will flip the spin. This “accidentally” records the which path the electron took and causes the loss of the interference pattern 3/16/2016 EE3.cma - Computer Architecture 310 Decoherence: the obstacle to quantum computation In reality it is very difficult to prevent qubits from interacting with the rest of the world. The best solution (so far) to this is to build quantum computers with fault tolerant designs using error correction procedures. The result of this is that we need more qubits, between 2 and 5 times the number in an “ideal world” 3/16/2016 EE3.cma - Computer Architecture 311