Part I Background and Motivation Jan. 2011 Computer Architecture, Background and Motivation Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First June 2003 July 2004 June 2005 Mar. 2006 Jan. 2007 Jan. 2008 Jan. 2009 Jan. 2011 Second Jan. 2011 Computer Architecture, Background and Motivation Slide 2 I Background and Motivation Provide motivation, paint the big picture, introduce tools: • Review components used in building digital circuits • Present an overview of computer technology • Understand the meaning of computer performance (or why a 2 GHz processor isn’t 2× as fast as a 1 GHz model) Topics in This Part Chapter 1 Combinational Digital Circuits Chapter 2 Digital Circuits with Memory Chapter 3 Computer System Technology Chapter 4 Computer Performance Jan. 2011 Computer Architecture, Background and Motivation Slide 3 1 Combinational Digital Circuits First of two chapters containing a review of digital design: • Combinational, or memoryless, circuits in Chapter 1 • Sequential circuits, with memory, in Chapter 2 Topics in This Chapter 1.1 Signals, Logic Operators, and Gates 1.2 Boolean Functions and Expressions 1.3 Designing Gate Networks 1.4 Useful Combinational Parts 1.5 Programmable Combinational Parts 1.6 Timing and Circuit Considerations Jan. 2011 Computer Architecture, Background and Motivation Slide 4 1.1 Signals, Logic Operators, and Gates Name NOT AND OR XOR Operator sign and alternat e(s) x′ _ ¬x or x xy x∧ y x∨ y x+y x⊕y x ≡/ y Output is 1 iff: Input is 0 Both inputs are 1s At least one input is 1 Inputs are not equal 1−x x × y or xy x + y − xy x + y − 2xy Graphical symbol Arithmetic expression Figure 1.1 Some basic elements of digital logic circuits, with operator signs used in this book highlighted. Jan. 2011 Computer Architecture, Background and Motivation Slide 5 The Arithmetic Substitution Method z′ = 1 – z xy x ∨ y = x + y − xy x ⊕ y = x + y − 2xy NOT converted to arithmetic form AND same as multiplication (when doing the algebra, set zk = z) OR converted to arithmetic form XOR converted to arithmetic form Example: Prove the identity xyz ∨ x ′ ∨ y ′ ∨ z ′ ≡? 1 LHS = [xyz ∨ x ′] ∨ [y ′ ∨ z ′] = [xyz + 1 – x – (1 – x)xyz] ∨ [1 – y + 1 – z – (1 – y)(1 – z)] = [xyz + 1 – x] ∨ [1 – yz] = (xyz + 1 – x) + (1 – yz) – (xyz + 1 – x)(1 – yz) = 1 + xy2z2 – xyz This is addition, = 1 = RHS not logical OR Jan. 2011 Computer Architecture, Background and Motivation Slide 6 Variations in Gate Symbols AND OR NAND NOR XNOR Figure 1.2 Gates with more than two inputs and/or with inverted signals at input or output. Jan. 2011 Computer Architecture, Background and Motivation Slide 7 Gates as Control Elements Enable/Pass signal e Enable/Pass signal e Data out x or 0 Data in x Data in x (a) AND gate for controlled trans fer Data out x or “high impedance” (b) Tristate buffer e e 0 0 0 1 x ex (c) Model for AND switch. x 1 No data or x (d) Model for tristate buffer. Figure 1.3 An AND gate and a tristate buffer act as controlled switches or valves. An inverting buffer is logically the same as a NOT gate. Jan. 2011 Computer Architecture, Background and Motivation Slide 8 Wired OR and Bus Connections ex ex x x ey ey Data out (x, y, z, or 0) y y ez Data out (x, y, z, or high impedance) ez z z (a) Wired OR of product terms (b) Wired OR of t ristate outputs Figure 1.4 Wired OR allows tying together of several controlled signals. Jan. 2011 Computer Architecture, Background and Motivation Slide 9 Control/Data Signals and Signal Bundles Enable Compl 8 / / 8 / 8 (a) 8 NOR gates Figure 1.5 Jan. 2011 / 32 / 32 (b) 32 AND gat es / k / k (c) k XOR gat es Arrays of logic gates represented by a single gate symbol. Computer Architecture, Background and Motivation Slide 10 1.2 Boolean Functions and Expressions Ways of specifying a logic function • Truth table: 2n row, “don’t-care” in input or output • Logic expression: w ′ (x ∨ y ∨ z), product-of-sums, sum-of-products, equivalent expressions • Word statement: Alarm will sound if the door is opened while the security system is engaged, or when the smoke detector is triggered • Logic circuit diagram: Synthesis vs analysis Jan. 2011 Computer Architecture, Background and Motivation Slide 11 Manipulating Logic Expressions Table 1.2 Laws (basic identities) of Boolean algebra. Name of law OR version AND version Identity One/Zero Idempotent Inverse Commutative Associative Distributive DeMorgan’s x∨0=x x1=x x∨1=1 x0=0 x∨x= x xx=x x∨x′=1 xx′=0 x∨y=y∨x xy=yx (x ∨ y) ∨ z = x ∨ (y ∨ z) (x y) z = x (y z) Jan. 2011 x ∨ (y z) = (x ∨ y) (x ∨ z) x (y ∨ z) = (x y) ∨ (x z) (x ∨ y)′ = x ′ y ′ (x y)′ = x ′ ∨ y ′ Computer Architecture, Background and Motivation Slide 12 Proving the Equivalence of Logic Expressions Example 1.1 • Truth-table method: Exhaustive verification • Arithmetic substitution x ∨ y = x + y − xy x ⊕ y = x + y − 2xy Example: x ⊕ y ≡? x′y ∨ xy ′ x + y – 2xy ≡? (1 – x)y + x(1 – y) – (1 – x)yx(1 – y) • Case analysis: two cases, x = 0 or x = 1 • Logic expression manipulation Jan. 2011 Computer Architecture, Background and Motivation Slide 13 1.3 Designing Gate Networks • AND-OR, NAND-NAND, OR-AND, NOR-NOR • Logic optimization: cost, speed, power dissipation (a ∨ b ∨ c)′ = a ′b ′c ′ x y y z z x x y y z z x (a) AND-OR circuit Figure 1.6 Jan. 2011 (b) Intermediate circuit x y y z z x (c) NAND-NAND equivalent A two-level AND-OR circuit and two equivalent circuits. Computer Architecture, Background and Motivation Slide 14 Seven-Segment Display of Decimal Digits Optional segment Figure 1.7 Seven-segment display of decimal digits. The three open segments may be optionally used. The digit 1 can be displayed in two ways, with the more common rightside version shown. Jan. 2011 Computer Architecture, Background and Motivation Slide 15 BCD-to-Seven-Segment Decoder Example 1.2 4-bit input in [0, 9] x3 x2 x1 x0 Signals to enable or turn on the segments e0 0 e5 5 1 e6 e4 e3 6 4 2 3 e2 e1 Figure 1.8 The logic circuit that generates the enable signal for the lowermost segment (number 3) in a seven-segment display unit. Jan. 2011 Computer Architecture, Background and Motivation Slide 16 1.4 Useful Combinational Parts • High-level building blocks • Much like prefab parts used in building a house • Arithmetic components (adders, multipliers, ALUs) will be covered in Part III • Here we cover three useful parts: multiplexers, decoders/demultiplexers, encoders Jan. 2011 Computer Architecture, Background and Motivation Slide 17 Multiplexers x0 z x1 y 0 x0 x1 (a) 2-to-1 mux 1 z y / 32 0 / 32 1 / 32 y x0 x1 x2 x3 0 1 2 3 z x2 1 0 y1y0 (d) Mux array (e) 4-to-1 mux with enable 0 x1 1 z y (c) Mux symbol (b) Switch view e (Enable) x0 x0 0 x1 1 x2 0 x3 1 y0 0 z 1 y1 y0 (e) 4-to-1 mux design Figure 1.9 Multiplexer (mux), or selector, allows one of several inputs to be selected and routed to output depending on the binary value of a set of selection or address signals provided to it. Jan. 2011 Computer Architecture, Background and Motivation Slide 18 Decoders/Demultiplexers y1 1 y0 0 x0 x1 1 1 1 x2 y1y0 1 0 y1y0 0 1 2 3 x0 x1 x2 x3 x3 (a) 2-to-4 decoder (b) Decoder symbol e 1 (Enable) 0 1 2 3 1 x0 x1 x2 x3 (c) Demultiplexer, or decoder wit h “enable” Figure 1.10 A decoder allows the selection of one of 2a options using an a-bit address as input. A demultiplexer (demux) is a decoder that only selects an output if its enable signal is asserted. Jan. 2011 Computer Architecture, Background and Motivation Slide 19 Encoders x0 0 x1 0 x0 x1 x2 x3 x2 1 x3 0 1 0 y1y0 y1y0 (a) 4-to-2 encoder 0 1 2 3 (b) Enc oder symbol Figure 1.11 A 2a-to-a encoder outputs an a-bit binary number equal to the index of the single 1 among its 2a inputs. Jan. 2011 Computer Architecture, Background and Motivation Slide 20 1.5 Programmable Combinational Parts A programmable combinational part can do the job of many gates or gate networks Programmed by cutting existing connections (fuses) or establishing new connections (antifuses) • Programmable ROM (PROM) • Programmable array logic (PAL) • Programmable logic array (PLA) Jan. 2011 Computer Architecture, Background and Motivation Slide 21 PROMs w x x y y z z Inputs Decoder w . . . ... Outputs (a) Programmable OR gates Figure 1.12 Jan. 2011 (b) Logic equivalent of part a (c) Programmable read-only memory (PROM) Programmable connections and their use in a PROM. Computer Architecture, Background and Motivation Slide 22 PALs and PLAs Inputs 8-input ANDs ... AND array (AND plane) . . . 6-input ANDs OR array (OR plane) ... 4-input ORs Outputs (a) General programmable combinational logic (b) PAL: programmable AND array, fixed OR array (c) PLA: programmable AND and OR arrays Figure 1.13 Programmable combinational logic: general structure and two classes known as PAL and PLA devices. Not shown is PROM with fixed AND array (a decoder) and programmable OR array. Jan. 2011 Computer Architecture, Background and Motivation Slide 23 1.6 Timing and Circuit Considerations Changes in gate/circuit output, triggered by changes in its inputs, are not instantaneous • Gate delay δ: a fraction of, to a few, nanoseconds • Wire delay, previously negligible, is now important (electronic signals travel about 15 cm per ns) • Circuit simulation to verify function and timing Jan. 2011 Computer Architecture, Background and Motivation Slide 24 Glitching Using the PAL in Fig. 1.13b to implement f = x ∨ y ∨ z x=0 x y z AND-OR (PAL) a y f AND-OR (PAL) z a=x f ∨y =a∨z Figure 1.14 Jan. 2011 2δ 2δ Timing diagram for a circuit that exhibits glitching. Computer Architecture, Background and Motivation Slide 25 CMOS Transmission Gates y P x0 TG z N TG (a) CM OS transmission gate: circuit and symbol x1 TG (b) Two-input mux built of t wo transmission gat es Figure 1.15 A CMOS transmission gate and its use in building a 2-to-1 mux. Jan. 2011 Computer Architecture, Background and Motivation Slide 26 2 Digital Circuits with Memory Second of two chapters containing a review of digital design: • Combinational (memoryless) circuits in Chapter 1 • Sequential circuits (with memory) in Chapter 2 Topics in This Chapter 2.1 Latches, Flip-Flops, and Registers 2.2 Finite-State Machines 2.3 Designing Sequential Circuits 2.4 Useful Sequential Parts 2.5 Programmable Sequential Parts 2.6 Clocks and Timing of Events Jan. 2011 Computer Architecture, Background and Motivation Slide 27 2.1 Latches, Flip-Flops, and Registers D R R Q Q Q′ S C (a) SR latch D Q′ S D C C Q Q (b) D latch D C Q Q Q Q′ (c) Master-slave D flip-flop Figure 2.1 Jan. 2011 D Q FF C Q (d) D flip-flop symbol / k D FF C Q / k Q (e) k -bit register Latches, flip-flops, and registers. Computer Architecture, Background and Motivation Slide 28 Latches vs Flip-Flops D Q C Q D Q Setup Hold time time Setup Hold time time D FF C Q C D latch: Q D FF: Q Figure 2.2 Operations of D latch and negative-edge-triggered D flip-flop. Jan. 2011 Computer Architecture, Background and Motivation Slide 29 Reading and Modifying FFs in the Same Cycle / k D Q FF C Q / D k Figure 2.3 flip-flops. Jan. 2011 k Computation module (combinational logic) Q FF C Clock / Q / k Propagation delay Combinational delay Register-to-register operation with edge-triggered Computer Architecture, Background and Motivation Slide 30 2.2 Finite-State Machines Example 2.1 Dime Quarter Reset ------- Input ------Dime Current state S 00 S 10 S 25 S 00 S 10 S 20 S 35 S 00 S 20 S 30 S 35 S 00 S 25 S 35 S 35 S 00 S 30 S 35 S 35 S 00 S 35 S 35 S 35 S 00 Next state S 00 is the initial state S 35 is the final state S 10 S 20 Reset Reset Dime Start Quarter Dime Quarter Quarter S 00 S 25 Reset Dime Quarter Reset Reset Dime Quarter S 35 Dime Quarter S 30 Figure 2.4 State table and state diagram for a vending machine coin reception unit. Jan. 2011 Computer Architecture, Background and Motivation Slide 31 Sequential Machine Implementation Only for Mealy machine Inputs State register / n Next-state logic Present state Next-state excitation signals Output logic Outputs / m / l Figure 2.5 Hardware realization of Moore and Mealy sequential machines. Jan. 2011 Computer Architecture, Background and Motivation Slide 32 2.3 Designing Sequential Circuits Example 2.3 Quarter in Inputs q Q D FF2 Dime in d C Q Output e Final state is 1xx Q D FF1 C Q Q D FF0 C Figure 2.7 Jan. 2011 Q Hardware realization of a coin reception unit (Example 2.3). Computer Architecture, Background and Motivation Slide 33 2.4 Useful Sequential Parts • High-level building blocks • Much like prefab closets used in building a house • Other memory components will be covered in Chapter 17 (SRAM details, DRAM, Flash) • Here we cover three useful parts: shift register, register file (SRAM basics), counter Jan. 2011 Computer Architecture, Background and Motivation Slide 34 Shift Register 0 1 0 0 1 1 1 0 Shift Parallel data in / k Serial data in Load 0 / k 1 Q D FF C k – 1 LSBs Parallel data out / k Serial data out Q MSB / Figure 2.8 Register with single-bit left shift and parallel load capabilities. For logical left shift, serial data in line is connected to 0. Jan. 2011 Computer Architecture, Background and Motivation Slide 35 Register File and FIFO Write data 2 h k -bit registers / / k k Write / address h D Q FF C / Write enable k Write enable / k / k Write data / h Write addr / Read addr 0 Q D Q FF C Decoder Muxes / Read data 0 k Q / h k k / k D Q FF / D Q FF C h / k Read data 1 Q C k / Jan. 2011 Read data 1 k/ Read addr 1 Read enable (b) Graphic symbol for register file / k Q Push Read address 0 h Read address 1 h / Read enable / k Input Empty Full Output / k Pop / (a) Register file with random access Figure 2.9 / Read data 0 k/ (c) FIFO symbol Register file with random access and FIFO. Computer Architecture, Background and Motivation Slide 36 Row decoder SRAM Write enable / g / h Data in Address Chip select . . . Square or almost square memory matrix Data out / g . . . Output enable Row buffer . . . Address / Row Column mux h Column g bits data out (a) SRAM block diagram Figure 2.10 Jan. 2011 (b) SRAM read mechanism SRAM memory is simply a large, single-port register file. Computer Architecture, Background and Motivation Slide 37 Binary Counter Input Incr′Init 0 Mux 1 Load 0 Count register x c out Incrementer c in 1 x+1 Figure 2.11 Jan. 2011 Synchronous binary counter with initialization capability. Computer Architecture, Background and Motivation Slide 38 2.5 Programmable Sequential Parts A programmable sequential part contain gates and memory elements Programmed by cutting existing connections (fuses) or establishing new connections (antifuses) • Programmable array logic (PAL) • Field-programmable gate array (FPGA) • Both types contain macrocells and interconnects Jan. 2011 Computer Architecture, Background and Motivation Slide 39 PAL and FPGA 8-input ANDs I/O blocks CLB CLB CLB CLB 01 Mu x C Q FF D Q Mu x Configurable logic block 01 (a) Portion of PAL with storable output Figure 2.12 Jan. 2011 Programmable connections (b) Generic structure of an FPGA Examples of programmable sequential logic. Computer Architecture, Background and Motivation Slide 40 2.6 Clocks and Timing of Events Clock is a periodic signal: clock rate = clock frequency The inverse of clock rate is the clock period: 1 GHz ↔ 1 ns Constraint: Clock period ≥ tprop + tcomb + tsetup + tskew D Q FF1 Q C D Combinational logic Clock1 Q FF2 C Q Clock2 Other inputs Clock period FF1 begins to change Figure 2.13 Jan. 2011 FF1 change observed Must be wide enough to accommodate worst-cas e delays Determining the required length of the clock period. Computer Architecture, Background and Motivation Slide 41 Synchronization Asynch input Synch version Q D Synch version Q D FF C Asynch input FF1 Q C (a) Simple synchroniz er Q D FF2 Q C Q (b) Two-FF synchronizer Clock Asynch input Synch version (c) Input and output waveforms Figure 2.14 Synchronizers are used to prevent timing problems arising from untimely changes in asynchronous signals. Jan. 2011 Computer Architecture, Background and Motivation Slide 42 Level-Sensitive Operation D Q Latch φ1 C Q Combinational logic D Q Latch φ2 C Q Combinational logic D Q Latch φ1 C Q Other inputs Other inputs Clock period φ1 φ Clocks with nonoverlapping highs 2 Figure 2.15 Jan. 2011 Two-phase clocking with nonoverlapping clock signals. Computer Architecture, Background and Motivation Slide 43 3 Computer System Technology Interplay between architecture, hardware, and software • Architectural innovations influence technology • Technological advances drive changes in architecture Topics in This Chapter 3.1 From Components to Applications 3.2 Computer Systems and Their Parts 3.3 Generations of Progress 3.4 Processor and Memory Technologies 3.5 Peripherals, I/O, and Communications 3.6 Software Systems and Applications Jan. 2011 Computer Architecture, Background and Motivation Slide 44 3.1 From Components to Applications Highlevel view Figure 3.1 Jan. 2011 Computer organization Circuit designer Logic designer Computer archit ecture Electronic components Hardware Computer designer System designer Application designer Application domains Software Lowlevel view Subfields or views in computer system engineering. Computer Architecture, Background and Motivation Slide 45 What Is (Computer) Architecture? Client’s requirements: function, cost, . . . Client’s taste: mood, style, . . . Goals Interface Architect Means Construction tec hnology: material, codes, . . . Engineering Arts The world of arts: aesthetics, trends, . . . Interface Figure 3.2 Like a building architect, whose place at the engineering/arts and goals/means interfaces is seen in this diagram, a computer architect reconciles many conflicting or competing demands. Jan. 2011 Computer Architecture, Background and Motivation Slide 46 3.2 Computer Systems and Their Parts Computer Analog Digital Fixed-function Stored-program Electronic General-purpose Number cruncher Nonelectronic Special-purpose Data manipulator Figure 3.3 The space of computer systems, with what we normally mean by the word “computer” highlighted. Jan. 2011 Computer Architecture, Background and Motivation Slide 47 Price/Performance Pyramid Super $Millions Mainframe $100s Ks Server Differences in scale, not in substance Workstation $10s Ks $1000s Personal Embedded $100s $10s Figure 3.4 Classifying computers by computational power and price range. Jan. 2011 Computer Architecture, Background and Motivation Slide 48 Automotive Embedded Computers Impact sensors Brakes Airbags Engine Cent ral controller Navigation & entert ainment Figure 3.5 Embedded computers are ubiquitous, yet invisible. They are found in our automobiles, appliances, and many other places. Jan. 2011 Computer Architecture, Background and Motivation Slide 49 Personal Computers and Workstations Figure 3.6 Notebooks, a common class of portable computers, are much smaller than desktops but offer substantially the same capabilities. What are the main reasons for the size difference? Jan. 2011 Computer Architecture, Background and Motivation Slide 50 Digital Computer Subsystems Memory Control Processor Input Link Datapath CPU Input/Output Output To/from network I/O Figure 3.7 The (three, four, five, or) six main units of a digital computer. Usually, the link unit (a simple bus or a more elaborate network) is not explicitly included in such diagrams. Jan. 2011 Computer Architecture, Background and Motivation Slide 51 3.3 Generations of Progress Table 3.2 The 5 generations of digital computers, and their ancestors. Generation (begun) Processor Memory I/O devices technology innovations introduced 0 (1600s) (Electro-) mechanical Wheel, card Lever, dial, punched card Factory equipment 1 (1950s) Vacuum tube Magnetic drum Paper tape, magnetic tape Hall-size cabinet 2 (1960s) Transistor Magnetic core Drum, printer, Room-size text terminal mainframe 3 (1970s) SSI/MSI RAM/ROM chip Disk, keyboard, Desk-size video monitor mini 4 (1980s) LSI/VLSI SRAM/DRAM Network, CD, mouse,sound 5 (1990s) ULSI/GSI/ WSI, SOC SDRAM, flash Jan. 2011 Dominant look & fell Desktop/ laptop micro Sensor/actuator, Invisible, point/click embedded Computer Architecture, Background and Motivation Slide 52 IC Production and Yield Blank wafer with defects 30-60 cm Silicon crystal ingot Slicer 15-30 cm x x x x x x x x x x x Patterned wafer Processing: 20-30 steps (100s of simple or scores of complex processors) 0.2 cm Die Dicer ~1 cm Figure 3.8 Jan. 2011 Die tester Good die Microchip or other part Mounting Part tester Usable part to ship ~1 cm The manufacturing process for an IC part. Computer Architecture, Background and Motivation Slide 53 Effect of Die Size on Yield 120 dies, 109 good 26 dies, 15 good Figure 3.9 Visualizing the dramatic decrease in yield with larger dies. Die yield =def (number of good dies) / (total number of dies) Die yield = Wafer yield × [1 + (Defect density × Die area) / a]–a Die cost = (cost of wafer) / (total number of dies × die yield) = (cost of wafer) × (die area / wafer area) / (die yield) Jan. 2011 Computer Architecture, Background and Motivation Slide 54 3.4 Processor and Memory Technologies Backplane Die PC board Interlayer connections deposited on the outside of the stack Bus CPU Connector Memory (a) 2D or 2.5D packaging now common Figure 3.11 Jan. 2011 Stacked layers glued together (b) 3D packaging of the future Packaging of processor, memory, and other components. Computer Architecture, Background and Motivation Slide 55 TIPS Processor ×1.6 / yr ×2 / 18 mos ×10 / 5 yrs Memory GIPS 80486 R10000 Pentium II Pentium 256Mb 68040 64Mb Gb 1Gb 16Mb 80386 68000 MIPS 80286 4Mb 1Mb Mb ×4 / 3 yrs 256kb 64kb kIPS 1980 1990 2000 kb 2010 Calendar year Figure 3.10 Trends in processor performance and DRAM memory chip capacity (Moore’s law). Jan. 2011 Computer Architecture, Background and Motivation Slide 56 Memory chip capacity Processor performance Moore’s Law Tb Pitfalls of Computer Technology Forecasting “DOS addresses only 1 MB of RAM because we cannot imagine any applications needing more.” Microsoft, 1980 “640K ought to be enough for anybody.” Bill Gates, 1981 “Computers in the future may weigh no more than 1.5 tons.” Popular Mechanics “I think there is a world market for maybe five computers.” Thomas Watson, IBM Chairman, 1943 “There is no reason anyone would want a computer in their home.” Ken Olsen, DEC founder, 1977 “The 32-bit machine would be an overkill for a personal computer.” Sol Libes, ByteLines Jan. 2011 Computer Architecture, Background and Motivation Slide 57 3.5 Input/Output and Communications Typically 2-9 cm Floppy disk . . .. (a) Cutaway view of a hard disk drive Figure 3.12 Jan. 2011 CD-ROM . .. . Magnetic tape cartridge (b) Some removable storage media Magnetic and optical disk memory units. Computer Architecture, Background and Motivation Slide 58 10 12 Bandwidth (b/s) Communication Technologies Processor bus Geographically distributed I/O network System-area network (SAN) Local-area network (LAN) 10 9 Metro-area network (MAN) 10 6 Same geographic location 10 3 10 −9 (ns) 10 −6 (μs) 10 −3 (ms) 1 Wide-area network (WAN) (min) 10 3 Latency (s) Figure 3.13 Latency and bandwidth characteristics of different classes of communication links. Jan. 2011 Computer Architecture, Background and Motivation Slide 59 (h) 3.6 Software Systems and Applications Software System Application: word processor, spreadsheet, circuit simulator, .. . Operating system Translator: Manager: Enabler: Coordinator: MIPS assembler, C compiler, .. . virtual memory, security, file system, .. . disk driver, display driver, printing, .. . scheduling, load balancing, diagnostics, .. . Figure 3.15 Jan. 2011 Categorization of software, with examples in each class. Computer Architecture, Background and Motivation Slide 60 High- vs Low-Level Programming temp=v[i] v[i]=v[i+1] v[i+1]=temp One task = many statements Figure 3.14 Jan. 2011 Compiler Swap v[i] and v[i+1] Assembly language instructions, mnemonic High-level language statements Interpreter Very high-level language objectives or tasks More conc rete, machine-specific, error-prone; harder to write, read, debug, or maintain add add add lw lw sw sw jr One statement = several instructions $2,$5,$5 $2,$2,$2 $2,$4,$2 $15,0($2) $16,4($2) $16,0($2) $15,4($2) $31 Assembler More abstract, machine-independent; easier to write, read, debug, or maintain Machine language instructions, binary (hex) 00a51020 00421020 00821020 8c620000 8cf20004 acf20000 ac620004 03e00008 Mostly one-to-one Models and abstractions in programming. Computer Architecture, Background and Motivation Slide 61 4 Computer Performance Performance is key in design decisions; also cost and power • It has been a driving force for innovation • Isn’t quite the same as speed (higher clock rate) Topics in This Chapter 4.1 Cost, Performance, and Cost/Performance 4.2 Defining Computer Performance 4.3 Performance Enhancement and Amdahl’s Law 4.4 Performance Measurement vs Modeling 4.5 Reporting Computer Performance 4.6 The Quest for Higher Performance Jan. 2011 Computer Architecture, Background and Motivation Slide 62 4.1 Cost, Performance, and Cost/Performance Computer cost $1 G $1 M $1 K $1 1960 1980 2000 2020 Calendar year Jan. 2011 Computer Architecture, Background and Motivation Slide 63 Cost/Performance Performance Superlinear: economy of scale Linear (ideal?) Sublinear: diminishing returns Cost Figure 4.1 Jan. 2011 Performance improvement as a function of cost. Computer Architecture, Background and Motivation Slide 64 4.2 Defining Computer Performance CPU-bound task Input Processing Output I/O-bound task Figure 4.2 Pipeline analogy shows that imbalance between processing power and I/O capabilities leads to a performance bottleneck. Jan. 2011 Computer Architecture, Background and Motivation Slide 65 Six Passenger Aircraft to Be Compared B 747 DC-8-50 Jan. 2011 Computer Architecture, Background and Motivation Slide 66 Performance of Aircraft: An Analogy Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft or are averages of cited range of values. Passengers Range (km) Speed (km/h) Price ($M) Airbus A310 250 8 300 895 120 Boeing 747 470 6 700 980 200 Boeing 767 250 12 300 885 120 Boeing 777 375 7 450 980 180 Concorde 130 6 400 2 200 350 DC-8-50 145 14 000 875 80 Aircraft Speed of sound ≈ 1220 km / h Jan. 2011 Computer Architecture, Background and Motivation Slide 67 Different Views of Performance Performance from the viewpoint of a passenger: Speed Note, however, that flight time is but one part of total travel time. Also, if the travel distance exceeds the range of a faster plane, a slower plane may be better due to not needing a refueling stop Performance from the viewpoint of an airline: Throughput Measured in passenger-km per hour (relevant if ticket price were proportional to distance traveled, which in reality it is not) Airbus A310 Boeing 747 Boeing 767 Boeing 777 Concorde DC-8-50 250 × 895 = 0.224 M passenger-km/hr 470 × 980 = 0.461 M passenger-km/hr 250 × 885 = 0.221 M passenger-km/hr 375 × 980 = 0.368 M passenger-km/hr 130 × 2200 = 0.286 M passenger-km/hr 145 × 875 = 0.127 M passenger-km/hr Performance from the viewpoint of FAA: Safety Jan. 2011 Computer Architecture, Background and Motivation Slide 68 Cost Effectiveness: Cost/Performance Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft or are averages of cited range of values. Larger values better Smaller values better Passengers Range (km) Speed (km/h) Price ($M) Throughput (M P km/hr) Cost / Performance A310 250 8 300 895 120 0.224 536 B 747 470 6 700 980 200 0.461 434 B 767 250 12 300 885 120 0.221 543 B 777 375 7 450 980 180 0.368 489 Concorde 130 6 400 2 200 350 0.286 1224 DC-8-50 145 14 000 875 80 0.127 630 Aircraft Jan. 2011 Computer Architecture, Background and Motivation Slide 69 Concepts of Performance and Speedup Performance = 1 / Execution time is simplified to Performance = 1 / CPU execution time (Performance of M1) / (Performance of M2) = Speedup of M1 over M2 = (Execution time of M2) / (Execution time M1) Terminology: M1 is x times as fast as M2 (e.g., 1.5 times as fast) M1 is 100(x – 1)% faster than M2 (e.g., 50% faster) CPU time = Instructions × (Cycles per instruction) × (Secs per cycle) = Instructions × CPI / (Clock rate) Instruction count, CPI, and clock rate are not completely independent, so improving one by a given factor may not lead to overall execution time improvement by the same factor. Jan. 2011 Computer Architecture, Background and Motivation Slide 70 Elaboration on the CPU Time Formula CPU time = Instructions × (Cycles per instruction) × (Secs per cycle) = Instructions × Average CPI / (Clock rate) Instructions: Number of instructions executed, not number of instructions in our program (dynamic count) Average CPI: Is calculated based on the dynamic instruction mix and knowledge of how many clock cycles are needed to execute various instructions (or instruction classes) Clock rate: 1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns) 200 MHz = 200 × 106 cycles / s (cycle time = 5 ns) Clock period Jan. 2011 Computer Architecture, Background and Motivation Slide 71 Dynamic Instruction Count How many instructions are executed in this program fragment? Each “for” consists of two instructions: increment index, check exit condition 12,422,450 Instructions 250 instructions for i = 1, 100 do 20 instructions for j = 1, 100 do 40 instructions for k = 1, 100 do 10 instructions endfor endfor endfor 2 + 20 + 124,200 instructions 100 iterations 12,422,200 instructions in all 2 + 40 + 1200 instructions 100 iterations 124,200 instructions in all 2 + 10 instructions 100 iterations 1200 instructions in all for i = 1, n while x > 0 Computer Architecture, Background and Motivation Slide 72 Static count = 326 Jan. 2011 Faster Clock ≠ Shorter Running Time 1 GHz Suppose addition takes 1 ns Clock period = 1 ns; 1 cycle Clock period = ½ ns; 2 cycles Solution 4 steps 20 steps 2 GHz In this example, addition time does not improve in going from 1 GHz to 2 GHz clock Figure 4.3 Faster steps do not necessarily mean shorter travel time. Jan. 2011 Computer Architecture, Background and Motivation Slide 73 4.3 Performance Enhancement: Amdahl’s Law 50 f = fraction f =0 Speedup (s ) 40 p f = 0.01 30 unaffected = speedup of the rest f = 0.02 20 f = 0.05 s= 10 f = 0.1 ≤ min(p, 1/f) 0 0 10 20 30 Enhancement factor (p ) 40 1 f + (1 – f)/p 50 Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast. Jan. 2011 Computer Architecture, Background and Motivation Slide 74 Amdahl’s Law Used in Design Example 4.1 A processor spends 30% of its time on flp addition, 25% on flp mult, and 10% on flp division. Evaluate the following enhancements, each costing the same to implement: a. Redesign of the flp adder to make it twice as fast. b. Redesign of the flp multiplier to make it three times as fast. c. Redesign the flp divider to make it 10 times as fast. Solution a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18 b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20 c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10 What if both the adder and the multiplier are redesigned? Jan. 2011 Computer Architecture, Background and Motivation Slide 75 Amdahl’s Law Used in Management Example 4.2 Members of a university research group frequently visit the library. Each library trip takes 20 minutes. The group decides to subscribe to a handful of publications that account for 90% of the library trips; access time to these publications is reduced to 2 minutes. a. What is the average speedup in access to publications? b. If the group has 20 members, each making two weekly trips to the library, what is the justifiable expense for the subscriptions? Assume 50 working weeks/yr and $25/h for a researcher’s time. Solution a. Speedup in publication access time = 1 / [0.1 + 0.9 / 10] = 5.26 b. Time saved = 20 × 2 × 50 × 0.9 (20 – 2) = 32,400 min = 540 h Cost recovery = 540 × $25 = $13,500 = Max justifiable expense Jan. 2011 Computer Architecture, Background and Motivation Slide 76 4.4 Performance Measurement vs Modeling Execution time Machine 1 Machine 2 Machine 3 Program A Figure 4.5 Jan. 2011 B C D E F Running times of six programs on three machines. Computer Architecture, Background and Motivation Slide 77 Generalized Amdahl’s Law Original running time of a program = 1 = f1 + f2 + . . . + fk New running time after the fraction fi is speeded up by a factor pi f1 f2 + p1 fk + ... + p2 pk Speedup formula 1 S= f1 f2 + p1 Jan. 2011 fk + ... + p2 pk If a particular fraction is slowed down rather than speeded up, use sj fj instead of fj / pj , where sj > 1 is the slowdown factor Computer Architecture, Background and Motivation Slide 78 Performance Benchmarks Example 4.3 You are an engineer at Outtel, a start-up aspiring to compete with Intel via its new processor design that outperforms the latest Intel processor by a factor of 2.5 on floating-point instructions. This level of performance was achieved by design compromises that led to a 20% increase in the execution time of all other instructions. You are in charge of choosing benchmarks that would showcase Outtel’s performance edge. a. What is the minimum required fraction f of time spent on floating-point instructions in a program on the Intel processor to show a speedup of 2 or better for Outtel? Solution a. We use a generalized form of Amdahl’s formula in which a fraction f is speeded up by a given factor (2.5) and the rest is slowed down by another factor (1.2): 1 / [1.2(1 – f) + f / 2.5] ≥ 2 ⇒ f ≥ 0.875 Jan. 2011 Computer Architecture, Background and Motivation Slide 79 Performance Estimation Average CPI = ∑All instruction classes (Class-i fraction) × (Class-i CPI) Machine cycle time = 1 / Clock rate CPU execution time = Instructions × (Average CPI) / (Clock rate) Table 4.3 Usage frequency, in percentage, for various instruction classes in four representative applications. Application → Instr’n class ↓ Data compression C language compiler Reactor simulation Atomic motion modeling A: Load/Store 25 37 32 37 B: Integer 32 28 17 5 C: Shift/Logic 16 13 2 1 D: Float 0 0 34 42 E: Branch 19 13 9 10 F: All others 8 9 6 4 Jan. 2011 Computer Architecture, Background and Motivation Slide 80 CPI and IPS Calculations Example 4.4 (2 of 5 parts) Consider two implementations M1 (600 MHz) and M2 (500 MHz) of an instruction set containing three classes of instructions: Class F I N CPI for M1 5.0 2.0 2.4 CPI for M2 4.0 3.8 2.0 Comments Floating-point Integer arithmetic Nonarithmetic a. What are the peak performances of M1 and M2 in MIPS? b. If 50% of instructions executed are class-N, with the rest divided equally among F and I, which machine is faster? By what factor? Solution a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250 b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95; for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95 → M1 is faster; factor 1.2 Jan. 2011 Computer Architecture, Background and Motivation Slide 81 MIPS Rating Can Be Misleading Example 4.5 Two compilers produce machine code for a program on a machine with two classes of instructions. Here are the number of instructions: Class A B CPI 1 2 Compiler 1 600M 400M Compiler 2 400M 400M a. What are run times of the two programs with a 1 GHz clock? b. Which compiler produces faster code and by what factor? c. Which compiler’s output runs at a higher MIPS rate? Solution a. Running time 1 (2) = (600M × 1 + 400M × 2) / 109 = 1.4 s (1.2 s) b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667) Jan. 2011 Computer Architecture, Background and Motivation Slide 82 4.5 Reporting Computer Performance Table 4.4 Measured or estimated execution times for three programs. Time on machine X Time on machine Y Speedup of Y over X Program A 20 200 0.1 Program B 1000 100 10.0 Program C 1500 150 10.0 All 3 prog’s 2520 450 5.6 Analogy: If a car is driven to a city 100 km away at 100 km/hr and returns at 50 km/hr, the average speed is not (100 + 50) / 2 but is obtained from the fact that it travels 200 km in 3 hours. Jan. 2011 Computer Architecture, Background and Motivation Slide 83 Comparing the Overall Performance Table 4.4 Measured or estimated execution times for three programs. Speedup of X over Y Time on machine X Time on machine Y Speedup of Y over X Program A 20 200 0.1 10 Program B 1000 100 10.0 0.1 Program C 1500 150 10.0 0.1 Arithmetic mean Geometric mean 6.7 2.15 3.4 0.46 Geometric mean does not yield a measure of overall speedup, but provides an indicator that at least moves in the right direction Jan. 2011 Computer Architecture, Background and Motivation Slide 84 Effect of Instruction Mix on Performance Example 4.6 (1 of 3 parts) Consider two applications DC and RS and two machines M1 and M2: Data Comp. Reactor Sim. Class A: Ld/Str 25% 32% B: Integer 32% 17% C: Sh/Logic 16% 2% D: Float 0% 34% E: Branch 19% 9% F: Other 8% 6% M1’s CPI 4.0 1.5 1.2 6.0 2.5 2.0 M2’s CPI 3.8 2.5 1.2 2.6 2.2 2.3 a. Find the effective CPI for the two applications on both machines. Solution a. CPI of DC on M1: 0.25 × 4.0 + 0.32 × 1.5 + 0.16 × 1.2 + 0 × 6.0 + 0.19 × 2.5 + 0.08 × 2.0 = 2.31 DC on M2: 2.54 RS on M1: 3.94 RS on M2: 2.89 Jan. 2011 Computer Architecture, Background and Motivation Slide 85 4.6 The Quest for Higher Performance State of available computing power ca. the early 2000s: Gigaflops on the desktop Teraflops in the supercomputer center Petaflops on the drawing board Note on terminology (see Table 3.1) Prefixes for large units: Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015 For memory: K = 210 = 1024, M = 220, G = 230, T = 240, P = 250 Prefixes for small units: micro = 10−6, nano = 10−9, pico = 10−12, femto = 10−15 Jan. 2011 Computer Architecture, Background and Motivation Slide 86 Performance Trends and Obsolescence Tb TIPS Processor performance ×1.6 / yr ×2 / 18 mos ×10 / 5 yrs Memory GIPS 80486 R10000 Pentium II Pentium 256Mb 68040 64Mb Gb 1Gb 16Mb 80386 68000 MIPS 80286 4Mb 1Mb Mb Memory chip capacity Processor ×4 / 3 yrs 256kb 64kb kIPS 1980 1990 2000 Calendar year Figure 3.10 Trends in processor performance and DRAM memory chip capacity (Moore’s law). Jan. 2011 kb 2010 “Can I call you back? We just bought a new computer and we’re trying to set it up before it’s obsolete.” Computer Architecture, Background and Motivation Slide 87 Super- PFLOPS computers Massively parallel processors Supercomputer performance $240M MPPs $30M MPPs CM-5 TFLOPS CM-5 CM-2 Vector supercomputers Y-MP GFLOPS Cray X-MP MFLOPS 1980 1990 2000 2010 Calendar year Figure 4.7 Jan. 2011 Exponential growth of supercomputer performance. Computer Architecture, Background and Motivation Slide 88 The Most Powerful Computers Performance (TFLOPS) 1000 Plan Develop Use 100+ TFLOPS, 20 TB ASCI Purple 100 30+ TFLOPS, 10 TB ASCI Q 10+ TFLOPS, 5 TB ASCI W hite 10 ASCI 3+ TFL OPS, 1.5 TB ASCI Blue 1+ TFL OPS, 0.5 TB 1 1995 ASCI Red 2000 2005 2010 Calendar year Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing Initiative (ASCI) program with extrapolation up to the PFLOPS level. Jan. 2011 Computer Architecture, Background and Motivation Slide 89 Performance is Important, But It Isn’t Everything TIPS DSP performance per Watt Absolute proce ssor performance Performance GIPS GP processor performance per Watt MIPS kIPS 1980 1990 2000 Figure 25.1 Trend in computational performance per watt of power used in generalpurpose processors and DSPs. 2010 Calendar year Jan. 2011 Computer Architecture, Background and Motivation Slide 90 Roadmap for the Rest of the Book Fasten your seatbelts as we begin our ride! Ch. 5-8: A simple ISA, variations in ISA Ch. 9-12: ALU design Ch. 13-14: Data path and control unit design Ch. 15-16: Pipelining and its limits Ch. 17-20: Memory (main, mass, cache, virtual) Ch. 21-24: I/O, buses, interrupts, interfacing Ch. 25-28: Vector and parallel processing Jan. 2011 Computer Architecture, Background and Motivation Slide 91 Part II Instruction-Set Architecture Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First June 2003 July 2004 June 2005 Mar. 2006 Jan. 2007 Jan. 2008 Jan. 2009 Jan. 2011 Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 2 A Few Words About Where We Are Headed Performance = 1 / Execution time simplified to 1 / CPU execution time CPU execution time = Instructions × CPI / (Clock rate) Performance = Clock rate / ( Instructions × CPI ) Try to achieve CPI = 1 with clock that is as high as that for CPI > 1 designs; is CPI < 1 feasible? (Chap 15-16) Design memory & I/O structures to support ultrahigh-speed CPUs Jan. 2011 Define an instruction set; make it simple enough to require a small number of cycles and allow high clock rate, but not so simple that we need many instructions, even for very simple tasks (Chap 5-8) Computer Architecture, Instruction-Set Architecture Design hardware for CPI = 1; seek improvements with CPI > 1 (Chap 13-14) Design ALU for arithmetic & logic ops (Chap 9-12) Slide 3 Strategies for Speeding Up Instruction Execution Performance = 1 / Execution time simplified to 1 / CPU execution time CPU execution time = Instructions × CPI / (Clock rate) Performance = Clock rate / ( Instructions × CPI ) Assembly line analogy Single-cycle (CPI = 1) Items that take longest to inspect dictate the speed of the assembly line Jan. 2011 Faster Faster Computer Architecture, Instruction-Set Architecture Parallel processing or pipelining Multicycle (CPI > 1) Slide 4 II Instruction Set Architecture Introduce machine “words” and its “vocabulary,” learning: • A simple, yet realistic and useful instruction set • Machine language programs; how they are executed • RISC vs CISC instruction-set design philosophy Topics in This Part Chapter 5 Instructions and Addressing Chapter 6 Procedures and Data Chapter 7 Assembly Language Programs Chapter 8 Instruction Set Variations Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 5 5 Instructions and Addressing First of two chapters on the instruction set of MiniMIPS: • Required for hardware concepts in later chapters • Not aiming for proficiency in assembler programming Topics in This Chapter 5.1 Abstract View of Hardware 5.2 Instruction Formats 5.3 Simple Arithmetic / Logic Instructions 5.4 Load and Store Instructions 5.5 Jump and Branch Instructions 5.6 Addressing Modes Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 6 5.1 Abstract View of Hardware ... m ≤ 2 32 Loc 0 Loc 4 Loc 8 4 B / location Memory Loc Loc m−8 m−4 up to 2 30 words ... EIU (Main proc.) $0 $1 $2 $31 ALU Execution & integer unit (Coproc. 1) Integer mul/div FP arith Hi FPU $0 $1 $2 Floatingpoint unit $31 Lo TMU Chapter 10 Figure 5.1 Jan. 2011 Chapter 11 Chapter 12 BadVaddr Trap & (Coproc. 0) Status memory Cause unit EPC Memory and processing subsystems for MiniMIPS. Computer Architecture, Instruction-Set Architecture Slide 7 Data Types Byte =Byte 8 bits Used only for floating-point data, so safe to ignore in this course Halfword= 2 bytes Halfword Word =Word 4 bytes Doubleword = 8 bytes Doubleword Quadword (16 bytes) also used occasionally MiniMIPS registers hold 32-bit (4-byte) words. Other common data sizes include byte, halfword, and doubleword. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 8 $0 $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15 $16 $17 $18 $19 $20 $21 $22 $23 $24 $25 $26 $27 $28 $29 $30 $31 0 Jan. 2011 $zero $at $v0 $v1 $a0 $a1 $a2 $a3 $t0 $t1 $t2 $t3 $t4 $t5 $t6 $t7 $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $t8 $t9 $k0 $k1 $gp $sp $fp $ra Reserved for assembler use Procedure results Procedure arguments Saved A 4-b yte word sits in consecutive memory addresses according to the big-endian order (most significant byte has the lowest address) Byte numbering: 3 2 3 2 1 0 1 Register Conventions 0 When loading a byte into a register, it goes in the low end Byte Temporary values Word Doublew ord Operands Saved across procedure calls More temporaries Reserved for OS (kernel) Global pointer Stack pointer Frame pointer Return address Saved A doubleword sits in consecutive registers or memory locations according to the big-endian order (most significant word comes first) Computer Architecture, Instruction-Set Architecture Figure 5.2 Registers and data sizes in MiniMIPS. Slide 9 Registers Used in This Chapter $8 $9 $10 $11 $12 $13 $14 $15 $16 $17 $18 $19 $20 $21 $22 $23 $24 $25 $t0 $t1 $t2 $t3 $t4 $t5 $t6 $t7 $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $t8 $t9 10 temporary registers Temporary values Change Operands Wallet Keys Saved across procedure calls More temporaries Figure 5.2 Jan. 2011 8 operand registers (partial) Computer Architecture, Instruction-Set Architecture Analogy for register usage conventions Slide 10 5.2 Instruction Formats High-level language statement: a = b + c Assembly language instruction: add $t8, $s2, $s1 Machine language instruction: 000000 10010 10001 11000 00000 100000 ALU-type Register Register Register Addition Unused opcode instruction 18 17 24 Instruction cache P C $17 $18 Instruction fetch Figure 5.3 Jan. 2011 Register file Register readout Data cache (not used) Register file ALU $24 Operation Data read/store Register writeback A typical instruction for MiniMIPS and steps in its execution. Computer Architecture, Instruction-Set Architecture Slide 11 Add, Subtract, and Specification of Constants MiniMIPS add & subtract instructions; e.g., compute: g = (b + c) − (e + f) add add sub $t8,$s2,$s3 $t9,$s5,$s6 $s7,$t8,$t9 # put the sum b + c in $t8 # put the sum e + f in $t9 # set g to ($t8) − ($t9) Decimal and hex constants Decimal Hexadecimal 25, 123456, −2873 0x59, 0x12b4c6, 0xffff0000 Machine instruction typically contains an opcode one or more source operands possibly a destination operand Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 12 MiniMIPS Instruction Formats 31 R 31 I 31 J op 25 rs 20 rt 15 6 bits 5 bits 5 bits Opcode Source register 1 Source register 2 op 25 rs 20 rt rd sh 10 5 bits Destination register 15 fn 5 5 bits 6 bits Shift amount Opcode extension operand / offset 6 bits 5 bits 5 bits 16 bits Opcode Source or base Destination or data Imm ediate operand or address offset op 25 0 0 jump target address 0 6 bits 1 0 0 0 0 0 0 0 0 0 0 0 260 bits 0 0 0 0 0 0 0 1 1 1 1 0 1 Opcode Memory word address (byte address di vided by 4) Figure 5.4 MiniMIPS instructions come in only three formats: register (R), immediate (I), and jump (J). Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 13 5.3 Simple Arithmetic/Logic Instructions Add and subtract already discussed; logical instructions are similar add sub and or xor nor 31 R $t0,$s0,$s1 $t0,$s0,$s1 $t0,$s0,$s1 $t0,$s0,$s1 $t0,$s0,$s1 $t0,$s0,$s1 op 25 rs # # # # # # 20 rt set set set set set set 15 $t0 $t0 $t0 $t0 $t0 $t0 rd to to to to to to ($s0)+($s1) ($s0)-($s1) ($s0)∧($s1) ($s0)∨($s1) ($s0)⊕($s1) (($s0)∨($s1))′ sh 10 5 fn 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 x 0 ALU instruction Source register 1 Source register 2 Destination register Unused add = 32 sub = 34 Figure 5.5 The arithmetic instructions add and sub have a format that is common to all two-operand ALU instructions. For these, the fn field specifies the arithmetic/logic operation to be performed. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 14 Arithmetic/Logic with One Immediate Operand An operand in the range [−32 768, 32 767], or [0x0000, 0xffff], can be specified in the immediate field. addi andi ori xori $t0,$s0,61 $t0,$s0,61 $t0,$s0,61 $t0,$s0,0x00ff # # # # set set set set $t0 $t0 $t0 $t0 to to to to ($s0)+61 ($s0)∧61 ($s0)∨61 ($s0)⊕ 0x00ff For arithmetic instructions, the immediate operand is sign-extended 31 I op 25 rs 20 rt 15 operand / offset 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 Errors 0 1 addi = 8 Source Destination Immediate operand Figure 5.6 Instructions such as addi allow us to perform an arithmetic or logic operation for which one operand is a small constant. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 15 5.4 Load and Store Instructions op 31 I 25 rs 20 rt 15 operand / offset 0 1 0 x 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 lw = 35 sw = 43 Memory A[0] A[1] A[2] . . . A[i] Base register lw lw Data register $t0,40($s3) $t0,A($s3) Address in base register Offset = 4i Element i of array A Offset relative to base Note on base and offset: The memory address is the sum of (rs) and an immediate value. Calling one of these the base and the other the offset is quite arbitrary. It would make perfect sense to interpret the address A($s3) as having the base A and the offset ($s3). However, a 16-bit base confines us to a small portion of memory space. Figure 5.7 MiniMIPS lw and sw instructions and their memory addressing convention that allows for simple access to array elements via a base address and an offset (offset = 4i leads us to the i th word). Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 16 lw, sw, and lui Instructions lw sw $t0,40($s3) $t0,A($s3) lui $s0,61 op 31 I 25 rs # load mem[40+($s3)] in $t0 # store ($t0) in mem[A+($s3)] # “($s3)” means “content of $s3” # The immediate value 61 is # loaded in upper half of $s0 # with lower 16b set to 0s 20 rt 15 operand / offset 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 lui = 15 Unused Destination Immediate operand 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Content of $s0 after the instruction is executed Figure 5.8 The lui instruction allows us to load an arbitrary 16-bit value into the upper half of a register while setting its lower half to 0s. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 17 Initializing a Register Example 5.2 Show how each of these bit patterns can be loaded into $s0: 0010 0001 0001 0000 0000 0000 0011 1101 1111 1111 1111 1111 1111 1111 1111 1111 Solution The first bit pattern has the hex representation: 0x2110003d lui ori $s0,0x2110 $s0,0x003d # put the upper half in $s0 # put the lower half in $s0 Same can be done, with immediate values changed to 0xffff for the second bit pattern. But, the following is simpler and faster: nor Jan. 2011 $s0,$zero,$zero # because (0 ∨ 0)′ = 1 Computer Architecture, Instruction-Set Architecture Slide 18 5.5 Jump and Branch Instructions Unconditional jump and jump through register instructions j jr $ra is the symbolic name for reg. $31 (return address) verify $ra 31 J # go to mem loc named “verify” # go to address that is in $ra; # $ra may hold a return address op jump target address 25 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 j=2 x x x x 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 From PC (incremented) op 31 R Effective target address (32 bits) 25 rs 20 rt 15 rd 10 sh 5 fn 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ALU instruction Source register Unused Unused Unused jr = 8 Figure 5.9 The jump instruction j of MiniMIPS is a J-type instruction which is shown along with how its effective target address is obtained. The jump register (jr) instruction is R-type, with its specified register often being $ra. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 19 Conditional Branch Instructions Conditional branches use PC-relative addressing bltz $s1,L beq $s1,$s2,L bne $s1,$s2,L 31 I op 25 rs 20 rt 15 operand / offset 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 bltz = 1 31 I # branch on ($s1)< 0 # branch on ($s1)=($s2) # branch on ($s1)≠($s2) op Source 25 rs Zero 20 rt Relative branch distance in words 15 operand / offset 0 0 0 0 1 0 x 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 beq = 4 bne = 5 Source 1 Figure 5.10 (part 1) Jan. 2011 Source 2 Relative branch distance in words Conditional branch instructions of MiniMIPS. Computer Architecture, Instruction-Set Architecture Slide 20 Comparison Instructions for Conditional Branching slt $s1,$s2,$s3 slti $s1,$s2,61 31 R op 20 if ($s2)<($s3), set $s1 to 1 else set $s1 to 0; often followed by beq/bne if ($s2)<61, set $s1 to 1 else set $s1 to 0 rt 15 rd 10 sh 5 fn 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 ALU instruction 31 I rs 25 # # # # # op Source 1 register rs 25 Source 2 register 20 rt Destination 15 Unused slt = 42 operand / offset 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 slti = 10 Source Figure 5.10 (part 2) Jan. 2011 Destination Immediate operand Comparison instructions of MiniMIPS. Computer Architecture, Instruction-Set Architecture Slide 21 Examples for Conditional Branching If the branch target is too far to be reachable with a 16-bit offset (rare occurrence), the assembler automatically replaces the branch instruction beq $s0,$s1,L1 with: bne j L2: ... $s1,$s2,L2 L1 # skip jump if (s1)≠(s2) # goto L1 if (s1)=(s2) Forming if-then constructs; e.g., if (i == j) x = x + y bne $s1,$s2,endif add $t1,$t1,$t2 endif: ... # branch on i≠j # execute the “then” part If the condition were (i < j), we would change the first line to: slt beq Jan. 2011 $t0,$s1,$s2 $t0,$0,endif # set $t0 to 1 if i<j # branch if ($t0)=0; # i.e., i not< j or i≥j Computer Architecture, Instruction-Set Architecture Slide 22 Compiling if-then-else Statements Example 5.3 Show a sequence of MiniMIPS instructions corresponding to: if (i<=j) x = x+1; z = 1; else y = y–1; z = 2*z Solution Similar to the “if-then” statement, but we need instructions for the “else” part and a way of skipping the “else” part after the “then” part. slt bne addi addi j else: addi add endif:... Jan. 2011 $t0,$s2,$s1 $t0,$zero,else $t1,$t1,1 $t3,$zero,1 endif $t2,$t2,-1 $t3,$t3,$t3 # # # # # # # j<i? (inverse condition) if j<i goto else part begin then part: x = x+1 z = 1 skip the else part begin else part: y = y–1 z = z+z Computer Architecture, Instruction-Set Architecture Slide 23 5.6 Addressing Modes Addressing Instruction Other elements involved Some place in the machine Implied Extend, if required Immediate Reg spec Register Reg base Reg file Reg data Constant offset Incremented PC Pseudodirect Reg file Constant offset Base PC-relative Operand PC Reg data Mem Add addr Mem Add addr Mem Memory data Mem Memory data Mem addr Memory Mem data Figure 5.11 Schematic representation of addressing modes in MiniMIPS. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 24 Finding the Maximum Value in a List of Integers Example 5.5 List A is stored in memory beginning at the address given in $s1. List length is given in $s2. Find the largest integer in the list and copy it into $t0. Solution Scan the list, holding the largest element identified thus far in $t0. lw addi loop: add beq add add add lw slt beq addi maximum j done: ... Jan. 2011 $t0,0($s1) $t1,$zero,0 $t1,$t1,1 $t1,$s2,done $t2,$t1,$t1 $t2,$t2,$t2 $t2,$t2,$s1 $t3,0($t2) $t4,$t0,$t3 $t4,$zero,loop $t0,$t3,0 # # # # # # # # # # initialize maximum to A[0] initialize index i to 0 increment index i by 1 if all elements examined, quit compute 2i in $t2 compute 4i in $t2 form address of A[i] in $t2 load value of A[i] into $t3 maximum < A[i]? if not, repeat with no change # if so, A[i] is the new loop # change completed; now repeat # continuation of the program Computer Architecture, Instruction-Set Architecture Slide 25 The 20 MiniMIPS Instructions Covered So Far Copy Arithmetic 31 R 31 I 31 J op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source register 1 Source register 2 Destination register Shift amount Opcode extension op rs rt 25 20 15 operand / offset 6 bits 5 bits 5 bits 16 bits Opcode Source or base Destination or data Immediate operand or address offset op 25 jump target address 0 0 0 6 bits 1 0 0 0 0 0 0 0 0 0 0 0260 bits 0 0 0 0 0 0 0 1 1 1 1 0 1 Opcode Memory word address (byte address divided by 4) Logic Memory access Control transfer Table 5.1 Jan. 2011 Instruction Usage Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch less than 0 Branch equal Branch not equal lui add sub slt addi slti and or xor nor andi ori xori lw sw j jr bltz beq bne Computer Architecture, Instruction-Set Architecture rt,imm rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rd,rs,imm rd,rs,rt rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rt,rs,imm rt,rs,imm rt,imm(rs) rt,imm(rs) L rs rs,L rs,rt,L rs,rt,L op fn 15 0 0 0 8 10 0 0 0 0 12 13 14 35 43 2 0 1 4 5 Slide 26 32 34 42 36 37 38 39 8 6 Procedures and Data Finish our study of MiniMIPS instructions and its data types: • Instructions for procedure call/return, misc. instructions • Procedure parameters and results, utility of stack Topics in This Chapter 6.1 Simple Procedure Calls 6.2 Using the Stack for Data Storage 6.3 Parameters and Results 6.4 Data Types 6.5 Arrays and Pointers 6.6 Additional Instructions Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 27 6.1 Simple Procedure Calls Using a procedure involves the following sequence of actions: 1. 2. 3. 4. 5. 6. Put arguments in places known to procedure (reg’s $a0-$a3) Transfer control to procedure, saving the return address (jal) Acquire storage space, if required, for use by the procedure Perform the desired task Put results in places known to calling program (reg’s $v0-$v1) Return control to calling point (jr) MiniMIPS instructions for procedure call and return from procedure: Jan. 2011 jal proc # jump to loc “proc” and link; # “link” means “save the return # address” (PC)+4 in $ra ($31) jr rs # go to loc addressed by rs Computer Architecture, Instruction-Set Architecture Slide 28 Illustrating a Procedure Call main Prepare to call PC jal proc Prepare to continue proc Save, etc. Restore jr Figure 6.1 Jan. 2011 $ra Relationship between the main program and a procedure. Computer Architecture, Instruction-Set Architecture Slide 29 $0 $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15 $16 $17 $18 $19 $20 $21 $22 $23 $24 $25 $26 $27 $28 $29 $30 $31 0 Jan. 2011 $zero $at $v0 $v1 $a0 $a1 $a2 $a3 $t0 $t1 $t2 $t3 $t4 $t5 $t6 $t7 $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $t8 $t9 $k0 $k1 $gp $sp $fp $ra Reserved for assembler use Procedure results Procedure arguments Saved A 4-b yte word sits in consecutive memory addresses according to the big-endian order (most significant byte has the lowest address) Byte numbering: 3 2 3 2 1 0 1 Recalling Register Conventions 0 When loading a byte into a register, it goes in the low end Byte Temporary values Word Doublew ord Operands Saved across procedure calls More temporaries Reserved for OS (kernel) Global pointer Stack pointer Frame pointer Return address Saved A doubleword sits in consecutive registers or memory locations according to the big-endian order (most significant word comes first) Computer Architecture, Instruction-Set Architecture Figure 5.2 Registers and data sizes in MiniMIPS. Slide 30 A Simple MiniMIPS Procedure Example 6.1 Procedure to find the absolute value of an integer. $v0 ← |($a0)| Solution The absolute value of x is –x if x < 0 and x otherwise. abs: sub $v0,$zero,$a0 bltz $a0,done add $v0,$a0,$zero done: jr $ra # # # # # put -($a0) in $v0; in case ($a0) < 0 if ($a0)<0 then done else put ($a0) in $v0 return to calling program In practice, we seldom use such short procedures because of the overhead that they entail. In this example, we have 3-4 instructions of overhead for 3 instructions of useful computation. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 31 Nested Procedure Calls main PC Prepare to call jal abc Prepare to continue abc Procedure abc Procedure xyz Save xyz jal Text version is incorrect Figure 6.2 Jan. 2011 xyz Restore jr $ra jr $ra Example of nested procedure calls. Computer Architecture, Instruction-Set Architecture Slide 32 6.2 Using the Stack for Data Storage sp Push c sp c b a Figure 6.4 push: addi sw Jan. 2011 Analogy: Cafeteria stack of plates/trays b a Pop x b a sp sp = sp – 4 mem[sp] = c x = mem[sp] sp = sp + 4 Effects of push and pop operations on a stack. $sp,$sp,-4 $t4,0($sp) pop: lw addi Computer Architecture, Instruction-Set Architecture $t5,0($sp) $sp,$sp,4 Slide 33 Memory Map in MiniMIPS Hex address 00000000 Reserved 1 M words Program Text segment 63 M words 00400000 10000000 Addressable with 16-bit signed offset Static data 10008000 1000ffff Data segment Dynamic data 448 M words $gp $28 $29 $30 $sp $fp Stack Stack segment 7ffffffc 80000000 Second half of address space reserved for memory-mapped I/O Figure 6.3 Jan. 2011 Overview of the memory address space in MiniMIPS. Computer Architecture, Instruction-Set Architecture Slide 34 6.3 Parameters and Results Stack allows us to pass/return an arbitrary number of values $sp Local variables z y .. . Saved registers Frame for current procedure Old ($fp) $sp $fp c b a .. . Frame for current procedure c b a .. . Frame for previous procedure $fp Before calling Figure 6.5 Jan. 2011 After calling Use of the stack by a procedure. Computer Architecture, Instruction-Set Architecture Slide 35 Example of Using the Stack Saving $fp, $ra, and $s0 onto the stack and restoring them at the end of the procedure $sp $sp $fp $fp proc: sw addi addi sw sw . ($s0) . ($ra) . ($fp) lw lw addi lw jr Jan. 2011 $fp,-4($sp) $fp,$sp,0 $sp,$sp,–12 $ra,-8($fp) $s0,-12($fp) # # # # # save the old frame pointer save ($sp) into $fp create 3 spaces on top of stack save ($ra) in 2nd stack element save ($s0) in top stack element $s0,-12($fp) $ra,-8($fp) $sp,$fp, 0 $fp,-4($sp) $ra # # # # # put top stack element in $s0 put 2nd stack element in $ra restore $sp to original state restore $fp to original state return from procedure Computer Architecture, Instruction-Set Architecture Slide 36 6.4 Data Types Data size (number of bits), data type (meaning assigned to bits) Signed integer: Unsigned integer: Floating-point number: Bit string: byte byte byte word word word word doubleword doubleword Converting from one size to another Type 8-bit number Value 32-bit version of the number Unsigned 0010 1011 Unsigned 1010 1011 43 171 0000 0000 0000 0000 0000 0000 0010 1011 0000 0000 0000 0000 0000 0000 1010 1011 Signed Signed +43 –85 0000 0000 0000 0000 0000 0000 0010 1011 1111 1111 1111 1111 1111 1111 1010 1011 Jan. 2011 0010 1011 1010 1011 Computer Architecture, Instruction-Set Architecture Slide 37 ASCII Characters Table 6.1 ASCII (American standard code for information interchange) 0 0 NUL 1 DLE 2 SP 3 0 4 @ 5 P 6 ` 7 p 1 SOH DC1 ! 1 A Q a q 2 STX DC2 “ 2 B R b r 3 ETX DC3 # 3 C S c s 4 EOT DC4 $ 4 D T d t 5 ENQ NAK % 5 E U e u 6 ACK SYN & 6 F V f v 7 BEL ETB ‘ 7 G W g w 8 BS CAN ( 8 H X h x 9 HT EM ) 9 I Y i y a LF SUB * : J Z j z b VT ESC + ; K [ k { c FF FS , < L \ l | d CR GS - = M ] m } e SO RS . > N ^ n ~ f SI US / ? O _ o DEL Jan. 2011 Computer Architecture, Instruction-Set Architecture 8-9 a-f More More controls symbols 8-bit ASCII code (col #, row #)hex e.g., code for + is (2b) hex or (0010 1011)two Slide 38 Loading and Storing Bytes Bytes can be used to store ASCII characters or small integers. MiniMIPS addresses refer to bytes, but registers hold words. lb $t0,8($s3) lbu $t0,8($s3) sb $t0,A($s3) op 31 I 25 rs # # # # # 20 rt load rt with mem[8+($s3)] sign-extend to fill reg load rt with mem[8+($s3)] zero-extend to fill reg LSB of rt to mem[A+($s3)] 15 immediate / offset 0 1 0 x x 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 lb = 32 lbu = 36 sb = 40 Figure 6.6 Jan. 2011 Base register Data register Address offset Load and store instructions for byte-size data elements. Computer Architecture, Instruction-Set Architecture Slide 39 Meaning of a Word in Memory Bit pattern (02114020) hex 0000 0010 0001 0001 0100 0000 0010 0000 00000010000100010100000000100000 Add instruction 00000010000100010100000000100000 Positive integer 00000010000100010100000000100000 Four-character string Figure 6.7 A 32-bit word has no inherent meaning and can be interpreted in a number of equally valid ways in the absence of other cues (e.g., context) for the intended meaning. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 40 6.5 Arrays and Pointers Index: Use a register that holds the index i and increment the register in each step to effect moving from element i of the list to element i + 1 Pointer: Use a register that points to (holds the address of) the list element being examined and update it in each step to point to the next element Array index i Add 1 to i; Compute 4i; Add 4i to base Base Array A A[i] A[i + 1] Pointer to A[i] Add 4 to get the address of A[i + 1] Array A A[i] A[i + 1] Figure 6.8 Stepping through the elements of an array using the indexing method and the pointer updating method. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 41 Selection Sort Example 6.4 To sort a list of numbers, repeatedly perform the following: Find the max element, swap it with the last item, move up the “last” pointer A A first first max A first x y last last last Start of iteration Figure 6.9 Jan. 2011 y x Maximum identified End of iteration One iteration of selection sort. Computer Architecture, Instruction-Set Architecture Slide 42 Selection Sort Using the Procedure max Example 6.4 (continued) A A first Inputs to proc max first In $a0 max x In $v0 In $v1 In $a1 last Start of iteration Jan. 2011 y Outputs from proc max last last sort: beq jal lw sw sw addi j done: ... A first $a0,$a1,done max $t0,0($a1) $t0,0($v0) $v1,0($a1) $a1,$a1,-4 sort # # # # # # # # y x Maximum identified End of iteration single-element list is sorted call the max procedure load last element into $t0 copy the last element to max loc copy max value to last element decrement pointer to last element repeat sort for smaller list continue with rest of program Computer Architecture, Instruction-Set Architecture Slide 43 6.6 Additional Instructions MiniMIPS instructions for multiplication and division: mult div $s0, $s1 $s0, $s1 mfhi mflo # # # # # $t0 $t0 31 R op 20 rt 15 rd 10 sh 5 fn Reg file Mul/Div unit Hi 0 Source register 1 Source register 2 Unused Unused mult = 24 div = 26 The multiply (mult) and divide (div) instructions of MiniMIPS. 31 R rs Hi,Lo to ($s0)×($s1) Hi to ($s0)mod($s1) Lo to ($s0)/($s1) $t0 to (Hi) $t0 to (Lo) 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 x 0 ALU instruction Figure 6.10 25 set set and set set op 25 rs 20 rt 15 rd 10 sh 5 fn 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 x 0 ALU instruction Unused Unused Destination register Unused mfhi = 16 mflo = 18 Figure 6.11 MiniMIPS instructions for copying the contents of Hi and Lo registers into general registers . Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 44 Lo Logical Shifts MiniMIPS instructions for left and right shifting: sll srl sllv srlv $t0,$s1,2 $t0,$s1,2 $t0,$s1,$s0 $t0,$s1,$s0 31 R op 25 20 $t0=($s1) $t0=($s1) $t0=($s1) $t0=($s1) rt 15 left-shifted by 2 right-shifted by 2 left-shifted by ($s0) right-shifted by ($s0) rd 10 sh fn 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 x 0 ALU instruction 31 R rs # # # # op Unused 25 rs Source register 20 rt Destination register 15 rd Shift amount 10 sh sll = 0 srl = 2 fn 5 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 x 0 ALU instruction Figure 6.12 Jan. 2011 Amount register Source register Destination register Unused sllv = 4 srlv = 6 The four logical shift instructions of MiniMIPS. Computer Architecture, Instruction-Set Architecture Slide 45 Unsigned Arithmetic and Miscellaneous Instructions MiniMIPS instructions for unsigned arithmetic (no overflow exception): addu subu multu divu $t0,$s0,$s1 $t0,$s0,$s1 $s0,$s1 $s0,$s1 addiu $t0,$s0,61 # # # # # # # # set $t0 to ($s0)+($s1) set $t0 to ($s0)–($s1) set Hi,Lo to ($s0)×($s1) set Hi to ($s0)mod($s1) and Lo to ($s0)/($s1) set $t0 to ($s0)+61; the immediate operand is sign extended To make MiniMIPS more powerful and complete, we introduce later: sra $t0,$s1,2 srav $t0,$s1,$s0 syscall Jan. 2011 # sh. right arith (Sec. 10.5) # shift right arith variable # system call (Sec. 7.6) Computer Architecture, Instruction-Set Architecture Slide 46 The 20 MiniMIPS Instructions Copy from Chapter 6 (40 in all so far) Arithmetic Table 6.2 (partial) 31 R 31 I 31 J op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source register 1 Source register 2 Destination register Shift amount Opcode extension op 25 rs 20 rt 15 operand / offset 6 bits 5 bits 5 bits 16 bits Opcode Source or base Destination or data Immediate operand or address offset op 25 jump target address 0 0 Shift 0 6 bits 1 0 0 0 0 0 0 0 0 0 0 0260 bits 0 0 0 0 0 0 0 1 1 1 1 0 1 Opcode Memory word address (byte address divided by 4) Memory access Control transfer Jan. 2011 Instruction Usage Move from Hi Move from Lo Add unsigned Subtract unsigned Multiply Multiply unsigned Divide Divide unsigned Add immediate unsigned Shift left logical Shift right logical Shift right arithmetic Shift left logical variable Shift right logical variable Shift right arith variable Load byte Load byte unsigned Store byte Jump and link System call mfhi rd mflo rd addu rd,rs,rt subu rd,rs,rt mult rs,rt multu rs,rt div rs,rt divu rs,rt addiu rs,rt,imm sll rd,rt,sh srl rd,rt,sh sra rd,rt,sh sllv rd,rt,rs srlv rt,rd,rs srav rd,rt,rd lb rt,imm(rs) lbu rt,imm(rs) sb rt,imm(rs) jal L syscall Computer Architecture, Instruction-Set Architecture op fn 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 32 36 40 3 0 Slide 47 16 18 33 35 24 25 26 27 0 2 3 4 6 7 12 Table 6.2 The 37 + 3 MiniMIPS Instructions Covered So Far Instruction Usage Instruction Usage Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch less than 0 Branch equal Branch not equal lui add sub slt addi slti and or xor nor andi ori xori lw sw j jr bltz beq bne Move from Hi Move from Lo Add unsigned Subtract unsigned Multiply Multiply unsigned Divide Divide unsigned Add immediate unsigned Shift left logical Shift right logical Shift right arithmetic Shift left logical variable Shift right logical variable Shift right arith variable Load byte Load byte unsigned Store byte Jump and link mfhi mflo addu subu mult multu div divu addiu sll srl sra sllv srlv srav lb lbu sb jal System call syscall Jan. 2011 rt,imm rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rd,rs,imm rd,rs,rt rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rt,rs,imm rt,rs,imm rt,imm(rs) rt,imm(rs) L rs rs,L rs,rt,L rs,rt,L Computer Architecture, Instruction-Set Architecture rd rd rd,rs,rt rd,rs,rt rs,rt rs,rt rs,rt rs,rt rs,rt,imm rd,rt,sh rd,rt,sh rd,rt,sh rd,rt,rs rd,rt,rs rd,rt,rs rt,imm(rs) rt,imm(rs) rt,imm(rs) L Slide 48 7 Assembly Language Programs Everything else needed to build and run assembly programs: • Supply info to assembler about program and its data • Non-hardware-supported instructions for convenience Topics in This Chapter 7.1 Machine and Assembly Languages 7.2 Assembler Directives 7.3 Pseudoinstructions 7.4 Macroinstructions 7.5 Linking and Loading 7.6 Running Assembler Programs Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 49 7.1 Machine and Assembly Languages $2,$5,$5 $2,$2,$2 $2,$4,$2 $15,0($2) $16,4($2) $16,0($2) $15,4($2) $31 00a51020 00421020 00821020 8c620000 8cf20004 acf20000 ac620004 03e00008 Executable machine language program Loader add add add lw lw sw sw jr Machine language program Linker Assembly language program Assembler MIPS, 80x86, PowerPC, etc. Library routines (machine language) Memory content Figure 7.1 Steps in transforming an assembly language program to an executable program residing in memory. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 50 Symbol Table Assembly language program addi sub add test: bne addi add j done: sw Symbol table $s0,$zero,9 $t0,$s0,$s0 $t1,$zero,$zero $t0,$s0,done $t0,$t0,1 $t1,$s0,$zero test $t1,result($gp) done result test 28 248 12 Location 0 4 8 12 16 20 24 28 Machine language program 00100000000100000000000000001001 00000010000100000100000000100010 00000001001000000000000000100000 00010101000100000000000000001100 00100001000010000000000000000001 00000010000000000100100000100000 00001000000000000000000000000011 10101111100010010000000011111000 op rs rt rd sh fn Field boundaries shown to facilitate understanding Determined from assembler directives not shown here Figure 7.2 An assembly-language program, its machine-language version, and the symbol table created during the assembly process. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 51 7.2 Assembler Directives Assembler directives provide the assembler with info on how to translate the program but do not lead to the generation of machine instructions tiny: max: small: big: array: str1: str2: .macro .end_macro .text ... .data .byte 156,0x7a .word 35000 .float 2E-3 .double 2E-3 .align 2 .space 600 .ascii “a*b” .asciiz “xyz” .global main Jan. 2011 # # # # # # # # # # # # # # start macro (see Section 7.4) end macro (see Section 7.4) start program’s text segment program text goes here start program’s data segment name & initialize data byte(s) name & initialize data word(s) name short float (see Chapter 12) name long float (see Chapter 12) align next item on word boundary reserve 600 bytes = 150 words name & initialize ASCII string null-terminated ASCII string consider “main” a global name Computer Architecture, Instruction-Set Architecture Slide 52 Composing Simple Assembler Directives Example 7.1 Write assembler directive to achieve each of the following objectives: a. Put the error message “Warning: The printer is out of paper!” in memory. b. Set up a constant called “size” with the value 4. c. Set up an integer variable called “width” and initialize it to 4. d. Set up a constant called “mill” with the value 1,000,000 (one million). e. Reserve space for an integer vector “vect” of length 250. Solution: a. noppr: .asciiz “Warning: The printer is out of paper!” b. size: .byte 4 # small constant fits in one byte c. width: .word 4 # byte could be enough, but ... d. mill: .word 1000000 # constant too large for byte e. vect: .space 1000 # 250 words = 1000 bytes Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 53 7.3 Pseudoinstructions Example of one-to-one pseudoinstruction: The following not $s0 # complement ($s0) is converted to the real instruction: nor $s0,$s0,$zero # complement ($s0) Example of one-to-several pseudoinstruction: The following abs $t0,$s0 # put |($s0)| into $t0 is converted to the sequence of real instructions: add slt beq sub Jan. 2011 $t0,$s0,$zero $at,$t0,$zero $at,$zero,+4 $t0,$zero,$s0 # # # # copy x into $t0 is x negative? if not, skip next instr the result is 0 – x Computer Architecture, Instruction-Set Architecture Slide 54 MiniMIPS Pseudoinstructions Copy Arithmetic Table 7.1 Shift Logic Memory access Control transfer Jan. 2011 Pseudoinstruction Usage Move Load address Load immediate Absolute value Negate Multiply (into register) Divide (into register) Remainder Set greater than Set less or equal Set greater or equal Rotate left Rotate right NOT Load doubleword Store doubleword Branch less than Branch greater than Branch less or equal Branch greater or equal move la li abs neg mul div rem sgt sle sge rol ror not ld sd blt bgt ble bge Computer Architecture, Instruction-Set Architecture regd,regs regd,address regd,anyimm regd,regs regd,regs regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 regd,reg1,reg2 reg regd,address regd,address reg1,reg2,L reg1,reg2,L reg1,reg2,L reg1,reg2,L Slide 55 7.4 Macroinstructions A macro is a mechanism to give a name to an often-used sequence of instructions (shorthand notation) .macro name(args) ... .end_macro # macro and arguments named # instr’s defining the macro # macro terminator How is a macro different from a pseudoinstruction? Pseudos are predefined, fixed, and look like machine instructions Macros are user-defined and resemble procedures (have arguments) How is a macro different from a procedure? Control is transferred to and returns from a procedure After a macro has been replaced, no trace of it remains Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 56 Macro to Find the Largest of Three Values Example 7.4 Write a macro to determine the largest of three values in registers and to put the result in a fourth register. Solution: .macro mx3r(m,a1,a2,a3) move m,a1 bge m,a2,+4 move m,a2 bge m,a3,+4 move m,a3 .endmacro # # # # # # # macro and arguments named assume (a1) is largest; m = (a1) if (a2) is not larger, ignore it else set m = (a2) if (a3) is not larger, ignore it else set m = (a3) macro terminator If the macro is used as mx3r($t0,$s0,$s4,$s3), the assembler replaces the arguments m, a1, a2, a3 with $t0, $s0, $s4, $s3, respectively. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 57 7.5 Linking and Loading The linker has the following responsibilities: Ensuring correct interpretation (resolution) of labels in all modules Determining the placement of text and data segments in memory Evaluating all data addresses and instruction labels Forming an executable program with no unresolved references The loader is in charge of the following: Determining the memory needs of the program from its header Copying text and data from the executable program file into memory Modifying (shifting) addresses, where needed, during copying Placing program parameters onto the stack (as in a procedure call) Initializing all machine registers, including the stack pointer Jumping to a start-up routine that calls the program’s main routine Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 58 7.6 Running Assembler Programs Spim is a simulator that can run MiniMIPS programs The name Spim comes from reversing MIPS Three versions of Spim are available for free downloading: PCSpim for Windows machines xspim for X-windows SPIM A MIPS32 Simulator spim for Unix systems You can download SPIM from: http://www.cs.wisc.edu/~larus/spim.html Jan. 2011 James Larus spim@larusstone.org Microsoft Research Formerly: Professor, CS Dept., Univ. Wisconsin-Madison spim is a self-contained simulator that will run MIPS32 assembly language programs. It reads and executes assembly . . . Computer Architecture, Instruction-Set Architecture Slide 59 Input/Output Conventions for MiniMIPS Table 7.2 Input/output and control functions of syscall in PCSpim. Arguments Result 1 Print integer Integer in $a0 Integer displayed 2 Print floating-point Float in $f12 Float displayed 3 Print double-float Double-float in $f12,$f13 Double-float displayed 4 Print string Pointer in $a0 Null-terminated string displayed Cntl Input Output ($v0) Function 5 Read integer Integer returned in $v0 6 Read floating-point Float returned in $f0 7 Read double-float Double-float returned in $f0,$f1 8 Read string Pointer in $a0, length in $a1 String returned in buffer at pointer 9 Allocate memory Number of bytes in $a0 10 Exit from program Jan. 2011 Pointer to memory block in $v0 Program execution terminated Computer Architecture, Instruction-Set Architecture Slide 60 PCSpim User Interface PCSpim Menu bar File Simulator Window Help Tools bar File R0 R1 Window Jan. 2011 ? PC = 00400000 Status = 00000000 Clear Regis ters Reinitializ e Reload Go Break Continue Single Step Multiple Step ... Breakpoints ... Set Value ... Disp Symbol Table Settings ... Tile 1 Messages 2 Tex t Segment 3 Data Segment 4 Regis ters 5 Console Clear Console Toolbar Status bar Status bar , ? Registers Open Sav e Log File Ex it Simulator Figure 7.3 (r0) = 0 (at) = 0 EPC = 00000000 Cause = 00000000 HI = 00000000 LO = 00000000 General Registers R8 (t0) = 0 R16 (s0) = 0 R24 R9 (t1) = 0 R17 (s1) = 0 R25 Text Segment [0x00400000] [0x00400004] [0x00400008] [0x0040000c] [0x00400010] 0x0c100008 0x00000021 0x2402000a 0x0000000c 0x00000021 jal 0x00400020 [main] addu $0, $0, $0 addiu $2, $0, 10 syscall addu $0, $0, $0 ; ; ; ; ; 43 44 45 46 47 Data Segment DATA [0x10000000] [0x10000010] [0x10000020] 0x00000000 0x6c696146 0x20206465 0x676e6974 0x44444120 0x6554000a 0x44412067 0x000a4944 0x74736554 Messages See the file README for a full copyright notice. Memory and registers have been cleared, and the simulator rei D:\temp\dos\TESTS\Alubare.s has been successfully loaded For Help, press F1 Base=1; Pseudo=1, Mapped=1; LoadTrap=0 Computer Architecture, Instruction-Set Architecture Slide 61 8 Instruction Set Variations The MiniMIPS instruction set is only one example • How instruction sets may differ from that of MiniMIPS • RISC and CISC instruction set design philosophies Topics in This Chapter 8.1 Complex Instructions 8.2 Alternative Addressing Modes 8.3 Variations in Instruction Formats 8.4 Instruction Set Design and Evolution 8.5 The RISC/CISC Dichotomy 8.6 Where to Draw the Line Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 62 Review of Some Key Concepts Macroinstruction Instruction Instruction Instruction Instruction Different from procedure, in that the macro is replaced with equivalent instructions Microinstruction Microinstruction Microinstruction Microinstruction Microinstruction Instruction format for a simple RISC design 31 R 31 I 31 J op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source register 1 Source register 2 Destination register Shift amount Opcode extension op 25 rs 20 rt 15 operand / offset 6 bits 5 bits 5 bits 16 bits Opcode Source or base Destination or data Immediate operand or address offset op 25 jump target address 0 Fields used consistently (simple decoding) 0 Can initiate reading of registers even before decoding the instruction 0 6 bits 1 0 0 0 0 0 0 0 0 0 0 0260 bits 0 0 0 0 0 0 0 1 1 1 1 0 1 Opcode Memory word address (byte address divided by 4) Jan. 2011 All of the same length Short, uniform execution Computer Architecture, Instruction-Set Architecture Slide 63 8.1 Complex Instructions Table 8.1 (partial) Examples of complex instructions in two popular modern microprocessors and two computer families of historical significance Machine Instruction Effect Pentium MOVS Move one element in a string of bytes, words, or doublewords using addresses specified in two pointer registers; after the operation, increment or decrement the registers to point to the next element of the string PowerPC cntlzd Count the number of consecutive 0s in a specified source register beginning with bit position 0 and place the count in a destination register IBM 360-370 CS Compare and swap: Compare the content of a register to that of a memory location; if unequal, load the memory word into the register, else store the content of a different register into the same memory location Digital VAX POLYD Polynomial evaluation with double flp arithmetic: Evaluate a polynomial in x, with very high precision in intermediate results, using a coefficient table whose location in memory is given within the instruction Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 64 Some Details of Sample Complex Instructions 0000 0010 1100 0111 Source string Destination string cntlzd (Count leading 0s) 6 leading 0s 0000 0000 0000 0110 POLYD (Polynomial evaluation in double floating-point) Coefficients cn–1xn–1 + . . . + c2x2 + c1x + c0 MOVS x (Move string) Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 65 Benefits and Drawbacks of Complex Instructions Fewer instructions in program (less memory) Fewer memory accesses for instructions Programs may become easier to write/read/understand Potentially faster execution (complex steps are still done sequentially in multiple cycles, but hardware control can be faster than software loops) Jan. 2011 More complex format (slower decoding) Less flexible (one algorithm for polynomial evaluation or sorting may not be the best in all cases) If interrupts are processed at the end of instruction cycle, machine may become less responsive to time-critical events (interrupt handling) Computer Architecture, Instruction-Set Architecture Slide 66 8.2 Alternative Addressing Modes Addressing Instruction Other elements involved Some place in the machine Implied Let’s refresh our memory (from Chap. 5) Extend, if required Immediate Reg spec Register Reg file Constant offset Base Reg base PC-relative Reg file Reg data Constant offset Reg data Mem Add addr Mem Add addr PC Pseudodirect Operand PC Mem Memory data Mem Memory data Mem addr Memory Mem data Figure 5.11 Schematic representation of addressing modes in MiniMIPS. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 67 Table 6.2 Addressing Mode Examples in the MiniMIPS ISA Instruction Usage Instruction Usage Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch less than 0 Branch equal Branch not equal lui add sub slt addi slti and or xor nor andi ori xori lw sw j jr bltz beq bne Move from Hi Move from Lo Add unsigned Subtract unsigned Multiply Multiply unsigned Divide Divide unsigned Add immediate unsigned Shift left logical Shift right logical Shift right arithmetic Shift left logical variable Shift right logical variable Shift right arith variable Load byte Load byte unsigned Store byte Jump and link mfhi mflo addu subu mult multu div divu addiu sll srl sra sllv srlv srav lb lbu sb jal System call syscall Jan. 2011 rt,imm rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rd,rs,imm rd,rs,rt rd,rs,rt rd,rs,rt rd,rs,rt rt,rs,imm rt,rs,imm rt,rs,imm rt,imm(rs) rt,imm(rs) L rs rs,L rs,rt,L rs,rt,L Computer Architecture, Instruction-Set Architecture rd rd rd,rs,rt rd,rs,rt rs,rt rs,rt rs,rt rs,rt rs,rt,imm rd,rt,sh rd,rt,sh rd,rt,sh rd,rt,rs rd,rt,rs rd,rt,rs rt,imm(rs) rt,imm(rs) rt,imm(rs) L Slide 68 More Elaborate Addressing Modes Addressing Instruction Other elements involved Indexed Reg file Index reg Base reg Increment amount Update (with base) Base reg Update (with index ed) Reg file Increment amount Indirect Reg file Base reg Index reg Operand x := B[i] Mem Mem Add addr Memory data x := Mem[p] p := p + 1 Mem Mem Incre- addr Memory data ment Mem Mem Add addr Memory data x := B[i] i := i + 1 Increment Mem data PC Memory Mem addr This part maybe replaced with any Mem addr, other form of address specif ication 2nd access t := Mem[p] x := Mem[t] Memory Mem data, 2nd access x := Mem[Mem[p]] Figure 8.1 Schematic representation of more elaborate addressing modes not supported in MiniMIPS. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 69 Usefulness of Some Elaborate Addressing Modes Update mode: XORing a string of bytes loop: lb xor addi bne $t0,A($s0) $s1,$s1,$t0 $s0,$s0,-1 $s0,$zero,loop One instruction with update addressing Indirect mode: Case statement case: lw add add la add lw jr Jan. 2011 $t0,0($s0) $t0,$t0,$t0 $t0,$t0,$t0 $t1,T $t1,$t0,$t1 $t2,0($t1) $t2 # # # # get s form 2s form 4s base T # entry Branch to location Li if s = i (switch var.) T T+4 T+8 T + 12 T + 16 T + 20 Computer Architecture, Instruction-Set Architecture L0 L1 L2 L3 L4 L5 Slide 70 8.3 Variations in Instruction Formats 0-, 1-, 2-, and 3-address instructions in MiniMIPS Category Format Opcode Description of operand(s) One implied operand in register $v0 0-address 0 1-address 2 2-address 0 rs rt 24 mult Two source registers addressed, destination implied 3-address 0 rs rt rd 32 add Destination and two source registers addressed 12 syscall Address j Jump target addressed (in pseudodirect form) Figure 8.2 Examples of MiniMIPS instructions with 0 to 3 addresses; shaded fields are unused. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 71 Zero-Address Architecture: Stack Machine Stack holds all the operands (replaces our register file) Load/Store operations become push/pop Arithmetic/logic operations need only an opcode: they pop operand(s) from the top of the stack and push the result onto the stack Example: Evaluating the expression (a + b) × (c – d) Push a Push b Add Push d Push c Subtract Multiply a b a a+b d a+b c d a+b c–d a+b Result Polish string: a b + d c – × If a variable is used again, you may have to push it multiple times Special instructions such as “Duplicate” and “Swap” are helpful Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 72 One-Address Architecture: Accumulator Machine The accumulator, a special register attached to the ALU, always holds operand 1 and the operation result Only one operand needs to be specified by the instruction Example: Evaluating the expression (a + b) × (c – d) Load add Store load subtract multiply a b t c d t Within branch instructions, the condition or target address must be implied Branch to L if acc negative If register x is negative skip the next instruction May have to store accumulator contents in memory (example above) No store needed for a + b + c + d + . . . (“accumulator”) Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 73 Two-Address Architectures Two addresses may be used in different ways: Operand1/result and operand 2 Condition to be checked and branch target address Example: Evaluating the expression (a + b) × (c – d) load add load subtract multiply $1,a $1,b $2,c $2,d $1,$2 Instructions of a hypothetical two-address machine A variation is to use one of the addresses as in a one-address machine and the second one to specify a branch in every instruction Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 74 Example of a Complex Instruction Format Instruction prefixes (zero to four, 1 B each) Operand/address size overwrites and other modifiers Mod Reg/Op R/M Scale Index Base Opcode (1-2 B) ModR/M SIB Offset or displacement (0, 1, 2, or 4 B) Most memory operands need these 2 bytes Instructions can contain up to 15 bytes Immediate (0, 1, 2, or 4 B) Components that form a variable-length IA-32 (80x86) instruction. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 75 Some of IA-32’s Variable-Width Instructions Type Format (field widths shown) 1-byte 5 3 2-byte 4 4 3-byte 6 4-byte 8 5-byte 4 3 6-byte 7 8 8 8 8 8 8 32 8 32 Opcode Description of operand(s) PUSH 3-bit register specification JE 4-bit condition, 8-bit jump offset MOV 8-bit register/mode, 8-bit offset XOR ADD 8-bit register/mode, 8-bit base/index, 8-bit offset 3-bit register spec, 32-bit immediate TEST 8-bit register/mode, 32-bit immediate Figure 8.3 Example 80x86 instructions ranging in width from 1 to 6 bytes; much wider instructions (up to 15 bytes) also exist Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 76 8.4 Instruction Set Design and Evolution Desirable attributes of an instruction set: Consistent, with uniform and generally applicable rules Orthogonal, with independent features noninterfering Transparent, with no visible side effect due to implementation details Easy to learn/use (often a byproduct of the three attributes above) Extensible, so as to allow the addition of future capabilities Efficient, in terms of both memory needs and hardware realization Processor design team New machine project Instruction-set definition Implementation Performance objectives Fabrication & testing Sales & use ? Tuning & bug fixes Feedback Figure 8.4 Jan. 2011 Processor design and implementation process. Computer Architecture, Instruction-Set Architecture Slide 77 8.5 The RISC/CISC Dichotomy The RISC (reduced instruction set computer) philosophy: Complex instruction sets are undesirable because inclusion of mechanisms to interpret all the possible combinations of opcodes and operands might slow down even very simple operations. Ad hoc extension of instruction sets, while maintaining backward compatibility, leads to CISC; imagine modern English containing every English word that has been used through the ages Features of RISC architecture 1. 2. 3. 4. Small set of inst’s, each executable in roughly the same time Load/store architecture (leading to more registers) Limited addressing mode to simplify address calculations Simple, uniform instruction formats (ease of decoding) Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 78 RISC/CISC Comparison via Generalized Amdahl’s Law Example 8.1 An ISA has two classes of simple (S) and complex (C) instructions. On a reference implementation of the ISA, class-S instructions account for 95% of the running time for programs of interest. A RISC version of the machine is being considered that executes only class-S instructions directly in hardware, with class-C instructions treated as pseudoinstructions. It is estimated that in the RISC version, class-S instructions will run 20% faster while class-C instructions will be slowed down by a factor of 3. Does the RISC approach offer better or worse performance compared to the reference implementation? Solution Per assumptions, 0.95 of the work is speeded up by a factor of 1.0 / 0.8 = 1.25, while the remaining 5% is slowed down by a factor of 3. The RISC speedup is 1 / [0.95 / 1.25 + 0.05 × 3] = 1.1. Thus, a 10% improvement in performance can be expected in the RISC version. Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 79 Some Hidden Benefits of RISC In Example 8.1, we established that a speedup factor of 1.1 can be expected from the RISC version of a hypothetical machine This is not the entire story, however! If the speedup of 1.1 came with some additional cost, then one might legitimately wonder whether it is worth the expense and design effort The RISC version of the architecture also: Reduces the effort and team size for design Shortens the testing and debugging phase Cheaper product and shorter time-to-market Simplifies documentation and maintenance Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 80 MIPS Performance Rating Revisited An m-MIPS processor can execute m million instructions per second Comparing an m-MIPS processor with a 10m-MIPS processor Like comparing two people who read m pages and 10m pages per hour 10 pages / hr 100 pages / hr Reading 100 pages per hour, as opposed to 10 pages per hour, may not allow you to finish the same reading assignment in 1/10 the time Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 81 RISC / CISC Convergence The earliest RISC designs: CDC 6600, highly innovative supercomputer of the mid 1960s IBM 801, influential single-chip processor project of the late 1970s In the early 1980s, two projects brought RISC to the forefront: UC Berkeley’s RISC 1 and 2, forerunners of the Sun SPARC Stanford’s MIPS, later marketed by a company of the same name Throughout the 1980s, there were heated debates about the relative merits of RISC and CISC architectures Since the 1990s, the debate has cooled down! We can now enjoy both sets of benefits by having complex instructions automatically translated to sequences of very simple instructions that are then executed on RISC-based underlying hardware Jan. 2011 Computer Architecture, Instruction-Set Architecture Slide 82 8.6 Where to Draw the Line The ultimate reduced instruction set computer (URISC): How many instructions are absolutely needed for useful computation? Only one! subtract source1 from source2, replace source2 with the result, and jump to target address if result is negative Assembly language form: label: urisc dest,src1,target Pseudoinstructions can be synthesized using the single instruction: stop: .word start: urisc urisc urisc Corrected urisc version ... Jan. 2011 0 dest,dest,+1 temp,temp,+1 temp,src,+1 dest,temp,+1 # # # # # dest temp temp dest rest This is the move pseudoinstruction = 0 = 0 = -(src) = -(temp); i.e. (src) of program Computer Architecture, Instruction-Set Architecture Slide 83 Some Useful Pseudo Instructions for URISC Example 8.2 (2 parts of 5) Write the sequence of instructions that are produced by the URISC assembler for each of the following pseudoinstructions. parta: uadd partc: uj dest,src1,src2 label # dest=(src1)+(src2) # goto label Solution at1 and at2 are temporary memory locations for assembler’s use parta: urisc urisc urisc urisc urisc partc: urisc urisc Jan. 2011 at1,at1,+1 at1,src1,+1 at1,src2,+1 dest,dest,+1 dest,at1,+1 at1,at1,+1 at1,one,label # # # # # # # at1 = 0 at1 = -(src1) at1 = -(src1)–(src2) dest = 0 dest = -(at1) at1 = 0 at1 = -1 to force jump Computer Architecture, Instruction-Set Architecture Slide 84 URISC Hardware URISC instruction: Word 1 Word 2 Word 3 Source 1 Source 2 / Dest Jump target Comp C in 0 PC in MDR in MAR in 0 Read 1 R R’ P C Adder N in R in Figure 8.5 Jan. 2011 Write M D R M A R Z in N Z 1 Mux 0 Memory unit PCout Instruction format and hardware structure for URISC. Computer Architecture, Instruction-Set Architecture Slide 85 Part III The Arithmetic/Logic Unit Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2006 Jan. 2007 Jan. 2008 Jan. 2009 Jan. 2011 Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 2 III The Arithmetic/Logic Unit Overview of computer arithmetic and ALU design: • Review representation methods for signed integers • Discuss algorithms & hardware for arithmetic ops • Consider floating-point representation & arithmetic Topics in This Part Chapter 9 Number Representation Chapter 10 Adders and Simple ALUs Chapter 11 Multipliers and Dividers Chapter 12 Floating-Point Arithmetic Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 3 Preview of Arithmetic Unit in the Data Path Incr PC Next addr jta Next PC ALUOvfl (PC) (rs) rs rt PC Instr cache inst 0 1 2 rd 31 imm op Br&Jump Instruction fetch Fig. 13.3 Jan. 2011 Register writeback Ovfl Reg file ALU (rt) / 16 ALU out Data cache Data out Data in Func 0 32 SE / 1 Data addr 0 1 2 Register input fn RegDst RegWrite Reg access / decode ALUSrc ALUFunc ALU operation RegInSrc DataRead DataWrite Data access Key elements of the single-cycle MicroMIPS data path. Computer Architecture, The Arithmetic/Logic Unit Slide 4 Computer Arithmetic as a Topic of Study Brief overview article – Encyclopedia of Info Systems, Academic Press, 2002, Vol. 3, pp. 317-333 Our textbook’s treatment of the topic falls between the extremes (4 chaps.) Graduate course ECE 252B – Text: Computer Arithmetic, Oxford U Press, 2000 (2nd ed., 2010) Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 5 9 Number Representation Arguably the most important topic in computer arithmetic: • Affects system compatibility and ease of arithmetic • Two’s complement, flp, and unconventional methods Topics in This Chapter 9.1 Positional Number Systems 9.2 Digit Sets and Encodings 9.3 Number-Radix Conversion 9.4 Signed Integers 9.5 Fixed-Point Numbers 9.6 Floating-Point Numbers Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 6 9.1 Positional Number Systems Representations of natural numbers {0, 1, 2, 3, …} ||||| ||||| ||||| ||||| ||||| || 27 11011 XXVII sticks or unary code radix-10 or decimal code radix-2 or binary code Roman numerals Fixed-radix positional representation with k digits k–1 Value of a number: x = (xk–1xk–2 . . . x1x0)r = Σ xi r i i=0 For example: 27 = (11011)two = (1×24) + (1×23) + (0×22) + (1×21) + (1×20) Number of digits for [0, P]: k = ⎡logr (P + 1)⎤ = ⎣logr P⎦ + 1 Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 7 Unsigned Binary Integers 0000 1111 15 1110 0 0001 1 0010 2 14 1101 0011 13 12 1100 1011 Turn x notches counterclockwise to add x 3 Inside: Natural number Outside: 4-bit encoding 11 4 5 10 1010 0100 0101 12 11 10 15 1 2 0 3 4 5 9 8 7 6 6 9 1001 14 13 8 1000 7 0110 0111 Turn y notches clockwise to subtract y Figure 9.1 Schematic representation of 4-bit code for integers in [0, 15]. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 8 Representation Range and Overflow − Overflow region max max Numbers smaller than max − + Overflow region Numbers larger than max + Finite set of representable numbers Figure 9.2 Overflow regions in finite number representation systems. For unsigned representations covered in this section, max – = 0. Example 9.2, Part d Discuss if overflow will occur when computing 317 – 316 in a number system with k = 8 digits in radix r = 10. Solution The result 86 093 442 is representable in the number system which has a range [0, 99 999 999]; however, if 317 is computed en route to the final result, overflow will occur. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 9 9.2 Digit Sets and Encodings Conventional and unconventional digit sets • Decimal digits in [0, 9]; 4-bit BCD, 8-bit ASCII • Hexadecimal, or hex for short: digits 0-9 & a-f • Conventional ternary digit set in [0, 2] Conventional digit set for radix r is [0, r – 1] Symmetric ternary digit set in [–1, 1] • Conventional binary digit set in [0, 1] Redundant digit set [0, 2], encoded in 2 bits ( 0 2 1 1 0 )two and ( 1 0 1 0 2 )two represent 22 Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 10 Carry-Save Numbers Radix-2 numbers using the digits 0, 1, and 2 Example: (1 0 2 1)two = (1×23) + (0×22) + (2×21) + (1×20) = 13 Possible encodings (a) Binary (b) Unary 0 1 2 0 1 1 2 MSB LSB Jan. 2011 00 01 10 11 (Unused) 1 0 2 1 0 0 1 0 = 2 1 0 0 1 = 9 00 01 (First alternate) 10 (Second alternate) 11 First bit Second bit Computer Architecture, The Arithmetic/Logic Unit 1 0 2 1 0 0 1 1 = 3 1 0 1 0 = 10 Slide 11 The Notion of Carry-Save Addition Digit-set combination: {0, 1, 2} + {0, 1} = {0, 1, 2, 3} = {0, 2} + {0, 1} This bit being 1 represents overflow (ignore it) Carry-save input Carry-save addition Two carry-save inputs Binary input Carry-save output 0 0 Carry-save addition 0 a. Carry-save addition. b. Adding two carry-save numbers. Figure 9.3 Adding a binary number or another carry-save number to a carry-save number. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 12 9.3 Number Radix Conversion Two ways to convert numbers from an old radix r to a new radix R • Perform arithmetic in the new radix R Suitable for conversion from radix r to radix 10 Horner’s rule: (xk–1xk–2 . . . x1x0)r = (…((0 + xk–1)r + xk–2)r + . . . + x1)r + x0 (1 0 1 1 0 1 0 1)two = 0 + 1 → 1 × 2 + 0 → 2 × 2 + 1 → 5 × 2 + 1 → 11 × 2 + 0 → 22 × 2 + 1 → 45 × 2 + 0 → 90 × 2 + 1 → 181 • Perform arithmetic in the old radix r Suitable for conversion from radix 10 to radix R Divide the number by R, use the remainder as the LSD and the quotient to repeat the process 19 / 3 → rem 1, quo 6 / 3 → rem 0, quo 2 / 3 → rem 2, quo 0 Thus, 19 = (2 0 1)three Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 13 Justifications for Radix Conversion Rules ( xk −1 xk − 2 L x0 ) r = xk −1r k −1 + xk − 2 r k −2 + L + x1r + x0 = x0 + r ( x1 + r ( x2 + r (L))) Justifying Horner’s rule. x Binary representation of ⎣x/2⎦ Figure 9.4 Jan. 2011 0 x mod 2 Justifying one step of the conversion of x to radix 2. Computer Architecture, The Arithmetic/Logic Unit Slide 14 9.4 Signed Integers • We dealt with representing the natural numbers • Signed or directed whole numbers = integers { . . . , −3, −2, −1, 0, 1, 2, 3, . . . } • Signed-magnitude representation +27 in 8-bit signed-magnitude binary code 0 0011011 –27 in 8-bit signed-magnitude binary code 1 0011011 –27 in 2-digit decimal code with BCD digits 1 0010 0111 • Biased representation Represent the interval of numbers [−N, P] by the unsigned interval [0, P + N]; i.e., by adding N to every number Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 15 Two’s-Complement Representation With k bits, numbers in the range [–2k–1, 2k–1 – 1] represented. Negation is performed by inverting all bits and adding 1. 0000 1111 –1 1110 +0 0001 +1 0010 +2 –2 1101 0011 –3 1011 + _ –4 1100 Turn x notches counterclockwise to add x –5 +3 +4 +5 –6 1001 0100 –4 –5 –6 0101 –1 1 2 0 3 4 5 –7 –8 7 6 +6 –7 1010 –2 –3 –8 1000 +7 0110 0111 Turn 16 – y notches counterclockwise to add –y (subtract y) Figure 9.5 Schematic representation of 4-bit 2’s-complement code for integers in [–8, +7]. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 16 Conversion from 2’s-Complement to Decimal Example 9.7 Convert x = (1 0 1 1 0 1 0 1)2’s-compl to decimal. Solution Given that x is negative, one could change its sign and evaluate –x. Shortcut: Use Horner’s rule, but take the MSB as negative –1 × 2 + 0 → –2 × 2 + 1 → –3 × 2 + 1 → –5 × 2 + 0 → –10 × 2 + 1 → –19 × 2 + 0 → –38 × 2 + 1 → –75 Sign Change for a 2’s-Complement Number Example 9.8 Given y = (1 0 1 1 0 1 0 1)2’s-compl, find the representation of –y. Solution –y = (0 1 0 0 1 0 1 0) + 1 = (0 1 0 0 1 0 1 1)2’s-compl Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit (i.e., 75) Slide 17 Two’s-Complement Addition and Subtraction x k / c in Adder y k / k / k / x±y c out y or y′ Add′Sub Figure 9.6 Jan. 2011 Binary adder used as 2’s-complement adder/subtractor. Computer Architecture, The Arithmetic/Logic Unit Slide 18 9.5 Fixed-Point Numbers Positional representation: k whole and l fractional digits Value of a number: x = (xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = Σ xi r i For example: 2.375 = (10.011)two = (1×21) + (0×20) + (0×2−1) + (1×2−2) + (1×2−3) Numbers in the range [0, rk – ulp] representable, where ulp = r –l Fixed-point arithmetic same as integer arithmetic (radix point implied, not explicit) Two’s complement properties (including sign change) hold here as well: (01.011)2’s-compl = (–0×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = +1.375 (11.011)2’s-compl = (–1×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = –0.625 Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 19 Fixed-Point 2’s-Complement Numbers 1.111 1.110 –.125 0.000 +0 0.001 +.125 0.010 +.25 –.25 1.101 0.011 –.375 1.100 1.011 + _ –.5 0.100 +.5 +.625 –.625 0.101 +.75 –.75 1.010 +.375 –.875 1.001 –1 1.000 +.875 0.110 0.111 Figure 9.7 Schematic representation of 4-bit 2’s-complement encoding for (1 + 3)-bit fixed-point numbers in the range [–1, +7/8]. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 20 Radix Conversion for Fixed-Point Numbers Convert the whole and fractional parts separately. To convert the fractional part from an old radix r to a new radix R: • Perform arithmetic in the new radix R Evaluate a polynomial in r –1: (.011)two = 0 × 2–1 + 1 × 2–2 + 1 × 2–3 Simpler: View the fractional part as integer, convert, divide by r l (.011)two = (?)ten Multiply by 8 to make the number an integer: (011)two = (3)ten Thus, (.011)two = (3 / 8)ten = (.375)ten • Perform arithmetic in the old radix r Multiply the given fraction by R, use the whole part as the MSD and the fractional part to repeat the process (.72)ten = (?)two 0.72 × 2 = 1.44, so the answer begins with 0.1 0.44 × 2 = 0.88, so the answer begins with 0.10 Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 21 9.6 Floating-Point Numbers Useful for applications where very large and very small numbers are needed simultaneously • Fixed-point representation must sacrifice precision for small values to represent large values x = (0000 0000 . 0000 1001)two Small number y = (1001 0000 . 0000 0000)two Large number • Neither y2 nor y / x is representable in the format above • Floating-point representation is like scientific notation: −20 000 000 = −2 × 10 7 +0.000 000 007 = +7 × 10–9 Sign Jan. 2011 Significand Exponent base Exponent Computer Architecture, The Arithmetic/Logic Unit Also, 7E−9 Slide 22 ANSI/IEEE Standard Floating-Point Format (IEEE 754) Revision (IEEE 754R) was completed in 2008: The revised version includes 16-bit and 128-bit binary formats, as well as 64- and 128-bit decimal formats Short (32-bit) format 8 bits, bias = 127, –126 to 127 23 bits for fractional part (plus hidden 1 in integer part) Sign Exponent 11 bits, bias = 1023, –1022 to 1023 Short exponent range is –127 to 128 but the two extreme values are reserved for special operands (similarly for the long format) Significand 52 bits for fractional part (plus hidden 1 in integer part) Long (64-bit) format Figure 9.8 Jan. 2011 The two ANSI/IEEE standard floating-point formats. Computer Architecture, The Arithmetic/Logic Unit Slide 23 Short and Long IEEE 754 Formats: Features Table 9.1 Some features of ANSI/IEEE standard floating-point formats Feature Word width in bits Significand in bits Significand range Exponent bits Exponent bias Zero (±0) Denormal Single/Short 32 23 + 1 hidden [1, 2 – 2–23] 8 127 e + bias = 0, f = 0 e + bias = 0, f ≠ 0 represents ±0.f × 2–126 e + bias = 255, f = 0 e + bias = 255, f ≠ 0 e + bias ∈ [1, 254] e ∈ [–126, 127] represents 1.f × 2e Double/Long 64 52 + 1 hidden [1, 2 – 2–52] 11 1023 e + bias = 0, f = 0 e + bias = 0, f ≠ 0 represents ±0.f × 2–1022 e + bias = 2047, f = 0 e + bias = 2047, f ≠ 0 e + bias ∈ [1, 2046] e ∈ [–1022, 1023] represents 1.f × 2e min 2–126 ≅ 1.2 × 10–38 2–1022 ≅ 2.2 × 10–308 max ≅ 2128 ≅ 3.4 × 1038 ≅ 21024 ≅ 1.8 × 10308 Infinity (±∞) Not-a-number (NaN) Ordinary number Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 24 10 Adders and Simple ALUs Addition is the most important arith operation in computers: • Even the simplest computers must have an adder • An adder, plus a little extra logic, forms a simple ALU Topics in This Chapter 10.1 Simple Adders 10.2 Carry Propagation Networks 10.3 Counting and Incrementation 10.4 Design of Fast Adders 10.5 Logic and Shift Operations 10.6 Multifunction ALUs Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 25 10.1 Simple Adders Inputs x Outputs y c x s c 0 0 1 1 0 1 0 1 0 0 0 1 Inputs Digit-set interpretation: {0, 1} + {0, 1} = {0, 2} + {0, 1} HA 0 1 1 0 s Outputs x y cin cout s 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 Figures 10.1/10.2 Jan. 2011 y x cout y FA cin Digit-set interpretation: {0, 1} + {0, 1} + {0, 1} = {0, 2} + {0, 1} s Binary half-adder (HA) and full-adder (FA). Computer Architecture, The Arithmetic/Logic Unit Slide 26 Full-Adder Implementations HA c out x y x y HA c in c out s (a) FA built of two HAs x y c out 0 1 2 3 0 0 1 2 3 1 c in s (b) CMOS mux-based FA c in s (c) Two-level AND-OR FA Figure10.3 Full adder implemented with two half-adders, by means of two 4-input multiplexers, and as two-level gate network. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 27 Ripple-Carry Adder: Slow But Simple x31 y31 c31 c32 cout FA Jan. 2011 . . . y1 c2 x0 y0 c0 c1 FA FA s1 s0 Critical path s31 Figure 10.4 x1 cin Ripple-carry binary adder with 32-bit inputs and output. Computer Architecture, The Arithmetic/Logic Unit Slide 28 Carry Chains and Auxiliary Signals Bit positions 15 14 13 12 ----------1 0 1 1 11 10 9 8 ----------0 1 1 0 7 6 5 4 ----------0 1 1 0 1 0 0 1 1 1 cout 0 1 0 1 \__________/\__________________/ 4 6 g = xy Jan. 2011 0 3 2 1 0 ----------1 1 1 0 0 0 0 1 1 cin \________/\____/ 3 2 Carry chains and their lengths Computer Architecture, The Arithmetic/Logic Unit p=x⊕y Slide 29 10.2 Carry Propagation Networks gi pi Carry is: 0 0 1 1 annihilated or killed propagated generated (impossible) 0 1 0 1 g k−1 p k−1 xi g k−2 p k−2 yi gi = xi yi pi = xi ⊕ yi g i+1 p i+1 gi pi ... ... g1 p1 g0 p0 c0 Carry network ck c k−1 ... c k−2 ci c i+1 ... c1 c0 si Figure 10.5 The main part of an adder is the carry network. The rest is just a set of gates to produce the g and p signals and the sum bits. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 30 Ripple-Carry Adder Revisited The carry recurrence: ci+1 = gi ∨ pi ci Latency of k-bit adder is roughly 2k gate delays: 1 gate delay for production of p and g signals, plus 2(k – 1) gate delays for carry propagation, plus 1 XOR gate delay for generation of the sum bits gk−1 pk−1 gk−2 pk−2 g1 p1 ... ck ck−1 Figure 10.6 Jan. 2011 ck−2 c2 c1 g0 p0 c0 The carry propagation network of a ripple-carry adder. Computer Architecture, The Arithmetic/Logic Unit Slide 31 The Complete Design of a Ripple-Carry Adder gi pi Carry is: 0 0 1 1 annihilated or killed propagated generated (impossible) 0 1 0 1 g k−1 p k−1 gk−1 pk−1 ck ck xi g k−2 p k−2 c k−1 gi = xi yi pi = xi ⊕ yi g i+1 p i+1 gi pi ... ... gk−2 pk−2 Carry network ck−2 ... c k−2 g1 p1 g1 . ck−1 yi ci c i+1 p1 . c2 c1 ... c1 g0 p0 g0 c0 p0 c0 c0 si Figure 10.6 (ripple-carry network) superimposed on Figure 10.5 (general structure of an adder). Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 32 First Carry Speed-Up Method: Carry Skip g4j+3 p4j+3 c4j+4 c4j+3 g4j+2 p4j+2 c4j+2 g4j+1 p4j+1 c4j+1 g4j p4j c4j One-way street Freeway Figures 10.7/10.8 A 4-bit section of a ripple-carry network with skip paths and the driving analogy. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 33 Mux-Based Skip Carry Logic g4j+3 p4j+3 g4j+2 p4j+2 g4j+1 p4j+1 g4j p4j Fig. 10.7 c4j+4 c4j+3 g4j+3 p4j+3 c4j+4 0 1c4j+4 p[4j, 4j+3] c4j+3 c4j+2 g4j+2 p4j+2 c4j+1 g4j+1 c4j+2 p4j+1 c4j g4j p4j c4j+1 c4j The carry-skip adder of Fig. 10.7 works fine if we begin with a clean slate, where all signals are 0s; otherwise, it will run into problems, which do not exist in this mux-based implementation Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 34 10.3 Counting and Incrementation k / k / Data in Incr′Init 0 k / Count register D Q 1 k / c in _ C Q Adder / k Update a (Increment amount) Figure 10.9 Jan. 2011 c out Schematic diagram of an initializable synchronous counter. Computer Architecture, The Arithmetic/Logic Unit Slide 35 Circuit for Incrementation by 1 Figure 10.6 Substantially simpler than an adder gk−1 pk−1 ck xk−1 ck 0g1 gk−2 pk−2 sk−1 x1 0g0 p0 ... ck−1 ck−2 c2 xk−2 ck−1 p1 ck−2 sk−2 ... 1 x0 c1 c2 s2 c0 c1 x1 s1 s0 Figure 10.10 Carry propagation network and sum logic for an incrementer. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit x0 Slide 36 10.4 Design of Fast Adders • Carries can be computed directly without propagation • For example, by unrolling the equation for c3, we get: c3 = g2 ∨ p2 c2 = g2 ∨ p2 g1 ∨ p2 p1 g0 ∨ p2 p1 p0 c0 • We define “generate” and “propagate” signals for a block extending from bit position a to bit position b as follows: g[a,b] = gb ∨ pb gb–1 ∨ pb pb–1 gb–2 ∨ . . . ∨ pb pb–1 … pa+1 ga p[a,b] = pb pb–1 . . . pa+1 pa • Combining g and p signals for adjacent blocks: g[h,j] = g[i+1,j] ∨ p[i+1,j] g[h,i] p[h,j] = p[i+1,j] p[h,i] j i+1 i h [h, j] = [i + 1, j] ¢ [h, i] Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 37 Carries as Generate Signals for Blocks [ 0, i ] gi pi Carry is: 0 0 1 1 annihilated or killed propagated generated (impossible) 0 1 0 1 g k−1 p k−1 xi g k−2 p k−2 yi Assuming c0 = 0, we have ci = g [0,i –1] g i+1 p i+1 gi pi ... ... g1 p1 g0 p0 c0 Carry network ck c k−1 Figure 10.5 Jan. 2011 ... c k−2 ci c i+1 ... c1 c0 si Computer Architecture, The Arithmetic/Logic Unit Slide 38 Second Carry Speed-Up Method: Carry Lookahead [7, 7 ] [6, 6 ] ¢ [5, 5 ] [4, 4 ] [3, 3 ] ¢ [2, 2 ] [1, 1 ] ¢ [6, 7 ] [0, 0 ] g [1, 1] p [1, 1] g [0, 0] p [0, 0] ¢ [2, 3 ] [4, 5 ] [0, 1 ] ¢ ¢ [4, 7 ] [0, 3 ] ¢ ¢ ¢ [0, 7 ] [0, 6 ] ¢ [0, 5 ] [0, 4 ] [0, 3 ] ¢ [0, 2 ] g [0, 1] p [0, 1] [0, 1 ] [0, 0 ] Figure 10.11 Brent-Kung lookahead carry network for an 8-digit adder, along with details of one of the carry operator blocks. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 39 Recursive Structure of Brent-Kung Carry Network [7, 7] [6, 6] [5, 5] [4, 4] [3, 3] [2, 2] [1, 1] [0, 0] [7, 7 ] ¢ ¢ ¢ ¢ [6, 6 ] ¢ [5, 5 ] [4, 4 ] [3, 3 ] ¢ [2, 2 ] ¢ [6, 7 ] [2, 3 ] [0, 1 ] ¢ ¢ [4, 7 ] [0, 3 ] ¢ ¢ ¢ ¢ ¢ [0, 6] [0, 5] [0, 4] ¢ ¢ ¢ [0, 7 ] [0, 7] [0, 0 ] ¢ [4, 5 ] 4-input Brent-Kung carry network [1, 1 ] [0, 3] [0, 2] [0, 1] [0, 6 ] [0, 5 ] [0, 4 ] [0, 3 ] [0, 2 ] [0, 1 ] [0, 0 ] [0, 0] Figure 10.12 Brent-Kung lookahead carry network for an 8-digit adder, with only its top and bottom rows of carry-operators shown. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 40 An Alternate Design: Kogge-Stone Network ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ c7 = g [0,6] c8 = g [0,7] c5 = g [0,4] c6 = g [0,5] ¢ c1 = g [0,0] c3 = g [0,2] c4 = g [0,3] c2 = g [0,1] Kogge-Stone lookahead carry network for an 8-digit adder. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 41 Brent-Kung vs. Kogge-Stone Carry Network [7, 7 ] [6, 6 ] [5, 5 ] ¢ [4, 4 ] ¢ [3, 3 ] [2, 2 ] [1, 1 ] ¢ [6, 7 ] [0, 0 ] g [1, 1] p [1, 1] g [0, 0] p [0, 0] ¢ [2, 3 ] [4, 5 ] [0, 1 ] ¢ ¢ [4, 7 ] [0, 3 ] ¢ ¢ ¢ [0, 7 ] [0, 6 ] ¢ [0, 5 ] [0, 4 ] ¢ [0, 3 ] [0, 2 ] g [0, 1] p [0, 1] [0, 1 ] 11 carry operators 4 levels Jan. 2011 [0, 0 ] 17 carry operators 3 levels Computer Architecture, The Arithmetic/Logic Unit Slide 42 Carry-Lookahead Logic with 4-Bit Block p i+3 g i+3 p i+2 g i+2 p i+1 g i+1 p i gi g [i, i+3] Intermeidte carries p [i, i+3] Block signal generation ci c i+3 c i+2 c i+1 Figure 10.13 Blocks needed in the design of carry-lookahead adders with four-way grouping of bits. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 43 Third Carry Speed-Up Method: Carry Select Allows doubling of adder width with a single-mux additional delay x c out Adder c in Version 0 of sum bits c out 0 0 1 s Figure 10.14 Jan. 2011 [a, b] c Adder y [a, b] c in Version 1 of sum bits a [a, b] 1 The lower a positions, (0 to a – 1) are added as usual Carry-select addition principle. Computer Architecture, The Arithmetic/Logic Unit Slide 44 10.5 Logic and Shift Operations Conceptually, shifts can be implemented by multiplexing Right-shifted values Left-shifted values 00...0, x[31] Right’Left Shift amount 5 00, x[30, 2] 0, x[31, 1] x[31, 0] 32 32 32 0 1 6-bit code specifying shift direction & amount Jan. 2011 x[30, 0], 0 x[31, 0] . . . 32 32 31 32 32 33 . . . 32 62 32 63 Multiplexer 6 Figure 10.15 2 x[0], 00...0 x[1, 0], 00...0 32 Multiplexer-based logical shifting unit. Computer Architecture, The Arithmetic/Logic Unit Slide 45 Arithmetic Shifts Purpose: Multiplication and division by powers of 2 sra $t0,$s1,2 srav $t0,$s1,$s0 op 31 R 25 20 rt 15 rd 10 sh 5 fn 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 ALU instruction op 31 R rs # $t0 ← ($s1) right-shifted by 2 # $t0 ← ($s1) right-shifted by ($s0) Unused 25 rs Source register 20 rt Destination register 15 rd Shift amount 10 sh sra = 3 5 fn 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 ALU instruction Figure 10.16 Jan. 2011 Amount register Source register Destination register Unused srav = 7 The two arithmetic shift instructions of MiniMIPS. Computer Architecture, The Arithmetic/Logic Unit Slide 46 Practical Shifting in Multiple Stages 00 01 10 11 No shift Logical left Logical right Arith right 0 x[31], x[31, 1] 0, x[31, 1] x[30, 0], 0 x[31, 0] 32 32 32 32 0 2 1 2 2 2 3 y[31, 0] 0 1 2 (0 or 2)-bit shift 3 2 z[31, 0] 3 Multiplexer 0 32 2 (a) Single-bit shifter Figure 10.17 Jan. 2011 1 (0 or 4)-bit shift 1 2 (0 or 1)-bit shift 3 (b) Shifting by up to 7 bits Multistage shifting in a barrel shifter. Computer Architecture, The Arithmetic/Logic Unit Slide 47 Bit Manipulation via Shifts and Logical Operations Bits 10-15 AND with mask to isolate a field: 0000 0000 0000 0000 1111 1100 0000 0000 Right-shift by 10 positions to move field to the right end of word The result word ranges from 0 to 63, depending on the field pattern 32-pixel (4 × 8) block of black-and-white image: Row 0 Row 1 Row 2 Row 3 Representation as 32-bit word: 1010 0000 0101 1000 0000 0110 0001 0111 Hex equivalent: 0xa0a80617 Figure 10.18 A 4 × 8 block of a black-and-white image represented as a 32-bit word. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 48 10.6 Multifunction ALUs Arith fn (add, sub, . . .) Operand 1 Arith unit 0 Result Operand 2 1 Logic unit Select fn type (logic or arith) Logic fn (AND, OR, . . .) General structure of a simple arithmetic/logic unit. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 49 Const′Var Shift function Constant 5 amount 0 Amount 5 1 5 Variable amount 2 00 01 10 11 No shift Logical left Logical right Arith right Shifter Function class 32 5 LSBs 00 01 10 11 An ALU for MiniMIPS Shift Set less Arithmetic Logic 2 Shifted y 0 x Adder y 0 or 1 c0 32 32 k / c 31 x±y s MSB 32 2 32 Shorthand symbol for ALU 1 Control c 32 3 x Func Add′Sub s ALU Logic unit AND OR XOR NOR 00 01 10 11 Ovfl y 32input NOR Zero 2 Logic function Zero Ovfl Figure 10.19 A multifunction ALU with 8 control signals (2 for function class, 1 arithmetic, 3 shift, 2 logic) specifying the operation. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 50 11 Multipliers and Dividers Modern processors perform many multiplications & divisions: • Encryption, image compression, graphic rendering • Hardware vs programmed shift-add/sub algorithms Topics in This Chapter 11.1 Shift-Add Multiplication 11.2 Hardware Multipliers 11.3 Programmed Multiplication 11.4 Shift-Subtract Division 11.5 Hardware Dividers 11.6 Programmed Division Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 51 11.1 Shift-Add Multiplication Multiplicand Multiplier x y Partial products bit-matrix y0 y 1 y2 y3 Product z Figure 11.1 x x x x 20 21 22 23 Multiplication of 4-bit numbers in dot notation. z(j+1) = (z(j) + yj x 2k) 2–1 with z(0) = 0 and z(k) = z |––– add –––| |–– shift right ––| Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 52 Binary and Decimal Multiplication Position 7 6 5 4 3 2 1 0 ========================= x24 1 0 1 0 y 0 0 1 1 ========================= z (0) 0 0 0 0 4 +y0x2 1 0 1 0 –––––––––––––––––––––––––– 2z (1) 0 1 0 1 0 (1) z 0 1 0 1 0 4 +y 1 x 2 1 0 1 0 –––––––––––––––––––––––––– 0 1 1 1 1 0 2z (2) (2) z 0 1 1 1 1 0 4 +y2x2 0 0 0 0 –––––––––––––––––––––––––– 2z (3) 0 0 1 1 1 1 0 (3) z 0 0 1 1 1 1 0 4 +y3x2 0 0 0 0 –––––––––––––––––––––––––– 2z (4) 0 0 0 1 1 1 1 0 (4) z 0 0 0 1 1 1 1 0 ========================= Figure 11.2 Jan. 2011 Example 11.1 Position 7 6 5 4 3 2 1 0 ========================= x104 3 5 2 8 y 4 0 6 7 ========================= z (0) 0 0 0 0 4 +y0x10 2 4 6 9 6 –––––––––––––––––––––––––– 10z (1) 2 4 6 9 6 (1) z 0 2 4 6 9 6 4 +y1x10 2 1 1 6 8 –––––––––––––––––––––––––– 10z (2) 2 3 6 3 7 6 (2) z 2 3 6 3 7 6 4 +y2x10 0 0 0 0 0 –––––––––––––––––––––––––– 10z (3) 0 2 3 6 3 7 6 (3) z 0 2 3 6 3 7 6 4 +y3x10 1 4 1 1 2 –––––––––––––––––––––––––– 10z (4) 1 4 3 4 8 3 7 6 (4) z 1 4 3 4 8 3 7 6 ========================= Step-by-step multiplication examples for 4-digit unsigned numbers. Computer Architecture, The Arithmetic/Logic Unit Slide 53 Two’s-Complement Multiplication Position 7 6 5 4 3 2 1 0 ========================= x24 1 0 1 0 y 0 0 1 1 ========================= z (0) 0 0 0 0 0 4 +y 0 x 2 1 1 0 1 0 –––––––––––––––––––––––––– 2z (1) 1 1 0 1 0 (1) z 1 1 1 0 1 0 4 +y 1 x 2 1 1 0 1 0 –––––––––––––––––––––––––– 1 0 1 1 1 0 2z (2) (2) z 1 1 0 1 1 1 0 4 +y 2 x 2 0 0 0 0 0 –––––––––––––––––––––––––– 2z (3) 1 1 0 1 1 1 0 (3) z 1 1 1 0 1 1 1 0 4 +(–y3x2 ) 0 0 0 0 0 –––––––––––––––––––––––––– 2z (4) 1 1 1 0 1 1 1 0 (4) z 1 1 1 0 1 1 1 0 ========================= Figure 11.3 Jan. 2011 Example 11.2 Position 7 6 5 4 3 2 1 0 ========================= x24 1 0 1 0 y 1 0 1 1 ========================= z (0) 0 0 0 0 0 4 +y0x2 1 1 0 1 0 –––––––––––––––––––––––––– 2z (1) 1 1 0 1 0 (1) z 1 1 1 0 1 0 4 +y1x2 1 1 0 1 0 –––––––––––––––––––––––––– 2z (2) 1 0 1 1 1 0 (2) z 1 1 0 1 1 1 0 4 +y2x2 0 0 0 0 0 –––––––––––––––––––––––––– 2z (3) 1 1 0 1 1 1 0 (3) z 1 1 1 0 1 1 1 0 4 +(–y3x2 ) 0 0 1 1 0 –––––––––––––––––––––––––– 2z (4) 0 0 0 1 1 1 1 0 (4) z 0 0 0 1 1 1 1 0 ========================= Step-by-step multiplication examples for 2’s-complement numbers. Computer Architecture, The Arithmetic/Logic Unit Slide 54 11.2 Hardware Multipliers Shift Multiplier y Doublewidth partial product z (j) Hi Shift Lo Multiplicand x 0 Mux 1 yj Enable Select c out Figure 11.4 Jan. 2011 Adder c in Add’Sub Hardware multiplier based on the shift-add algorithm. Computer Architecture, The Arithmetic/Logic Unit Slide 55 The Shift Part of Shift-Add From adder cout Sum /k Partial product /k–1 Multiplier /k–1 /k To adder yj Figure11.5 Shifting incorporated in the connections to the partial product register rather than as a separate phase. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 56 High-Radix Multipliers Multiplicand Multiplier x y 0, x, 2x, or 3x Product z Radix-4 multiplication in dot notation. z(j+1) = (z(j) + yj x 2k) 4–1 with z(0) = 0 and z(k/2) = z |––– add –––| Assume k even |–– shift right ––| Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 57 Tree Multipliers All partial products Several partial products ... ... Large tree of carry-save adders Logdepth Adder Logdepth Product (a) Full-tree multiplier Figure 11.6 Jan. 2011 Small tree of carry-save adders Adder Product (b) Partial-tree multiplier Schematic diagram for full/partial-tree multipliers. Computer Architecture, The Arithmetic/Logic Unit Slide 58 Array Multipliers 0 s x3 0 0 x2 0 0 x1 0 0 x0 0 MA MA MA MA 0 0 Figure 9.3a (Recalling carry-save addition) s MA y0 c MA MA MA y1 z0 0 MA MA MA MA y2 z1 0 MA MA MA MA y3 z2 0 z3 FA Figure 11.7 Jan. 2011 Our original dot-notation representing multiplication FA FA HA z7 z6 z5 Straightened dots to depict array multiplier to the left z4 Array multiplier for 4-bit unsigned operands. Computer Architecture, The Arithmetic/Logic Unit Slide 59 11.3 Programmed Multiplication MiniMIPS instructions related to multiplication mult multu mfhi mflo $s0,$s1 $s2,$s3 $t0 $t1 # # # # set set set set Hi,Lo to ($s0)×($s1); signed Hi,Lo to ($s2)×($s3); unsigned $t0 to (Hi) $t1 to (Lo) Example 11.3 Finding the 32-bit product of 32-bit integers in MiniMIPS Multiply; result will be obtained in Hi,Lo For unsigned multiplication: Hi should be all-0s and Lo holds the 32-bit result For signed multiplication: Hi should be all-0s or all-1s, depending on the sign bit of Lo Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 60 Emulating a Hardware Multiplier in Software Example 11.4 (MiniMIPS shift-add program for multiplication) Shift $a1 (multiplier Multiplier y y) $v0Doublewidth (Hi part of z)partial $v1product (Lo part zof(j)z) $t0 (carry-out) $a0Multiplicand (multiplicandx x) 0 Also, holds LSB of Hi during shift Shift Mux 1 $t1 (bit j of y) yj Enable Select $t2 (counter) c out Adder c in Add’Sub Part of the control in hardware Figure 11.8 Register usage for programmed multiplication superimposed on the block diagram for a hardware multiplier. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 61 Multiplication When There Is No Multiply Instruction Example 11.4 (MiniMIPS shift-add program for multiplication) shamu: move move addi mloop: move move srl subu subu beqz addu sltu noadd: move srl subu subu sll MSB of $t0 addu srl sll MSB of $t1 addu addi by 1Jan. 2011 bne jr $v0,$zero $vl,$zero $t2,$zero,32 $t0,$zero $t1,$a1 $a1,1 $t1,$t1,$a1 $t1,$t1,$a1 $t1,noadd $v0,$v0,$a0 $t0,$v0,$a0 $t1,$v0 $v0,1 $t1,$t1,$v0 $t1,$t1,$v0 $t0,$t0,31 # # # # # # # # # # # # # # # initialize Hi to 0 initialize Lo to 0 init repetition counter to 32 set c-out to 0 in case of no add copy ($a1) into $t1 halve the unsigned value in $a1 subtract ($a1) from ($t1) twice to obtain LSB of ($a1), or y[j], in $t1 no addition needed if y[j] = 0 add x to upper part of z form carry-out of addition in $t0 copy ($v0) into $t1 halve the unsigned value in $v0 subtract ($v0) from ($t1) twice to obtain LSB of Hi in $t1 # carry-out converted to 1 in $v0,$v0,$t0 $v1,1 $t1,$t1,31 # right-shifted $v0 corrected # halve the unsigned value in $v1 # LSB of Hi converted to 1 in $v1,$v1,$t1 $t2,$t2,-1 # right-shifted $v1 corrected # decrement repetition Computer Architecture, The Arithmetic/Logic Unit $t2,$zero,mloop $ra counter Slide 62 # if counter > 0, repeat multiply loop # return to the calling program 11.4 Shift-Subtract Division x y Divisor z Dividend y 3 x 23 y 2 x 22 y 1 x 21 y 0 x 20 Subtracted bit-matrix s Figure11.9 Remainder Division of an 8-bit number by a 4-bit number in dot notation. z(j) = 2z(j−1) − yk−j x 2k |shift| |–– subtract ––| Jan. 2011 Quotient with z(0) = z and z(k) = 2k s Computer Architecture, The Arithmetic/Logic Unit Slide 63 Integer and Fractional Unsigned Division Position 7 6 5 4 3 2 1 0 ========================= z 0 1 1 1 0 1 0 1 4 x2 1 0 1 0 ========================= z (0) 0 1 1 1 0 1 0 1 (0) 2z 0 1 1 1 0 1 0 1 4 –y3x2 1 0 1 0 y =1 –––––––––––––––––––––––––– 3 z (1) 0 1 0 0 1 0 1 (1) 2z 0 1 0 0 1 0 1 4 –y2x2 0 0 0 0 y =0 –––––––––––––––––––––––––– 2 z (2) 1 0 0 1 0 1 (2) 2z 1 0 0 1 0 1 4 –y1x2 1 0 1 0 y =1 –––––––––––––––––––––––––– 1 z (3) 1 0 0 0 1 (3) 2z 1 0 0 0 1 4 –y0x2 1 0 1 0 y0=1 –––––––––––––––––––––––––– z (4) 0 1 1 1 s 0 1 1 1 y 1 0 1 1 ========================= Figure 11.10 Jan. 2011 Example 11.5 Position –1 –2 –3 –4 –5 –6 –7 –8 ========================== z .1 4 3 5 1 5 0 2 x .4 0 6 7 ========================== z (0) .1 4 3 5 1 5 0 2 (0) 10z 1.4 3 5 1 5 0 2 –y–1x 1.2 2 0 1 y–1=3 ––––––––––––––––––––––––––– z (1) .2 1 5 0 5 0 2 (1) 10z 2.1 5 0 5 0 2 –y–2x 2.0 3 3 5 y–2=5 ––––––––––––––––––––––––––– z (2) .1 1 7 0 0 2 (2) 10z 1.1 7 0 0 2 –y–3x 0.8 1 3 4 y–3=2 ––––––––––––––––––––––––––– z (3) .3 5 6 6 2 (3) 10z 3.5 6 6 2 –y–4x 3.2 5 3 6 y–4=8 ––––––––––––––––––––––––––– z (4) .3 1 2 6 s .0 0 0 0 3 1 2 6 y .3 5 2 8 ========================== Division examples for binary integers and decimal fractions. Computer Architecture, The Arithmetic/Logic Unit Slide 64 Division with Same-Width Operands Position 7 6 5 4 3 2 1 0 ========================= z 0 0 0 0 1 1 0 1 4 x2 0 1 0 1 ========================= z (0) 0 0 0 0 1 1 0 1 (0) 2z 0 0 0 1 1 0 1 4 –y3x2 0 0 0 0 y =0 –––––––––––––––––––––––––– 3 z (1) 0 0 0 1 1 0 1 (1) 2z 0 0 1 1 0 1 4 –y2x2 0 0 0 0 y =0 –––––––––––––––––––––––––– 2 z (2) 0 0 1 1 0 1 (2) 2z 0 1 1 0 1 4 –y1x2 0 1 0 1 y =1 –––––––––––––––––––––––––– 1 z (3) 0 0 0 1 1 (3) 2z 0 0 1 1 4 –y0x2 1 0 1 0 y0=0 –––––––––––––––––––––––––– z (4) 0 0 1 1 s 0 0 1 1 y 0 0 1 0 ========================= Figure 11.11 Jan. 2011 Example 11.6 Position –1 –2 –3 –4 –5 –6 –7 –8 ========================== z .0 1 0 1 x .1 1 0 1 ========================== z (0) .0 1 0 1 (0) 2z 0.1 0 1 0 –y–1x 0.0 0 0 0 y–1=0 ––––––––––––––––––––––––––– z (1) .1 0 1 0 (1) 2z 1.0 1 0 0 –y–2x 0.1 1 0 1 y–2=1 ––––––––––––––––––––––––––– z (2) .0 1 1 1 (2) 2z 0.1 1 1 0 –y–3x 0.1 1 0 1 y–3=1 ––––––––––––––––––––––––––– z (3) .0 0 0 1 (3) 2z 0.0 0 1 0 –y–4x 0.0 0 0 0 y–4=0 ––––––––––––––––––––––––––– z (4) .0 0 1 0 s .0 0 0 0 0 0 1 0 y .0 1 1 0 ========================== Division examples for 4/4-digit binary integers and fractions. Computer Architecture, The Arithmetic/Logic Unit Slide 65 Signed Division Method 1 (indirect): strip operand signs, divide, set result signs Dividend z=5 z=5 z = –5 z = –5 Divisor x=3 x = –3 x=3 x = –3 ⇒ ⇒ ⇒ ⇒ Quotient y=1 y = –1 y = –1 y=1 Remainder s=2 s=2 s = –2 s = –2 Method 2 (direct 2’s complement): develop quotient with digits –1 and 1, chosen based on signs, convert to digits 0 and 1 Restoring division: perform trial subtraction, choose 0 for q digit if partial remainder negative Nonrestoring division: if sign of partial remainder is correct, then subtract (choose 1 for q digit) else add (choose –1) Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 66 11.5 Hardware Dividers Shift Quotient y Partial rem ainder z (j) (initially z) Hi Shift Lo Mux 1 Load Quotient digit selector Divisor x 0 yk – j Enable 1 Select c out Adder Trial difference Figure 11.12 Jan. 2011 c in 1 (Always subtract) Hardware divider based on the shift-subtract algorithm. Computer Architecture, The Arithmetic/Logic Unit Slide 67 The Shift Part of Shift-Subtract qk–j From adder /k Partial remainder /k Quotient /k MSB /k To adder Figure 11.13 Shifting incorporated in the connections to the partial remainder register rather than as a separate phase. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 68 High-Radix Dividers x Divisor y Quotient z Dividend 0, x, 2x, or 3x s Remainder Radix-4 division in dot notation. z(j) = 4z(j−1) − (yk−2j+1 yk−2j)two x 2k with z(0) = z and z(k/2) = 2ks |shift| Assume k even |––––––– subtract –––––––| Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 69 Array Dividers z7 y3 x3 z6 MS x1 MS x0 z5 MS b y2 x2 z4 MS MS z3 0 d z2 MS 0 MS MS z1 y1 MS MS MS MS Our original dot-notation for division 0 z0 y0 MS s3 MS s2 Figure 11.14 Jan. 2011 MS s1 MS 0 s0 Straightened dots to depict an array divider Array divider for 8/4-bit unsigned integers. Computer Architecture, The Arithmetic/Logic Unit Slide 70 11.6 Programmed Division MiniMIPS instructions related to division div divu mfhi mflo $s0,$s1 $s2,$s3 $t0 $t1 # # # # Lo = quotient, Hi = remainder unsigned version of division set $t0 to (Hi) set $t1 to (Lo) Example 11.7 Compute z mod x, where z (singed) and x > 0 are integers Divide; remainder will be obtained in Hi if remainder is negative, then add |x| to (Hi) to obtain z mod x else Hi holds z mod x Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 71 Emulating a Hardware Divider in Software Example 11.8 (MiniMIPS shift-add program for division) Shift $a1 (quotient Quotient y y) $v0 Partial (Hi part remainder of z) $v1 z (j) (Lo (initially part of z)z) Shift $t0 (MSB of Hi) yk – j Load $t1 (bit k −j of y) $a0 Divisor (divisorx x) 0 1 Mux Enable 1 Quotient digit selector Select $t2 (counter) c out Adder Trial difference c in 1 (Always subtract) Part of the control in hardware Figure 11.15 Register usage for programmed division superimposed on the block diagram for a hardware divider. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 72 Division When There Is No Divide Instruction Example 11.7 (MiniMIPS shift-add program for division) shsdi: move move addi dloop: slt sll slt or sll sge or sll or beq subu nosub: addi by 1 bne move jr Jan. 2011 $v0,$a2 $vl,$a3 $t2,$zero,32 $t0,$v0,$zero $v0,$v0,1 $t1,$v1,$zero $v0,$v0,$t1 $v1,$v1,1 $t1,$v0,$a0 $t1,$t1,$t0 $a1,$a1,1 $a1,$a1,$t1 $t1,$zero,nosub $v0,$v0,$a0 $t2,$t2,-1 # # # # # # # # # # # # # # initialize Hi to ($a2) initialize Lo to ($a3) initialize repetition counter to 32 copy MSB of Hi into $t0 left-shift the Hi part of z copy MSB of Lo into $t1 move MSB of Lo into LSB of Hi left-shift the Lo part of z quotient digit is 1 if (Hi) ≥ x, or if MSB of Hi was 1 before shifting shift y to make room for new digit copy y[k-j] into LSB of $a1 if y[k-j] = 0, do not subtract subtract divisor x from Hi part of z # decrement repetition counter $t2,$zero,dloop # if counter > 0, repeat divide loop $v1,$a1 # copy the quotient y into $v1 $ra # return to the calling program Slide 73 Computer Architecture, The Arithmetic/Logic Unit Divider vs Multiplier: Hardware Similarities Shift Quotient y Partial rem ainder z (j) (initially z) Shift yk – j Multiplier y Dou blewi dth pa rtial product z (j) Load Shift Shift Quotient digit selector Divisor x 0 Mux Enable 1 Multiplicand x 1 0 yj Enable 1 Mux Select Select Figure 11.12 c out c Adder in Trial difference x3 0 0 x2 0 0 MA x1 0 0 MA x0 0 z7 y3 MA Adder c out 1 (Always subtract) 0 MA Figure 11.4 y0 x3 z6 MS x2 x1 z5 MS c in x0 z4 MS MS 0 MA MA MA MA y1 z0 0 MA MA MA MA y2 z1 MS MS MS MS MS MS MS 0 MA MA MA MA y3 MS z3 FA FA HA z7 z6 z5 S traighten ed dots to depic t arr ay m ultiplier to the left Figure 11.14 Jan. 2011 z4 0 Our original dot-notation for division 0 z0 y0 MS MS MS 0 FA 0 z1 y1 MS z2 z3 z2 y2 O ur o rigin al dot-n otation rep res entin g m ultiplic ation Add’Sub s3 s2 s1 s0 Turn upside-down Computer Architecture, The Arithmetic/Logic Unit 0 Straightened dots to depict an array divider Figure 11.7 Slide 74 12 Floating-Point Arithmetic Floating-point is no longer reserved for high-end machines • Multimedia and signal processing require flp arithmetic • Details of standard flp format and arithmetic operations Topics in This Chapter 12.1 Rounding Modes 12.2 Special Values and Exceptions 12.3 Floating-Point Addition 12.4 Other Floating-Point Operations 12.5 Floating-Point Instructions 12.6 Result Precision and Errors Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 75 12.1 Rounding Modes Short (32-bit) format 8 bits, 23 bits for fractional part bias = 127, (plus hidden 1 in integer part) –126 to 127 Sign Exponent ±0, ±∞, NaN IEEE 754 Format 1.f × 2e Significand 11 bits, bias = 1023, –1022 to 1023 Denormals: 0.f × 2emin 52 bits for fractional part (plus hidden 1 in integer part) Denormals allow graceful underflow Long (64-bit) format –∞ Negative numbers FLP – max – min – Sparser Overflow region ±0 Denser Positive numbers FLP + min + Denser Figure 12.1 Jan. 2011 Underflow example +∞ Sparser Underflow regions Midway example max + Overflow region Typical example Overflow example Distribution of floating-point numbers on the real line. Computer Architecture, The Arithmetic/Logic Unit Slide 76 Round-to-Nearest (Even) rtnei(x) rtni(x) 4 4 3 3 2 2 1 1 x –4 –3 –2 –1 1 2 3 4 Jan. 2011 –4 –3 –2 –1 1 –1 –1 –2 –2 –3 –3 –4 –4 (a) Round to nearest even integer Figure 12.2 x 2 3 (b) Round to nearest integer Two round-to-nearest-integer functions for x in [–4, 4]. Computer Architecture, The Arithmetic/Logic Unit Slide 77 4 Directed Rounding ritni(x) rutni(x) 4 4 3 3 2 2 1 1 x –4 –3 –2 –1 1 2 3 4 x –4 –3 –2 –1 1 –1 –1 –2 –2 –3 –3 –4 –4 (a) Round inward to nearest integer 2 3 4 (b) Round upward to nearest integer Figure 12.3 Two directed round-to-nearest-integer functions for x in [–4, 4]. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 78 12.2 Special Values and Exceptions Zeros, infinities, and NaNs (not a number) ± 0 Biased exponent = 0, significand = 0 (no hidden 1) ± ∞ Biased exponent = 255 (short) or 2047 (long), significand = 0 NaN Biased exponent = 255 (short) or 2047 (long), significand ≠ 0 Arithmetic operations with special operands (+0) + (+0) = (+0) – (–0) = +0 (+0) × (+5) = +0 (+0) / (–5) = –0 (+∞) + (+∞) = +∞ x – (+∞) = –∞ (+∞) × x = ±∞, depending on the sign of x x / (+∞) = ±0, depending on the sign of x √(+∞) = +∞ Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 79 Exceptions Undefined results lead to NaN (not a number) (±0) / (±0) = NaN (+∞) + (–∞) = NaN (±0) × (±∞) = NaN (±∞) / (±∞) = NaN Arithmetic operations and comparisons with NaNs NaN + x = NaN NaN + NaN = NaN NaN × 0 = NaN NaN × NaN = NaN NaN < 2 Æ false NaN = Nan Æ false NaN ≠ (+∞) Æ true NaN ≠ NaN Æ true Examples of invalid-operation exceptions Addition: Multiplication: Division: Square-root: Jan. 2011 (+∞) + (–∞) 0×∞ 0 / 0 or ∞ / ∞ Operand < 0 Computer Architecture, The Arithmetic/Logic Unit Slide 80 12.3 Floating-Point Addition (±2e1s1) + (±2e1(s2 / 2e1–e2)) = ±2e1(s1 ± s2 / 2e1–e2) (±2e2s2) Numbers to be added: x = 25 × 1.00101101 y = 21 × 1.11101101 Operand with smaller exponent to be preshifted Operands after alignment shift: x = 25 × 1.00101101 y = 25 × 0.000111101101 Result of addition: s = 25 × 1.010010111101 s = 25 × 1.01001100 Figure 12.4 Jan. 2011 Extra bits to be rounded off Rounded sum Alignment shift and rounding in floating-point addition. Computer Architecture, The Arithmetic/Logic Unit Slide 81 Inp ut 1 Hardware for Floating-Point Addition Inp ut 2 Unpack Signs Exponents Significands Add′Sub Mu x Sub Possible swap & compleme nt Align significands Control & sign logic Add Norma lize & round Figure 12.5 Simplified schematic of a floating-point adder. ± Sign Exponent Significand Pack Outp ut Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 82 12.4 Other Floating-Point Operations Overflow (underflow) possible Floating-point multiplication (±2e1s1) × (±2e2s2) = ±2e1+ e2(s1 × s2) Product of significands in [1, 4) If product is in [2, 4), halve to normalize (increment exponent) Overflow (underflow) possible Floating-point division (±2e1s1) / (±2e2s2) = ±2e1– e2(s1 / s2) Ratio of significands in (1/2, 2) If ratio is in (1/2, 1), double to normalize (decrement exponent) Floating-point square-rooting (2es)1/2 = 2e/2(s)1/2 = 2(e–1)2(2s)1/2 Normalization not needed Jan. 2011 when e is even when e is odd Computer Architecture, The Arithmetic/Logic Unit Slide 83 Hardware for Floating-Point Multiplication and Division Input 1 Input 2 Unpack Signs Exponents Significands Mul′Div ± Multiply or divide Control & sign logic Normalize & round ± Figure 12.6 Simplified schematic of a floatingpoint multiply/divide unit. Sign Exponent Significand Pack Output Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 84 12.5 Floating-Point Instructions Floating-point arithmetic instructions for MiniMIPS: add.s sub.d mul.d div.s neg.s 31 F op $f0,$f8,$f10 $f0,$f8,$f10 $f0,$f8,$f10 $f0,$f8,$f10 $f0,$f8 25 ex 20 # # # # # ft set set set set set 15 $f0 $f0 $f0 $f0 $f0 fs to to to to to 10 ($f8) +fp ($f8) –fp ($f8) ×fp ($f8) /fp –($f8) ($f10) ($f10) ($f10) ($f10) fd fn 5 0 0 1 0 0 0 1 0 0 0 0 x 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x x x Floating-point instruction s=0 d=1 Source register 2 Source register 1 Destination register add.* = 0 sub.* = 1 mul.* = 2 div.* = 3 neg.* = 7 Figure 12.7 The common floating-point instruction format for MiniMIPS and components for arithmetic instructions. The extension (ex) field distinguishes single (* = s) from double (* = d) operands. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 85 The Floating-Point Unit in MiniMIPS ... m ≤ 2 32 Loc 0 Loc 4 Loc 8 4 B / location Memory Loc Loc m−8 m−4 up to 2 30 words Coprocessor 1 ... EIU $0 $1 $2 (Main proc.) $31 ALU Execution & integer unit (Coproc. 1) Integer mul/div FP arith Hi FPU $0 $1 $2 Floatingpoint unit $31 Pairs of registers, beginning with an even-numbered one, are used for double operands Lo TMU Chapter 10 Chapter 11 Figure 5.1 Jan. 2011 Chapter 12 BadVaddr Trap & (Coproc. 0) Status memory Cause unit EPC Memory and processing subsystems for MiniMIPS. Computer Architecture, The Arithmetic/Logic Unit Slide 86 Floating-Point Format Conversions MiniMIPS instructions for number format conversion: cvt.s.w cvt.d.w cvt.d.s cvt.s.d cvt.w.s cvt.w.d 31 F op $f0,$f8 $f0,$f8 $f0,$f8 $f0,$f8 $f0,$f8 $f0,$f8 25 ex # # # # # # 20 set set set set set set ft $f0 $f0 $f0 $f0 $f0 $f0 15 to to to to to to fs single(integer $f8) double(integer $f8) double($f8) single($f8,$f9) integer($f8) integer($f8,$f9) 10 fd 5 fn 0 0 1 0 0 0 1 0 0 0 0 x 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 x x x Floating-point instruction Figure 12.8 Jan. 2011 *.w = 0 w.s = 0 w.d = 1 *.* = 1 Unused Source register Destination register To format: s = 32 d = 33 w = 36 Floating-point instructions for format conversion in MiniMIPS. Computer Architecture, The Arithmetic/Logic Unit Slide 87 Floating-Point Data Transfers MiniMIPS instructions for floating-point load, store, and move: lwc1 swc1 mov.s mov.d mfc1 mtc1 31 F $f8,40($s3) $f8,A($s3) $f0,$f8 $f0,$f8 $t0,$f12 $f8,$t4 op 25 20 ft load mem[40+($s3)] into $f8 store ($f8) into mem[A+($s3)] load $f0 with ($f8) load $f0,$f1 with ($f8,$f9) load $t0 with ($f12) load $f8 with ($t4) 15 fs 10 fd 5 fn 0 0 1 0 0 0 1 0 0 0 0 x 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 Floating-point instruction 31 R ex # # # # # # op s=0 d=1 25 rs Unused 20 rt Source register 15 rd Destination register 10 sh mov.* = 6 5 fn 0 0 1 0 0 0 1 0 0 x 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Floating-point instruction Figure 12.9 Jan. 2011 mfc1 = 0 mtc1 = 4 Source register Destination register Unused Unused Instructions for floating-point data movement in MiniMIPS. Computer Architecture, The Arithmetic/Logic Unit Slide 88 Floating-Point Branches and Comparisons MiniMIPS instructions for floating-point load, store, and move: bc1t bc1f c.eq.* c.lt.* c.le.* 31 I L L $f0,$f8 $f0,$f8 $f0,$f8 op 25 20 branch on fp flag true branch on fp flag false if ($f0)=($f8), set flag to “true” if ($f0)<($f8), set flag to “true” if ($f0)≤($f8), set flag to “true” rt operand / offset 15 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 Floating-point instruction 31 F rs # # # # # op bc1? = 8 25 ex true = 1 false = 0 20 ft Offset Correction: 1 1 x x x 0 15 fs 10 fd 5 fn 0 0 1 0 0 0 1 0 0 0 0 x 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 Floating-point instruction Figure 12.10 Jan. 2011 s=0 d=1 Source register 2 Source register 1 Unused c.eq.* = 50 c.lt.* = 60 c.le.* = 62 Floating-point branch and comparison instructions in MiniMIPS. Computer Architecture, The Arithmetic/Logic Unit Slide 89 Floating-Point Instructions of MiniMIPS Copy Table 12.1 Arithmetic * s/d for single/double # 0/1 for single/double Conversions Memory access Control transfer Jan. 2011 Instruction Usage Move s/d registers Move fm coprocessor 1 Move to coprocessor 1 Add single/double Subtract single/double Multiply single/double Divide single/double Negate single/double Compare equal s/d Compare less s/d Compare less or eq s/d Convert integer to single Convert integer to double Convert single to double Convert double to single Convert single to integer Convert double to integer Load word coprocessor 1 Store word coprocessor 1 Branch coproc 1 true Branch coproc 1 false mov.* fd,fs mfc1 rt,rd mtc1 rd,rt add.* fd,fs,ft sub.* fd,fs,ft mul.* fd,fs,ft div.* fd,fs,ft neg.* fd,fs c.eq.* fs,ft c.lt.* fs,ft c.le.* fs,ft cvt.s.w fd,fs cvt.d.w fd,fs cvt.d.s fd,fs cvt.s.d fd,fs cvt.w.s fd,fs cvt.w.d fd,fs lwc1 ft,imm(rs) swc1 ft,imm(rs) bc1t L bc1f L Computer Architecture, The Arithmetic/Logic Unit ex fn # 0 4 # # # # # # # # 0 0 1 1 0 1 rs rs 8 8 Slide 90 6 0 1 2 3 7 50 60 62 32 33 33 32 36 36 12.6 Result Precision and Errors Example 12.4 Laws of algebra may not hold in floating-point arithmetic. For example, the following computations show that the associative law of addition, (a + b) + c = a + (b + c), is violated for the three numbers shown. Numbers to be added first a =-25 × 1.10101011 b = 25 × 1.10101110 Numbers to be added first b = 25 × 1.10101110 c =-2−2 × 1.01100101 Compute a + b 25 × 0.00000011 a+b = 2−2 × 1.10000000 c =-2−2 × 1.01100101 Compute b + c (after preshifting c) 25 × 1.101010110011011 b+c = 25 × 1.10101011 (Round) a =-25 × 1.10101011 Compute (a + b) + c 2−2 × 0.00011011 Sum = 2−6 × 1.10110000 Compute a + (b + c) 25 × 0.00000000 Sum = 0 (Normalize to special code for 0) Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 91 Error Control and Certifiable Arithmetic Catastrophic cancellation in subtracting almost equal numbers: Area of a needlelike triangle A = [s(s – a)(s – b)(s – c)]1/2 c b a Possible remedies Carry extra precision in intermediate results (guard digits): commonly used in calculators Use alternate formula that does not produce cancellation errors Certifiable arithmetic with intervals A number is represented by its lower and upper bounds [xl, xu] Example of arithmetic: [xl, xu] +interval [yl, yu] = [xl +fp∇ yl, xu +fpΔ yu] Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 92 Evaluation of Elementary Functions Approximating polynomials ln x = 2(z + z3/3 + z5/5 + z7/7 + . . . ) where z = (x – 1)/(x + 1) ex = 1 + x/1! + x2/2! + x3/3! + x4/4! + . . . cos x = 1 – x2/2! + x4/4! – x6/6! + x8/8! – . . . tan–1 x = x – x3/3 + x5/5 – x7/7 + x9/9 – . . . Iterative (convergence) schemes For example, beginning with an estimate for x1/2, the following iterative formula provides a more accurate estimate in each step q(i+1) = 0.5(q(i) + x/q(i)) Table lookup (with interpolation) A pure table lookup scheme results in huge tables (impractical); hence, often a hybrid approach, involving interpolation, is used. Jan. 2011 Computer Architecture, The Arithmetic/Logic Unit Slide 93 Function Evaluation by Table Lookup h bits xH Input x k - h bits xL xL Table for a f(x) Table for b Best linear approximation in subinterval Multiply x xH Add Output Figure 12.12 Jan. 2011 f(x) The linear approximation above is characterized by the line equation a + b x L , where a and b are read out from tables based on x H Function evaluation by table lookup and linear interpolation. Computer Architecture, The Arithmetic/Logic Unit Slide 94 Part IV Data Path and Control Feb. 2011 Computer Architecture, Data Path and Control Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2006 Feb. 2007 Feb. 2008 Feb. 2009 Feb. 2011 Feb. 2011 Computer Architecture, Data Path and Control Slide 2 A Few Words About Where We Are Headed Performance = 1 / Execution time simplified to 1 / CPU execution time CPU execution time = Instructions × CPI / (Clock rate) Performance = Clock rate / ( Instructions × CPI ) Try to achieve CPI = 1 with clock that is as high as that for CPI > 1 designs; is CPI < 1 feasible? (Chap 15-16) Design memory & I/O structures to support ultrahigh-speed CPUs (chap 17-24) Feb. 2011 Define an instruction set; make it simple enough to require a small number of cycles and allow high clock rate, but not so simple that we need many instructions, even for very simple tasks (Chap 5-8) Computer Architecture, Data Path and Control Design hardware for CPI = 1; seek improvements with CPI > 1 (Chap 13-14) Design ALU for arithmetic & logic ops (Chap 9-12) Slide 3 IV Data Path and Control Design a simple computer (MicroMIPS) to learn about: • Data path – part of the CPU where data signals flow • Control unit – guides data signals through data path • Pipelining – a way of achieving greater performance Topics in This Part Chapter 13 Instruction Execution Steps Chapter 14 Control Unit Synthesis Chapter 15 Pipelined Data Paths Chapter 16 Pipeline Performance Limits Feb. 2011 Computer Architecture, Data Path and Control Slide 4 13 Instruction Execution Steps A simple computer executes instructions one at a time • Fetches an instruction from the loc pointed to by PC • Interprets and executes the instruction, then repeats Topics in This Chapter 13.1 A Small Set of Instructions 13.2 The Instruction Execution Unit 13.3 A Single-Cycle Data Path 13.4 Branching and Jumping 13.5 Deriving the Control Signals 13.6 Performance of the Single-Cycle Design Feb. 2011 Computer Architecture, Data Path and Control Slide 5 13.1 A Small Set of Instructions 31 R I op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source 1 or base Source 2 or dest’n Destination Unused Opcode ext imm Operand / Offset, 16 bits jta J Jump target address, 26 bits inst Instruction, 32 bits Fig. 13.1 MicroMIPS instruction formats and naming of the various fields. We will refer to this diagram later Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) Six I-format ALU instructions (lui, addi, slti, andi, ori, xori) Two I-format memory access instructions (lw, sw) Three I-format conditional branch instructions (bltz, beq, bne) Four unconditional jump instructions (j, jr, jal, syscall) Feb. 2011 Computer Architecture, Data Path and Control Slide 6 0 The MicroMIPS Instruction Set Copy Arithmetic Logic Memory access Control transfer Table 13.1 Feb. 2011 Instruction Usage Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch less than 0 Branch equal Branch not equal Jump and link System call lui rt,imm add rd,rs,rt sub rd,rs,rt slt rd,rs,rt addi rt,rs,imm slti rd,rs,imm and rd,rs,rt or rd,rs,rt xor rd,rs,rt nor rd,rs,rt andi rt,rs,imm ori rt,rs,imm xori rt,rs,imm lw rt,imm(rs) sw rt,imm(rs) j L jr rs bltz rs,L beq rs,rt,L bne rs,rt,L jal L syscall Computer Architecture, Data Path and Control op fn 15 0 0 0 8 10 0 0 0 0 12 13 14 35 43 2 0 1 4 5 3 0 Slide 7 32 34 42 36 37 38 39 8 12 13.2 The Instruction Execution Unit 31 beq,bne syscall R I 25 rs 20 rt 15 6 bits 5 bits 5 bits Opcode Source 1 or base Source 2 or dest’n rd 10 5 bits Destination sh fn 5 5 bits 6 bits Unused Opcode ext 0 imm Operand / Offset, 16 bits Next addr bltz,jr jta j,jal rs,rt,rd PC Instr cache op Jump target address, 26 bits inst Instruction, 32 bits 22 instructions (rs) Reg file inst jta J 12 A/L, lui, lw,sw ALU Address Data Data cache (rt) imm op fn Control Harvard architecture Fig. 13.2 Abstract view of the instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 13.1. Feb. 2011 Computer Architecture, Data Path and Control Slide 8 13.3 A Single-Cycle Data Path Incr PC Next addr jta Next PC ALUOvfl (PC) (rs) rs rt PC Instr cache inst rd 31 imm op Br&Jump Instruction fetch Fig. 13.3 Feb. 2011 0 1 2 Register writeback Ovfl Reg file ALU (rt) / 16 ALU out Data cache Data out Data in Func 0 32 SE / 1 Data addr 0 1 2 Register input fn RegDst RegWrite Reg access / decode ALUSrc ALUFunc ALU operation RegInSrc DataRead DataWrite Data access Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 9 Const′Var Shift function Constant 5 amount 0 Amount 5 1 5 Variable amount 2 00 01 10 11 No shift Logical left Logical right Arith right Shifter Function class 32 5 LSBs Adder y 0 or 1 c0 32 k / c 31 x±y Shift Set less Arithmetic Logic 0 imm 32 00 01 10 11 2 Shifted y x An ALU for MicroMIPS lui s 32 2 32 Shorthand symbol for ALU 1 MSB We use only 5 control signals (no shifts) Control c 32 3 5 x Func Add′Sub s ALU Logic unit AND OR XOR NOR 00 01 10 11 Ovfl y 32input NOR Zero 2 Logic function Zero Ovfl Fig. 10.19 A multifunction ALU with 8 control signals (2 for function class, 1 arithmetic, 3 shift, 2 logic) specifying the operation. Feb. 2011 Computer Architecture, Data Path and Control Slide 10 13.4 Branching and Jumping Update options for PC (PC)31:2 + 1 (PC)31:2 + 1 + imm (PC)31:28 | jta (rs)31:2 SysCallAddr Default option When instruction is branch and condition is met When instruction is j or jal When the instruction is jr Start address of an operating system routine Lowest 2 bits of PC always 00 BrTrue / 30 IncrPC / 30 Adder c in 0 1 2 3 NextPC / 30 PCSrc Fig. 13.4 Feb. 2011 / 30 / 30 / 30 / 30 4 MSBs 1 / 30 Branch condition checker (rt) / 32 (rs) / 30 (PC)31:2 / 26 jta 30 MSBs SE / 30 / 32 4 16 imm MSBs SysCallAddr BrType Next-address logic for MicroMIPS (see top part of Fig. 13.3). Computer Architecture, Data Path and Control Slide 11 13.5 Deriving the Control Signals Table 13.2 Control signals for the single-cycle MicroMIPS implementation. Control signal Reg file ALU Data cache Next addr Feb. 2011 0 1 2 3 RegWrite Don’t write Write RegDst1, RegDst0 rt rd $31 RegInSrc1, RegInSrc0 Data out ALU out IncrPC ALUSrc (rt ) imm Add′Sub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic DataRead Don’t read Read DataWrite Don’t write Write BrType1, BrType0 No branch beq bne bltz PCSrc1, PCSrc0 IncrPC jta (rs) SysCallAddr Computer Architecture, Data Path and Control Slide 12 Single-Cycle Data Path, Repeated for Reference Incr PC Outcome of an executed instruction: A new value loaded into PC Possible new value in a reg or memory loc Next addr jta Next PC ALUOvfl (PC) (rs) rs rt PC Instr cache inst rd 31 imm op Br&Jump Instruction fetch Fig. 13.3 Feb. 2011 0 1 2 Register writeback Ovfl Reg file ALU (rt) / 16 ALU out Data cache Data out Data in Func 0 32 SE / 1 Data addr 0 1 2 Register input fn RegDst RegWrite Reg access / decode ALUSrc ALUFunc ALU operation RegInSrc DataRead DataWrite Data access Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 13 Feb. 2011 Computer Architecture, Data Path and Control 00 01 10 11 00 01 10 0 0 FnClass LogicFn 0 1 1 0 1 00 10 10 01 10 01 11 11 11 11 11 11 11 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCSrc 10 10 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 Add’Sub 01 01 01 01 01 01 01 01 01 01 01 01 01 00 ALUSrc 00 01 01 01 00 00 01 01 01 01 00 00 00 00 BrType 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 DataW rite 001111 000000 100000 000000 100010 000000 101010 001000 001010 000000 100100 000000 100101 000000 100110 000000 100111 001100 001101 001110 100011 101011 000010 000000 001000 000001 000100 000101 000011 000000 001100 DataRead Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than 0 Branch on equal Branch on not equal Jump and link System call fn RegInS rc op RegDst Table 13.3 Instruction RegWrite Control Signal Settings 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 00 01 11 11 01 10 00 Slide 14 Control Signals in the Single-Cycle Data Path Incr PC Next addr jta Next PC ALUOvfl (PC) Instr cache inst 001111 000000 Br&Jump BrType Fig. 13.3 Feb. 2011 Ovfl Reg file ALU (rt) / 16 imm op 00 00 00 00 0 1 2 rd 31 lui slt PCSrc (rs) rs rt PC ALU out Register input fn 00 01 1 1 RegDst RegWrite 010101 1 0 ALUSrc Data cache Data out Data in Func 0 32 SE / 1 Data addr x xx 00 1 xx 01 ALUFunc 0 0 0 0 01 01 RegInSrc DataRead DataWrite Add′Sub LogicFn FnClass Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 Slide 15 0 3 4 5 bltzIns t jIns t jalIns t beqIns t bneIns t 8 addiIns t 1 2 10 s ltiIns t 12 13 14 15 andiIns t oriIns t xoriIns t luiIns t 35 lwIns t 43 /6 RtypeIns t 0 8 fn Decoder 1 fn /6 op Decoder Instruction Decoding op Feb. 2011 12 s ys callIns t 32 addIns t 34 s ubIns t 36 37 38 39 andIns t orIns t xorIns t norIns t 42 s ltIns t s wIns t 63 Fig. 13.5 jrIns t 63 Instruction decoder for MicroMIPS built of two 6-to-64 decoders. Computer Architecture, Data Path and Control Slide 16 Feb. 2011 Computer Architecture, Data Path and Control 00 01 10 11 00 01 10 0 0 FnClass LogicFn 0 1 1 0 1 00 10 10 01 10 01 11 11 11 11 11 11 11 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCSrc 10 10 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 Add’Sub 01 01 01 01 01 01 01 01 01 01 01 01 01 00 ALUSrc 00 01 01 01 00 00 01 01 01 01 00 00 00 00 BrType 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 DataW rite 001111 000000 100000 000000 100010 000000 101010 001000 001010 000000 100100 000000 100101 000000 100110 000000 100111 001100 001101 001110 100011 101011 000010 000000 001000 000001 000100 000101 000011 000000 001100 DataRead Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than 0 Branch on equal Branch on not equal Jump and link System call fn RegInS rc Table 13.3 op RegDst Repeated for Reference Instruction RegWrite Control Signal Settings: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 00 01 11 11 01 10 00 Slide 17 Control Signal Generation Auxiliary signals identifying instruction classes arithInst = addInst ∨ subInst ∨ sltInst ∨ addiInst ∨ sltiInst logicInst = andInst ∨ orInst ∨ xorInst ∨ norInst ∨ andiInst ∨ oriInst ∨ xoriInst immInst = luiInst ∨ addiInst ∨ sltiInst ∨ andiInst ∨ oriInst ∨ xoriInst Example logic expressions for control signals RegWrite = luiInst ∨ arithInst ∨ logicInst ∨ lwInst ∨ jalInst ALUSrc = immInst ∨ lwInst ∨ swInst addInst subInst jInst Add′Sub = subInst ∨ sltInst ∨ sltiInst DataRead = lwInst PCSrc0 = jInst ∨ jalInst ∨ syscallInst Feb. 2011 Computer Architecture, Data Path and Control . . . Control . . . sltInst Slide 18 Putting It All Together Fig. 10.19 Const′Var Fig. 13.4 / 30 / 30 IncrPC / 30 Adder 0 1 2 3 / 30 / 30 / 30 / 30 / 30 / 30 4 MSBs1 / 32 / 32 (rs) 4 16 imm MSBs Shift function Cons tant 5 am ount 0 Amount 5 1 5 Variable am ount 2 / 30 (PC)31:2 / 26 jta Function class imm 5 LSBs lui Shif t Set less Arithmetic Logic 0 y 0 or 1 c0 32 BrType 00 01 10 11 2 Shifted y x SysCallAddr No shif t Logical lef t Logical right Arith right Shifter Adder PCSrc 00 01 10 11 32 30 MSBs SE c in NextPC Branch condition checker BrTrue (rt) 32 k / c c 32 31 x±y s 32 2 32 Shorth symb for AL 1 MSB Cont 3 x Fun Add′Sub A Incr PC Next addr jta Next PC rd 31 imm op 0 1 2 00 01 10 11 Ovfl Reg file ALU (rt) / 16 0 32 SE / 1 Func Zero Ovfl addInst ALU out Data addr Data cache Data out Data in subInst jInst 0 1 2 Register input fn Zero . . . . . . Control sltInst Br&Jump Feb. 2011 RegDst RegWrite ALUSrc ALUFunc RegInSrc DataRead DataWrite Computer Architecture, Data Path and Control O y 32input NOR 2 Logic function (rs) rs rt inst AND OR XOR NOR ALUOvfl (PC) PC Instr cache Logic unit Fig. 13.3 Slide 19 13.6 Performance of the Single-Cycle Design An example combinational-logic data path to compute z := (u + v)(w – x) / y Add/Sub latency 2 ns Multiply latency 6 ns Total latency 23 ns Divide latency 15 ns u + Note that the divider gets its correct inputs after ≅9 ns, but this won’t cause a problem if we allow enough total time v × w − / z x y Feb. 2011 Beginning with inputs u, v, w, x, and y stored in registers, the entire computation can be completed in ≅25 ns, allowing 1 ns each for register readout and write Computer Architecture, Data Path and Control Slide 20 Performance Estimation for Single-Cycle MicroMIPS Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz R-type 44% 6 ns Load 24% 8 ns Store 12% 7 ns Branch 18% 5 ns Jump 2% 3 ns Weighted mean ≅ 6.36 ns ALU-type P C Load P C Store P C Branch P C (and jr) Jump (except jr & jal) P C Not used Not used Not used Not used Not used Not used Not used Not used Not used Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Feb. 2011 Computer Architecture, Data Path and Control Slide 21 How Good is Our Single-Cycle Design? Clock rate of 125 MHz not impressive How does this compare with current processors on the market? Not bad, where latency is concerned Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz A 2.5 GHz processor with 20 or so pipeline stages has a latency of about 0.4 ns/cycle × 20 cycles = 8 ns Throughput, however, is much better for the pipelined processor: Up to 20 times better with single issue Perhaps up to 100 times better with multiple issue Feb. 2011 Computer Architecture, Data Path and Control Slide 22 14 Control Unit Synthesis The control unit for the single-cycle design is memoryless • Problematic when instructions vary greatly in complexity • Multiple cycles needed when resources must be reused Topics in This Chapter 14.1 A Multicycle Implementation 14.2 Choosing the Clock Cycle 14.3 The Control State Machine 14.4 Performance of the Multicycle Design 14.5 Microprogramming 14.6 Exception Handling Feb. 2011 Computer Architecture, Data Path and Control Slide 23 14.1 A Multicycle Implementation Appointment book for a dentist Single-cycle Multicycle Assume longest treatment takes one hour Feb. 2011 Computer Architecture, Data Path and Control Slide 24 Single-Cycle vs. Multicycle MicroMIPS Clock Time needed Time allotted Instr 1 Instr 2 Instr 3 Instr 4 Clock Time needed Time allotted Time saved 3 cycles 5 cycles 3 cycles 4 cycles Instr 1 Instr 2 Instr 3 Instr 4 Fig. 14.1 Feb. 2011 Single-cycle versus multicycle instruction execution. Computer Architecture, Data Path and Control Slide 25 A Multicycle Data Path Inst Reg x Reg jta Address rs,rt,rd (rs) PC imm Cache z Reg Reg file ALU (rt) Data y Reg Data Reg op fn Control von Neumann (Princeton) architecture Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS. For naming of instruction fields, see Fig. 13.1. Feb. 2011 Computer Architecture, Data Path and Control Slide 26 Multicycle Data Path with Control Signals Shown Three major changes relative to the single-cycle data path: 26 / 1. Instruction & data caches combined Corrections are shown in red Inst Reg rs rt 0 rd 1 31 2 Cache PCWrite MemWrite MemRead Fig. 14.3 Feb. 2011 Reg file op IRWrite (rt) imm 16 / Data Reg fn 32 y Reg SE / RegInSrc RegDst ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl (rs) 0 12 Data 0 1 SysCallAddr 3. Registers added for jta intercycle data x Reg PC Inst′Data 30 / 4 MSBs Address 0 1 2. ALU performs double duty for address calculation RegWrite y Mux 4 0 1 2 ×4 3 ALUSrcX ALUSrcY 30 ×4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 27 14.2 Clock Cycle and Control Signals Table 14.1 Program counter Cache Register file ALU Feb. 2011 Control signal 0 1 2 JumpAddr jta SysCallAddr PCSrc1, PCSrc0 Jump addr x reg PCWrite Don’t write Write Inst′Data PC z reg MemRead Don’t read Read MemWrite Don’t write Write IRWrite Don’t write Write RegWrite Don’t write Write RegDst1, RegDst0 rt rd $31 RegInSrc1, RegInSrc0 Data reg z reg PC ALUSrcX PC x reg ALUSrcY1, ALUSrcY0 4 y reg Add′Sub Add Subtract LogicFn1, LogicFn0 AND FnClass1, FnClass0 lui z reg 3 ALU out imm 4 × imm OR XOR NOR Set less Arithmetic Logic Computer Architecture, Data Path and Control Slide 28 Multicycle Data Path, Repeated for Reference 26 / 30 / 4 MSBs Corrections are shown in red Inst Reg rt 0 rd 1 31 2 Cache PCWrite MemWrite MemRead Fig. 14.3 Feb. 2011 Reg file op IRWrite (rt) imm 16 / Data Reg Inst′Data (rs) 0 12 Data fn ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl x Reg rs PC 0 1 SysCallAddr jta Address 32 y Reg SE / RegInSrc RegDst 0 1 RegWrite y Mux 4 0 1 2 ×4 3 ALUSrcX ALUSrcY 30 ×4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 29 Execution Cycles Fetch & PC incr Decode & reg read Table 14.2 Instruction Operations Signal settings Any Read out the instruction and write it into instruction register, increment PC Any Read out rs & rt into x & y registers, compute branch address and save in z register Perform ALU operation and save the result in z register Add base and offset values, save in z register If (x reg) = ≠ < (y reg), set PC to branch target address Inst′Data = 0, MemRead = 1 IRWrite = 1, ALUSrcX = 0 ALUSrcY = 0, ALUFunc = ‘+’ PCSrc = 3, PCWrite = 1 ALUSrcX = 0, ALUSrcY = 3 ALUFunc = ‘+’ 1 2 ALU type Set PC to the target address jta, SysCallAddr, or (rs) Write back z reg into rd Load Read memory into data reg ALUSrcX = 1, ALUSrcY = 1 or 2 ALUFunc: Varies ALUSrcX = 1, ALUSrcY = 2 ALUFunc = ‘+’ ALUSrcX = 1, ALUSrcY = 1 ALUFunc= ‘−’, PCSrc = 2 PCWrite = ALUZero or ALUZero′ or ALUOut31 JumpAddr = 0 or 1, PCSrc = 0 or 1, PCWrite = 1 RegDst = 1, RegInSrc = 1 RegWrite = 1 Inst′Data = 1, MemRead = 1 Store Copy y reg into memory Inst′Data = 1, MemWrite = 1 Load Copy data register into rt RegDst = 0, RegInSrc = 0 RegWrite = 1 ALU type Load/Store ALU oper & PC update 3 Branch Jump Reg write or mem access Reg write for lw Feb. 2011 4 5 Execution cycles for multicycle MicroMIPS Computer Architecture, Data Path and Control Slide 30 14.3 The Control State Machine Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero (′) for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 Inst′Data = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Jump/ Branch State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ Note for State 7: ALUFunc is determined based on the op and fn fields Fig. 14.4 Feb. 2011 Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘−’ JumpAddr = % PCSrc = @ PCWrite = # State 6 lw/ sw ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ Cycle 5 Inst′Data = 1 MemWrite = 1 Branches based on instruction sw Speculative calculation of branch address Start Cycle 3 lw State 3 State 4 Inst′Data = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 The control state machine for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 31 State and Instruction Decoding op /4 5 6 7 8 9 10 11 12 13 14 15 ControlSt0 ControlSt1 ControlSt2 ControlSt3 ControlSt4 ControlSt5 ControlSt6 ControlSt7 ControlSt8 0 1 2 3 4 1 op Decoder st Decoder 0 1 2 3 4 1 fn /6 5 bltzInst jInst jalInst beqInst bneInst 8 addiInst andiInst 10 sltiInst 12 13 14 15 andiInst oriInst xoriInst luiInst 35 lwInst 43 Feb. 2011 0 jrInst 8 12 syscallInst 32 addInst 34 subInst 36 37 38 39 andInst orInst xorInst norInst 42 sltInst swInst 63 Fig. 14.5 /6 RtypeInst fn Decoder st 63 State and instruction decoders for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 32 Control Signal Generation Certain control signals depend only on the control state ALUSrcX = ControlSt2 ∨ ControlSt5 ∨ ControlSt7 RegWrite = ControlSt4 ∨ ControlSt8 Auxiliary signals identifying instruction classes addsubInst = addInst ∨ subInst ∨ addiInst logicInst = andInst ∨ orInst ∨ xorInst ∨ norInst ∨ andiInst ∨ oriInst ∨ xoriInst Logic expressions for ALU control signals Add′Sub = ControlSt5 ∨ (ControlSt7 ∧ subInst) FnClass1 = ControlSt7′ ∨ addsubInst ∨ logicInst FnClass0 = ControlSt7 ∧ (logicInst ∨ sltInst ∨ sltiInst) LogicFn1 = ControlSt7 ∧ (xorInst ∨ xoriInst ∨ norInst) LogicFn0 = ControlSt7 ∧ (orInst ∨ oriInst ∨ norInst) Feb. 2011 Computer Architecture, Data Path and Control Slide 33 14.4 Performance of the Multicycle Design R-type Load Store Branch Jump 44% 24% 12% 18% 2% 4 cycles 5 cycles 4 cycles 3 cycles 3 cycles Contribution to CPI R-type 0.44×4 = 1.76 Load 0.24×5 = 1.20 Store 0.12×4 = 0.48 Branch 0.18×3 = 0.54 Jump 0.02×3 = 0.06 ALU-type P C Load P C Store P C Branch P C (and jr) Not used Not used Not used Not used Not used Not used Not used Not used _____________________________ Average CPI ≅ 4.04 Jump (except jr & jal) P C Not used Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. Feb. 2011 Computer Architecture, Data Path and Control Slide 34 How Good is Our Multicycle Design? Clock rate of 500 MHz better than 125 MHz of single-cycle design, but still unimpressive Cycle time = 2 ns Clock rate = 500 MHz How does the performance compare with current processors on the market? Not bad, where latency is concerned R-type Load Store Branch Jump A 2.5 GHz processor with 20 or so pipeline stages has a latency of about 0.4 × 20 = 8 ns Contribution to CPI R-type 0.44×4 = 1.76 Throughput, however, is much better for the pipelined processor: Up to 20 times better with single issue Load Store Branch Jump 44% 24% 12% 18% 2% 0.24×5 0.12×4 0.18×3 0.02×3 4 cycles 5 cycles 4 cycles 3 cycles 3 cycles = = = = 1.20 0.48 0.54 0.06 _____________________________ Perhaps up to 100× with multiple issue Feb. 2011 Computer Architecture, Data Path and Control Average CPI ≅ 4.04 Slide 35 14.5 Microprogramming PC control Cache control Register control ALU inputs ALU Sequence function control 2 bits JumpAddr PCSrc PCWrite FnType LogicFn Add′Sub ALUSrcY ALUSrcX RegInSrc Microinstruction RegDst RegWrite Inst′Data MemRead MemWrite IRWrite Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero (′) for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 23 Fig. 14.6 Possible 22-bit microinstruction format for MicroMIPS. State 0 Inst′Data = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Jump/ Branch State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ Feb. 2011 Note for State 7: ALUFunc is determined based on the op and fn fields Computer Architecture, Data Path and Control Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘−’ JumpAddr = % PCSrc = @ PCWrite = # State 6 Cycle 5 Inst′Data = 1 MemWrite = 1 sw lw/ sw Start The control state machine resembles a program (microprogram) Cycle 3 ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 Inst′Data = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 Slide 36 The Control State Machine as a Microprogram Cycle 1 Cycle 2 Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero (′) for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 Inst′Data = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 Jump/ Branch Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘−’ JumpAddr = % PCSrc = @ PCWrite = # State 6 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ lw/ sw Start Inst′Data = 1 MemWrite = 1 ALUtype State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 Inst′Data = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 Multiple substates Fig. 14.4 Feb. 2011 Cycle 5 sw State 1 Note for State 7: ALUFunc is determined based on the op and fn fields Cycle 3 Multiple substates Decompose into 2 substates The control state machine for multicycle MicroMIPS. Computer Architecture, Data Path and Control Slide 37 Symbolic Names for Microinstruction Field Values Table 14.3 Microinstruction field values and their symbolic names. The default value for each unspecified field is the all 0s bit pattern. Field name PC control Cache control Register control ALU inputs* ALU function* Seq. control Possible field values and their symbolic names 0001 1001 x011 x101 x111 PCjump PCsyscall PCjreg PCbranch PCnext 0101 1010 1100 CacheFetch CacheStore CacheLoad 1000 10000 rt ← Data 1001 10001 rt ← z 1011 10101 rd ← z 1101 11010 $31 ← PC 000 011 101 110 x10 PC ⊗ 4 PC ⊗ 4imm x ⊗ y x ⊗ imm (imm) 0xx10 1xx01 1xx10 x0011 x0111 + < − ∧ ∨ x1011 x1111 xxx00 ⊕ ∼∨ lui 01 10 11 μPCdisp1 μPCdisp2 μPCfetch * The operator symbol ⊗ stands for any of the ALU functions defined above (except for “lui”). Feb. 2011 Computer Architecture, Data Path and Control Slide 38 fetch: ------------- Control Unit for Microprogramming Multiway branch andi: --------- 64 entries in each table Dispatch table 1 Dispatch table 2 0 0 1 2 3 MicroPC 1 Address Microprogram memory or PLA Incr Data Microinstruction register op (from instruction register) Fig. 14.7 Feb. 2011 Control signals to data path Sequence control Microprogrammed control unit for MicroMIPS . Computer Architecture, Data Path and Control Slide 39 fetch: Microprogram for MicroMIPS 37 microinstructions Fig. 14.8 The complete MicroMIPS microprogram. Feb. 2011 PCnext, CacheFetch PC + 4imm, μPCdisp1 lui1: lui(imm) rt ← z, μPCfetch add1: x + y rd ← z, μPCfetch sub1: x - y rd ← z, μPCfetch slt1: x - y rd ← z, μPCfetch addi1: x + imm rt ← z, μPCfetch slti1: x - imm rt ← z, μPCfetch and1: x ∧ y rd ← z, μPCfetch or1: x ∨ y rd ← z, μPCfetch xor1: x ⊕ y rd ← z, μPCfetch nor1: x ∼∨ y rd ← z, μPCfetch andi1: x ∧ imm rt ← z, μPCfetch ori1: x ∨ imm rt ← z, μPCfetch xori: x ⊕ imm rt ← z, μPCfetch lwsw1: x + imm, mPCdisp2 lw2: CacheLoad rt ← Data, μPCfetch sw2: CacheStore, μPCfetch j1: PCjump, μPCfetch jr1: PCjreg, μPCfetch branch1: PCbranch, μPCfetch jal1: PCjump, $31←PC, μPCfetch syscall1:PCsyscall, μPCfetch Computer Architecture, Data Path and Control # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State State 0 (start) 1 7lui 8lui 7add 8add 7sub 8sub 7slt 8slt 7addi 8addi 7slti 8slti 7and 8and 7or 8or 7xor 8xor 7nor 8nor 7andi 8andi 7ori 8ori 7xori 8xori 2 3 4 6 5j 5jr 5branch 5jal 5syscall Slide 40 14.6 Exception Handling Exceptions and interrupts alter the normal program flow Examples of exceptions (things that can go wrong): • • • • ALU operation leads to overflow (incorrect result is obtained) Opcode field holds a pattern not representing a legal operation Cache error-code checker deems an accessed word invalid Sensor signals a hazardous condition (e.g., overheating) Exception handler is an OS program that takes care of the problem • Derives correct result of overflowing computation, if possible • Invalid operation may be a software-implemented instruction Interrupts are similar, but usually have external causes (e.g., I/O) Feb. 2011 Computer Architecture, Data Path and Control Slide 41 Exception Control States Cycle 1 Cycle 2 Jump/ Branch Cycle 3 Cycle 4 State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘−’ JumpAddr = % PCSrc = @ PCWrite = # State 6 Cycle 5 Inst′Data = 1 MemWrite = 1 sw State 0 Inst′Data = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1 State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’ lw/ sw Start ALUtype Illegal operation Fig. 14.10 Feb. 2011 State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’ lw State 3 State 4 Inst′Data = 1 MemRead = 1 RegDst = 0 RegInSrc = 0 RegWrite = 1 State 7 State 8 ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1 State 10 IntCause = 0 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘−’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1 Overflow State 9 IntCause = 1 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘−’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1 Exception states 9 and 10 added to the control state machine. Computer Architecture, Data Path and Control Slide 42 15 Pipelined Data Paths Pipelining is now used in even the simplest of processors • Same principles as assembly lines in manufacturing • Unlike in assembly lines, instructions not independent Topics in This Chapter 15.1 Pipelining Concepts 15.2 Pipeline Stalls or Bubbles 15.3 Pipeline Timing and Performance 15.4 Pipelined Data Path Design 15.5 Pipelined Control 15.6 Optimal Pipelining Feb. 2011 Computer Architecture, Data Path and Control Slide 43 h c t e F Feb. 2011 Reagd Re Reigte Dataory Wr m e M ALU Computer Architecture, Data Path and Control Slide 44 Single-Cycle Data Path of Chapter 13 Incr PC Next addr jta Next PC ALUOvfl (PC) (rs) rs rt PC Instr cache inst rd 31 imm op Br&Jump Fig. 13.3 Feb. 2011 0 1 2 Clock rate = 125 MHz CPI = 1 (125 MIPS) Ovfl Reg file ALU (rt) / 16 0 32 SE / 1 ALU out Data addr Data cache Data out Data in Func 0 1 2 Register input fn RegDst RegWrite ALUSrc ALUFunc RegInSrc DataRead DataWrite Key elements of the single-cycle MicroMIPS data path. Computer Architecture, Data Path and Control Slide 45 Multicycle Data Path of Chapter 14 Clock rate = 500 MHz CPI ≅ 4 (≅ 125 MIPS) 26 / 4 MSBs Inst Reg rt 0 rd 1 31 2 Cache PCWrite MemWrite MemRead Fig. 14.3 Feb. 2011 Reg file op IRWrite (rt) imm 16 / Data Reg Inst′Data (rs) 0 12 Data fn ALUZero x Mux ALUOvfl 0 Zero z Reg 1 Ovfl x Reg rs PC 32 y Reg SE / RegInSrc RegDst 0 1 SysCallAddr jta Address 0 1 30 / RegWrite y Mux 4 0 1 2 ×4 3 ALUSrcX ALUSrcY 30 ×4 ALU Func ALU out ALUFunc PCSrc JumpAddr Key elements of the multicycle MicroMIPS data path. Computer Architecture, Data Path and Control 0 1 2 3 Slide 46 Getting the Best of Both Worlds Pipelined: Clock rate = 500 MHz CPI ≅ 1 Single-cycle: Multicycle: Clock rate = 125 MHz CPI = 1 Clock rate = 500 MHz CPI ≅ 4 Multicycle analogy: Doctor appointments scheduled in 15-min increments Single-cycle analogy: Doctor appointments scheduled for 60 min per patient Feb. 2011 Computer Architecture, Data Path and Control Slide 47 15.1 Pipelining Concepts Strategies for improving performance 1 – Use multiple independent data paths accepting several instructions that are read out at once: multiple-instruction-issue or superscalar 2 – Overlap execution of several instructions, starting the next instruction before the previous one has run to completion: (super)pipelined Approval 1 Cashier 2 2 Registrar ID photo 3 4 Pickup 5 Start here Exit Fig. 15.1 Feb. 2011 Pipelining in the student registration process. Computer Architecture, Data Path and Control Slide 48 Pipelined Instruction Execution Instr cache Instr 3 Instr 4 Instr 5 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Time dimension Instr 2 Instr 1 Cycle 1 Task dimension Fig. 15.2 Feb. 2011 Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Pipelining in the MicroMIPS instruction execution process. Computer Architecture, Data Path and Control Slide 49 Alternate Representations of a Pipeline Except for start-up and drainage overheads, a pipeline can execute one instruction per clock tick; IPS is dictated by the clock frequency 1 2 1 2 3 4 5 f r a d w f r a d w f r a d w f f = Fetch r = Reg read a = ALU op d = Data access w = Writeback r a d w f r a d w f r a d w f r a d 3 4 5 6 7 6 7 8 9 10 11 Cycle 1 2 1 2 3 4 5 6 7 f f f f f f f r r r r r r r a a a a a a a d d d d d d d w w w w w w 3 4 5 w Start-up region 8 9 10 Cycle Drainage region Pipeline stage Instruction (a) Task-time diagram (b) Space-time diagram Fig. 15.3 Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions). Feb. 2011 Computer Architecture, Data Path and Control 11 Slide 50 w Pipelining Example in a Photocopier Example 15.1 A photocopier with an x-sheet document feeder copies the first sheet in 4 s and each subsequent sheet in 1 s. The copier’s paper path is a 4-stage pipeline with each stage having a latency of 1s. The first sheet goes through all 4 pipeline stages and emerges after 4 s. Each subsequent sheet emerges 1s after the previous sheet. How does the throughput of this photocopier vary with x, assuming that loading the document feeder and removing the copies takes 15 s. Solution Each batch of x sheets is copied in 15 + 4 + (x – 1) = 18 + x seconds. A nonpipelined copier would require 4x seconds to copy x sheets. For x > 6, the pipelined version has a performance edge. When x = 50, the pipelining speedup is (4 × 50) / (18 + 50) = 2.94. Feb. 2011 Computer Architecture, Data Path and Control Slide 51 15.2 Pipeline Stalls or Bubbles First type of data dependency $5 = $6 + $7 $8 = $8 + $6 $9 = $8 + $2 sw $9, 0($3) Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Cycle 8 Data forwarding Reg file Fig. 15.4 Read-after-write data dependency and its possible resolution through data forwarding . Feb. 2011 Computer Architecture, Data Path and Control Slide 52 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Instr cache Reg file ALU Data cache Reg file Instr cache Reg file Instr 3 Instr 5 Instr 4 Instr cache Reg file Cycle 7 Cycle 8 Cycle 9 Cycle 2 Cycle 3 Writes into $8 Data cache ALU Bubble Reg file Task dimension Cycle 4 Reg file Data cache Bubble ALU Instr cache Reg file Cycle 5 Cycle 6 Reg file Data cache Reg file ALU Data cache Reg file Cycle 8 Cycle 9 Cycle 7 Without data forwarding, three bubbles are needed to resolve a read-after-write data dependency Reads from $8 Time dimension Instr cache Reg file ALU Instr cache Reg file Instr 3 Instr 2 Instr 1 ALU Bubble Instr cache Cycle 1 Instr cache Instr 4 Instr 5 Cycle 6 Time dimension Instr 2 Instr 1 Inserting Bubbles in a Pipeline Data cache ALU Bubble Reg file Instr cache Task dimension Feb. 2011 Reg file Data cache Writes into $8 Reg file Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache ALU Bubble Two bubbles, if we assume that a register can be updated and read from in one cycle Reads from $8 Reg file Computer Architecture, Data Path and Control Slide 53 Second Type of Data Dependency C ycle 1 C ycle 2 Instr mem Reg file sw $6, . . . C ycle 3 ALU C ycle 4 C ycle 5 Data mem Reg file C ycle 6 C ycle 7 C ycle 8 Reorder? lw $8, . . . Insert bubble? $9 = $8 + $2 Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Without data forwarding, three (two) bubbles are needed to resolve a read-after-load data dependency Fig. 15.5 Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding. Feb. 2011 Computer Architecture, Data Path and Control Slide 54 Control Dependency in a Pipeline C ycle 1 C ycle 2 Instr mem Reg file Instr mem $6 = $3 + $5 beq $1, $2, . . . Insert bubble? C ycle 3 C ycle 4 C ycle 5 ALU Data mem Reg file Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem Reg file Instr mem Reg file ALU Data mem $9 = $8 + $2 Assume branch resolved here Fig. 15.6 Feb. 2011 C ycle 6 C ycle 7 C ycle 8 Reorder? (delayed branch) Reg file Here would need 1-2 more bubbles Control dependency due to conditional branch. Computer Architecture, Data Path and Control Slide 55 15.3 Pipeline Timing and Performance Latching of results t Function unit Stage 1 t/q Fig. 15.7 Feb. 2011 Stage 2 Stage 3 .. . Stage q−1 Stage q τ Pipelined form of a function unit with latching overhead. Computer Architecture, Data Path and Control Slide 56 Throughput Increase in a q-Stage Pipeline Throughput improvement factor 8 7 t t/q + τ Ideal: τ/t = 0 6 τ/t = 0.05 5 or 4 τ/t = 0.1 q 1 + qτ / t 3 2 1 1 2 3 4 5 6 Number q of pipeline stages 7 8 Fig. 15.8 Throughput improvement due to pipelining as a function of the number of pipeline stages for different pipelining overheads. Feb. 2011 Computer Architecture, Data Path and Control Slide 57 Pipeline Throughput with Dependencies Assume that one bubble must be inserted due to read-after-load dependency and after a branch when its delay slot cannot be filled. Let β be the fraction of all instructions that are followed by a bubble. q Pipeline speedup = (1 + qτ / t)(1 + β) Effective CPI R-type Load Store Branch Jump 44% 24% 12% 18% 2% Example 15.3 Calculate the effective CPI for MicroMIPS, assuming that a quarter of branch and load instructions are followed by bubbles. Solution Fraction of bubbles β = 0.25(0.24 + 0.18) = 0.105 CPI = 1 + β = 1.105 (which is very close to the ideal value of 1) Feb. 2011 Computer Architecture, Data Path and Control Slide 58 15.4 Pipelined Data Path Design Stage 1 Stage 2 ALUOvfl 1 PC inst Instr cache rs rt (rs) 1 Reg file ALU imm SE Incr IncrPC SeqInst op Feb. 2011 Stage 5 Data Data addr Address Ovfl (rt) Fig. 15.9 Stage 4 Next addr NextPC 0 Stage 3 Data cache 0 1 Func 0 1 0 1 2 rt rd 0 1 31 2 Br&Jump RegDst fn ALUSrc RegWrite ALUFunc DataRead RetAddr DataWrite RegInSrc Key elements of the pipelined MicroMIPS data path. Computer Architecture, Data Path and Control Slide 59 15.5 Pipelined Control Stage 1 Stage 2 Stage 4 Data Data addr Address ALUOvfl 1 PC Stage 5 Next addr NextPC 0 Stage 3 inst Instr cache rs rt (rs) Ovfl Reg file (rt) 1 imm SE Incr Data cache ALU Func 0 1 0 1 2 rt rd 0 1 31 2 IncrPC 0 1 2 3 5 SeqInst op RegDst Br&Jump fn ALUSrc RegWrite Fig. 15.10 Feb. 2011 ALUFunc DataRead RetAddr DataWrite RegInSrc Pipelined control signals. Computer Architecture, Data Path and Control Slide 60 15.6 Optimal Pipelining MicroMIPS pipeline with more than four-fold improvement PC Instruction fetch Register readout Instr cache Reg file Instr cache ALU operation ALU Reg file Instr cache Data read/store Register writeback Data cache Reg file ALU Reg file Data cache ALU Reg file Reg file Data cache Fig. 15.11 Higher-throughput pipelined data path for MicroMIPS and the execution of consecutive instructions in it . Feb. 2011 Computer Architecture, Data Path and Control Slide 61 Optimal Number of Pipeline Stages Assumptions: Pipeline sliced into q stages Stage overhead is τ q/2 bubbles per branch (decision made midway) Fraction b of all instructions are taken branches Derivation of q opt Latching of results t Function unit Stage 1 t/q Fig. 15.7 Stage 2 Stage 3 .. . Stage q−1 Stage q τ Pipelined form of a function unit with latching overhead. Average CPI = 1 + b q / 2 Throughput = Clock rate / CPI = 1 (t / q + τ)(1 + b q / 2) Differentiate throughput expression with respect to q and equate with 0 q opt = Feb. 2011 2t / τ b Varies directly with t / τ and inversely with b Computer Architecture, Data Path and Control Slide 62 Pipelining Example An example combinational-logic data path to compute z := (u + v)(w – x) / y Add/Sub latency 2 ns Multiply latency 6 ns Divide latency 15 ns Throughput, original = 1/(25 × 10–9) = 40 M computations / s u + Pipeline register placement, Option 2 v × w − / Throughput, option 1 = 1/(17 × 10–9) = 58.8 M computations / s Write, 1 ns z x y Readout, 1 ns Feb. 2011 Pipeline register placement, Option 1 Throughput, Option 2 = 1/(10 × 10–9) = 100 M computations / s Computer Architecture, Data Path and Control Slide 63 16 Pipeline Performance Limits Pipeline performance limited by data & control dependencies • Hardware provisions: data forwarding, branch prediction • Software remedies: delayed branch, instruction reordering Topics in This Chapter 16.1 Data Dependencies and Hazards 16.2 Data Forwarding 16.3 Pipeline Branch Hazards 16.4 Delayed Branch and Branch Prediction 16.5 Dealing with Exceptions 16.6 Advanced Pipelining Feb. 2011 Computer Architecture, Data Path and Control Slide 64 16.1 Data Dependencies and Hazards Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 Cycle 4 Cycle 5 ALU Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Fig. 16.1 Feb. 2011 Cycle 6 Cycle 7 Cycle 8 Cycle 9 $2 = $1 - $3 Instructions that read register $2 Reg file Data dependency in a pipeline. Computer Architecture, Data Path and Control Slide 65 Resolving Data Dependencies via Forwarding Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Instr cache Reg file ALU Instr cache Cycle 6 Cycle 7 Data cache Reg file Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Cycle 8 Cycle 9 $2 = $1 - $3 Instructions that read register $2 Reg file Fig. 16.2 When a previous instruction writes back a value computed by the ALU into a register, the data dependency can always be resolved through forwarding. Feb. 2011 Computer Architecture, Data Path and Control Slide 66 Pipelined MicroMIPS – Repeated for Reference Stage 1 Stage 2 Stage 4 Data Data addr Address ALUOvfl 1 PC Stage 5 Next addr NextPC 0 Stage 3 inst Instr cache rs rt (rs) Ovfl Reg file (rt) 1 imm SE Incr Data cache ALU Func 0 1 0 1 2 rt rd 0 1 31 2 IncrPC 0 1 2 3 5 SeqInst op RegDst Br&Jump fn ALUSrc RegWrite Fig. 15.10 Feb. 2011 ALUFunc DataRead RetAddr DataWrite RegInSrc Pipelined control signals. Computer Architecture, Data Path and Control Slide 67 Certain Data Dependencies Lead to Bubbles Cycle 1 Cycle 2 Instr cache Reg file Instr cache Cycle 3 ALU Cycle 4 Cycle 5 Cycle 6 Data cache Reg file lw Cycle 7 Cycle 8 Cycle 9 $2,4($12) Instructions that read register $2 Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Instr cache Reg file ALU Data cache Reg file Fig. 16.3 When the immediately preceding instruction writes a value read out from the data memory into a register, the data dependency cannot be resolved through forwarding (i.e., we cannot go back in time) and a bubble must be inserted in the pipeline. Feb. 2011 Computer Architecture, Data Path and Control Slide 68 16.2 Data Forwarding Stage 2 Stage 3 s2 rt x2 Reg file SE RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 Ovfl ALU (rt) x3 0 1 2 y2 x3 y3 x4 y4 t2 Forw arding unit, lower ALUSrc2 Fig. 16.4 Feb. 2011 Data cache x4 0 1 Func y3 d3 ALUSrc1 Stage 5 d3 d4 Forw arding unit, upper x3 y3 x4 y4 (rs) rs Stage 4 0 1 RegInSrc3 d3 RegWrite3 d4 RetAddr3 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 y4 d4 RegInSrc4 RegWrite4 Forwarding unit for the pipelined MicroMIPS data path. Computer Architecture, Data Path and Control Slide 69 Design of the Data Forwarding Units Stage 2 Stage 3 s2 Let’s focus on designing the upper data forwarding unit rt Ovfl x2 Reg file ALU (rt) Table 16.1 Partial truth table for the upper forwarding unit in the pipelined MicroMIPS data path. Data cache x3 x3 y3 x4 y4 y2 0 1 0 1 y3 d3 Forw arding unit, lower t2 ALUSrc1 ALUSrc2 Fig. 16.4 x4 Func 0 1 2 SE Stage 5 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 x3 y3 x4 y4 (rs) rs Stage 4 d3 d4 Forw arding unit, upper y4 RegInSrc3 RegWrite3 RetAddr3 RetAddr3, RegWrite3, RegWrite4 RegInSrc3, RegInSrc4 d3 d4 d4 RegInSrc4 RegWrite4 Forwarding unit for the pipelined MicroMIPS data path. RegWrite3 RegWrite4 s2matchesd3 s2matchesd4 RetAddr3 RegInSrc3 RegInSrc4 Choose 0 0 x x x x x x2 0 1 x 0 x x x x2 0 1 x 1 x x 0 x4 0 1 x 1 x x 1 y4 1 0 1 x 0 1 x x3 1 0 1 x 1 1 x y3 1 1 1 1 0 1 x x3 Incorrect in textbook Feb. 2011 Computer Architecture, Data Path and Control Slide 70 Hardware for Inserting Bubbles Stage 1 Stage 2 Stage 3 Data hazard detector LoadPC LoadInst Inst Instr cache PC (rs) rs rt x2 Reg file (rt) LoadIncrPC Inst reg Corrections to textbook figure shown in red 0 1 2 IncrPC y2 Bubble Control signals from decoder All-0s t2 0 1 Controls or all-0s DataRead2 Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path. Feb. 2011 Computer Architecture, Data Path and Control Slide 71 Augmentations to Pipelined Data Path and Control Branch predictor Next addr forwarders Hazard detector Stage 1 Data cache forwarder Stage 3 Stage 4 Stage 2 ALUOvfl 1 PC Stage 5 Next addr NextPC 0 ALU forwarders inst Instr cache rs rt (rs) Ovfl Reg file imm SE Incr IncrPC Data cache ALU (rt) 1 Data Data addr Address Func 0 1 0 1 0 1 2 rt rd 0 1 31 2 2 3 5 SeqInst Fig. 15.10 Feb. 2011 op RegDst Br&Jump fn ALUSrc RegWrite ALUFunc Computer Architecture, Data Path and Control DataRead RetAddr DataWrite RegInSrc Slide 72 16.3 Pipeline Branch Hazards Software-based solutions Compiler inserts a “no-op” after every branch (simple, but wasteful) Branch is redefined to take effect after the instruction that follows it Branch delay slot(s) are filled with useful instructions via reordering Hardware-based solutions Mechanism similar to data hazard detector to flush the pipeline Constitutes a rudimentary form of branch prediction: Always predict that the branch is not taken, flush if mistaken More elaborate branch prediction strategies possible Feb. 2011 Computer Architecture, Data Path and Control Slide 73 16.4 Branch Prediction Predicting whether a branch will be taken • Always predict that the branch will not be taken • Use program context to decide (backward branch is likely taken, forward branch is likely not taken) • Allow programmer or compiler to supply clues • Decide based on past history (maintain a small history table); to be discussed later • Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines Feb. 2011 Computer Architecture, Data Path and Control Slide 74 Forward and Backward Branches Example 5.5 List A is stored in memory beginning at the address given in $s1. List length is given in $s2. Find the largest integer in the list and copy it into $t0. Solution Scan the list, holding the largest element identified thus far in $t0. lw addi loop: add beq add add add lw slt beq addi j done: ... Feb. 2011 $t0,0($s1) $t1,$zero,0 $t1,$t1,1 $t1,$s2,done $t2,$t1,$t1 $t2,$t2,$t2 $t2,$t2,$s1 $t3,0($t2) $t4,$t0,$t3 $t4,$zero,loop $t0,$t3,0 loop # # # # # # # # # # # # # initialize maximum to A[0] initialize index i to 0 increment index i by 1 if all elements examined, quit compute 2i in $t2 compute 4i in $t2 form address of A[i] in $t2 load value of A[i] into $t3 maximum < A[i]? if not, repeat with no change if so, A[i] is the new maximum change completed; now repeat continuation of the program Computer Architecture, Data Path and Control Slide 75 Simple Branch Prediction: 1-Bit History Taken Not taken Not taken Predict taken Predict not taken Taken Two-state branch prediction scheme. Problem with this approach: Each branch in a loop entails two mispredictions: Once in first iteration (loop is repeated, but the history indicates exit from loop) Once in last iteration (when loop is terminated, but history indicates repetition) Feb. 2011 Computer Architecture, Data Path and Control Slide 76 Simple Branch Prediction: 2-Bit History Taken Not taken Not taken Predict taken Taken Not taken Predict taken again Taken Predict not taken Not taken Predict not taken again Taken Fig. 16.6 Four-state branch prediction scheme. Example 16.1 L1: ---10 iter’s ---20 iter’s L2: ------br <c2> L2 ---br <c1> L1 Feb. 2011 Impact of different branch prediction schemes Solution Always taken: 11 mispredictions, 94.8% accurate 1-bit history: 20 mispredictions, 90.5% accurate 2-bit history: Same as always taken Computer Architecture, Data Path and Control Slide 77 Other Branch Prediction Algorithms Problem 16.3 Taken Not taken Not taken Part a Predict taken Taken Not taken Predict taken again Taken Part b Predict taken Taken Taken Not taken Predict not taken Not taken Taken Predict taken again Taken Predict not taken again Not taken Taken Predict not taken Not taken Predict not taken again Not taken Taken Not taken Not taken Fig. 16.6 Predict taken Taken Not taken Predict taken again Taken Predict not taken Not taken Predict not taken again Taken Feb. 2011 Computer Architecture, Data Path and Control Slide 78 Hardware Implementation of Branch Prediction Low-order bits used as index Addresses of recent branch instructions Target addresses History bit(s) Incremented PC Next PC 0 Read-out table entry From PC Fig. 16.7 Compare = 1 Logic Hardware elements for a branch prediction scheme. The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches (Chapter 18) Feb. 2011 Computer Architecture, Data Path and Control Slide 79 Pipeline Augmentations – Repeated for Reference Branch predictor Next addr forwarders Hazard detector Stage 1 Data cache forwarder Stage 3 Stage 4 Stage 2 ALUOvfl 1 PC Stage 5 Next addr NextPC 0 ALU forwarders inst Instr cache rs rt (rs) Ovfl Reg file imm SE Incr IncrPC Data cache ALU (rt) 1 Data Data addr Address Func 0 1 0 1 0 1 2 rt rd 0 1 31 2 2 3 5 SeqInst Fig. 15.10 Feb. 2011 op RegDst Br&Jump fn ALUSrc RegWrite ALUFunc Computer Architecture, Data Path and Control DataRead RetAddr DataWrite RegInSrc Slide 80 16.5 Advanced Pipelining Deep pipeline = superpipeline; also, superpipelined, superpipelining Parallel instruction issue = superscalar, j-way issue (2-4 is typical) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Variable # of stages Stage q−2 Stage q−1 Stage q Function unit 1 Instr cache Instr decode Operand prep Instr issue Function unit 2 Function unit 3 Instruction fetch Retirement & commit stages Fig. 16.8 Dynamic instruction pipeline with in-order issue, possible out-of-order completion, and in-order retirement. Feb. 2011 Computer Architecture, Data Path and Control Slide 81 Performance Improvement for Deep Pipelines Hardware-based methods Lookahead past an instruction that will/may stall in the pipeline (out-of-order execution; requires in-order retirement) Issue multiple instructions (requires more ports on register file) Eliminate false data dependencies via register renaming Predict branch outcomes more accurately, or speculate Software-based method Pipeline-aware compilation Loop unrolling to reduce the number of branches Loop: Compute with index i Increment i by 1 Go to Loop if not done Feb. 2011 Loop: Compute with index i Compute with index i + 1 Increment i by 2 Go to Loop if not done Computer Architecture, Data Path and Control Slide 82 CPI Variations with Architectural Features Table 16.2 Effect of processor architecture, branch prediction methods, and speculative execution on CPI. Architecture Methods used in practice CPI Nonpipelined, multicycle Strict in-order instruction issue and exec 5-10 Nonpipelined, overlapped In-order issue, with multiple function units 3-5 Pipelined, static In-order exec, simple branch prediction 2-3 Superpipelined, dynamic Out-of-order exec, adv branch prediction 1-2 Superscalar 2- to 4-way issue, interlock & speculation 0.5-1 Advanced superscalar 4- to 8-way issue, aggressive speculation 0.2-0.5 Need 100 for TIPS performance Need 100,000 for 1 PIPS Feb. 2011 3.3 inst / cycle × 3 Gigacycles / s ≅ 10 GIPS Computer Architecture, Data Path and Control Slide 83 Development of Intel’s Desktop/Laptop Micros In the beginning, there was the 8080; led to the 80x86 = IA32 ISA Half a dozen or so pipeline stages 80286 80386 80486 Pentium (80586) More advanced technology A dozen or so pipeline stages, with out-of-order instruction execution Pentium Pro Pentium II Pentium III Celeron More advanced technology Instructions are broken into micro-ops which are executed out-of-order but retired in-order Two dozens or so pipeline stages Pentium 4 Feb. 2011 Computer Architecture, Data Path and Control Slide 84 Current State of Computer Performance Multi-GIPS/GFLOPS desktops and laptops Very few users need even greater computing power Users unwilling to upgrade just to get a faster processor Current emphasis on power reduction and ease of use Multi-TIPS/TFLOPS in large computer centers World’s top 500 supercomputers, http://www.top500.org Next list due in June 2009; as of Nov. 2008: All 500 >> 10 TFLOPS, ≈30 > 100 TFLOPS, 1 > PFLOPS Multi-PIPS/PFLOPS supercomputers on the drawing board IBM “smarter planet” TV commercial proclaims (in early 2009): “We just broke the petaflop [sic] barrier.” The technical term “petaflops” is now in the public sphere Feb. 2011 Computer Architecture, Data Path and Control Slide 85 The Shrinking Supercomputer Feb. 2011 Computer Architecture, Data Path and Control Slide 86 16.6 Dealing with Exceptions Exceptions present the same problems as branches How to handle instructions that are ahead in the pipeline? (let them run to completion and retirement of their results) What to do with instructions after the exception point? (flush them out so that they do not affect the state) Precise versus imprecise exceptions Precise exceptions hide the effects of pipelining and parallelism by forcing the same state as that of strict sequential execution (desirable, because exception handling is not complicated) Imprecise exceptions are messy, but lead to faster hardware (interrupt handler can clean up to offer precise exception) Feb. 2011 Computer Architecture, Data Path and Control Slide 87 The Three Hardware Designs for MicroMIPS Incr PC Single-cycle Next addr jta Next PC ALUOvfl (PC) PC Instr cache inst rd 31 imm op 0 1 2 Ovfl Inst Reg Reg file ALU (rt) Data addr Data cache 0 1 0 1 2 Cache Inst′Data ALUFunc 125 MHz CPI = 1 Stage 1 y Mux 0 1 2 ×4 3 4 (rt) 32 y Reg SE / imm 16 / Stage 2 IRWrite RegInSrc ALUSrcX RegWrite Stage 4 A LUOvf l PC fn RegDst Stage 3 1 ins t Instr cache rs rt (rs) ×4 ALU 0 1 2 3 Func ALU out ALUSrcY Stage 5 Reg file 1 Inc r Inc rPC rt rd 31 JumpAddr 500 MHz CPI ≅ 4 Address Data cache ALU imm SE PCSrc ALUFunc Data Data addr Ovf l (rt) 500 MHz CPI ≅ 1.1 op MemWrite MemRead N ext addr Nex tPC 0 PCWrite RegInSrc DataRead DataWrite ALUSrc RegDst RegWrite Func 0 1 0 1 0 1 2 0 1 2 2 3 5 SeqIns t op Feb. 2011 Reg file 30 Register input fn Br&Jump (rs) 0 1 Data Data Reg ALUZero ALUOvfl x Mux 0 z Reg Zero 1 Ovfl x Reg rs rt 0 1 rd 31 2 0 1 SysCallAddr jta PC Data out Data in Func 0 32 SE / 1 / 16 ALU out 30 / 4 MSBs Address (rs) rs rt 26 / Multicycle RegDs t Br&Jump fn A LUSrc RegWrite A LUFunc DataRead RetA ddr DataWrite Reg InSrc Computer Architecture, Data Path and Control Slide 88 Where Do We Go from Here? Memory Design: How to build a memory unit that responds in 1 clock Input and Output: Peripheral devices, I/O programming, interfacing, interrupts Higher Performance: Vector/array processing Parallel processing Feb. 2011 Computer Architecture, Data Path and Control Slide 89 Part V Memory System Design Feb. 2011 Computer Architecture, Memory System Design Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2006 Mar. 2007 Mar. 2008 Feb. 2009 Feb. 2011 Feb. 2011 Computer Architecture, Memory System Design Slide 2 V Memory System Design Design problem – We want a memory unit that: • Can keep up with the CPU’s processing speed • Has enough capacity for programs and data • Is inexpensive, reliable, and energy-efficient Topics in This Part Chapter 17 Main Memory Concepts Chapter 18 Cache Memory Organization Chapter 19 Mass Memory Concepts Chapter 20 Virtual Memory and Paging Feb. 2011 Computer Architecture, Memory System Design Slide 3 17 Main Memory Concepts Technologies & organizations for computer’s main memory • SRAM (cache), DRAM (main), and flash (nonvolatile) • Interleaving & pipelining to get around “memory wall” Topics in This Chapter 17.1 Memory Structure and SRAM 17.2 DRAM and Refresh Cycles 17.3 Hitting the Memory Wall 17.4 Interleaved and Pipelined Memory 17.5 Nonvolatile Memory 17.6 The Need for a Memory Hierarchy Feb. 2011 Computer Architecture, Memory System Design Slide 4 17.1 Memory Structure and SRAM Output enable Chip select Storage cells Write enable Data in Address / / Q D g FF h C / g / g Data out Q 0 D Q FF Address decoder / g Q C 1 . . . D Q FF 2h –1 C / g Q WE D in D out Addr CS OE Fig. 17.1 Conceptual inner structure of a 2h × g SRAM chip and its shorthand representation. Feb. 2011 Computer Architecture, Memory System Design Slide 5 Multiple-Chip SRAM Data in 32 Address / 18 / 17 WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE WE D in D out Addr CS OE MSB Data out, byte 3 Fig. 17.2 Feb. 2011 Data out, byte 2 Data out, byte 1 Data out, byte 0 Eight 128K × 8 SRAM chips forming a 256K × 32 memory unit. Computer Architecture, Memory System Design Slide 6 SRAM with Bidirectional Data Bus Output enable Chip select Write enable Data in/out Address / / h g Data in Data out Fig. 17.3 When data input and output of an SRAM chip are shared or connected to a bidirectional data bus, output must be disabled during write operations. Feb. 2011 Computer Architecture, Memory System Design Slide 7 17.2 DRAM and Refresh Cycles DRAM vs. SRAM Memory Cell Complexity Word line Word line Vcc Pass transistor Capacitor Bit line (a) DRAM cell Compl. bit line Bit line (b) Typical SRAM cell Fig. 17.4 Single-transistor DRAM cell, which is considerably simpler than SRAM cell, leads to dense, high-capacity DRAM memory chips. Feb. 2011 Computer Architecture, Memory System Design Slide 8 DRAM Refresh Cycles and Refresh Rate Voltage for 1 1 Written Refreshed Refreshed Refreshed Threshold voltage Voltage for 0 0 Stored 10s of ms before needing refresh cycle Time Fig. 17.5 Variations in the voltage across a DRAM cell capacitor after writing a 1 and subsequent refresh operations. Feb. 2011 Computer Architecture, Memory System Design Slide 9 Loss of Bandwidth to Refresh Cycles Example 17.2 A 256 Mb DRAM chip is organized as a 32M × 8 memory externally and as a 16K × 16K array internally. Rows must be refreshed at least once every 50 ms to forestall data loss; refreshing a row takes 100 ns. What fraction of the total memory bandwidth is lost to refresh cycles? Row decoder 16K Write enable / Data in / Address Chip select g h . . . Square or almost square memory matrix 16K Data out / g . . . Output enable Figure 2.10 14 Address / . . . Row h Column (a) SRAM block diagram Row buffer Column mux 11 8 g bits data out (b) SRAM read mechanism Solution Refreshing all 16K rows takes 16 × 1024 × 100 ns = 1.64 ms. Loss of 1.64 ms every 50 ms amounts to 1.64/50 = 3.3% of the total bandwidth. Feb. 2011 Computer Architecture, Memory System Design Slide 10 DRAM Packaging 24-pin dual in-line package (DIP) Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss Legend: 24 23 22 21 20 19 18 17 16 15 14 13 1 2 3 4 5 6 7 9 10 11 12 Ai CAS Dj NC OE RAS WE 8 Address bit i Column address strobe Data bit j No connection Output enable Row address strobe Write enable Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc Fig. 17.6 Feb. 2011 Typical DRAM package housing a 16M × 4 memory. Computer Architecture, Memory System Design Slide 11 DRAM 1000 Evolution Computer class Number of memory chips Memory size Supercomputers 100 Servers Feb. 2011 64 GB 16 GB 4 GB 1 GB Large PCs 256 MB Small PCs 64 MB 4 MB Fig. 17.7 Trends in DRAM main memory. 256 GB Workstations 10 1 TB 16 MB 1 MB 1 1980 1990 2000 2010 Calendar year Computer Architecture, Memory System Design Slide 12 17.3 Hitting the Memory Wall Relative performance 10 6 Processor 10 3 Memory 1 1980 1990 2000 2010 Calendar year Fig. 17.8 Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace. Feb. 2011 Computer Architecture, Memory System Design Slide 13 Bridging the CPU-Memory Speed Gap Idea: Retrieve more data from memory with each access Wideaccess memory . . . . . . Narrow bus to processor Mux (a) Buffer and mult iplex er at the memory side Wideaccess memory . . . Wide bus to processor . . . Mux (a) Buffer and mult iplex er at the processor side Fig. 17.9 Two ways of using a wide-access memory to bridge the speed gap between the processor and memory. Feb. 2011 Computer Architecture, Memory System Design Slide 14 17.4 Pipelined and Interleaved Memory Memory latency may involve other supporting operations besides the physical access itself Virtual-to-physical address translation (Chap 20) Tag comparison to determine cache hit/miss (Chap 18) Address translation Row decoding & read out Fig. 17.10 Feb. 2011 Column decoding & selection Tag comparison & validation Pipelined cache memory. Computer Architecture, Memory System Design Slide 15 Memory Interleaving Module accessed Addresses that are 0 mod 4 0 Addresses 0, 4, 8, … Address Data in Dispatch (based on 2 LSBs of address) 1 2 Addresses that are 1 mod 4 Addresses 1, 5, 9, … Return data Data out 3 0 Addresses that are 2 mod 4 1 Addresses 2, 6, 10, … Addresses that are 3 mod 4 Addresses 3, 7, 11, … Bus cycle Memory cycle 2 3 Time Fig. 17.11 Interleaved memory is more flexible than wide-access memory in that it can handle multiple independent accesses at once. Feb. 2011 Computer Architecture, Memory System Design Slide 16 17.5 Nonvolatile Memory S u p p ly vo l t a g e ROM PROM EPROM Word contents 1010 1001 Word lines 0010 1101 B i t li nes Fig. 17.12 Read-only memory organization, with the fixed contents shown on the right. Feb. 2011 Computer Architecture, Memory System Design Slide 17 Flash Memory S o u r c e l i n es Control gate Floating gate Source Word lines n− p substrate n+ B i t li nes Drain Fig. 17.13 EEPROM or Flash memory organization. Each memory cell is built of a floating-gate MOS transistor. Feb. 2011 Computer Architecture, Memory System Design Slide 18 17.6 The Need for a Memory Hierarchy The widening speed gap between CPU and main memory Processor operations take of the order of 1 ns Memory access requires 10s or even 100s of ns Memory bandwidth limits the instruction execution rate Each instruction executed involves at least one memory access Hence, a few to 100s of MIPS is the best that can be achieved A fast buffer memory can help bridge the CPU-memory gap The fastest memories are expensive and thus not very large A second (third?) intermediate cache level is thus often used Feb. 2011 Computer Architecture, Memory System Design Slide 19 Typical Levels in a Hierarchical Memory Capacity Access latency 100s B ns Cost per GB Reg’s 10s KB a few ns MBs 10s ns 100s MB 100s ns 10s GB 10s ms TBs Fig. 17.14 Feb. 2011 $Millions Cache 1 Cache 2 min+ Speed gap Main Secondary Tertiary $100s Ks $10s Ks $1000s $10s $1s Names and key characteristics of levels in a memory hierarchy. Computer Architecture, Memory System Design Slide 20 Memory Price Trends 100K ■ DRAM z Flash 10K $ / GByte 1K 100 Hard disk drive 10 1 0.1 Source: https://www1.hitachigst.com/hdd/technolo/overview/chart03.html Feb. 2011 Computer Architecture, Memory System Design Slide 21 18 Cache Memory Organization Processor speed is improving at a faster rate than memory’s • Processor-memory speed gap has been widening • Cache is to main as desk drawer is to file cabinet Topics in This Chapter 18.1 The Need for a Cache 18.2 What Makes a Cache Work? 18.3 Direct-Mapped Cache 18.4 Set-Associative Cache 18.5 Cache and Main Memory 18.6 Improving Cache Performance Feb. 2011 Computer Architecture, Memory System Design Slide 22 18.1 The Need for a Cache Incr PC Single-cycle Next addr jta Next PC ALUOvfl (PC) PC Instr cache inst rd 31 imm op 0 1 2 Ovfl Inst Reg Reg file ALU (rt) / 16 Data addr ALU out Data in Func 0 32 SE / 1 Data cache 0 1 0 1 2 Cache Reg file (rt) 32 y Reg SE / imm 16 / y Mux 4 0 1 2 ×4 3 30 ALUZero ALUOvfl Zero Ovfl z Reg ×4 0 1 2 3 ALU Func ALU out Register input Inst′Data RegDst RegWrite PCWrite RegInSrc DataRead DataWrite ALUSrc ALUFunc 125 MHz CPI = 1 Stage 1 IRWrite Stage 2 ALUOvfl PC fn RegInSrc RegDst ALUSrcX RegWrite Stage 3 1 All three of our MicroMIPS designs assumed 2-ns data and instruction memories; however, typical RAMs are 10-50 times slower op MemWrite MemRead Stage 4 inst Instr cache rs rt (rs) Stage 5 Reg file Incr IncrPC 500 MHz CPI ≅ 4 Data addr Data cache ALU imm SE JumpAddr Ovfl (rt) 1 PCSrc ALUFunc ALUSrcY Next addr NextPC 0 Func 0 1 0 1 0 1 2 rt rd 0 1 31 2 2 3 Pipelined 500 MHz CPI ≅ 1.1 5 SeqInst op Feb. 2011 x Mux 0 1 (rs) 0 1 Data Data Reg fn Br&Jump x Reg jta rt 0 1 rd 31 2 0 1 SysCallAddr rs PC Data out 30 / 4 MSBs Address (rs) rs rt 26 / Multicycle Br&Jump RegDst fn ALUSrc RegWrite ALUFunc Computer Architecture, Memory System Design DataRead RetAddr DataWrite RegInSrc Slide 23 Cache, Hit/Miss Rate, and Effective Access Time Cache is transparent to user; transfers occur automatically Line Word CPU Reg file Cache (fast) memory Main (slow) memory Data is in the cache fraction h of the time (say, hit rate of 98%) One level of cache with hit rate h Go to main 1 – h of the time (say, cache miss rate of 2%) Ceff = hCfast + (1 – h)(Cslow + Cfast) = Cfast + (1 – h)Cslow Feb. 2011 Computer Architecture, Memory System Design Slide 24 Multiple Cache Levels CPU CPU registers Level-2 cache Level-1 cache Cleaner and easier to analyze Main memory (a) Level 2 between level 1 and main CPU Level-2 cache CPU registers Main memory Level-1 cache (b) Level 2 connected to “backside” bus Fig. 18.1 Cache memories act as intermediaries between the superfast processor and the much slower main memory. Feb. 2011 Computer Architecture, Memory System Design Slide 25 Performance of a Two-Level Cache System Example 18.1 A system with L1 and L2 caches has a CPI of 1.2 with no cache miss. There are 1.1 memory accesses on average per instruction. What is the effective CPI with cache misses factored in? What are the effective hit rate and miss penalty overall if L1 and L2 caches are modeled as a single cache? CPU CPU Level L1 L2 Local hit rate 95 % 80 % Solution Miss penalty 8 cycles 60 cycles registers 1% 95% Level-1 cache 8 cycles 4% Level-2 cache Main memory 60 cycles Ceff = Cfast + (1 – h1)[Cmedium + (1 – h2)Cslow] Because Cfast is included in the CPI of 1.2, we must account for the rest CPI = 1.2 + 1.1(1 – 0.95)[8 + (1 – 0.8)60] = 1.2 + 1.1 × 0.05 × 20 = 2.3 Overall: hit rate 99% (95% + 80% of 5%), miss penalty 60 cycles Feb. 2011 Computer Architecture, Memory System Design Slide 26 Cache Memory Design Parameters Cache size (in bytes or words). A larger cache can hold more of the program’s useful data but is more costly and likely to be slower. Block or cache-line size (unit of data transfer between cache and main). With a larger cache line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-utility data in. Placement policy. Determining where an incoming cache line is stored. More flexible policies imply higher hardware cost and may or may not have performance benefits (due to more complex data location). Replacement policy. Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten. Typical policies: choosing a random or the least recently used block. Write policy. Determining if updates to cache words are immediately forwarded to main (write-through) or modified blocks are copied back to main if and when they must be replaced (write-back or copy-back). Feb. 2011 Computer Architecture, Memory System Design Slide 27 18.2 What Makes a Cache Work? Temporal locality Spatial locality Address mapping (many-to-one) Cache memory Fig. 18.2 Assuming no conflict in address mapping, the cache will hold a small program loop in its entirety, leading to fast execution. Feb. 2011 Main memory Computer Architecture, Memory System Design 9-instruction program loop Cache line/ block (unit of t rans fer between main and cache memories) Slide 28 Desktop, Drawer, and File Cabinet Analogy Once the “working set” is in the drawer, very few trips to the file cabinet are needed. Access cabinet in 30 s Register file Access drawer in 5 s Access desktop in 2 s Cache memory Main memory Fig. 18.3 Items on a desktop (register) or in a drawer (cache) are more readily accessible than those in a file cabinet (main memory). Feb. 2011 Computer Architecture, Memory System Design Slide 29 Temporal and Spatial Localities Addresses From Peter Denning’s CACM paper, July 2005 (Vol. 48, No. 7, pp. 19-24) Temporal: Accesses to the same address are typically clustered in time Spatial: When a location is accessed, nearby locations tend to be accessed also Feb. 2011 Working set Time Computer Architecture, Memory System Design Slide 30 Caching Benefits Related to Amdahl’s Law Example 18.2 In the drawer & file cabinet analogy, assume a hit rate h in the drawer. Formulate the situation shown in Fig. 18.2 in terms of Amdahl’s law. 3 Solution Without the drawer, a document is accessed in 30 s. So, fetching 1000 documents, say, would take 30 000 s. The drawer causes a fraction h of the cases to be done 6 times as fast, with access time unchanged for the remaining 1 – h. Speedup is thus 1/(1 – h + h/6) = 6 / (6 – 5h). Improving the drawer access time can increase the speedup factor but as long as the miss rate remains at 1 – h, the speedup can never exceed 1 / (1 – h). Given h = 0.9, for instance, the speedup is 4, with the upper bound being 10 for an extremely short drawer access time. Note: Some would place everything on their desktop, thinking that this yields even greater speedup. This strategy is not recommended! Feb. 2011 Computer Architecture, Memory System Design Slide 31 Compulsory, Capacity, and Conflict Misses Compulsory misses: With on-demand fetching, first access to any item is a miss. Some “compulsory” misses can be avoided by prefetching. Capacity misses: We have to oust some items to make room for others. This leads to misses that are not incurred with an infinitely large cache. Conflict misses: Occasionally, there is free room, or space occupied by useless data, but the mapping/placement scheme forces us to displace useful items to bring in other items. This may lead to misses in future. Given a fixed-size cache, dictated, e.g., by cost factors or availability of space on the processor chip, compulsory and capacity misses are pretty much fixed. Conflict misses, on the other hand, are influenced by the data mapping scheme which is under our control. We study two popular mapping schemes: direct and set-associative. Feb. 2011 Computer Architecture, Memory System Design Slide 32 18.3 Direct-Mapped Cache 2-bit word offset in line 3-bit line index in cache 0-3 Main 4-7 mem ory 8-11 Word address Tag locations 32-35 36-39 40-43 64-67 68-71 72-75 Tags Valid bits Read tag and specified word Data out 1,Tag Compare 1 if equal 96-99 100-103 104-107 Cache miss Fig. 18.4 Direct-mapped cache holding 32 words within eight 4-word lines. Each line is associated with a tag and a valid bit. Feb. 2011 Computer Architecture, Memory System Design Slide 33 Accessing a Direct-Mapped Cache Example 18.4 Show cache addressing for a byte-addressable memory with 32-bit addresses. Cache line W = 16 B. Cache size L = 4096 lines (64 KB). Solution Byte offset in line is log216 = 4 b. Cache line index is log24096 = 12 b. This leaves 32 – 12 – 4 = 16 b for the tag. 12-bit line index in cache 16-bit line tag 4-bit byte offset in line 32-bit address Byte address in cache Fig. 18.5 Components of the 32-bit address in an example direct-mapped cache with byte addressing. Feb. 2011 Computer Architecture, Memory System Design Slide 34 Direct-Mapped Cache Behavior Address trace: 1, 7, 6, 5, 32, 33, 1, 2, . . . 1: miss, line 3, 2, 1, 0 fetched 7: miss, line 7, 6, 5, 4 fetched 6: hit 5: hit 32: miss, line 35, 34, 33, 32 fetched (replaces 3, 2, 1, 0) 33: hit 1: miss, line 3, 2, 1, 0 fetched (replaces 35, 34, 33, 32) 2: hit 1,Tag ... and so on Feb. 2011 Fig. 18.4 2-bit word offs et in line 3-bit line index in cache Word address Tag 35 3 34 2 33 1 32 0 7 6 5 4 Tags Valid bits Read tag and s pecified word Dat Compare Computer Architecture, Memory System Design 1 if equal Cac Slide 35 18.4 Set-Associative Cache 0-3 2-bit word offset in line 2-bit set index in cache Main memory 16-19 locations Word address Tag 32-35 Option 0 Tags Option 1 48-51 Valid bits 0 1 1,Tag Compare Compare 64-67 Read tag and specified word from each option 1 if equal 80-83 Data out 96-99 Cache miss 112-115 Fig. 18.6 Two-way set-associative cache holding 32 words of data within 4-word lines and 2-line sets. Feb. 2011 Computer Architecture, Memory System Design Slide 36 Accessing a Set-Associative Cache Example 18.5 Show cache addressing scheme for a byte-addressable memory with 32-bit addresses. Cache line width 2W = 16 B. Set size 2S = 2 lines. Cache size 2L = 4096 lines (64 KB). Solution Byte offset in line is log216 = 4 b. Cache set index is (log24096/2) = 11 b. This leaves 32 – 11 – 4 = 17 b for the tag. 11-bit set index in cache Fig. 18.7 Components 32-bit of the 32-bit address in address an example two-way set-associative cache. Feb. 2011 17-bit line tag 4-bit byte offset in line Address in cache used to read out two candidate items and their control info Computer Architecture, Memory System Design Slide 37 Cache Address Mapping Example 18.6 A 64 KB four-way set-associative cache is byte-addressable and contains 32 B lines. Memory addresses are 32 b wide. a. How wide are the tags in this cache? b. Which main memory addresses are mapped to set number 5? Solution a. Address (32 b) = 5 b byte offset + 9 b set index + 18 b tag b. Addresses that have their 9-bit set index equal to 5. These are of the general form 214a + 25×5 + b; e.g., 160-191, 16 554-16 575, . . . 32-bit address Tag width = 32 – 5 – 9 = 18 Feb. 2011 Tag Set index Offset 18 bits 9 bits 5 bits Set size = 4 × 32 B = 128 B Number of sets = 216/27 = 29 Line width = 32 B = 25 B Computer Architecture, Memory System Design Slide 38 18.5 Cache and Main Memory Split cache: separate instruction and data caches (L1) Unified cache: holds instructions and data (L1, L2, L3) Harvard architecture: separate instruction and data memories von Neumann architecture: one memory for instructions and data The writing problem: Write-through slows down the cache to allow main to catch up Write-back or copy-back is less problematic, but still hurts performance due to two main memory accesses in some cases. Solution: Provide write buffers for the cache so that it does not have to wait for main memory to catch up. Feb. 2011 Computer Architecture, Memory System Design Slide 39 Faster Main-Cache Data Transfers .. . Row address decoder Byte address in 14 / 16Kb × 16Kb memory matrix Selected row .. . . 11 / . . 16 Kb = 2 KB Column mux Data byte out Fig. 18.8 A 256 Mb DRAM chip organized as a 32M × 8 memory module: four such chips could form a 128 MB main memory unit. Feb. 2011 Computer Architecture, Memory System Design Slide 40 18.6 Improving Cache Performance For a given cache size, the following design issues and tradeoffs exist: Line width (2W). Too small a value for W causes a lot of main memory accesses; too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used. Set size or associativity (2S). Direct mapping (S = 0) is simple and fast; greater associativity leads to more complexity, and thus slower access, but tends to reduce conflict misses. More on this later. Line replacement policy. Usually LRU (least recently used) algorithm or some approximation thereof; not an issue for direct-mapped caches. Somewhat surprisingly, random selection works quite well in practice. Write policy. Modern caches are very fast, so that write-through is seldom a good choice. We usually implement write-back or copy-back, using write buffers to soften the impact of main memory latency. Feb. 2011 Computer Architecture, Memory System Design Slide 41 Effect of Associativity on Cache Performance 0.3 Miss rate 0.2 0.1 0 Direct 2-way 4-way 8-way 16-way 32-way 64-way Associativity Fig. 18.9 Feb. 2011 Performance improvement of caches with increased associativity. Computer Architecture, Memory System Design Slide 42 19 Mass Memory Concepts Today’s main memory is huge, but still inadequate for all needs • Magnetic disks provide extended and back-up storage • Optical disks & disk arrays are other mass storage options Topics in This Chapter 19.1 Disk Memory Basics 19.2 Organizing Data on Disk 19.3 Disk Performance 19.4 Disk Caching 19.5 Disk Arrays and RAID 19.6 Other Types of Mass Memory Feb. 2011 Computer Architecture, Memory System Design Slide 43 19.1 Disk Memory Basics Sector Read/write head Actuator Recording area Track c – 1 Track 2 Track 1 Track 0 Arm Direction of rotation Platter Spindle Fig. 19.1 Feb. 2011 Disk memory elements and key terms. Computer Architecture, Memory System Design Slide 44 Disk Drives Typically Typically 2-8 cm 2 - 8 cm Feb. 2011 Computer Architecture, Memory System Design Slide 45 Access Time for a Disk 3. Disk rotation until sector has passed under the head: Data transfer time (< 1 ms) 2. Disk rotation until the desired sector arrives under the head: Rotational latency (0-10s ms) 2 3 1. Head movement from current position to desired cylinder: Seek time (0-10s ms) 1 Sector Rotation The three components of disk access time. Disks that spin faster have a shorter average and worst-case access time. Feb. 2011 Computer Architecture, Memory System Design Slide 46 Representative Magnetic Disks Table 19.1 Key attributes of three representative magnetic disks, from the highest capacity to the smallest physical size (ca. early 2003). [More detail (weight, dimensions, recording density, etc.) in textbook.] Manufacturer and Model Name Seagate Barracuda 180 Hitachi DK23DA IBM Microdrive Application domain Capacity Platters / Surfaces Cylinders Sectors per track, avg Buffer size Seek time, min,avg,max Diameter Rotation speed, rpm Typical power Server 180 GB 12 / 24 24 247 604 16 MB 1, 8, 17 ms 3.5″ 7 200 14.1 W Laptop 40 GB 2/4 33 067 591 2 MB 3, 13, 25 ms 2.5″ 4 200 2.3 W Pocket device 1 GB 1/2 7 167 140 1/8 MB 1, 12, 19 ms 1.0″ 3 600 0.8 W Feb. 2011 Computer Architecture, Memory System Design Slide 47 19.2 Organizing Data on Disk Sector 4 Gap Sector 5 (end) Sector 1 (begin) Sector 2 Thin-film head Sector 3 Magnetic medium 0 1 0 Fig. 19.2 Magnetic recording along the tracks and the read/write head. 0 30 60 27 16 46 13 43 32 62 29 59 48 15 45 12 1 31 61 28 17 47 14 44 33 0 30 60 49 16 46 13 2 32 62 29 Track i Track i + 1 Track i + 2 Track i + 3 Fig. 19.3 Logical numbering of sectors on several adjacent tracks. Feb. 2011 Computer Architecture, Memory System Design Slide 48 19.3 Disk Performance Seek time = a + b(c – 1) + β(c – 1)1/2 Average rotational latency = (30 / rpm) s = (30 000 / rpm) ms Arrival order of access requests: D A, B, C, D, E, F E A F B Possible out-oforder reading: C, F, D, E, B, A C Rotation Fig. 19.4 Reducing average seek time and rotational latency by performing disk accesses out of order. Feb. 2011 Computer Architecture, Memory System Design Slide 49 19.4 Disk Caching Same idea as processor cache: bridge main-disk speed gap Read/write an entire track with each disk access: “Access one sector, get 100s free,” hit rate around 90% Disks listed in Table 19.1 have buffers from 1/8 to 16 MB Rotational latency eliminated; can start from any sector Need back-up power so as not to lose changes in disk cache (need it anyway for head retraction upon power loss) Placement options for disk cache In the disk controller: Suffers from bus and controller latencies even for a cache hit Closer to the CPU: Avoids latencies and allows for better utilization of space Intermediate or multilevel solutions Feb. 2011 Computer Architecture, Memory System Design Slide 50 19.5 Disk Arrays and RAID The need for high-capacity, high-throughput secondary (disk) memory Processor RAM speed size Disk I/O rate 1 GIPS 1 GB 1 TIPS Disk capacity Number of disks 100 MB/s 1 100 GB 1 1 TB 100 GB/s 1000 100 TB 100 1 PIPS 1 PB 100 TB/s 1 Million 100 PB 100 000 1 EIPS 1 EB 100 PB/s 1 Billion 100 EB 100 Million 1 RAM byte for each IPS Feb. 2011 Number of disks 1 I/O bit per sec for each IPS 100 disk bytes for each RAM byte Computer Architecture, Memory System Design Amdahl’s rules of thumb for system balance Slide 51 Redundant Array of Independent Disks (RAID) Data organization on m ultiple disks Data disk 0 Data disk 1 Data disk 2 Mirror disk 0 Mirror disk 1 RAID0: Multiple disks for higher data rate; no redundancy Mirror disk 2 RAID1: Mirrored disks RAID2: Error-correcting code DataA disk 0 DataB disk 1 DataC disk 2 Data D disk 3 Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Data 2” Data 0’” Data 1’” Data 2’” Data 0 Data 1 Data 2 Data 0’ Data 1’ Data 2’ Data 0” Data 1” Parity 2 Data 0’” Parity 1 Data 2” Parity P disk Spare disk RAID3: Bit- or b yte-level striping with parity/checksum disk Parity 0 Parity 1 Parity 2 Spare disk RAID4: Parity/checksum applied to sectors,not bits or bytes Parity 0 Data 1’” Data 2’” Spare disk RAID5: Parity/checksum distributed across several disks A⊕B⊕C⊕D⊕P=0→ B=A⊕C⊕D⊕P RAID6: Parity and 2nd check distributed across several disks Fig. 19.5 Feb. 2011 RAID levels 0-6, with a simplified view of data organization. Computer Architecture, Memory System Design Slide 52 RAID Product Examples IBM ESS Model 750 Feb. 2011 Computer Architecture, Memory System Design Slide 53 19.6 Other Types of Mass Memory Typically 2-9 cm Floppy disk . . . .. (a) Cutaway view of a hard disk drive Fig. 3.12 Magnetic and optical disk memory units. Feb. 2011 CD-ROM .. . Magnetic tape cartridge (b) Some removable storage media Flash drive Thumb drive Travel drive Computer Architecture, Memory System Design Slide 54 Optical Disks Pits on adjacent tracks Tracks Laser diode Beam splitter Spiral, rather than concentric, tracks Detector Pits Side view of one track Lenses Protective coating Substrate 0 1 0 1 0 0 1 1 Fig. 19.6 Simplified view of recording format and access mechanism for data on a CD-ROM or DVD-ROM. Feb. 2011 Computer Architecture, Memory System Design Slide 55 Automated Tape Libraries Feb. 2011 Computer Architecture, Memory System Design Slide 56 20 Virtual Memory and Paging Managing data transfers between main & mass is cumbersome • Virtual memory automates this process • Key to virtual memory’s success is the same as for cache Topics in This Chapter 20.1 The Need for Virtual Memory 20.2 Address Translation in Virtual Memory 20.3 Translation Lookaside Buffer 20.4 Page Placement and Replacement 20.5 Main and Mass Memories 20.6 Improving Virtual Memory Performance Feb. 2011 Computer Architecture, Memory System Design Slide 57 20.1 The Need for Virtual Memory System Active pieces of program and data in memory Program and data on several disk tracks Unused space Stack Fig. 20.1 Feb. 2011 Program segments in main memory and on disk. Computer Architecture, Memory System Design Slide 58 Memory Hierarchy: The Big Picture Virtual memory Main memory Cache Registers Words Lines (transferred explicitly via load/store) Fig. 20.2 Feb. 2011 Pages (transferred automatically upon cache miss) (transferred automatically upon page fault) Data movement in a memory hierarchy. Computer Architecture, Memory System Design Slide 59 20.2 Address Translation in Virtual Memory Virtual address Virtual page number Offset in page V − P bits P bits Address translation Physical address Fig. 20.3 Example 20.1 Feb. 2011 M − P bits P bits Physical page number Offset in page Virtual-to-physical address translation parameters. Determine the parameters in Fig. 20.3 for 32-bit virtual addresses, 4 KB pages, and 128 MB byte-addressable main memory. Solution: Physical addresses are 27 b, byte offset in page is 12 b; thus, virtual (physical) page numbers are 32 – 12 = 20 b (15 b) Computer Architecture, Memory System Design Slide 60 Page Tables and Address Translation Page table register Page table Virtual page number Valid bits Other flags Main memory Fig. 20.4 The role of page table in the virtual-to-physical address translation process. Feb. 2011 Computer Architecture, Memory System Design Slide 61 Protection and Sharing in Virtual Memory Page table for process 1 0 1 2 3 4 5 6 7 Page table for process 2 Read & w rite accesses allowed Only read accesses allow ed 0 1 2 3 4 5 6 7 Pointer Flags Per mission bits To disk memory Main memory Fig. 20.5 Virtual memory as a facilitator of sharing and memory protection. Feb. 2011 Computer Architecture, Memory System Design Slide 62 The Latency Penalty of Virtual Memory Virtual address Page table register Memory access 1 Memory access 2 Page table Physical address Virtual page number Valid bits Other flags Main memory Fig. 20.4 Feb. 2011 Computer Architecture, Memory System Design Slide 63 20.3 Translation Lookaside Buffer Virtual Byte page number offset Valid bits Translation TLB tags Tags match and ent ry is valid Other flags Physical page number Physical address tag Virtual address Program page in virtual memory lw $t0,0($s1) addi $t1,$zero,0 L: add $t1,$t1,1 beq $t1,$s2,D add $t2,$t1,$t1 add $t2,$t2,$t2 add $t2,$t2,$s1 lw $t3,0($t2) slt $t4,$t0,$t3 beq $t4,$zero,L addi $t0,$t3,0 j L D: ... Physical address Byte offset in word Cache index All instructions on this page have the same virtual page address and thus entail the same translation Fig. 20.6 Virtual-to-physical address translation by a TLB and how the resulting physical address is used to access the cache memory. Feb. 2011 Computer Architecture, Memory System Design Slide 64 Address Translation via TLB Example 20.2 An address translation process converts a 32-bit virtual address to a 32-bit physical address. Memory is byte-addressable with 4 KB pages. A 16-entry, direct-mapped TLB is used. Specify the components of the virtual and physical addresses and the width of the various TLB fields. Virtual Byte page number offset Virtual Page number 16 Tag Valid bits TLB tags 4 12 Tags match and ent ry TLB is valid index 16-entry TLB Other flags Fig. 20.6 Feb. 2011 20 20 12 Physical page number Physical address tag Virtual address Translation Solution Physical address TLB word width = 16-bit tag + 20-bit phys page # + 1 valid bit + Other flags ≥ 37 bits Byte offset in word Cache index Computer Architecture, Memory System Design Slide 65 Virtual- or Physical-Address Cache? Virtual-address cache TLB TLB Main memory Physical-address cache Main memory Hybrid-address cache Main memory TLB Cache may be accessed with part of address that is common between virtual and physical addresses TLB access may form an extra pipeline stage, thus the penalty in throughput can be insignificant Fig. 20.7 Options for where virtual-to-physical address translation occurs. Feb. 2011 Computer Architecture, Memory System Design Slide 66 20.4 Page Replacement Policies Least-recently used (LRU) policy Implemented by maintaining a stack Pages Æ A B A F B E A LRU stack MRU D B E LRU C A D B E B A D E A B D E F A B D B F A D E B F A A E B F Feb. 2011 Computer Architecture, Memory System Design Slide 67 Approximate LRU Replacement Policy Least-recently used policy: effective, but hard to implement Approximate versions of LRU are more easily implemented Clock policy: diagram below shows the reason for name Use bit is set to 1 whenever a page is accessed Page slot 0 Page slot 7 0 0 1 1 1 0 0 1 (a) Before replacement Fig. 20.8 Feb. 2011 0 Page slot 1 0 0 1 0 0 1 1 (b) A fter replacement A scheme for the approximate implementation of LRU . Computer Architecture, Memory System Design Slide 68 LRU Is Not Always the Best Policy Example 20.2 Computing column averages for a 17 × 1024 table; 16-page memory for j = [0 … 1023] { temp = 0; for i = [0 … 16] temp = temp + T[i][j] print(temp/17.0); } Evaluate the page faults for row-major and column-major storage. Solution 1024 61 60 60 60 17 Fig. 20.9 Feb. 2011 60 . . . Pagination of a 17×1024 table with row- or column-major storage. Computer Architecture, Memory System Design Slide 69 20.5 Main and Mass Memories Working set of a process, W(t, x): The set of pages accessed over the last x instructions at time t Principle of locality ensures that the working set changes slowly W(t, x) Time, t Fig. 20.10 Feb. 2011 Variations in the size of a program’s working set. Computer Architecture, Memory System Design Slide 70 20.6 Improving Virtual Memory Performance Table 20.1 Memory hierarchy parameters and their effects on performance Parameter variation Potential advantages Possible disadvantages Larger main or cache size Fewer capacity misses Longer access time Larger pages or longer lines Fewer compulsory misses (prefetching effect) Greater miss penalty Greater associativity Fewer conflict misses (for cache only) Longer access time More sophisticated replacement policy Fewer conflict misses Longer decision time, more hardware Write-through policy (for cache only) No write-back time penalty, Wasted memory bandwidth, easier write-miss handling longer access time Feb. 2011 Computer Architecture, Memory System Design Slide 71 Impact of Technology on Virtual Memory s Disk seek time Time ms μs DRAM access time ns CPU cycle time ps 1980 1990 2000 2010 Calendar year Fig. 20.11 Feb. 2011 Trends in disk, main memory, and CPU speeds. Computer Architecture, Memory System Design Slide 72 Performance Impact of the Replacement Policy 0.04 Approximate LRU Least recently used 0.03 Page fault rate First in, first out Ideal (best possible) 0.02 0.01 0.00 0 5 10 15 Pages allocated Fig. 20.12 Dependence of page faults on the number of pages allocated and the page replacement policy Feb. 2011 Computer Architecture, Memory System Design Slide 73 Summary of Memory Hierarchy Cache memory: provides illusion of very high speed Main memory: reasonable cost, but slow & small Virtual memory: provides illusion of very large size Virtual memory Main memory Cache Registers Words Lines (transferred explicitly via load/store) Fig. 20.2 Feb. 2011 Pages (transferred automatically upon cache miss) Locality makes the illusions work (transferred automatically upon page fault) Data movement in a memory hierarchy. Computer Architecture, Memory System Design Slide 74 Part VI Input/Output and Interfacing Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Mar. 2008 Mar. 2009 Feb. 2011 Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 2 VI Input/Output and Interfacing Effective computer design & use requires awareness of: • I/O device types, technologies, and performance • Interaction of I/O with memory and CPU • Automatic data collection and device actuation Topics in This Part Chapter 21 Input/Output Devices Chapter 22 Input/Output Programming Chapter 23 Buses, Links, and Interfacing Chapter 24 Context Switching and Interrupts Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 3 21 Input/Output Devices Learn about input and output devices as categorized by: • Type of data presentation or recording • Data rate, which influences interaction with system Topics in This Chapter 21.1 Input/Output Devices and Controllers 21.2 Keyboard and Mouse 21.3 Visual Display Units 21.4 Hard-Copy Input/Output Devices 21.5 Other Input/Output Devices 21.6 Networking of Input/Output Devices Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 4 Section 21.1: Introduction Section 21.3 Section 21.4 Section 21.2 Feb. 2011 Section 21.5: Other devices Section 21.6: Networked I/O Computer Architecture, Input/Output and Interfacing Slide 5 21.1 Input/Output Devices and Controllers Table 3.3 Some input, output, and two-way I/O devices. Input type Prime examples Other examples Data rate (b/s) Main uses Symbol Keyboard, keypad Music note, OCR 10s Ubiquitous Position Mouse, touchpad Stick, wheel, glove 100s Ubiquitous Identity Barcode reader Badge, fingerprint 100s Sales, security Sensory Touch, motion, light Scent, brain signal 100s Control, security Audio Microphone Phone, radio, tape 1000s Ubiquitous Image Scanner, camera Graphic tablet 1000s-106s Photos, publishing Video Camcorder, DVD VCR, TV cable 1000s-109s Entertainment Output type Prime examples Other examples Data rate (b/s) Main uses Symbol LCD line segments LED, status light 10s Ubiquitous Position Stepper motor Robotic motion 100s Ubiquitous Warning Buzzer, bell, siren Flashing light A few Safety, security Sensory Braille text Scent, brain stimulus 100s Personal assistance Audio Speaker, audiotape Voice synthesizer 1000s Ubiquitous Image Monitor, printer Plotter, microfilm 1000s Ubiquitous Video Monitor, TV screen Film/video recorder 1000s-109s Entertainment Two-way I/O Prime examples Other examples Data rate (b/s) Main uses Mass storage Hard/floppy disk CD, tape, archive 106s Ubiquitous Network Modem, fax, LAN Cable, DSL, ATM 1000s-109s Ubiquitous Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 6 Simple Organization for Input/Output Interrupts CPU Main memory Cache System bus I/O controller Disk Figure 21.1 Feb. 2011 Disk I/O controller I/O controller Graphics display Network Input/output via a single common bus. Computer Architecture, Input/Output and Interfacing Slide 7 I/O Organization for Greater Performance CPU Interrupts Main memory Cache Memory bus Bus adapter AGP PCI bus Intermediate buses / ports Graphics display Standard Bus adapter I/O bus I/O controller I/O controller Network Proprietary Bus adapter I/O controller Disk Disk I/O controller CD/DVD Figure 21.2 Input/output via intermediate and dedicated I/O buses (to be explained in Chapter 23). Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 8 21.2 Keyboard and Mouse Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 9 Keyboard Switches and Encoding Key cap Spring c d e f 8 9 a b 4 5 6 7 0 1 2 3 (a) Mechanical switch with a plunger Conductor-coated membrane Contacts (b) Membrane switch (c) Logical arrangement of keys Figure 21.3 Two mechanical switch designs and the logical layout of a hex keypad. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 10 Projection Virtual Keyboard Hardware: A tiny laser device projects the image of a full-size keyboard on any surface Software: Emulates a real keyboard, even clicking key sounds Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 11 Pointing Devices Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 12 How a Mouse Works y roller x roller Mouse pad y axis x axis Ball touching the rollers rotates them via friction (a) Mechanical mouse Figure 21.4 Feb. 2011 Photosensor detects crossing of grid lines (b) Optical mouse Mechanical and simple optical mice. Computer Architecture, Input/Output and Interfacing Slide 13 21.3 Visual Display Units Deflection coils Electron beam ≅ 1K Pixel info: brightness, color, etc. lines Electron gun y Sensitive screen ≅ 1K pixels Feb. 2011 Frame buffer per line (a) Image formation on a CRT Figure 21.5 x (b) Data defining the image CRT display unit and image storage in frame buffer. Computer Architecture, Input/Output and Interfacing Slide 14 How Color CRT Displays Work RGB RGB RGB RGB RGBRGB Direction of blue beam Direction of green beam Direction of red beam Shadow mask RGB Faceplate (a) The RGB color stripes Figure 21.6 Feb. 2011 (b) Use of shadow mask The RGB color scheme of modern CRT displays. Computer Architecture, Input/Output and Interfacing Slide 15 Encoding Colors in RGB Format Besides hue, saturation is used to affect the color’s appearance (high saturation at the top, low saturation at the bottom) Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 16 Flat-Panel Displays Column pulses Column pulses Row lines Address pulse Column (data) lines (a) Passive display Figure 21.7 Feb. 2011 Column (data) lines (b) Active display Passive and active LCD displays. Computer Architecture, Input/Output and Interfacing Slide 17 Flexible Display Devices Paper-thin tablet-size display unit by E Ink Sony organic light-emitting diode (OLED) flexible display Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 18 Other Display Technologies Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 19 21.4 Hard-Copy Input/Output Devices Document (face down) Filters Lens Detector: charge-coupled device (CCD) Mirror Light beam A/D converter Light source Scanning software Mirror Image file Figure 21.8 Feb. 2011 Scanning mechanism for hard-copy input. Computer Architecture, Input/Output and Interfacing Slide 20 Character Formation by Dot Matrices ooooo o o o o o o o o o o o o o o ooooo oo oo o oo oo o oo oo oo oo o o o o o o o o o o o o o o o o oo ooo oo ooo oo o o oo ooo oo ooo oo ooo o oo o oo o oo o oo oo oo oo oo oo oo oo oo oo oo oo o oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo o oo oo oo oo oo oo oo oo oo oo oo o oo oo o oo o o oo ooo oo ooo oo ooo o o oo ooo oo ooo oo o Figure 21.9 Feb. 2011 ooooooooo o oo o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo ooooooooo Same dot matrix size, but with greater resolution oooooooooooooo ooooooooooooooooo oo oooo oo ooo oo oo oo oo oo oo oo oo oo oo oo o oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo oo o oo oo oo oo oo oo oo oo oo oo oo ooo oo oooo ooooooooooooooooo oooooooooooooo Forming the letter “D” via dot matrices of varying sizes. Computer Architecture, Input/Output and Interfacing Slide 21 Simulating Intensity Levels via Dithering Forming five gray levels on a device that supports only black and white (e.g., ink-jet or laser printer) Using the dithering patterns above on each of three colors forms 5 × 5 × 5 = 125 different colors Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 22 Simple Dot-Matrix Printer Mechanism Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 23 Common Hard-Copy Output Devices Sheet of paper Paper movement Print head movement (a) Ink jet printing Figure 21.10 Feb. 2011 Rotating drum Light from optical system Rollers Print head Ink droplet Corona wire for charging Fusing of toner Heater Print head assembly Ink supply Cleaning of excess toner Sheet of paper Toner (b) Laser printing Ink-jet and laser printers. Computer Architecture, Input/Output and Interfacing Slide 24 How Color Printers Work Red Green The RGB scheme of color monitors is additive: various amounts of the three primary colors are added to form a desired color Blue Absence of green Cyan Magenta The CMY scheme of color printers is subtractive: various amounts of the three primary colors are removed from white to form a desired color Yellow Feb. 2011 To produce a more satisfactory shade of black, the CMYK scheme is often used (K = black) Computer Architecture, Input/Output and Interfacing Slide 25 The CMYK Printing Process Illusion of full color created with CMYK dots Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 26 Color Wheels Artist’s color wheel, used for mixing paint Subtractive color wheel, used in printing (CMYK) Additive color wheel, used for projection Primary colors appear at center and equally spaced around the perimeter Secondary colors are midway between primary colors Tertiary colors are between primary and secondary colors Source of this and several other slides on color: http://www.devx.com/projectcool/Article/19954/0/ (see also color theory tutorial: http://graphics.kodak.com/documents/Introducing%20Color%20Theory.pdf) Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 27 21.5 Other Input/Output Devices Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 28 Sensors and Actuators Collecting info about the environment and other conditions • Light sensors (photocells) • Temperature sensors (contact and noncontact types) • Pressure sensors S S N S N S S S N N S N N S N N S S N N (a) Initial state (a) After rotation Figure 21.11 Stepper motor principles of operation. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 29 Converting Circular Motion to Linear Motion Locomotive Feb. 2011 Screw Computer Architecture, Input/Output and Interfacing Slide 30 21.6 Networking of Input/Output Devices Computer 1 Printer 2 Camera Ethernet Computer 3 Printer 1 Computer 2 Printer 3 Figure 21.12 With network-enabled peripherals, I/O is done via file transfers. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 31 Input/Output in Control and Embedded Systems CPU and memory Network interface Intelligent devices, other computers, archival storage, ... Digital output interface Signal conversion Digital actuators: stepper motors, relays, alarms, ... D/A output interface Signal conversion Analog actuators: valves, pumps, speed regulators, ... Digital input interface Digital signal conditioning Digital sensors: detectors, counters, on/off switches, ... A/D input interface Analog signal conditioning Analog sensors: thermocouples, pressure sensors, ... Figure 21.13 The structure of a closed-loop computer-based control system. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 32 22 Input/Output Programming Like everything else, I/O is controlled by machine instructions • I/O addressing (memory-mapped) and performance • Scheduled vs demand-based I/O: polling vs interrupts Topics in This Chapter 22.1 I/O Performance and Benchmarks 22.2 Input/Output Addressing 22.3 Scheduled I/O: Polling 22.4 Demand-Based I/O: Interrupts 22.5 I/O Data Transfer and DMA 22.6 Improving I/O Performance Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 33 22.1 I/O Performance and Benchmarks Example 22.1: The I/O wall An industrial control application spent 90% of its time on CPU operations when it was originally developed in the early 1980s. Since then, the CPU component has been upgraded every 5 years, but the I/O components have remained the same. Assuming that CPU performance improved tenfold with each upgrade, derive the fraction of time spent on I/O over the life of the system. Solution Apply Amdahl’s law with 90% of the task speeded up by factors of 10, 100, 1000, and 10000 over a 20-year period. In the course of these upgrades the running time has been reduced from the original 1 to 0.1 + 0.9/10 = 0.19, 0.109, 0.1009, and 0.10009, making the fraction of time spent on input/output 52.6, 91.7, 99.1, and 99.9%, respectively. The last couple of CPU upgrades did not really help. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 34 Types of Input/Output Benchmark Supercomputer I/O benchmarks Reading large volumes of input data Writing many snapshots for checkpointing Saving a relatively small set of results I/O data throughput, in MB/s, is important Transaction processing I/O benchmarks Huge database, but each transaction fairly small A handful (2-10) of disk accesses per transaction I/O rate (disk accesses per second) is important File system I/O benchmarks File creation, directory management, indexing, . . . Benchmarks are usually domain-specific Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 35 22.2 Memory location (hex address) 0xffff0000 31 Input/Output Addressing Interrupt enable I ER 76543210 Data byte 0xffff0004 Device ready Keyboard control Keyboard data 32-bit device registers 0xffff0008 31 0xffff000c I ER 76543210 Data byte Display control Display data Figure 22.1 Control and data registers for keyboard and display unit in MiniMIPS. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 36 Hardware for I/O Addressing Control Address Data Memory bus Device status Device address Device data Compare = Figure 22.2 Feb. 2011 Control logic Device controller Addressing logic for an I/O device controller. Computer Architecture, Input/Output and Interfacing Slide 37 Data Input from Keyboard Example 22.2 Write a sequence of MiniMIPS assembly language instructions to make the program wait until the keyboard has a symbol to transmit Memory location Interrupt enable and then read the symbol into register $v0. (hex address) 0xffff0000 31 Solution 0xffff0004 Device read I ER 76543210 Keyboard Data byte Keyboard 32-bit device registers The program must continually examine the keyboard control register, I 0xffff0008 Display co ER 31 7 6 5 4 3 2 1 0 ending its “busy wait” when the R bit has been asserted. 0xffff000c lui idle: lw andi beq lw $t0,0xffff $t1,0($t0) $t1,$t1,0x0001 $t1,$zero,idle $v0,4($t0) # # # # # Data byte put 0xffff0000 in $t0 get keyboard’s control word isolate the LSB (R bit) if not ready (R = 0), wait retrieve data from keyboard This type of input is appropriate only if the computer is waiting for a critical input and cannot continue in the absence of such input. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 38 Display da Data Output to Display Unit Example 22.3 Memory location Interrupt enable Device re (hex address) Write a sequence of MiniMIPS assembly language instructions to EI R 0xffff0000 31 6543210 make the program wait until the display0xffff0004 unit is ready to accept a 7Data new byte symbol and then write the symbol from $a0 to the display unit. 32-bit device registers 0xffff0008 31 Solution 0xffff000c $t0,0xffff $t1,8($t0) $t1,$t1,0x0001 $t1,$zero,idle $a0,12($t0) # # # # # Display Data byte Display put 0xffff0000 in $t0 get display’s control word isolate the LSB (R bit) if not ready (R = 0), wait supply data to display unit This type of output is appropriate only if we can afford to have the CPU dedicated to data transmission to the display unit. Feb. 2011 Computer Architecture, Input/Output and Interfacing Keyboa I ER 76543210 The program must continually examine the display unit’s control register, ending its “busy wait” when the R bit has been asserted. lui idle: lw andi beq sw Keyboa Slide 39 22.3 Scheduled I/O: Polling Examples 22.4, 22.5, 22.6 What fraction of a 1 GHz CPU’s time is spent polling the following devices if each polling action takes 800 clock cycles? Keyboard must be interrogated at least 10 times per second Floppy sends data 4 bytes at a time at a rate of 50 KB/s Hard drive sends data 4 bytes at a time at a rate of 3 MB/s Solution For keyboard, divide the number of cycles needed for 10 interrogations by the total number of cycles available in 1 second: (10 × 800)/109 ≅ 0.001% The floppy disk must be interrogated 50K/4 = 12.5K times per sec (12.5K × 800)/109 ≅ 1% The hard disk must be interrogated 3M/4 = 750K times per sec (750K × 800)/109 ≅ 60% Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 40 22.4 Demand-Based I/O: Interrupts Example 22.7 Consider the disk in Example 22.6 (transferring 4 B chunks of data at 3 MB/s when active). Assume that the disk is active 5% of the time. The overhead of interrupting the CPU and performing the transfer is 1200 clock cycles. What fraction of a 1 GHz CPU’s time is spent attending to the hard disk drive? Solution When active, the hard disk produces 750K interrupts per second 0.05 × (750K × 1200)/109 ≅ 4.5% (compare with 60% for polling) Note that even though the overhead of interrupting the CPU is higher than that of polling, because the disk is usually idle, demand-based I/O leads to better performance. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 41 Interrupt Handling Upon detecting an interrupt signal, provided the particular interrupt or interrupt class is not masked, the CPU acknowledges the interrupt (so that the device can deassert its request signal) and begins executing an interrupt service routine. 1. Save the CPU state and call the interrupt service routine. 2. Disable all interrupts. 3. Save minimal information about the interrupt on the stack. 4. Enable interrupts (or at least higher priority ones). 5. Identify cause of interrupt and attend to the underlying request. 6. Restore CPU state to what existed before the last interrupt. 7. Return from interrupt service routine. The capability to handle nested interrupts is important in dealing with multiple high-speed I/O devices. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 42 22.5 I/O Data Transfer and DMA Other cont rol ReadWrite’ DataReady’ System bus Address Data Bus request CPU and cache DMA controller Bus grant Status Source Length Dest’n Main memory Typical I/O device Figure 22.3 DMA controller shares the system or memory bus with the CPU. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 43 DMA Operation CPU BusRequest BusGrant DMA (a) DMA trans fer in one continuous burst CPU BusRequest BusGrant DMA (b) DMA trans fer in several shorter bursts Figure 22.4 DMA operation and the associated transfers of bus control. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 44 22.6 Improving I/O Performance Example 22.9: Effective I/O bandwidth from disk Consider a hard disk drive with 512 B sectors, average access latency of 10 ms, and peak throughput of 10 MB/s. Plot the variation of the effective I/O bandwidth as the unit of data transfer (block) varies in size from 1 sector (0.5 KB) to 1024 sectors (500 KB). Solution Throughput (MB / s) 10 Figure 22.5 Feb. 2011 8 6 5 MB/s 4 2 0.05 MB/s 0 0 100 200 300 Block size (KB) Computer Architecture, Input/Output and Interfacing 400 500 Slide 45 Computing the Effective Throughput Elaboration on Example 22.9: Effective I/O bandwidth from disk Total access time for x bytes = 10 ms + xfer time = (0.01 + 10–7x) s Effective access time per byte = (0.01 + 10–7x)/x s/B Effective transfer rate = x/(0.01 + 10–7x) B/s For x = 100 KB: Effective transfer rate = 105/(0.01 + 10–2) = 5×106 B/s Average access latency = 10 ms Peak throughput = 10 MB/s Throughput (MB / s) 10 Figure 22.5 Feb. 2011 8 6 5 MB/s 4 2 0.05 MB/s 0 0 100 200 300 Block size (KB) Computer Architecture, Input/Output and Interfacing 400 500 Slide 46 Distributed Input/Output CPU CPU CPU CPU Mem HCA Mem HCA HCA = Host channel adapter Switch Module with built-in switch Router Switch Switch HCA I/O Figure 22.6 Feb. 2011 HCA I/O HCA I/O HCA HCA To other subnets I/O I/O Example configuration for the Infiniband distributed I/O. Computer Architecture, Input/Output and Interfacing Slide 47 23 Buses, Links, and Interfacing Shared links or buses are common in modern computers: • Fewer wires and pins, greater flexibility & expandability • Require dealing with arbitration and synchronization Topics in This Chapter 23.1 Intra- and Intersystem Links 23.2 Buses and Their Appeal 23.3 Bus Communication Protocols 23.4 Bus Arbitration and Performance 23.5 Basics of Interfacing 23.6 Interfacing Standards Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 48 23.1 Trench with via Intra- and Intersystem Links Metal layer 4 Trench 1. Etched and insulated Metal layer 3 2. Coated with copper via via Metal layer 2 Contact 3. Excess copper removed (a) Cross section of layers Metal layer 1 (b) 3D view of wires on multiple metal layers Figure 23.1 Multiple metal layers provide intrasystem connectivity on microchips or printed-circuit boards. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 49 Multiple Metal Layers on a Chip or PC Board Cross section of metal layers Active elements and their connectors Modern chips have 8-9 metal layers Upper layers carry longer wires as well as those that need more power Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 50 Intersystem Links Computer (a) RS-232 (b) Ethernet Figure 23.2 Example intersystem connectivity schemes. DTR: data terminal ready Signal ground 5 CTS: clear to send Feb. 2011 Transmit data 4 9 Figure 23.3 (c) ATM 3 8 2 7 Receive data 1 6 RTS: request to send DSR: data set ready RS-232 serial interface 9-pin connector. Computer Architecture, Input/Output and Interfacing Slide 51 Intersystem Communication Media Twisted pair Plastic Insulator Copper core Coaxial cable Outer conductor Reflection Silica Light source Optical fiber Figure 23.4 Commonly used communication media for intersystem connections. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 52 Comparing Intersystem Links Table 23.1 Summary of three interconnection schemes. Interconnection properties RS-232 Ethernet ATM Maximum segment length (m) 10s 100s 1000s Maximum network span (m) 10s 100s Unlimited Up to 0.02 10/100/1000 155-2500 1 100s 53 <1 10s-100s 100s Input/Output LAN Backbone Low Low High Bit rate (Mb/s) Unit of transmission (B) Typical end-to-end latency (ms) Typical application domain Transceiver complexity or cost Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 53 23.2 Buses and Their Appeal 1 0 1 2 0 3 n–1 n–2 2 3 n–1 n–2 Point-to-point connections between n units require n(n – 1) channels, or n(n – 1)/2 bidirectional links; that is, O(n2) links Bus connectivity requires only one input and one output port per unit, or O(n) links in all Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 54 Bus Components and Types . . . Control . . . Address . . . Data Figure 23.5 Handshaking, direction, transfer mode, arbitration, ... one bit (serial) to several bytes; may be shared The three sets of lines found in a bus. A typical computer may use a dozen or so different buses: 1. Legacy Buses: PC bus, ISA, RS-232, parallel port 2. Standard buses: PCI, SCSI, USB, Ethernet 3. Proprietary buses: for specific devices and max performance Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 55 23.3 Bus Communication Protocols c d e f g h Clock Address placed on the bus Address Data Wait Figure 23.6 Wait Data availability ensured Synchronous bus with fixed-latency devices. Request Address or data Wait d c d e h f Ack f g h i Ready Figure 23.7 Handshaking on an asynchronous bus for an input operation (e.g., reading from memory). Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 56 Example Bus Operation c d e f g h i j k CLK FRAME′ C/BE′ AD I/O read Byte enable Address Data 0 Data 1 Data 2 IRDY′ Data 3 Wait TRDY′ Wait DEVSEL′ Transfer Address AD turn- Data initiation transfer around transfer Figure 23.8 Feb. 2011 Wait cycle Data transfer Data transfer Wait cycle Data transfer I/O read operation via PCI bus. Computer Architecture, Input/Output and Interfacing Slide 57 23.4 R0 R1 R2 Bus Arbitration and Performance . . . S y n c . . . Arbiter R n−1 . . . G0 G1 G2 Gn−1 Bus release Figure 23.9 Feb. 2011 General structure of a centralized bus arbiter. Computer Architecture, Input/Output and Interfacing Slide 58 Some Simple Bus Arbiters Round robin Ring counter Fixed-priority R0 0 0 0 0 1 0 0 0 Ri R0 G0 1 Ri Gi Rn–1 G0 Gi Gn–1 Starvation avoidance Rotating priority With fixed priorities, low-priority units may never get to use the bus (they could “starve”) Idea: Order the units circularly, rather than linearly, and allow the highest-priority status to rotate among the units (combine a ring counter with a priority circuit) Combining priority with service guarantee is desirable Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 59 Daisy Chaining R0 R1 R2 . . . S y n c . . . Bus release Arbiter . . . Bus grant G0 G1 G2 Device A Daisy chain of devices Device B Device C Device D Bus request Figure 23.9 Daisy chaining allows a small centralized arbiter to service a large number of devices that use a shared resource. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 60 23.5 N W Contact point Basics of Interfacing Ground +5 V DC S Microcontroller with internal A/D converter E Pin x of port y Figure 23.11 Wind vane supplying an output voltage in the range 0-5 V depending on wind direction. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 61 23.6 Table 23.2 Attributes ↓ Interfacing Standards Summary of four standard interface buses. Name → PCI SCSI FireWire USB Type of bus Backplane Parallel I/O Serial I/O Serial I/O Standard designation PCI ANSI X3.131 IEEE 1394 USB 2.0 Typical application domain System Fast I/O Fast I/O Low-cost I/O Bus width (data bits) 32-64 8-32 2 1 Peak bandwidth (MB/s) 133-512 5-40 12.5-50 0.2-15 Maximum number of devices 1024* 7-31# 63 127$ Maximum span (m) <1 3-25 4.5-72$ 5-30$ Arbitration method Centralized Self-select Distributed Daisy chain Transceiver complexity or cost High Medium Medium Low Notes: Feb. 2011 * 32 per bus segment; # One less than bus width; $ With hubs (repeaters) Computer Architecture, Input/Output and Interfacing Slide 62 Standard Connectors USB A Host side 4321 1 4 Max cable length: 5m Host (controller & hub) USB B Device side 2 3 Hub Hub Pin 1: +5V DC Pin 4: Ground Figure 23.12 Pin 2: Data − Pin 3: Data + Device Hub Device Device Device Single product with hub & device USB connectors and connectivity structure . Pin 1: 8-40V DC, 1.5 A Pin 2: Ground Pin 3: Twisted pair B − Pin 4: Twisted pair B + Pin 5: Twisted pair A − Pin 6: Twisted pair A + Shell: Outer shield Figure 23.13 IEEE 1394 (FireWire) connector. The same connector is used at both ends. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 63 24 Context Switching and Interrupts OS initiates I/O transfers and awaits notification via interrupts • When an interrupt is detected, the CPU switches context • Context switch can also be used between users/threads Topics in This Chapter 24.1 System Calls for I/O 24.2 Interrupts, Exceptions, and Traps 24.3 Simple Interrupt Handling 24.4 Nested Interrupts 24.5 Types of Context Switching 24.6 Threads and Multithreading Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 64 24.1 System Calls for I/O Why the user must be isolated from details of I/O operations Protection: User must be barred from accessing some disk areas Convenience: No need to learn details of each device’s operation Efficiency: Most users incapable of finding the best I/O scheme I/O abstraction: grouping of I/O devices into a small number of generic types so as to make the I/O device-independent Character stream I/O: get(●), put(●) – e.g., keyboard, printer Block I/O: seek(●), read(●), write(●) – e.g., disk Network Sockets: create socket, connect, send/receive packet Clocks or timers: set up timer (get notified via an interrupt) Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 65 24.2 Interrupts, Exceptions, and Traps Interrupt Exception Trap Both general term for any diversion and the I/O type Caused by an illegal operation (often unpredictable) AKA “software interrupt” (preplanned and not rare) Studying Parhami’s book for test 6:55 7:40 8:01 9:46 Stomach sends interrupt signal E- mail arrives Eating dinner Reading/sending e-mail 8:42 Telemarketer calls 8:53 9:20 Best friend calls Talk ing on the phone Figure 24.1 Feb. 2011 The notions of interrupts and nested interrupts. Computer Architecture, Input/Output and Interfacing Slide 66 24.3 Simple Interrupt Handling Acknowledge the interrupt by asserting the IntAck signal Notify the CPU’s next-address logic that an interrupt is pending Set the interrupt mask so that no new interrupt is accepted IntAck IntReq S y n c IntDisable S S FF R Signals from/to devices Q Q FF Q Interrupt acknowledge R Q Interrupt mask IntAlert Signals from/to CPU IntEnable Figure 24.2 Feb. 2011 Simple interrupt logic for the single-cycle MicroMIPS. Computer Architecture, Input/Output and Interfacing Slide 67 Interrupt Timing c d e f g h Clock IntReq Synchronized version IntAck IntMask IntAlert Figure 24.3 Feb. 2011 Timing of interrupt request and acknowledge signals. Computer Architecture, Input/Output and Interfacing Slide 68 Next-Address Logic with Interrupts Added IncrPC Old PC NextPC / 30 0 1 IntAlert / 30 0 1 2 3 / 30 / 30 / 30 / 30 / 30 (PC) 31:28| jta (rs) 31:2 SysCallAddr IntHandlerAddr PCSrc Figure 24.4 Part of the next-address logic for single-cycle MicroMIPS, with an interrupt capability added (compare with the lower left part of Figure 13.4). Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 69 24.4 Nested Interrupts prog PC inst(a) inst(b) Interrupts disabled and (PC) saved Int detected int1 Interrupt handler Save state Save int info Enable int’s PC inst(c) inst(d) In t detected int2 Save state Save int info Interrupts disabled Enable int’s and (PC) saved Restore state Return Figure 24.6 Feb. 2011 Interrupt handler Restore state Return Example of nested interrupts. Computer Architecture, Input/Output and Interfacing Slide 70 24.5 Scanning e-mail messages Types of Context Switching Taking notes Task 1 Task 2 Task 3 Time slice Context switch Talking on telephone (a) Human multitasking Figure 24.7 Feb. 2011 (b) Computer multitasking Multitasking in humans and computers. Computer Architecture, Input/Output and Interfacing Slide 71 24.6 Threads and Multithreading Thread 1 Thread 2 Thread 3 Spawn additional threads Sync Sync (a) Task graph of a program Figure 24.8 Feb. 2011 (b) Thread structure of a task A program divided into tasks (subcomputations) or threads. Computer Architecture, Input/Output and Interfacing Slide 72 Multithreaded Processors Threads in memory Issue pipelines Bubble Retirement and commit pipeline Function units Figure 24.9 Instructions from multiple threads as they make their way through a processor’s execution pipeline. Feb. 2011 Computer Architecture, Input/Output and Interfacing Slide 73 Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011* * Minimal update, due to this part not being used for lectures in ECE 154 at UCSB Feb. 2011 Computer Architecture, Advanced Architectures Slide 2 VII Advanced Architectures Performance enhancement beyond what we have seen: • What else can we do at the instruction execution level? • Data parallelism: vector and array processing • Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing Feb. 2011 Computer Architecture, Advanced Architectures Slide 3 25 Road to Higher Performance Review past, current, and future architectural trends: • General-purpose and special-purpose acceleration • Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 4 25.1 Past and Current Performance Trends Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005 0.06 MIPS (4-bit processor) 10,000 MIPS (32-bit processor) 8008 8-bit 80386 8080 80486 Pentium, MMX 8084 32-bit 8086 16-bit 8088 80186 80188 Pentium Pro, II Pentium III, M Celeron 80286 Feb. 2011 Computer Architecture, Advanced Architectures Slide 5 Architectural Innovations for Improved Performance Newer methods Feb. 2011 Improvement factor 1. Pipelining (and superpipelining) 3-8 √ 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar) 2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration 2-10 ? 9. Vector and array processing 2-10 ? 10. Parallel/distributed computing 2-1000s ? Computer Architecture, Advanced Architectures Previously discussed Established methods Architectural method Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Covered in Part VII Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture Slide 6 Peak Performance of Supercomputers PFLOPS Earth Simulator × 10 / 5 years ASCI White Pacific ASCI Red TFLOPS TMC CM-5 Cray X-MP Cray T3D TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Feb. 2011 Computer Architecture, Advanced Architectures Slide 7 Energy Consumption is Getting out of Hand TIPS DSP performance per watt Absolute proce ssor performance Performance GIPS GP processor performance per watt MIPS kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs. Feb. 2011 Computer Architecture, Advanced Architectures Slide 8 25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a × b) Multiply-accumulate: reduce round-off error (s := s + a × b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Feb. 2011 Computer Architecture, Advanced Architectures Slide 9 Intel MMX ISA Extension Class Copy Arithmetic Shift Logic Table 25.1 Memory access Control Feb. 2011 Instruction Register copy Parallel pack Parallel unpack low Parallel unpack high Parallel add Parallel subtract Parallel multiply low Parallel multiply high Parallel multiply-add Parallel compare equal Parallel compare greater Parallel left shift logical Parallel right shift logical Parallel right shift arith Parallel AND Parallel ANDNOT Parallel OR Parallel XOR Parallel load MMX reg Parallel store MMX reg Empty FP tag bits Vector Op type 32 bits 4, 2 Saturate 8, 4, 2 8, 4, 2 8, 4, 2 Wrap/Saturate# 8, 4, 2 Wrap/Saturate# 4 4 4 8, 4, 2 8, 4, 2 4, 2, 1 4, 2, 1 4, 2 1 Bitwise 1 Bitwise 1 Bitwise 1 Bitwise 32 or 64 bits 32 or 64 bit Computer Architecture, Advanced Architectures Function or results Integer register ↔ MMX register Convert to narrower elements Merge lower halves of 2 vectors Merge upper halves of 2 vectors Add; inhibit carry at boundaries Subtract with carry inhibition Multiply, keep the 4 low halves Multiply, keep the 4 high halves Multiply, add adjacent products* All 1s where equal, else all 0s All 1s where greater, else all 0s Shift left, respect boundaries Shift right, respect boundaries Arith shift within each (half)word dest ← (src1) ∧ (src2) dest ← (src1) ∧ (src2)′ dest ← (src1) ∨ (src2) dest ← (src1) ⊕ (src2) Address given in integer register Address given in integer register Required for compatibility$ Slide 10 MMX Multiplication and Multiply-Add b a×e a b d e a b d e e f g h e f g h e×h z v d×g y u ×f x t w s s Feb. 2011 d×g b a×e t u (a) Parallel multiply low Figure 25.2 e×h ×f v u t s v add add s +t u+v (b) Parallel multiply-add Parallel multiplication and multiply-add in MMX. Computer Architecture, Advanced Architectures Slide 11 MMX Parallel Comparisons 14 3 58 66 5 3 12 32 5 6 12 9 79 1 58 65 3 12 22 17 5 12 90 8 65 535 (all 1s) 0 0 0 0 0 (a) Parallel compare equal Figure 25.3 Feb. 2011 255 (all 1s) 0 0 0 (b) Parallel compare greater Parallel comparisons in MMX. Computer Architecture, Advanced Architectures Slide 12 25.3 Instruction-Level Parallelism 3 Speedup attained Fraction of cycles 30% 20% 10% 0% 0 1 2 3 4 5 6 7 8 Issuable instructions per cycle 2 1 0 (a) 2 4 6 8 Instruction issue width (b) Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. Feb. 2011 Computer Architecture, Advanced Architectures Slide 13 Instruction-Level Parallelism Figure 25.5 Feb. 2011 A computation with inherent instruction-level parallelism. Computer Architecture, Advanced Architectures Slide 14 VLIW and EPIC Architectures VLIW EPIC Very long instruction word architecture Explicitly parallel instruction computing General registers (128) Memory Execution unit Execution unit ... Execution unit Execution unit Execution unit ... Execution unit Predicates (64) Floating-point registers (128) Figure 25.6 Hardware organization for IA-64. General and floatingpoint registers are 64-bit wide. Predicates are single-bit registers. Feb. 2011 Computer Architecture, Advanced Architectures Slide 15 25.4 Speculation and Value Prediction spec load ------------check load ------- ------------load ------- (a) Control speculation Figure 25.7 Feb. 2011 ------store ---load ------- spec load ------store ---check load ------- (b) Data speculation Examples of software speculation in IA-64. Computer Architecture, Advanced Architectures Slide 16 Value Prediction Memo table Miss Mult/ Div Inputs Mux 0 Output 1 Done Inputs ready Control Output ready Figure 25.8 Value prediction for multiplication or division via a memo table. Feb. 2011 Computer Architecture, Advanced Architectures Slide 17 25.5 Special-Purpose Hardware Accelerators Data and program memory CPU FPGA-like unit on which accelerators can be formed via loading of configuration registers Configuration memory Accel. 1 Accel. 3 Accel. 2 Unused resources Figure 25.9 General structure of a processor with configurable hardware accelerators. Feb. 2011 Computer Architecture, Advanced Architectures Slide 18 Graphic Processors, Network Processors, etc. Input buffer Feedback path PE 0 PE 1 PE 2 PE 3 PE 4 PE PE 5 5 PE 6 PE 7 PE 8 PE 9 PE 10 PE 11 PE 12 PE 13 PE 14 PE 15 Column memory Column memory Column memory Column memory Output buffer Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 19 SISD SIMD Uniprocessors Array or vector processors MISD Rarely used MIMD Multiproc’s or multicomputers Global memory Multiple data streams Distributed memory Single data stream Johnson’ s expansion Multiple instr streams Single instr stream 25.6 Vector, Array, and Parallel Processing Shared variables Message passing GMSV GMMP Shared-memory multiprocessors Rarely used DMSV DMMP Distributed Distrib-memory shared memory multicomputers Flynn’s categories Figure 25.11 Feb. 2011 The Flynn-Johnson classification of computer systems. Computer Architecture, Advanced Architectures Slide 20 SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time – vector processing Concurrency in space – array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] × x[i], 0 ≤ i < n Sources of performance improvement in vector processing (details in the first half of Chapter 26) One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar (details in the second half of Chapter 26) Feb. 2011 Computer Architecture, Advanced Architectures Slide 21 MISD Architecture Example Ins truct ion streams 1-5 Data in Data out Figure 25.12 Multiple instruction streams operating on a single data stream (MISD). Feb. 2011 Computer Architecture, Advanced Architectures Slide 22 MIMD Architectures Control parallelism: executing several instruction streams in parallel GMSV: Shared global memory – symmetric multiprocessors DMSV: Shared distributed memory – asymmetric multiprocessors DMMP: Message passing – multicomputers 0 Processortoprocessor network 0 1 . . . Memories and processors Memory modules Processortomemory network p−1 1 . . . m−1 Parallel input/output Processors Routers 0 1 A computing node . . . Interconnection network p−1 ... ... Parallel I/O Figure 27.1 Centralized shared memory. Feb. 2011 Figure 28.1 Distributed memory. Computer Architecture, Advanced Architectures Slide 23 Amdahl’s Law Revisited 50 f = sequential f =0 Speedup (s ) 40 p f = 0.01 30 f = 0.02 20 fraction = speedup of the rest with p processors f = 0.05 1 s= f + (1 – f)/p 10 f = 0.1 0 0 10 20 30 Enhancement factor (p ) 40 50 ≤ min(p, 1/f) Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast. Feb. 2011 Computer Architecture, Advanced Architectures Slide 24 26 Vector and Array Processing Single instruction stream operating on multiple data streams • Data parallelism in time = vector processing • Data parallelism in space = array processing Topics in This Chapter 26.1 Operations on Vectors 26.2 Vector Processor Implementation 26.3 Vector Processor Performance 26.4 Shared-Control Systems 26.5 Array Processor Implementation 26.6 Array Processor Performance Feb. 2011 Computer Architecture, Advanced Architectures Slide 25 26.1 Operations on Vectors Sequential processor: Vector processor: for i = 0 to 63 do P[i] := W[i] × D[i] endfor load W load D P := W × D store P for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1] + Y[i] endfor Feb. 2011 Unparallelizable Computer Architecture, Advanced Architectures Slide 26 26.2 Vector Processor Implementation From scalar registers To and from memory unit Function unit 1 pipeline Load unit A Load unit B Function unit 2 pipeline Vector register file Function unit 3 pipeline Store unit Forwarding muxes Figure 26.1 Feb. 2011 Simplified generic structure of a vector processor. Computer Architecture, Advanced Architectures Slide 27 Conflict-Free Memory Access 0 1 2 62 0,0 0,1 0,2 ... 0,62 1,0 1,1 1,2 ... 1,62 63 Bank number 0 1 0,63 0,0 0,1 0,2 ... 0,62 0,63 1,63 1,63 1,0 1,0 ... 1,61 1,62 2,63 2,2 ... 2,62 2,1 2,0 . . . . . . . . . . . . . . . . . . 62,0 62,1 62,2 ... 62,62 62,63 2,62 2,63 2,0 ... . . . . . . . . . . . . 62,2 62,3 62,4 ... 2,60 . . . 62,0 2,61 . . . 62,1 63,0 63,1 63,2 ... 63,62 63,63 63,1 63,2 63,3 ... 63,63 63,0 .. . (a) Conventional row-major order 2 .. . 62 (b) Skewed row-major order Figure 26.2 Skewed storage of the elements of a 64 × 64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams . Feb. 2011 Computer Architecture, Advanced Architectures Slide 28 63 Overlapped Memory Access and Computation To and from memory unit Pipelined adder Load X Load Y Store Z Vector reg 0 Vector reg 1 Vector reg 2 Vector reg 3 Vector reg 4 Vector reg 5 Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment. Feb. 2011 Computer Architecture, Advanced Architectures Slide 29 26.3 Vector Processor Performance Time Without chaining × Multiplication start-up With pipeline chaining + Addition start-up × + Figure 26.4 Total latency of the vector computation S := X × Y + Z, without and with pipeline chaining. Feb. 2011 Computer Architecture, Advanced Architectures Slide 30 Clock cycles per vector element Performance as a Function of Vector Length 5 4 3 2 1 0 0 100 200 300 400 Vector length Figure 26.5 The per-element execution time in a vector processor as a function of the vector length. Feb. 2011 Computer Architecture, Advanced Architectures Slide 31 26.4 Shared-Control Systems (a) Shared-control array processor, SIMD (b) Multiple shared controls, MSIMD Control Processing ... Processing ... Control ... Processing ... Control (c) Separate controls, MIMD Figure 26.6 From completely shared control to totally separate controls. Feb. 2011 Computer Architecture, Advanced Architectures Slide 32 Example Array Processor Control Processor array Switches Control broadcast Parallel I/O Figure 26.7 Array processor with 2D torus interprocessor communication network. Feb. 2011 Computer Architecture, Advanced Architectures Slide 33 26.5 Array Processor Implementation Reg file Data memory N E W ALU 0 Commun buffer 1 S To array state reg CommunEn PE state FF To reg file and data memory To NEWS neighbors CommunDir Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding. Feb. 2011 Computer Architecture, Advanced Architectures Slide 34 Configuration Switches Processor array Control Switches Control broadcast Parallel I/O Figure 26.7 (a) Torus operation Figure 26.9 Feb. 2011 In In Out Out (b) Clockwise I/O (c) Counterclockwise I/O I/O switch states in the array processor of Figure 26.7. Computer Architecture, Advanced Architectures Slide 35 26.6 Array Processor Performance Array processors perform well for the same class of problems that are suitable for vector processors For embarrassingly (pleasantly) parallel problems, array processors can be faster and more energy-efficient than vector processors A criticism of array processing: For conditional computations, a significant part of the array remains idle while the “then” part is performed; subsequently, idle and busy processors reverse roles during the “else” part However: Considering array processors inefficient due to idle processors is like criticizing mass transportation because many seats are unoccupied most of the time It’s the total cost of computation that counts, not hardware utilization! Feb. 2011 Computer Architecture, Advanced Architectures Slide 36 27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve • Didn’t we conclude that memory is the bottleneck? • How then does it make sense to share the memory? Topics in This Chapter 27.1 Centralized Shared Memory 27.2 Multiple Caches and Cache Coherence 27.3 Implementing Symmetric Multiprocessors 27.4 Distributed Shared Memory 27.5 Directories to Guide Data Access 27.6 Implementing Asymmetric Multiprocessors Feb. 2011 Computer Architecture, Advanced Architectures Slide 37 Parallel Processing as a Topic of Study An important area of study that allows us to overcome fundamental speed limits Our treatment of the topic is quite brief (Chapters 26-27) Graduate course ECE 254B: Adv. Computer Architecture – Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 38 27.1 Centralized Shared Memory Processors Memory modules 0 Processortoprocessor network 0 1 . . . Processortomemory network 1 . . . p−1 m−1 ... Parallel I/O Figure 27.1 Structure of a multiprocessor with centralized shared-memory. Feb. 2011 Computer Architecture, Advanced Architectures Slide 39 Processor-to-Memory Interconnection Network Processors 0 Mem ories Row 0 0 1 2 1 2 Row 1 3 4 3 Row 2 4 5 6 5 6 Row 3 7 8 7 Row 4 9 10 Row 5 11 12 11 Row 6 12 13 14 15 13 14 Row 7 (a) Butterfly network 15 0 1 2 3 4 5 6 7 8 9 10 P r o c e s s o r s 0 M e m o r i e s 1 2 3 4 5 6 7 (b) Beneš network Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 40 Processor-to-Memory Interconnection Network Processors 0 Sections Subsections 8 / 4× 4 Memo ry banks 0, 4, 8, 12, 16, 20, 24, 28 32, 36, 40, 44, 48, 52, 56, 60 8× 8 1 4× 4 224, 228, 232, 236, . . . , 252 1 × 8 s witches 2 8 / 4× 4 1, 5, 9, 13, 17, 21, 25, 29 8× 8 3 4× 4 4 4× 4 225, 229, 233, 237, . . . , 253 8 / 2, 6, 10, 14, 18, 22, 26, 30 8× 8 5 4× 4 6 4× 4 226, 230, 234, 238, . . . , 254 8 / 3, 7, 11, 15, 19, 23, 27, 31 8× 8 7 4× 4 227, 231, 235, 239, . . . , 255 Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors. Feb. 2011 Computer Architecture, Advanced Architectures Slide 41 Shared-Memory Programming: Broadcasting Copy B[0] into all B[i] so that multiple processors can read its value without memory access conflicts for k = 0 to ⎡log2 p⎤ – 1 processor j, 0 ≤ j < p, do B[j + 2k] := B[j] endfor B Recursive doubling Feb. 2011 0 1 2 3 4 5 6 7 8 9 10 11 Computer Architecture, Advanced Architectures Slide 42 Shared-Memory Programming: Summation Sum reduction of vector X processor j, 0 ≤ j < p, do Z[j] := X[j] s := 1 while s < p processor j, 0 ≤ j < p – s, do Z[j + s] := X[j] + X[j + s] s := 2 × s endfor S Recursive doubling Feb. 2011 0 1 2 3 4 5 6 7 8 9 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9 Computer Architecture, Advanced Architectures 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 Slide 43 27.2 Multiple Caches and Cache Coherence Processors Caches Memory modules 0 Processortoprocessor network 0 1 . . . Processortomemory network p−1 1 . . . m−1 ... Parallel I/O Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems. Feb. 2011 Computer Architecture, Advanced Architectures Slide 44 Status of Data Copies Processors Caches Memory modules Multiple consistent Processortoprocessor network 0 w z′ w x 1 w y′ y Single inconsistent Single consistent p–1 Invalid x z . . . Processortomemory network 0 1 . . . z′ m–1 ... Parallel I/O Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches. Feb. 2011 Computer Architecture, Advanced Architectures Slide 45 A Snoopy Cache Coherence Protocol P P P P C C C C Bus Memory CPU w rite miss: signal write miss on bus Invalid Bus write miss: write back cache line Bus write miss CPU read miss: signal read miss on bus Bus read miss: write back cache line Exclusive CPU read or write hit (writable) Shared CPU w rite hit: signal write miss on bus (read-only) CPU read hit Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches. Feb. 2011 Computer Architecture, Advanced Architectures Slide 46 27.3 Implementing Symmetric Multiprocessors Computing nodes Interleaved memory (typically, 1-4 CPUs and caches per node) I/O modules Standard interfaces Bus adapter Bus adapter Very wide, high-bandwidth bus Figure 27.6 Structure of a generic bus-based symmetric multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 47 Bus Bandwidth Limits Performance Example 27.1 Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth. Solution Executing an instruction implies a bus transfer of 1.4 × 0.02 × 4 = 0.112 B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture. Feb. 2011 Computer Architecture, Advanced Architectures Slide 48 Implementing Snoopy Caches State Duplicate tags and state store for snoop side CPU Tags Addr Cmd Cache data array Snoop side cache control Main tags and state store for processor side Processor side cache control =? Tag =? Snoop state Cmd Addr Buffer Buffer Addr Cmd System bus Figure 27.7 Main structure for a snoop-based cache coherence algorithm. Feb. 2011 Computer Architecture, Advanced Architectures Slide 49 27.4 Distributed Shared Memory Processors with memory Parallel input/output y := -1 z := 1 Routers 0 1 z :0 . . . while z=0 do x := x + y endwhile p−1 Interconnection network x:0 y:1 Figure 27.8 Structure of a distributed shared-memory multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 50 27.5 Directories to Guide Data Access Communication & memory interfaces Directories Processors & caches Memories Parallel input/output 0 1 . . . Interconnection network p−1 Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 51 Directory-Based Cache Coherence Write miss: return value, set sharing set to {c} Read miss: return value, set sharing set to {c} Uncached Write miss: fetch data from owner, request invalidation, return value, set sharing set to {c} Data w rite-back: set sharing set to { } Read miss: return value, include c in sharing set Read miss: fetch data from owner, return value, include c in sharing set Exclusive (writable) Shared Write miss: invalidate all cached copies, set sharing set to {c}, return value (read-only) Figure 27.10 States and transitions for a directory entry in a directorybased cache coherence protocol (c is the requesting cache). Feb. 2011 Computer Architecture, Advanced Architectures Slide 52 27.6 Implementing Asymmetric Multiprocessors To I/O controllers Node 0 Link Memory Computing nodes (typically, 1-4 CPUs and associated memory) Node 1 Node 2 Node 3 Link Link Link Ring network Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 53 Processors and caches Memories To interconnection network Scalable Coherent Interface (SCI) 0 1 2 3 Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 54 28 Distributed Multicomputing Computer architects’ dream: connect computers like toy blocks • Building multicomputers from loosely connected nodes • Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.4 Building and Using Multicomputers 28.5 Network-Based Distributed Computing 28.6 Grid Computing and Beyond Feb. 2011 Computer Architecture, Advanced Architectures Slide 55 28.1 Communication by Message Passing Parallel input/output Memories and processors 0 1 A computing node . . . Interconnection network p−1 Figure 28.1 Feb. 2011 Routers Structure of a distributed multicomputer. Computer Architecture, Advanced Architectures Slide 56 Router Design Routing and arbitration Ejection channel LC LC Q Q Input channels Input queues Link controller Message queue Output queues LC Q Q LC LC Q Q LC LC Q Q LC LC Q Q LC Switch Output channels Injection channel Figure 28.2 The structure of a generic router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 57 Building Networks from Switches Straight through Crossed connection Lower broadcast Upper broadcast Figure 28.3 Example 2 × 2 switch with point-to-point and broadcast connection capabilities. Processors 0 Memories Row 0 0 1 1 2 2 Row 1 3 4 3 Row 2 4 5 5 6 6 Row 3 7 7 Row 4 Figure 27.2 8 9 9 Butterfly and Beneš networks 10 10 11 Row 6 12 13 14 14 Row 7 (a) Butterfly network 15 0 1 2 3 4 5 6 7 8 13 15 Feb. 2011 Row 5 11 12 P r o c e s s o r s 0 M e m o r i e s 1 2 3 4 5 6 7 (b) Beneš network Computer Architecture, Advanced Architectures Slide 58 Interprocess Communication via Messages Communication latency Time Process A Process B ... ... ... ... ... ... send x ... ... ... ... ... ... ... ... ... receive x ... ... ... Process B is suspended Process B is awakened Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes. Feb. 2011 Computer Architecture, Advanced Architectures Slide 59 28.2 Interconnection Networks Nodes (a) Direct network Figure 28.5 Feb. 2011 Routers Nodes (b) Indirect network Examples of direct and indirect interconnection networks. Computer Architecture, Advanced Architectures Slide 60 Direct Interconnection Networks (a) 2D torus (b) 4D hyperc ube (c) Chordal ring (d) Ring of rings Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 61 Indirect Interconnection Networks Level-3 bus Level-2 bus Level-1 bus (a) Hierarchical bus es (b) Omega network Figure 28.7 Two commonly used indirect interconnection networks. Feb. 2011 Computer Architecture, Advanced Architectures Slide 62 28.3 Message Composition and Routing Message Packet data Padding First packet Header Data or payload Last packet Trailer A transmitted packet Flow control digits (flits) Figure 28.8 Feb. 2011 Messages and their parts for message passing. Computer Architecture, Advanced Architectures Slide 63 Wormhole Switching Each worm is blocked at the point of attempted right turn Destination 1 Destination 2 Worm 1: moving Worm 2: blocked Source 1 Source 2 (a) Two worms en route to their respective destinations (b) Deadlock due to circular waiting of four blocked worms Figure 28.9 Concepts of wormhole switching. Feb. 2011 Computer Architecture, Advanced Architectures Slide 64 28.4 Building and Using Multicomputers Inputs t=1 B A B C D A B C A B E G H C F D E F GH t=2 t=2 t=2 A C F t=1 D E F G H H t=3 D t=2 t=1 Outputs G Time E 0 (a) Static task graph Figure 28.10 Feb. 2011 5 10 (b) Schedules on 1-3 computers A task system and schedules on 1, 2, and 3 computers. Computer Architecture, Advanced Architectures Slide 65 15 Building Multicomputers from Commodity Nodes One module: CPU(s), memory, disks Expansion slots One module: CPU, memory, disks (a) Current racks of modules Wireless connection surfaces (b) Futuristic toy-block construction Figure 28.11 Growing clusters using modular nodes. Feb. 2011 Computer Architecture, Advanced Architectures Slide 66 28.5 Network-Based Distributed Computing PC Fast network interface with large memory System or I/O bus NIC Network built of high-s peed wormhole switches Figure 28.12 Feb. 2011 Network of workstations. Computer Architecture, Advanced Architectures Slide 67 28.6 Grid Computing and Beyond Computational grid is analogous to the power grid Decouples the “production” and “consumption” of computational power Homes don’t have an electricity generator; why should they have a computer? Advantages of computational grid: Near continuous availability of computational and related resources Resource requirements based on sum of averages, rather than sum of peaks Paying for services based on actual usage rather than peak demand Distributed data storage for higher reliability, availability, and security Universal access to specialized and one-of-a-kind computing resources Still to be worked out as of late 2000s: How to charge for compute usage Feb. 2011 Computer Architecture, Advanced Architectures Slide 68 Computing in the Cloud Computational resources, both hardware and software, are provided by, and managed within, the cloud Users pay a fee for access Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms) Image from Wikipedia This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business Feb. 2011 Computer Architecture, Advanced Architectures Slide 69 The Shrinking Supercomputer Feb. 2011 Computer Architecture, Advanced Architectures Slide 70 Warehouse-Sized Data Centers Image from IEEE Spectrum, June 2009 Feb. 2011 Computer Architecture, Advanced Architectures Slide 71