CSC 2405 Computer Systems II Advanced Topics Instruction Set Architecture Instruction Set Architecture Assembly Language View – Processor state – Application Program Registers, memory, … Instructions addl, movl, leal, … How instructions are encoded as bytes Compiler ISA Layer of Abstraction – Above: how to program machine – Processor executes instructions in a sequence Below: what needs to be built OS Use variety of tricks to make it run fast E.g., execute multiple instructions simultaneously Chapter 4 CPU Design Circuit Design Chip Layout 3 Instruction Set Architectures Basic ISA Classes The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B. Stack Accumulator Register (Register-memory) Register (load-store) Push A Load A Load R1, A Load R1, A Push B Add B Add Load R2, B Add Store C Store Add R3, R1, R2 R1, B C, R1 Pop C Store C, R3 Registers are the class that won out. The more registers on the CPU, the better. Chapter 4 4 80x86 Instruction Frequency Rank 1 2 3 4 5 6 7 8 9 10 Total Instruction load branch compare store add and sub register move call return Chapter 4 Frequency 22% 20% 16% 12% 8% 6% 5% 4% 1% 1% 96% 9 5 Relative Frequency of Control Instructions Operation Call/Return Jumps Branches Integer 19% 6% 75% Floating Pt 8% 10% 82% Design hardware to handle branches quickly, since these occur most frequently Chapter 4 6 CISC Instruction Sets – – Stack-oriented instruction set – – Complex Instruction Set Computer Dominant style through mid-80’s Use stack to pass arguments, save program counter Explicit push and pop instructions Arithmetic instructions can access memory – addl %eax, 12(%ebx,%ecx,4) Condition codes – requires memory read and write Complex address calculation Set as side effect of arithmetic and logical instructions Philosophy – Add instructions to perform “typical” programming tasks Chapter 4 7 RISC Instruction Sets – – Fewer, simpler instructions – – – Many more (typically 32) registers Use for arguments, return pointer, temporaries Only load and store instructions can access memory – Might take more to get given task done Can execute them with small and fast hardware Register-oriented instruction set – Reduced Instruction Set Computer Internal project at IBM, later popularized by Hennessy (Stanford) and Patterson (Berkeley) Similar to Y86 mrmovl and rmmovl No Condition codes – Test instructions return 0/1 in register Chapter 4 8 Example RISC Instruction Formats Register-Register (R-type) 31 26 25 Op ADD R1, R2, R3 21 20 rs1 rs2 6 5 11 10 16 15 rd 0 func (ALI reg. operations, read/write special registers and moves) Register-Immediate (I-type) 31 26 25 Op rs1 21 20 SUB R1, R2, #3 16 15 0 immediate rd (ALU imm. operations, loads and stores, conditional branch, jump (and link) Jump / Call (J-type) 31 JUMP end 26 25 Op 0 offset added to PC (jump, jump and link, trap and return from exception) Chapter 4 9 CISC vs. RISC Original Debate – – – Strong opinions! CISC proponents---easy for compiler, fewer code bytes RISC proponents---better for optimizing compilers, can make run fast with simple chip design Current Status – For desktop processors, choice of ISA not a technical issue – With enough hardware, can make anything run fast Code compatibility more important For embedded processors, RISC makes sense Smaller, cheaper, less power Chapter 4 10 Logic Design Overview of Logic Design Fundamental Hardware Requirements – Communication – – How to get values from one place to another Computation Storage Bits are Our Friends – – Everything expressed in terms of values 0 and 1 Communication – Computation – Low or high voltage on wire Compute Boolean functions Storage Store bits of information Chapter 4 12 Digital Signals 0 1 0 Voltage – – Time Use voltage thresholds to extract discrete values from continuous signal Simplest version: 1-bit signal – Either high range (1) or low range (0) With guard range between them Not strongly affected by noise or low quality circuit elements Can make circuits simple, small, and fast Chapter 4 13 Computing with Logic Gates And a b Or out out = a && b – – a b Not out out = a || b a out out = !a Outputs are Boolean functions of inputs Respond continuously to changes in inputs With some, small delay Rising Delay Falling Delay a && b b Voltage a Time Chapter 4 14 Combinational Circuits Acyclic Network Primary Inputs Primary Outputs Acyclic Network of Logic Gates – – Continuously responds to changes on primary inputs Primary outputs become (after some delay) Boolean functions of primary inputs Chapter 4 15 Bit Equality Bit equal a HCL Expression eq bool eq = (a&&b)||(!a&&!b) b – Generate 1 if a and b are equal Hardware Control Language (HCL) – Very simple hardware description language – Boolean operations have syntax similar to C logical operations We’ll use it to describe control logic for processors Chapter 4 16 Word Equality Word-Level Representation b31 Bit equal eq31 B = a31 b30 Bit equal A eq30 a30 HCL Representation bool Eq = (A == B) Eq b1 Bit equal eq1 a1 b0 Eq – – Bit equal eq0 32-bit word size HCL representation a0 Chapter 4 Equality operation Generates Boolean value 17 1-Bit Latch D Latch D R Data Q+ Q– C S Clock Latching d D Storing !d !d !d d d R D !d 0R !q q Q+ 1 C Q– dS d !d Q+ 0 C 0 Chapter 4 S q !q Q– 18 Registers Structure i7 D C Q+ o7 i6 D C Q+ o6 i5 D C Q+ o5 i4 D C Q+ o4 i3 D C Q+ o3 i2 D C Q+ o2 i1 D C Q+ o1 i0 D C Q+ o0 I O Clock Clock – Stores word of data – – Different from program registers seen in assembly code Collection of edge-triggered latches Loads input on rising edge of clock Chapter 4 19 Random-Access Memory valA srcA A valW Register file Read ports valB srcB – – Write port Clock Address input specifies which word to read or write Register file Holds values of program registers %eax, %esp, etc. Register identifier serves as address – – dstW B Stores multiple words of memory W ID 8 implies no read or write performed Multiple Ports Can read and/or write multiple words in one cycle – Each has separate address and data input/output Chapter 4 20 Basic Logic Gates NOTE: okay to use just a circle for NOT: Chapter 4 21 More than 2 Inputs? AND/OR can take any number of inputs. – – – AND = 1 if all inputs are 1. OR = 1 if any input is 1. Similar for NAND/NOR. Can implement with multiple two-input gates Chapter 4 22 Logical Completeness Can implement ANY truth table with AND, OR, NOT. A B C D 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1. AND combinations that yield a "1" in the truth table. 2. OR the results of the AND gates. Chapter 4 23 DeMorgan's Law Converting AND to OR (with some help from NOT) Consider the following gate: A B A B A B A B 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 Chapter 4 To convert AND to OR (or vice versa), invert inputs and output. 24 Decoder n inputs, 2n outputs – exactly one output is 1 for each possible input pattern 2-bit decoder Chapter 4 25 Sequential Processors newPC PC Sequential HW Structure Write back valM State – – – – Program counter register (PC) Condition code register (CC) Register File Memories valE, valM Access same memory space Data: for reading/writing program data Instruction: for reading instructions Data Data memory memory Memory Addr, Data valE Bch Execute CC CC ALU ALU aluA, aluB Instruction Flow – – – valA, valB Read instruction at address specified by PC Decode Process through stages Update program counter srcA, srcB dstA, dstB B E icode ifun rA , rB valC valP , Fetch A Register RegisterM file file Instruction Instruction memory memory PC PC increment increment PC Chapter 4 27 newPC PC Seqential Stages Data Data memory memory Memory Addr, Data valE Read program registers Bch Execute Compute value or address CC CC ALU ALU aluA, aluB Memory – Read instruction from instruction memory Execute – valM Decode – Write back Fetch – valE, valM Read or write data valA, valB Write Back – srcA, srcB dstA, dstB Decode Write program registers A B Register RegisterM file file E PC – icode ifun rA , rB valC valP , Update program counter Fetch Instruction Instruction memory memory PC PC increment increment PC Chapter 4 28 Instruction Decoding Optional 5 Optional 0 rA rB D icode ifun rA rB valC Instruction Format – – – Instruction byte Optional register byte Optional constant word icode:ifun rA:rB valC Chapter 4 29 Sequential Summary Implementation – – – – Express every instruction as series of simple steps Follow same general flow for each instruction type Assemble registers, memories, predesigned combinational blocks Connect with control logic Limitations – – – – Too slow to be practical In one cycle, must propagate through instruction memory, register file, ALU, and data memory Would need to run clock very slowly Hardware units only active for fraction of clock cycle Chapter 4 30 Pipelined Processors What is Pipelining Computers execute billions of instructions, so instruction throughput is what matters IDEA: Divide instruction execution up into several pipeline stages. For example IF ID EX MEM WB Simultaneously have different instructions in different pipeline stages The length of the longest pipeline stage determines the cycle time Desirable pipeline features (e.g., RISC): – – – all instructions same length registers located in same place in instruction format memory operands only in loads or stores Chapter 4 32 What Is Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes Chapter 4 A B C D 33 What Is Pipelining 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? Chapter 4 34 What Is Pipelining 6 PM Start work ASAP 7 8 9 10 11 Midnight Time 30 40 T a s k 40 40 40 20 A Pipelined laundry takes 3.5 hours for 4 loads B O r d e r C D Chapter 4 35 What Is Pipelining 6 PM Pipelining Lessons 7 8 9 Time T a s k O r d e r 30 40 40 40 40 20 A B Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup C D Chapter 4 36 Real-World Pipelines: Car Washes Sequential Pipelined Parallel Idea – – – Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed Chapter 4 37 Pipeline Diagrams Unpipelined OP1 OP2 OP3 – Time Cannot start new operation until previous one completes 3-Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time – Up to 3 operations in process simultaneously Chapter 4 38 Data Dependencies Combinational logic R e g Clock OP1 OP2 OP3 Time System – Each operation depends on result from preceding one Chapter 4 39 Data Hazards Comb. logic A OP1 OP2 R e g A Comb. logic B R e g Comb. logic C Clock B C A B C A B C A B OP3 OP4 R e g C Time – – Result does not feed back around in time for next operation Pipelining has changed behavior of system Chapter 4 40 One Memory Port/Structural Hazards Time (clock cycles) Ifetch Reg DMem Reg DMem Reg ALU DMem Reg ALU Instr 3 DMem ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Instr 4 Reg Ifetch Chapter 4 Reg Reg Reg Reg DMem 41 One Memory Port/Structural Hazards Time (clock cycles) Stall DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Instr 3 Reg Reg DMem Bubble Bubble Ifetch Reg Reg Bubble ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem How do you “bubble” the pipe? Chapter 4 42 Data Hazard on R1 Time (clock cycles) and r6,r1,r7 or Ifetch DMem Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU sub r4,r1,r3 Reg ALU Ifetch ALU O r d e r add r1,r2,r3 WB ALU I n s t r. MEM ALU IF ID/RF EX r8,r1,r9 xor r10,r1,r11 Chapter 4 Reg Reg Reg Reg DMem 43 Reg Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Chapter 4 44 Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Chapter 4 45 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Chapter 4 46 Data Forwarding Naïve Pipeline – – Register isn’t written until completion of write-back stage Source operands read from register file in decode stage Observation – Needs to be in register file at start of stage Value generated in execute or memory stage Trick – – Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage Chapter 4 47 Forwarding to Avoid Data Hazard or Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) r8,r1,r9 xor r10,r1,r11 Chapter 4 Reg Reg Reg Reg DMem 48 Reg