CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson & Hennessy, ©2005 The CPU • Processor (CPU): the active part of the computer, which does all the work (data manipulation and decision-making) – Datapath: portion of the processor which contains hardware necessary to perform operations required by the processor (the brawn) – Control: portion of the processor (also in hardware) which tells the datapath what needs to be done (the brain) The CPU • Our implementation of the MIPS is simplified – memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt – control flow instructions: beq, j • Generic implementation – use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) Fetch PC = PC+4 Exec Decode – decode the instruction (and read registers) – execute the instruction • All instructions (except j) use the ALU after reading the registers A Small Set of Instructions 31 R I op 25 rs 20 rt 15 rd 10 sh fn 5 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Opcode Source 1 or base Source 2 or dest’n Destination Unused Opcode ext imm Operand / Offset, 16 bits J jta Jump target address, 26 bits inst Instruction, 32 bits MIPS instruction formats and naming of the various fields. We will refer to this diagram later Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) Six I-format ALU instructions (lui, addi, slti, andi, ori, xori) Two I-format memory access instructions (lw, sw) Three I-format conditional branch instructions (bltz, beq, bne) Four unconditional jump instructions (j, jr, jal, syscall) 0 The MIPS Instruction Set Copy Arithmetic Logic Memory access Control transfer Instruction Usage Load upper immediate lui rt,imm Add add rd,rs,rt Subtract sub rd,rs,rt Set less than slt rd,rs,rt Add immediate addi rt,rs,imm Set less than immediate slti rd,rs,imm AND and rd,rs,rt OR or rd,rs,rt XOR xor rd,rs,rt NOR nor rd,rs,rt AND immediate andi rt,rs,imm OR immediate ori rt,rs,imm XOR immediate xori rt,rs,imm Load word lw rt,imm(rs) Store word sw rt,imm(rs) Jump j L Jump register jr rs Branch less than 0 bltz rs,L Branch equal beq rs,rt,L Branch not equal bne rs,rt,L Jump and link jal L System call syscall op fn 15 0 0 0 8 10 0 0 0 0 12 13 14 35 43 2 0 1 4 5 3 0 32 34 42 36 37 38 39 8 12 Fetching Instructions • Fetching instructions involves – reading the instruction from Add 4 the Instruction Memory Instruction Memory – updating the PC to hold the address of the next instruction – PC is updated every cycle, so it does not need an explicit write control signal – Instruction Memory is read every cycle, so it doesn’t need an explicit read control signal PC Read Address Instruction Decoding Instructions • Decoding instructions involves – sending the fetched instruction’s opcode and function field bits to the control unit – reading two values from the Register File • Register File addresses are contained in the instruction Control Unit Read Addr 1 Instruction Read Register Read Addr 2 Data 1 File Write Addr Read Write Data Data 2 Executing R Format Operations • R format operations (add,sub,slt,and,or) 31 R-type: op 25 rs 20 15 rt rd 10 5 0 shamt funct – perform the (op and funct) operation on values in rs and rt – store the result back into the Register File (into location rd) RegWrite Instruction Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data ALU control ALU overflow zero Data 2 – The Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File Executing Load and Store Operations • Load and store operations involve – compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction – store value (read from the Register File during decode) written to the Data Memory – load value, read from the Data Memory, written to the Register File RegWrite Instruction ALU control overflow zero Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data 16 Address ALU Data Memory Read Data Write Data Data 2 Sign Extend MemWrite MemRead 32 Executing Branch Operations Add 4 Add Shift left 2 Branch target address ALU control PC • Branch operations involves Instruction – compare the operands read from the Register File during decode for equality (zero ALU output) – compute the branch target address by adding the updated PC to the 16bit signed-extended offset field in the instr Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data 16 Data 2 Sign Extend 32 zero (to branch control logic) ALU Jump and Branch Instructions Unconditional jump and jump through register instructions j jr $ra is the symbolic name for reg. $31 (return address) verify $ra 31 J # go to mem loc named “verify” # go to address that is in $ra; # $ra may hold a return address op jump target address 25 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 j=2 x x x x 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 From PC (incremented) op 31 R Effective target address (32 bits) 25 rs 20 rt 15 rd 10 sh 5 fn 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ALU instruction Source register Unused Unused Unused jr = 8 The jump instruction j of MiniMIPS is a J-type instruction which is shown along with how its effective target address is obtained. The jump register (jr) instruction is R-type, with its specified register often being $ra. Executing Jump Operations • Jump operation involves – replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Instruction Memory PC Read Address Shift left 2 Instruction 26 Jump address 28 Creating a Single Datapath from the Parts • Assemble the datapath segments and add control lines and multiplexors as needed • Single cycle design – fetch, decode and execute each instructions in one clock cycle – no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) – multiplexors needed at the input of shared elements with control lines to do the selection – write signals to control writing to the Register File and Data Memory • Cycle time is determined by length of the longest path The Main Control Unit Control signals derived from instruction R-type Load/ Store Branch 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read read, except for load write for R-type and load sign-extend and add Single Cycle Datapath with Control Unit 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Address Instr[31-0] ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control R-type Instruction Data/Control Flow 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Instr[31-0] Address ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[5-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Deriving the Control Signals Control signals for the single-cycle MicroMIPS implementation. Control signal Reg file ALU op Data cache Next addr 0 1 2 3 RegWrite Don’t write Write RegDst rt rd $31 MemtoReg Data out ALU out IncrPC ALUSrc (rt ) imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic MemRead Don’t read Read MemWrite Don’t write Write BrType1, BrType0 No branch beq bne bltz PCSrc IncrPC jta (rs) SysCallAddr Load Word Instruction Data/Control Flow 0 Add Shift left 2 4 ALUOp Instr[31-26] Add 1 PCSrc Branch MemRead MemtoReg MemWrite Control Unit ALUSrc RegWrite RegDst ovf Instr[25-21] PC Instruction Memory Read Address Instr[31-0] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Store Word Instruction? Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Deriving the Control Signals Control signals for the single-cycle MicroMIPS implementation. Control signal Reg file ALU op Data cache Next addr 0 1 2 3 RegWrite Don’t write Write RegDst rt rd $31 MemtoReg Data out ALU out IncrPC ALUSrc (rt ) imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic MemRead Don’t read Read MemWrite Don’t write Write BrType1, BrType0 No branch beq bne bltz PCSrc IncrPC jta (rs) SysCallAddr Branch Instruction Data/Control Flow 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst ovf Instr[25-21] Instruction Memory PC Read Instr[31-0] Address Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Deriving the Control Signals Control signals for the single-cycle MicroMIPS implementation. Control signal Reg file ALU op Data cache Next addr 0 1 2 3 RegWrite Don’t write Write RegDst rt rd $31 MemtoReg Data out ALU out IncrPC ALUSrc (rt ) imm AddSub Add Subtract LogicFn1, LogicFn0 AND OR XOR NOR FnClass1, FnClass0 lui Set less Arithmetic Logic MemRead Don’t read Read MemWrite Don’t write Write BrType1, BrType0 No branch beq bne bltz PCSrc IncrPC jta (rs) SysCallAddr Jump Operation? 0 Add Add Shift left 2 4 ALUOp 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Address Instr[31-0] ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Adding the Jump Operation Instr[25-0] Shift 28 left 2 PC+4[31-28] 26 1 32 0 0 Add Jump ALUOp Add Shift left 2 4 1 PCSrc Branch MemRead MemtoReg MemWrite Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Instr[31-0] Address ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control Can the Control Unit set all the control signals based solely on the opcode field of the instruction? Instr[25-0] Shift left 2 26 1 28 32 0 PC+4[31-28] Add Jump ALUOp Instr[31-26] Add Shift left 2 4 0 1 PCSrc Branch MemRead MemtoReg MemWrite Control Unit ALUSrc RegWrite RegDst Instruction Memory PC Read Instr[31-0] Address ovf Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read 1 Instr[15 -11] Instr[15-0] Write Data zero ALU Data Memory Read Data 1 Write Data 0 0 Data 2 1 Sign 16 Extend Address 32 Instr[5-0] ALU control 0 1 1 0 1 00 01 10 11 00 01 10 0 0 FnClass 00 10 10 01 10 01 11 11 11 11 11 11 11 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCSrc 10 10 1 0 0 0 1 1 0 0 0 0 1 1 1 1 1 LogicFn 01 01 01 01 01 01 01 01 01 01 01 01 01 00 Add’Sub 00 01 01 01 00 00 01 01 01 01 00 00 00 00 BrType 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 DataW rite 001111 000000 100000 000000 100010 000000 101010 001000 001010 000000 100100 000000 100101 000000 100110 000000 100111 001100 001101 001110 100011 101011 000010 000000 001000 000001 000100 000101 000011 000000 001100 ALUSrc fn DataRead Load upper immediate Add Subtract Set less than Add immediate Set less than immediate AND OR XOR NOR AND immediate OR immediate XOR immediate Load word Store word Jump Jump register Branch on less than 0 Branch on equal Branch on not equal Jump and link System call op RegInS rc Settings Instruction RegDst Signal RegWrite Control 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 00 01 11 11 01 10 00 Control Signal Generation Auxiliary signals identifying instruction classes arithInst = addInst subInst sltInst addiInst sltiInst logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst immInst = luiInst addiInst sltiInst andiInst oriInst xoriInst Example logic expressions for control signals RegWrite = luiInst arithInst logicInst lwInst jalInst ALUSrc = immInst lwInst swInst addInst subInst jInst AddSub = subInst sltInst sltiInst DataRead = lwInst PCSrc0 = jInst jalInst syscallInst . . . Control . . . sltInst 13.6 Performance of the Single-Cycle Design An example combinational-logic data path to compute z := (u + v)(w – x) / y Add/Sub latency 2 ns Multiply latency 6 ns Divide latency 15 ns Total latency 23 ns u + v Note that the divider gets its correct inputs after 9 ns, but this won’t cause a problem if we allow enough total time w - / z x y Beginning with inputs u, v, w, x, and y stored in registers, the entire computation can be completed in 25 ns, allowing 1 ns each for register readout and write Performance Estimation for Single-Cycle MicroMIPS Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz R-type 44% 6 ns Load 24% 8 ns Store 12% 7 ns Branch 18% 5 ns Jump 2% 3 ns Weighted mean 6.36 ns ALU-type P C Load P C Store P C Branch P C (and jr) Jump (except jr & jal) P C Not used Not used Not used Not used Not used Not used Not used Not used Not used The MicroMIPS data path unfolded (by depicting the register write step as a separate block) so as to better visualize the critical-path latencies. How Good is Our Single-Cycle Design? Clock rate of 125 MHz not impressive How does this compare with current processors on the market? Not bad, where latency is concerned Instruction access 2 ns Register read 1 ns ALU operation 2 ns Data cache access 2 ns Register write 1 ns Total 8 ns Single-cycle clock = 125 MHz A 2.5 GHz processor with 20 or so pipeline stages has a latency of about 0.4 ns/cycle 20 cycles = 8 ns Throughput, however, is much better for the pipelined processor: Up to 20 times better with single issue Perhaps up to 100 times better with multiple issue