2010 R&E Computer System Education & Research Lecture 9. MIPS Processor Design – Single-Cycle Processor Design Prof. Taeweon Suh Computer Science Education Korea University Single-Cycle MIPS Processor • Again, microarchitecture (CPU implementation) is divided into 2 interacting parts Datapath Control 2 Korea Univ Single-Cycle Processor Design • Let’s start with a memory access instruction - lw Example: lw $2, 80($0) • STEP 1: Instruction Fetch CLK CLK PC' PC Instr A RD Instruction Memory A1 I-Type op 6 bits rs 5 bits rt imm 5 bits 16 bits CLK WE3 WE RD1 A A2 A3 WD3 RD Data Memory WD RD2 Register File 3 Korea Univ Single-Cycle Processor Design • STEP 2: Decoding Read source operands from register file I-Type Example: lw $2, 80($0) op 6 bits CLK CLK 25:21 PC' PC A RD Instruction Memory Instr A1 rs 5 bits rt imm 5 bits 16 bits CLK WE3 WE RD1 A A2 A3 WD3 RD Data Memory WD RD2 Register File 4 Korea Univ Single-Cycle Processor Design • STEP 2: Decoding Sign-extend the immediate I-Type Example: lw $2, 80($0) op 6 bits CLK CLK PC' PC A RD Instr 25:21 A1 rs 5 bits rt imm 5 bits 16 bits CLK WE3 WE RD1 A Instruction Memory A2 A3 WD3 RD Data Memory WD RD2 Register File SignImm 15:0 module signext(input [15:0] a, output [31:0] y); Sign Extend assign y = {{16{a[15]}}, a}; endmodule 5 Korea Univ Single-Cycle Processor Design • STEP 3: Execution Compute the memory address I-Type Example: lw $2, 80($0) rs op 6 bits 5 bits rt imm 5 bits 16 bits ALUControl2:0 PC A RD Instr 25:21 Instruction Memory A1 A2 A3 WD3 WE3 RD2 SrcB Register File CLK Zero SrcA RD1 ALU CLK PC' 010 CLK ALUResult WE A RD Data Memory WD SignImm 15:0 Sign Extend 6 Korea Univ Single-Cycle Processor Design • STEP 4: Execution Read data from memory and write it back to register file I-Type Example: lw $2, 80($0) op 6 bits RegWrite 5 bits rt imm 5 bits 16 bits ALUControl2:0 1 010 CLK PC A RD Instruction Memory Instr 25:21 20:16 A1 A2 A3 WD3 CLK WE3 Zero SrcA RD1 RD2 SrcB Register File ALU CLK PC' rs ALUResult WE A RD Data Memory WD ReadData SignImm 15:0 Sign Extend 7 Korea Univ Single-Cycle Processor Design • We are done with lw • CPU starts fetching the next instruction from PC+4 module adder(input [31:0] a, b, output [31:0] y); adder pcadd1(pc, 32'b100, pcplus4); assign y = a + b; endmodule RegWrite ALUControl2:0 1 PC A RD Instr Instruction Memory 25:21 A1 A2 20:16 A3 + WD3 CLK WE3 Zero SrcA RD1 RD2 SrcB Register File ALU CLK PC' 010 CLK ALUResult WE A RD Data Memory WD ReadData PCPlus4 SignImm 4 15:0 Sign Extend Result 8 Korea Univ Single-Cycle Processor Design • Let’s consider another memory access instruction - sw sw instruction needs to write data to data memory I-Type Example: sw $2, 84($0) rs op 6 bits RegWrite A RD Instr Instruction Memory 25:21 A1 A2 A3 WD3 + WE3 MemWrite 1 Zero SrcA RD1 20:16 20:16 16 bits CLK RD2 SrcB ALU PC 5 bits 010 CLK CLK imm ALUControl2:0 0 PC' 5 bits rt ALUResult WriteData Register File WE A RD Data Memory WD ReadData PCPlus4 SignImm 4 15:0 Sign Extend Result 9 Korea Univ Single-Cycle Processor Design • Let’s consider arithmetic and logical instructions - add, sub, and, or R-Type Write ALUResult to register file Note that R-type instructions write to rd field of instruction (instead of rt) RegWrite RegDst 1 rs 6 bits 5 bits 0 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 funct 5 bits 5 bits 5 bits 6 bits MemWrite Zero SrcA RD1 0 SrcB RD2 1 Register File ALUResult WriteData MemtoReg 0 0 WE A RD Data Memory WD ReadData 0 1 0 15:11 + shamt CLK 20:16 WriteReg4:0 PCPlus4 rd varies ALU PC rt ALUSrc ALUControl2:0 1 CLK CLK PC' op 1 SignImm 4 15:0 Sign Extend Result 10 Korea Univ Single-Cycle Processor Design • Let’s consider a branch instruction - beq Determine whether register values are equal Calculate branch target address (BTA) from sign-extended immediate and PC+4 I-Type Example: beq $4,$0, around rs op 6 bits 5 bits rt imm 5 bits 16 bits PCSrc RegWrite RegDst x 0 PC 1 RD A Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 0 SrcB 1 RD2 Register File 20:16 + WriteReg4:0 15:0 RD Data Memory WD A ReadData 0 1 1 SignImm 4 WriteData x 0 0 15:11 PCPlus4 ALUResult MemtoReg WE Zero SrcA RD1 ALU PC' CLK Sign Extend <<2 + 0 1 110 0 CLK CLK MemWrite ALUSrc ALUControl2:0 Branch PCBranch Result 11 Korea Univ Single-Cycle Datapath Example • We are done with the implementation of basic instructions • Let’s see how or instruction works out in the implementation R-Type op rs 6 bits 5 bits rt rd shamt funct 5 bits 5 bits 5 bits 6 bits MemtoReg 31:26 5:0 Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst 0 PCSrc RegWrite PC' PC 1 A RD Instr Instruction Memory 25:21 A1 CLK 1 WE3 001 SrcA RD1 0 20:16 A2 A3 WD3 RD2 0 SrcB 1 Register File + WriteReg4:0 WriteData 0 A RD Data Memory WD ReadData 0 1 1 SignImm 15:0 ALUResult 0 WE 0 15:11 4 Zero 1 20:16 PCPlus4 ALU 0 CLK CLK Sign Extend <<2 + 0 PCBranch Result 12 Korea Univ Single-Cycle Processor - Control • As mentioned, CPU is designed with datapath and control • Now, let’s delve into the control part design 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst PCSrc RegWrite CLK PC' PC 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 RD2 0 SrcB 1 Register File 20:16 + WriteReg4:0 15:0 WriteData A RD Data Memory WD ReadData 0 1 1 SignImm 4 ALUResult WE 0 15:11 PCPlus4 Zero SrcA RD1 Sign Extend <<2 + 0 CLK ALU CLK PCBranch Result 13 Korea Univ Control Unit Control Unit Opcode5:0 Main Decoder Opcode and funct fields come from the fetched instruction MemtoReg MemWrite Branch ALUSrc RegDst RegWrite ALUOp1:0 Funct5:0 14 ALU Decoder ALUControl 2:0 Korea Univ ALU Implementation and Control A B N B adder N 0 N N 1 A N F2 N ALU N 3F Cout N = 32 in 32-bit processor Zero Extend Y + [N-1] S N N N N 0 1 2 3 Y 15 Function 000 A&B 001 A|B 010 A+B 011 not used 100 A & ~B 101 A | ~B 110 A-B 111 SLT slt: set less than 2 N F2:0 F1:0 Example: slt $t0, $t1, $t2 // $t0 = 1 if $t1 < $t2 Korea Univ Control Unit: ALU Control • Implementation is completely dependent on hardware designers • But, the designers should make sure the implementation is reasonable enough • • Memory access instructions (lw, sw) need to use ALU to calculate memory target address (addition) Branch instructions (beq, bne) need to use ALU for the equality check (subtraction) Control Unit Opcode5:0 Main Decoder ALU Decoder Meaning 00 Add 01 Subtract 10 Look at Funct 11 Not Used MemtoReg MemWrite Branch ALUSrc ALUOp1:0 Funct ALUControl2:0 00 X 010 (Add) X1 X 110 (Subtract) RegDst RegWrite 1X 100000 (add) 010 (Add) 1X 100010 (sub) 110 (Subtract) 1X 100100 (and) 000 (And) 1X 100101 (or) 001 (Or) 1X 101010 (slt) 111 (SLT) ALUOp1:0 Funct5:0 ALUOp1:0 ALUControl 2:0 16 Korea Univ Control Unit: Main Decoder Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 R-type 000000 1 1 0 0 0 0 10 lw 100011 0 0 101011 1 0 1 1 X 00 00 beq 000100 0 0 X X 1 sw 1 0 0 1 0 X 01 Control Unit Opcode5:0 Main Decoder MemtoReg MemWrite Branch ALUSrc RegDst RegWrite ALUOp1:0 Funct5:0 ALU Decoder ALUControl 2:0 ALUOp1:0 Meaning 00 Add 01 Subtract 10 Look at Funct field 11 Not Used 17 Korea Univ How about Other Instructions? • Hmmm.. Now, we are done with the control part design • Let’s examine if the design is able to execute other instructions addi Example: addi $t0, $t1, -14 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst PCSrc RegWrite CLK PC' PC 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 RD2 0 SrcB 1 Register File 20:16 + WriteReg4:0 15:0 WriteData WE A RD Data Memory WD ReadData 0 1 1 SignImm 4 ALUResult 0 15:11 PCPlus4 Zero SrcA RD1 Sign Extend <<2 + 0 CLK ALU CLK PCBranch Result 18 Korea Univ Control Unit: Main Decoder Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 R-type 000000 1 1 0 0 0 0 10 lw 100011 1 0 1 0 0 1 00 sw 101011 0 X 1 0 1 X 00 beq 000100 0 X 0 1 0 X 01 addi 001000 1 0 1 0 0 0 00 19 Korea Univ How about Other Instructions? • Ok. So far, so good… • How about jump instructions? J-Type j 31:26 5:0 op addr 6 bits 26 bits MemtoReg Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst PCSrc RegWrite CLK PC' PC 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 WE3 RD2 0 SrcB 1 Register File 20:16 + WriteReg4:0 15:0 WriteData A RD Data Memory WD ReadData 0 1 1 SignImm 4 ALUResult WE 0 15:11 PCPlus4 Zero SrcA RD1 Sign Extend <<2 + 0 CLK ALU CLK PCBranch Result 20 Korea Univ How about Other Instructions? • We need to add some hardware to support the j instruction A logic to compute the target address op Mux and control signal 6 bits Jump 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl2:0 Op ALUSrc Funct RegDst J-Type addr 26 bits PCSrc RegWrite CLK 0 1 0 PC' PC 1 A RD Instr Instruction Memory 25:21 20:16 A1 A2 A3 WD3 CLK WE3 RD2 0 SrcB 1 Register File 20:16 PCJump + WriteReg4:0 PCPlus4 WriteData Sign Extend RD Data Memory WD ReadData 0 Result 1 <<2 + 27:0 A 1 SignImm 15:0 ALUResult WE 0 15:11 4 Zero SrcA RD1 ALU CLK PCBranch 31:28 25:0 <<2 21 Korea Univ Control Unit: Main Decoder • There is one more output in the main decoder to support the jump instructions • Jump Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump R-type 000000 1 1 0 0 0 0 10 0 lw 100011 1 0 1 0 0 1 00 0 sw 101011 0 X 1 0 1 X 00 0 beq 000100 0 X 0 1 0 X 01 0 addi 001000 1 0 1 0 0 0 00 0 j 000100 0 X X X 0 X XX 1 22 Korea Univ Verilog Code - Main Decoder and ALU Control Control Unit Opcode5:0 module maindec(input [5:0] op, output memtoreg, memwrite, output branch, alusrc, output regdst, regwrite, output jump, output [1:0] aluop); Funct5:0 ALU Decoder ALUControl 2:0 module aludec(input [5:0] funct, input [1:0] aluop, output reg [2:0] alucontrol); assign {regwrite, regdst, alusrc, branch, memwrite, memtoreg, jump, aluop} = controls; controls <= controls <= controls <= controls <= controls <= controls <= controls <= RegDst RegWrite ALUOp1:0 reg [8:0] controls; always @(*) case(op) 6'b000000: 6'b100011: 6'b101011: 6'b000100: 6'b001000: 6'b000010: default: endcase endmodule Main Decoder MemtoReg MemWrite Branch ALUSrc always @(*) case(aluop) 2'b00: alucontrol <= 3'b010; // add 2'b01: alucontrol <= 3'b110; // sub default: case(funct) // RTYPE 6'b100000: alucontrol <= 3'b010; 6'b100010: alucontrol <= 3'b110; 6'b100100: alucontrol <= 3'b000; 6'b100101: alucontrol <= 3'b001; 6'b101010: alucontrol <= 3'b111; default: alucontrol <= 3'bxxx; // endcase endcase endmodule 9'b110000010; // R-type 9'b101001000; // lw 9'b001010000; // sw 9'b000100001; // beq 9'b101000000; // addi 9'b000000100; // j 9'bxxxxxxxxx; // ??? 23 // ADD // SUB // AND // OR // SLT ??? Korea Univ Verilog Code – ALU A module alu(input [31:0] a, b, input [2:0] alucont, output reg [31:0] result, output zero); B N N 3F ALU N wire [31:0] b2, sum, slt; Y A assign b2 = alucont[2] ? ~b:b; assign sum = a + b2 + alucont[2]; assign slt = sum[31]; B N N N 0 1 F2 N Cout + [N-1] S Zero Extend N N N Function 000 A&B 001 A|B 010 A+B 011 not used 100 A & ~B 101 A | ~B 110 A-B 111 SLT N 0 1 2 3 2 N always@(*) case(alucont[1:0]) 2'b00: result <= a & b2; 2'b01: result <= a | b2; 2'b10: result <= sum; 2'b11: result <= slt; endcase F2:0 F1:0 assign zero = (result == 32'b0); endmodule Y 24 Korea Univ Single-Cycle Processor Performance • How fast is the single-cycle processor? • Clock cycle time (frequency) is limited by the critical path The critical path is the path that takes the longest time What do you think the critical path is? • The path that lw instruction goes through 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl 2:0 0 0 PCSrc Op ALUSrc Funct RegDst RegWrite PC' PC 1 A RD Instr Instruction Memory A1 WE3 010 SrcA RD1 1 20:16 A2 A3 WD3 RD2 0 SrcB 1 Register File + WriteReg4:0 A RD Data Memory WD ReadData 1 1 SignImm 15:0 WriteData 1 0 0 15:11 4 ALUResult 0 WE 0 20:16 PCPlus4 Zero Sign Extend <<2 + 0 25:21 CLK 1 ALU CLK CLK PCBranch Result 25 Korea Univ Single-Cycle Processor Performance • Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext) + tmux + tALU + tmem + tmux + tRFsetup • In most implementations, limiting paths are: memory (instruction and data), ALU, register file. Thus, Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup 31:26 5:0 MemtoReg Control MemWrite Unit Branch ALUControl 2:0 Op 0 0 PCSrc Funct RegDst RegWrite PC' PC 1 A RD Instr Instruction Memory 25:21 A1 20:16 A2 010 SrcA RD1 1 0 SrcB 1 RD2 A3 + Register WD3 File 0 15:11 1 WriteReg4:0 SignImm 4 15:0 Sign Extend Zero ALUResult WriteData 0 1 WE Parameter Register clock-to-Q tpcq_PC Multiplexer tmux A RD Data Memory WD ReadData ALU tALU Memory read tmem Register file read tRFread Register file setup tRFsetup 0 1 0 20:16 PCPlus4 ALU 0 CLK 1 WE3 <<2 + CLK CLK Elements ALUSrc PCBranch Result 26 Korea Univ Single-Cycle Processor Performance Example Elements Parameter Delay (ps) Register clock-to-Q tpcq_PC 30 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20 Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup = [30 + 2(250) + 150 + 2(25) + 200 + 20] ps = 950 ps • fc = 1/Tc fc = 1/950ps = 1.052GHz Assuming that the CPU executes 100 billion instructions to run your program, what is the execution time of the program on a single-cycle MIPS processor? Execution Time = (#instructions)(cycles/instruction)(seconds/cycle) = (100 × 109)(1)(950 × 10-12 s) = 95 seconds 27 Korea Univ