1 Single-Cycle Processors: Datapath & Control Arvind Computer Science & Artificial Intelligence Lab M.I.T. Based on the material prepared by Arvind and Krste Asanovic Instruction Set Architecture (ISA) versus Implementation 6.823 L5- 2 Arvind • ISA is the hardware/software interface – Defines set of programmer visible state – Defines instruction format (bit encoding) and instruction semantics – Examples: MIPS, x86, IBM 360, JVM • Many possible implementations of one ISA – 360 implementations: model 30 (c. 1964), z900 (c. 2001) – x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC – MIPS implementations: R2000, R4000, R10000, ... – JVM: HotSpot, PicoJava, ARM Jazelle, ... September 26, 2005 6.823 L5- 3 Arvind Processor Performance Time Program = Instructions Program * Cycles Instruction * Time Cycle – Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology this lecture September 26, 2005 Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined 1 short 6.823 L5- 4 Arvind Microarchitecture: status lines Implementation of an ISA Controller control points Data path Structure: How components are connected. Static Behavior: How data moves between components Dynamic September 26, 2005 Hardware Elements • Combinational circuits OpSelect – Mux, Demux, Decoder, ALU, ... A0 A1 An-1 ... Mux O A lg(n) Demux Sel lg(n) ... O0 O1 A lg(n) On- Decoder Sel - Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ... ... O0 O1 A On-1 B Result ALU Comp? 1 • Synchronous state elements – Flipflop, Register, Register file, SRAM, DRAM D En Clk ff Q register Clk En D Q En Clk D1 D2 ff ff ff ... Q0 Q1 Q2 Edge-triggered: Data is sampled at the rising edge September 26, 2005 ... D0 ... Dn-1 ff Qn-1 6.823 L5- 6 Arvind Register Files Clock WE ReadSel1 ReadSel2 WriteSel WriteData ws clk rs1 rs2 ws wd we Register file 2R+1W ReadData1 ReadData2 rd1 rd2 rs1 wd 32 5 5 register 1 … … we … register 0 32 rd1 32 rs2 32 32 … register 31 5 32 rd2 • No timing issues in reading a selected register • Register files with a large number of ports are difficult to design – Intel’s Itanium, GPR File has 128 registers with 8 read ports and 4 write ports!!! September 26, 2005 6.823 L5- 7 Arvind A Simple Memory Model WriteEnable Clock Address WriteData MAGIC RAM ReadData Reads and writes are always completed in one cycle • a Read can be done any time (i.e. combinational) • a Write is performed at the rising clock edge if it is enabled ⇒ the write address and data must be stable at the clock edge Later in the course we will present a more realistic model of memory September 26, 2005 6.823 L5- 8 Arvind Implementing MIPS: Single-cycle per instruction datapath & control logic September 26, 2005 6.823 L5- 9 Arvind The MIPS ISA Processor State 32 32-bit GPRs, R0 always contains a 0 32 single precision FPRs, may also be viewed as 16 double precision FPRs FP status register, used for FP compares & exceptions PC, the program counter some other special registers Data types 8-bit byte, 16-bit half word 32-bit word for integers 32-bit word for single precision floating point 64-bit word for double precision floating point Load/Store style instruction set data addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode All instructions are 32 bits September 26, 2005 6.823 L5- 10 Arvind Instruction Execution Execution of an instruction involves 1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation memory operation (optional) write back and the computation of the address of the next instruction September 26, 2005 6.823 L5- 11 Arvind Datapath: Reg-Reg ALU Instructions RegWrite 0x4 clk Add inst<25:21> inst<20:16> PC clk addr inst<15:11> inst Inst. Memory we rs1 rs2 rd1 ws wd rd2 ALU z GPRs inst<5:0> ALU Control OpCode 6 0 31 26 25 September 26, 2005 5 rs 5 rt 21 20 5 rd 16 15 5 0 11 RegWrite Timing? 6 func 5 rd ← (rs) func (rt) 0 6.823 L5- 12 Arvind Datapath: Reg-Imm ALU Instructions RegWrite 0x4 clk Add inst<25:21> PC clk addr inst<20:16> inst Inst. Memory we rs1 rs2 rd1 ws wd rd2 GPRs inst<15:0> OpCode 31 26 25 September 26, 2005 5 rs 5 rt 2120 ALU Control ExtSel 16 immediate 16 15 z Imm Ext inst<31:26> 6 opcode ALU rt ← (rs) op immediate 0 6.823 L5- 13 Arvind Conflicts in Merging Datapath RegWrite 0x4 Add we rs1 rs2 rd1 ws wd rd2 inst<25:21> PC clk addr inst<20:16> inst<15:11> inst Inst. Memory inst<15:0> September 26, 2005 5 rs 5 rt rs rt z Imm Ext ALU Control ExtSel OpCode opcode ALU GPRs inst<31:26> inst<5:0> 6 0 Introduce muxes clk 5 rd 5 0 immediate 6 func rd ← (rs) func (rt) rt ← (rs) op immediate 6.823 L5- 14 Arvind Datapath for ALU Instructions RegWrite 0x4 clk Add PC clk addr we rs1 rs2 rd1 ws wd rd2 <25:21> <20:16> inst <15:11> Inst. Memory GPRs <15:0> 6 0 opcode September 26, 2005 5 rs 5 rt rs rt z Imm Ext <31:26>, <5:0> OpCode ALU RegDst rt / rd 5 rd ALU Control ExtSel 5 0 immediate OpSel 6 func BSrc Reg / Imm rd ← (rs) func (rt) rt ← (rs) op immediate 6.823 L5- 15 Arvind Datapath for Memory Instructions Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires accessing the memory more than once during its execution September 26, 2005 6.823 L5- 16 Arvind Load/Store Instructions:Harvard Datapath RegWrite 0x4 we rs1 rs2 rd1 ws wd rd2 “base” addr inst Inst. Memory clk WBSrc ALU / Mem clk Add PC MemWrite clk ALU GPRs z rdata Data Memory Imm Ext disp we addr wdata ALU Control OpCode RegDst 6 opcode 31 26 25 5 rs ExtSel 5 rt 21 20 OpSel BSrc 16 displacement 16 15 addressing mode (rs) + displacement 0 rs is the base register rt is the destination of a Load or the source for a Store September 26, 2005 6.823 L5- 17 Arvind MIPS Control Instructions Conditional (on GPR) PC-relative branch 6 opcode 5 rs 5 16 offset BEQZ, BNEZ Unconditional register-indirect jumps 6 opcode 5 rs 5 16 JR, JALR Unconditional absolute jumps 6 opcode 26 target J, JAL • PC-relative branches add offset×4 to PC+4 to calculate the target address (offset is in words): ±128 KB range • Absolute jumps append target×4 to PC<31:28> to calculate the target address: 256 MB range • jump-&-link stores PC+4 into the link register (R31) • All Control Transfers are delayed by 1 instruction we will worry about the branch delay slot later September 26, 2005 6.823 L5- 18 Arvind Conditional Branches (BEQZ, BNEZ) PCSrc br MemWrite RegWrite pc+4 0x4 Add Add clk PC clk addr we rs1 rs2 rd1 ws wd rd2 inst Inst. Memory clk we addr ALU GPRs z Imm Ext wdata ALU Control OpCode RegDst September 26, 2005 ExtSel rdata Data Memory OpSel BSrc zero? WBSrc 6.823 L5- 19 Arvind Register-Indirect Jumps (JR) PCSrc br rind RegWrite MemWrite pc+4 0x4 Add Add clk PC clk addr we rs1 rs2 rd1 ws wd rd2 inst Inst. Memory clk we addr ALU GPRs z Imm Ext wdata ALU Control OpCode RegDst September 26, 2005 ExtSel rdata Data Memory OpSel BSrc zero? WBSrc 6.823 L5- 20 Arvind Register-Indirect Jump-&-Link (JALR) PCSrc br rind RegWrite MemWrite pc+4 0x4 Add Add clk PC clk addr inst 31 Inst. Memory we rs1 rs2 rd1 ws wd rd2 clk we addr ALU GPRs z Imm Ext wdata ALU Control OpCode RegDst September 26, 2005 ExtSel rdata Data Memory OpSel BSrc zero? WBSrc 6.823 L5- 21 Arvind Absolute Jumps (J, JAL) PCSrc br rind jabs pc+4 RegWrite MemWrite 0x4 Add Add clk PC clk addr inst 31 Inst. Memory we rs1 rs2 rd1 ws wd rd2 clk we addr ALU GPRs z Imm Ext wdata ALU Control OpCode RegDst September 26, 2005 ExtSel rdata Data Memory OpSel BSrc zero? WBSrc 6.823 L5- 22 Arvind Harvard-Style Datapath for MIPS PCSrc br rind jabs pc+4 RegWrite MemWrite 0x4 Add Add clk PC clk addr inst 31 Inst. Memory we rs1 rs2 rd1 ws wd rd2 clk we addr ALU GPRs z Imm Ext wdata ALU Control OpCode RegDst September 26, 2005 ExtSel rdata Data Memory OpSel BSrc zero? WBSrc 23 Five-minute break to stretch your legs Single-Cycle Hardwired Control: 6.823 L5- 24 Arvind Harvard architecture We will assume • clock period is sufficiently long for all of the following steps to be “completed”: 1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation data fetch if required register write-back setup time ⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB • At the rising edge of the following clock, the PC, the register file and the memory are updated September 26, 2005 6.823 L5- 25 Arvind Hardwired Control is pure Combinational Logic ExtSel BSrc op code zero? OpSel combinational logic MemWrite WBSrc RegDst RegWrite PCSrc September 26, 2005 6.823 L5- 26 Arvind ALU Control & Immediate Extension Inst<5:0> (Func) Inst<31:26> (Opcode) ALUop + 0? OpSel ( Func, Op, +, 0? ) Decode Map ExtSel ( sExt16, uExt16, High16) September 26, 2005 6.823 L5- 27 Arvind Hardwired Control Table Opcode ExtSel ALU BSrc OpSel MemW RegW WBSrc RegDst PCSrc SW * sExt16 uExt16 sExt16 sExt16 Reg Imm Imm Imm Imm Func Op Op + + no no no no yes yes yes yes yes no ALU ALU ALU Mem * rd rt rt rt * pc+4 pc+4 pc+4 pc+4 pc+4 BEQZz=0 sExt16 * 0? no no * * br BEQZz=1 sExt16 * * * * * no no no no no * * * * pc+4 jabs * * * * 0? * * * * yes no yes PC * PC R31 * R31 jabs rind rind ALUi ALUiu LW J JAL JR JALR BSrc = Reg / Imm RegDst = rt / rd / R31 September 26, 2005 no no WBSrc = ALU / Mem / PC PCSrc = pc+4 / br / rind / jabs 6.823 L5- 28 Arvind Pipelined MIPS To pipeline MIPS: • First build MIPS without pipelining with CPI=1 • Next, add pipeline registers to reduce cycle time while maintaining CPI=1 September 26, 2005 6.823 L5- 29 Arvind Pipelined Datapath 0x4 Add PC addr rdata Inst. Memory IR we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ALU we addr rdata Data Memory wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined September 26, 2005 6.823 L5- 30 Arvind An Ideal Pipeline stage 1 stage 2 stage 3 stage 4 • All objects go through the same stages • No sharing of resources between any two stages • Propagation delay through all pipeline stages is equal • The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition? September 26, 2005 How to divide the datapath into stages 6.823 L5- 31 Arvind Suppose memory is significantly slower than other stages. In particular, suppose tIM tDM tALU tRF tRW = = = = = 10 units 10 units 5 units 1 unit 1 unit Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance September 26, 2005 6.823 L5- 32 Arvind Alternative Pipelining 0x4 Add PC addr rdata Inst. Memory fetch phase IR we rs1 rs2 rd1 ws wd rd2 GPRs ALU we addr rdata Data Memory Imm Ext wdata decode & Reg-fetch phase execute phase memory phase tCC > max {tIM , tRF , t ALU,, ttDM ,, ttRW }} RW IM RF+tALU DM+t RW} = = ttDM DM DM+ tRW write -back phase ⇒ increase the critical path by 10% Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase September 26, 2005 6.823 L5- 33 Arvind Maximum Speedup by Pipelining Assumptions Unpipelined Pipelined Speedup 1. tIM = tDM = 10, tALU = 5, tRF = tRW= 1 4-stage pipeline tC tC 27 10 2.7 2. tIM =tDM = tALU = tRF = tRW = 5 4-stage pipeline 25 10 2.5 3. tIM =tDM = tALU = tRF = tRW = 5 5-stage pipeline 25 5 5.0 It is possible to achieve higher speedup with more stages in the pipeline. September 26, 2005 34 Thank you !