Part II: MIPS R4000 Pipeline with Branch Prediction Assigned: June 9th Due Date: July 19th Introduction In this part, we are going to implement the in-order MIPS R4000 Pipeline along with a Branch Predictor (Using Branch Target Buffer and Branch History Table) for the pipelined processor which executes MIPS32 instructions as defined in Part I of the project. This is an individual project. Description of pipeline The pipeline has eight stages. (1) Instruction fetch First-half (IF): First half of instruction fetch. IF fetches up to three instructions per cycle. IF fetches the first instruction from address 596. PC is updated in the following ways: a) The PC hits an entry of Branch Target Buffer (BTB), the PC is updated to the Predicted PC. b) PC increased by 4 otherwise. c) If a branch/jump is resolved as mispredicted (at its execute stage), the PC is set to the correct value and the correct instruction is fetched in the next cycle. NOTE: if the instruction is a branch and missed in the Branch Target Buffer, it works as predicting NotTaken. NOTE: You can stop fetching new instructions when the PC is greater than 700. (It should never happen if your program is correct.) (2) Instruction fetch Second-half (IS): Second half of instruction fetch, complete instruction cache access (assume instruction cache always hit). (3) Issue and Instruction Decode (ID): Get up to 3 instructions from the instruction queue per cycle and decode the instruction. Issue is in-order. Issue the instruction if there is no structural hazard (For example, there is no Integer unit available for ALU instructions or branch instructions. Or there is no Address unit available for Load/Store instructions). Jump/Branch Instructions: 1) If the PC miss Branch Target Buffer in IF stage and the low-order 12-bit of PC address hits an entry of Branch History Table in ID stage, the target PC will be known at the end of ID stage. If it is predicted as Taken, the instruction at target PC will be fetched next cycle. The already fetched instructions need to be flushed out. If it is predicted as Not-Taken, the instructions after branch instruction are already fetched. 2) If the instruction missed in Branch History Table and decoded as a branch instruction, put this instruction into Branch History Table. The branch will be treated as Not-Taken. (4) Execute (EX): This stage is in order. If one or more of the operands is not yet available, this instruction and its successors will be stalled. This step checks for RAW hazards. When both operands are available, execute the operation. Instructions may take multiple clock cycles in this stage. NOP and BREAK instructions bypass this stage. Jump/Branch Instructions: The Jump/Branch Instructions will wait for both operands are ready and use the ALU unit to calculate the branch output at this stage. If the branch instruction is correct predicted, it updates the Branch Target Buffer and Branch History Table. If the instruction is mis-predicted, the PC is set to the correct value to fetch the next instruction after the branch/jump. Branch Target Buffer and Branch History Table need to be updated if necessary. You need take care of which instructions need to be flushed out. If the instruction is missed from Branch History Table in ID stage, the prediction is set to 01 (weak taken) if the output is taken, or 10 (weak not taken) if it is not taken. Load/Store executes as follows: Load instruction takes two steps to finish (go to Write Back stage). 1) The first step is address calculation (AC). When the necessary register is ready and there is a free Address Unit, the address is calculated. Otherwise, it must wait. AC stage takes one cycle. 2) The second step is real memory access. A load can access memory if and only if there is no more earlier store with the same address or unknown (uncalculated) address. A load start access the memory at stage DF, and the data value is available at the end of DS. We assume that there is unlimited bandwidth for instructions to fetch data from memory, thus there is no structural hazard for memory accessing. Store instruction needs one step to finish. The first step is address calculation (CA). When the necessary register is ready and there is a free Address Unit, the address is calculated. Otherwise, it must wait. This step takes one cycle. The second step is real memory access. When 1) address is ready; 2) the data to write is ready; 3) there is no previous store/load with the same address or unknown (uncalculated) address, a store starts accessing memory at stage DF. Notice, multiple load/store instructions can finish Execute stage at the same time. Multiple load/store instructions can access the memory at the same time as long as they don’t have memory conflicts (see next stage). (5) Data fetch First half (DF): Data fetch, first half of data cache access. Load/Store start to access memory at this stage. Assume it is always a hit in data cache. Memory conflicts must be detected at this stage. Memory conflicts: If two Load need to access the same memory address, they can access the data at the same time. If a Store is followed by a Load with the same memory address, the Load must wait until the Store finishes its DS stage. If a Store is followed by a Store with the same memory address, the second Store must wait until the first Store finishes its DS stage. If a Load is followed by a Store with the same memory address, the Store Instruction must be stalled until the Load finishes memory accessing (the end of DS). (6) Data fetch Second half (DS): Second half data fetch. Load/Store will finish the memory accessing at the end of this stage. For load, the data value will be available in the end of this stage. For store, the memory will be update in the end of this stage. (7) Tag Check (TC): Determine whether the data cache access hit. Since we assume it is always a hit for cache access. This stage only takes 1 cycle and will not change the results. (8) Write Back: In-order write back load and register-register operations. Up to three instructions can be written back at this stage. Forwarding or bypassing: Assume there is a pipeline register between every two adjacent stages, forwarding can be done from one pipeline register to the function unit that requires it. If the forwarding hardware detects that previous instruction has written the register corresponding to a source for the current instruction, the results will be forwarded as the source of the current instruction. (See textbook Section A2). There are enough pipeline register between each two pipeline to forward data. The main forwarding source register are EX/DF and DS/TC register. Stall: If one instruction is stalled at some stage, its following instructions will be also stalled because it is a in-order CPU. There is an instruction buffer between each two stages to store the instructions when the pipeline is stalled. Notice: All these eight stages are in-ordered and each stage takes 1 clock cycle to finish. Pipeline Units The pipeline has following function units. (1) Address Unit: It calculates one effective address per cycle for Load/Store instructions at the Execution stage. There are 2 Address Units. (2) Integer Unit: There are two Integer Units. All the ALU, branch, etc. instructions need one cycle at an Integer Unit. (3) Register File: There are 32 integer registers. We assume the register file has unlimited read/write ports, so there will be no hardware hazard for register read/write. (4) Branch Target Buffer (BTB): See the next part. (5) Branch History Table: See the next part. (6) Main Memory: We assume there are sufficient read/write ports to Main Memory, instruction fetch, data read and data write can happen at the same cycle. Instruction fetch takes 1 cycle to finish, data read /write takes 2 cycles to finish (1 cycle at DF and 1 cycle at DS). For data write, the main memory is updated at the end of DS. (7) Pipeline Register: There are enough pipeline register between each two pipeline to forward data. The main forwarding source register are EX/DF and DS/TC register. Branch Predictor We use Branch Target Buffer (BTB) and Branch History Table (BHT) in our project. The difference between BTB and BHT is: 1) BTB has much smaller size compared with BHT. 2) BTB is indexed by instruction PC, BHT is indexed by the low-order 12 bits of instruction PC. 3) BTB only records the predicted taken instruction. BHT will record all the branch instruction. 4) BTB record the target PC address, BHT only records the predicted outcome (taken or not taken). If the PC hits in BTB in IF stage, then target PC is known and the instructions at target PC can be fetched as soon as possible. If a branch instruction hits in BHT, the target PC is not known until the end of ID stage. There are 8 entries in BTB. The BTB is organized as fully-associative with FIFO replacement policy, based on the PC address. Each entry of BTB records the corresponding PC address as index, the target PC. There are 4096 entries in BHT. The BHT is organized as fully-associative with FIFO replacement policy. Each entry is indexed by the low-order 12 bits of the instruction PC (assume this index is large enough and no two instructions have the same low-order 12 bits PC). The value in the entry is the predicted outcome of this instruction (00 (strong not taken), 01(weak not taken), 10 (weak taken), 11 (strong taken)). We use 2bit predictor in each entry to determine the prediction outcome. (Refer to PPT (Chapter 2 Part 1) #27). A Branch/Jump instruction is recorded into BTB and BHT at the end of EX stage if it is taken. If it is not taken, this Branch/Jump instruction will be only recorded into BHT. The prediction outcome in BHT and predicted PC in BTB is updated each time after the instruction is resolved (the Execute stage). If a previous predicted taken Branch is updated to not taken, the branch needs to be deleted from BTB. If a previous predicted not taken Branch is updated to taken, the branch needs to be added into BTB. For branch instructions, the prediction is set to 10 (weak taken) if the output is taken, or 01 (weak not taken) if it is not taken when it first enters the BTB while Jump instructions are set to be Taken (11) when it first enters the BTB. Assumptions As in part I, assume the program starts at memory location 596 (decimal). PC is initialized to this location for fetching the first instruction out of the memory. The data section begins at address “700”. Following that is a sequence of 32-bit 2’s complement signed integers for the program data up to the end of file. The instruction section won’t exceed “700”. The whole simulation ends when a “BREAK” instruction is writebacked. Assume the effective address is the same as the physical memory address. All eight stages of the pipeline are in-order. Proper pipeline registers must be used to latch intermediate results between pipeline stages. In your simulator, there is no delay branch slot. Guidelines Your output should match the following sample output formats. Your simulator should simulate the actual execution and produce the correct results for the given program. A program will be considered "complete" once the BREAK instruction leaves the WB stage. Command Line Your simulator (MIPSsim) should provide the following options to users, dis/sim option is omit for this part. MIPSsim inputfilename outputfilename [-Tm:n] Inputfilename - The file name of the binary input file. Outputfilename - The file name for printing the output. -Tm:n - Optional argument to specify the start (m) and end (n) cycles of simulation output trace. -T0:0 indicates that no tracing is to be performed; eliminating the argument specifies that every cycle (complete execution) is to be traced.