TDC 311 Microarchitecture Dr. C.M.White All CPUs execute the machine code for which the CPU is designed. IBM mainframes execute IBM 370 machine code; Motorola 680x0 microprocessors execute 680x0 machine code; etc. The CPU will fetch a machine code instruction, decode the opcode and operands, and execute the instruction. CPUs that are not microprogrammed perform the above fetch, decode and execute in hardware (digital logic). CPUs that are microprogrammed fetch, decode and execute the machine code instructions by executing microcode. Each machine code instruction causes a different section of microcode to execute. The execution of the appropriate microcode sends out the control signals to open and close gates, registers, ALU, etc. For example: the COBOL statement: Add 1 to Total. might be compiled into the following machine language statements: Load Add Store Reg1,Total Reg1,+1 Reg1,Total Then, the machine language statement Load Reg1,Total has to be executed by the microarchitecture. Example Microarchitecture Data Path Examine Figure 4-1 Page 205 closely. Note how C bus flows into the registers. Note how the registers flow into B bus and the H register flows into the A bus. There is a switch (the little arrows) at each point between bus and register or register and bus. TDC 311 Microarchitecture 1 The following are the internal 32-bit registers: MAR - memory address register MDR - memory data register MBR - memory buffer register PC - program counter (address of NEXT instruction) (The following registers might be discussed later: SP, LV, CPP, TOS, OPC, H) ALU - arithmetic logic unit - can add, AND, and complement. Function ALU performs is based on 6 inputs: See Figure 4-2 page 206 TDC 311 Microarchitecture 2 (Typo in Figure 4-2: Second occurrences of A and B should be ~A and ~B) Shifter - can shift left logical one byte, right arithmetic one bit, or no shift. All operations require precise timing of the signals on the buses and values entering and leaving the registers. See Figure 4-3 page 208. TDC 311 Microarchitecture 3 Reading and writing main memory can be performed two ways: 1. 32-bit words: MAR has the 32-bit address of a word in storage, data is read / written into / from the MDR. Thus, if the address in MAR is 4, it really means access word number 4. 2. 8-bit byte: the PC has a 32-bit address of a byte in storage, and the byte at that address is read into the low 8 bits of the MBR register. If the address in PC is 4, it means access byte number 4. When you load the MBR with the byte, you can do a logical load (leading 24 bits are 0), or an arithmetic load (leading 24 bits receive the proper sign bit). (Sign extension) The Microinstructions The control unit needs to send out a total of 29 signals: 9 for registers to B bus 9 for C bus to registers 6 for ALU control 2 for shifter control 2 for Read/Write via MAR/MDR 1 to indicate memory fetch via PC/MBR. We can create a microinstruction (Figure 4-5 page 212) that has the following fields: The complete block diagram for the example microarchitecture is shown in Figure 4-6 Page 214. TDC 311 Microarchitecture 4 Note how bits from the MIR leave the MIR as on or off signals (1 or 0) which control the operation of gates, shifting, adding, register selection, and address creation. Control store - contains 512 36-bit microinstructions. Note: When executing a microprogram, you do not simply execute the next instruction. Each microinstruction tells you which microinstruction to execute next. MIR: the instruction register for the control store MPC: the program counter for the control store During subcycle 1, MIR is loaded from the address currently held in MPC. TDC 311 Microarchitecture 5 During subcycle 2, the signals from MIR propagate out and the B bus is loaded from the selected register. During subcycle 3, the ALU and shifter operate and produce a stable result. During subcycle 4, the C bus, memory buses, and ALU values become stable. The MBR and MDR get their results from the memory operation started at the end of the previous data path cycle. The MPC is loaded in preparation for the next microinstruction. A Simple Example Increment the Program Counter by 1 (PC = PC + 1). What are the events that will cause this to happen? 1. Gate PC onto B bus. 2. Perform B+1 function in ALU 3. Gate C bus into PC Try another one: MAR = MBRU + H; rd SP = MBR = SP + 1 MDR = SP + H Design of the Microarchitecture Level 1. Speed vs. cost Reduce the number of clock cycles needed to execute an instruction Simplify the organization so that the clock cycle can be shorter Overlap the execution of instructions 2. Reduce the execution path length (the number of clock cycles needed to execute a set of TDC 311 Microarchitecture 6 operations) Say a machine code instruction requires 5 microinstructions. Is there anyway you can cut that down to 4 microinstructions by performing 2 operations at the same time? 3. Add another internal bus (See Figure 4-29 Page 252) If you extend the A bus such that all registers can lead into either the A bus or the B bus, you can simplify some operations. 4. Create an independent unit that fetches and processes the instructions. 5. Prefetch the instruction 6. Perform pipelining (See Figure 4-34 page 259) TDC 311 Microarchitecture 7 Unfortunately, pipelining is ruined when the program does a branch. Improving Performance 1. Cache memory Main memory is usually referenced near one location (locality principle). Program obviously in one location, and data often in another location. Bring most recently referenced values into high speed cache. How does the CPU know something is in the cache or not? Direct-mapped cache Consider a cache which has 2048 entries, each entry holding 32 bytes (not bits!) of data. 2048 entries times 32 bytes per entry equals 64 KB. The Valid bit tells whether there is valid data in the cache line. TDC 311 Microarchitecture 8 Addresses that use this entry: : : V Tag (16 bits) Data (32 bytes) When a program generates a 32 bit address, it has the following form: Tag - 16 bits Line - 11 bits Word - 3 bits Byte - 2 bits To see if the data item is in cache, take the 11-bit LINE portion, which points to one of the 2048 cache entries. The 16-bit TAG of the address is compared to the 16-bit Tag value in the cache entry. If there is a match, the data is there. The 3-bit WORD portion of the address tells you which word from the 8 words (32 bytes) in the cache line should be fetched. The 2-bit BYTE address may tell you which one of the four bytes to fetch. Note: Since the cache holds 64 KB, it holds data for addresses 0 - 65535. But it may also hold data for the addresses 65536 - 131072, and so on. That is why you must compare the TAG fields to see if there is a match. If no match, then there is a cache miss and the CPU must go to main memory and fetch the data, then store it in the cache entry, thus wiping out the old value. For example, CPU wants to fetch data at loc 3610 (0000002416): 0000 0000 0000 0000 tag TDC 311 Microarchitecture 0000 0000 001 line 9 001 00 word byte 2. Branch prediction and speculative execution Processor tries to predict which way a branch statement might go and then loads the machine instructions based on that prediction. Dynamic branch prediction - Create a history table which lists the branches that have been taken and whether they branched back or not. Overhead! Static branch prediction - You know a loop that loops 1000 times will branch back 999, so go ahead and load the instructions as if the loop was going to be taken. Simpler, but fails 1 out of 1000 times. 3. Out-of-order execution Sometimes you can rearrange the order of instructions and not make any difference in the final program outcome. For example, by moving the write operation up one statement, you can start it sooner (because I/O operations always take longer than other instructions). Add two register contents and store in a register Increment a counter by 1 Start a write operation changed to: Add two register contents and store in a register Start a write operation Increment a counter by 1 4. Register renaming Keep track of when a variable is “alive”. When it is no longer alive, reuse the register it was in. Counter: integer; Counter := 0; Read a value; Counter := Counter + 1; Sum := Sum + value; Loop back Print Counter; NewValue := Value; Read a name; : TDC 311 Microarchitecture Put the Counter value into some register Counter not used after this point, so reuse this register 10