Real Processor Architectures • Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors – We start with the 486 pipeline to see how NOT to do a pipeline • recall Intel x86 is a CISC with variable length instructions, memory-register addressing, some complex addressing modes and some complex instructions • we will compare it to the much more efficient MIPS pipeline – We then consider dynamic issue superscalars of varying degrees of sophistication – To understand the Pentium architecture, we must look at how they avoided the pitfalls of the 486 by issuing microcode rather than machine instructions, so this requires that we also look at microprogrammed control units and microcode 486 Processor • The instruction set was almost identical to the 386 (and thus was still CISC based) – They added a floating point functional unit to the processor so that it could execute the floating point operations introduced for the x86 math coprocessor – This FP functional unit provided a degree of parallel processing in that while FP operations were executed, the pipeline would continue to fetch and execute integer operations – It contained an 8KB combined instruction/data cache (later expanded to 16KB) • The big difference between the 386 and 486 though was the pipeline, the first Intel processor with a pipeline – However, because of the CISC nature of the instruction set, the pipeline is not particularly efficient The 486 Pipeline • They used a 5 stage pipeline – Fetch 16 bytes worth of instruction • this may be 1 instruction (or even a part of 1 instruction), or multiple instructions – Decode stage 1 – was an entire instruction fetched? If not, this stage stalls • divide up the 16 bytes into instruction(s) – Decode stage 2 – decode the next instruction, fetch operands from registers – Execution – ALU operations, branch operations, cache (mov) instructions • this stage may take multiple cycles if an ALU operation requires 1 or more memory access (e.g., add x, 5 takes 2 memory accesses) – Write result of load or ALU operation to register 486 Difficulties • Stalls arise for numerous reasons – 17 byte long instructions require 2 instruction fetch stages – Any ALU memory-register or memory-immediate takes at least 1 additional cycle, possibly two if the memory item was both a source and destination • such a situation stalls instructions in the decode 2 stage • or in the EX stage if the result is written back to memory – Complex addressing modes can cause stalls • pointer accessing (indirect addressing) is available which takes 2 memory accesses • scaled addressing mode can involve both an add and a shift • again, stalls occur in the decode 2 stage – Branch instructions have a 3 cycle penalty because branches are computed in the EX stage (4th stage) and some loop operations take more than 1 cycle to compute adding a further stall • The first example has three data movements with no penalties • The second example has a data hazard requiring 1 cycle of stall • The third example illustrates a branch penalty 486 Examples 486 Overall Architecture ARM Cortex A-8 Processor • Dual-issue superscalar with static scheduling but dynamic issue detection through a scoreboard – Up to 2 instructions per cycle • 14-stage pipeline (see next slide) – Branch prediction performed by in the AGU (address generation unit) using: • Dynamic branch prediction with 512-entry two-way set associative branch target buffer • 4K global history buffer – when branch target buffer misses, a prediction is obtained from the global history buffer • 8-entry return stack – an incorrect branch prediction flushes the entire pipeline – Instruction decode is 5 stages long and up to 2 instructions decoded per cycle • 8 bytes fetched from cache • if neither instruction is a branch, PC is incremented • stage 4 in this 5 stage mini-pipeline is the scoreboard and issue logic A-8 Pipeline A-8 Execution Unit Either instruction can Go to the load/store Pipeline but not Both ALU pipe 1 is for simple integer operations Multiplies use ALU pipe 0 and can accommodate up to 2 in one cycle Structural hazards are rare because the compiler attempts to schedule pairs of instructions to not use the same instruction pipe at the same time Data hazards are detected during decode by the scoreboard and may either stall both instructions or just the second of the pair, the compiler is responsible for attempting to prevent such stalls (note that forwarding is only available from WB (E5) to E0 A-8 Performance The ideal CPI for the A-8 is .5 (2 instructions issued per cycle) Here, you see the truth is that the ideal is not possible and that aside from the mcf and gzip benchmarks, the greatest source of stalls arise because of the pipeline stalling (not because of cache misses) Pentium Architecture • Recall our examination of the Intel 486 pipeline – variable length of instructions, variable complexity of operations, memory-register ALU operations, etc led to poor performance • In order to improve performance using RISC features, the Pentium architects had to rethink things – they were stuck with their CISC instruction set (for backward compatibility) – in CISC architectures, a machine instruction is first translated into a sequence of microinstructions – each microinstruction is a lengthy string of 1s and 0s, each of which refer to one control signal in the machine – there needs to be a process to translate each machine instruction into microinstructions and execute each microinstruction – this is done by collecting machine instructions and their associated microinstructions into microprograms Why Microinstructions? • The Pentium architecture uses a microprogrammed control unit – there is already a necessary step of decoding a machine instruction into microcode • Now, consider each microinstruction: – equal length – executes in the same amount of time (unless hazards arise) – branches are at the microinstruction level and are more predictable than machine language level branching • In a RISC architecture, the simplicity of each instruction allows it to be carried out directly in hardware in 1 cycle (usually) – Intel realized that to efficiently pipeline their CISC architecture, they had to pipeline the microinstructions instead of machine instructions Control and Micro-Operations • An example architecture is shown to the right • Each of the various connections is controlled by a particular control signal – MBR to the AC controlled with signal C11 – PC to MAR by C2 – AC to ALU C7 • note that this figure is incomplete • A microprogram is a sequence of microoperations this is not an x86 architecture! Example • Consider a CISC instruction such as Add R1, X – X copied into MAR and a memory read signaled – datum returned across data bus to MBR – adder sent values in R1 and MBR, adding the two, storing result back into R1 • This sequence can be written in terms of microoperations as: – – – – – t1: t2: t3: t3: t4: MAR (IR (address) ) MBR Memory R1 (R1) + (MBR) Acc (R1) + (MBR) R1 (Acc) t1 – t5 are clock cycles, each microinstruction executes in separate clock cycles • Each micro-operation is handled by one or more control signals – For instance, MBR Memory is C5 Control Memory Each microprogram consists of one or more microinstructions, each stored in a separate entry of the control memory The control memory itself is firmware, a program stored in ROM, that is placed inside of the control unit ... Jump to Indirect or Execute ... Jump to Execute ... Jump to Fetch Jump to Op code routine ... Jump to Fetch or Interrupt ... Jump to Fetch or Interrupt Fetch cycle routine Indirect Cycle routine Interrupt cycle routine Execute cycle begin AND routine ADD routine Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program Example of Three Micro-Programs • Fetch: t1: MAR (PC) C2 t2: MBR Memory C0, C5, CR PC (PC) + 1 C* t3: IR (MBR) C4 • Indirect: t1: MAR (IR (address) ) C8 t2: MBR Memory C0, C5, CR t3: IR(address) (MBR (address) ) C4 • Interrupt: t1: MBR (PC) C1 t2: MAR save address C* PC routine address C* t3: Memory (MBR) C12, CW – CR – Read control to system bus – CW – write control to system bus • C0 – C12 refers to the previous figure • C* are signals not shown in the figure Horizontal vs. Vertical Micro-Instructions Micro-instruction address points to a branch in the control memory and is taken if the condition bit is true Micro-instruction Address Function Codes Vertical micro-instructions use function codes that need additional decoding Jump Condition Internal CPU Control Signals This micro-instruction requires 1 bit for every control line, it is longer than the vertical micro-instruction and therefore takes more space to store, but does not require additional time to decode by the control unit Horizontal micro-instructions contain 1 bit for every control signal controlled by the control unit Micro-instruction Address Jump Condition System Bus Control Signals Micro-programmed Control Unit Continued • Decoder analyzes IR – delivers starting address of op code’s micro-program in control store • address placed in the to a micro-program counter (here, called a Control Address Register) • Loop on the following – sequencer signals read of control memory using address in microPC – item in control memory moved to control buffer register – contents of control buffer register generate control signals and next address information • if the micro-instructions are vertical, decoding is required here – sequencer moves next address to control address register • next instruction (add 1 to current) • jump to new part of this microprogram • jump to new machine routine Pentium IV: RISC features • All RISC features are implemented at the microinstructions level instead of machine instruction level as seen in typical RISC processors – Microinstruction-level pipeline – Dynamically scheduled micro-operations – Reservation stations (128) and multiple functional units (7) – Branch speculation via branch target buffer • speculation at micro-instruction level, not machine level • instead of an ROB, decisions are made at the reservation stations so that a miss-speculation causes reservation stations to flush their contents, correct speculation causes reservation stations to forward results to registers/store units – Trace cache used (discussed shortly) Pentium Pipeline • Fetch machine instruction (3 stages) • Decode machine instruction into microinstructions (2 stages) • Superscalar issues multiple microinstructions (2 stages) – register renaming occurs here, up to 3 microinstructions can be issued per cycle – 2 integer and 1 FP • Execute of microinstructions (1 stage) – Functional units are pipelined and can take from 1 up to approximately 32 cycles to execute • Write back (3 stages) • Commit (3 stages) – up to 3 microinstructions can commit in any cycle Pentium IV Overall Architecture Specifications • 7 functional units: – 2 simple ALUs (add, compare, shift) – ½ cycle execution to accommodate up to 2 micro-operations per cycle – 1 complex ALU (integer multiplication and division) – multicycle, pipelined – 1 load unit and 1 store unit – including address computation – 1 FP move (register to register move and convert) – 1 FP arithmetic unit (+, -, *, /) – multicycle, pipelined, some SIMD execution permitted on these units • 128 registers for renaming – reservation stations are used rather than a re-order buffer – instructions must wait in reservation stations longer than in Tomasulo’s version, waiting for speculation results Trace Cache • The trace cache is an instruction cache – It caches not just individual instructions or even memory refill lines, it caches blocks of instructions that have recently been executed together • In this way, the trace cache encodes branch behavior implicitly • Additionally, miss-speculated instructions would be discarded from a trace cache • The trace cache was developed for the Pentium IV, so it stores microinstructions (not machine instructions) • Combining a trace cache and branch target buffer together minimize microinstruction fetch and decoding – As long as the microinstructions remain in the trace cache – Miss-predictions at the microinstruction level is far rarer than miss-predictions at the machine level Source of Stalls • This architecture is very complex and relies on being able to fetch and decode instructions quickly – The process breaks down when • less than 3 instructions can be fetched in 1 cycle • trace cache causes a miss, or branches are miss predicted • less than 3 instructions can be issued because they either are not 2 int + 1 FP or because of structural hazards • limitation of reservation stations • data dependencies between functional units cause stalls because other instructions have to wait at their reservation stations • data cache access results in a miss Continued • Stalls manifest themselves in two places – The issue stage • branch miss-predictions • cache misses • reservation stations full – The commit stage • branch miss-predictions • instructions not ready to commit yet • these are not actually stalls, but because instructions are committed in the order they were issued, a later instruction may wait to commit because of earlier instructions being time consuming, and if the later instruction is a branch, improperly fetched instructions because of miss-speculation may continue to occur • branch computation not yet available Continued • Miss-prediction rates (at the micro-operation level) are very low – About .8% for integer benchmarks and .1% for floating point benchmarks • notice how FP benchmarks continue to have high predictability because they involve a lot of for loops which are very predictable, integer benchmarks tend to have more conditional statements which are less predictable – At the machine language level, miss-speculation is between .1% and 1.5% • Trace cache has nearly a 0% miss rate – The L1 and L2 data caches have miss rates of around 6% and .5% respectively – The machine’s effective CPI ranges from around 1.2 to 5.85 with an average of around 2.2 (machine instructions, not micro-operations) Earlier Pentiums • Pipeline changes: – Pentium pipeline: 2-issue superscalar, 5 stages – Pentium Pro pipeline: 14 stages – Pentium III pipeline 14 stages (shown earlier in these slides) – Pentium IV pipeline 21 stages (minimum) and eventually widened to 31 (minimum) • Bus widened to support 64 GB • Conditional instructions introduced (we will cover this next week) • Faster clock cycles introduced – From 1 GHz to 1.5 GHz, eventually up to 3.2 GHz • the clock rate is so fast that it takes 2 complete cycles for an instruction or data to cross the chip • Increased reservation stations – PIII: 40, PIV: 128 • up to 128 instructions can become state of operation simultaneously Pentium IV versus AMD Opteron • The Opteron uses dynamic scheduling, speculation, a shallower pipeline, issue and commit of up to 3 instructions per cycle, 2-level cache, chip has a similar transistor count although is only 2.8 GHz • The Opteron is a RISC instruction set, so instructions are machine instructions, not microinstructions – P4 has a higher CPI on all benchmarks except mcf • AMD is more than twice the P4 on this benchmark – So for most cases, instructions take fewer cycles to complete (lower CPI) in the AMD than the P4 but the P4 has a slightly faster clock to offset this Intel Core i7 • The i7 extends on the Pentium approach – Aggressive out of order speculation – Deep pipeline (14 stages) • instruction fetch – retrieves 16 bytes to decode • there is a separate IIFU that feeds a queue that can store up to 18 instructions at a time – unlike the Pentium, decoding is done using a step called macro-op fusion which combines instructions that have independent micro-ops that can execute in parallel • if a loop is detected that contains fewer than 28 instructions or 256 bytes, these instructions will remain in a buffer to repeatedly be issued (rather than repeated instruction fetches) Continued • Instruction fetch also includes – The use of a multilevel branch target buffer and a return address stack for speculation • miss-predictions cause a penalty of about 15 cycles – A 32 KB instruction cache • Decoding first converts machine instructions into microcode and breaks instructions into two types using four decoders – Simple micro-operation instructions (2 each) – Complex micro-operation instructions (2 each) • Instruction issue can issue up to 6 micro-operations per cycle to – 36 centralized reservation stations – 6 functional units including 1 load and 2 store units that share a memory buffer connected to 3 different data caches i7 Architecture i7 Performance CPI for various SPEC06 Benchmarks Average CPI is 1.06 for both integer programs and .89 for FP This is the number of machine instructions issued (not micro-ops) so obtaining the values is not completely transparent The Pentium and i7 are both susceptible to miss-speculation, that results in “wasted” work, up to 40% of the total work that goes into Spec 06 benchmarks is wasted Waste also arises from cache misses (10 cycles or more lost with an L1 miss, 30-40 for L2 misses and as much as 135 for L3 misses) Multicore Processors • With additional space on the chip, the strategy today is to equip the processor with multiple cores – Each core is a separate processor with its own local cache, local bus, etc – An additional cache is commonly added to the chip so that there is an L1 (within each core), L2 (on the chip, shared among cores) and L3 (off chip) • We will briefly consider multicore processors later when we consider thread level processor and true parallel processing • We wrap up our examination of processors by looking multicore performances as number of cores increase Multicore Performance • Three things are apparent when considering the performance of the multi-core processors – First, obviously, the IBM Power 7 outperforms the other two in every case – The speedup is close to but not always linear to the number of cores • doubling the number of cores does not guarantee twice the performance – There is a greater potential for speedup on FP benchmarks for the Power7 than on the int benchmarks A Balancing Act • Improving one aspect of our processor does not necessarily improve performance – in fact, it might harm performance • consider lengthening the pipeline depth and increasing clock speed in the P4 without adding reservation stations or using the trace cache • stalls will arise at the issue stage thus defeating any benefit from the longer pipeline • cache misses will have a greater impact, not a lesser impact, when the clock speed is increased • Modern processor design takes a lot of effort to balance out the factors – without accurate branch prediction and speculation hardware, stalls from miss-predicted branches will drop performance greatly • we saw this in both the ARM and i7 processors Continued • As clock speeds increase – Stalls from cache misses create a bigger impact on CPI, so larger caches and cache optimization techniques are needed • To support multiple issue of instructions – we need a larger cache-to-processor bandwidth, which can take up valuable space • As we increase the number of instructions that can be issued – we need to increase the number of reservation stations and reorder buffer size • Some compiler optimizations can also be applied to help support the distributed nature of the hardware (we look at this next week)