1 GENERAL PURPOSE PROCESSOR Introduction 2 General-Purpose Processor Processor designed for a variety of computation tasks Low unit cost, in part because manufacturer spreads NRE over large numbers of units Motorola sold half a billion 68HC05 microcontrollers in 1996 alone Carefully Can designed since higher NRE is acceptable yield good performance, size and power Low NRE cost for Embedded system designer, short time-to-market/prototype, high flexibility User just writes software; no processor design Basic Architecture 3 Control unit and datapath Processor Control unit Similar to single-purpose processor Datapath ALU Controller Control /Status Key differences Datapath is general Control unit doesn’t store the algorithm – the algorithm is “programmed” into the memory Registers PC IR I/O Memory Datapath Operations 4 Load • Processor Read memory location into register Control unit Datapath ALU ALU operation Controller – Input certain registers through ALU, store back in register • Registers Store – Write register to memory location +1 Control /Status 10 PC IR I/O Memory ... 10 11 ... 11 Control Unit 5 Control unit: configures the datapath operations Processor Sequence of desired operations (“instructions”) stored in memory – “program” Control unit ALU Controller Instruction cycle – broken into several sub-operations, each one clock cycle, e.g.: Fetch: Get next instruction into IR Decode: Determine what the instruction means Fetch operands: Move data from memory to datapath register Execute: Move data through the ALU Store results: Write data from register to memory Datapath Control /Status Registers PC IR R0 I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory ... 500 10 501 ... R1 Instruction Cycles 6 PC=100 Fetch Decode Fetch Exec. Store ops results clk Processor Control unit Datapath ALU Controller Control /Status Registers 10 PC 100 IR load R0, M[500] R0 I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory ... 500 10 501 ... R1 Instruction Cycles 7 PC=100 Fetch Decode Fetch Exec. Store ops results clk Processor Control unit Datapath ALU Controller +1 Control /Status PC=101 Registers Fetch Decode Fetch Exec. Store ops results clk 10 PC 101 IR inc R1, R0 R0 I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory ... 500 10 501 ... 11 R1 Instruction Cycles 8 PC=100 Fetch Decode Fetch Exec. Store ops results clk Processor Control unit Datapath ALU Controller Control /Status PC=101 Registers Fetch Decode Fetch Exec. Store ops results clk 10 PC 102 IR store M[501], R1 R0 PC=102 Fetch Decode Fetch Exec. Store ops results clk I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory ... 500 10 501 11 ... 11 R1 Architectural Considerations 9 N-bit processor ALU, registers, buses, memory data interface Embedded: 8-bit, 16-bit, 32-bit common Desktop/servers: 32-bit, even 64 Processor N-bit PC size determines address space Control unit Datapath ALU Controller Control /Status Registers PC IR I/O Memory Architectural Considerations 10 Clock frequency Inverse Processor Control unit of clock period Must be longer than longest register to register delay in entire processor Memory access is often the longest Datapath ALU Controller Control /Status Registers PC IR I/O Memory ARM Introduction ARM RISC Design Philosophy Smaller die size Shorter Development time Higher performance Insects flap wings faster than small birds Complex instruction will make some high level function more efficient but will slow down the clock for all instructions ARM Design philosophy Reduce power consumption and extend battery life High Code density Low price Embedded systems prefer slow and low cost memory Reduce area of the die taken by embedded processor Leave space for specialized processor Hardware debug capability ARM is not a pure RISC Architecture Designed primarily for embedded systems Instruction set for embedded systems Variable cycle execution for certain instructions Multi registers Load-store instructions Faster if memory access is sequential Higher code density – common operation at start and end of function Inline barrel shifting – leads to complex instructions Improved code density E.g. ADD r0,r1,r1, LSL #1 Instruction set for embedded systems Thumb 16 bit instruction set Code can execute both 16 or 32 bit instruction Conditional execution Improved code density Reduce branch instructions CMP r1,r2 SUBGT r1,r1,r2 SUBLT r2,r2,r1 Enhanced instructions – DSP Instructions Use one processor instead of traditional combination of two Arm Based Embedded device Peripherals ALL ARM Peripherals are Memory Mapped Interrupt Controllers Standard Interrupt Controller Sends a interrupt signal to processor core Can be programmed to ignore or mask an individual device or set of devices Interrupt handler read a device bitmap register to determine which device requires servicing VIC- Vectored interrupt controller Assigned priority and ISR handler to each device Depending on type will call standard Int. Hand. Or jump to specific device handler directly ARM Datapath Registers R0-R15 General Purpose registers R13-stack pointer R14-Link register R15 – program counter R0-R13 are orthogonal Two program status registers CPSR SPSR ARM’s visible registers r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 (PC) CPSR user mode usable in user mode system modes only r8_fiq r9_fiq r10_fiq r11_fiq r12_fiq r13_fiq r14_fiq SPSR_fiq fiq mode r13_svc r14_svc r13_abt r14_abt SPSR_svc SPSR_abt svc mode abort mode r13_irq r14_irq r13_und r14_und SPSR_irq SPSR_und irq mode undefined mode BANK Registers Total 37 registers 20 are hidden from program at different time Also called Banked Registers Available only when processor in certain mode Mode can be changed by program or on exception Reset, interrupt request, fast interrupt request software interrupt, data abort, prefetch abort and undefined instruction No SPSR access in user mode CPSR Condition flags – NZCV Interrupt masks – IF Thumb state- T , Jazelle –J Mode bits 0-4 – processor mode Six privileged modes Abort – failed attempt to access memory Fast interrupt request Interrupt request Supervisor mode – after reset, Kernel work in this mode System – special version of user mode – full RW access to CPSR Undefined mode – when undefined or not supported inst. Is exec. User Mode 31 NZCV 28 27 8 7 unused 6 IF 5 4 T 0 mode Instruction execution 3 Stage pipeline ARM Organization Fetch Decode The instruction is fetched from the memory and placed in the instruction pipeline The instruction is decoded and the datapath control signals prepared for the next cycle. In this stage inst. ‘Owns’ the decode logic but not the datapath Execute The inst. ‘owns’ the datapath; the register bank is read, an operand shifted, the ALU result generated and written back into a destination register. ARM7 Core Diagram 3 stage Pipeline – Single Cycle Inst. 3 stage Pipeline – Multi-Cycle Inst. PC Behavior R15 increment twice before an instruction executes due to pipeline operation R15=current instruction address+8 Offset is +4 for thumb instruction To get Higher performance Tprog =(Ninst X CPI ) / fclk Ninst – No of inst. Executed for a program–Constant Increase the clock rate The clock rate is limited by slowest pipeline stage Decrease the logic complexity per stage Increase the pipeline depth Improve the CPI Instruction that take more than one cycle are reimplemented to occupy fewer cycles Pipeline stalls are reduced Typical Dynamic Instruction usage Statistics for a print preview program in an ARM Inst. Emulator Instruction Type Dynamic Usage Data Movement 43% Control Flow 23% Arithmetic operation 15% Comparisons 13% Logical Operation 5% Other 1% Memory Bottleneck Von Neumann Bottleneck Single inst and data memory Limited by available memory bandwidth A 3 stage ARM core accesses memory on (almost) every clock Harvard Architecture in higher performance arm cores The 5 stage pipeline Fetch Decode An operand is shifted and the ALU result generated. For load and store memory address is computed Buffer/Data Inst. Is decoded and register operand read from the register file Execute Inst. Fetched and placed in Inst. Pipeline Data Memory is accessed if required otherwise ALU result is simply buffered Write Back The results are written back to register file Data Forwarding Read after write pipeline hazard An instruction needs to use the result of one of its predecessors before that result has returned to the register file e.g. Add r1,r2,r3 Add r4,r5,r1 Data forwarding is used to eliminate stall In following case even with forwarding it is not possible to avoid a pipeline stall E.g LDR rN, [..] ADD r2,r1,rN Processor ; Load rN from somewhere ; and use it immediately cannot avoid one cycle stall Data Hazards Handling data hazard in software Solution- Encourage compiler to not put a depended instruction immediately after a load instruction Side effects When a location other than one explicitly named in an instruction as destination operand is affected Addressing Complex modes addressing modes doesn’t necessarily leads to faster execution E.g. Load (X(R1)),R2 Add #X,R1,R2 Load (R2),R2 Load (R2),R2 Data Hazards Complex addressing require more complex hardware to decode and execute them Cause the pipeline to stall Pipelining features Access to an operand does not require more than one access to memory Only load and store instruction access memory The addressing modes used do not have side effects Register, register indirect, index modes Condition codes Flags are modified by as few instruction as possible Compiler should be able to specify in which instr. Of the program they are affected and in which they are not Complex Addressing Mode Load (X(R1)), R2 Clock cycle 1 2 3 Load D X + [R1] F 4 5 6 [X +[R1]] [[X +[R1]]] Time 7 W Forward Next instruction F D (a) Complex addressing mode E W Simple Addressing Mode Add #X, R1, R2 Load (R2), R2 Load (R2), R2 Add F Load Load Next instruction D X + [R1] W F D [X +[R1]] W F D [[X +[R1]]] W F D E (b) Simple addressing mode W ARM 5 Stage Pipeline Instruction hazards - Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline stalls. Cache miss Branch Time Clock cy cle 1 2 F1 E1 3 4 5 6 Instruction I1 I 2 (Branch) F2 Execution unit idle E2 Unconditional Branches I3 Ik I k+1 F3 X Fk Ek Fk+1 Ek+1 Figure 8.8. An idle yccle caused by a branch instruction. I k+1 Fk+1 D k+1 Ek+1 (b) Branch address computed in Decode stage Figure 8.9. Branch timing. Branch Timing - Branch penalty - Reducing the penalty Instruction Queue and Prefetching Instruction fetch unit Instruction queue F : Fetch instruction D : Dispatch/ Decode unit E : Execute instruction W : Write results Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b. Branch Timing with Instruction Queue Clock cycle 1 2 3 4 5 6 7 8 9 10 Queue length 1 1 1 1 2 3 2 1 1 1 I1 F1 D1 E1 E1 E1 W1 F2 D2 E2 W2 F3 D3 E3 W3 D4 E4 W4 Dk Ek I2 I3 F4 I4 I 5 (Branch) I6 Ik I k+1 F5 Branch folding D5 F6 Time X Fk Fk+1 Wk D k+1 Ek+1 Figure 8.11. Branch timing in the presence of an instruction queue. Branch target address is computed in the D stage. Branch Folding Branch folding – executing the branch instruction concurrently with the execution of other instructions. Branch folding occurs only if at the time a branch instruction is encountered, at least one instruction is available in the queue other than the branch instruction. Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing. This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache. Having an instruction queue is also beneficial in dealing with cache misses. Conditional Braches A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions represent about 20% of the dynamic instruction count of most programs. Delayed Branch The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken. The objective is to place useful instructions in these slots. The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions. Delayed Branch LOOP NEXT Shift_left Decrement Branch=0 Add R1 R2 LOOP R1,R3 (a) Original program loop LOOP NEXT Decrement Branch=0 Shift_left Add R2 LOOP R1 R1,R3 (b) Reordered instructions Figure 8.12. Reordering of instructions for a delayed branch. Delayed Branch Time Clock cycle 1 2 F E 3 4 5 6 7 8 Instruction Decrement Branch Shift (delay slot) Decrement (Branch tak en) Branch Shift (delay slot) Add (Branch not tak en) F E F E F E F E F E F E Figure 8.13. Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12. Branch Prediction To predict whether or not a particular branch will be taken. Simplest form: assume branch will not take place and continue to fetch instructions in sequential address order. Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis. Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence. Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. Incorrectly Predicted Branch Time Clock cycle 1 2 3 4 5 F1 D1 E1 W1 F2 D2/P2 E2 F3 D3 X F4 X 6 Instruction I 1 (Compare) I 2 (Branch>0) I3 I4 Ik Fk Dk Figure 8.14.Timing when a branch decision has been incorrectly predicted as not taken. Branch Prediction Better performance can be achieved if we arrange for some branch instructions to be predicted as taken and others as not taken. Use hardware to observe whether the target address is lower or higher than that of the branch instruction. Let compiler include a branch prediction bit. So far the branch prediction decision is always the same every time a given instruction is executed – static branch prediction. Superscalar operation Maximum Throughput - One instruction per clock cycle Multiple processing units More than one instruction per cycle Instruction queue Floatingpoint unit Dispatch unit W : Write results Integer unit Superscalar Figure 8.19. A processor with two execution units. Timing Time Clock cycle 1 2 3 4 5 6 I 1 (Fadd) F1 D1 E1A E1B E1C W1 I 2 (Add) F2 D2 E2 W2 I 3 (Fsub) F3 D3 E3 E3 E3 I 4 (Sub) F4 D4 E4 W4 7 W3 Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19, assuming no hazards are encountered. ALU Logic operation OR,AND, XOR, NOT, NAND, NOR etc. No dependencies among bits – Each result can be calculated in parallel for every bit Arithmetic operation ADD, SUB, INC, DEC, MUL, DIVIDE Involve long carry propagation chain Major source of delay Require optimization Suitability of algorithm based on Resource usage – physical space on silicon die Turnaround time The original ARM1 ripple-carry adder circuit Cout A B sum Cin The ARM2 4-bit carry look-ahead scheme Cout[3] A[3:0] G 4-bit adder logic P B[3:0] Cin[0] sum[3:0] The ARM2 ALU logic for one result bit fs: NB bu s 5 01 23 4 carry log ic G AL U bu s P NA bu s ARM2 ALU function codes fs 5 fs 4 fs 3 fs 2 fs 1 fs 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 A LU o ut p ut A an d B A an d n o t B A xor B A p l us n o t B p l us carry A p l us B p l us carry n o t A p l us B p l us carry A A or B B not B zero The ARM6 carry-select adder scheme a,b[3:0] + c a,b[31:28] +, +1 s +, +1 s+1 mux mux mux sum[3:0] sum[7:4] sum[15:8] sum[31:16] Conditional Sum Adder Extension of carry-select adder Carry select adder One-level using k/2-bit adders Two-level using k/4-bit adders Three-level using k/8-bit adders Etc. Assuming k is a power of two, eventually have an extreme where there are log2k-levels using 1-bit adders This is a conditional sum adder Conditional sum - example Conditional Sum Adder: Top-Level Block for One Bit Position The ARM6 ALU organization A operand latch invert B operand latch invert B A XOR gates XOR gates C in function logic functions adder C V logic/arithmetic result mux N zero detect result Z The cross-bar switch barrel shifter principle right 3 right 2 right 1 no shift in[3] left 1 in[2] left 2 in[1] left 3 in[0] out[0] out[1] out[2] out[3] Shift implementation For left or right shift one diagonal is turned on Shifter operate in negative logic Precharging sets all output logic to ‘0’. For rotate right, the right shift diagonal is enabled together with complimentary left shift diagonal Arithmetic shift uses sign extension rather than ‘0’ fill Multiplier ARM include hardware support for integer multiplication Older ARM cores include low cost multiplication hardware Support 32 bit result multiply and multiply accumulate Uses the main datapath iteratively Barrel shifter and ALU to generate 2 bit product in each cycle Employ a modified booth’s algorithm to produce 2 bit product Multiplier Radix 2 multiplication Radix 4 multiplication Radix 2 Booth algorithm Radix 4 booth algorithm Modified Booth’s Recoding ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– xi+1 xi xi–1 yi+1 yi zi/2 Explanation ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 0 0 0 0 0 0 No string of 1s in sight 0 0 1 0 1 1 End of string of 1s 0 1 0 0 1 1 Isolated 1 0 1 1 1 0 2 End of string of 1s -1 -2 1 0 0 0 Beginning of string of 1s 1 0 1 1 1 1 End a string, begin new one 1 1 0 0 1 1 Beginning of string of 1s 1 1 1 0 0 0 Continuation of string of 1s ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Recoded radix-2 digits Radix-4 digit Context Example 1 0 0 1 (1) -1 0 1 0 (1) -2 2 1 1 0 1 1 0 1 0 1 1 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 -1 2 -1 -1 0 -2 Operand x Recoded version y Radix-4 version z Example : Modified Booth’s Recoding ================================ a 0 1 1 0 x 1 0 1 0 -1 -2 z Radix-4 ================================ p(0) 0 0 0 0 0 0 +z0a 1 1 0 1 0 0 ––––––––––––––––––––––––––––––––– 4p(1) 1 1 0 1 0 0 p(1) 1 1 1 1 0 1 0 0 +z1a 1 1 1 0 1 0 ––––––––––––––––––––––––––––––––– 4p(2) 1 1 0 1 1 1 0 0 p(2) 1 1 0 1 1 1 0 0 ================================ a x M ultiplic and M ultiplier (x x ) a 40 (x x ) a 41 1 0 tw o 3 2 tw o p P roduc t High speed multiplier Recent cores have high performance multiplication hardware Support 64 bit result multiply and multiply accumulate Multiplier: Carry Save Addition In Multiplication multiple partial products are added simultaneously using 2-operand adders Time-consuming carry-propagation must be repeated several times: k operands - k-1 propagations Techniques for lowering this penalty exist - Carry-save addition Carry propagates only in last step - other steps generate partial sum and sequence of carries Basic CSA accepts 3 n-bit operands; generates 2n-bit results: n-bit partial sum, n-bit carry Second CSA accepts the 2 sequences and another input operand, generates new partial sum and carry CSA reduces number of operands to be added from 3 to 2 without carry propagation CSA-Basic unit - (3,2)Counter Simplest implementation - full adder (FA) with 3 inputs x,y,z x+y+z=2c+s (s,c - sum and carry outputs) Outputs - weighted binary representation of number of 1's in inputs FA called a (3,2) counter n-bit CSA: n(3,2)counters in parallel with no carry links (a)Carry-propagate (b)carry-save A B Cin A + ( a) Cout A B Cin S B Cin Cout A + ( b) Cout A + B Cin S B Cin Cout A + S Cout A + B Cin + S B Cin Cout A + S Cout S B Cin + S Cout S Cascaded CSA for four 4 bit operands Upper 2 levels - 4-bit CSAs 3rd level - 4-bit carry-propagating adder (CPA) Wallace Tree Better Organization for CSA – faster operation time ARM high-speed multiplier organization initiali za tion fo r MLA regis ters Rs >> 8 bits /c yc le Rm rotate s um an d carry 8 bits/cycle carry-save adders partial sum partial carry ALU (add partials)