Introduction to Processor Architecture www.company.com Contents • Introduction • Processor architecture overview • ISA(Instruction Set Architecture) – RISC example (SMIPs) – CISC example (Y86) • Processor architecture – Single-cycle processor example(SMIPs) • Pipelining – Control hazard – Branch Predictor – Data hazard • Cache memory www.company.com Introduction www.company.com Processors • What is the processor? • What’s the difference among them? www.company.com Processor architecture and program • Understanding architecture, there’s more opportunity to optimize your program. • Let’s see some examples www.company.com Example1 1 • for(i=0 ; i<size ; i++) { for(j=0 ; j<size ; j++) { sum += array[i][j]; } } 2 • for(j=0 ; j<size ; j++) { for(i=0 ; i<size ; i++) { sum += array[i][j]; } } Keyword : Cache www.company.com Example2 (1/2) 1 • for(i=0 ; i<size ; i++) { if(i%2 == 0) { action_even(); { else { action_odd(); } } www.company.com Example2 (2/2) 2 • for(i=0 ; i<size ; i += 2) { action_even(); } for(i=1 ; i<size ; i+= 2) { action_odd(); } Keyword : Branch predictor and pipeline www.company.com Processor architecture overview www.company.com Von Neumann Architecture • Input -> process -> output model • Integrated Instruction Memory and Data Memory www.company.com Basic components of x86 CPU Register file %eax %esp %ecx %ebp %edx %esi %ebx %edi Program Counter CPU pipeline Fetch %eip Decode Status Registers Zero flag Sign flag Cache Memory Execution Units Overflow flag Carry flag $ Commit Memory (external) www.company.com Register file • What is a register? A simple memory element(s.t. edge triggered flip flops) www.company.com Register file • A collection of registers – 8 registers are visible • In fact, there are a lot of registers hided for other usages. ex) There are 168 registers in Intel’s Haswell www.company.com Program counter %eip • Points the address of instruction that processor should execute next cycle. • %eip is the name of program counter register in X86. • Naming convention differs with ISA, Instruction Set Architecture www.company.com Status registers Zero flag Sign flag Overflow flag Carry flag • Also a collection of registers • Boolean registers that represents processor’s status. • Used to evaluate conditions www.company.com Memory • Main memory, usually D-RAM • In Von Neumann architecture, instructions(codes) and data are on same memory module www.company.com CPU pipeline CPU pipeline Fetch Decode Execution Units Commit • Where actual operation occurs • Details will be explained later www.company.com Instruction Set Architecture www.company.com Instruction Set Architecture (ISA) • How you actually talk to a Processor www.company.com Instruction Set Architecture (ISA) • Mapping between assembly code and machine code – What assembly codes will be included? – How to represent assembly codes in byte codes www.company.com Instruction • A command to processor to make processor perform specific task(s) – Ex1) Mov 4(%eax), %esp (x86) -> move the data in the address of (%eax) + 4, to %esp – Ex2) Irmovl %eax, $256 (y86) -> store the value 256 to the register eax www.company.com Representation of instructions 0 pushl 1 A 0 0 popl rA x 1 B 0 0 irmovl 2 2 rA 1 3 0 x 2 x rB 3 4 5 6 Immediate Value • Instructions are represented in byte codes – Pushl %ebx => 0xa01f – Irmovl %eax, $256 => 0x30f000010000 www.company.com CISC vs RISC CISC(Y86) RISC(sMips) www.company.com CISC • Basic Idea : give programmers powerful instructions ; fewer instructions to complete a work • • • • One instruction do multiple work A lot of instructions! (over 300 in x86) Many instruction can access memory Variable instruction length www.company.com RISC • Basic Idea : Using simple instructions, write a complex program • Each instruction do only one task • Small instructions set (about 100 in MIPS) • Only load and store instruction can access memory • Fixed instruction length www.company.com RISC example SMIPs ISA www.company.com Instruction formats opcode opcode opcode rs rs 6 rt rt 5 rd 6 target 5 5 5 5 shamt immediate 6 func 5 6 R-type 16 I-type 26 J-type • Only three formats but the fields are used differently by different types of instructions M05-27 www.company.com Instruction formats • Computational Instructions 0 opcode rs rt rs rt rd 6 0 5 func 5 immediate 5 5 6 rd (rs) func (rt) rt (rs) op immediate • Load/Store Instructions 6 opcode 31 rs 26 25 5 rt 21 20 5 16 15 16 addressing mode displacement (rs) + displacement 0 rs is the base register rt is the destination of a Load or the source for a Store M05-28 www.company.com Control instructions • Conditional (on GPR) PC-relative branch opcode rs offset 6 5 5 16 BEQZ, BNEZ – target address = (offset in words)4 + (PC+4) – range: 128 KB range • Unconditional register-indirect jumps opcode 6 rs 5 • Unconditional absolute jumps opcode target 6 5 16 JR, JALR 26 J, JAL – target address = {PC<31:28>, target4} – range : 256 MB range jump-&-link stores PC+4 into the link register (R31) M05-29 www.company.com CISC example Y86 ISA www.company.com Instruction formats 0 1 Byte 1 iCd 0 2 Bytes 1 iCd 0 5 Bytes iFun 2 rA 1 iCd 0 6 Bytes iFun 2 iFun 3 4 5 4 5 Destination iFun 1 iCd rB 2 rA rB 3 6 Immediate/Offset iCd : Instruction code iFun : Function code rA, rB : Register index M05-3 www.company.com 1 byte instructions – halt, nop 0 halt 1 0 0 nop 0 1 1 0 halt: Used as a sign of program termination - Changes processor state to halt (HLT) nop: No operation. Used as a bubble. M05-4 www.company.com 2 byte instruction – opl 0 OPl 1 6 Op 2 rA rB OPl : Perform 4 basic ALU operations; add, sub, and, xor - R[rB] <- R[rB] Op R[rA] - Condition flags are set depending on the result. M05-33 www.company.com 5 byte instruction – call 0 call dest 1 8 0 2 3 4 5 Destination call - R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top) - M[esp] <- pc + 5 (Store the return address on the stack top) - pc <- Destination (Jump to Destination address) M05-34 www.company.com 6 byte instructions – rmmov, mrmov 0 rmmovl 4 rA, Offset(rB) mrmovl Offset(rB), rA 1 0 0 rA 2 rB 1 5 0 rB 4 5 6 Offset 2 rA 3 3 4 5 6 Offset rmmovl: Store - target address = R[rB] + offset - M[target address] <- R[rA] mrmovl: Load - source address = R[rB] + offset - R[rA] <- M[source address] M05-7 www.company.com Processor Architecture www.company.com Simple processor architecture www.company.com Simplified version (a lot..) Output(register values) Clock Large sequential Logic Store Data Load program codes Memory www.company.com Sequential design Register File % E I P Fetch Decode Execution Units Commit Memory www.company.com Fetch unit % E I P 5) Give next instruction (Byte code) 1) Get PC Fetch 4) Update PC 3) Get next instruction 2) Require next instruction Memory www.company.com Decode unit(1/2) Decode 1) Truncate input Instruction iCd fCd rA rB imm 2) Fill information structure for execution Decode Combinational Logic Inst Type Target Register A Target Register B Immediate value Register value A Register value B … (depends on ISA) www.company.com Decode unit(2/2) Decode Decoded Instruction Inst Type Target Register A 3) Read register values Register Read Inst Type Target Register A Target Register B Target Register B Immediate value Immediate value Register value A Register value A Register value B Register value B … (depends on ISA) … (depends on ISA) Register File www.company.com Execution unit(1/2) Execute Execute Combinational Logic Inst Type Executed Instruction Inst Type Target Register A Target Register Target Register B Memory Addr Immediate value Register Data Register value A Memory Data Register value B ALU 1) Select input for ALU 2) Perform appropriate ALU operation 3) Using ALU result, fill information structure for memory & register update www.company.com Execution unit(2/2) Execute Executed Instruction (updated) Inst Type Memory Operation Logic Inst Type Target Register Target Register Memory Addr Memory Addr Register Data Register Data Memory Data Memory Data 5) Update the field (if load instruction executed) 4) Perform memory operations(Ld, St) Memory www.company.com Commit unit Commit Inst Type Register Update Logic Target Register Memory Addr Register Data Memory Data Register File www.company.com Single-cycle processor example SMIPs www.company.com Single-Cycle SMIPS 2 read & 1 write ports Register File SMIPs instructions are all 4 byte-long PC +4 Decode Execute separate Instruction & Data memories Inst Memory Data Memory M06-47 www.company.com Single-Cycle SMIPS module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; Rule doProc() let inst = iMem.req(pc); let dInst = decode(inst); let rVal1 = rf.rd1(validRegValue(dInst.src1)); let rVal2 = rf.rd2(validRegValue(dInst.src2)); let eInst = exec(dInst, rVal1, rVal2, pc); if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?}); else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data}); if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; endrule endmodule M06-48 www.company.com Single-Cycle SMIPS Register File module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PC +4 Inst Memory Decode Execute Data Memory • Declaration of components M06-49 www.company.com Single-Cycle SMIPS Rule doProc() let inst = iMem.req(pc); Register File PC +4 Inst Memory Decode Execute Data Memory M06-50 www.company.com Single-Cycle SMIPS let dInst = decode(inst); Register File PC +4 Inst Memory Decode Execute Data Memory M06-51 www.company.com Single-Cycle SMIPS let rVal1 = rf.rd1(validRegValue(dInst.src1)); let rVal2 = rf.rd2(validRegValue(dInst.src2)); Register File PC +4 Inst Memory Decode Execute Data Memory M06-52 www.company.com Single-Cycle SMIPS let eInst = exec(dInst, rVal1, rVal2, pc); Register File PC +4 Inst Memory Decode Execute Data Memory M06-53 www.company.com Single-Cycle SMIPS if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?}); else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data}); Register File PC +4 Inst Memory Decode Execute Data Memory M06-54 www.company.com Single-Cycle SMIPS if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); Register File PC +4 Inst Memory Decode Execute Data Memory M06-55 www.company.com Single-Cycle SMIPS pc <= eInst.brTaken ? eInst.addr : pc + 4; Register File PC +4 Inst Memory Decode Execute Data Memory M06-56 www.company.com Improve processor performance -Pipelining www.company.com Pipelining • Introduce the idea of conveyor belt process www.company.com Pipelining • Introduce the idea of conveyor belt process Register File % E I P Fetch Decode Execution Units Commit Memory FIFO or Register www.company.com Pipelining Register File % E I P Fetch Decode Execution Units Commit Memory • In this case, 4 instructions are executing on the same time www.company.com Pipelining • Throughput is same? – Sequential : 1 instructions / 1 cycle – Pipelined : 4 instructions / 4 cycle • No, pipelined design can make clock faster • Because task amount per cycle is decreased, we can apply shorter clock time www.company.com Pipeline hazard - control hazard www.company.com Control hazard (1/5) • Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je B A: addl %ecx, %edi B: leave ret • We don’t know the condition before run the code www.company.com Control hazard (2/5) Register File % E I P Fetch Decode Execution Units Commit Memory • What’s next? addl? leave? je B subl %eax, (%ebx) mov %ecx, %ebx Execution flow www.company.com Control hazard (3/5) • Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je B A: addl %ecx, %edi mov %edi, %eax B: leave ret • As we can’t know about the future, bring the instruction in the next position www.company.com Control hazard (4/5) Register File % E I P Fetch Decode Execution Units Commit Memory • What if jump occurred? addl %ecx, %edi je B subl %eax, (%ebx) Execution flow www.company.com Control hazard (5/5) Register File % E I P Fetch Decode Execution Units Commit Memory • Wrong instructions were executing mov %edi, %eax addl %ecx, %edi je B Execution flow www.company.com Control hazard - analysis • We must discard some instructions when we mispredict the branch direction… • The longer the pipeline, the more instructions must be discarded when branch mispredict. www.company.com Epoch method • Add an epoch register in the processor state • The Execute stage changes the epoch whenever the pc prediction is wrong and sets the pc to the correct value • The Fetch stage associates the current epoch with every instruction when it is fetched The epoch of the instruction is examined when it is ready to execute. If the processor epoch has changed the instruction is thrown away Fetch Execute Epoch targetPC PC pred f2d inst iMem M07-69 www.company.com Epoch method - Summary • Add a ‘Tag’ to each instructions • There are two Tag machine; in Fetch and Execute • If tag machines recognize something wrong, they change their testing tag www.company.com PC redirect FIFO +4 f2d eEpoch fEpoch Epoch method example (SMIPs) Decode Register File Execute FIFO Inst Memory Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo Data Memory M07-71 www.company.com Epoch method example (SMIPs) rule doFetch ; let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc; f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch, inst:inst}); Endrule rule doExecute; let x=f2d.first; let inpc=x.pc; let inEp=x.epoch; let ppc = x.ppc; let inst = x.inst; if(inEp == epoch) begin let dInst = decode(inst); ... register fetch ...; let eInst = exec(dInst, rVal1, rVal2, inpc, ppc); ...memory operation ... ...rf update ... if (eInst.mispredict) begin pc <= eInst.addr; epoch <= inEp + 1; end end f2d.deq; endrule www.company.com Branch Prediction www.company.com Need to predict next PC • We must fetch a instruction every cycle • But as we see in control hazard part, we can’t know what is exactly next instruction • So we must predict what is next instruction www.company.com How to predict next PC? • We can depend on the history – Memo the history and make use of it • So we must predict what is next instruction www.company.com 2-bit counter branch predictor (Strongly taken) (Weakly taken) (Weakly not taken) (Strongly not taken) www.company.com 2-bit counter branch predictor • Assume 2 BP bits per instruction • Use saturating counter On taken On ¬taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly ¬taken 0 0 Strongly ¬taken Direction prediction changes only after two successive bad predictions M11-77 www.company.com Branch History Table (BHT) from Fetch Instruction Opcode Fetch PC offset 0 0 k + 2k-entry BHT, 2 bits/entry BHT Index Branch? Target PC After decoding the instruction if it turns out be a branch, then we can consult BHT using the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch Taken/¬Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions M11-78 www.company.com Let’s see program example again • for(i=0 ; i<size ; i++) { if(i%2 == 0) { action_even(); { else { action_odd(); } } www.company.com Pipeline hazard - data hazard www.company.com Data hazard by flow dependence • Sometimes, instructions uses the result of former instruction I1 addl %eax, %ebx I2 subl %ebx, %ecx I2 must wait until I1 updates the register file, so that I2 can see the result. www.company.com Dealing with data hazards • We can wait until desired value is updated – Stall method • Or, we can send the value directly – Bypass method www.company.com Pipeline stalling example www.company.com Data bypass example www.company.com Improve processor performance -Cache Memory www.company.com Memory operations are bottle neck! • Memory transfer rate of DDR3 RAM – Peak transfer rate : 6400 MB/s • Assume that we have single core processor of clock speed 3.0Ghz, and we process a word(32bit) every cycle. – Approximately, we process 1.2GB/s • 3.0 Ghz * 32bit(word size) / 8 www.company.com Memory hierarchy www.company.com Locality of reference • Temporal locality – If a value is used, it is likely to be used again soon. mov $100, %ecx mov Array, %ebx xorl %eax, %eax Loop : mov (%ebx), %esi addl %esi, %eax addl $4, %ebx subl $1, %ecx je Loop Fin : leave ret // %ebx = &Array // %eax = 0 // %esi += *(%ebx) // %eax += %esi // %ebx += 4 // %ecx = %ecx - 1 www.company.com Locality of reference • Spatial locality – If a value is used, nearby values are likely to be used for(i=0; i<size; i++) { for(j=0; j<size; j++) { sum += array[i][j]; } } www.company.com Cache memory • A buffer between processor and memory – Often several levels of caches • Small but fast – Old values will be removed from cache to make space for new values • Capitalizes on spatial locality and temporal locality • Parameters vary by system – unknown to programmer www.company.com Cache memory • Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory • CPU looks first for data in L1, then in L2, then in main memory. • Typical bus structure: CPU chip register file L1 cache cache bus L2 cache ALU system bus bus interface I/O bridge memory bus main memory www.company.com Structure of cache memory • Basically, an array of memory elements copy of main mem locations 100, 101, ... 100 304 Data Data Byte Byte Data Byte 6848 Address Tag Data Block 416 How many bits are needed for the tag? Enough to uniquely identify block www.company.com Read behavior Search cache tags to find match for the processor generated address Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS Read block of data from Main Memory – may require writing back a cache line Return data to processor and update cache Which line do we replace? www.company.com Write behavior • On a write hit – Write-through: write to both cache and the next level memory – Writeback: write only to cache and update the next level memory when line is evacuated • On a write miss – Allocate – because of multi-word lines we first fetch the line, and then update a word in it – Not allocate – word modified in memory M12-94 www.company.com Direct mapped cache • A buffer between processor and memory – Often several levels of caches • Small but fast – Old values will be removed from cache to make space for new values • Capitalizes on spatial locality and temporal locality • Parameters vary by system – unknown to programmer www.company.com Cache line size • A cache line usually holds more than one word(32bit) – Reduces the number of tags and the tag size needed to identify memory locations – To exploit spatial locality M12-96 www.company.com Types of misses • Compulsory misses (cold start) Cold fact of life! – First time data is referenced – Run billions of instructions, become insignificant • Capacity misses – Working set is larger than cache size – Solution: increase cache size • Conflict misses – Usually multiple memory locations are mapped to the same cache location to simplify implementations – Thus it is possible that the designated cache location is full while there are empty locations in the cache. – Solution: Set-Associative Caches M12-97 www.company.com Direct-mapped cache Block number Tag Index t Block offset Offset req address k V Tag b Data Block 2k lines t = HIT What is a bad reference pattern? Data Word or Byte Strided = size of cache www.company.com Addressing example 00000100001100 Tag = 524 00000100001110 01 00 reqColdaddress fact of life! Index = 270 Block offset = 1 • Bitwise truncation • Goto 270th cache entry and compare the Tag M12-99 www.company.com Address selection Index Tag t k V Offset b Tag Data Block 2k lines t = HIT Data Word or Byte Why might this be undesirable? Spatially local blocks conflict www.company.com Reduce conflict misses • Memory time = Hit time + Prob(miss) * Miss penalty • Associativity: Allow blocks to go to several sets in cache – 2-way set associative: each block maps to either of 2 cache sets – Fully associative: each block maps to any cache frame M12-101 www.company.com 2-way Set-Associative cache Tag Block Offset Index t b k V Tag Data Block V Tag Data Block t = = Data Word or Byte HIT www.company.com Replacement policy • In order to bring in a new cache line, usually another cache line has to be thrown out. Which one? – No choice in replacement if the cache is direct mapped • Replacement policy for set-associative caches – One that is not dirty, i.e., has not been modified • In I-cache all lines are clean • In D-cache if a dirty line has to be thrown out then it must be written back first – Least recently used? – Most recently used? – Random? How much is performance affected by the choice? Difficult to know without quantitative measurements M12-103 www.company.com Implementing LRU.. • We need time stamps for all lines • And also require time stamp comparison! – Log scale over head www.company.com Pseudo LRU example • So use pseudo LRU instead of true LRU • We’ll use 8-way set associative cache • Use three bit history bits • If a line is referenced, memo the following code at history bits – Line 0 : 000 – Line 1 : 001 – Line 2 : 010 – Line 3 : 011 … www.company.com Pseudo LRU example www.company.com Pseudo LRU example www.company.com