ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology Instruction Supply Issues Instruction Fetch Unit Execution Core Instruction buffer • Fetch throughput defines max performance that can be achieved in later stages • Superscalar processors need to supply more than 1 instruction per cycle • Instruction Supply limited by – Misalignment of multiple instructions in a fetch group – Change of Flow (interrupting instruction supply) – Memory latency and bandwidth 2 Aligned Instruction Fetching (4 instructions) Row Decoder PC=..xx000000 ..00 ..01 ..10 ..11 Can pull out one row at a time 00 01 10 11 A0 A4 A8 A12 A1 A5 A9 A13 A2 A6 A10 A14 A3 A7 A11 A15 inst 1 inst 2 inst 3 inst 4 One 64B Icache line Cycle n Assume one fetch group = 16B 3 Misaligned Fetch Row Decoder PC=..xx001000 ..00 ..01 ..10 ..11 00 01 10 11 A0 A4 A8 A12 A1 A5 A9 A13 A2 A6 A10 A14 A3 A7 A11 A15 One 64B Icache line Rotating network inst 1 inst 2 inst 3 inst 4 Cycle n IBM RS/6000 4 Split Cache Line Access Row Decoder PC=..xx111000 ..00 ..01 ..10 ..11 00 01 10 11 A0 A4 A8 A12 B0 B4 A1 A5 A9 A13 B1 B5 A2 A6 A10 A14 B2 B6 A3 A7 A11 A15 B3 B7 inst 1 inst 2 inst 3 inst 4 Be broken down to 2 physical accesses cache line A cache line B Cycle n Cycle n+1 5 Split Cache Line Access Miss Row Decoder PC=..xx111000 ..00 ..01 ..10 ..11 Cache line B misses 00 01 10 11 A0 A4 A8 A12 C0 C4 A1 A5 A9 A13 C1 C5 A2 A6 A10 A14 C2 C6 A3 A7 A11 A15 C3 C7 inst 1 inst 2 inst 3 inst 4 cache line A cache line C Cycle n Cycle n+X 6 High Bandwidth Instruction Fetching • Wider issue More instruction feed • Major challenge: to fetch more than one non-contiguous basic block per cycle • Enabling technique? – Predication – Branch alignment based on profiling – Other hardware solutions (branch prediction is a given) BB4 BB1 BB2 BB3 BB5 BB7 BB6 7 Predication Example Source code if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Typical assembly lw lw blt sw j L1: sw L2: r2, r3, r3, r0, L2 [r1+4] [r1] r2, L1 [r1] Assembly w/ predication lw lw sgt (p4) sw (!p4) sw r2, [r1+4] r3, [r1] pr4, r2, r3 r0, [r1+4] r0, [r1] r0, [r1+4] • Convert control dependency into data dependency • Enlarge basic block size – More room for scheduling – No fetch disruption 8 Collapse Buffer [ISCA 95] • To fetch multiple (often non-contiguous) instructions • Use interleaved BTB to enable multiple branch predictions • Align instructions in the predicted sequential order • Use banked I-cache for multiple line access 9 Collapsing Buffer Fetch PC Interleaved BTB Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit 10 Collapsing Buffer Mechanism Interleaved BTB A E Bank Routing E A E F G H A B C D Valid Instruction Bits Interchange Switch A B C D E F G H Collapsing Circuit A B C D E F G H A B C E G 11 High Bandwidth Instruction Fetching • To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) • Multiple branches predictions BB4 BB1 BB2 BB3 BB5 BB7 BB6 12 Multiple Branch Predictor [YehMarrPatt ICS’93] • Pattern History Table (PHT) design to support MBP • Based on global history only Pattern History Table (PHT) Branch History Register (BHR) bk …… Tertiary prediction b1 p1 p2 p2 update Secondary prediction p1 Primary prediction 13 Multiple Branch Predictin Fetch address (br0 Primary prediction) BTB entry BB1 br1 T (2nd) BB2 br2 T BB4 F BB3 F (3rd) BB5 T BB6 • Fetch address could be retrieved from BTB • Predicted path: BB1 BB2 BB5 • How to fetch BB2 and BB5? BTB? – Can’t. Branch PCs of br1 and br2 not available when MBP made – Use a BAC design F BB7 14 Branch Address Cache V br Taken Target Address Tag 23 bits 1 2 V br V br 30 bits Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 30 bits 212 bits per fetch address entry Fetch Addr (from BTB) • Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions • br: 2 bits for branch type (cond, uncond, return) • V: single valid bit (to indicate if hits a branch in the sequence) • To make one more level prediction – Need to cache another 8 more addresses (i.e. total=14 addresses) – 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8 15 Caching Non-Consecutive Basic Blocks • High Fetch Bandwidth + Low Latency BB3 BB1 BB5 BB2 BB4 Fetch in Conventional Instruction Cache BB1 BB2 BB3 BB4 BB5 Fetch in Linear Memory Location 16 Trace Cache • Cache dynamic non-contiguous instructions (traces) • Cross multiple basic blocks • Need to predict multiple branches (MBP) E F G H I J K A B C D C H I J I$ Fetch (5 cycles) I$ Trace Cache A B C D E F G H I J E F G A B D A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J T$ Fetch (1 cycle) 17 Trace Cache [Rotenberg Bennett Smith MICRO‘96] 11: 3 branches. 1: the trace ends w/ a branch 1st 2nd Br taken Br Not taken 11, 1 10 For T.C. miss Line fill buffer Br Br flag mask Tag Fall-thru Address Taken Address M branches BB1 BB2 BB3 T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr MBP • Cache at most (in original paper) – M branches OR (M = 3 in all follow-up TC studies due to MBP) – N instructions (N = 16 in all follow-up TC studies) • Fall-thru address if last branch is predicted not taken 18 Trace Hit Logic Fetch: A Tag Multi-BPred T N BF Mask Fall-thru Target A 10 11,1 X Y N = 0 1 Cond. AND Match Remaining Block(s) Match 1st Block Next Fetch Address Trace hit 19 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC 5 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A 6 insts B 12 insts A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 C A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 D 4 insts C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Exit 16 instructions Trace Cache (5 lines) 20 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC 5 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A 6 insts B 12 insts A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 C A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 D 4 insts C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Exit Trace Cache (5 lines) 21 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC 5 insts A 6 insts B Trace Cache is Full 12 insts A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 C A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 D 4 insts C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Exit Trace Cache (5 lines) 22 Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC 5 insts How many hits? A 6 insts B What is the utilization? 12 insts A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 C A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 D 4 insts C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4 Exit 23 Redundancy • Duplication – Note that instructions only appear once in I-Cache – Same instruction appears many times in TC • Fragmentation – If 3 BBs < 16 instructions – If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. – Empty slots wasted resources • Example – A single BB is broken up to (ABC), (BCD), (CDA), (DAB) – Duplicating each instruction 3 times 6 A 4 B 6 C 3 D (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst A B C B C D C D A A B D Trace Cache 24 Indexability • TC saved traces (EAC) and (BCD) • Path: (EAC) to (D) E A – Cannot index interior block (D) • Can cause duplication • Need partial matching – (BCD) is cached, if (BC) is needed B C D G E B A C C D Trace Cache 25 Pentium 4 (NetBurst) Trace Cache Front-end BTB No I$ !! iTLB and Prefetcher L2 Cache Decoder Trace $ BTB Trace-based prediction (predict next-trace, not next-PC) Trace $ Rename, execute, etc. Decoded Instructions 26