It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent Overview • • • • • • • Introduction of processor model Show importance of latency Techniques to handle latency Quantify memory latency effect Why consider optical interconnects? Latency of an optical interconnect Conclusions Out-of-order processor pipeline I-cache fetch ‘future’ register instruction execution file units decode rename window LD ST INT in-order retirement architectural register file Branch latency fetch ST XOR ... ... LD OR ... BR OR ADD ... LD ... XOR ... ST latency BR ... INT ... I-cache ‘future’ register instruction execution file units decode rename window LD ST BR time Eliminate branch latency • By prediction: predict outcome of branch => eliminate dependency (with a high probability) • By predication: convert control dependency to data dependency => eliminate control dependency Load latency while (pointer!=0) pointer = pointer.next; execution units LD Loop: LD R1, R1(32) BNE R1, Loop load latency = 2 cycles branch latency = 1 cycle LD LD CPI = 2 cycles/2 instructions LD = 1 cycle/instruction BNE BNE BNE cycles When longer load latency execution units • When L1-cache misses and L2-cache hits: load latency = 2+6 cycles branch latency = 1 cycle CPI = 8 cycles/2 instructions = 4 cycles/instruction LD BNE LD • When L2-cache misses and main memory hits: load latency = 2+6+60 cycles CPI = 34 cycles/instruction BNE LD BNE cycles Memory hierarchy register file L1 cache L2 cache storage capacity and latency main memory hard drive execution units L1 cache latency 12 10 IPC 8 6 4 load/store latency = 2 latency = 3 latency = 4 2 0 0 50 100 150 200 250 instruction window size (#instructions) IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs 300 Main memory latency 3.6 3.5 IPC 3.4 3.3 load/store 3.2 3.1 3 0 20 40 60 main memory latency (ns) IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs 80 100 Performance and latency Interconnect type Sensitivity of performance to latency decrease (% per ns) Processor core/register file 39 Processor/L1-cache 19 L1-cache/L2-cache 3,0 L2-cache/main memory 0,18 performance change = sensitivity * load latency change Increase performance by • eliminating/reducing load latency: – By prefetching: predict the next miss and fetch the data to e.g. L1-cache – By address prediction: address known earlier => load executed earlier => data early in register file • or reducing sensitivity to load latency: – by fine-grain multithreading Some prefetch techniques • Stride prefetching: search for pattern with constant stride 20 31 42 stride: 11 53 64 e.g. walking through a matrix (row- or column-order) • Markov prefetching: recurring patterns of misses miss history 10 110 15 12 … prediction 100 ... Stride prefetching 5.2 IPC 5.1 prefetching load/store no prefetching 5 4.9 70 75 80 85 latency main memory (ns) IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress 90 Prefetching and sensitivity Factors of “performance sensitivity to latency” increase with stride-prefetching: L1-cache/L2-cache L2-cache/main memory to L1-prefetching 1.6 4.1 to L2-prefetching 2.5 Latency is important: generalization to other processor architectures Consider schedule of program: time Present in every program execution: • Latency of instruction execution • Latency of communication => latency important whatever processor architecture Optical interconnects (OI) • Mature components: – Vertical-Cavity Surface Emitting Lasers (VCSELs) – Light Emitting Diodes (LEDs) • Very high bandwidths • Are replacing electronic interconnects in telecom and networks • Useful for short inter-chip and even intra-chip interconnects? OI in processor context • At levels close to processor core, latency is very important => latency of OI determines how far OI penetrates in the memory hierarchy • What is the latency of an optical interconnect? An optical link LED/VCSEL receiver diode fiber or light conductor buffer/modulation/bias transimpedance amplifier Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency VCSEL characteristics optical output (mW) • A small semiconductor laser • Carrier density should be high enough for lasing action 2 optical power carrier density 1.5 load/store 1 0.5 0 0 1 2 current (mA) 3 Total VCSEL link latency consists of • Buffer latency • Parasitic capacitances and series resistances of VCSEL and pads • Threshold carrier density build up • From low optical output to final optical output (intrinsic latency) • Time of flight (TOF) • Receiver latency Total optical link latency 7 latency (ns) 6 5 4 3 load/store TOF (10 cm) receiver intrinsic threshold parasitics buffer @ 1 mW 2 1 0 LED LED VCSEL VCSEL CMOS: 0.6 m 0.25 m 0.6 m 0.25 m Latency as function of power latency (ns) 8 LED (0.6 microm.) 7 VCSEL (0.6 microm.) 6 LED (0.25 microm.) 5 VCSEL (0.25 microm.) 4 load/store 3 2 1 0 0 1 2 3 4 optical output power (mW) 5 6 Conclusions • When combining performance sensitivity and optical latency we conclude: – optical interconnects are feasible to main memory and for multiprocessors – for interconnects close to processor core, optical interconnects have too high latency with present (telecom) devices, drivers and receivers => but now evolution to lower latency devices, drivers and receivers is taking place... For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000 www.elis.rug.ac.be/~neefs