It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS)

It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent Overview • • • • • • • Introduction of processor model Show importance of latency Techniques to handle latency Quantify memory latency effect Why consider optical interconnects? Latency of an optical interconnect Conclusions Out-of-order processor pipeline I-cache fetch ‘future’ register instruction execution file units decode rename window LD ST INT in-order retirement architectural register file Branch latency fetch ST XOR ... ... LD OR ... BR OR ADD ... LD ... XOR ... ST latency BR ... INT ... I-cache ‘future’ register instruction execution file units decode rename window LD ST BR time Eliminate branch latency • By prediction: predict outcome of branch => eliminate dependency (with a high probability) • By predication: convert control dependency to data dependency => eliminate control dependency Load latency while (pointer!=0) pointer = pointer.next; execution units LD Loop: LD R1, R1(32) BNE R1, Loop load latency = 2 cycles branch latency = 1 cycle LD LD CPI = 2 cycles/2 instructions LD = 1 cycle/instruction BNE BNE BNE cycles When longer load latency execution units • When L1-cache misses and L2-cache hits: load latency = 2+6 cycles branch latency = 1 cycle CPI = 8 cycles/2 instructions = 4 cycles/instruction LD BNE LD • When L2-cache misses and main memory hits: load latency = 2+6+60 cycles CPI = 34 cycles/instruction BNE LD BNE cycles Memory hierarchy register file L1 cache L2 cache storage capacity and latency main memory hard drive execution units L1 cache latency 12 10 IPC 8 6 4 load/store latency = 2 latency = 3 latency = 4 2 0 0 50 100 150 200 250 instruction window size (#instructions) IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs 300 Main memory latency 3.6 3.5 IPC 3.4 3.3 load/store 3.2 3.1 3 0 20 40 60 main memory latency (ns) IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs 80 100 Performance and latency Interconnect type Sensitivity of performance to latency decrease (% per ns) Processor core/register file 39 Processor/L1-cache 19 L1-cache/L2-cache 3,0 L2-cache/main memory 0,18 performance change = sensitivity * load latency change Increase performance by • eliminating/reducing load latency: – By prefetching: predict the next miss and fetch the data to e.g. L1-cache – By address prediction: address known earlier => load executed earlier => data early in register file • or reducing sensitivity to load latency: – by fine-grain multithreading Some prefetch techniques • Stride prefetching: search for pattern with constant stride 20 31 42 stride: 11 53 64 e.g. walking through a matrix (row- or column-order) • Markov prefetching: recurring patterns of misses miss history 10 110 15 12 … prediction 100 ... Stride prefetching 5.2 IPC 5.1 prefetching load/store no prefetching 5 4.9 70 75 80 85 latency main memory (ns) IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress 90 Prefetching and sensitivity Factors of “performance sensitivity to latency” increase with stride-prefetching: L1-cache/L2-cache L2-cache/main memory to L1-prefetching 1.6 4.1 to L2-prefetching 2.5 Latency is important: generalization to other processor architectures Consider schedule of program: time Present in every program execution: • Latency of instruction execution • Latency of communication => latency important whatever processor architecture Optical interconnects (OI) • Mature components: – Vertical-Cavity Surface Emitting Lasers (VCSELs) – Light Emitting Diodes (LEDs) • Very high bandwidths • Are replacing electronic interconnects in telecom and networks • Useful for short inter-chip and even intra-chip interconnects? OI in processor context • At levels close to processor core, latency is very important => latency of OI determines how far OI penetrates in the memory hierarchy • What is the latency of an optical interconnect? An optical link LED/VCSEL receiver diode fiber or light conductor buffer/modulation/bias transimpedance amplifier Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency VCSEL characteristics optical output (mW) • A small semiconductor laser • Carrier density should be high enough for lasing action 2 optical power carrier density 1.5 load/store 1 0.5 0 0 1 2 current (mA) 3 Total VCSEL link latency consists of • Buffer latency • Parasitic capacitances and series resistances of VCSEL and pads • Threshold carrier density build up • From low optical output to final optical output (intrinsic latency) • Time of flight (TOF) • Receiver latency Total optical link latency 7 latency (ns) 6 5 4 3 load/store TOF (10 cm) receiver intrinsic threshold parasitics buffer @ 1 mW 2 1 0 LED LED VCSEL VCSEL CMOS: 0.6 m 0.25 m 0.6 m 0.25 m Latency as function of power latency (ns) 8 LED (0.6 microm.) 7 VCSEL (0.6 microm.) 6 LED (0.25 microm.) 5 VCSEL (0.25 microm.) 4 load/store 3 2 1 0 0 1 2 3 4 optical output power (mW) 5 6 Conclusions • When combining performance sensitivity and optical latency we conclude: – optical interconnects are feasible to main memory and for multiprocessors – for interconnects close to processor core, optical interconnects have too high latency with present (telecom) devices, drivers and receivers => but now evolution to lower latency devices, drivers and receivers is taking place... For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000 www.elis.rug.ac.be/~neefs

It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS)

Related documents

Products

Support

It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib