AnneBracy CS3410 ComputerScience CornellUniversity The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, and Sirer. SeeP&HChapter:AppendixB Complexquestion • • • • • Howfastistheprocessor? Howfastyourapplicationruns? Howquicklydoesitrespondtoyou? Howfastcanyouprocessabigbatchofjobs? Howmuchpowerdoesyourmachineuse? 2 Latency(executiontime):timetofinishafixedtask Throughput(bandwidth):#oftasksinfixedtime • Different:exploitparallelismforthroughput,not latency(e.g.,bread) • Oftencontradictory(latencyvs.throughput) – Willseemanyexamplesofthis • Usedefinitionofperformancethatmatchesyourgoals – Scientificprogram:latency;webserver:throughput? 3 Car: speed=60miles/hour,capacity=5 Bus: speed=20miles/hour,capacity=60 Task:transportpassengers10miles Latency(min) Throughput(PPH) Car Bus 4 + 4 PC I$ Register File s1 s2 d D$ Single-cycledatapath: true“atomic” fetch/executeloop Fetch,decode,executeoneinstruction/cycle + LowCPI(seelaterslides):1bydefinition – Longclockperiod:toaccommodate slowestinstruction (PCà I$à RFà ALUà D$à RF) 6 + 4 PC I$ Register File s1 s2 d A O B D D$ Multi-cycledatapath: attacksslowclock Fetch,decode,executeoneinsn overmultiplecycles Allowsinsnstotakedifferentnumberofcycles (mainpoint) ±Oppositeofsingle-cycle:shortclockperiod,high CPI 7 Single-cycle • Clockperiod=50ns,CPI=1 • Performance=50ns/insn Multi-cycle: oppositeperformancesplit + Shorterclockperiod – HigherCPI Example • branch:20%(3 cycles),load:20%(5 cycles),ALU:60%(4 cycle) • Clockperiod=11ns,CPI=(20%*3)+(20%*5)+(60%*4)=4 – Whyisclockperiod 11nsandnot10ns? • Performance=44ns/insn Aside: CISCmakesperfectsenseinmulti-cycledatapath 8 Programruntime: seconds program = instructions program cycles x seconds x instruction cycle Instructionsperprogram:“dynamicinstructioncount” • Runtimecountofinstructionsexecutedbytheprogram • Determinedbyprogram,compiler,ISA Cyclesperinstruction:“CPI”(typicalrange:2to0.5) • Howmanycycles doesaninstructiontaketoexecute? • Determinedbyprogram,compiler,ISA,micro-architecture Secondspercycle:clockperiod,lengthofeachcycle • Inversemetric:cycles/second(Hertz)orcycles/ns(Ghz) • Determinedbymicro-architecture,technologyparameters Forlowerlatency(=betterperformance)minimizeallthree • Difficult:oftenpullagainstoneanother 9 CPI:Cycle/instructionfor on average • IPC =1/CPI – Usedmorefrequently thanCPI – Favoredbecause “biggerisbetter”,buthardertocomputewith • Differentinstructionshavedifferentcyclecosts – E.g.,“add”typically takes1cycle,“divide” takes>10cycles • Dependsonrelativeinstructionfrequencies CPIexample • • • • Programhasequalratio:integer,memory,floatingpoint Cyclesperinsn type:integer=1,memory=2,FP=3 WhatistheCPI?(33%*1)+(33%*2)+(33%*3)=2 Caveat:thissortofcalculationignoresmanyeffects – Back-of-the-envelope arguments only 10 Assumeaprocessorwithinstructionfrequenciesandcosts • • • • IntegerALU:50%,1cycle Load:20%,5cycle Store:10%,1cycle Branch:20%,2cycle Whichchangewouldimproveperformancemore? A:“Branchprediction”toreducebranchcostto1cycle? B:“Cache”toreduceloadcostto3cycles? ComputeCPI INT LD ST BR CPI Base A B 11 1Hertz=1cycle/second 1Ghz = 1cycle/nanosecond,1Ghz =1000Mhz Generalpublic(mostly) ignoresCPI • Equatesclockfrequencywithperformance! Whichprocessorwouldyoubuy? • ProcessorA:CPI=2,clock=5GHz • ProcessorB:CPI=1,clock=3GHz • ProbablyA,butBisfaster(assumingsameISA/compiler) Classicexample • 800MHzPentiumIII fasterthan1GHzPentium4! • Example:Corei7fasterclock-per-clockthanCore2 • SameISAandcompiler! Meta-point:dangerofpartialperformancemetrics! 13 (Micro)architectsoftenignoredynamicinstructioncount • TypicallyhaveoneISA,onecompiler→ treatitasfixed CPUperformanceequationbecomes Latency: seconds cycles = insn insn Throughput: insn insn = seconds cycles x x seconds cycle cycles second MIPS (millionsofinstructionspersecond) • Cycles/second:clockfrequency(inMHz) • Ex:CPI=2,clock=500MHz→ 0.5*500MHz=250MIPS Pitfall:mayvaryinverselywithactualperformance – Compilerremovesinsns,programfaster,butlowerMIPS – Workperinstructionvaries(multiplyvs.add,FPvs.integer) 14 Decreaselatency CriticalPath combinatorial Logic tcombinatorial outputs expected inputs arrive • Longestpathdeterminingtheminimumtimeneeded foranoperation • Determinesminimumlengthofclockcycle i.e.determinesmaximumclockfrequency 15 Goal: MakeMulti-Cycle@30MHzCPU(15MIPS)run2x fasterbymakingarithmeticinstructionsfaster Instructionmix (forP): • 25%load/store,CPI=3 • 60%arithmetic,CPI=2 • 15%branches,CPI=1 WhatisCPI? Goal:Makeprocessorrun2xfaster(30à 15 MIPS) Try:Arithmetic2à 1? (2àXwhatwouldxhavetobe?) 16 Amdahl’sLaw Executiontimeafterimprovement= execution timeaffected byimprovement amountofimprovement +execution timeunaffected Or:Speedupislimitedbypopularityofimprovedfeature Corollary:buildabalancedsystem • Don’toptimize1%tothedetrimentofother99% • Don’tover-engineercapabilitiesthatcannotbeutilized Caveat:Lawofdiminishingreturns 18