b/w - Department of Computer Science

advertisement
AnneBracy
CS3410
ComputerScience
CornellUniversity
The slides are the product of many rounds of teaching CS 3410 by
Professors Weatherspoon, Bala, Bracy, and Sirer.
SeeP&HChapter:AppendixB
Complexquestion
•
•
•
•
•
Howfastistheprocessor?
Howfastyourapplicationruns?
Howquicklydoesitrespondtoyou?
Howfastcanyouprocessabigbatchofjobs?
Howmuchpowerdoesyourmachineuse?
2
Latency(executiontime):timetofinishafixedtask
Throughput(bandwidth):#oftasksinfixedtime
• Different:exploitparallelismforthroughput,not
latency(e.g.,bread)
• Oftencontradictory(latencyvs.throughput)
– Willseemanyexamplesofthis
• Usedefinitionofperformancethatmatchesyourgoals
– Scientificprogram:latency;webserver:throughput?
3
Car: speed=60miles/hour,capacity=5
Bus: speed=20miles/hour,capacity=60
Task:transportpassengers10miles
Latency(min)
Throughput(PPH)
Car
Bus
4
+
4
PC
I$
Register
File
s1 s2 d
D$
Single-cycledatapath: true“atomic” fetch/executeloop
Fetch,decode,executeoneinstruction/cycle
+ LowCPI(seelaterslides):1bydefinition
– Longclockperiod:toaccommodate slowestinstruction
(PCà I$à RFà ALUà D$à RF)
6
+
4
PC
I$
Register
File
s1 s2 d
A
O
B
D
D$
Multi-cycledatapath: attacksslowclock
Fetch,decode,executeoneinsn overmultiplecycles
Allowsinsnstotakedifferentnumberofcycles (mainpoint)
±Oppositeofsingle-cycle:shortclockperiod,high CPI
7
Single-cycle
• Clockperiod=50ns,CPI=1
• Performance=50ns/insn
Multi-cycle: oppositeperformancesplit
+ Shorterclockperiod
– HigherCPI
Example
• branch:20%(3 cycles),load:20%(5 cycles),ALU:60%(4 cycle)
• Clockperiod=11ns,CPI=(20%*3)+(20%*5)+(60%*4)=4
– Whyisclockperiod 11nsandnot10ns?
• Performance=44ns/insn
Aside: CISCmakesperfectsenseinmulti-cycledatapath
8
Programruntime:
seconds
program
=
instructions
program
cycles
x seconds
x instruction
cycle
Instructionsperprogram:“dynamicinstructioncount”
• Runtimecountofinstructionsexecutedbytheprogram
• Determinedbyprogram,compiler,ISA
Cyclesperinstruction:“CPI”(typicalrange:2to0.5)
• Howmanycycles doesaninstructiontaketoexecute?
• Determinedbyprogram,compiler,ISA,micro-architecture
Secondspercycle:clockperiod,lengthofeachcycle
• Inversemetric:cycles/second(Hertz)orcycles/ns(Ghz)
• Determinedbymicro-architecture,technologyparameters
Forlowerlatency(=betterperformance)minimizeallthree
• Difficult:oftenpullagainstoneanother
9
CPI:Cycle/instructionfor on average
• IPC =1/CPI
– Usedmorefrequently thanCPI
– Favoredbecause “biggerisbetter”,buthardertocomputewith
• Differentinstructionshavedifferentcyclecosts
– E.g.,“add”typically takes1cycle,“divide” takes>10cycles
• Dependsonrelativeinstructionfrequencies
CPIexample
•
•
•
•
Programhasequalratio:integer,memory,floatingpoint
Cyclesperinsn type:integer=1,memory=2,FP=3
WhatistheCPI?(33%*1)+(33%*2)+(33%*3)=2
Caveat:thissortofcalculationignoresmanyeffects
– Back-of-the-envelope arguments only
10
Assumeaprocessorwithinstructionfrequenciesandcosts
•
•
•
•
IntegerALU:50%,1cycle
Load:20%,5cycle
Store:10%,1cycle
Branch:20%,2cycle
Whichchangewouldimproveperformancemore?
A:“Branchprediction”toreducebranchcostto1cycle?
B:“Cache”toreduceloadcostto3cycles?
ComputeCPI
INT
LD
ST
BR
CPI
Base
A
B
11
1Hertz=1cycle/second
1Ghz = 1cycle/nanosecond,1Ghz =1000Mhz
Generalpublic(mostly) ignoresCPI
• Equatesclockfrequencywithperformance!
Whichprocessorwouldyoubuy?
• ProcessorA:CPI=2,clock=5GHz
• ProcessorB:CPI=1,clock=3GHz
• ProbablyA,butBisfaster(assumingsameISA/compiler)
Classicexample
• 800MHzPentiumIII fasterthan1GHzPentium4!
• Example:Corei7fasterclock-per-clockthanCore2
• SameISAandcompiler!
Meta-point:dangerofpartialperformancemetrics!
13
(Micro)architectsoftenignoredynamicinstructioncount
• TypicallyhaveoneISA,onecompiler→ treatitasfixed
CPUperformanceequationbecomes
Latency:
seconds cycles
= insn
insn
Throughput: insn
insn
=
seconds cycles
x
x
seconds
cycle
cycles
second
MIPS (millionsofinstructionspersecond)
• Cycles/second:clockfrequency(inMHz)
• Ex:CPI=2,clock=500MHz→ 0.5*500MHz=250MIPS
Pitfall:mayvaryinverselywithactualperformance
– Compilerremovesinsns,programfaster,butlowerMIPS
– Workperinstructionvaries(multiplyvs.add,FPvs.integer)
14
Decreaselatency
CriticalPath
combinatorial
Logic
tcombinatorial
outputs
expected
inputs
arrive
• Longestpathdeterminingtheminimumtimeneeded
foranoperation
• Determinesminimumlengthofclockcycle
i.e.determinesmaximumclockfrequency
15
Goal: MakeMulti-Cycle@30MHzCPU(15MIPS)run2x
fasterbymakingarithmeticinstructionsfaster
Instructionmix (forP):
• 25%load/store,CPI=3
• 60%arithmetic,CPI=2
• 15%branches,CPI=1
WhatisCPI?
Goal:Makeprocessorrun2xfaster(30à 15 MIPS)
Try:Arithmetic2à 1?
(2àXwhatwouldxhavetobe?)
16
Amdahl’sLaw
Executiontimeafterimprovement=
execution timeaffected byimprovement
amountofimprovement
+execution timeunaffected
Or:Speedupislimitedbypopularityofimprovedfeature
Corollary:buildabalancedsystem
• Don’toptimize1%tothedetrimentofother99%
• Don’tover-engineercapabilitiesthatcannotbeutilized
Caveat:Lawofdiminishingreturns
18
Download