slides - Microarch.org

advertisement
Dynamically Trading Frequency for
Complexity in a GALS Microprocessor
Steven Dropsho, Greg Semeraro, David H. Albonesi,
Grigorios Magklis, Michael L. Scott
University of Rochester
The gist of the paper…
Radical idea: Trade off frequency and
hardware complexity dynamically at runtime
rather than statically at design time
The new twist: A Globally-Asynchronous,
Locally-Synchronous (GALS) microarchitecture
is key to making this worthwhile
Application phase behavior
Varying behavior over time
Can exploit to save power
gcc
adaptive issue queue
L2
misses
E per
interval
L1I
misses
L1D
misses
branch
mispred
IPC
[Sherwood, Sair, Calder, ISCA 2003]
[Buyuktosunoglu, et al., GLSVLSI 2001]
What about performance?
RAM delay
entries
32
24
16
8
relative delay
1.0
0.77
0.52
0.31
CAM delay
entries
32
24
26
8
relative delay
1.0
0.77
0.55
0.34
Lower power and
faster access time!
[Buyuktosunoglu, GLSVLSI 2001]
What about performance?
How do we exploit the
faster speed?
Variable latency
Increase frequency
when downsizing
Decrease frequency
when upsizing
What about performance?
L1 I-Cache
Br Pred
Main
Memory
Fetch Unit
Dispatch, Rename, ROB
L2 Cache
Issue Queue
integer
Ld/St Unit
ALUs & RF
L1 D-Cache
FP
ALUs & RF
clock
[Albonesi, ISCA 1998]
Issue Queue
0.0
[Albonesi, ISCA 1998]
average
wave5
fpppp
apsi
turb3d
applu
mgrid
hydro2d
su2cor
swim
tomcatv
appcg
radar
stereo
airshed
vortex
perl
ijpeg
li
compress
gcc
m88ksim
Avg TPI (ns)
What about performance?
1.2
Best Conventional
Process-level Adaptive
1.0
0.8
0.6
0.4
0.2
Enter GALS…
Front-end Domain
External Domain
L1 I-Cache
Br Pred
Main
Memory
Fetch Unit
Dispatch, Rename, ROB
Memory Domain
L2 Cache
Integer Domain
FP Domain
Issue Queue
Issue Queue
Ld/St Unit
ALUs & RF
ALUs & RF
L1 D-Cache
[Semeraro et al., HPCA 2002]
[Iyer and Marculescu, ISCA 2002]
Outline

Motivation and background
 Adaptive GALS microarchitecture

Control mechanisms
 Evaluation methodology
 Results

Conclusions and future work
Adaptive GALS microarchitecture
Front-end Domain
External Domain
L1
I-Cache
L1
L1
I-Cache
L1I-Cache
I-Cache
Br
Pred
Br
Br
Pred
BrPred
Pred
Fetch Unit
Dispatch, Rename, ROB
Integer Domain
Issue Queue
ALUs & RF
Main
Memory
FP Domain
Issue
Queue
Issue
Queue
Issue
IssueQueue
Queue
ALUs & RF
Memory Domain
L2Cache
Cache
L2
Cache
L2
L2 Cache
Ld/St Unit
L1D-Cache
D-Cache
L1
D-Cache
L1
L1 D-Cache
Adaptive GALS operation
Front-end Domain
External Domain
L1
I-Cache
L1
L1
I-Cache
L1I-Cache
I-Cache
Br
Pred
Br
Br
Pred
BrPred
Pred
Fetch Unit
Dispatch, Rename, ROB
Integer Domain
Issue Queue
ALUs & RF
Main
Memory
FP Domain
Issue
Queue
Issue
Queue
Issue
IssueQueue
Queue
ALUs & RF
Memory Domain
L2Cache
Cache
L2
Cache
L2
L2 Cache
Ld/St Unit
L1D-Cache
D-Cache
L1
D-Cache
L1
L1 D-Cache
Resizable cache organization

Access A part first, then B part on a miss
 Swap A and B blocks on a A miss, B hit
 Select A/B split according to application phase behavior
Resizable cache control
Example Accesses
(MRU)
MRU State
0 1 2 3
A B C D
B A C D
C B AD
C B AD
(LRU)
MRU[1]++
MRU[2]++
Config A1 B3
• hitsA = MRU[0]
• hitsB = MRU[1] + [2] + [3]
Config A2 B2
• hitsA = MRU[0] + [1]
• hitsB = MRU[2] + [3]
MRU[0]++
Config A3 B1
• hitsA = MRU[0] + [1] + [2]
• hitsB = MRU[3]
MRU[3]++
Config A4 B0
• hitsA = MRU[0] + [1] + [2] + [3]
• hitsB = 0
• Calculate the cost for each possible configuration:
A access costs = (hitsA + hitsB + misses) * CostA
B access costs = (hitsB + misses) * CostB
Miss access costs = misses * CostMiss
Total access cost = A + B + Miss (normalized to frequency)
Resizable issue queue control

Measures the exploitable ILP for each queue size
 Timestamp counter is reset at the start of an interval and
incremented each cycle
 During rename, a destination register is given a timestamp
based on the timestamp + execution latency of its slowest
source operand
 The maximum timestamp, MAXN is maintained for each of
the four possible queue sizes over N fetched instructions
(N=16, 32, 48, 64)
 ILP is estimated as N/MAXN

Queue size with highest ILP (normalized to frequency) is
selected
Resizable hardware – some details

Front end domain
•
Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way
•
Branch predictor sized with Icache
–
gshare PHT: 16KB-64KB
– Local BHT: 2KB-8KB
– Local PHT: 1024 entries
– Meta: 16KB-64KB

Load/store domain
•
Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8way
•
L2 cache “A” sized with Dcache
–

256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way
Integer and floating point domains
•
Issue queue: 16, 32, 48, or 64 entries
Evaluation methodology

SimpleScalar and Cacti
 40 benchmarks from SPEC, Mediabench, and Olden
 Baseline: best overall performing fully synchronous
21264-like design found out of 1,024 simulated options
 Adaptive MCD costs imposed:
•
Additional branch penalty of 2 integer domain cycles and 1
front end domain cycle (overpipelined)
•
Frequency penalty as much as 31%

Mean PLL locking time of 15 µsec
 Program-Adaptive: profile application and pick the best
adaptive configuration for the whole program

Phase-Adaptive: use online cache and issue queue
control mechanisms
-10%
Program Adaptive
w u p w i se
vp r
vo rte x
tw o l f
p a rse r
me sa
g zi p
Olden
g cc
galgel
e q u a ke
eon
cra fty
b zi p 2
a rt
a p si
tsp
tre e a d d
power
p e ri me te r
mst
h e a l th
Mediabench
e m3 d
b i so rt
bh
mp e g 2 d e co d e
mp e g 2 e n co d e
me sa te xg e n
me sa o so d e mo
me sa mi p ma p
g h o stscri p t
g sm d e co d e
g sm e n co d e
g 7 2 1 d e co d e
g 7 2 1 e n co d e
j p e g d e co mp re ss
50%
j p e g co mp re ss
e p i c d e co d e
e p i c e n co d e
a d p cm d e co d e
a d p cm e n co d e
Performance improvement
SPEC
40%
30%
20%
10%
0%
Phase Adaptive
Phase behavior – art
issue queue entries
64
48
32
16
100 million instruction window
Phase behavior – apsi
Dcache “A” size
256KB
128KB
64KB
32KB
100 million instruction window
Performance summary

Program Adaptive: 17% performance improvement
 Phase Adaptive: 20% performance improvement

•
Automatic
•
Never degrades performance for 40 applications
•
Few phases in chosen application windows – could
perhaps do better
Distribution of chosen configurations for Program
Adaptive:
Integer IQ
16
32
48
64
85%
5%
5%
5%
FP IQ
16
32
48
64
73%
15%
8%
5%
D/L2 Cache
32KB/256KB
64KB/512KB
128KB/1MB
256KB/2MB
50%
18%
23%
10%
Icache
16KB
32KB
48KB
64KB
55%
18%
8%
20%
Domain frequency versus IQ size
1.8
Relative frequency
1.6
1.4
1.2
1.0
0.8
0.6
0.4
16
32
48
Issue Queue Size
64
Conclusions

Application phase behavior can be exploited to improve
performance in addition to power savings
 GALS approach is key to localizing the impact of slowing the
clock
 Cache and queue control mechanisms can evaluate all
possible configurations within a single interval
 Phase adaptive approach improves performance by as much
as 48% and by an average of 20%
Future work

Explore multiple adaptive structures in each domain
 Better take into account the branch predictor

Resize the instruction cache by sets rather than ways
 Explore better issue queue design alternatives
 Build circuits

Dynamically customized heterogeneous multi-core
architectures using phase-adaptive GALS cores
Dynamically Trading Frequency for
Complexity in a GALS Microprocessor
Steven Dropsho, Greg Semeraro, David H. Albonesi,
Grigorios Magklis, Michael L. Scott
University of Rochester
Download