PPT - Microarch.org

advertisement
Systematic Energy Characterization of CMP/SMT
Processor Systems via Automated Micro-Benchmarks
R. Bertran*+, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez+, P. Bose*
*IBM T.J. Watson Research Center
+Barcelona Supercomputing Center
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Why do we need micro-benchmarks?
What is the maximum
power consumption?
Any performance bug?
Any reliability issues?
…
Micro-benchmarks!
 Time consuming and tedious
– Error prone task
• Trial and error process
– Several microbenchmarks are required
 Deep expertise limited to few
designers
– Detailed knowledge of
the underlying
architecture is required
2
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe:
a micro-benchmark generation framework
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe Workflow
Inputs
User
Endless
Endless
loop
loop
for
Max Power
each
50%
INT
instruction
50% FP
stressmark
of the ISA
Outputs
Microbenchmark
generation
policy
MicroProbe
Framework
MicroMicroBenchMicroBenchmark
MicroBenchmark
Benchmark
mark
Architecture
Definition
files
External tools
Real
platforms
MICRO 2012
Tuesday, December 4, 2012
Simulators
Models
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe: Distinguishing Features
Feature
Previous works
MicroProbe
ISA queries
- Instruction type
- Operand length, binary codification etc.

 (manual)


 (manual)

 (no)



 (no)

 (no)

 (manual)
 (manual)




Micro-architecture queries
- Functional unit, latency, throughput, energy per instruction,
average instruction power etc.
Micro-architecture models
- Set-associative cache model
Code generation
- Skeleton and instruction definition passes, memory
modeling pass, branch modeling pass, ILP definition pass.
- Configurable passes
Design space exploration
- Integrated
- GA-based search
- Exhaustive search
- Customizable search
5
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe Usage and Design Overview
Research
idea
Micro-benchmark
Micro-benchmark
Micro-benchmark
Micro-benchmark generation policies (user-defined scripts)
Loop stressing
the floating
point unit
Sequence of loads
hitting 50% L1
and 50% L2
Generate a stressmark for each functional
unit of the architecture
Search for the sequence of 2
loads and 2 integer operations
with maximum IPC
MicroProbe Framework (Python API)
Architecture module
ISA
ISA
ISA
definitions
definitions
definitions
Micro-architecture
Micro-architecture
Micro-architecture
analytical
models
analytical
analyticalmodels
models
Micro-architecture
Micro-architecture
Micro-architecture
definitions
definitions
definitions
MICRO 2012
Tuesday, December 4, 2012
Automatic
bootstrap
process
Code generation
module
Design space
exploration module
Micro-benchmark
synthesizer
Passes
Passes
Passes
Search
Search
Search
drivers
drivers
drivers
Properties
Properties
Properties
External tools
© 2012 IBM Corporation
Barcelona Supercomputing Center
Max-power Stressmark Generation
Use MicroProbe
to
generate maxpower
stressmark
Characterize energy per instruction (EPI)
and IPC (Architecture Module)
Select N instructions with max (IPC* EPI)
Form a basic endless loop (e.g. 4K) using
selected instructions (Code Generation
Module)
Generate micro-benchmarks with different
orders of the selected N instructions
Evaluate using Design Space Exploration
Module
mulldo
xvnmsubmdp
lxvw4x
Loop:
… Loop:
mulldo
…
mulldo
mulldo
lxvw4x
lxvw4x
lxvw4x
mulldo
xvnmsubmdp
xvnmsubmdp
xvnmsubmdp
lxvw4x
… xvnmsubmdp
…
Pick the highest power microbenchmark
7
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe:
A Micro-benchmark Generation Framework
CASE STUDIES
8
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Experimental Methodology
 Platform:
– Processor: POWER7 @ 3GHz
• 8-core 4-way SMT
• 32KB L1, 256KB L2 and 4MB L3 per core
– Memory: 32 GB DDR3 SDRAM @ 800MHz
– OS: RHEL 5.7 + Linux 3.0.1
– EnergyScale architecture
• Power measurements in miliwatts
• Sampling rate up to 1ms
 In-house software collects power and performance counter
traces [C. Lefurgy et al, IBM]
9
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Case Study 1: EPI Characterization
Category
Instruction
Core IPC
Normalized EPI
Global
Category
Functional Units
FXU
LSU
VSU
mulldo
subf
addic
lxvw4x
lvewx
lbz
xvnmsubmdp
xvmaddadp
xstsqrtdp
1,40
2,00
2,00
1,68
1,68
1,68
2,00
2,00
2,00
2,60
1,69
1,00
2,88
2,81
2,14
2,35
2,31
1,32
2,60
1,69
1,00
1,35
1,31
1,00
1,78
1,75
1,00
1,73
1,58
1,16
1,49
1,36
1,00
5,12
5,01
4,24
5,51
5,29
4,80
1,21
1,18
1,00
1,15
1,10
1,00
8,36
7,16
5,97
10,00
9,49
8,40
1,40
1,20
1,00
1,19
1,13
1,00
Simple Integer Operations
FXU or LSU
add
nor
and
3,50
3,50
3,50
Integer Memory Operations
ldux
lwax
lfsu
lhaux
lwax
lhaux
1,00
1,00
1,00
1,00
1,00
1,00
LSU anddifferences
FXU
High
in EPI across
instructions stressing different microLSU and 2FXU
architecture components
Vector/Float/Decimal memory operations
10
stxvw4x
0,48
High
differences in EPI across
stxsdx
0,48
LSU and VSU
stfd
0,48
instructions stressing the same
microstfsux
0,48
stfdux
LSU and VSU and FXU
architecture
components and
at the0,48
stfdu
0,48
same rate (IPC)
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Case Study 2: Max-power Stressmark Generation
Generate
Use complex
all possible
Use a of
combinations
instructions
complex
accessing
Use MicroProbe
instructions
different
computational
functional
stressing units
different
with
intensive
kernel
high
units
IPC
?
MicroProbe
Expert
manual
Loop:
…
MicroProbe
mullw
Selected
intructions: Loops
Selected
instructions:
Loops
mullw
lxvd2x
Loops
mullw
DAXPY
Loops
mulldo,
Loops
xvmaddadp
mullw
Loops
Heuristic:
xvmaddadp
xvnmsubmdp,
xvmaddadp
Max(EPI
* IPC)
lxvw4x
lxvd2x
xvmaddadp
lxvd2x
lxvd2x
xvmaddadp
…
11
MICRO 2012
Tuesday, December 4, 2012
Expert
DSE
Loops
Loops
Loops
Loops
Loops
Loops
MicroProbe
© 2012 IBM Corporation
Barcelona Supercomputing Center
Max-power Stressmark Generation
Max-power results
Normalized power
1.2
1.1
1
Min
Mean
Max
0.9
0.8
0.7
0.6
DAXPY
Expert Manual
Expert DSE
MicroProbe
Methods
12
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Case Study 3: Counter-based Processor Power Model
Func.Unit microBenchmarks
CMP1–SMT1
Dynamic
Power
f(PMCs)
Bottomup
Power
modeling
method
Random microBenchmarks
CMP1–SMT1
1
Intercept
SMT1
Random microBenchmarks
CMP1–SMT2/4
2
SMT
effect
Intercept
SMT2-4
CMP effect
Random microBenchmarks
CMP1/8–SMT2/4
Linear
Regression
f(CMP)
Model:
# threads

Dynamic
Power
f(PMCs)
k 1
13

# cores
MICRO 2012
Tuesday, December 4, 2012


k 1
SMT
effect

SMT
enabled


CMP
effect

#
cores
3
Uncore
power

Uncore
power
© 2012 IBM Corporation
Barcelona Supercomputing Center
% Error
Counter-based Processor Power Model
Validation
Model accuracy results on SPEC CPU2006
10
9
8
7
6
5
4
3
2
1
0
Micro trained
Random trained
SPEC trained
Proposed
1-1
1-2
1-4
2-1
2-2
2-4
4-1
4-2
4-4
6-1
6-2
6-4
8-1
8-2
8-4
Mean
CMP - SMT configuration
 Within acceptable error margins: < 4% on average
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Counter-based Processor Power Model
Validation on Corner Cases
62%
Model accuracy results
% Error
20
15
Micro trained
Random trained
SPEC trained
Proposed
10
5
0
FXU
High
FXU
Low
L1
Loads
Main
Memory
VSU
High
VSU
Low
Mean
Validation set
 Models trained using non-micro-architecture aware training sets show
high errors and variability
 Models trained using the micro-architecture aware training set show
acceptable error margins: < 5% on average
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Conclusions
 MicroProbe is a productive micro-benchmark generation
framework
– Adaptive and flexible
– Includes micro-architecture semantics
– Integrates design space exploration
 Presented three case studies:
– Instruction-based EPI characterization
– Automated max-power stressmark generation
– CMP/SMT-aware bottom-up counter-based processor
power model
16
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
MicroProbe:
A Micro-benchmark Generation Framework
QUESTIONS?
17
MICRO 2012
Tuesday, December 4, 2012
© 2012 IBM Corporation
Barcelona Supercomputing Center
Download