Outline An Integrated Hardware/Software

advertisement
Outline
Framework: Dynamically Tunable Clustered
Multithreaded Architecture
Motivation: Workload characterization
Architectural support for adaptation
Role of program analysis
Resource-aware operating system support
An Integrated Hardware/Software
Approach to OnOn-Line PowerPowerPerformance Optimization
Sandhya Dwarkadas
University of Rochester
Collaborators at UR: David Albonesi, Chen Ding, Eby Friedman, Michael L. Scott, UR
Systems and Architecture groups
Collaborators at IBM: Pradip Bose, Alper Buyuktosunoglu, Calin Cascaval, Evelyn
Duesterwald, Hubertus Franke, Zhigang Hu, Bonnie Ray,Viji Srinivasan
Conventional Processor Design
Emerging Trends
• Wire delays and faster clocks will necessitate
aggressive use of clustering
• Larger transistor budgets and low cluster design
costs will enable addition of more clusters
incrementally
• There is a trend toward multithreading to exploit
the transistor budget for improved throughput by
combining ILP and TLP
Combine clustering and multithreading?
Branch
Predictor
I Cache
I Cache
r2
r1 + r41
Rename
&
Dispatch
Regfile
IQ
FU
I
s
s
u
e
Q
FU
FU
FU
FU
A Clustered Multithreaded (CMT)
Architecture
r3 + r4
r2 r1 + r41
r1
ROB
IPREG
Integer
IIQ
Memory
L1
Icache
Regfile
IQ
Rename
&
Dispatch
Large structures
Slower clock speed
A Clustered Processor
Branch
Predictor
Register File
Branch
predict
FetchQ
Rename
map
LSQ
Int FUs
L1
Dcache
Floating Pt
FU
FPQ
FP FUs
FPREG
Regfile
Small structures
Faster clock speed
IQ
r41
r43 + r44
IPREG
Memory
FU
LSQ
L1
Icache
Branch
predict
FetchQ
Rename
map
L2
Shared
Unified
Cache
Integer
IIQ
ROB
Regfile
IQ
But, high latency for some
instructions
FU
Int FUs
L1
Dcache
Floating Pt
FPQ
FP FUs
FPREG
1
Components of IPC Degradation
Overall Energy Impact
7
Power
6
1.2
5
Communication
4
ResultBus
1
ALU
3
Dcache 2
0.8
2
Dcache
Icache
1
0.6
0
ilp4
com4
Regfiles
LSQ
mix4
Bpred
0.4
Monolithic
ROB
Clustered
Clock
0.2
Centralized FUs
IQRAM
No Comm Penalty
Rename
0
Centralized FUs + No Comm Penalty
SMT
Centralized FUs + No Comm Penalty + Centralized RegFile
TD:8;Link 2 TD:4;Link 2 TD:2;Link 2 TD:1;Link 2
Dynamic
Single Thread Execution
Problems
IPREG
• Tradeoff in communication vs. parallelism for a single
thread
• Increased communication delays and contention when
employing multiple threads
– Reduced performance
– Increased energy consumption
Goal:
Integer
IIQ
Memory
LSQ
ROB
Branch
predict
FetchQ
Rename
map
L1
Dcache
Floating Pt
FPQ
L1
Icache
Int FUs
FP FUs
FPREG
L2
Unified
Cache
IPREG
Integer
IIQ
Intelligent mapping of applications to resources for
improved throughput and resource utilization as well as
reduced energy
Memory
LSQ
Int FUs
L1
Dcache
Floating Pt
FPQ
FP FUs
FPREG
Communication vs Parallelism
4 clusters
100 active instrs
r1
r5
r2 + r3
r1 + r3
r7
r8
r2 + r3
r7 + r3
…
…
8 clusters
r2 + r3
r1 + r3
r7
r8
r2 + r3
r7 + r3
…
…
…
…
r1 + r7
…
r9 r2 + r3
r5
Distant parallelism:
distant instructions
that are ready to execute
200 active instrs
r1
r5
Ready instructions
SingleSingle-Thread Adaptation [ISCA’
[ISCA’03]
• Dynamic interval-based exploration can adapt to
available instruction-level parallelism in a single
thread
– Determine when communication can no
longer be tolerated in exploiting additional
clusters
• Allow remaining clusters to be turned off to
reduce power consumption or to be used by a
different thread/application
2
An Integrated Approach to Dynamic
Tuning of the CMT
Instructions per cycle (IPC)
Results with Interval-Based Scheme
(ISCA’03)
2.5
4-clusters
16-clusters
interval-based
2
• Architectural design and dynamic configuration for finegrain adaptation
• Program analysis to determine application behavior
• Runtime support to match predicted application behavior
and resource requirements with available resources
– Resource-aware thread scheduling for maximum
throughput and fairness
– Runtime support for balancing ILP with TLP in parallel
application environments
1.5
1
0.5
0
cjpeg crafty
gzip
parser
vpr
djpeg galgel mgrid swim
HM
Overall improvement: 11%
OutOut-ofof-order Dispatch & Fetch Gating
Multithreaded Adaptation
• In-Order Dispatch (dispatch stall)
T6 T3 T4 T3 T6 T8 T1 T2 T5 T2 T1 T7 T4 T2
Ready for dispatch
tail
head
Blocked from dispatch
•Out-of-Order Dispatch (dispatch from T6)
• Basic scheme
– Interval-based
– Fixed 100,000 cycles
– Exploration-based
– Hysteresis to avoid spurious changes
T6 T3 T4 T3 T6 T8 T1 T2 T5 T2 T1 T7 T4 T2
Ready for dispatch
tail
head
Blocked from dispatch
Tx
Thread id
Thread to Cluster Assignment
Thread to Cache Bank Assignment
N_WAY = 8
8
7
7
6
6
SMT_8_BANK
5
5
SMT_4_BANK
4
SMT_2_BANK
4
SMT_1_BANK
IPC
3
CMT_8_BANK
3
CMT_4_BANK
2
CMT_2_BANK
2
1
CMT_1_BANK
1
0
ilp4
com4
mix4
ilp8
com8
mix8
0
Monolithic
TD:4 + SD + FG
TD:2 + SD + FG
Adaptive
com_4_I
com_4_m
ilp_4_f
ilp_4_m
mix1_4_m mix2_4_m
3
ILP vs. TLP
ILP vs. TLP
LU
Jacobi
9
8
8
7
7
Ideal WB
Shared
5
WB
WB
4
WT
WT
3
IPC
5
IPC
Shared
6
6
Ideal WB
4
Centralized
SMT
Centralized
3
2
SMT
2
1
1
0
1
0
1
2
4
6
2
4
6
8
Number of Thre ads
8
Number of Threads
A Dynamically Tunable Clustered
Multithreaded (DT(DT-CMT) Architecture
ROB
Memory
Branch
predict
FetchQ
Reactive
Integer
IPREG
IIQ
L1
Icache
Current Approaches to Adaptation
Rename
map
LSQ
Int FUs
Adaptive change is triggered after observed
variation in program behavior
Inspect
counters
L1
Dcache
Floating Pt
FPQ
FP FUs
FPREG
Integer
IPREG
IIQ
ROB
Memory
LSQ
L1
Icache
Branch
predict
FetchQ
Rename
map
Is there a
phase change?
L2
Shared
Unified
Cache
Int FUs
L1
Dcache
yes
Explore configurations
Record CPIs
Pick best configuration
no
Remain at
present
configuration
Floating Pt
FPQ
FP FUs
FPREG
Success depends on ability to repeat behavior across successive
intervals
Interval Length
Varied Interval Lengths
Benchmark
Problem:
• Unstable behavior across intervals
Solution:
Solution:
• Start with minimum allowed interval length
• If phase changes are too frequent, double the interval
length – find a coarse enough granularity such that
behavior is consistent
• Periodically reset interval length to the minimum
• Small interval lengths can result in noisy measurements
Instability factor
for a 10K interval
length
Minimum acceptable interval
length and its instability factor
gzip
4%
10K / 4%
vpr
14%
320K / 5%
crafty
30%
320K / 4%
parser
12%
40M / 5%
swim
0%
10K / 0%
mgrid
0%
10K / 0%
galgel
1%
10K / 1%
cjpeg
9%
40K / 4%
djpeg
31%
1280K / 1%
Instability factor: Percentage of intervals that flag a phase change
4
Characterizing Program Behavior
Variability
• Whole program instrumentation (currently SPEC2k)
• Periodic hardware performance counter sampling using
Ticker
– Dynamic Probe Class Library (DPCL) to insert a timerbased interrupt in the program
– Performance Monitoring API (PMAPI) to read the hardware
counters
– AIX-based
• Sampling interval of 10 msec
• Examination of IPC, L1D cache miss rate, instruction mix,
branch mispredict rate
• Statistical analysis – correlation, frequency analysis, behavior
variation
Example IPC Plots
Example IPC Plots
SPEC2k:bzip2
•Existence of macro phase behavior
•Significant behavior variation even at coarse granularities
•Strong frequency components/periodicity across several metrics
Similarity Across Metrics
SPEC2k:bzip2
Spec2k:art
• High rate of behavior variation from one measurement to the next
Comparing Frequency Spectra
bzip2
Program Behavior Variability
art
•Strong low (bzip2) and high (art) frequency components, indicating high rate of
repeatability
•Variation in behavior, while different, persists across different sampling interval sizes
5
Important Behavior Characteristics
• Programs exhibit high degrees of repeatability across all
metrics
• Rate of behavior repeatability (periodicity) across metrics
is highly similar
• Variation in behavior from one interval to the next can be
high
• Variation in behavior, while different, persists across
different sampling interval sizes
OnOn-Line Program Behavior Prediction
• Linear (statistical) predictors to exploit behavior in the immediate
past
– Last value
– Average(N)
– Mode(N)
• Table-based predictors to exploit periodicity (non-linear)
– Run-length encoded
– Fixed-size history
• Cross-metric predictors to exploit similarity across metrics
– Use one metric to predict several potentially different metrics
– Efficiently combine multiple predictors
On-line power-performance optimization needs to be
predictive rather than reactive
TableTable-Based Predictors
• E.g. table-based and asymmetric predictor –
at-4at-3at-2at-1
at,bt
avote, bvote
• Default to last value during learning period
• Use a voting mechanism to update table entries
– Prediction (at or bt) is updated with the mode of the
actual value (vote) the last time this history was
encountered, the current prediction(t), and the
measured value at the end of the interval
• Encoding and length of history (index) can be varied
– Fixed size or run-length encoded
Design TradeTrade-offs
• Precision
– Too coarse a precision implies insensitivity to finegrained behavior
– Too fine a precision implies sensitivity to noise
• Size of history
– Too long a history implies a potentially long learning
period
– Too short a history implies inability to distinguish
between common histories of otherwise distinct
regions
• Both precision and history have table size implications
Trade-off between noise tolerance, learning period, and prediction accuracy
Mean IPC Prediction Error
(Power3)
Program Behavior Predictability
• Variations in program behavior are predictable
to within a few percent
• Table-based predictors outperform any others
for programs with high variability
• Cross-metric table-based predictors make it
possible to predict multiple metrics using a
single predictor
• Microarchitecture-independent metrics allow
stable prediction even when the predicted metric
changes due to dynamic optimization
6
Information Space for Workload
Analysis
Problems
• High variability in program behavior
• Interval length hard to determine
– Too small
measurement noise
– Too large
missed opportunities for
adaptation
• Interval and actual phase boundaries do not
match
Data Locality Analysis (Shen
(Shen and Ding)
!"
Phase Marker Insertion
-----Basic
-----Basic Block Trace Analysis
Program Phase Detection
/+ ,.'
"#$%&
'($ )*$ !$+
('$ ,- ,.
,.' ,-
01 2
01 22
TOMCATV RD Signature with Phase Boundaries
• Objective: to find basic blocks marking unique phase
boundaries.
<
7
1
1;
:1;
33 4
56 789:1
;
33
<=
>
33
?
:
7
Similarity of Locality Phases:
TOMCATV (5250 Phases)
Similarity of Locality Phases:
COMPRESS (52 Phases)
Similarity of Phases with BBV
TOMCATV (2493 Intervals)
Bringing It All Together
• Locality analysis for phase detection and
marking of macro phases
• Linear or non-linear (table-based) prediction
within each phase for improved learning
ResourceResource-Aware Thread
Scheduling
ResourceResource-Aware O/S Scheduler
Multiple levels of thread scheduling
Application threads
O/S threads
H/W threads
Processor
pipeline
Process A
H/W Counters
O/S task
Resource-aware
extension
Resource Usage
Prediction
Process B
H/W scheduling
Counter-based
Resource Model
...
next
Kernel
scheduler(s) task(s)
e.g., current
temperature
O/S kernel
Application-level
scheduling
O/S scheduling
Resource counter/ Sensor
8
Fair Cooperative Scheduling [PPoPP
[PPoPP
2001]
• Each process is allocated a piggy-bank of time (set to 1
quantum) from which it can borrow and to which it can
add
• The piggy-bank is used to boost a process’s priority (with
the original purpose of responding to a communication
request when notified by a wakeup signal)
• A process can add to the piggy-bank whenever it
relinquishes the processor
Adapt the above so the piggy-bank is used to schedule a
process earlier than according to priority and
replenished, for example, when reconfiguring at a phase
marker
Coordinate among schedulers for multiple hardware
contexts
ResourceResource-Aware Thread
Scheduling: Other Applications
Power/Thermal management
• Temperature-aware process/thread scheduling to
avoiding temperature hotspots
– characterize threads based on expected temperature
contribution
– schedule based on a thread’s predicted heat
contribution and current temperature
Performance
• Improving L2 bandwidth utilization on a multiprocessor
(e.g., the two cores of a Power4)
– Characterize threads based on expected L2 cache
accesses
– avoid scheduling different threads with high L2
access concurrently
Summary: An Integrated
Hardware/Software Approach to
DTDT-CMT
• Aggressive clustering and multithreading
requires a whole-system integrated view in order
to maximize resource efficiency
– Architectural configuration support (while
carefully considering circuit-level issues)
– Program analysis
– Runtime/OS support
ApplicationApplication-Level Scheduling
• Provide a framework for
– trading ILP for TLP based on application
characteristics and available resources
– Specifying cache and cluster sharing
configurations at appropriate points
At the JVM level
– Target server workloads
At the level of an API such as OpenMP
– Target scientific/parallel applications
ResourceResource-Aware Thread
Scheduling (cont’
(cont’d)
Performance and Power
• Resource (memory, FU, and temperature)
aware thread scheduling for simultaneous
multithreaded processors (e.g., the Power5, and
the hyper-threads of the Pentium IV) or our
proposed clustered multithreaded architecture
OnOn-Going Projects at UofR
• CAP: Dynamic reconfigurable general-purpose
processor design
• MCD: Multiple Clock Domain Processors
• DT-CMT: Dynamically Tunable Clustered Multithreaded architectures
• InterWeave: 3-level versioned shared state
(predecessors: InterAct and Cashmere)
• ARCH: Architecture, Runtime, and Compiler
Integration for High-Performance Computing
See http://www.cs.rochester.edu/research and
http://www.cs.rochester.edu/~sandhya
9
Download