PPT

advertisement
Methodologies for
Performance Simulation of
Super-scalar OOO
processors
Srinivas Neginhal
Anantharaman Kalyanaraman
CprE 585: Survey Project
1
Architectural Simulators



Explore Design Space
Evaluate existing hardware, or
Predict performance of proposed
hardware
Designer has control
Functional Simulators:
Performance Simulators:
Model architecture
(programmers’ focus)
Model microarchitecture
(designer’s focus)
Eg., sim-fast, sim-safe
Eg., cycle-by-cycle (sim-outoforder)
2
Simulation Issues


Real-applications take too long for a cycle-by-cycle
simulation !!
Vast design space:
 Design Parameters:


code properties, value prediction, dynamic instruction distance,
basic block size, instruction fetch mechanisms, etc.
Architectural metrics:

IPC/ILP, cache miss rate, branch prediction accuracy, etc.

Find design flaws + Provide design improvements

Correctness and accuracy of simulation results

Need a “fast and robust” simulation methodology !!
3
Two Simulation Methodologies

HLS
Hybrid: Statistical + Symbolic
REF:




HLS: Combining Statistical and Symbolic Simulation to
Guide Microprocessor Designs. M. Oskin, F. T. Chong and M.
Farrens. Proc. ISCA. 71-82. 2000.
BBDA


Basic block distribution analysis
REF:

Basic Block Distribution Analysis to Find Periodic Behavior
and Simulation Points in Applications. T. Sherwood, E.
Perelman and B. Calder. Proc. PACT. 2001.
4
HLS: An Overview

A hybrid processor simulator
Statistical
Model
HLS
Symbolic
Execution
Performance Contours
spanned by design
space parameters
What can be achieved?
Explore design changes in architectures
and compilers that would be impractical to
simulate using conventional simulators
5
HLS: Main Idea
Synthetically
generated code/data
Large
Application
code
Statistical
Profiling
Code
characteristics:
Architecture
metrics:
-basic block size
-Cache behavior
-Dynamic instruction
distance
-Branch prediction
accuracy
Instruction stream,
data stream
Structural Simulation
of FU, issue pipeline
units
-Instruction mix
sim-fast: Statistical Profiling
sim-outorder: Structural Simulation
6
Statistical Code Generation

Each “synthetic instruction” contains
the following parameters based on
the statistical profile:



Functional unit requirements
Dynamic instruction distances
Cache behavior
7
HLS Correctness and Accuracy


Validate HLS against SimpleScalar (use IPC)
For varying combinations of design
parameters:



Run original benchmark code on SimpleScalar (use
sim-outoforder)
Run statistically generated code on HLS
Compare SimpleScalar IPC vs. HLS IPC
8
Validation:
Single- and Multi-value correlations
IPC vs. L1-cache hit rate
For SPECint95:
HLS Errors are
within 5-7% of the
cycle-by-cycle
results !!
9
HLS: Code Properties
Basic Block Size vs. L1-Cache Hit Rate
Inferred
Correlation:
Increasing
basic block
size helps only
when L1 cache
hit rate is
>96% or <82%
10
HLS: Value Prediction
GOAL: Break True Dependency
DID vs. Value predictability
Stall Penalty for mispredict vs.
Value Prediction Knowledge
11
HLS: Superscalar Issue Width vs.
Dynamic Instruction Distance
Inferred
Correlation:
DID and issue
width are
highly
correlated,
especially as
both start to
increase
12
HLS: Conclusions




Low error rate only on SPECint95 benchmark
suite. High error rates on SPECfp95 and STREAM
benchmarks
Findings: by R. H. Bell et. Al, 2004
Reason:
 Instruction-level granularity for workload
Recommended Improvement:
 Basic block-level granularity
13
Basic Block Distribution
Analysis
Basic Block Distribution Analysis to Find Periodic Behavior and
Simulation Points in Applications.
T. Sherwood, E. Perelman and B. Calder.
Proc. PACT. 2001.
14
Introduction
Goal


To capture large scale
program behavior in
significantly reduced
simulation time.
Initialization
Simulation Points
Approach



Find a representative
subset of the full program.
Find an ideal place to
simulate given a specific
number of instructions one
has to simulate
Accurate confidence
estimation of the
simulation point.
Period
Program Execution

15
Program Behavior


Program behavior has ramifications on
architectural techniques.
Program behavior is different in different
parts of execution.


Initialization
Cyclic behavior (Periodic)


Cyclic Behavior is not representative of all programs.
Common case for compute bound applications.
16
BBDA Basics

Fast profiling is used to determine the
number of times a basic block executes.




Behavior of the program is directly related to
the code that it is executing.
Profiling gives a basic block fingerprint for that
particular interval of time.
The interval chosen is ideally a representative
of the full execution of the program.
Profiling information is collected in
intervals of 100 million instructions.
17
Basic Block Vector (BBV)
BBV for Interval i:
Interval i
B1
B2
BD
B1
B2
…
Bx
Frequency
D: Total number
of Basic blocks in
the program code
BBV = Fingerprint of an interval
Varying size intervals
A BBV collected over an interval of N
times 100 million instructions is a BBV of
duration N.
18
Target BBV

BBVs are normalized


Target BBV


Each element divided by the sum of all
elements.
BBV for the entire execution of the
program.
Objective

Find a BBV of smallest duration “similar”
to Target BBV.
19
Basic Block Vector
Difference

Difference between BBVs

Euclidean Distance


Ai  Bi2
Manhattan Distance

Conservative Measure
Ai Bi
20
Basic Block Difference
Graph



Plot of how well each individual interval in
the program compares to the target BBV
For each interval of 100 million
instructions, we create a BBV and calculate
its difference from target BBV
Used to


Find the end of initialization phase
Find the period for the program
21
Basic Block Difference
Graph
22
Initialization




Initialization is not trivial.
Important to simulate representative sections of
the initialization code.
Detection of the end of the initialization phase is
important.
Initialization Difference Graph




Initial Representative Signal - First quarter of BB
Difference graph.
Slide it across BB difference graph.
Difference calculated at each point for first half of
BBDG.
When IRS reaches the end of the initialization stage on
the BB difference graph, the difference is maximized.
23
Initialization
24
Period

Period Difference Graph

Period Representative Signal




Part of BBDG, starting from the end of
initialization to ¼th the length of program
execution.
Slide across half the BBDG.
Distance between the minimum Y-axis points is
the period.
Using larger durations of a BBV creates a
BBDG that emphasizes larger periods.
25
Period
26
Characterizing Program
Behavior Through Clustering
Automatically characterizing Large Scale Program Behavior.
T. Sherwood, E. Perelman, G. Hamerly and B. Calder.
ASPLOS 2002
28
Clustering
#1
P1
#2
P2
…
…
#K
Pk
Multiple Simulation Points
N BBVs
Clustering Approach
Clusters
29
Clustering (k-means)




Goal is to divide a set of points into groups such that points
within each group are similar to one another by a desired
metric.
Input: N points in D-dimensional space
Output: A partition of k clusters
Algorithm:
 Randomly choose k points as centroids (initialization)
 Compute cluster membership of each point based on its
distance from each centroid
 Compute new centroid for each cluster
 Iterate steps 2 and 3 until convergence
Runtime complexity affected by the “curse of dimensionality”
30
Dimension Reduction Technique

Random Projection:

Reduces the dimension of the BBVs to 15

Dimension Selection

Dimension Reduction

Random Linear Projection.
31
BBDA: Conclusions



BBDA provides better sensitivity and lower
performance variation in phases
Other related work such as instruction
working set technique provides higher
“stability”
For further evaluation of different
techniques refer to

Comparing Program Phase Detection Techniques

A. S. Dhodapkar and J. E. Smith
32
Related Work

Find smaller representative inputs: Klein Osowski et al.,
2000.

Fast forwarding and checkpointing: Haskins and Skadron,
2002.

Simulation points based: Lafage et al., 2000.

Statistical Simulation: Oskin et al., 2000.

Trace-driven approach for Statistical Simulation: Carl et al.,
1998.
33
Download