IPC Monitoring

advertisement
IPC Monitoring and Phase Analysis of Programs
with SimpleScalar
Final Project for the course
VLSI Architecture Design #048853
Spring 2004
Avshalom Elyada, EE Faculty, Technion, Israel Institute of Technology
General
The objective of this project is to observe
performance behavior of various benchmark
programs. Results of several programs are
analyzed and explained, and the addition of
insightful notes is attempted where
appropriate.
In the first stage the IPC (instructions per
cycle) of various programs was recorded,
together with other performance monitors,
namely branch misprediction rate and cache
miss rates. Simulations were run with a
modified version of SimpleScalar in which
code was added to dynamically record the
monitors. Finally, results were analyzed in the
report which follows. Phase behavior is
clearly observed in all programs. IPC analysis
yields some surprising results.
Introduction
Phase Analysis
Current research shows that program
behavior over time has distinct, recurring
structure which is often predictable. Contrary
to common perception, Phase Analysis
researchers have shown that a program timeline can be divided into distinct segments, in
which the programs performance monitors are
relatively constant and dissimilar from other
segments.
Figure 1: Plot of monitors over several billion
instructions for gzip (a) and gcc (b) taken from [2]
Repetition is also observed in virtually all programs. A segment in which performance
monitors are stable and distinct typically reoccurs, possibly for the same duration. This
recurring stable state of the processor is called a phase.
1 of 21
Phases are illustrated in Figure 1 taken from [2], Timothy Sherwood et al., "Discovering
and Exploiting Program Phases". On the top a run of gzip is recorded, while on the bottom a
gcc run is shown.
Phase analysis has the potential to revolutionize high-performance processor design, and
most notably power-aware performance optimization. By splitting the time-line into recurring
phases each having a distinct performance and power-consumption nature, performance
optimizations can be designed to react to each phase change, adjusting their optimization in
real time. Thus phase analysis can significantly improve designers' ability to control power
consumption while striving for improved performance.
DVS – an Example of Phase Analysis Potential
Dynamic Voltage Scaling is a technique of dynamically varying the processors
frequency-voltage work-point in order to regulate power consumption to match the currently
required performance (see [5], [6] and others).
An good DVS algorithm should settle on an efficient and stable frequency-voltage workpoint in real-time and as quickly as possible. The problem with existing real-time algorithms
is that they react only to current processor state, and do so relatively slowly so as not to
destabilize processor behavior or arrive at a wrong decision ([6]).
Since the frequency-voltage work-point is a power-to-performance regulator, it is
reasonable to assume that finding a work-point is strongly correlated to program phases.
Specifically, there may be an optimal work-point per phase. If this is true, then an improved
DVS algorithm may tune in on the optimal work-point over a few initial phase reoccurrences,
and then maintain that decision for following reoccurrences. Once the work-point has been
determined per phase, it changes immediately upon phase change, resulting in much
improved response time. Alternatively, programs may carry a phase information header
which declares their behavior to a specific processor based on initial profile runs. Then the
processor can use this information for fast and simple phase reaction.
Obtaining Performance-Monitor Data
Simplescalar Configuration
SimpleScalar is a versatile simulator that can run in several levels of detail and has many
configurable parameters. I ran all my simulations with sim-outorder (SimpleScalar's most
detailed Out-Of-Order simulation program.) For fair comparison, all programs ran with the
same default SimpleScalar configuration, the main features of which are specified below:
instruction fetch queue size: 4 insts
speed of front-end of machine = speed of execution core
bimodal branch predictor with 512*4way BTB entries
decode, issue and commit width: 4 insts/cycle
register update unit (RUU) size: 16
load/store queue (LSQ) size: 8
l1 data cache: 128 sets, 32B block size, 4 way LRU
l1 inst
cache: 512 sets, 32B block size, 1 way LRU
l2 unified cache: 1024 sets, 64B block size, 4 way LRU
The full configuration is listed in Appendix A.
Avshalom Elyada, Technion, Israel Inst. of Tech.
2
Performance Monitors
Performance monitors are indicators of current program behavior. In this work I measure
and analyze the primary ones, although others may also be important (see discussion below.)
The analyzed performance monitors are:





Instruction per cycle (IPC)
Branch misprediction rate (BP)
Data Level 1 cache miss rate (DL1)
Instruction Level 1 cache miss rate (IL1)
Unified Level 2 cache miss rate (UL2)
After measuring these monitors for each program, a multi-graph is produced showing all
these monitors side-by-side as they progress in time (as in [2]).
SimpleScalar Instrumentation
The first stage in this project was instrumenting the sim-outorder simulator to produce
dynamic monitor readings. Off-the-shelf sim-outorder prints only the average result for
execution of a complete program (for example the average cache miss rate). To display a
time-line, I needed to add code to sim-outorder that would print out the average of each
monitor within a specified (configurable) time-window.
After studying the simulator, I determined which variables to sample, at what point in the
program to sample them, and when to print. The following simplified code explains in
general how it was done:
//for IPC, store current inst. count at end of dispatch stage
num_insn_curr = sim_num_insn;
//count BP misses at point where resolution is made
if (!(rs->pred_PC == rs->next_PC))
num_bp_miss_win++;
}
//print window performance parameters:
//IPC, BP (miss rate), DL1, IL1, UL2 (cache mis rates)
if (0 == sim_cycle % ipc_win_size) {
// calc monitor readings
//IPC
num_insn_win = (num_insn_curr - num_insn_prev)/ipc_win_size;
num_insn_prev = num_insn_curr; //save previous for next window
// BPMissRate
bp_miss_rate = num_bp_miss_win/ipc_win_size;
IPC Monitoring and Phase Analysis
3
num_bp_miss_win = 0; //reset the counter
//DL1
num_dl1_curr = cache_dl1->misses;
dl1_miss_rate = (num_dl1_curr - num_dl1_prev)/ipc_win_size;
num_dl1_prev = num_dl1_curr; //reset the counter
....
PrintMonitorReadings();
}
Picking the Window Size
Microsoft Excel was used to plot the multi-graphs. In Excel the number of data points is
limited to 65536 (216 unsigned) which is fine since the human eye usually can't benefit from
more information. However this means that for a window of 20 thousand cycles (as I used in
gzip), the data will end after 20*103*216 = 1.3 billion cycles. As can be seen in the multigraphs that follow, gzip's phase cycles (an order of phases that repeats multiple times
throughout the program) are relatively short, so more than enough recurrence is seen in 1.3
billion cycles. However for gcc and equake it was necessary to see more execution time so I
increased the window size to 1 million cycles (for gcc it would have been better to see even
more run time, hence increasing the window even more, however the benchmark was too
short for that). Note that [2] used a window of 10 million cycles.
The above issue also teaches us that the time-scale by which phases are observed may be
different for considerably different programs. This important aspect must not be overlooked
when thinking of the usability of phase analysis in design considerations.
Also note that the time window has an effect of smoothing the data. As shall be seen later
on, choosing the right window size is important since it smoothes out unimportant small-scale
noise while allowing focus on the principal details
Programs
Three programs from the SPEC2000 benchmark are analyzed:



gzip (with a graphic benchmark input).
gcc (with benchmark input "166.i").
equake (with typical input from benchmark).
All benchmark programs were compiled using compiler optimizations for maximum
performance (in benchmarking, this is referred to as "peak"). The data itself was taken from
[3].
Avshalom Elyada, Technion, Israel Inst. of Tech.
4
Computational Limitations
[2] used a window of 10 million cycles and recorded hundreds of billion cycles. They
report that their results "took several machine years" to produce. I knew that I didn't have at
my disposal anywhere near that amount of resources. My simulations ran on a Linux OS PC;
so I was not sure whether they would produce enough run-time to see phase behavior.
However I assumed that I would see some interesting results regardless of that question and
as it turns out this assumption was justified.
IPC Monitoring and Phase Analysis
5
Results and Analysis
GZIP
The following figure shows the results for gzip in a 20K window spanning 20 million
cycles.
Figure 2: zip multi-graph spanning 20 million cycles
Avshalom Elyada, Technion, Israel Inst. of Tech.
6
Phase behavior and repetition can clearly be seen. The entire program behaves
repetitively as displayed (as far as the run-time of my simulation could determine).
Comparing this figure to the one from [2], we see that the results are in agreement.
Let's zoom in to 8 million cycles to get a better view:
Figure 3: gzip zoomed in to 8 million cycles
IPC Monitoring and Phase Analysis
7
GZIP Phase Analysis
Two distinct phases are seen (marked 1 and 2). Another short recurring phenomenon is
marked (?). The (?) does not exactly qualify as a phase by definition since the readings aren't
at a stable value (at least it appears so at this resolution). However it is clearly not a chaotic
phenomenon. The observed phases might typically correspond to reading a block,
compressing it in various algorithm stages (perhaps that is what the "?" is), and writing it
back.
GZIP IPC Analysis
We would expect IPC to be high when all other metrics are low, and correspondingly IPC
is expected to be low when other metrics are high, i.e. negative correlation. This is intuitive
since branch misprediction and cache misses slow-down the execution. Recall the formula for
the normalized correlation between two vectors:
If X is a vector, then Correl(aX, X+c) = 1, Correl(X+c, -aX) = -1, and two independent
vectors give a correlation of zero.
Calculating the correlation between IPC and the other performance metrics for gzip, we
get:
BP
DL1
IL1
UL2
0.12
-0.78
-0.87
0.00
Table 1: gzip, correlation of monitors to IPC
Correlation to both level-one caches is close to (-1) as we would expect. However there is
close to zero correlation to BP and UL2. What might be the cause of this?
We know that the monitors are also correlated among each other, for example high DL1
may stall the pipe, indirectly causing a decrease in BP, not because there are fewer
mispredictions but rather because the IPC is lower. Perhaps there is typically one (or a few)
dominant monitor(s) that directly affect IPC at a given phase, while the others are indirectly
affected by this dominant monitor?
Also, there are probably other effects beyond the recorded monitors which have a
significant influence on IPC at some intervals. For example, a limited number of integer
ALUs can limit IPC in phases characterized by many integer operations. Perhaps such a
correlation opposite from expected can point to a machine bottleneck, at least for the current
program. These questions will be further addressed in the remainder of this work.
Non-correlation of UL2 and BP
The level-two cache is logically distanced from the processor; hence many intermediate
effects have a chance to spoil correlation to IPC. This might rationalize the zero correlation of
UL2.
Avshalom Elyada, Technion, Israel Inst. of Tech.
8
However the same cannot be said regarding BP. Taking another look at the multi-graph,
we see that most intervals, BP appears positively related to IPC, while at a few others
negatively related. One possible explanation for this is that there are compound relations
between IPC, IL1 and BP. The answer to this question may lie at a totally different level of
analysis. Nevertheless, taking a look at the correlation-matrix below, there is clearly negative
(and thus seemingly counter-intuitive) correlation between BP and the other metrics, even
UL2.
IP
BP
C
1.0
IP
0
C
BP
1
1
2
0.1
2
1.0
0
0.78
IL
0.87
0.0
UL
0
DL
0.1
2
0.26
0.33
0.15
DL
IL
1
1
0.78
0.26
1.0
0
0.7
0
0.3
8
0.87
0.33
0.7
0
1.0
0
0.16
UL
2
0.0
0
0.15
0.3
8
0.16
1.0
0
Table 2: gzip, cross-correlation matrix
Let's go on to see how results of the two other programs coincide with this initial
analysis.
IPC Monitoring and Phase Analysis
9
EQUAKE
It is interesting to observe the equake program at two time segments of about half a
billion cycles. The first figure shows two distinct phases marked 1 and 2:
Figure 4: equake, 23.4 - 24.1 billion cycles
Avshalom Elyada, Technion, Israel Inst. of Tech.
10
EQUAKE Phase Analysis
One might argue that 1a and 1b are different phases since IL1 and subsequently IPC have
a slightly different value (note closely in the figure) which is also a recurring one. However it
is the author's opinion that phase classification should be general enough to include 1a and 1b
in the same phase. As we shall see next, there is also a 1c which does not to reoccur (or
reoccurs rarely), which may point to the fact that the small difference in IL1 value may be
input dependent.
The previous observation leads to a more general insight. In order to practically benefit
from phase analysis information in processor design, phase classification must correctly
balance between accuracy and practicality. A phase profiling algorithm should be rigid
enough to distinguish between time-segments of significant difference in processor behavior,
and at the same time lenient enough to group segments of small difference to the same phase.
Small differences in a small number of monitors (or small difference in phase duration),
should be attributed to the same phase, so that programs have a magnitude of no more than 50
or so phases.
One possible way to do this may be to define a vector composed of performance
monitors, which characterizes current processor state. A vectorial-distance between processor
states can be defined and calculated, and thus similar vectors will be classified to the same
phase according to a determined threshold.
IPC Monitoring and Phase Analysis
11
The second figure below shows an interesting transition between phases 1 and 2 to
phases 2 and 3. At a glance 3 might appear to be similar to 1 ("1d", although with a shorter
duration than 1a, b and c). However looking at DL1 and UL2 we see that this is clearly not
the case.
Also, as mentioned above, note 1c which is grouped together with 1a and 1b.
Occurrences of 1a and 1b were commonly seen, however 1c was seen only a few times in my
simulation.
Figure 5: equake, 26.6 to 27.1 billion cycles
Avshalom Elyada, Technion, Israel Inst. of Tech.
12
EQUAKE IPC Analysis
The equake program exhibits three phase-cycles (characteristic patterns of phase
alternation) rather than just one as in the gzip example. Therefore correlation is taken in three
different segments corresponding to phase alternations 1a-2, 1b-2, and 2-3 in the previous
multi-graphs.
Phases
1a-2
1b-2
2-3
BP
0.26
0.25
0.15
DL1
0.87
0.86
0.52
IL1
0.9
UL2
-0.89
0.9
-0.89
0.8
-0.85
4
3
8
Table 3: equake, correlation of monitors to IPC
As can be seen, DL1, UL2 show strong negative correlation to IPC. Even BP is
negatively correlated, thus behaving closer to expected compared to the gzip example.
However, the strong positive correlation of IL1 to IPC is counter-intuitive.
Taking a second look at the multi-graphs, we see that this is indeed true. Phase 1 exhibits
high IPC together with high IL1, while phase 2 shows relatively low IPC together with zero
IL1. (Phase 3 doesn't follow this trend, but neither does it coincide with other parameters, for
instance positive correlation of IPC to DL1 is seen in phase 3).
IPC
IP
C
BP
DL
1
IL1
UL
2
1.00
0.26
0.87
0.94
0.89
BP
0.26
DL1
0.87
1.00
0.40
0.40
0.44
1.00
0.97
0.44
1.00
IL1
0.94
0.44
0.97
1.00
0.99
UL2
0.89
0.44
1.00
0.99
1.00
Table 4: equake, cross-correlation matrix
In the cross-correlation matrix above, we see that IL1 shows strong negative correlation
to other monitors. Notably, there is an almost (-1) correlation between IL1 and DL1, which
may explain (at least statistically) why there is high IPC together with high IL1. It appears
that in equake, DL1 and perhaps also UL2 are dominant monitors while IL1 is an affected
monitor. Negative correlation between IL1 and DL1 may occur when a small set of cached
data is being operated on by cache-thrashing code, and could point to a bad cache
configuration, at least with regard to this program.
IPC Monitoring and Phase Analysis
13
GCC
The gcc program ended after about 27 billion cycles, producing the following result:
Figure 6: gcc, 0 to 27 billion cycles
Avshalom Elyada, Technion, Israel Inst. of Tech.
14
GCC Phase Analysis
Structured behavior is clearly seen in gcc as well as in the previous programs; however
gcc seems to exhibit significantly more complicated behavior than gzip or equake. In the
whole data set of 27 billion cycles, a full phase-cycle is observed only twice, which is several
orders of magnitude more than both gzip and equake.
Overall there are probably more than 10 phases in gcc, which is also considerably more
than in the previous examples. Some dominant phases are marked in Figure 6 above, while a
few others can be seen better in Figure 7 below which is an enlarged view. Not all phases are
marked due to the complicated nature of the multi-graph. Gcc demonstrates that phase
profiling and classification of complex programs should be automated, and done in real-time
hardware if possible.
As defined in the introduction, monitors should be stable and distinct throughout a phase
occurrence. In some phases of gcc (notably phase 2) not all monitors are stable. However,
phase structure is maintained in the sense that monitor instability is usually tightly bound and
centers around a distinct average (some of these instabilities are not seen in [2] due to their
use of a 10 million cycle window which smoothes their graphs.). As mentioned before, phase
classification should be lenient enough to accommodate a phase with tightly-bound instability
in a small number of monitors.
IPC Monitoring and Phase Analysis
15
Figure 7: gcc, cycles 4 to 10 billion
Avshalom Elyada, Technion, Israel Inst. of Tech.
16
GCC IPC Analysis
How should we measure correlation for the gcc program? The gzip run exhibited totally
monotonous behavior (2-3 constantly repeating phases), so it was obvious that correlation
needed to be measured on at least one phase-cycle. Equake is characterized by three different
patterns of phase-cycles, so correlation was measured separately for each phase-cycle.
In gcc however, a phase-cycle exists only at a much larger magnitude (so that we see
only 2 phase-cycles in the simulation). One possible way to deal with this is to simply
measure correlation on the whole run or on one phase-cycle as was previously done.
However, in the gcc case it might be more interesting to show results for two segments which
look significantly different from each other within the phase-cycle. Looking at Figure 6, two
different segments are defined: one of low IPC (the area around marked phases 1 and 2) and a
second of high IPC (phases 3 and after).
Table 5 shows the cross-correlation matrix for the low-IPC segment, while Table 6 shows
the corresponding date for the high-IPC segment. Table 7 shows the difference (subtraction)
between the two correlation matrixes.
IPC
BP
1.00
0.14
0.14
1.00
0.54
DL1
IP
C
BP
DL
1
2
0.52
IL1
0.09
UL
0.74
0.46
0.54
0.52
0.54
1.00
0.50
0.00
IL1
0.09
0.46
0.50
1.00
0.32
UL2
0.74
0.54
0.00
0.32
1.00
Table 5: gcc, cross-correlation matrix of low IPC segment
IPC
IP
C
BP
1.00
0.50
DL
1
2
0.08
IL1
0.72
UL
0.03
BP
0.50
1.00
0.37
0.64
0.33
DL1
0.08
0.37
1.00
0.24
0.97
IL1
0.72
0.64
0.24
1.00
0.20
UL2
0.02
0.33
0.97
0.20
1.00
Table 6: gcc, cross-correlation matrix of high IPC segment
IPC
IP
C
BP
DL
1
0.00
0.64
0.44
BP
0.65
IPC Monitoring and Phase Analysis
DL1
IL1
0.43
0.63
UL2
0.72
0.00
0.16
0.18
0.21
0.16
0.00
0.26
0.96
17
IL1
0.64
0.18
0.26
0.00
0.13
0.21
0.96
0.13
0.00
UL
2
0.71
Table 7: gcc, difference between the two cross-correlation matrixes above
As can be seen in Table 7, there is substantial difference between the two segments in a
number of monitor correlations. This perhaps demonstrates the crucial importance of
understanding phase analysis. The differences between the two segments show that virtually
any design optimization that is not somehow phase-aware is far from optimal.
Finally, for completeness it should be mentioned that gcc has notably higher cache miss
rates than the two previous examples.
Avshalom Elyada, Technion, Israel Inst. of Tech.
18
Conclusions
Simplescalar was instrumented to print out dynamic readings of important performance
monitors: IPC, branch misprediction and cache miss rates. gzip, equake and gcc benchmark
programs were analyzed, IPC and phase behavior was reviewed. Different programs had
phases on different time scales.
The importance and potential of phase analysis research was discussed. Phase behavior
was clearly seen in all programs, and insights were explained in detail.
Correlation between IPC and other performance monitors was not always negative as
initially expected. This was explained by the fact that other monitors are not independent
from each other, and such dependence can be dominant enough to produce seemingly
counter-intuitive results.
Future Research
Phase analysis research shows that program phases generally correlate to specific regions
of code. Typically these regions are frequented loops and subroutine calls. This work focused
primarily on statistical IPC analysis. However, by tying program phases to specific code
regions, the underlying reasons to observed statistical results might be explored. This can be
an interesting continuation of this work. This would require printing out the program-counter
value, correlating between phases and actual code segments, and then analyzing detailed perinstruction behavior in that segment.
This work analyzed a small number of primary performance monitors. However there are
definitely many more phenomena which affect the IPC than we have monitored. For
example, the number of available execution resources, or pipe-stages halted due to some
bottleneck in the design. In order to observe issues such as these we need to monitor
additional events such as internal queue fill levels, number if instructions of each type ready
to execute at any given time, important counters and so on.
It would also be interesting to observe the effect of different machine configurations
(caches, branch-prediction tables etc.) on program behavior using the multi-graph tool.
Finally, it would be beneficial to analyze additional programs in order to generalize
conclusions regarding the analysis that was done.
References
1. SimpleScalar source and documentation from http://www.simplescalar.com
2. Timothy Sherwood et al., "Discovering and Exploiting Program Phases"
3. "/spec_2k" public directory at the Lion computer-farm in EE. These are SPEC2000
programs recompiled as SimpleScalar inputs and made available for research use.
4. SPEC2000
5. Greg Semeraro, David Albonesi et al., "Dynamic Frequency and Voltage Scaling for a
Multiple-Clock-Domain Microprocessor"
6. Greg Semeraro, David Albonesi et al., "Dynamic Frequency and Voltage Scaling for a
Multiple-Clock-Domain Microarchitecture"
7. Slides and Notes, VLSI Architecture Design Course #048853 Spring 2004.
IPC Monitoring and Phase Analysis
19
Appendix A: SimpleScalar Default Configuration
The full configuration is listed below:
# instruction fetch queue size (in insts)
-fetch:ifqsize
4
# extra branch mis-prediction latency
-fetch:mplat
3
# speed of front-end of machine relative to execution core
-fetch:speed
1
# branch predictor type {nottaken|taken|perfect|bimod|2lev|comb}
-bpred
bimod
# bimodal predictor config (<table size>)
-bpred:bimod
2048
# 2-level predictor config (<l1size> <l2size> <hist_size> <xor>)
-bpred:2lev
1 1024 8 0
# combining predictor config (<meta_table_size>)
-bpred:comb
1024
# return address stack size (0 for no return stack)
-bpred:ras
8
# BTB config (<num_sets> <associativity>)
-bpred:btb
512 4
# speculative predictors update in {ID|WB} (default non-spec)
# -bpred:spec_update
<null>
# instruction decode B/W (insts/cycle)
-decode:width
4
# instruction issue B/W (insts/cycle)
-issue:width
4
# run pipeline with in-order issue
-issue:inorder
false
# issue instructions down wrong execution paths
-issue:wrongpath
true
# instruction commit B/W (insts/cycle)
-commit:width
4
# register update unit (RUU) size
-ruu:size
16
# load/store queue (LSQ) size
-lsq:size
8
# l1 data cache config, i.e., {<config>|none}
-cache:dl1
dl1:128:32:4:l
# l1 data cache hit latency (in cycles)
-cache:dl1lat
1
# l2 data cache config, i.e., {<config>|none}
-cache:dl2
ul2:1024:64:4:l
# l2 data cache hit latency (in cycles)
-cache:dl2lat
6
# l1 inst cache config, i.e., {<config>|dl1|dl2|none}
-cache:il1
il1:512:32:1:l
# l1 instruction cache hit latency (in cycles)
-cache:il1lat
1
Avshalom Elyada, Technion, Israel Inst. of Tech.
20
# l2 instruction cache config, i.e., {<config>|dl2|none}
-cache:il2
dl2
# l2 instruction cache hit latency (in cycles)
-cache:il2lat
6
# flush caches on system calls
-cache:flush
false
# convert 64-bit inst addresses to 32-bit inst equivalents
-cache:icompress
false
# memory access latency (<first_chunk> <inter_chunk>)
-mem:lat
18 2
# memory access bus width (in bytes)
-mem:width
8
# instruction TLB config, i.e., {<config>|none}
-tlb:itlb
itlb:16:4096:4:l
# data TLB config, i.e., {<config>|none}
-tlb:dtlb
dtlb:32:4096:4:l
# inst/data TLB miss latency (in cycles)
-tlb:lat
30
# total number of integer ALU's available
-res:ialu
4
# total number of integer multiplier/dividers available
-res:imult
1
# total number of memory system ports available (to CPU)
-res:memport
2
# total number of floating point ALU's available
-res:fpalu
4
# total number of floating point multiplier/dividers available
-res:fpmult
1
# profile stat(s) against text addr's (mult uses ok)
# -pcstat
<null>
# operate in backward-compatible bugs mode (for testing only)
-bugcompat
false
IPC Monitoring and Phase Analysis
21
Download