Correlation between Detailed and Simplified Simulations in

advertisement
Correlation between Detailed and Simplified Simulations in Studying
Multiprocessor Architecture
Khaled Z. Ibrahim
Department of Electrical Engineering, Suez Canal University, Egypt.
kibrahim@mailer.eun.eg
Abstract
5, 6, 10, 9, 4]. An analytical model can be developed to
model the system behavior [10, 9], or a hybrid statisticalanalytical model can be incorporated with a simplified simulator to estimate the system performance [5, 6, 4]. These
techniques can estimate the multiprocessor performance
within 15% error. They face a conventional challenge in
coping with new architectural ideas, especially those giving
small performance advantage.
In this work, we measured the correlation between detailed and simplified simulations for barrier-based execution
times. The correlation coefficient is almost one for most applications. We then developed a scheme to estimate a linear relation between detailed and simplified simulations that
requires a small amount of detailed simulation. The performance can be estimated within 5% error with 25% of the
detailed simulation.
The remaining of this paper is organized as follows:
We begin with studying the correlation between detailed
and simplified simulations in Section 2; Section 3 describes our proposed technique to estimate detailed execution time based on the correlation information; The accuracy of our proposed scheme is evaluated in Section 4; Section 5 presents related work; Conclusions and Future work
are presented in Section 6.
Simulation of multiprocessor systems is widely used to
evaluate design alternatives. Using detailed simulation can
be prohibitive especially for tackling large design space.
Simplified simulation, on the other hand, may not provide
accurate picture of the performance trend.
In this work, we studied the correlation between the measurements of synchronization-based execution times for detailed and simplified simulations. We show that these measurements are highly correlated. Using the correlation information, the total detailed execution time can be estimated within 5% error with less than 25% of the detailed
simulation cycles.
1 Introduction
Simulation technique is widely used in performance
studies of computer architecture. Tension arises between
using detailed simulation to get accurate characterization of
workloads and the high cost to perform these simulations.
Detailed simulators are expensive to implement and to run.
Figure 1 shows, on the left, the speedups for five
OpenMP-parallelized NAS-NPB benchmarks [1] using
simplified simulation. On the right, the figure shows the
ratio between simplified and detailed simulations. The detailed simulation models out-of-order superscalar processors, while the simplified simulation models in-order singleissue processors. The figure shows that the detailed simulation gives a different trend from that given by the simplified. As expected, simplified simulation overestimates the
speedups because the detailed processors usually stress the
memory system more aggressively. Unfortunately, the ratio
of simplified to detailed simulations neither gives a linear
trend nor provides a single trend. This exposes the problem
of trusting simplified simulations.
Estimating the behavior of detailed models without detailed simulation has grasped the attention of researchers [3,
2 Correlation between Detailed and Simplified Simulations
The performance of a workload is dependent on two
factors algorithm and architecture. Algorithms are usually characterized independent from architecture, based on
their time and space complexity. Trusting architectureindependent performance evaluation assumes that running
the same algorithm on different architectures will provide
algorithm-related results.
A parallel application can be decomposed into phases of
executions separated by barrier synchronizations, we call
them sessions. A session is usually visited multiple times
throughout the execution of the application. The execution
1
CG
LU
2.2
6
2.0
Simplified to Detailed ratio
Simplified simulation speedup
BT
7
5
4
3
2
1
1
2
3
4
5
6
7
8
MG
SP
1.8
1.6
1.4
1.2
1.0
0.8
1
2
processor count
3
4
5
6
7
8
processor count
Figure 1. (left) Simplified simulation speedups. (right) ratio between simplified to detailed simulation
speedups.
2.1 Linear Minimum Mean Squared Error Estimation of D based on an observation S
times for the same session represent a sequence.
Figure 2 shows sequences for two sessions obtained by
detailed and simplified simulations of the processor model.
The measurements are taken for the NAS-NPB LU application. The first session exhibits a constant amount of work
and the execution time follows that algorithmic behavior by
stabilizing to a constant value. Initially, the measurements
oscillate due to cache warming. The second session exhibits
a varying amount of work. Both detailed and simplified
models of the processor give execution times following the
same algorithmic profile. Other sessions may have a monotonically decreasing amount of work each time they are visited.
Let E X be the expected value of a random variable X.
E X is defined as E X µx ∑ni 1 xi P X xi . The variance is defined as:
E X µx 2 σXX n
∑
i 1
xi µ x 2 P X x i (1)
We also define σX σXX . Suppose D (for detailed)
and S (for simplified) are two discrete random variables.
A good measure of dependence between the two random
variables D and S is the correlation coefficient, defined as:
Each application has a mixture of sessions each with a
specific behavior. Benchmarks are usually accompanied
with the amount of simulation (input data, number of iterations, etc.) that leads to representative performance results.
Reducing the amount of execution by reducing input data or
number of iterations can lead to misleading results.
ρDS E D µD S µS σD σS
σDS
σD σS
(2)
The numerator of the right hand side is called the covariance σDS of D and S. If D and S are linearly dependent, then ρDS 1.
We need to estimate D based on the observation S using
2
a linear estimator D̂ a h S such that E D D̂ is
minimized based on selection of the constants a and h.
It can be proved [8] that the best estimator has h σσDS
SS
and a µD h µS
The minimum Mean Square Error (MSE) is given as:
The time cost of detailed simulation for the complete
execution is usually very high, especially for large design
space.
In the next section, we introduce dependence analysis
between sequences of barrier-based execution times obtained from detailed simulation D and simplified simulation S.
2
E D D̂ σDD 1 ρ2DS 2
(3)
Detailed
Cycles
Simplified
Iterations
Figure 2. Two sample barrier time measurements with detailed and simplified simulations taken from
application LU.
If the correlation coefficient ρDS 1, then the expected
squared error is nearly zero. If ρDS 0, then observing the
value S has no value in estimating D.
cient amount of time. Based on â, ĥ and S, we can compute
D̂.
The estimation of âi and ĥi can be made every multiple
measurements to reduce the overhead. On the other hand,
excessively skipping the computations of âi and ĥi can increase the amount of cycles simulated in details.
3 Estimating Detailed Performance based on
Simplified Simulation
3.2 Synchronization-based Execution Time
Section 2 presents a technique to estimate a linear relation between detailed and simplified measurements. Both
measurements are for the complete benchmark run. A complete simplified simulation S is needed only once to capture
the algorithmic behavior of the application as well as capturing the interaction with architectural part that is not targeted by the detailed simulation. To achieve the objective of
saving the time in detailed simulation, we need to estimate
the constants a and h without carrying the complete detailed
run, as will be explored in the next section.
In uniprocessor setting, natural points of measurements
are every pre-specified count of graduated instructions.
In multiprocessor setting, the number of graduated instruction for the same task may differ with the architecture
because the graduated synchronization instructions count
depends on the timing. The calling instructions for synchronization primitives are good points for measurements.
Two frequently used synchronization primitives are Lock
and Barrier.
As discussed earlier, we need execution time measurements for the detailed and the simplified simulations that
relate to each others. We have chosen barrier-based execution times because they are frequently used with compilerbased parallelization. The benchmarks, used in this study,
are parallelized using OpenMP, as discussed further in Section 4.
Barriers can be identified using their caller program
counter (PC). We have multiple alternatives for the barrier
time measurements: the first alternative is to consider the
measurements for all barriers in the application indiscriminately, we call these all-barrier measurements; the second
alternative is to group the measurements sourcing from the
same barrier, we call these after-barrier measurements; finally, the measurements can be grouped based on their des-
3.1 Estimating a Linear Predictor
Let â and ĥ be estimates of the constants a and h, respectively. Let d be a subset of D, and s be the corresponding
subset of S. Let the total number of samples in S is n.
Initially, d is set empty. The following steps are repeated for at most n repetitions: A new measurement di
is added to the set d; The corresponding subset s is obtained from S and the following computations are carried
out ĥi σσds
, âi µd ĥi µs ; The estimated constants âi
ss
and ĥi are checked for convergence to stable values, if so
the procedure is terminated.
The convergence can be tested by insuring that the percentage of change to the current value is bounded for a suffi3
Figure 3 shows a normalized variation of estimating the
constants a and h as well as the normalized error associated
with this variation. The figure includes a table showing the
actual final values for the estimated variables. It is shown
that the correlation coefficients are almost one for four out
of the five benchmarks. This shows that a linear predictor
can be highly accurate in predicting the relation between
detailed and simplified simulations. Only MG has a correlation coefficient of 0.67 which is still a high correlation.
The percentage of error in estimating detailed from simplified simulations is less than 1% for all benchmarks.
Figure 3 shows that the constants a and h are not the
same for all benchmarks, which explain the different simplified to detailed ratios shown in Figure 1. Each benchmark
interacts with the architecture and stress components of the
system differently.
It is also shown that the estimation process of a and h
converges based on the technique described in Section 3.1.
The convergence ranged from being very fast for CG to being very slow for MG. The error is computed between the
complete detailed simulation and the estimated detailed performance based on our technique, discussed in Section 3.
For a specific error objective, the estimated parameters can
be used to determine the RMS error as given by Equation 3.
Sources of uncorrelated behavior between detailed and
simplified simulations include timing related transactions,
such as false-sharing transactions and contention on synchronization points and memory lines.
Table 2. NAS-NPB benchmarks and their
problem sizes.
Bench.
BT
CG
LU
size
16 16 16
1400 points
16 16 16
bench.
MG
SP
size
32 32 32
16 16 16
tination barrier, we call these before-barrier measurements.
Generally, the amount of work before a certain barrier call
shapes a profile that is different from that associated with
the amount of work after that barrier. We seek profiles that
exhibit higher correlation between detailed and simplified
simulations.
4 Simulation Methodology and Performance
Evaluation
We used SimOS [7] a complete system simulation environment running IRIX 5.3 OS. On this simulation environment, we have multiple CPU and memory models.
Table 1 shows the simulation parameters for two processor models used in this study, MIPSY and MXS. MIPSY
models an in-order processor model with blocking load.
MXS models R10K like out-of-order superscalar processor
model. The table shows also the memory parameters. System wide coherence of the L2 caches is maintained by an
invalidate-based fully-mapped directory protocol. Table 1
shows also the network latency parameters.
Benchmarks used in this study are listed in Table 2.
These benchmarks are an OpenMP port of NAS Parallel
Benchmarks 2.3 [1], done by Omni compiler project [2].
All simulations are done for a machine with 8 processors.
4.2 Estimating Detailed Performance
To estimate the detailed performance we first perform
simplified simulation, and then we simulate the system with
the detailed processor model until we converge to the correlation parameters between the simplified and the detailed
simulations.
As discussed earlier barrier-based measurements of the
execution time are all-barrier, after-barrier, or beforebarrier. Both after-barrier and before-barrier are measured
for individual barriers. For individual barrier, we consider
the convergence of individual barriers parameters independent from each others. Barriers that converge do not need
further detailed simulation.
The error percentage for all-barrier is 2 to 8 times higher
than with individual barriers because it tries to estimate a
linear relation for different algorithmic behavior presented
on the different sessions of execution.
Control flow within the program makes the distribution
of measurements with after-barrier different from those
with before-barrier. Except for LU and MG, smaller error
percentage is associated with the measurements of afterbarrier. The after-barrier measurement enables deciding
whether we need to simulate the next session in details or
4.1 Measuring Correlation and Estimating Linear Model
In this section, we evaluate the correlation between detailed and simplified simulations based on the model presented in Section 2. We also estimate a linear relation between detailed and simplified simulations.
We set the simulator to run with the simplified processor
model for the first run, and then we perform simulation with
the detailed processor model in another run.
For clarity, we have chosen one barrier from each application with the largest contribution to the simulation cycles, more than 60% for all benchmarks. All the measurements are after-barrier. All these barriers are with a varying amount of work. Both detailed and simplified measurements follow similar profiles.
4
Table 1. System parameters of the simulated machines.
MIPSY (Simple in-order processor): (1.0 GHz) Simple in-order execution model.
MXS (superscalar out-of-order processor): (1.0 GHz) R10K like architecture and latencies, Int Reg 32, FP Reg 32, ROB 64 entries,
Window Size 64, load bypass, precise interrupt. Width( Fetch 4, Issue 4, Cache 2, Writeback 4, Graduate 4), return stack 32, Branch
History Table 1K entries, load-store buffer 64.
L1 Caches (I/D)
L2 Cache (Unified)
Size: 16 KB, 32B line
Size: 1 MB, 64B line
Associativity: 2
Associativity: 4
Hit Latency: 1 cycle
Hit Latency: 10 cycles
Memory Parameters(ns): (Detailed description is found in SimOS documentation [7].)
BusTime: 30
PILocalDCTime: 10
NIRemoteDCTime: 10
NILocalDCTime: 60
NetTime: 50
MemTime: 50
not.
Figure 4 shows the percentage of cycles simulated in details (using after-barrier) to achieve a certain error percentage. Unlike Figure 3 where only one barrier is taken into
consideration, Figure 4 consider the whole application.
Simulating 13.5% of the cycles in details leads to an error percentage of 10% on average, while with 25% simulated in details the error percentage is 5% on average. As
expected, we observed slowdown for detailed simulations
compared with simplified simulations. For eight processors
system, the ratios of detailed simulation time to simplified
simulation time are 19, 54, 14, 24 and 13 for BT, CG, LU,
MG and SP, respectively. Additionally, the ratio of detailed
to simplified simulation time increases with the number of
CPUs. For instance, the ratios for MG are 6, 9, 12 and 24
for systems of 1, 2, 4 and 8 processors, respectively.
The applications reach the solution through multiple iterations. The cycles per second (CPS) in detailed simulation decreases with these iterations because the processor
resources are stressed as iterations are advanced. Considering 5% error target, let cpsest be the CPS during the parameter estimation phase and let cpsrem be the CPS during the remaining detailed execution time. The ratios of
cpsest cpsrem are 1.17, 1.25, 2.26, 1.69 and 3.58 for BT,
CG, LU, MG and SP, respectively. Avoiding detailed simulation toward the end of the simulation leads to a greater
saving in simulation time.
Interestingly, the profiles of inter-barrier measurements
are very much the same for systems with different number
of processors. Thus, profiles taken from small number of
processors system can be used for estimating detailed simulation performance for higher number of processors system.
As discussed earlier simplified simulation profiles carries both algorithmic part along with its interaction with the
simplified hardware. The same profile can be used for estimating the detailed performance for multiple detailed hard-
ware configurations. For each detailed hardware configuration, we simulate a small fraction of the cycles in details
to estimate the predictor parameters, and then compute the
complete detailed simulation behavior based on the correlation parameters and the simplified simulation profile.
5 Related Work
Providing alternatives to detailed simulations has been
the target of many researchers [3, 5, 6, 10, 9, 4]. Hybrid
statistical-analytical modeling estimates a system statistical
profile through simulation [5, 6, 4]. The profile is then fed
into a simple simulator that estimates the detailed behavior
of the system. For shared memory applications, Nussbaum
et al. [6] show that performance estimate can be achieved
within 15% error, on average.
Sorin et al. [10, 9] proposed an analytic model for shared
memory ILP-based multiprocessor. Two sets of parameters
are defined one for the system and the other for the application. These parameters are defined for homogeneous processes (that exhibit the same behavior). Model parameters
are obtained using simplified simulation. These models estimate ILP with less than 12% average error. The estimation
error is larger for heterogeneous processes.
6 Conclusions and Future Work
Exploring large design space in multiprocessor machines
requires reducing the amount of detailed simulation without
sacrificing accuracy.
This work studies the correlation between detailed and
simplified simulations. We show that barrier-based execution times have a very strong correlation (correlation coefficient almost one, for most applications).
5
a
3.0
h
error
2.5
1.6
2.5
2.0
1.2
2.0
1.5
1.5
0.8
1.0
1.0
0.4
0.5
0.5
0.0
0.0
0
20
40
2.5
60
80
100
BT
0.0
0
20
40
40
1.5
1.5
1.0
1.0
0.5
0.5
80
100
0
20
bench.
0.0
0.0
0
20
40
60
80
100
40
CG
50
2.0
60
0
20
40
60
80
LU
correlation
coefficient
60
a
80
100
h
BT
0.9964
-3.54E+04
0.8623
CG
0.9991
1.89E+03
0.4296
LU
0.9699
4.65E+04
0.4900
MG
0.6749
6.63E+03
0.7919
SP
0.9988
2.73E+04
0.3768
100
SP
MG
Figure 3. Normalized behavior of the parameters h, a and error vs. the percentage of simulated cycles
for the largest contributing barrier.
[2] Omni OpenMP Compiler Project. http://phase.etl.go.jp/Omni/.
[3] D. Albonsi and I. Koren. A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques. Int’l Journal of Parallel Programming, pages 235–263, 1996.
[4] L. Eeckhout, R. H. B. Jr., B. Stougie, K. D. Bosschere, and
L. K. John. Control Flow Modeling in Statistical Simulation
for Accurate and Efficient Processor Design Studies. 31st
Int’l Symp. on Computer Architecture, pages 350–363, jun
2004.
[5] S. Nussbaum and J. E. Smith. Modeling Superscalar Processors via Statistical Simulation. Int’l Conf. on Parallel Architectures and Compilation Techniques, pages 15–24, 2001.
[6] S. Nussbaum and J. E. Smith. Statistical Simulation of Symmetric Multiprocessor Systems. Annual Simulation Symposium, pages 89–97, 2002.
[7] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod.
Using the SimOS Machine Simulator to Study Complex
Computer Systems. Modeling and Computer Simulation,
7(1):78–103, 1997.
[8] K. S. Shanmugan and A. M. Breipohl. Random Signals:
Detection, Estimation and Data Analysis. Wiley, 1988.
[9] D. J. Sorin, J. L. Lemon, D. L. Eager, and M. K. Vernon.
Analytic evaluation of shared-memory architectures. IEEE
Trans. on Parallel and Distributed Systems, pages 166–180,
feb 2003.
[10] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A.
Wood. Analytic Evaluation of Shared-Memory System with
ILP Processors. 25th Int’l Symp. on Computer Architecture,
pages 380–392, jun 1998.
60
10% error
5% error
% of Detailed simulated cycles
50
40
30
20
10
0
BT
CG
LU
MG
SP
avg.
Figure 4. Percentage of detailed simulation
cycles for 5% and 10% error.
We show that a linear relation between detailed and simplified simulations can be estimated. This linear relation allows estimating the detailed simulation performance within
5% error with 25% percent of the cycles simulated in details.
Future work includes extending the approach presented
in this paper for other synchronization primitives. Additionally, we will study synchronization-independent correlation
between simplified and detailed simulations.
References
[1] NAS Parallel Benchmarks. http://www.nas.nasa.gov/NAS/NPB.
6
Download