Correlation between Detailed and Simplified Simulations in Studying Multiprocessor Architecture Khaled Z. Ibrahim Department of Electrical Engineering, Suez Canal University, Egypt. kibrahim@mailer.eun.eg Abstract 5, 6, 10, 9, 4]. An analytical model can be developed to model the system behavior [10, 9], or a hybrid statisticalanalytical model can be incorporated with a simplified simulator to estimate the system performance [5, 6, 4]. These techniques can estimate the multiprocessor performance within 15% error. They face a conventional challenge in coping with new architectural ideas, especially those giving small performance advantage. In this work, we measured the correlation between detailed and simplified simulations for barrier-based execution times. The correlation coefficient is almost one for most applications. We then developed a scheme to estimate a linear relation between detailed and simplified simulations that requires a small amount of detailed simulation. The performance can be estimated within 5% error with 25% of the detailed simulation. The remaining of this paper is organized as follows: We begin with studying the correlation between detailed and simplified simulations in Section 2; Section 3 describes our proposed technique to estimate detailed execution time based on the correlation information; The accuracy of our proposed scheme is evaluated in Section 4; Section 5 presents related work; Conclusions and Future work are presented in Section 6. Simulation of multiprocessor systems is widely used to evaluate design alternatives. Using detailed simulation can be prohibitive especially for tackling large design space. Simplified simulation, on the other hand, may not provide accurate picture of the performance trend. In this work, we studied the correlation between the measurements of synchronization-based execution times for detailed and simplified simulations. We show that these measurements are highly correlated. Using the correlation information, the total detailed execution time can be estimated within 5% error with less than 25% of the detailed simulation cycles. 1 Introduction Simulation technique is widely used in performance studies of computer architecture. Tension arises between using detailed simulation to get accurate characterization of workloads and the high cost to perform these simulations. Detailed simulators are expensive to implement and to run. Figure 1 shows, on the left, the speedups for five OpenMP-parallelized NAS-NPB benchmarks [1] using simplified simulation. On the right, the figure shows the ratio between simplified and detailed simulations. The detailed simulation models out-of-order superscalar processors, while the simplified simulation models in-order singleissue processors. The figure shows that the detailed simulation gives a different trend from that given by the simplified. As expected, simplified simulation overestimates the speedups because the detailed processors usually stress the memory system more aggressively. Unfortunately, the ratio of simplified to detailed simulations neither gives a linear trend nor provides a single trend. This exposes the problem of trusting simplified simulations. Estimating the behavior of detailed models without detailed simulation has grasped the attention of researchers [3, 2 Correlation between Detailed and Simplified Simulations The performance of a workload is dependent on two factors algorithm and architecture. Algorithms are usually characterized independent from architecture, based on their time and space complexity. Trusting architectureindependent performance evaluation assumes that running the same algorithm on different architectures will provide algorithm-related results. A parallel application can be decomposed into phases of executions separated by barrier synchronizations, we call them sessions. A session is usually visited multiple times throughout the execution of the application. The execution 1 CG LU 2.2 6 2.0 Simplified to Detailed ratio Simplified simulation speedup BT 7 5 4 3 2 1 1 2 3 4 5 6 7 8 MG SP 1.8 1.6 1.4 1.2 1.0 0.8 1 2 processor count 3 4 5 6 7 8 processor count Figure 1. (left) Simplified simulation speedups. (right) ratio between simplified to detailed simulation speedups. 2.1 Linear Minimum Mean Squared Error Estimation of D based on an observation S times for the same session represent a sequence. Figure 2 shows sequences for two sessions obtained by detailed and simplified simulations of the processor model. The measurements are taken for the NAS-NPB LU application. The first session exhibits a constant amount of work and the execution time follows that algorithmic behavior by stabilizing to a constant value. Initially, the measurements oscillate due to cache warming. The second session exhibits a varying amount of work. Both detailed and simplified models of the processor give execution times following the same algorithmic profile. Other sessions may have a monotonically decreasing amount of work each time they are visited. Let E X be the expected value of a random variable X. E X is defined as E X µx ∑ni 1 xi P X xi . The variance is defined as: E X µx 2 σXX n ∑ i 1 xi µ x 2 P X x i (1) We also define σX σXX . Suppose D (for detailed) and S (for simplified) are two discrete random variables. A good measure of dependence between the two random variables D and S is the correlation coefficient, defined as: Each application has a mixture of sessions each with a specific behavior. Benchmarks are usually accompanied with the amount of simulation (input data, number of iterations, etc.) that leads to representative performance results. Reducing the amount of execution by reducing input data or number of iterations can lead to misleading results. ρDS E D µD S µS σD σS σDS σD σS (2) The numerator of the right hand side is called the covariance σDS of D and S. If D and S are linearly dependent, then ρDS 1. We need to estimate D based on the observation S using 2 a linear estimator D̂ a h S such that E D D̂ is minimized based on selection of the constants a and h. It can be proved [8] that the best estimator has h σσDS SS and a µD h µS The minimum Mean Square Error (MSE) is given as: The time cost of detailed simulation for the complete execution is usually very high, especially for large design space. In the next section, we introduce dependence analysis between sequences of barrier-based execution times obtained from detailed simulation D and simplified simulation S. 2 E D D̂ σDD 1 ρ2DS 2 (3) Detailed Cycles Simplified Iterations Figure 2. Two sample barrier time measurements with detailed and simplified simulations taken from application LU. If the correlation coefficient ρDS 1, then the expected squared error is nearly zero. If ρDS 0, then observing the value S has no value in estimating D. cient amount of time. Based on â, ĥ and S, we can compute D̂. The estimation of âi and ĥi can be made every multiple measurements to reduce the overhead. On the other hand, excessively skipping the computations of âi and ĥi can increase the amount of cycles simulated in details. 3 Estimating Detailed Performance based on Simplified Simulation 3.2 Synchronization-based Execution Time Section 2 presents a technique to estimate a linear relation between detailed and simplified measurements. Both measurements are for the complete benchmark run. A complete simplified simulation S is needed only once to capture the algorithmic behavior of the application as well as capturing the interaction with architectural part that is not targeted by the detailed simulation. To achieve the objective of saving the time in detailed simulation, we need to estimate the constants a and h without carrying the complete detailed run, as will be explored in the next section. In uniprocessor setting, natural points of measurements are every pre-specified count of graduated instructions. In multiprocessor setting, the number of graduated instruction for the same task may differ with the architecture because the graduated synchronization instructions count depends on the timing. The calling instructions for synchronization primitives are good points for measurements. Two frequently used synchronization primitives are Lock and Barrier. As discussed earlier, we need execution time measurements for the detailed and the simplified simulations that relate to each others. We have chosen barrier-based execution times because they are frequently used with compilerbased parallelization. The benchmarks, used in this study, are parallelized using OpenMP, as discussed further in Section 4. Barriers can be identified using their caller program counter (PC). We have multiple alternatives for the barrier time measurements: the first alternative is to consider the measurements for all barriers in the application indiscriminately, we call these all-barrier measurements; the second alternative is to group the measurements sourcing from the same barrier, we call these after-barrier measurements; finally, the measurements can be grouped based on their des- 3.1 Estimating a Linear Predictor Let â and ĥ be estimates of the constants a and h, respectively. Let d be a subset of D, and s be the corresponding subset of S. Let the total number of samples in S is n. Initially, d is set empty. The following steps are repeated for at most n repetitions: A new measurement di is added to the set d; The corresponding subset s is obtained from S and the following computations are carried out ĥi σσds , âi µd ĥi µs ; The estimated constants âi ss and ĥi are checked for convergence to stable values, if so the procedure is terminated. The convergence can be tested by insuring that the percentage of change to the current value is bounded for a suffi3 Figure 3 shows a normalized variation of estimating the constants a and h as well as the normalized error associated with this variation. The figure includes a table showing the actual final values for the estimated variables. It is shown that the correlation coefficients are almost one for four out of the five benchmarks. This shows that a linear predictor can be highly accurate in predicting the relation between detailed and simplified simulations. Only MG has a correlation coefficient of 0.67 which is still a high correlation. The percentage of error in estimating detailed from simplified simulations is less than 1% for all benchmarks. Figure 3 shows that the constants a and h are not the same for all benchmarks, which explain the different simplified to detailed ratios shown in Figure 1. Each benchmark interacts with the architecture and stress components of the system differently. It is also shown that the estimation process of a and h converges based on the technique described in Section 3.1. The convergence ranged from being very fast for CG to being very slow for MG. The error is computed between the complete detailed simulation and the estimated detailed performance based on our technique, discussed in Section 3. For a specific error objective, the estimated parameters can be used to determine the RMS error as given by Equation 3. Sources of uncorrelated behavior between detailed and simplified simulations include timing related transactions, such as false-sharing transactions and contention on synchronization points and memory lines. Table 2. NAS-NPB benchmarks and their problem sizes. Bench. BT CG LU size 16 16 16 1400 points 16 16 16 bench. MG SP size 32 32 32 16 16 16 tination barrier, we call these before-barrier measurements. Generally, the amount of work before a certain barrier call shapes a profile that is different from that associated with the amount of work after that barrier. We seek profiles that exhibit higher correlation between detailed and simplified simulations. 4 Simulation Methodology and Performance Evaluation We used SimOS [7] a complete system simulation environment running IRIX 5.3 OS. On this simulation environment, we have multiple CPU and memory models. Table 1 shows the simulation parameters for two processor models used in this study, MIPSY and MXS. MIPSY models an in-order processor model with blocking load. MXS models R10K like out-of-order superscalar processor model. The table shows also the memory parameters. System wide coherence of the L2 caches is maintained by an invalidate-based fully-mapped directory protocol. Table 1 shows also the network latency parameters. Benchmarks used in this study are listed in Table 2. These benchmarks are an OpenMP port of NAS Parallel Benchmarks 2.3 [1], done by Omni compiler project [2]. All simulations are done for a machine with 8 processors. 4.2 Estimating Detailed Performance To estimate the detailed performance we first perform simplified simulation, and then we simulate the system with the detailed processor model until we converge to the correlation parameters between the simplified and the detailed simulations. As discussed earlier barrier-based measurements of the execution time are all-barrier, after-barrier, or beforebarrier. Both after-barrier and before-barrier are measured for individual barriers. For individual barrier, we consider the convergence of individual barriers parameters independent from each others. Barriers that converge do not need further detailed simulation. The error percentage for all-barrier is 2 to 8 times higher than with individual barriers because it tries to estimate a linear relation for different algorithmic behavior presented on the different sessions of execution. Control flow within the program makes the distribution of measurements with after-barrier different from those with before-barrier. Except for LU and MG, smaller error percentage is associated with the measurements of afterbarrier. The after-barrier measurement enables deciding whether we need to simulate the next session in details or 4.1 Measuring Correlation and Estimating Linear Model In this section, we evaluate the correlation between detailed and simplified simulations based on the model presented in Section 2. We also estimate a linear relation between detailed and simplified simulations. We set the simulator to run with the simplified processor model for the first run, and then we perform simulation with the detailed processor model in another run. For clarity, we have chosen one barrier from each application with the largest contribution to the simulation cycles, more than 60% for all benchmarks. All the measurements are after-barrier. All these barriers are with a varying amount of work. Both detailed and simplified measurements follow similar profiles. 4 Table 1. System parameters of the simulated machines. MIPSY (Simple in-order processor): (1.0 GHz) Simple in-order execution model. MXS (superscalar out-of-order processor): (1.0 GHz) R10K like architecture and latencies, Int Reg 32, FP Reg 32, ROB 64 entries, Window Size 64, load bypass, precise interrupt. Width( Fetch 4, Issue 4, Cache 2, Writeback 4, Graduate 4), return stack 32, Branch History Table 1K entries, load-store buffer 64. L1 Caches (I/D) L2 Cache (Unified) Size: 16 KB, 32B line Size: 1 MB, 64B line Associativity: 2 Associativity: 4 Hit Latency: 1 cycle Hit Latency: 10 cycles Memory Parameters(ns): (Detailed description is found in SimOS documentation [7].) BusTime: 30 PILocalDCTime: 10 NIRemoteDCTime: 10 NILocalDCTime: 60 NetTime: 50 MemTime: 50 not. Figure 4 shows the percentage of cycles simulated in details (using after-barrier) to achieve a certain error percentage. Unlike Figure 3 where only one barrier is taken into consideration, Figure 4 consider the whole application. Simulating 13.5% of the cycles in details leads to an error percentage of 10% on average, while with 25% simulated in details the error percentage is 5% on average. As expected, we observed slowdown for detailed simulations compared with simplified simulations. For eight processors system, the ratios of detailed simulation time to simplified simulation time are 19, 54, 14, 24 and 13 for BT, CG, LU, MG and SP, respectively. Additionally, the ratio of detailed to simplified simulation time increases with the number of CPUs. For instance, the ratios for MG are 6, 9, 12 and 24 for systems of 1, 2, 4 and 8 processors, respectively. The applications reach the solution through multiple iterations. The cycles per second (CPS) in detailed simulation decreases with these iterations because the processor resources are stressed as iterations are advanced. Considering 5% error target, let cpsest be the CPS during the parameter estimation phase and let cpsrem be the CPS during the remaining detailed execution time. The ratios of cpsest cpsrem are 1.17, 1.25, 2.26, 1.69 and 3.58 for BT, CG, LU, MG and SP, respectively. Avoiding detailed simulation toward the end of the simulation leads to a greater saving in simulation time. Interestingly, the profiles of inter-barrier measurements are very much the same for systems with different number of processors. Thus, profiles taken from small number of processors system can be used for estimating detailed simulation performance for higher number of processors system. As discussed earlier simplified simulation profiles carries both algorithmic part along with its interaction with the simplified hardware. The same profile can be used for estimating the detailed performance for multiple detailed hard- ware configurations. For each detailed hardware configuration, we simulate a small fraction of the cycles in details to estimate the predictor parameters, and then compute the complete detailed simulation behavior based on the correlation parameters and the simplified simulation profile. 5 Related Work Providing alternatives to detailed simulations has been the target of many researchers [3, 5, 6, 10, 9, 4]. Hybrid statistical-analytical modeling estimates a system statistical profile through simulation [5, 6, 4]. The profile is then fed into a simple simulator that estimates the detailed behavior of the system. For shared memory applications, Nussbaum et al. [6] show that performance estimate can be achieved within 15% error, on average. Sorin et al. [10, 9] proposed an analytic model for shared memory ILP-based multiprocessor. Two sets of parameters are defined one for the system and the other for the application. These parameters are defined for homogeneous processes (that exhibit the same behavior). Model parameters are obtained using simplified simulation. These models estimate ILP with less than 12% average error. The estimation error is larger for heterogeneous processes. 6 Conclusions and Future Work Exploring large design space in multiprocessor machines requires reducing the amount of detailed simulation without sacrificing accuracy. This work studies the correlation between detailed and simplified simulations. We show that barrier-based execution times have a very strong correlation (correlation coefficient almost one, for most applications). 5 a 3.0 h error 2.5 1.6 2.5 2.0 1.2 2.0 1.5 1.5 0.8 1.0 1.0 0.4 0.5 0.5 0.0 0.0 0 20 40 2.5 60 80 100 BT 0.0 0 20 40 40 1.5 1.5 1.0 1.0 0.5 0.5 80 100 0 20 bench. 0.0 0.0 0 20 40 60 80 100 40 CG 50 2.0 60 0 20 40 60 80 LU correlation coefficient 60 a 80 100 h BT 0.9964 -3.54E+04 0.8623 CG 0.9991 1.89E+03 0.4296 LU 0.9699 4.65E+04 0.4900 MG 0.6749 6.63E+03 0.7919 SP 0.9988 2.73E+04 0.3768 100 SP MG Figure 3. Normalized behavior of the parameters h, a and error vs. the percentage of simulated cycles for the largest contributing barrier. [2] Omni OpenMP Compiler Project. http://phase.etl.go.jp/Omni/. [3] D. Albonsi and I. Koren. A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques. Int’l Journal of Parallel Programming, pages 235–263, 1996. [4] L. Eeckhout, R. H. B. Jr., B. Stougie, K. D. Bosschere, and L. K. John. Control Flow Modeling in Statistical Simulation for Accurate and Efficient Processor Design Studies. 31st Int’l Symp. on Computer Architecture, pages 350–363, jun 2004. [5] S. Nussbaum and J. E. Smith. Modeling Superscalar Processors via Statistical Simulation. Int’l Conf. on Parallel Architectures and Compilation Techniques, pages 15–24, 2001. [6] S. Nussbaum and J. E. Smith. Statistical Simulation of Symmetric Multiprocessor Systems. Annual Simulation Symposium, pages 89–97, 2002. [7] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod. Using the SimOS Machine Simulator to Study Complex Computer Systems. Modeling and Computer Simulation, 7(1):78–103, 1997. [8] K. S. Shanmugan and A. M. Breipohl. Random Signals: Detection, Estimation and Data Analysis. Wiley, 1988. [9] D. J. Sorin, J. L. Lemon, D. L. Eager, and M. K. Vernon. Analytic evaluation of shared-memory architectures. IEEE Trans. on Parallel and Distributed Systems, pages 166–180, feb 2003. [10] D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic Evaluation of Shared-Memory System with ILP Processors. 25th Int’l Symp. on Computer Architecture, pages 380–392, jun 1998. 60 10% error 5% error % of Detailed simulated cycles 50 40 30 20 10 0 BT CG LU MG SP avg. Figure 4. Percentage of detailed simulation cycles for 5% and 10% error. We show that a linear relation between detailed and simplified simulations can be estimated. This linear relation allows estimating the detailed simulation performance within 5% error with 25% percent of the cycles simulated in details. Future work includes extending the approach presented in this paper for other synchronization primitives. Additionally, we will study synchronization-independent correlation between simplified and detailed simulations. References [1] NAS Parallel Benchmarks. http://www.nas.nasa.gov/NAS/NPB. 6