IPC Monitoring and Phase Analysis of Programs with SimpleScalar Final Project for the course VLSI Architecture Design #048853 Spring 2004 Avshalom Elyada, EE Faculty, Technion, Israel Institute of Technology General The objective of this project is to observe performance behavior of various benchmark programs. Results of several programs are analyzed and explained, and the addition of insightful notes is attempted where appropriate. In the first stage the IPC (instructions per cycle) of various programs was recorded, together with other performance monitors, namely branch misprediction rate and cache miss rates. Simulations were run with a modified version of SimpleScalar in which code was added to dynamically record the monitors. Finally, results were analyzed in the report which follows. Phase behavior is clearly observed in all programs. IPC analysis yields some surprising results. Introduction Phase Analysis Current research shows that program behavior over time has distinct, recurring structure which is often predictable. Contrary to common perception, Phase Analysis researchers have shown that a program timeline can be divided into distinct segments, in which the programs performance monitors are relatively constant and dissimilar from other segments. Figure 1: Plot of monitors over several billion instructions for gzip (a) and gcc (b) taken from [2] Repetition is also observed in virtually all programs. A segment in which performance monitors are stable and distinct typically reoccurs, possibly for the same duration. This recurring stable state of the processor is called a phase. 1 of 21 Phases are illustrated in Figure 1 taken from [2], Timothy Sherwood et al., "Discovering and Exploiting Program Phases". On the top a run of gzip is recorded, while on the bottom a gcc run is shown. Phase analysis has the potential to revolutionize high-performance processor design, and most notably power-aware performance optimization. By splitting the time-line into recurring phases each having a distinct performance and power-consumption nature, performance optimizations can be designed to react to each phase change, adjusting their optimization in real time. Thus phase analysis can significantly improve designers' ability to control power consumption while striving for improved performance. DVS – an Example of Phase Analysis Potential Dynamic Voltage Scaling is a technique of dynamically varying the processors frequency-voltage work-point in order to regulate power consumption to match the currently required performance (see [5], [6] and others). An good DVS algorithm should settle on an efficient and stable frequency-voltage workpoint in real-time and as quickly as possible. The problem with existing real-time algorithms is that they react only to current processor state, and do so relatively slowly so as not to destabilize processor behavior or arrive at a wrong decision ([6]). Since the frequency-voltage work-point is a power-to-performance regulator, it is reasonable to assume that finding a work-point is strongly correlated to program phases. Specifically, there may be an optimal work-point per phase. If this is true, then an improved DVS algorithm may tune in on the optimal work-point over a few initial phase reoccurrences, and then maintain that decision for following reoccurrences. Once the work-point has been determined per phase, it changes immediately upon phase change, resulting in much improved response time. Alternatively, programs may carry a phase information header which declares their behavior to a specific processor based on initial profile runs. Then the processor can use this information for fast and simple phase reaction. Obtaining Performance-Monitor Data Simplescalar Configuration SimpleScalar is a versatile simulator that can run in several levels of detail and has many configurable parameters. I ran all my simulations with sim-outorder (SimpleScalar's most detailed Out-Of-Order simulation program.) For fair comparison, all programs ran with the same default SimpleScalar configuration, the main features of which are specified below: instruction fetch queue size: 4 insts speed of front-end of machine = speed of execution core bimodal branch predictor with 512*4way BTB entries decode, issue and commit width: 4 insts/cycle register update unit (RUU) size: 16 load/store queue (LSQ) size: 8 l1 data cache: 128 sets, 32B block size, 4 way LRU l1 inst cache: 512 sets, 32B block size, 1 way LRU l2 unified cache: 1024 sets, 64B block size, 4 way LRU The full configuration is listed in Appendix A. Avshalom Elyada, Technion, Israel Inst. of Tech. 2 Performance Monitors Performance monitors are indicators of current program behavior. In this work I measure and analyze the primary ones, although others may also be important (see discussion below.) The analyzed performance monitors are: Instruction per cycle (IPC) Branch misprediction rate (BP) Data Level 1 cache miss rate (DL1) Instruction Level 1 cache miss rate (IL1) Unified Level 2 cache miss rate (UL2) After measuring these monitors for each program, a multi-graph is produced showing all these monitors side-by-side as they progress in time (as in [2]). SimpleScalar Instrumentation The first stage in this project was instrumenting the sim-outorder simulator to produce dynamic monitor readings. Off-the-shelf sim-outorder prints only the average result for execution of a complete program (for example the average cache miss rate). To display a time-line, I needed to add code to sim-outorder that would print out the average of each monitor within a specified (configurable) time-window. After studying the simulator, I determined which variables to sample, at what point in the program to sample them, and when to print. The following simplified code explains in general how it was done: //for IPC, store current inst. count at end of dispatch stage num_insn_curr = sim_num_insn; //count BP misses at point where resolution is made if (!(rs->pred_PC == rs->next_PC)) num_bp_miss_win++; } //print window performance parameters: //IPC, BP (miss rate), DL1, IL1, UL2 (cache mis rates) if (0 == sim_cycle % ipc_win_size) { // calc monitor readings //IPC num_insn_win = (num_insn_curr - num_insn_prev)/ipc_win_size; num_insn_prev = num_insn_curr; //save previous for next window // BPMissRate bp_miss_rate = num_bp_miss_win/ipc_win_size; IPC Monitoring and Phase Analysis 3 num_bp_miss_win = 0; //reset the counter //DL1 num_dl1_curr = cache_dl1->misses; dl1_miss_rate = (num_dl1_curr - num_dl1_prev)/ipc_win_size; num_dl1_prev = num_dl1_curr; //reset the counter .... PrintMonitorReadings(); } Picking the Window Size Microsoft Excel was used to plot the multi-graphs. In Excel the number of data points is limited to 65536 (216 unsigned) which is fine since the human eye usually can't benefit from more information. However this means that for a window of 20 thousand cycles (as I used in gzip), the data will end after 20*103*216 = 1.3 billion cycles. As can be seen in the multigraphs that follow, gzip's phase cycles (an order of phases that repeats multiple times throughout the program) are relatively short, so more than enough recurrence is seen in 1.3 billion cycles. However for gcc and equake it was necessary to see more execution time so I increased the window size to 1 million cycles (for gcc it would have been better to see even more run time, hence increasing the window even more, however the benchmark was too short for that). Note that [2] used a window of 10 million cycles. The above issue also teaches us that the time-scale by which phases are observed may be different for considerably different programs. This important aspect must not be overlooked when thinking of the usability of phase analysis in design considerations. Also note that the time window has an effect of smoothing the data. As shall be seen later on, choosing the right window size is important since it smoothes out unimportant small-scale noise while allowing focus on the principal details Programs Three programs from the SPEC2000 benchmark are analyzed: gzip (with a graphic benchmark input). gcc (with benchmark input "166.i"). equake (with typical input from benchmark). All benchmark programs were compiled using compiler optimizations for maximum performance (in benchmarking, this is referred to as "peak"). The data itself was taken from [3]. Avshalom Elyada, Technion, Israel Inst. of Tech. 4 Computational Limitations [2] used a window of 10 million cycles and recorded hundreds of billion cycles. They report that their results "took several machine years" to produce. I knew that I didn't have at my disposal anywhere near that amount of resources. My simulations ran on a Linux OS PC; so I was not sure whether they would produce enough run-time to see phase behavior. However I assumed that I would see some interesting results regardless of that question and as it turns out this assumption was justified. IPC Monitoring and Phase Analysis 5 Results and Analysis GZIP The following figure shows the results for gzip in a 20K window spanning 20 million cycles. Figure 2: zip multi-graph spanning 20 million cycles Avshalom Elyada, Technion, Israel Inst. of Tech. 6 Phase behavior and repetition can clearly be seen. The entire program behaves repetitively as displayed (as far as the run-time of my simulation could determine). Comparing this figure to the one from [2], we see that the results are in agreement. Let's zoom in to 8 million cycles to get a better view: Figure 3: gzip zoomed in to 8 million cycles IPC Monitoring and Phase Analysis 7 GZIP Phase Analysis Two distinct phases are seen (marked 1 and 2). Another short recurring phenomenon is marked (?). The (?) does not exactly qualify as a phase by definition since the readings aren't at a stable value (at least it appears so at this resolution). However it is clearly not a chaotic phenomenon. The observed phases might typically correspond to reading a block, compressing it in various algorithm stages (perhaps that is what the "?" is), and writing it back. GZIP IPC Analysis We would expect IPC to be high when all other metrics are low, and correspondingly IPC is expected to be low when other metrics are high, i.e. negative correlation. This is intuitive since branch misprediction and cache misses slow-down the execution. Recall the formula for the normalized correlation between two vectors: If X is a vector, then Correl(aX, X+c) = 1, Correl(X+c, -aX) = -1, and two independent vectors give a correlation of zero. Calculating the correlation between IPC and the other performance metrics for gzip, we get: BP DL1 IL1 UL2 0.12 -0.78 -0.87 0.00 Table 1: gzip, correlation of monitors to IPC Correlation to both level-one caches is close to (-1) as we would expect. However there is close to zero correlation to BP and UL2. What might be the cause of this? We know that the monitors are also correlated among each other, for example high DL1 may stall the pipe, indirectly causing a decrease in BP, not because there are fewer mispredictions but rather because the IPC is lower. Perhaps there is typically one (or a few) dominant monitor(s) that directly affect IPC at a given phase, while the others are indirectly affected by this dominant monitor? Also, there are probably other effects beyond the recorded monitors which have a significant influence on IPC at some intervals. For example, a limited number of integer ALUs can limit IPC in phases characterized by many integer operations. Perhaps such a correlation opposite from expected can point to a machine bottleneck, at least for the current program. These questions will be further addressed in the remainder of this work. Non-correlation of UL2 and BP The level-two cache is logically distanced from the processor; hence many intermediate effects have a chance to spoil correlation to IPC. This might rationalize the zero correlation of UL2. Avshalom Elyada, Technion, Israel Inst. of Tech. 8 However the same cannot be said regarding BP. Taking another look at the multi-graph, we see that most intervals, BP appears positively related to IPC, while at a few others negatively related. One possible explanation for this is that there are compound relations between IPC, IL1 and BP. The answer to this question may lie at a totally different level of analysis. Nevertheless, taking a look at the correlation-matrix below, there is clearly negative (and thus seemingly counter-intuitive) correlation between BP and the other metrics, even UL2. IP BP C 1.0 IP 0 C BP 1 1 2 0.1 2 1.0 0 0.78 IL 0.87 0.0 UL 0 DL 0.1 2 0.26 0.33 0.15 DL IL 1 1 0.78 0.26 1.0 0 0.7 0 0.3 8 0.87 0.33 0.7 0 1.0 0 0.16 UL 2 0.0 0 0.15 0.3 8 0.16 1.0 0 Table 2: gzip, cross-correlation matrix Let's go on to see how results of the two other programs coincide with this initial analysis. IPC Monitoring and Phase Analysis 9 EQUAKE It is interesting to observe the equake program at two time segments of about half a billion cycles. The first figure shows two distinct phases marked 1 and 2: Figure 4: equake, 23.4 - 24.1 billion cycles Avshalom Elyada, Technion, Israel Inst. of Tech. 10 EQUAKE Phase Analysis One might argue that 1a and 1b are different phases since IL1 and subsequently IPC have a slightly different value (note closely in the figure) which is also a recurring one. However it is the author's opinion that phase classification should be general enough to include 1a and 1b in the same phase. As we shall see next, there is also a 1c which does not to reoccur (or reoccurs rarely), which may point to the fact that the small difference in IL1 value may be input dependent. The previous observation leads to a more general insight. In order to practically benefit from phase analysis information in processor design, phase classification must correctly balance between accuracy and practicality. A phase profiling algorithm should be rigid enough to distinguish between time-segments of significant difference in processor behavior, and at the same time lenient enough to group segments of small difference to the same phase. Small differences in a small number of monitors (or small difference in phase duration), should be attributed to the same phase, so that programs have a magnitude of no more than 50 or so phases. One possible way to do this may be to define a vector composed of performance monitors, which characterizes current processor state. A vectorial-distance between processor states can be defined and calculated, and thus similar vectors will be classified to the same phase according to a determined threshold. IPC Monitoring and Phase Analysis 11 The second figure below shows an interesting transition between phases 1 and 2 to phases 2 and 3. At a glance 3 might appear to be similar to 1 ("1d", although with a shorter duration than 1a, b and c). However looking at DL1 and UL2 we see that this is clearly not the case. Also, as mentioned above, note 1c which is grouped together with 1a and 1b. Occurrences of 1a and 1b were commonly seen, however 1c was seen only a few times in my simulation. Figure 5: equake, 26.6 to 27.1 billion cycles Avshalom Elyada, Technion, Israel Inst. of Tech. 12 EQUAKE IPC Analysis The equake program exhibits three phase-cycles (characteristic patterns of phase alternation) rather than just one as in the gzip example. Therefore correlation is taken in three different segments corresponding to phase alternations 1a-2, 1b-2, and 2-3 in the previous multi-graphs. Phases 1a-2 1b-2 2-3 BP 0.26 0.25 0.15 DL1 0.87 0.86 0.52 IL1 0.9 UL2 -0.89 0.9 -0.89 0.8 -0.85 4 3 8 Table 3: equake, correlation of monitors to IPC As can be seen, DL1, UL2 show strong negative correlation to IPC. Even BP is negatively correlated, thus behaving closer to expected compared to the gzip example. However, the strong positive correlation of IL1 to IPC is counter-intuitive. Taking a second look at the multi-graphs, we see that this is indeed true. Phase 1 exhibits high IPC together with high IL1, while phase 2 shows relatively low IPC together with zero IL1. (Phase 3 doesn't follow this trend, but neither does it coincide with other parameters, for instance positive correlation of IPC to DL1 is seen in phase 3). IPC IP C BP DL 1 IL1 UL 2 1.00 0.26 0.87 0.94 0.89 BP 0.26 DL1 0.87 1.00 0.40 0.40 0.44 1.00 0.97 0.44 1.00 IL1 0.94 0.44 0.97 1.00 0.99 UL2 0.89 0.44 1.00 0.99 1.00 Table 4: equake, cross-correlation matrix In the cross-correlation matrix above, we see that IL1 shows strong negative correlation to other monitors. Notably, there is an almost (-1) correlation between IL1 and DL1, which may explain (at least statistically) why there is high IPC together with high IL1. It appears that in equake, DL1 and perhaps also UL2 are dominant monitors while IL1 is an affected monitor. Negative correlation between IL1 and DL1 may occur when a small set of cached data is being operated on by cache-thrashing code, and could point to a bad cache configuration, at least with regard to this program. IPC Monitoring and Phase Analysis 13 GCC The gcc program ended after about 27 billion cycles, producing the following result: Figure 6: gcc, 0 to 27 billion cycles Avshalom Elyada, Technion, Israel Inst. of Tech. 14 GCC Phase Analysis Structured behavior is clearly seen in gcc as well as in the previous programs; however gcc seems to exhibit significantly more complicated behavior than gzip or equake. In the whole data set of 27 billion cycles, a full phase-cycle is observed only twice, which is several orders of magnitude more than both gzip and equake. Overall there are probably more than 10 phases in gcc, which is also considerably more than in the previous examples. Some dominant phases are marked in Figure 6 above, while a few others can be seen better in Figure 7 below which is an enlarged view. Not all phases are marked due to the complicated nature of the multi-graph. Gcc demonstrates that phase profiling and classification of complex programs should be automated, and done in real-time hardware if possible. As defined in the introduction, monitors should be stable and distinct throughout a phase occurrence. In some phases of gcc (notably phase 2) not all monitors are stable. However, phase structure is maintained in the sense that monitor instability is usually tightly bound and centers around a distinct average (some of these instabilities are not seen in [2] due to their use of a 10 million cycle window which smoothes their graphs.). As mentioned before, phase classification should be lenient enough to accommodate a phase with tightly-bound instability in a small number of monitors. IPC Monitoring and Phase Analysis 15 Figure 7: gcc, cycles 4 to 10 billion Avshalom Elyada, Technion, Israel Inst. of Tech. 16 GCC IPC Analysis How should we measure correlation for the gcc program? The gzip run exhibited totally monotonous behavior (2-3 constantly repeating phases), so it was obvious that correlation needed to be measured on at least one phase-cycle. Equake is characterized by three different patterns of phase-cycles, so correlation was measured separately for each phase-cycle. In gcc however, a phase-cycle exists only at a much larger magnitude (so that we see only 2 phase-cycles in the simulation). One possible way to deal with this is to simply measure correlation on the whole run or on one phase-cycle as was previously done. However, in the gcc case it might be more interesting to show results for two segments which look significantly different from each other within the phase-cycle. Looking at Figure 6, two different segments are defined: one of low IPC (the area around marked phases 1 and 2) and a second of high IPC (phases 3 and after). Table 5 shows the cross-correlation matrix for the low-IPC segment, while Table 6 shows the corresponding date for the high-IPC segment. Table 7 shows the difference (subtraction) between the two correlation matrixes. IPC BP 1.00 0.14 0.14 1.00 0.54 DL1 IP C BP DL 1 2 0.52 IL1 0.09 UL 0.74 0.46 0.54 0.52 0.54 1.00 0.50 0.00 IL1 0.09 0.46 0.50 1.00 0.32 UL2 0.74 0.54 0.00 0.32 1.00 Table 5: gcc, cross-correlation matrix of low IPC segment IPC IP C BP 1.00 0.50 DL 1 2 0.08 IL1 0.72 UL 0.03 BP 0.50 1.00 0.37 0.64 0.33 DL1 0.08 0.37 1.00 0.24 0.97 IL1 0.72 0.64 0.24 1.00 0.20 UL2 0.02 0.33 0.97 0.20 1.00 Table 6: gcc, cross-correlation matrix of high IPC segment IPC IP C BP DL 1 0.00 0.64 0.44 BP 0.65 IPC Monitoring and Phase Analysis DL1 IL1 0.43 0.63 UL2 0.72 0.00 0.16 0.18 0.21 0.16 0.00 0.26 0.96 17 IL1 0.64 0.18 0.26 0.00 0.13 0.21 0.96 0.13 0.00 UL 2 0.71 Table 7: gcc, difference between the two cross-correlation matrixes above As can be seen in Table 7, there is substantial difference between the two segments in a number of monitor correlations. This perhaps demonstrates the crucial importance of understanding phase analysis. The differences between the two segments show that virtually any design optimization that is not somehow phase-aware is far from optimal. Finally, for completeness it should be mentioned that gcc has notably higher cache miss rates than the two previous examples. Avshalom Elyada, Technion, Israel Inst. of Tech. 18 Conclusions Simplescalar was instrumented to print out dynamic readings of important performance monitors: IPC, branch misprediction and cache miss rates. gzip, equake and gcc benchmark programs were analyzed, IPC and phase behavior was reviewed. Different programs had phases on different time scales. The importance and potential of phase analysis research was discussed. Phase behavior was clearly seen in all programs, and insights were explained in detail. Correlation between IPC and other performance monitors was not always negative as initially expected. This was explained by the fact that other monitors are not independent from each other, and such dependence can be dominant enough to produce seemingly counter-intuitive results. Future Research Phase analysis research shows that program phases generally correlate to specific regions of code. Typically these regions are frequented loops and subroutine calls. This work focused primarily on statistical IPC analysis. However, by tying program phases to specific code regions, the underlying reasons to observed statistical results might be explored. This can be an interesting continuation of this work. This would require printing out the program-counter value, correlating between phases and actual code segments, and then analyzing detailed perinstruction behavior in that segment. This work analyzed a small number of primary performance monitors. However there are definitely many more phenomena which affect the IPC than we have monitored. For example, the number of available execution resources, or pipe-stages halted due to some bottleneck in the design. In order to observe issues such as these we need to monitor additional events such as internal queue fill levels, number if instructions of each type ready to execute at any given time, important counters and so on. It would also be interesting to observe the effect of different machine configurations (caches, branch-prediction tables etc.) on program behavior using the multi-graph tool. Finally, it would be beneficial to analyze additional programs in order to generalize conclusions regarding the analysis that was done. References 1. SimpleScalar source and documentation from http://www.simplescalar.com 2. Timothy Sherwood et al., "Discovering and Exploiting Program Phases" 3. "/spec_2k" public directory at the Lion computer-farm in EE. These are SPEC2000 programs recompiled as SimpleScalar inputs and made available for research use. 4. SPEC2000 5. Greg Semeraro, David Albonesi et al., "Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microprocessor" 6. Greg Semeraro, David Albonesi et al., "Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microarchitecture" 7. Slides and Notes, VLSI Architecture Design Course #048853 Spring 2004. IPC Monitoring and Phase Analysis 19 Appendix A: SimpleScalar Default Configuration The full configuration is listed below: # instruction fetch queue size (in insts) -fetch:ifqsize 4 # extra branch mis-prediction latency -fetch:mplat 3 # speed of front-end of machine relative to execution core -fetch:speed 1 # branch predictor type {nottaken|taken|perfect|bimod|2lev|comb} -bpred bimod # bimodal predictor config (<table size>) -bpred:bimod 2048 # 2-level predictor config (<l1size> <l2size> <hist_size> <xor>) -bpred:2lev 1 1024 8 0 # combining predictor config (<meta_table_size>) -bpred:comb 1024 # return address stack size (0 for no return stack) -bpred:ras 8 # BTB config (<num_sets> <associativity>) -bpred:btb 512 4 # speculative predictors update in {ID|WB} (default non-spec) # -bpred:spec_update <null> # instruction decode B/W (insts/cycle) -decode:width 4 # instruction issue B/W (insts/cycle) -issue:width 4 # run pipeline with in-order issue -issue:inorder false # issue instructions down wrong execution paths -issue:wrongpath true # instruction commit B/W (insts/cycle) -commit:width 4 # register update unit (RUU) size -ruu:size 16 # load/store queue (LSQ) size -lsq:size 8 # l1 data cache config, i.e., {<config>|none} -cache:dl1 dl1:128:32:4:l # l1 data cache hit latency (in cycles) -cache:dl1lat 1 # l2 data cache config, i.e., {<config>|none} -cache:dl2 ul2:1024:64:4:l # l2 data cache hit latency (in cycles) -cache:dl2lat 6 # l1 inst cache config, i.e., {<config>|dl1|dl2|none} -cache:il1 il1:512:32:1:l # l1 instruction cache hit latency (in cycles) -cache:il1lat 1 Avshalom Elyada, Technion, Israel Inst. of Tech. 20 # l2 instruction cache config, i.e., {<config>|dl2|none} -cache:il2 dl2 # l2 instruction cache hit latency (in cycles) -cache:il2lat 6 # flush caches on system calls -cache:flush false # convert 64-bit inst addresses to 32-bit inst equivalents -cache:icompress false # memory access latency (<first_chunk> <inter_chunk>) -mem:lat 18 2 # memory access bus width (in bytes) -mem:width 8 # instruction TLB config, i.e., {<config>|none} -tlb:itlb itlb:16:4096:4:l # data TLB config, i.e., {<config>|none} -tlb:dtlb dtlb:32:4096:4:l # inst/data TLB miss latency (in cycles) -tlb:lat 30 # total number of integer ALU's available -res:ialu 4 # total number of integer multiplier/dividers available -res:imult 1 # total number of memory system ports available (to CPU) -res:memport 2 # total number of floating point ALU's available -res:fpalu 4 # total number of floating point multiplier/dividers available -res:fpmult 1 # profile stat(s) against text addr's (mult uses ok) # -pcstat <null> # operate in backward-compatible bugs mode (for testing only) -bugcompat false IPC Monitoring and Phase Analysis 21