1 inch (“) This Sample shows the 1st two & Last two pages of a paper No page numbers A Comparison of Path Profiling and Edge Profiling In C++ Applications Brian A. Malloya Clemson University,U.S.A 0.75’’ Abstract Font 12 pt & bold Body Text font size 10 single spacing 0.25’’ 0.75’’ We investigate the notion of phases as they occur in object-oriented programs. The focus of our investigation is on C++ programs and our test suite includes various kinds of object-oriented programs including scientific and general-purpose applications. We focus on individual phased branch behavior and attempt to capture information and generalize about the frequency of phased behavior. We provide profiling in determining phase change in various program executions. Keywords: Profiling, edge profiling, path profiling, performance, control flow graph. 1. Introduction The tremendous interest and growth in the internet has brought with it a demand for mobile code that is compact and efficient. The idea behind mobile code is that applications can be distributed quickly across computer networks and automatically executed upon arrival. For these applications to run efficiently, they must adapt to the runtime environment of the target architecture. In view of the demand for speed, there is evidence that traditional static optimizations may not provide the efficiency required for modern mobile applications. Moreover, there is increased use of object-technology in these mobile applications with a concurrent belief that object-oriented code, with its frequent use of dynamic binding, is less amenable to traditional, static optimizations. One optimization in common use today is feedbackdirected optimization, FDO, which uses program characteristics, obtained at runtime, to attempt to improve the performance of the application. Most FDO approaches use profile guided compilation, which facilitates determination of those parts of a program to modify in order to maximize performance [26]. Profiling entails instrumenting an application to obtain a cost metric such as time spent in a method or the number of times a method executes. This metric can indicate those parts of a program where optimization is likely to achieve a parts of a program where optimization is likely to achieve a a Computer Science Department Clemson, SC 29634, USA malloy@cs.clemson.edu If the input to a program changes, the optimizations performed on a previous execution may no longer be beneficial [27]. In fact, the performance of a statically optimized program may degrade appreciably, so that performance is worse than the un-optimized version of the program [26]. A second drawback with FDO relates to the determination of the instrumentation strategy that is likely to provide the most information at a tolerable cost. Traditional approaches to profile guided optimization construct a graphical representation of the program that describes the flow of control, and then place counters at vertices or edges in the graph. Edge profiles record an aggregate count of the frequency of edge executions that provide general information about the behavior of a program. However, reference [26] argues for the collection of more detailed information than aggregate edge information to describe a program's run-time characteristics, introducing the notion that the run-time behavior of a program cycles through a series of phases. For example the branch prediction and cache miss rate might be more accurately described over the execution of a program using phases [23]. Phase profiling might address the shortcomings of edge profiling if one could detect changes in phase and apply an optimization more suited for a particular phase and later apply a different one for a different phase. Phase profiling might address the shortcomings of edge profiling if one could detect changes in phase and apply an optimization more suited for a particular phase and later apply a different one for a different phase. Phase profiling might address the shortcomings of edge profiling if one could detect changes in phase and apply an optimization more suited for a particular phase and later apply a different one for a different phase. profiling if one could detect changes in phase and apply . apply a different one for a different phase. profiling if one could detect changes in phase and apply . apply a different one for a different phase. optimization more suited for a particular phase and later apply a different one for a different phase. This metric can indicate those parts of a program where optimization is likely to achieve Profiling entails instrumenting an application to obtain a cost metric such Should be an exact 0.75’’ from the above paragraph line and the end of the page No page numbers Please put title at the bottom of the figure area of program optimizations, phases and the use of profiling. Finally, in Section 8 we summarize and conclude. 2. Background In this section, we provide definitions of terms and background about profiling. We discuss both edge and path profiling including a greedy algorithm for computing edge profile information. We conclude this section with a discussion of phases and phased behavior, including the impact of phased behavior on superblock scheduling. Should have figure no. like say figure 1 Phased Behavior. This is a branch taken from the lcom benchmark which exhibits phased behavior. The pink line at y = 55% represents the average bias for the branch. The blue line represents the sampled taken bias for the branch. The sampled bias seems to interleave between periods of being strongly taken and being weakly taken. An optimization based purely on the aggregate bias would adversely affect performance during the period of very weak taken bias. To illustrate the impact that phased behavior might have on optimization, consider the graph in Figure [1], which models the run time branching behavior of an application. The branch shown in the figure is biased towards “taken” 55% of the time; that is, aggregate information would lead the profiler to conclude that the branch is likely to be “taken” more often than not. However, the sampled taken bias interleaves between periods of strong bias towards “taken” and periods of “not taken”. An optimization based strictly on the 55.5% average bias would perform poorly during the phases with significant bias towards ``not taken''. In this paper, we investigate the notion of phases as they might occur in object-oriented programs. The focus of our investigation is on C++ programs and our test suite includes various kinds of object-oriented programs including scientific and general-purpose applications. We focus on individual phased branch behavior and attempt to capture information and generalize about the frequency of phased behavior. We provide guidance about the advantages of using path profiling over edge profiling for improving the performance of programs that exhibit phased branch behavior. The remainder of this paper is organized as follows. In the next section we provide background about various approaches to program profiling, including the notion of superblock scheduling. In Section 3 we describe the simulation tool that we exploit to investigate phased behavior and in Section 4 we describe our case study that forms the basis of our conclusions about phased behavior. In Section 5 we present the results obtained from analyzing phased behavior for individual branches and discuss these results in Section 6. Section 7 overviews related work in the 2.1 Profiling One of the common uses of profiles is to derive paths taken during program execution for use in path based optimizations [7]. Edge profiling records the execution count of transitions between basic blocks [5] or edges. Since edge profiles summarize path execution using aggregate edge counts [26, 6], programs paths must be derived from a profile. Generally path based optimizations are performed on hot paths, heavily executed paths, in order to gain the best performance gain with limited resources. One technique for deriving heavily executed paths from an edge profile is to use a greedy algorithm that follows the edge with the maximum execution frequency out of a basic block[6]. This greedy algorithm does not always produce accurate results; more specifically, the greedy algorithm may misidentify which path significantly contributes to overall control flow of a program. For example, given an edge profile for a CFG such as Figure 2a, a greedy algorithm would incorrectly identify ACDEF instead of ABDEF as the most frequently executing path. Path profiling addresses the shortcomings of edge profiles by recording the actual paths taken within an application[6]. Looking at Figure 2a, path profiling can disambiguate between the two paths and determine which path actually contributed significantly to the control flow of the program. Since the frequency counts for actual paths are recorded instead of aggregate edge counts, the correct hot path ABDEF was found for the CFG in Figure 2a. The additional accuracy provided by path profiling can be beneficial for certain optimizations based on run-time profile information. Young and Smith demonstrated improvements that can be made to superblock schedulers using path profiling [29, 26]. A superblock scheduler attempts to improve program performance by increasing the amount of instruction-level parallelism (ILP) extractable 7. Related Work 0.75’’ No page numbers In this section, we overview the research that relates to program optimizations, phases and the use of profiling. 7.1 The Deco Project at Harvard and Hewlett Packard’s Dynamo The Deco project [14] at Harvard University and Hewlett Packard's Dynamo[4] are feedback-direct optimization systems which attempt to create executables which can adapt to their run-time environment. Each system uses a different mechanism for detecting changes in phase. The current prototype of Deco records Ball and Larus acyclic paths[6] and stores them in a hash table(profile table). An execution count is maintained for each entry in the hash table. Deco uses an interrupt mechanism to examine the hash table and determine which path to optimize based on a threshold execution count. Each optimized piece of code is put into a fixed sized optimization cache. After the hash table is examined, each entry's execution count is zeroed. If the cache is full, a random entry is evicted in order to deal with program phases. Dynamo uses a heuristic called MRET (Most Recently Executed Tail)[4] for identifying hot traces. Within Dynamo backward taken branches represent the end of a trace. The target addresses of these branches identify the start of new traces. An execution count is associated with each of these target addresses. If this execution count exceeds some threshold value, dynamo begins to record the instruction stream until another backward taken branch is encountered. The trace is optimized and placed into a cache indexed by the target address that starts the trace. Subsequent encounters of the target address result in hits to the cache. The target address is replaced by the address of the optimized trace. Once the trace ends control is returned to the Dynamo interpreter which starts the tracing process again. To adapt to changes in phase, Dynamo flushes the trace cache whenever the rate of trace creation surpasses some threshold value. This flushing strategy attempts to readjust optimizations to changes in branch phase. 7.3 Adaptive Parallelism Even though adaptive loop transformations for parallel computations can provide significant performance speedups, existing adaptive techniques waste processor resources. For a particular parallel computation, speedups gained by adding more processors could level off or possibly decrease. Hall and Martonosi showed that the behavior of some loops from the Specfp95 and NAS benchmark suites may go through phases where there may be insufficient levels of parallelism so additional processors may not help improve performance[15]. In some cases a serialized version of a loop might perform better. 8. Conclusions We investigated the impact of phased behavior on various C++ benchmarks. We showed that the conditional branches within the C++ benchmarks exhibit a significant amount of phased behavior. We observed a mean percentage of phased behavior of 13% at a granularity of 1000, 13.7% at a granularity of 500, and 17% at a granularity of 100. We believe this indicates a great deal of opportunity for an optimizer that can exploit path profiling. We have described techniques to improve the accuracy of our work, as well as approaches to extend the work for analyzing phased behavior in conditional branches. References [1] [2] [3] 7.2 Time Varying Behavior [4] Sherwood and Calder looked at phases of program behavior that vary over time[23]. They looked at how the instructions executed per cycle, branch prediction rates, address prediction rates, cache miss rates, and value prediction rates influence each other and change over the lifetime of the SPEC95 benchmark suite. They used the SimpleScalar [9] simulator to record statistics for every 100 million committed instructions. Each point (for each 100 million instructions committed) obtained was graphed to view trends over time and to look for cyclic behavior. The cyclic behavior of the SPEC95 benchmarks was used to reduce the amount of time needed to get an accurate picture of program behavior. The graph was used to determine the length of a program cycle. [5] [6] [7] 0.75’’ G. Aigner and U Holzle, “Eliminating Virtual Function Calls in C++ Programs”, in Proc. of the European Conference on Object Oriented Programming, 1996. G. Albert, “A Transparent Method for Correlating Profiles with Source Programs”, in Proc. of the Second Workshop on FeedbackDirected Optimization(FDO), Nov. 1999. J. Anderson, “Continuous Profiling: Where Have All the Cycles Gone”, in Proc. of the Sixteenth ACM Symposium on Operating System Principles, Oct. 1997, pp. 1- 14. V. Bala, E. Duesterwald and S. Banerjia, “Dynamo: A Transparent Dynamic Optimization System”, in Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI), June, 2000. T. Ball and J. R. Larus, “Optimally Profiling and Tracing Programs”, ACM Trans. on Programming Languages and Systems, July 1994. T. Ball and J. R. Larus, “Efficient Path Profiling”, in Proc. of MICRO-29, Dec. 1996. T. Ball, P. Mataga and M. Sagiv, “Edge Profiling versus Path Profiling: The Showdown”, In Symposium on Principles of Programming Languages, Jan. 1998. No page numbers [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] F.-B. Bjorn and J. Maloney, “The deltablue algorithm: An incremental constraint hierarchy solver”, in Proc. of the Eighth Annual IEEE Phoenix Conference on Computers and Communications, 1989. D. Burger and T. M. Austin, “The SimpleScalar Toolset, Version 2.0”, in University of Wisconsin-Madison Technical Report, CS-TR1997-1342, June 1997, pp. 128-137. B. Calder, D. Grunwald and B. Zorn, “Quantifying Behavioral Differences Between C and C++ Programs”, in Journal of Programming Languages, 1994. R. F. Cmelik and D. Keppel, “Shade: A Fast Instruction-Set Simulator for Execution Profiling”, in Proc. ACM SIGMETRICS Conference on the Measurement and Modeling of Computer Systems, May 1994, pp. 128-137 E. Cox, “Fuzzy Fundamentals”, IEEE Spectrum, Oct. 1992, pp. 58-61. E. Duesterwald and V. Bala, “Software Profiling for Hot Path Prediction: Less is More”, in Proc. Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000. E. Feigin, “A Case for Automatic Run-time Code Optimization”, A Case for Automatic Runtime Code Optimization. Bachelor of Arts Thesis, Harvard College, April 1999, 1999. M. W. Hall and M. Martonosi, “Adaptive Parallelism in Compiler Parallelized Code”, in Concurrency: Practice and Experience, vol. 10, no. 14, pp. 1235-1250, 1998. W. Hwu, “The Superblock: An Effective Technique for VLIW and Superscalar Compilation”, in Journal of Supercomputing, vol. 7, pp. 229-248, Jan. 1993. G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy Logic - Theory and Applications”, Upper Saddle River, NJ: Prentice Hall PTR, 1995. J. R. Larus and E. Schnarr, “EEL: MachineIndependent Executable Editing”, in Proc. of the SIGPLAN Conference on PLDI, vol. 30, no. 6, pp. 291-300, 1995. D. C. Lee, P. J. Crowley, J.-L. Baer, T. E. Anderson and B. N. Bershad, “Execution Characteristics of Desktop Applications on Windows NT”, in ISCA, 1998 pp. 27—38. M. A. Linton, J. M. Vlissides and P. R. Calder}, “Composing User Interfaces with InterViews”, IEEE Computer, vol. 22, no. 2, pp. 8-22, 1989. T. Romer, G. Voelkner, D. Lee, A. Wolman, W. Wong, H. Levy and B. Bershard, “Instrumentation and Optimization of Win32/Intel Executables Using Etch.”, USENIX Windows NT Workshop, Seatle, WA, August 1997. S. Savari and C. Young, “Comparing and [23] [24] [25] [26] [27] [28] [29] Combining Profiles”, in Proc. Second Workshop on Feedback-Directed Optimization (FDO), Nov. 1999. T. Sherwood and B. Calder, “Time Varying Behavior of Programs”, in UC San Diego Technical Report UCSD-CS99-630, Aug. 1999. L. Smith and C. Laird, “Android, Open Source Scripting for Testing & Automation”, in Dr. Dobbs Journal, no. 326, pp. 58-61, July 2001. M. D. Smith, “Extending SUIF for Machinedependent Optimizations”, in Proc. of the First SUIF Compiler Workshop, 1996, pp. 14-25. M. D. Smith, “Overcoming the Challenges to Feedback-Directed Optimization”, in Proc. of the ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization (Dynamo'00), Jan. 2000. D. W. Wall, “Predicting Program Behavior Using Real or Estimated Profiles”, in Proc. of the ACM SIGPLAN '91 Conference on PLDI, vol. 26, pp. 59-70, June 1991. Z. Wang, K. Pierce and S. McFarling, “BMAT - A Binary Matching Tool”, in Proc. of the Seond Workshop on Feedback-Directed Optimization, 1999. C. Young and M. D. Smith, “Better Global Scheduling Using Path Profiles”, in Proc. 30th Annual IEEE/ACM Intl. Symp. on Microarchitecture, Nov. 1998. Brian A. Malloy is an associate professor in the department of Computer Science at Clemson University. Dr. Malloy's research focus is software engineering and compiler technology. He has investigated issues in software validation, testing and program representations to facilitate validation and testing. He has applied software engineering to parser development, especially applied to the construction of a parser front-end for C++. Dr. Malloy has given presentations at national and international conferences and workshops. He is an active member of the Association for Computing Machinery (ACM) and the IEEE Computer Society. Dr. Malloy has reviewed papers for IEEE Transactions for Parallel and Distributed Systems, Journal of Software, Practice and Experience, Journal of Parallelism and IEEE Transactions on Software Engineering. Black & White picture