Approximate computing: application analysis and microarchitecture design 15-740: Computer Architecture Project Proposal Gennady Pekhimenko (gpekhime@cs.cmu.edu) Danai Koutra (dkoutra@cs.cmu.edu) Kun Qian (kqian@andrew.cmu.edu) September 27, 2010 1 Introduction and Motivation Typically people are expecting the processor to produce a correct result, but it may be the case that the software algorithm does not need ”strict” correctness. It may need only limited correctness. Hence, whether a fault is acceptable or not depends on the definition of the quality of service for this application. The potential examples are AI or Biological applications, as well as some multimedia programs [4]. Those can be very tolerable to potential ”defects” in execution [5]. So, it is possible to exploit those characteristics and get performance improvement by simplifying hardware or by changing the algorithm itself with an aggressive compiler optimization. But how to replace the notion of strict correctness? And what are those changes and aggressive optimizations that can be done? That is what this project about. Our plan is to search for a good definition of ”non-strict” correctness for a certain class of applications. Then we want to investigate potential changes in hardware that can be done using this new definition. For example, we would like to investigate the possibility of ignoring some cache misses by returning some value (by prediction), and fetching the true value later if needed. Similar things can be done for branch mispredictions and load-store dependencies. But those speculation are possible not only in hardware, but in software itself through compiler optimizations. For example, we have a loop with 90% of time of the whole program execution, and we want to make it parallel (to use benefits of our multicore system). But unfortunately it has a ”maybe” dependency in it, meaning that the compiler is not sure whether this dependency is real or not. This may happen because of the pointers to unknown memory that do not really intersect. Then we can potentially ignore this dependence to achieve a better performance with an acceptable quality of service. The goal of our project is to search where and when these aggressive optimizations are possible, and what are the positive and negative effects on the execution of an application. 2 Related Work Wang, Fertig and Patel [5] used fault injection in randomly selected conditional branches and showed that 40% of all the conditional branches and 50% of dynamically mispredicted branches result in correct execution, which means that before a system call is reached the architectural state is correct. This paper shows that it is possible, in some cases, not to recover if a branch misprediction occurs, as the branch might be outcome-tolerant. This can simplify the design and also improve the performance of some programs. Moreover, Li and Yeung [2] showed that in some multimedia and AI benchmarks 45.8% of the executions that are incorrect from the architecture’s standpoint are acceptable to the user. In this paper, the authors define the application-level correctness and, using the technique of fault injection (single bit flip), determine the fault susceptibility of the applications under the definitions of application-level and architecture-level correctness. They also present their fault recovery system which is used in the cases where the programs produce results that are unacceptable even to the user. The main difference between our work and previous studies is that we intend taking advantage of the fact that some applications allow for approximate results and so we want to find which constraints can be relaxed in order to improve the performance of such applications. We do not want to inject random errors in applications, but we are interested in finding specific parts of the programs which can be approached in a simplified way and also lead to correct execution under the application-level correctness. More papers related to our project can be found in the references. 3 Research Plan 1. Week September 27 - October 3 Defining the benchmark list for experiments. Search for experimental setup methodology: (a) simulator (Simics [14], Scarab), (b) Pin [11] and Valgrind [17] dynamic instrumentation tools, (c) real hardware (d) profile tools 2. Week October 4 - October 10 Benchmark profiling: hot routines/functions collection and cache misses (L1, L2). 3. Week October 11 - October 17 Cache misses classification and initial experiments. Milestone Report 1. 4. Week October 18 - October 24 Search for a ”non-strict” correctness definition. Modify benchmarks verification code if needed. 5. Week October 25 - October 31 Cache misses experiments Phase 1 (no value prediction). Reevaluate our model. 6. Week November 1 - November 7 Cache misses experiments Phase 2 (simple value prediction). Reevaluate our model if needed. Milestone Report 2. 7. Week November 8 - November 14 LLVM [16] compiler framework setup. 8. Week November 15 - November 21 Compiler opportunities analysis: dependency breaking, parallelization, compiler directives, cache misses. Manual source changes to check for opportunities. Milestone Report 3. 9. Week November 22 - November 28 Compiler experiments for opportunities found. 10. Week November 29 - December 5 More compiler experiments. Poster preparation. 11. Week December 6 - December 12 Poster Presentation. Final Report. 2 4 Methodology For the part of the project that involves hardware experiments we are planning to use simulators: Simics [14], Scarab; and binary instrumentation tools like Pin [11] and Valgrind [17]. For the software part we will use profile tools and a compiler framework LLVM [16]. Our target applications are the selection of benchmarks from SPEC2000/SPEC2006 [15] and MediaBench [4]. References [1] Balakrishnan, S., and Sohi, G. S. Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs. In In ISCA’06: Proceedings of the 33rd International Symposium on Computer Architecture (2006), pp. 302–313. [2] Li, X., and Yeung, D. Application-level correctness and its impact on fault tolerance. In In Proceedings of the 13th International Symposium on High Performance Computer Architecture (2007), pp. 181–192. [3] Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. Value locality and load value prediction. pp. 138–147. [4] http://euler.slu.edu/ fritts/mediabench, Media Benchmarks. [5] Michael, N. W., Fertig, M., and Patel, S. Y-branches: When you come to a fork in the road, take it. In in Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (2003), pp. 56–67. [6] Mutlu, O., Kim, H., Member, S., and Patt, Y. N. Address-value delta (avd) prediction: A hardware technique for efficiently parallelizing dependent cache misses. [7] Neelakantam, N., Rajwar, R., Srinivas, S., Srinivasan, U., and Zilles, C. Hardware atomicity for reliable software speculation. In ISCA ’07: Proceedings of the 34th annual international symposium on Computer architecture (New York, NY, USA, 2007), ACM, pp. 174–185. [8] Patel, S. J., and Lumetta, S. S. replay: A hardware framework for dynamic program optimization. Tech. rep., IEEE Transactions on Computers, 1999. [9] Patel, S. J., Tung, T., Bose, S., and Crum, M. M. Increasing the size of atomic instruction blocks using control flow assertions. In In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (2000), pp. 303– 313. [10] Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In SOSP ’09: Proceedings of the ACM SIGOPS 3 22nd symposium on Operating systems principles (New York, NY, USA, 2009), ACM, pp. 87–102. [11] http://www.pintool.org/, Pin. [12] Salamat, B., Jackson, T., Gal, A., and Franz, M. Orchestra: intrusion detection using parallel execution and monitoring of program variants in user-space. In EuroSys ’09: Proceedings of the 4th ACM European conference on Computer systems (New York, NY, USA, 2009), ACM, pp. 33–46. [13] Sazeides, Y., and Smith, J. E. The predictability of data values. In In proceedings of the 30TH International Symposium on Architecture (1997), pp. 248–258. [14] http://www.virtutech.com, Simics. [15] http://www.spec.org/, SPEC CPU2000 and CPU2006 Benchmarks. [16] http://llvm.org, The LLVM Compiler Infrastructure. [17] http://valgrind.org/, Valgrind. [18] Zilles, C., and Sohi, G. Master/slave speculative parallelization. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture (Los Alamitos, CA, USA, 2002), IEEE Computer Society Press, pp. 85–96. 4