Approximate computing: application analysis and microarchitecture design

advertisement
Approximate computing: application analysis and
microarchitecture design
15-740: Computer Architecture
Project Proposal
Gennady Pekhimenko (gpekhime@cs.cmu.edu)
Danai Koutra (dkoutra@cs.cmu.edu)
Kun Qian (kqian@andrew.cmu.edu)
September 27, 2010
1
Introduction and Motivation
Typically people are expecting the processor to produce a correct result, but it may be
the case that the software algorithm does not need ”strict” correctness. It may need only
limited correctness. Hence, whether a fault is acceptable or not depends on the definition
of the quality of service for this application. The potential examples are AI or Biological
applications, as well as some multimedia programs [4]. Those can be very tolerable to
potential ”defects” in execution [5]. So, it is possible to exploit those characteristics and
get performance improvement by simplifying hardware or by changing the algorithm itself
with an aggressive compiler optimization.
But how to replace the notion of strict correctness? And what are those changes
and aggressive optimizations that can be done? That is what this project about. Our
plan is to search for a good definition of ”non-strict” correctness for a certain class of
applications. Then we want to investigate potential changes in hardware that can be
done using this new definition. For example, we would like to investigate the possibility
of ignoring some cache misses by returning some value (by prediction), and fetching the
true value later if needed. Similar things can be done for branch mispredictions and
load-store dependencies.
But those speculation are possible not only in hardware, but in software itself through
compiler optimizations. For example, we have a loop with 90% of time of the whole
program execution, and we want to make it parallel (to use benefits of our multicore
system). But unfortunately it has a ”maybe” dependency in it, meaning that the compiler
is not sure whether this dependency is real or not. This may happen because of the
pointers to unknown memory that do not really intersect. Then we can potentially ignore
this dependence to achieve a better performance with an acceptable quality of service.
The goal of our project is to search where and when these aggressive optimizations
are possible, and what are the positive and negative effects on the execution of an application.
2
Related Work
Wang, Fertig and Patel [5] used fault injection in randomly selected conditional
branches and showed that 40% of all the conditional branches and 50% of dynamically
mispredicted branches result in correct execution, which means that before a system
call is reached the architectural state is correct. This paper shows that it is possible,
in some cases, not to recover if a branch misprediction occurs, as the branch might be
outcome-tolerant. This can simplify the design and also improve the performance of some
programs. Moreover, Li and Yeung [2] showed that in some multimedia and AI benchmarks 45.8% of the executions that are incorrect from the architecture’s standpoint are
acceptable to the user. In this paper, the authors define the application-level correctness
and, using the technique of fault injection (single bit flip), determine the fault susceptibility of the applications under the definitions of application-level and architecture-level
correctness. They also present their fault recovery system which is used in the cases
where the programs produce results that are unacceptable even to the user.
The main difference between our work and previous studies is that we intend taking
advantage of the fact that some applications allow for approximate results and so we
want to find which constraints can be relaxed in order to improve the performance of
such applications. We do not want to inject random errors in applications, but we are
interested in finding specific parts of the programs which can be approached in a simplified
way and also lead to correct execution under the application-level correctness.
More papers related to our project can be found in the references.
3
Research Plan
1. Week September 27 - October 3 Defining the benchmark list for experiments. Search
for experimental setup methodology:
(a) simulator (Simics [14], Scarab),
(b) Pin [11] and Valgrind [17] dynamic instrumentation tools,
(c) real hardware
(d) profile tools
2. Week October 4 - October 10
Benchmark profiling: hot routines/functions collection and cache misses (L1, L2).
3. Week October 11 - October 17
Cache misses classification and initial experiments. Milestone Report 1.
4. Week October 18 - October 24
Search for a ”non-strict” correctness definition. Modify benchmarks verification
code if needed.
5. Week October 25 - October 31
Cache misses experiments Phase 1 (no value prediction). Reevaluate our model.
6. Week November 1 - November 7
Cache misses experiments Phase 2 (simple value prediction). Reevaluate our model
if needed. Milestone Report 2.
7. Week November 8 - November 14
LLVM [16] compiler framework setup.
8. Week November 15 - November 21
Compiler opportunities analysis: dependency breaking, parallelization, compiler directives, cache misses. Manual source changes to check for opportunities. Milestone
Report 3.
9. Week November 22 - November 28
Compiler experiments for opportunities found.
10. Week November 29 - December 5
More compiler experiments. Poster preparation.
11. Week December 6 - December 12
Poster Presentation. Final Report.
2
4
Methodology
For the part of the project that involves hardware experiments we are planning to use
simulators: Simics [14], Scarab; and binary instrumentation tools like Pin [11] and Valgrind [17]. For the software part we will use profile tools and a compiler framework LLVM
[16]. Our target applications are the selection of benchmarks from SPEC2000/SPEC2006
[15] and MediaBench [4].
References
[1] Balakrishnan, S., and Sohi, G. S. Program demultiplexing: Data-flow based
speculative parallelization of methods in sequential programs. In In ISCA’06: Proceedings of the 33rd International Symposium on Computer Architecture (2006),
pp. 302–313.
[2] Li, X., and Yeung, D. Application-level correctness and its impact on fault tolerance. In In Proceedings of the 13th International Symposium on High Performance
Computer Architecture (2007), pp. 181–192.
[3] Lipasti, M. H., Wilkerson, C. B., and Shen, J. P. Value locality and load
value prediction. pp. 138–147.
[4] http://euler.slu.edu/ fritts/mediabench, Media Benchmarks.
[5] Michael, N. W., Fertig, M., and Patel, S. Y-branches: When you come to a
fork in the road, take it. In in Proceedings of the 12th International Conference on
Parallel Architectures and Compilation Techniques (2003), pp. 56–67.
[6] Mutlu, O., Kim, H., Member, S., and Patt, Y. N. Address-value delta (avd)
prediction: A hardware technique for efficiently parallelizing dependent cache misses.
[7] Neelakantam, N., Rajwar, R., Srinivas, S., Srinivasan, U., and Zilles,
C. Hardware atomicity for reliable software speculation. In ISCA ’07: Proceedings
of the 34th annual international symposium on Computer architecture (New York,
NY, USA, 2007), ACM, pp. 174–185.
[8] Patel, S. J., and Lumetta, S. S. replay: A hardware framework for dynamic
program optimization. Tech. rep., IEEE Transactions on Computers, 1999.
[9] Patel, S. J., Tung, T., Bose, S., and Crum, M. M. Increasing the size of
atomic instruction blocks using control flow assertions. In In Proceedings of the 33rd
Annual ACM/IEEE International Symposium on Microarchitecture (2000), pp. 303–
313.
[10] Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J.,
Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G.,
Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically
patching errors in deployed software. In SOSP ’09: Proceedings of the ACM SIGOPS
3
22nd symposium on Operating systems principles (New York, NY, USA, 2009), ACM,
pp. 87–102.
[11] http://www.pintool.org/, Pin.
[12] Salamat, B., Jackson, T., Gal, A., and Franz, M. Orchestra: intrusion
detection using parallel execution and monitoring of program variants in user-space.
In EuroSys ’09: Proceedings of the 4th ACM European conference on Computer
systems (New York, NY, USA, 2009), ACM, pp. 33–46.
[13] Sazeides, Y., and Smith, J. E. The predictability of data values. In In proceedings
of the 30TH International Symposium on Architecture (1997), pp. 248–258.
[14] http://www.virtutech.com, Simics.
[15] http://www.spec.org/, SPEC CPU2000 and CPU2006 Benchmarks.
[16] http://llvm.org, The LLVM Compiler Infrastructure.
[17] http://valgrind.org/, Valgrind.
[18] Zilles, C., and Sohi, G. Master/slave speculative parallelization. In MICRO 35:
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture (Los Alamitos, CA, USA, 2002), IEEE Computer Society Press, pp. 85–96.
4
Download