Addressing Dark Silicon Challenges with Disciplined Approximate Computing Hadi Esmaeilzadeh Adrian Sampson Michael Ringenburg Luis Ceze Dan Grossman Doug Burger Microsoft Research dburger@microsoft.com University of Washington Department of Computer Science & Engineering http://sampa.cs.washington.edu/ Abstract ciency. In a nutshell, we advocate the following approach: (1) provide the programmer with safe ways to write programs that can tolerate small inaccuracies while guaranteeing strong program semantics; (2) build architectures that can save energy by running these disciplined approximate programs; (3) build accelerators that can leverage approximation to execute hot code in general-purpose programs more efficiently. With the power constraints of the dark silicon era looming, architectural changes to improve energy efficiency are critical. Disciplined approximate computing, in which unreliable hardware components are safely used in error-tolerant software, is a promising approach. We discuss programming models, architectures, and accelerators for approximate computing. 1. 2. Introduction Disciplined Approximate Programming In order to make approximate computing viable, we need an easy and safe way to express approximate programs. Namely, it must be straightforward to write reliable software that uses unreliable, error-prone hardware. To accomplish this, the programmer must be able to distinguish important pieces of data and computation that retain precise semantics in the presence of approximation. This is extremely important for programmability and argues for programming model involvement. We have been exploring the use of type systems to declare data that may be subject to approximate computation in a language extension called EnerJ [10]. Using programmer-provided type annotations, the system automatically maps approximate variables to low-power storage, uses low-power operations, and even applies more energy-efficient algorithms provided by the programmer. In addition, EnerJ can statically guarantee isolation of error-sensitive program components from approximated components. This allows a programmer to control explicitly how the errors from approximate computation can affect the fundamental operation of the application. Crucially, employing static analysis eliminates the need for dynamic checks, further improving energy savings. We call this language-based approach “disciplined” approximate programming. In this model, there is no formal guarantee on the quality of the output; however, the runtime system or the architecture can be configured such that the approximation does not produce catastrophically altered results. An approximation-aware programming model also helps programmers write generalizable approximate programs. Using an abstract semantics for approximation, programs written in approximationaware languages like EnerJ can take advantage of many different hardware mechanisms—including both traditional architectures and accelerators, both of which are described below. Moore’s Law [9], coupled with Dennard scaling [3], through device, circuit, microarchitecture, architecture, and compiler advances, has resulted in commensurate exponential performance increases for the past three decades. With the end of Dennard scaling—and thus slowed supply voltage scaling—future technology generations can sustain the doubling of devices every generation (Moore’s Law) but with significantly less improvement in energy efficiency at the device level. This device scaling trend heralds a divergence between energy efficiency gains and transistor integration capacity. Based on the projection results in [1, 4, 8, 12], this divergence will result in transistor integration capacity underutilization or Dark Silicon. Even the optimistic upper bound projections [4] show that conventional approaches including multicore scaling fall significantly short of the historical performance scaling to which the microprocessor research and industry is accustomed. We refer to this shortfall as the Dark Silicon performance gap. Esmaeilzadeh et al. show that fundamental performance limitations stems from the processor core [4]. Architectures that move well past the power/performance Pareto-optimal frontier of today’s designs are necessary to bridge the dark silicon performance gap and utilize transistor integration capacity. Improvements to the core’s efficiency will have impact on long performance improvement and will enable technology scaling even though the core consumes only 20% of the power budget for an entire laptop, smartphone, or tablet. The model introduced in [4] provides places to focus on for innovation and shows that radical depurates are needed. To bridge the dark silicon performance gap, designers must develop systems that use significantly more energy-efficient techniques than conventional approaches. In this paper, we make a case for general-purpose approximate computing that allows abstractions beyond precise digital logic (error-prone circuitry) and relies on program semantics permitting probabilistic and approximate results. Large and important classes of applications tolerate inaccuracies in their computation, e.g., vision, machine learning, image processing, search, etc. [2, 10, 11]. While past work has explored that property for performance and energy efficiency, they focused on either hardware (e.g., [2]) or software (e.g., [11]) in isolation and missed tremendous opportunities in co-design. We believe that hardware/software co-design for approximate computing can lead to orders of magnitude improvement in energy effi- 3. Architecture Design for Approximation In a recent paper [5], we proposed an efficient mapping of disciplined approximate programming onto hardware. We introduced an ISA extension that provides approximate operations and storage, which give the hardware freedom to save energy at the cost of accuracy. We then proposed Truffle, a microarchitecture design that efficiently supports the ISA extensions. The basis of our design is dual-voltage operation, with a high voltage for precise operations and a low voltage for approximate operations. The key aspect of the microarchitecture is its dependence on the instruction stream to determine when to use the low voltage. We introduce a transistor-level 1 GPUs, FPGAs, or FPAAs. By mapping the trained neural network to high-performance implementations on these substrates, the program transformation has the potential to provide speedup even without special hardware for some applications. To realize learning-based approximate acceleration with NPUs, we must address a number of important research questions: design of a dual-voltage SRAM array and discuss the implementation of the core structures using dual-voltage primitives as well as dynamic, instruction-based control of the voltage supply. We evaluate the power savings potential of in-order and out-of-order Truffle configurations and explore the resulting quality of service degradation. We evaluate several applications and demonstrate energy savings up to 43%. Even though our microarchitecture implementation of Truffle is based on dual-voltage functional units and storage, the ISA extensions and the architecture design can leverage other approximate technologies such as custom-designed approximate functional units as well as analog operation or storage units. Utilizing these technologies, we can further improve the efficiency of our architecture design which is enabled by breaking the accuracy abstraction while preserving safety guarantees. The Truffle architecture enables fine-grained approximate computing, where the granularity of approximate operations is a single instruction. This approach also builds on fine-grained approximation at the storage level, where a single register or a single data cache line can be approximate or precise state. Thus, the Truffle microarchitecture is able to run a mixture of approximate and precise instructions that dynamically choose the level of precision they should execute. In order to execute this fine-grained mixture of precise and approximate instructions, the Truffle microarchitecture relies on the compiler to provide static guarantees on safety of the execution. Our ISA extensions allow the compiler to statically encode the precision level of all the operations and storage for the microarchitecture. This compiler–architecture co-design approach simplifies the microarchitecture and alleviates the overheads of fine-grained approximate execution. Since the lower-power units in Truffle are restricted to execution and data storage, the overhead of processor frontend is a limiting factor in the efficiency of this class of microarchitectures. Hence, we are introducing approximate accelerators. 4. • What should the programming model look like? • How can a neural network be automatically configured and trained with limited intervention from the programmer? • How should the compilation workflow transform programs to invoke NPUs? • What ISA extensions are necessary to support NPUs? • How can a high-performance, low-power, configurable neural network implementation tightly integrate with the pipeline? Space prohibits us from treating each of these issues in detail here, but we have explored solutions to each of these important questions and will discuss them in the presentation. 5. Conclusions We strongly believe that approximate computing offers many opportunities to improve computer systems and bridge the dark silicon performance gap imposed by energy inefficiency of the device scaling. While we are actively developing new technologies at programming level, architecture level, and circuit level, we need language and runtime techniques to express and enforce reliability constraints to seize this opportunity. References [1] S. Borkar and A. A. Chen. The future of microprocessors. Communications of the ACM, 54(5), 2011. [2] M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: An architectural framework for software recovery of hardware faults. In ISCA, 2010. [3] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. Design of ion-implanted mosfet’s with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9, October 1974. [4] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In ISCA, 2011. [5] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Architecture support for disciplined approximate programming. In ASPLOS, 2012. [6] V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically specialized datapaths for energy efficient computing. In HPCA, 2011. [7] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. Bundled execution of recurring traces for energy-efcient general purpose processing. In MICRO, 2011. [8] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15, July–Aug. 2011. [9] G. E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965. [10] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate data types for safe and general low-power computation. In PLDI, 2011. [11] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In FSE, 2011. [12] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation cores: Reducing the energy of mature computations. In ASPLOS, 2010. [13] G. Venkatesh, J. Sampson, N. Goulding, S. K. Venkata, S. Swanson, and M. Taylor. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In MICRO, 2011. Approximate Accelerators Recent research on addressing the dark silicon problem has proposed configurable accelerators that adapt to efficiently execute hot code in general-purpose programs without incurring the instruction control overhead of traditional execution. Recent configurable accelerators include BERET [7], Conservation Cores [12], DySER [6], and QsCores [13]. Approximate accelerators have the potential to go beyond the savings available to correctnesspreserving accelerators when imprecision is acceptable. We are exploring approximate accelerators based on a novel data-driven approach to mimicking code written in a traditional, imperative programming language. Our approach transforms errortolerant code to use a learning mechanism—namely, a neural network—to perform some computation that mimics the behavior of some hot code in an approximation-aware application. This program transformation transparently trains a neural network based on empirical inputs and outputs for the “hot” target code. Then, a hardware neural network implementation can be used at run time to replace the original imperative code. In this sense, this program transformation uses hardware neural network implementations as a new class of approximate accelerators for general-purpose code. We call these accelerators Neural Processing Units (NPUs). Hardware neural network implementations, in both analog and digital logic, have been thoroughly explored and can be extremely efficient, so there is great efficiency potential in replacing expensive computations with neural network recalls. To effectively use them as accelerators, however, requires low latency and thus tight coupling with the main core. We are investigating digital and analog NPU implementations that integrate with the processor pipeline. This learning-based approach to approximate acceleration also opens the door to acceleration using existing hardware substrates: 2