Arithmetic Data Value Speculation Daniel R. Kelly and Braden J. Phillips Centre for High Performance Integrated Technologies and Systems (CHiPTec),School of Electrical and Electronic Engineering, The University of Adelaide, Adelaide SA 5005, Australia {dankelly,phillips}@eleceng.adelaide.edu.au Abstract. Value speculation is currently widely used in processor designs to increase the overall number of instructions executed per cycle (IPC). Current methods use sophisticated prediction techniques to speculate on the outcome of branches and execute code accordingly. Speculation can be extended to the approximation of arithmetic values. As arithmetic operations are slow to complete in pipelined execution an increase in overall IPC is possible through accurate arithmetic data value speculation. This paper will focus on integer adder units for the purposes of demonstrating arithmetic data value speculation. 1 Introduction Modern processors commonly use branch prediction to speculatively execute code. This allows the overall number of instructions per cycle (IPC) to be increased if the time saved for correct predictions outweighs the penalties for a mis–prediction. Various schemes are used for branch prediction, however, few are used for the prediction, or approximation, of arithmetic values. The adder is the basic functional unit in computer arithmetic. Adder structures are used in signed addition and subtraction, as well as floating point multiplication and division operations. Hence, improved adder design can be applied to improving all basic forms of computer arithmetic. Pipeline latency is often restricted by the relatively large delay of arithmetic units. Therefore, a decrease in the delay associated with arithmetic operations will promote an increase in IPC. Data value speculation schemes, as opposed to value prediction schemes, have been proposed in the past for superscalar processors, such as the use of stride based predictors for the calculation of memory locations [1]. Stride based predictors assume a constant offset from a base memory location, and use a constant stride-length to iteratively access elements in a data array. In this paper we will discuss two basic designs for arithmetic approximation units, and use a ripple carry adder as an example. We will then investigate the theoretical performance of such an adder by the use of SPEC benchmarks. Finally we will briefly discuss modifications to a basic MIPS architecture model to employ arithmetic data value speculation. 2 2.1 Approximate arithmetic units Overview Arithmetic units can provide an approximate result if the design of a basic arithmetic unit is modified so that an incomplete result can be provided earlier than the normal delay. This can be done by specially designed approximate hardware, or by modifying regular arithmetic units to provide a partial result earlier than the worst–case delay. The intermediate (approximate) result can then be forwarded to other units for speculative execution before the exact result is known. The speculative result can be checked against the exact result, provided either by a worst–case arithmetic unit, or against the full result after the worst– case delay. A comparison operation simultaneously provides the outcome of the speculation, and the exact result in the case of spurious speculation. An approximate arithmetic unit is incomplete in some way. An approximate unit can be logically changed to provide results earlier than the worst–case completion time. In the common case, the result will match the exact calculation, and will be erroneous in the uncommon case. Such a unit will be called a “logically incomplete unit”. It is also possible to provide an approximate result by overclocking the arithmetic hardware to take advantage of short propagation times in the average case. These units are called “temporally incomplete units”. It is important to test new designs under normal operating conditions to investigate actual performance. In the execution of typical programs, the assumption of uniform random input is not true. Li has empirically demonstrated that adder inputs are highly data dependent, and exhibit different carry length distributions depending on the data [2]. For instance, operands for loop counters in programs are usually small and positive, producing small carry lengths in adders. On the other hand, integer addressing calculations can produce quite long carry lengths. This observation lead Koes et al. to design an application specific adder based upon an asynchronous dynamic Brent-Kung adder [3, 4]. The new adder adds early termination logic to the original design. 2.2 Logically Incomplete Arithmetic Units Liu and Lu proposes a simple ripple carry adder (we will call it a “Liu-Lu adder”), in which logic is structured such that it restricts the length that a carry can propagate [5]. This is, therefore, an example of a logically incomplete adder. Figure 1 shows an example of an 8-bit adder with 3-bit carry segments. The Liu-Lu adder exploits Burks et al’s famous upper bound on the expected-worstcase-carry length [6]. Assuming the addition of uniform random binary data, the expected-worst-case-carry length is shown in (1), where CN is the expected length of the longest carry length in the N -bit addition. CN ≤ log2 (N ) . (1) Fig. 1. Structurally incomplete 8-bit adder with 3-bit carry segments This upper bound was reduced by Briley to (2) in 1973 [7]. CN ≤ log2 (N ) − 1/2 . (2) For the purposes of adder design, we wish to find a model of the probability distribution for the expected-worst-case-carry length k-bits, in an N -bit addition, i.e., P (N, k) = 1 − P r[CN ≥ k] = P r[CN < k] . (3) Lu provides an analysis of the probability of an incorrect result in the LiuLu adder for uniform random inputs [5]. The Liu-Lu adder forms the sum for each bit i by considering any carry generated in the previous k-bit segment (inclusive). Each k-bit segment is a ripple carry adder. (1) and (2) above show that the expected-worst-case-carry length is short with respect to the length of the addition. Hence we can tradeoff a small accuracy penalty for a significant time saving compared with the worst–case propagation delay for a ripple carry adder. In order to predict the average–case accuracy of the Liu-Lu adder, a result is derived for the probability of a correct result, P , in an N -bit addition with a k-bit segment. Liu and Lu’s equation is given below [8]. µ PLiu-Lu (N, k) = 1 − 1 2(k+2) ¶(N −k−1) . (4) Pippenger’s Equation Pippenger also provides an approximate distribution for the expected-worst-case-carry length shown in (5) below [10]. Pippenger proves this to be a lower bound to the probability distribution, and is here used as a pessimistic approximation to the performance of the Liu-Lu adder. PP ippenger (N, k) = e−N/2 k+1 . (5) Analysis of the Liu-Lu Equation In the derivation of (4) Lu states that “if we only consider k previous bits to generate the carry, the result will be wrong if the carry propagation chain is greater than (k + 1)” and “. . . moreover, the previous bit must be in the carry generate condition” [5]. Both statements are incorrect. In analysis of arithmetic circuits, it is useful to define the result of an N -bit addition as the product of generate gi , propagate pi , and annihilate ai signals for each digit i = 0 . . . (N -1) in the addition (where i = 0 for the least significant digit) [9]. If we consider any k-bit segment in an N -bit addition, in the Liu-Lu adder a carry will not be propagated from the k-th bit to any other bit. Hence the result in the (k + 1)-th bit will be wrong. Therefore the Liu-Lu adder can only provide correct answers for a carry length less than k-bits. The approximate result will be wrong if any carry propagation chain is greater than or equal to k-bits. Now, consider a very long carry string of length 2k, (gi pi+1 . . . pi+2k−1 ). As demonstrated above, the most–significant k-bits in the 2k-bit segment will be incorrect, as they will not have the a carry propagated to them. Thus it is possible that an incorrect result can be produced without requiring that the previous bit to the most significant k-bit segment is in the carry generate condition (but it may propagate a carry generated earlier and still cause failure of the adder). Also, two or more disjoint carry lengths in N -bits can produce a spurious result if k ≤ N/2. The probability of any input being a generate signal is P (gi ) = 1/4 = 1/22 , and for a propagate signal P (pi ) = 1/2, because it can occur in two distinct ways. Hence the probability of each k-bit segment producing an erroneous result is actually 1/2k+1 . There are also (N −k+1) overlapping k-bit segments required to construct the N -bit logically incomplete adder. The result in equation 4 is arrived at by Liu and Lu’s assumption “the probability of (each k-bit segment) being correct is one minus the probability of being wrong . . . we multiply all the probabilities to produce the final product”. As discussed above, the Liu-Lu adder will produce spurious results if there exists any carry lengths ≥ k-bits in the N -bit addition. Hence, there are many probability cross terms not represented in equation 4. The probabilities of each k-bit segment producing a carry out is fiendishly difficult to calculate as each k-bit segment overlaps with (k − 1) to 2(k − 1) other such segments. Backcount Algorithm In order to analyse the performance of the Liu-Lu adder it is necessary to know the actual probabilities of success P (N, k) for each word length N and carry segment k. Exhaustive calculation is not feasible for large word lengths, as there are 4N distinct input combinations for a two 1 Probability of correct result 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Lu’s equation Backcount algorithm Pippenger’s approximation 0.1 0 0 0.5 1 1.5 2 2.5 3 Worst−case−carry−length (bits) 3.5 4 Fig. 2. 4-bit approximate adder performance 1 Probability of correct result 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Lu’s equation Backcount algorithm Pippenger’s approximation 0.1 0 0 1 2 3 4 5 6 Worst−case−carry−length (bits) Fig. 3. 8-bit approximate adder performance 7 8 1 Probability of correct result 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Lu’s equation Backcount algorithm Pippenger’s approximation 0.1 0 0 2 4 6 8 10 12 Worst−case−carry−length (bits) 14 Fig. 4. 16-bit approximate adder performance 1 Probability of correct result 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Lu’s equation Backcount algorithm Pippenger’s approximation 0.1 0 0 5 10 15 Worst−case−carry−length (bits) 20 Fig. 5. 24-bit approximate adder performance 16 operand N -bit adder. Set theory and probabilistic calculation are difficult due to the overlapping nature of the k-bit segments. For these reasons, a counting algorithm was devised to quickly count all the patterns of carry segments that would violate the Liu-Lu adder. For any N and k the number of carry strings which cause the failure of the Liu-Lu adder out of 4N possible inputs is counted. A carry string of length k-bits or more will cause the adder to fail. The algorithm works efficiently if the result for P (N, k + 1) is already known, as these combinations can be discounted when counting the violations in P (N, k). For this reason we refer to the algorithm as the backcount algorithm. Consider a carry string of exactly length k. The string consists of a generate signal gi and (k − 1) propagate signals pi+1 . . . pi+k−1 . Furthermore, the next bit, if it exists, must be an annihilate signal ai+k , or the start of another carry string gi+k . There are two possible ways in which a propagate signal can occur, but only one way in which a generate or annihilate signal can occur in position i. For an arbitrary k-bit segment there are r input signals (bits) to the left and s input signals to the right. So, 2k−1 4̇r+s possible offending combinations are counted. However, from this we must subtract all combinations containing carry lengths longer than k-bits to avoid double counting. This is achieved by recursively calling the backcount algorithm on the s- and r-bits on either side of the k-bit segment being considered, until r and s are too small. To avoid double counting all the combinations involving multiple carry strings of length k in the N -bit addition, we must consider each case individually. This is the most time consuming part of the algorithm, as it is computationally equivalent to generating a subset of the partitions of N . The number of partitions of N increases exponentially with N , and so the process of counting many small carry chains for k ¿ N is very time consuming. However inefficient this may be, it has been verified to produce correct results for 8-bit addition against exhaustive calculation (see Table 1 below). The region of interest is generally k > log2 (N ), the expected-worst-case-carry length (Figure 1). It is inefficient for k ≤ log2 (N ), but the calculation is reduced greatly from considering all 4N input combinations. A simple case exists when k = 0, and is included to highlight the difference in prediction against other methods. A zero-length-maximum-carry cannot exist if there are any generate signals. There are four distinct input combinations per bit and three that are not a generate signal. Hence the proportion of maximumzero-length-carries is given below as P (N, 0) = 4N − 3N . 4N (6) Comparison of Models Although the distribution given by equations 4 and 5 are not exact, it provides a sufficiently close approximation for word lengths of 32-, 64-, and 128-bits. The distribution given by equation 4 approaches the exact distribution for large N because the proportion of long carry chains to all the possible input combinations is smaller for long word lengths. However, Lu’s distribution is optimistic because it does not consider all the ways in which the adder can fail. For instance, the predicted accuracy of a 64-bit adder with an 8-bit carry segment, is calculated as PLiu-Lu (64, 8) = 0.9477. Results indicate that the correct value is P (64, 8) = 0.9465, to 4 decimal places. Figures 2, 3, 4 and 5 show the predicted accuracy of the various methods for calculating the proportion of correctly speculated results vs. the longest carry chain in the addition. Note that to a achieve the accuracies shown with a worstcase-carry length of x-bits will require the designed adder to use (x + 1)-bit segments. It is not a trivial task to calculate or otherwise count all possible inputs which produce carries of length k or greater. An algorithm was devised to count the number of incorrect additions of the Liu-Lu adder by considering all input combinations containing patterns of input signals consisting of gi , pi and ai , which cause the N -bit adder to fail. Table 1. Predicted number of errors for an N =8-bit logically incomplete approximate adder for various models k-bit carry Exact Backcount Lu Pippenger 0 1 2 3 4 5 6 7 8 58975 43248 23040 10176 4096 1536 512 128 0 58975 43248 23040 10176 4096 1536 512 128 0 65600 65536 65280 64516 62511 57720 47460 29412 8748 64519 63520 61565 57835 51040 39750 24109 8869 1200 Further Extensions to Logically Incomplete Arithmetic Units It is well known that in twos complement arithmetic, a subtraction operation can be performed by an adder unit by setting the carry-in bit high and inverting the subtrahend. For the case of a subtraction involving a small positive operand (or addition of a small negative operand), the operation is much more likely to produce a long carry chain. In this case it may be possible to perform the subtraction up to the expected-worst-case-carry-length, and then sign extend to the full N bits. The full advantage of this is to be determined by further investigation of distributions of worst-case-carry-lengths in benchmarked programs for subtraction operations. 2.3 Temporally Incomplete Arithmetic Units A temporally incomplete adder is a proposed adder design that is clocked at a rate that will violate worst–case design. This adder is also based upon Burks et al.’s result (1). As the time for a result to be correctly produced at the output is dependent on the carry propagation delay, a short expected-worst-case-carry length can yield high accuracy by the same principle as the Liu-Lu adder. The advantage of a temporally incomplete ripple carry adder is that no modifications need to be made to the worst–case design in order to produce speculative results. In order to produce exact results, the adder requires the full critical path delay, or otherwise worst–case units are required to produce the exact result. To evaluate the performance of the temporally incomplete ripple carry adder, a VHDL implementation of the circuit was constructed using the Artisan library of components. A simulation of uniform random inputs was performed using timing information from the Artisan library datasheet. The propagation delay for a full-adder cell was considered to be the average input-to-output propagation delay, due to a change in either inputs or carry–in. 1 Proportion of correct results 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Worst−case propagation delay 0.2 0.1 0 0 1 2 3 4 Clock period (ns) 5 6 7 Fig. 6. Theoretical performance for a 32-bit temporally incomplete adder The correct proportion of additions was counted when the adder was presented with 500, 000 uniform random inputs. The minimum propagation time is determined by the maximum carry chain length in the addition. The number of correct additions was counted for 0.05 ns increments. Results are shown in Figure 6. The worst case propogation delay is shown as a dashed line. Assuming uniform random input, the temporally incomplete adder can also yield high accuracy in much less than the worst–case propagation delay due to the expected short worst-case-carry-length. Metastability Sequential logic components like latches have setup and hold times during which the inputs must be stable. If these time restrictions are not met, the circuit may produce erroneous results. We can see from worst–case timing results above the probability of an erroneous result if the addition result arrives too late. In this case, the approximation is wrong and the pipeline will be flushed. There is a small chance however that the circuit will become metastable. If the timing elements (latches or flip-flops) sample the adder output when the adder output is changing, then the timing element may be become stuck in an undetermined state. Metastable behaviour includes holding a voltage between circuit thresholds, or toggling outputs. Systems employing temporal incorrectness will, therefore, need to be designed to be robust in the presence of metastability. 3 3.1 Benchmark investigation Investigating Theoretical Performance We have used the SimpleScalar toolset configured with the PISA architecture to simulate SPEC CINT ’95 benchmarks. PISA is a MIPS-like superscalar 64bit architecture, and supports out-of-order (OOO) execution. The pipeline has 5-stages, and supports integer and floating-point arithmetic. To evaluate the performance of an approximate adder, we have simulated each SPEC benchmark and recorded the length of the longest carry chain. Figure 7 shows the performance of the Liu-Lu adder with a worst-case-carry-length of k-bits when used for add instructions. This figure was derived by examining all add instructions executed, irrespective of addressing mode. Note that only unsigned data was present as there were no examples of a signed add instructions in the SPEC binaries provided with SimpleScalar. Each benchmark in the suite is plotted on the same graph. The theoretical performance of the adder with uniform random inputs is shown as a dashed line. We can observe that the performance of the Liu-Lu adder for add instructions is higher than expected for a small carry length, with a high proportion of correct additions achieved. However, the benchmarks repeat many identical calculations, and may be misrepresentative. It can be observed that in some cases the length of the k-bit segment must be extended to a very wide structure in order to capture most of the additions correctly. This indicates many additions involving long carry lengths being repeated many times. Otherwise, the performance of Liu-Lu adder running SPEC CINT ’95 benchmarks is close to theoretical. The Liu-Lu adder performance is also shown in figure 8 when we consider only subtraction operations. No instances of signed arithmetic were observed. All results are derived from unsigned subtractions. 1 Proportion of correct results 0.9 0.8 0.7 0.6 0.5 theoretical vortex compress gcc m88k lisp perl ijpeg 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Worst−case−carry−length (bits) 60 Fig. 7. Liu-Lu adder performance for ADD instructions 1 Proportion of correct results 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Worst−case−carry−length (bits) 60 Fig. 8. Liu-Lu adder performance for SUB instructions When an adder is used to perform an unsigned subtraction on two binary inputs, the subtrahend is inverted, and the carry in bit is set to one. If for example we subtract 0 from 0 then one input would consist of N ones. Due to the carry in bit, the carry chain for the addition (subtraction) would propagate the entire length of the input. Likewise, subtraction operations involving small subtrahends produce large carry chains. In subtraction the Liu-Lu adder performs much less well than theory. It is possible to approximate subtraction results by performing the full operation on k-bits out of N , and then sign–extending the result. This has the effect of reducing the calculation time for additions which result in very long carry chains. However, by sign extending past k-bits, the range of possible accurate results is reduced, as the bits of greater significance will be ignored. Subtraction operations formed less then 2.5% of all integer arithmetic operations in the SPEC integer benchmarks. This is an important design consideration for the implementation of approximate arithmetic units. If in practice subtraction operations are not common and are not easily approximated, it is best not to use them for speculative execution. 4 4.1 Architecture design Introduction There are many challenges associated with pipelining instruction execution. Processors are becoming increasingly complex in order to exploit instruction level parallelism (ILP). Wall [11] demonstrates that in typical programs, even assuming perfect exploitation of instruction parallelism, such as a perfect branch predictors, the parallelism of most instructions will rarely exceed 5. Hence, in order to execute instructions at a higher rate assuming the maximum exploitable parallelism is fixed, instruction latency needs to be reduced (after issue). In this section we briefly discuss the implementation of approximate arithmetic in order to facilitate speculative execution. 4.2 Pipeline Considerations Data speculation is supported by the use of a reorder buffer (ROB). The ROB helps maintain precise exceptions by retiring instructions in order, and postponing handling exceptions until an instruction is ready to commit [12]. The use of ROB also supports dynamic scheduling and variable completion times for the various functional units. In a branch prediction scheme, the ROB maintains instruction order. In the event that the branch outcome was different to that predicted, then it is necessary to flush all instructions after the spurious branch. This is easy to do and will not affect future execution. In order to detect speculation dependency, extra information is needed in the ROB. Liu and Lu [8] have accomplished this by the inclusion of a value prediction field (VPF) in a MIPS architecture. Any instruction may depend upon the results of previous instructions. If there is a dependency between instructions, and the former instruction(s) are speculative, then a store cannot be allowed to commit. If it were able to be committed, then a spurious result would be able to be written to registers or memory. When a speculative arithmetic result is available, it is likely that a dependent instruction will start execution based upon the speculative result. In the case of a spurious arithmetic result being detected, it is necessary that all dependent instructions be flushed immediately from the pipeline and re-issued. Liu and Lu [8] make the observation that another pipeline stage needs to be added to facilitate the checking of speculative values. As demonstrated above, different arithmetic operations have different carry characteristics, and hence suggest different approaches to data value speculation. As modern processors typically contain more than one arithmetic unit, it is easy to imagine different operations having their own dedicated hardware. 4.3 System Performance Arithmetic data value speculation aims to improve system performance by increasing IPC. The total performance impact on a system will depend on the programs being run, the pipeline depth, the time saved by speculation, the accuracy rate of the approximate arithmetic units and the penalty for a spurious speculation. It is not possible to quote an absolute benefit (or detriment) to a system without a full implementation of an approximate arithmetic unit in a specific architecture, running specific programs. In order to evaluate the performance of the approximate arithmetic units independent of the architecture, the performance of these designs has been analysed as accuracy versus time for uniform input data, and accuracy versus complexity when we consider SPEC benchmarks (simulating real programs). System performance as raw IPC will be evaluated after a full implementation. However, before this occurs a number of architectural and engineering design problems need to be addressed, including choosing k to maximise IPC, selecting components for the designs to meet architecture specific circuit timing requirements, and increased power and area considerations. 5 Conclusion We have demonstrated that with careful design and analysis, arithmetic approximation can quickly yield accurate results for data value speculation. Furthermore, different arithmetic operations require separate analysis in order to achieve high performance. With continued investigation into the field of arithmetic approximation, and further research into the newly proposed concept of temporally incomplete approximate arithmetic units, data value speculation can be better implemented in specific architectures. By performing realistic performance profiling, arithmetic approximation can be better tuned to maximise the expected benefit of speculation in general computing. References 1. Jose Gonzalez and Antonio Gonzalez. Data value speculation in superscalar processors. Microprocessors-and-Microsystems, 22(6):293–301, November 1998. 2. A. Li. An empirical study of the longest carry length in real programs. Master’s thesis, Department of Computer Science, Princeton University, May 2002. 3. D. Koes, T. Chelcea, C. Oneyama, and S. C. Goldstein. Adding faster with application specific early termination. School of Computer Science, Carnegie Mellon University, Pittsburgh, USA, January 2005. 4. S.M. Nowick, K. Y. Yun, P. A. Beerel, and A. E. Dooply. Speculative completion for the design of high performance asynchronous dynamic adders. In International Symposium on Advance Research in Asynchronous Circuits and Systems, pages 210–223, Eindhoven, The Netherlands, 1997. IEEE Comput. Soc. Press. 5. S-L. Lu. Speeding up processing with approximation circuits. Computer, 37(3):67– 73, March 2004. 6. A. W. Burks, H. H. Goldstein, and J. von Neumann. Preliminiary discussion of the design of an electronic computing instrument. Inst. Advanced Study, Princeton, N.J., June 1946. 7. B. E. Briley. Some new results on average worst case carry. IEEE Transactions On Computers, C–22(5):459–463, 1973. 8. T. Liu and S-L. Lu. Performance improvement with circuit level speculation. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 348–355. IEEE, 2000. 9. B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press, New York, 2000. 10. N. Pippenger. Analysis of carry propagation in addition: an elementary approach. Journal of Algorithms, 42(2):317–333, 2002. 11. David W. Wall. Limits of instruction-level parallelism. International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS, 26(4):176 – 188, 1991. 12. Hennessey J.L. and Patterson D.A. Computer Architecture: A Quantitative Approach. Morgan Kauffman Publishers, San Francisco, USA, 2003.