Arithmetic Data Value Speculation - School of Electrical & Electronic

advertisement
Arithmetic Data Value Speculation
Daniel R. Kelly and Braden J. Phillips
Centre for High Performance Integrated Technologies and Systems (CHiPTec),School
of Electrical and Electronic Engineering, The University of Adelaide, Adelaide SA
5005, Australia
{dankelly,phillips}@eleceng.adelaide.edu.au
Abstract. Value speculation is currently widely used in processor designs to increase the overall number of instructions executed per cycle
(IPC). Current methods use sophisticated prediction techniques to speculate on the outcome of branches and execute code accordingly. Speculation can be extended to the approximation of arithmetic values. As
arithmetic operations are slow to complete in pipelined execution an increase in overall IPC is possible through accurate arithmetic data value
speculation. This paper will focus on integer adder units for the purposes
of demonstrating arithmetic data value speculation.
1
Introduction
Modern processors commonly use branch prediction to speculatively execute
code. This allows the overall number of instructions per cycle (IPC) to be increased if the time saved for correct predictions outweighs the penalties for a
mis–prediction. Various schemes are used for branch prediction, however, few
are used for the prediction, or approximation, of arithmetic values.
The adder is the basic functional unit in computer arithmetic. Adder structures are used in signed addition and subtraction, as well as floating point multiplication and division operations. Hence, improved adder design can be applied
to improving all basic forms of computer arithmetic. Pipeline latency is often
restricted by the relatively large delay of arithmetic units. Therefore, a decrease
in the delay associated with arithmetic operations will promote an increase in
IPC.
Data value speculation schemes, as opposed to value prediction schemes,
have been proposed in the past for superscalar processors, such as the use of
stride based predictors for the calculation of memory locations [1]. Stride based
predictors assume a constant offset from a base memory location, and use a
constant stride-length to iteratively access elements in a data array.
In this paper we will discuss two basic designs for arithmetic approximation
units, and use a ripple carry adder as an example. We will then investigate
the theoretical performance of such an adder by the use of SPEC benchmarks.
Finally we will briefly discuss modifications to a basic MIPS architecture model
to employ arithmetic data value speculation.
2
2.1
Approximate arithmetic units
Overview
Arithmetic units can provide an approximate result if the design of a basic
arithmetic unit is modified so that an incomplete result can be provided earlier
than the normal delay. This can be done by specially designed approximate
hardware, or by modifying regular arithmetic units to provide a partial result
earlier than the worst–case delay. The intermediate (approximate) result can
then be forwarded to other units for speculative execution before the exact result
is known. The speculative result can be checked against the exact result, provided
either by a worst–case arithmetic unit, or against the full result after the worst–
case delay. A comparison operation simultaneously provides the outcome of the
speculation, and the exact result in the case of spurious speculation.
An approximate arithmetic unit is incomplete in some way. An approximate
unit can be logically changed to provide results earlier than the worst–case completion time. In the common case, the result will match the exact calculation,
and will be erroneous in the uncommon case. Such a unit will be called a “logically incomplete unit”. It is also possible to provide an approximate result by
overclocking the arithmetic hardware to take advantage of short propagation
times in the average case. These units are called “temporally incomplete units”.
It is important to test new designs under normal operating conditions to
investigate actual performance. In the execution of typical programs, the assumption of uniform random input is not true. Li has empirically demonstrated
that adder inputs are highly data dependent, and exhibit different carry length
distributions depending on the data [2]. For instance, operands for loop counters in programs are usually small and positive, producing small carry lengths
in adders. On the other hand, integer addressing calculations can produce quite
long carry lengths. This observation lead Koes et al. to design an application
specific adder based upon an asynchronous dynamic Brent-Kung adder [3, 4].
The new adder adds early termination logic to the original design.
2.2
Logically Incomplete Arithmetic Units
Liu and Lu proposes a simple ripple carry adder (we will call it a “Liu-Lu
adder”), in which logic is structured such that it restricts the length that a carry
can propagate [5]. This is, therefore, an example of a logically incomplete adder.
Figure 1 shows an example of an 8-bit adder with 3-bit carry segments. The
Liu-Lu adder exploits Burks et al’s famous upper bound on the expected-worstcase-carry length [6]. Assuming the addition of uniform random binary data,
the expected-worst-case-carry length is shown in (1), where CN is the expected
length of the longest carry length in the N -bit addition.
CN ≤ log2 (N ) .
(1)
Fig. 1. Structurally incomplete 8-bit adder with 3-bit carry segments
This upper bound was reduced by Briley to (2) in 1973 [7].
CN ≤ log2 (N ) − 1/2 .
(2)
For the purposes of adder design, we wish to find a model of the probability
distribution for the expected-worst-case-carry length k-bits, in an N -bit addition,
i.e.,
P (N, k) = 1 − P r[CN ≥ k] = P r[CN < k] .
(3)
Lu provides an analysis of the probability of an incorrect result in the LiuLu adder for uniform random inputs [5]. The Liu-Lu adder forms the sum for
each bit i by considering any carry generated in the previous k-bit segment
(inclusive). Each k-bit segment is a ripple carry adder. (1) and (2) above show
that the expected-worst-case-carry length is short with respect to the length of
the addition. Hence we can tradeoff a small accuracy penalty for a significant
time saving compared with the worst–case propagation delay for a ripple carry
adder.
In order to predict the average–case accuracy of the Liu-Lu adder, a result
is derived for the probability of a correct result, P , in an N -bit addition with a
k-bit segment. Liu and Lu’s equation is given below [8].
µ
PLiu-Lu (N, k) = 1 −
1
2(k+2)
¶(N −k−1)
.
(4)
Pippenger’s Equation Pippenger also provides an approximate distribution
for the expected-worst-case-carry length shown in (5) below [10]. Pippenger
proves this to be a lower bound to the probability distribution, and is here
used as a pessimistic approximation to the performance of the Liu-Lu adder.
PP ippenger (N, k) = e−N/2
k+1
.
(5)
Analysis of the Liu-Lu Equation In the derivation of (4) Lu states that “if
we only consider k previous bits to generate the carry, the result will be wrong
if the carry propagation chain is greater than (k + 1)” and “. . . moreover, the
previous bit must be in the carry generate condition” [5]. Both statements are
incorrect.
In analysis of arithmetic circuits, it is useful to define the result of an N -bit
addition as the product of generate gi , propagate pi , and annihilate ai signals
for each digit i = 0 . . . (N -1) in the addition (where i = 0 for the least significant
digit) [9].
If we consider any k-bit segment in an N -bit addition, in the Liu-Lu adder a
carry will not be propagated from the k-th bit to any other bit. Hence the result
in the (k + 1)-th bit will be wrong. Therefore the Liu-Lu adder can only provide
correct answers for a carry length less than k-bits. The approximate result will
be wrong if any carry propagation chain is greater than or equal to k-bits.
Now, consider a very long carry string of length 2k, (gi pi+1 . . . pi+2k−1 ). As
demonstrated above, the most–significant k-bits in the 2k-bit segment will be
incorrect, as they will not have the a carry propagated to them. Thus it is possible
that an incorrect result can be produced without requiring that the previous bit
to the most significant k-bit segment is in the carry generate condition (but it
may propagate a carry generated earlier and still cause failure of the adder).
Also, two or more disjoint carry lengths in N -bits can produce a spurious result
if k ≤ N/2.
The probability of any input being a generate signal is P (gi ) = 1/4 = 1/22 ,
and for a propagate signal P (pi ) = 1/2, because it can occur in two distinct
ways. Hence the probability of each k-bit segment producing an erroneous result
is actually 1/2k+1 . There are also (N −k+1) overlapping k-bit segments required
to construct the N -bit logically incomplete adder.
The result in equation 4 is arrived at by Liu and Lu’s assumption “the probability of (each k-bit segment) being correct is one minus the probability of being
wrong . . . we multiply all the probabilities to produce the final product”. As discussed above, the Liu-Lu adder will produce spurious results if there exists any
carry lengths ≥ k-bits in the N -bit addition. Hence, there are many probability cross terms not represented in equation 4. The probabilities of each k-bit
segment producing a carry out is fiendishly difficult to calculate as each k-bit
segment overlaps with (k − 1) to 2(k − 1) other such segments.
Backcount Algorithm In order to analyse the performance of the Liu-Lu
adder it is necessary to know the actual probabilities of success P (N, k) for
each word length N and carry segment k. Exhaustive calculation is not feasible
for large word lengths, as there are 4N distinct input combinations for a two
1
Probability of correct result
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Lu’s equation
Backcount algorithm
Pippenger’s approximation
0.1
0
0
0.5
1
1.5
2
2.5
3
Worst−case−carry−length (bits)
3.5
4
Fig. 2. 4-bit approximate adder performance
1
Probability of correct result
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Lu’s equation
Backcount algorithm
Pippenger’s approximation
0.1
0
0
1
2
3
4
5
6
Worst−case−carry−length (bits)
Fig. 3. 8-bit approximate adder performance
7
8
1
Probability of correct result
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Lu’s equation
Backcount algorithm
Pippenger’s approximation
0.1
0
0
2
4
6
8
10
12
Worst−case−carry−length (bits)
14
Fig. 4. 16-bit approximate adder performance
1
Probability of correct result
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Lu’s equation
Backcount algorithm
Pippenger’s approximation
0.1
0
0
5
10
15
Worst−case−carry−length (bits)
20
Fig. 5. 24-bit approximate adder performance
16
operand N -bit adder. Set theory and probabilistic calculation are difficult due
to the overlapping nature of the k-bit segments. For these reasons, a counting
algorithm was devised to quickly count all the patterns of carry segments that
would violate the Liu-Lu adder.
For any N and k the number of carry strings which cause the failure of the
Liu-Lu adder out of 4N possible inputs is counted. A carry string of length k-bits
or more will cause the adder to fail. The algorithm works efficiently if the result
for P (N, k + 1) is already known, as these combinations can be discounted when
counting the violations in P (N, k). For this reason we refer to the algorithm as
the backcount algorithm.
Consider a carry string of exactly length k. The string consists of a generate
signal gi and (k − 1) propagate signals pi+1 . . . pi+k−1 . Furthermore, the next
bit, if it exists, must be an annihilate signal ai+k , or the start of another carry
string gi+k . There are two possible ways in which a propagate signal can occur,
but only one way in which a generate or annihilate signal can occur in position
i. For an arbitrary k-bit segment there are r input signals (bits) to the left
and s input signals to the right. So, 2k−1 4̇r+s possible offending combinations
are counted. However, from this we must subtract all combinations containing
carry lengths longer than k-bits to avoid double counting. This is achieved by
recursively calling the backcount algorithm on the s- and r-bits on either side of
the k-bit segment being considered, until r and s are too small.
To avoid double counting all the combinations involving multiple carry strings
of length k in the N -bit addition, we must consider each case individually. This is
the most time consuming part of the algorithm, as it is computationally equivalent to generating a subset of the partitions of N . The number of partitions of N
increases exponentially with N , and so the process of counting many small carry
chains for k ¿ N is very time consuming. However inefficient this may be, it has
been verified to produce correct results for 8-bit addition against exhaustive calculation (see Table 1 below). The region of interest is generally k > log2 (N ), the
expected-worst-case-carry length (Figure 1). It is inefficient for k ≤ log2 (N ), but
the calculation is reduced greatly from considering all 4N input combinations.
A simple case exists when k = 0, and is included to highlight the difference
in prediction against other methods. A zero-length-maximum-carry cannot exist
if there are any generate signals. There are four distinct input combinations per
bit and three that are not a generate signal. Hence the proportion of maximumzero-length-carries is given below as
P (N, 0) =
4N − 3N
.
4N
(6)
Comparison of Models Although the distribution given by equations 4 and
5 are not exact, it provides a sufficiently close approximation for word lengths
of 32-, 64-, and 128-bits. The distribution given by equation 4 approaches the
exact distribution for large N because the proportion of long carry chains to all
the possible input combinations is smaller for long word lengths. However, Lu’s
distribution is optimistic because it does not consider all the ways in which the
adder can fail. For instance, the predicted accuracy of a 64-bit adder with an
8-bit carry segment, is calculated as PLiu-Lu (64, 8) = 0.9477. Results indicate
that the correct value is P (64, 8) = 0.9465, to 4 decimal places.
Figures 2, 3, 4 and 5 show the predicted accuracy of the various methods for
calculating the proportion of correctly speculated results vs. the longest carry
chain in the addition. Note that to a achieve the accuracies shown with a worstcase-carry length of x-bits will require the designed adder to use (x + 1)-bit
segments.
It is not a trivial task to calculate or otherwise count all possible inputs
which produce carries of length k or greater. An algorithm was devised to count
the number of incorrect additions of the Liu-Lu adder by considering all input
combinations containing patterns of input signals consisting of gi , pi and ai ,
which cause the N -bit adder to fail.
Table 1. Predicted number of errors for an N =8-bit logically incomplete approximate
adder for various models
k-bit carry
Exact
Backcount
Lu
Pippenger
0
1
2
3
4
5
6
7
8
58975
43248
23040
10176
4096
1536
512
128
0
58975
43248
23040
10176
4096
1536
512
128
0
65600
65536
65280
64516
62511
57720
47460
29412
8748
64519
63520
61565
57835
51040
39750
24109
8869
1200
Further Extensions to Logically Incomplete Arithmetic Units It is well
known that in twos complement arithmetic, a subtraction operation can be performed by an adder unit by setting the carry-in bit high and inverting the subtrahend. For the case of a subtraction involving a small positive operand (or
addition of a small negative operand), the operation is much more likely to
produce a long carry chain. In this case it may be possible to perform the subtraction up to the expected-worst-case-carry-length, and then sign extend to the
full N bits. The full advantage of this is to be determined by further investigation of distributions of worst-case-carry-lengths in benchmarked programs for
subtraction operations.
2.3
Temporally Incomplete Arithmetic Units
A temporally incomplete adder is a proposed adder design that is clocked at a
rate that will violate worst–case design. This adder is also based upon Burks et
al.’s result (1). As the time for a result to be correctly produced at the output
is dependent on the carry propagation delay, a short expected-worst-case-carry
length can yield high accuracy by the same principle as the Liu-Lu adder.
The advantage of a temporally incomplete ripple carry adder is that no modifications need to be made to the worst–case design in order to produce speculative
results. In order to produce exact results, the adder requires the full critical path
delay, or otherwise worst–case units are required to produce the exact result.
To evaluate the performance of the temporally incomplete ripple carry adder,
a VHDL implementation of the circuit was constructed using the Artisan library
of components. A simulation of uniform random inputs was performed using
timing information from the Artisan library datasheet. The propagation delay
for a full-adder cell was considered to be the average input-to-output propagation
delay, due to a change in either inputs or carry–in.
1
Proportion of correct results
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Worst−case
propagation delay
0.2
0.1
0
0
1
2
3
4
Clock period (ns)
5
6
7
Fig. 6. Theoretical performance for a 32-bit temporally incomplete adder
The correct proportion of additions was counted when the adder was presented with 500, 000 uniform random inputs. The minimum propagation time is
determined by the maximum carry chain length in the addition. The number of
correct additions was counted for 0.05 ns increments.
Results are shown in Figure 6. The worst case propogation delay is shown
as a dashed line. Assuming uniform random input, the temporally incomplete
adder can also yield high accuracy in much less than the worst–case propagation
delay due to the expected short worst-case-carry-length.
Metastability Sequential logic components like latches have setup and hold
times during which the inputs must be stable. If these time restrictions are not
met, the circuit may produce erroneous results. We can see from worst–case
timing results above the probability of an erroneous result if the addition result
arrives too late. In this case, the approximation is wrong and the pipeline will
be flushed.
There is a small chance however that the circuit will become metastable. If the
timing elements (latches or flip-flops) sample the adder output when the adder
output is changing, then the timing element may be become stuck in an undetermined state. Metastable behaviour includes holding a voltage between circuit
thresholds, or toggling outputs. Systems employing temporal incorrectness will,
therefore, need to be designed to be robust in the presence of metastability.
3
3.1
Benchmark investigation
Investigating Theoretical Performance
We have used the SimpleScalar toolset configured with the PISA architecture
to simulate SPEC CINT ’95 benchmarks. PISA is a MIPS-like superscalar 64bit architecture, and supports out-of-order (OOO) execution. The pipeline has
5-stages, and supports integer and floating-point arithmetic.
To evaluate the performance of an approximate adder, we have simulated each
SPEC benchmark and recorded the length of the longest carry chain. Figure 7
shows the performance of the Liu-Lu adder with a worst-case-carry-length of
k-bits when used for add instructions. This figure was derived by examining
all add instructions executed, irrespective of addressing mode. Note that only
unsigned data was present as there were no examples of a signed add instructions
in the SPEC binaries provided with SimpleScalar. Each benchmark in the suite
is plotted on the same graph. The theoretical performance of the adder with
uniform random inputs is shown as a dashed line.
We can observe that the performance of the Liu-Lu adder for add instructions
is higher than expected for a small carry length, with a high proportion of correct
additions achieved. However, the benchmarks repeat many identical calculations,
and may be misrepresentative.
It can be observed that in some cases the length of the k-bit segment must
be extended to a very wide structure in order to capture most of the additions
correctly. This indicates many additions involving long carry lengths being repeated many times. Otherwise, the performance of Liu-Lu adder running SPEC
CINT ’95 benchmarks is close to theoretical. The Liu-Lu adder performance
is also shown in figure 8 when we consider only subtraction operations. No instances of signed arithmetic were observed. All results are derived from unsigned
subtractions.
1
Proportion of correct results
0.9
0.8
0.7
0.6
0.5
theoretical
vortex
compress
gcc
m88k
lisp
perl
ijpeg
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Worst−case−carry−length (bits)
60
Fig. 7. Liu-Lu adder performance for ADD instructions
1
Proportion of correct results
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Worst−case−carry−length (bits)
60
Fig. 8. Liu-Lu adder performance for SUB instructions
When an adder is used to perform an unsigned subtraction on two binary
inputs, the subtrahend is inverted, and the carry in bit is set to one. If for
example we subtract 0 from 0 then one input would consist of N ones. Due to
the carry in bit, the carry chain for the addition (subtraction) would propagate
the entire length of the input. Likewise, subtraction operations involving small
subtrahends produce large carry chains.
In subtraction the Liu-Lu adder performs much less well than theory. It is
possible to approximate subtraction results by performing the full operation
on k-bits out of N , and then sign–extending the result. This has the effect of
reducing the calculation time for additions which result in very long carry chains.
However, by sign extending past k-bits, the range of possible accurate results is
reduced, as the bits of greater significance will be ignored.
Subtraction operations formed less then 2.5% of all integer arithmetic operations in the SPEC integer benchmarks. This is an important design consideration
for the implementation of approximate arithmetic units. If in practice subtraction operations are not common and are not easily approximated, it is best not
to use them for speculative execution.
4
4.1
Architecture design
Introduction
There are many challenges associated with pipelining instruction execution. Processors are becoming increasingly complex in order to exploit instruction level
parallelism (ILP). Wall [11] demonstrates that in typical programs, even assuming perfect exploitation of instruction parallelism, such as a perfect branch
predictors, the parallelism of most instructions will rarely exceed 5. Hence, in
order to execute instructions at a higher rate assuming the maximum exploitable
parallelism is fixed, instruction latency needs to be reduced (after issue).
In this section we briefly discuss the implementation of approximate arithmetic in order to facilitate speculative execution.
4.2
Pipeline Considerations
Data speculation is supported by the use of a reorder buffer (ROB). The ROB
helps maintain precise exceptions by retiring instructions in order, and postponing handling exceptions until an instruction is ready to commit [12]. The use of
ROB also supports dynamic scheduling and variable completion times for the
various functional units.
In a branch prediction scheme, the ROB maintains instruction order. In the
event that the branch outcome was different to that predicted, then it is necessary
to flush all instructions after the spurious branch. This is easy to do and will
not affect future execution.
In order to detect speculation dependency, extra information is needed in
the ROB. Liu and Lu [8] have accomplished this by the inclusion of a value
prediction field (VPF) in a MIPS architecture.
Any instruction may depend upon the results of previous instructions. If
there is a dependency between instructions, and the former instruction(s) are
speculative, then a store cannot be allowed to commit. If it were able to be
committed, then a spurious result would be able to be written to registers or
memory.
When a speculative arithmetic result is available, it is likely that a dependent
instruction will start execution based upon the speculative result. In the case of
a spurious arithmetic result being detected, it is necessary that all dependent
instructions be flushed immediately from the pipeline and re-issued. Liu and
Lu [8] make the observation that another pipeline stage needs to be added to
facilitate the checking of speculative values.
As demonstrated above, different arithmetic operations have different carry
characteristics, and hence suggest different approaches to data value speculation.
As modern processors typically contain more than one arithmetic unit, it is easy
to imagine different operations having their own dedicated hardware.
4.3
System Performance
Arithmetic data value speculation aims to improve system performance by increasing IPC. The total performance impact on a system will depend on the
programs being run, the pipeline depth, the time saved by speculation, the accuracy rate of the approximate arithmetic units and the penalty for a spurious
speculation.
It is not possible to quote an absolute benefit (or detriment) to a system
without a full implementation of an approximate arithmetic unit in a specific
architecture, running specific programs.
In order to evaluate the performance of the approximate arithmetic units
independent of the architecture, the performance of these designs has been analysed as accuracy versus time for uniform input data, and accuracy versus complexity when we consider SPEC benchmarks (simulating real programs).
System performance as raw IPC will be evaluated after a full implementation. However, before this occurs a number of architectural and engineering
design problems need to be addressed, including choosing k to maximise IPC,
selecting components for the designs to meet architecture specific circuit timing
requirements, and increased power and area considerations.
5
Conclusion
We have demonstrated that with careful design and analysis, arithmetic approximation can quickly yield accurate results for data value speculation. Furthermore, different arithmetic operations require separate analysis in order to achieve
high performance.
With continued investigation into the field of arithmetic approximation, and
further research into the newly proposed concept of temporally incomplete approximate arithmetic units, data value speculation can be better implemented
in specific architectures. By performing realistic performance profiling, arithmetic approximation can be better tuned to maximise the expected benefit of
speculation in general computing.
References
1. Jose Gonzalez and Antonio Gonzalez. Data value speculation in superscalar processors. Microprocessors-and-Microsystems, 22(6):293–301, November 1998.
2. A. Li. An empirical study of the longest carry length in real programs. Master’s
thesis, Department of Computer Science, Princeton University, May 2002.
3. D. Koes, T. Chelcea, C. Oneyama, and S. C. Goldstein. Adding faster with application specific early termination. School of Computer Science, Carnegie Mellon
University, Pittsburgh, USA, January 2005.
4. S.M. Nowick, K. Y. Yun, P. A. Beerel, and A. E. Dooply. Speculative completion
for the design of high performance asynchronous dynamic adders. In International
Symposium on Advance Research in Asynchronous Circuits and Systems, pages
210–223, Eindhoven, The Netherlands, 1997. IEEE Comput. Soc. Press.
5. S-L. Lu. Speeding up processing with approximation circuits. Computer, 37(3):67–
73, March 2004.
6. A. W. Burks, H. H. Goldstein, and J. von Neumann. Preliminiary discussion of the
design of an electronic computing instrument. Inst. Advanced Study, Princeton,
N.J., June 1946.
7. B. E. Briley. Some new results on average worst case carry. IEEE Transactions
On Computers, C–22(5):459–463, 1973.
8. T. Liu and S-L. Lu. Performance improvement with circuit level speculation.
In Proceedings of the 33rd Annual International Symposium on Microarchitecture,
pages 348–355. IEEE, 2000.
9. B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford
University Press, New York, 2000.
10. N. Pippenger. Analysis of carry propagation in addition: an elementary approach.
Journal of Algorithms, 42(2):317–333, 2002.
11. David W. Wall. Limits of instruction-level parallelism. International Conference
on Architectural Support for Programming Languages and Operating Systems ASPLOS, 26(4):176 – 188, 1991.
12. Hennessey J.L. and Patterson D.A. Computer Architecture: A Quantitative Approach. Morgan Kauffman Publishers, San Francisco, USA, 2003.
Download