17128 >> Peter Montgomery: Welcome to the MSR talk... on modern good arithmetic. And Paul studied in France...

advertisement
17128
>> Peter Montgomery: Welcome to the MSR talk series. Today's speaker is Paul Zimmerman
on modern good arithmetic. And Paul studied in France and 1991 got his Ph.D on average case
analysis of algorithms under the supervision of Phillip Flajolet. He since worked on matter
number generation of computer Algebra structures and started switching to floating point
computation in Y2K days, now also working in computational number theory and principle author
of the GMP-ECM package or factoring elliptic curve method and session chairman for interval
arithmetic at a recent conference. So welcome Paul Zimmerman.
>> Paul Zimmermann: Thank you, very much, Peter, for the introduction. And so today I would
like to present a book I am currently writing with Richard Brent who is in Australia, ANU. So I am
the itinerarian [phonetic] Aussie here, so maybe a few words about my group, you probably
know Pierre Gudry and Emmanuel Tomei [phonetic] mostly on cryptography applications mostly
on public cryptography.
So today I present you a book which we are currently writing, since a long time with Richard
Brent. This is a current version which is free and available on the web. So we started this book
in 2003, a long time ago. So as I said, it is freely available there, so the current version 0.3 has
more than 200 pages and the paper version should be fully out next year. We want the online
version to remain available to people.
Okay, so why write another book. So the first motivation is that in our field, computer arithmetic's,
the main is reference is Volume II from Knuth, "The Art of Computer Programming," but despite
several revisions, this book is out of date in some domains, for example, it does not cover what is
known as divide and conquer division. It does not explain or Shönhage [inaudible]. It does not
cover much of floating point algorithms, so we wanted to address those issues.
Other issues is that most book dealing with computer arithmetic, either they only cover
schoolbook range algorithms, so most book dealing with cryptographic application are interested
in that range, or they only cover the FFT range algorithm, so [inaudible]. And we want to address
also what we call middle-range algorithms, or, say, 100 to 10,000 digits. So in the range of
Karatsuba algorithms. Also most book only cover fast multiplication but other arithmetic
operation, division, square root, redex conversion, and we want to address other algorithms for
those other operations.
On another point is that most booklet the reader deal with problems with carriers running errors.
So usually when you read such a book, they say, oh, there is an algorithm working for
polynomials and the integer case is very similar except for carries, but we precisely want to deal
with that precisely. And so the idea is that the algorithm we present in the book can be
implemented out of the books directly whether you prefer computer Algebra system or computer
arithmetic library.
So that was the main motivation. Before I start describing the book, I want to say a few words
about rated books, the comparison with our book. So the presentation in order of appearance,
except that one, yes, so the first reference is the book of Hopcroft and Ullman, which describes a
fast polynomial GCD for polynomial. "The Art of Computer Programming," we saw it contains
many algorithms. This book is quite difficult to find but it contains some nice algorithms, which
are not well known. There is of course chapter 14 of the "Handbook of Applied Cryptography"
which covers many algorithms in the schoolbook range.
So we were very much inspired by the book called "Modern Computer Algebra" by von zur
Gathen and Gerhard, which is the same spirit of our book but more they deal with polynomial and
power series. There is of course the book, "The Handbook of Elliptic and Hyperelliptic Curve
Cryptography," which once again follows the schoolbook range. And there is a very new book
which we saw last week, beginning of this week, with Peter, we saw a reversion and it will cover
floating point algorithms. And another quite unknown reference, which I recommend is a chapter
algorithm from the GNU MP reference manual which describes implementing, very relevant if you
are interested in computer arithmetic.
Okay, so why did we start writing this book. First, Richard has a very long experience in
designing fast arithmetic algorithms. There are many papers, several papers Richard wrote and
he also did implement them in the Fortran MP package. And I have also a long but much shorter
experience with several implementations, and so I did some contributions to the GNU MP library,
for example, did implement the first transform [inaudible] and also I did contribute to the GNU MP
library for floating point numbers.
So our hope is that this work will be useful to researchers and engineers in the field and also to
students. So I take the opportunity of being here at Microsoft Research to do some
advertisement for the GNU MP library, we are developing a new development team. It is an
arbitrary precision floating point libraries with what we call correct rounding. So this means that
for every operation, when you do an addition or multiplication or whatever, you have to specify
the running mode you want for the results, and we guarantee that the results we give is always
the best possible result with a given running direction.
And so it extend [inaudible] and also to mathematical functions. We not only guarantee correct
rounding for the basic operations for and also for the sign here, and you have to specify with a
library which running mode you want, and so one application may be to extend a Microsoft Excel
to arbitrary precision and the key feature is that since we have correct rounding, so you get
exactly the same results on every platform, so whatever the compiler you use or the operating
system or whatever.
>>: How does it compare to MPIR?
>> Paul Zimmermann: MPIR is a fork of the GNU MP library. So it mainly deals with integers
and it is also MPIR, a class called MPF for floating point, but it does not provide correct rounding.
So the semantics might differ on a 32-bit or a 64-bit machine.
>>: So one of the reasons I ask is that there's some people involved with the stage project that
are now working on importing MPIR to Windows and I suppose the same kind of work would have
to be done for this to be ->> Paul Zimmermann: Yes. So that's the end of the advertisement slide.
Okay, so now we go back to the book. So the book contains four chapters. The first one is about
integer arithmetic. The second one is more about modular arithmetic, the third one about
floating-point arithmetic and the last one about marginal floating point evaluation of special
function and elementary functions.
And so today, I would focus on the two first chapters, which are more relevant to the cryptography
group here. And so we try to cover the main algorithms, and there are some technical algorithms
we try to put them in the exercises.
A nice thing is that this book was already useful for us, because while we wrote this book, we had
several questions, some questions led to new algorithms and some are still unanswered. So for
example, the kind of question we ask is if we have say operation one, we have two cases and we
have two corresponding algorithms on the four operation two, for case a we have one algorithm
and for case b there is no known algorithm, so the question we ask us is whether we can merge
the IDs of this algorithm and this algorithm to have an algorithm for that case. This was quite
useful to find new algorithm.
So I will not give you the table of contents of the book, this would be nonsense for the talk, but I
prefer to give you a few excerpts from the book showing you a algorithms, and if I have time,
there will be a bonus track especially for Peter.
Yes of course this is my view of the book. If Richard would make this talk, he would probably
give a very different talk.
Okay, so let's start with the first example, the binary GCD. So that's a new algorithm we
discovered with my former Ph.D student Daniel Stehle. The origin of that work is that there an
authentically fast GCD algorithm which is due to Knuth and Shönhage. So first Knuth did
describe an algorithm with that complexity, and then it was later improved by Shönhage to the
complexity which is big of n of n times log n.
One problem with that algorithm is that you have to deal with carries. We will see an example
later, and because of the carries, this algorithm requires a fix-up procedure, so at some point you
have to go back and this makes this algorithm very tricky to implement. And most description of
this algorithm in textbooks are wrong, and cannot be implemented out of the books.
On the other hand, if you consider what I call least significant bit algorithms, so algorithms that
work from the least significant beat like modular reduction, we know the algorithm in that case
and we know how to divide integers in that case.
So if you consider the GCD is some kind of iterated division so it was natural to ask the question,
whether we could design a fast GCD algorithm that could work from the least significant beat. So
this is exactly an example of the figure we had before. So the division we have both classical
division and binary division, both with the same complexity m of n, and for the GCD, so we have
the classical Shönhage with big n of n, and we wanted to feel that here.
Yes, of course, if you have any question, please raise your hand and ask me during my talk.
Don't wait till the end of my talk because surely somebody else has the same question.
Okay, so let's take an example with the classical fast GCD. Consider these two numbers, a and
b, and we want to compute the GCD of those.
So the classical way is to compute what we call the remainder sequence. So for each entry you
take, you divide it by the next one, and this is the next remainder. So this sequence starts with
two original numbers, then we get 221, 51 and 17, and at the end you get zero. So that's the
classical Arcadia algorithm, so the GCD is 17. And you can write this in a matrix form like this
with matrices, and when you write this iteration, what you see is that the last two terms -- no, the
first two terms, a 0, a 1 can be written as a product of two by two matrices times the final results
here.
And so the idea of the fast GCD algorithm is that you can multiply all those two by two matrices
together using what we call a product tree, so you multiply those two ones here, those two ones,
and you will then multiply the product of those two by the products of those two, and by doing
that, you can use a fast multiplication algorithm, because the entries here will grow, and then if
your product tree is balanced, then you will have -- you will multiply both numbers about the same
time, which are growing.
And this is what gives you the big of m of n log begin complexity. And the trick here is we do not
compute the word sequence here, but we only need the partial quotients here, 1, 3, 4 and 3. The
trick of the algorithm is to compute the partial quotients while avoiding to compute the real values
here. You only compute the most significant bits and those remainder terms that are enough to
compute the partial quotients.
So this is exactly the idea of the algorithm. And the problem is that since you are only computing
the upper part of the most significant bit part of those remainder terms, and you can have carries
going from the real part, the qi, the quotient qi you compute maybe incorrect in some cases. So
that is one reason why you need this fix-up procedure.
Let's take the binary view. So if I write the same numbers in binary, so this is my remainder
sequence in decimals. This is the remainder sequence in binary, and you see that the Arcadia
algorithm at each step, we try to put zeros here in the most significant bits physical we get a full
set of zeros here at the end. And so the GCD would be that number here.
And so the idea new is to do exactly the same, but working from the least significant bits. So for
that we had to introduce what we call the binary division. So consider two integers, a, b, which
may be negative, and we consider new, two, the two valuation, so this is the largest power of two
that divides, so the exponent of the largest power of two that divides b or a. So we assume that
the two valuation of b is larger than the two valuation of a. And let j be the difference. Then we
can show that there is a unique integer, q, less than an absolute value of two to the g. Such that
we can write a remainder r, which is a plus q times two to the minus g times b, which with the two
valuation of r greater than the two valuation of b, so this is what we call the binary division.
And q is call the binary quotient of a by b, and r is the binary remainder. And we can compute q
and r quite efficiently as follows, so I will not give the details, but basically you have this equation
at the end.
Okay. So let's take an example. Again, take my two favorite numbers, a and b, and so the initial
condition is fulfilled because the two valuation of a zero, you have no zero at the end and here
you have one zero, so the two valuation is one. And it is greater than the two valuation of a. So
now we use the equation we had before.
So we compute the binary quotient here, which in that case would be one modular four, and this
would be the equation for the binary remainder, which would be 1,292, which can be returned in
that form. And you can see here that you have two zeros, so you have one more than b, so the
two valuation increases at each time. So you might think it is quite similar to the answer division
or logarithmic reduction, but the difference here is that we keep the zeros here at each step. And
another remark is that contrary to the Arcadia algorithm, the value can increase here, you see
that the remainder is greater than the two input values.
So this is a binary view. So now the binary's GCD is simply an iteration of the binary division until
you get zero at the end. So we start with the numbers in binary, the i is the step number and qi is
the corresponding binary quotient. And you can see that the quotient can be negative. It is
between minus two to the j plus one and it is between minus two to the j. And so you see that
here the zeros are increasing from the right, from the least significant bits, and this is exactly the
counter part of the classical, and the GCD would be the odd part of the last non-zero term. And
of course we need at the start at both, so at least one term is order, it is required by the algorithm.
Usually you take away the binary part of the GCD before.
Okay, so this is the fast GCD. And so you see on that slide that the carries go from the least
significant part to the most significant parts, or when you compute the binary quotients, the
carries are not a problem so, we do not need a fix-up procedure in that case. So this is what I
said, so we need no fix-up procedure.
Another advantage that the binary division is usually simpler to compute on a classical machines,
than the classical division. So we have the same complexity big m of n again than the classical
GCD. On a small disadvantage of that algorithm is due to the fact of the numbers can increase -on average the increase with ten percent, so you have a smaller overrider with that algorithm, and
this was precisely analyzed by Bridgett Vallée on our team using dynamic analysis. And from
that binary GCD you can also compute a mod line verse. So assume you want to invert b
modular a, you first compute binary GCD which gives you that result or the last iteration before
you get zero, so g the GCD. If you assume the GCD is one. Then modular a, then you can write
the inverse of b will be better divided by 2 the k. So division by 2 the k, the inverse is better, and
this division by 2 the k can be concluded efficiently.
>>: Is it easy to see the binary terminals?
>> Paul Zimmermann: Because at each step you add at least one zero on the right, and you can
balance the increase on the left.
Okay, so this division by 2 the k can be done by the GCD itself which is big m over n. So when n
is larger, which division, this extra division is a small overrider.
>>: [Inaudible]
>> Paul Zimmermann: So Lemner is more in the schoolbook range. Here we are more
interested in the aesthetically fast range.
>>: When you say "schoolbook," what are you referring to, how large?
>> Paul Zimmermann: I mean schoolbook range is when you use big quadratic algorithms, up to,
say, one hundred digits. So it's just before Karatsuba multiplication.
So this was the first example. A second example is what I call divide and conquer division. So
you don't need to read that slide, just an ex-certificate from the book. The purpose of the slide is
to show you the carries and corrections and so on you can apply and you can directly take this
algorithm and write it in your preferred system. This algorithm is not described in many
textbooks. It was describing a paper by Burnikel and Ziegler, but there were previous proposition,
but not very precise by Moenck and Borodin and Jebelean. Okay, so you see that this algorithm
has two recursive, in the next slide I will describe it in more detail.
Okay, so what this algorithm does is it performs the division of a number of, say, two n words by a
number of n words, and this is a division with remainder. So you compute both the quotient and
the remainder in the same time. And the idea is the following.
So ah is the most significant part, so you first divide the most significant part of a by the most
significant part of b. So using the same algorithm recursivety, this gives you the quotient qh and
the remainder rh, so then you can replace this by this expression and while you are doing this,
you see that you have to do a correction because neglected the low significant part of the divisor.
So this correction is exactly that here. So this input is replaced by this, minus this correction, and
you get this number here, which has three quarter of the total size of the input number.
So now we do the same, once again, so we'll divide the most significant part here, a prime h by
the most significant part of the divisor, giving another quotient, the low part of quotient on another
remainder. And now we will replace this expression by this, giving this remainder on this
correction, and you get the final result. And of course you might have some carries here or
borrows here in this operation, too, and so you might have to do some fix-up here, which is
described in the previous slide here. So if you get a negative remainder, you have to do some
corrections.
And we can prove that the number of corrections is valid. Okay, so this is a divide and conquer
division. And this is another view of that division, a more complexity view. So I assume I have
my quotient here with the most significant part on the bottom and I have my divisor here with the
most significant part here. So I will first divide the most significant part of the quotient by the most
significant part of the divisor, which will yield a -- so, I will first compute the most significant part of
the quotient, and then I will multiply it by the most significant part of the divisor, so this is that
computation.
So first don't consider that computation. Now I have computed the most significant part of the
quotient. I will multiply it by the low significant part of the divisor. So this is that fix-up, that
correction that I did here. And since all the most significant part of the quotient is known, and of
course the divisor is known, I can use the fast algorithm to do that computation. So this will yield
a complexity big of m of n over two, and then I do -- so then I have computed all those products
here, and then do I the same for the low significant part of the quotient. So I computed
recursivety using my algorithm, and I do the second correction for the low significant part of the
divisor.
So this gives you a complexity view of the operations performed in the divide and conquer
division. And you can analyze it precisely, so the division of n, which yield two recursive scores
of size n over two plus two multiplications of size n over two.
So in the Karatsuba range, this here is the complexity of two times the multiplication for the
division with remainder, so we compute both the quotient and remainder. In principle you can get
d of n equals m over n with von zur Gathen's algorithm but this algorithm is very tricky to
implement. I don't know any implementation of it. I mean, any efficient implementation of it. In
the Tom-cook range where m over n is to the power 1.47 then you get that complexity and in the
FFT range you get an extra login factor due to the recursive course.
So this algorithm is implemented in the new MP line, it is a new subquadratic division, so in FFT
range you would get this extra log in factor.
It is quite efficient because the pentium m, this pleasure hold with respect to the quad rat I can
school book division is only about 27 limbs or limb is a word, so this is about -- this is less than
300 digits, where this algorithm is faster than the classical division.
Okay, so let's have a look now about the classical and hen sell division, so this is a third example
from the book. So here I put the most significant bits left. So the classical division divides a by b.
So we try to find a quotient q such that qb will match the most significant part of a. And so when
you subtract a minus qb we get a remainder r where this most significant part is finished
And Hansel division is exactly the symmetric. So here I have my input a, my divisor b and I want
to find an integer q prime such that q prime b will cancel the low significant part of a, so which I
can quite in that form here. So this is what I call Hansel division.
Okay, so this is related to the work of Peter, of course, because you can write Hansel division in
that form, so it's a classical division except you have a factor of two three n here. So if you only
want the quotient, this means you compute 2 to the n, then 2 is a quotient of a divided by b and if
you are interested in the remainder, that means you cancel that term here, you work the model b
and you can right r is a divided check heck and this is exactly Peter's red c operation. So what
the general scheme of Hansel division gives either the quadratic quotient or red c reduction.
Okay. Another -- except from the book is what we call the odd even Karatsuba scheme with my
colleague, and the classical car tuba scheme, which I guess everybody knows, works as follows.
So consider a number a, so I consider b, the arithmetic base would be either two to the 32 or two
to the 64. And ar the digits in the even base. So the classical Karatsuba scheme divides a in
upper part, ah on the low part, the same for the upper and b. And you evaluate three different
products and you recommend them.
So what we call the odd-even Karatsuba scheme works as follows. Instead of cutting the two
upper in the part and -- you have and an even part. So the odd part will be the sum of the terms
a one, a 3, a 5 and so on, and the even part, a 0 and so on, and you can write exactly the same
computation as the classical Karatsuba.
Okay, so now I will go back to the odd-even scheme later on, but I want to focus the fact that also
in most textbooks, people only consider multiplication between upper of the same size, and in the
case of upper imbalance, it's not well covered so, we also tried to address this issue. So assume
you wanted to operate those two operands of different sizes where m is larger than n, with the
Karatsuba algorithm, so they are different possibilities to achieve this.
The first one is to align the least significant bits here, so you will cut the larger operands in exactly
two, two halves, and the second operand will be cut in one big size of size m over two-on-one
smaller part.
So when you apply the Karatsuba algorithm to that product, you see that the complexity k of m
over n of multiplying those two numbers. So you first have to multiply ah times bh, so this gives
that complexity. Then you have to multiply this by this, which gives k of m over two, and you
have to multiply the addition of both here with the addition of both here, which gives also that
complexity. So that's the recursive equation you get when you align operands to the right.
It might be better to center the second opponent right-hand b like here. So I center it so that I get
two parts of exactly the same size. So when I apply Karatsuba algorithm to that case, I have to
multiply those two operands, which gives that complexity, those two with exactly the same
complexity, and I have to add both here and add both here.
But when I add those two operands, since I work model two to the m or b to the m, they don't
exactly match on the complexity I get is k of m over two, so I get the sum as exactly the same
size as those operands here.
Okay, so let's take an example. So I have my scheme here where I align to the right. I assume
this number as five times and this number three times. So if I align to the right, I first have to mull
fly this by this, but since this operand has three -- size five, I have to cut in, so I take this part of
size three and this part of size two, and then this part would be of size zero. So I have this first
term and for the right part I have this recursive term, and for the sum of both, I have this as a
testimony. And case three three can be returned as two k plus one which is seven, so I get 14
operation for that case.
If I center the b operand, I can give you the details, but we save one operation which we are
centering b. But this is not always best. So if I have one operand of size six and one of size four,
then this scheme will need a 17 recursive operations, whereas the center scheme we need 19.
So it's not always the best.
Okay. So now let's go back to the odd-even Karatsuba scheme. Assume again we want to
multiply a polynomial of five terms times a polynomial of three terms using the odd-even scheme.
We would first multiply the even terms together so this would relate to a multiplication of size
three times two, and then we multiply the odd terms, which leads to a multiplication of times two
times one and we multiply together the sum of the odd terms and the even terms. This means we
have to add a zero plus a one, a 2 plus a 3 here and a 4, b zero plus b 1. So this is a general
case. The example with five and three, we get a better complexity. We get only 12 operations
instead of 14 or 13 before with that scheme.
And the nice thing that we can do even better, instead of mull flying like we did before, in that
case where both operands have an odd size, we can zero -- say one of those to the right, to the
least significant bit parts, or we add a zero here, we shift it. If you consider polynomials, we
multiply it by x. And then we play apply the classical odd-even scheme, so this means we
multiply the even part here by the even part here, but here we know you have zero. We multiply
the odd part by the odd part and we multiply the sum of both parts here.
So in that case, in the five three case, we have one multiplication of times three times one and
since we have a zero here, so we do not multiply by zero, so it is this complexity. We have a two
by two product here and we have a three times two product here.
And in that case, we save one extra multiplication with respect to the previous scheme, so we can
show that with the odd-even Karatsuba scheme, it is always better to use that scheme when you
have two operands of odd size and this leads to the following recursive equation.
And the nice feature of that equation is on the right part, all of the terms here depend on n, so
they only depend on the size, on the smaller operand whereas in the classical scheme here, in
the recursive cost, this recursivity depends on the size of the larger operand, so you lose some
efficiency here.
Okay, so let's go back to Montgomery's multiplication. The main idea is that when you want to
compute a product multiple d, you use that operation, which is ab divided by a power of the
internal base. Instead of working with an operand a, you use a shifted operand a till did a. So
you [inaudible] and then you use a modified division to compute the shifted value of that product,
okay.
Okay, so how does it work in the schoolbook range. So we assume this is only the reduction. So
last operation which goes from that product to that value here. So I assume I have an input value
c which is less than d square, so where d is my divisor, and I want to compute c divided by b to
the n model d.
So the classical algorithm, the Montgomery reduction is as follows, you have a loop where you
have a quotient selection where you take the least significant word from the current remainder,
you multiply it by this, we computed a new value, and then you combine it with a divisor and then
you have a last fix-up possible here. And at each operation, after this, the least significant word
of the current remainder will vanish, so you can divide it by b.
So here is an example, so I want to divide this number by this number, so I precompute the
inverse, the significant verse of this model 1,000, so I consider words of three digits in that case.
So what I need to do is that I need to multiply this by this precomputed value, module low 1,000,
and add the product of the result by d. So the product of 23 times this modular 1,000 will give this
part of the quotient, and if I multiply this by the divisor I get this value, which I have to add. And
this makes a low significant word vanish here, and then I have to do the next, I have to do the
same thing with the next word here. So the product of this times new will give this part for the
quotient. I multiply this by this, I get this value, and this makes the next word vanish.
So the next part of the quotient and result, and finally, I get this extra word which vanishes, and I
have one extra carry which I have to consider, which is why I simply subtract the divisor and I get
this final result. And this correspond to this operation, so I have shown that my input c, plus, so
this is a divisor times d, is equal to this product, which is exactly this.
Okay, so now why is Montgomery reduction better than the classical division. The first thing is
that the quotient selection is very cheap because you have to multiply two numbers module low
basically is the word base, which is a very cheap operation in hardware. Then in the classical
division, you have a repair step at each iteration, whereas here, you have only one repair step at
the very end. So branch prediction may be difficult in the new steps, you have no branch inside
the loop. And you also have a dependency between the repair step and the next loop, and so
this condition still holds here, because so let's see on an example here. When you compute that
part of the quotient here, which is this multiplied by 23, this value 640 was not available before, so
you really have to wait for this addition to have this value available. This is what I call the
dependency here between the different values. And yes, it is still quadratics. This is the way I
describe the algorithm.
So now there is another algorithm due to you also have a dependency between the repair step
and the next loop. So this condition still holds here, so let's see on an example here. When you
compute that part of the quotient here, which is this multiplied by 23, this value 640 was not
available before, so you really have to wait for that addition to have this value available. So this is
what I call the dependency here between the different values.
And yes, it is still quadratic, so that is the way I describe the algorithm. Now there is another
algorithm due to Svoboda, which is quite unknown. So the idea that when you do the classical
division, you have this quotient selection type which divides the two most significant words of the
current remainder by the most significant word of the divisor and this yields the next quotient
value. I assume now that the divisor has this special forth which is one times zero, so I consider
still a base of 1,000 here. Then the quotient selection is easy, because when I divide this by this,
it will always yield the most significant word of the current remainder. So this is Svoboda's idea,
so we want to force the divisor to start with that special value here.
Okay, so let's take an example, so to force that, we will simply multiply the divisor by a constant,
to get most significant volume starting with 1,000 here. When I want to divide this number by this
d prime, this multiplied divisor, it's very easy, because the quotient selection, I simply have to
read the first word here and to write it here as the part of the quotient. And then I simply multiply
this by this value and subtract, and so on, and then I have simply to read the first word here, this
is the next word of the quotient, and so on and so forth. I also have a possible quotient at the
end, and then I have an extra step because since I have multiplied my divisor with basically one
word here, my new divisor d prime is one more word than the original divisor.
So I have to do an extra step where I do here a classical division with the original divisor d, so this
is my extra step here. And I get this final result here. And the point here is that -- so you can
write Svoboda division as, so kd is a modified divisor, so your first computer quotient with a
modified divisor, and then you do a last step, which small q is only one word here and you divide
by the original divisor.
So the main idea is that the quotient selection is now trivial for that loop, except the last step.
Another thing is that you have a smaller repair probability because the most significant word of
the divisor is larger, and it is very interesting when you do not need the quotient, when you are
only interested in the remainder, since you do not have to compute the product k times k here,
which would be needed if you want the fuller quotient. So how you can use Svoboda division, so
if you are able to choose d, and you can choose a d of this special form, then it's the best
situation. So you do not need this extra last steps. It's the best choice.
Another possibility is that if you want to do motorized mathematics, instead of working modular d
you can work the modular, so now we can apply the same technique to the model reduction on
the least significant bits. This is a classical Montgomery reduction. This is n loops and this is the
quotient selection and we multiply by the divisor here, and I show, so in red I show the difference
between, so this is what I call Montgomery Svoboda reduction, so you see the difference, c is
very small with the classical reduction, so we have n minus two loops here, and the main
difference here is that the new term here, as went from that step to that step here, you will see
the new is now here.
So new d is a modified divisor, where new d is such that it is congruent to minus one modular as
a base and is shown by this equation. And you see that now the quotient selection is trivial, as in
the classical Svoboda case, and we still have this, okay. So let's look at an example here. So if I
want to divide this by this, by the least significant parts -- no, yes -- this is my modified divisor,
which is congruent to model of the base, 999.
So now to make this term cancel, it is trivial, I simply have to answer this times this d prime, so
the quotient will be here. I multiply this by d prime and this gives this value, which I add to c and
this cancels the least significant part. And again I simply have to read the least significant part of
the current remainder. And the last step, I use a classical logarithmic reduction to finally obtain
this result.
So you see we have just applied the idea of Svoboda here to the least significant bit case. Okay,
so maybe I don't know if -- I don't want to take too much time and so we have time for questions.
But this is after overview of Svoboda division. So in Svoboda division, you multiply d by a word k
such that you get b to the power n plus one plus remainder. You can also choose a multiplier k
such that you have bit plus n plus one minus r with a negative remainder. And this equation can
be returned in that form. And you can generalize it so the case where we have b to the ni with n
larger than one, it's i modular d. So you can take for example i cross n over two, if you want to
reduce this number modular d, you can simply precompute sub b to the power n over two
modular d and multiply it times c 3, so this should be the pre computed value, and c 3 times this
precomputed value, you add it here or you subtract it, and you have reduced your number by n
over two words in one operation. And this is basically a paper published in the last ARITH
conference. And of course you can do that conclusively, once you get the remainder here, you
can pre compute some enterprise or half of the excess by another value and so on.
Okay, so now I would like to present an algorithm that we almost discovered by writing our book,
which is what I call half-Montgomery reduction. In fact, it was already published by Guyere and
Taggy [phonetic] last year. The idea is to divide a number c by d, we would perform half a
classical division and half logarithmic reduction.
So if you have two n words here, we will make n over two words vanish here and n over two
words here vanish here. So we want to find a quotient q such that when we subtract qd to c, we
get a remainder here shifted by b touch to the power n over two. And this would be the
remainder here, which is the same size as the divisor.
So let's take an example. I want to divide that number by that number, okay. So I start with my
input number. So I compute first a classical division, so I will divide those two words by the most
significant word here, and this gives me that first quotient. So I have made this upper word
vanish here, okay. So now I will make that lower word vanish using logarithmic reduction, so I
need this product here, and now this lower word is vanished. And now I will do the same again,
so I will make this upper word vanish here, and so now I have two zero words here, and now I will
make this word vanish here. And I might have one fix up, one extra fix-up step here, and you see
at the end, I have a divisor of four words, and I have a remainder of -- half-logarithmic remainder
of four words, also.
And this operation can be returned in that form. And so the nice thing of that half-Montgomery
reduction is that you can perform both reduction completely independently. So this means that if
on your -- if you have two threads on your processor, you can do both reduction completely
independently. So the first step is we want to compute the classical quotient from the upper, so
for that we need the upper n words, and we need the divisor. And this gives us this quotient.
Then for the lower part, we want to compute the Montgomery reduction and this depends on the
two words here, so this is a Montgomery quotient.
Then we need to apply the classical quotient and to subtract it after multiplication by the divisor,
and this gives us this part, so the values have changed with respect to the original values. And
then we want to apply the Montgomery quotient to the least significant part, and this gives us this
value here, where the zeros should be in red, okay. And then we have a fix up possible if
needed. But you see that step one and step one a and one b can be performed completely
independently since they only depend on the input number.
And then even you can apply the classical quotient on the original input, because in step two, you
do not modify the low part here. So you can even while doing one computing the classical
quotient, you can even apply it here on this will not interfere with step one b. So the idea, you
compute the classical quotient, use a Montgomery quotient and you apply them on the input
numbers.
So the last bonus track for Peter is what I call batch multiplications, with my current Ph.D student
Alexander Kuppa, and this might also be of interest for your group. So the context in some
application we want to connect both points on a curve and when both your points are represented
on a homogenous forms, there are ways to do that computation, so this will require three half of k
square of our products to compute. So what we do is we normalize the z coordinates to one
doing that operation, so we only had to compute that product with only half of k square in our
other products. Of course, we can do the normalization with modular inversion, so it is better to
use Montgomery's batching version which only requires one modular inversion, okay, so this is
algorithms, so I do not describe it in detail.
But you need only one modular inversion and three k minus three multiplications which is usually
much cheaper, so the idea of batch multiplication is that instead of normalizing the z coordinate to
one, we normalize it to the product of all z coordinates of the input numbers we add. For
example, if you have three numbers, instead of normalizing the z one-to-one, we are normalizing
two z one times z two times z three, and this means that for the q coordinates we multiply z one
by two by z and we have multiply x two by z one and z three and multiply x three by z one by z
two, and we can do that in exactly a total of four k minus six multiplications. So compared to
using as a batch inversion, we have saved one modular inversion and exactly three
multiplications, and so this is better in the case where you can compute all of those values here.
Thank you very much for your attention.
>> Peter Montgomery: Thank you, any questions.
>>: A very practical question. So take something kind of what kind like at the borderline of what
you're calling schoolbook, so for example, it's of great interest to us, the 256 prime fields for
elliptic curve and hearing operations. So what we have noticed over time is that it matters a lot
what the ratio of multiplication costs to the conversion costs to the underlying field. So let's take a
prime field. I mean, some of these improvements on the reduction side could possibly improve
the multiplication and some of the improvements on the division could possibly improve the
inversion. So having in mind elliptic curves, scaler multiplication and possibly pairing operations,
out of all of these algorithms, what would be your best suggestion for a combination of the choice
of reduction, algorithm and division algorithm and whether that would mean you would suggest a
fine coordinates or conjective coordinates.
>> Paul Zimmermann: There is no definite answer. It really depends on your applications.
Usually for a given applications, the best is to try the different algorithms, different possible
representations, and I cannot give you a definite answer. But usually between the classical
algorithm and the least significant bit algorithm, it's best to use a list if you can beat algorithms,
because they are more suited to current processors.
>>: What about compared to, like, say, the divide and conquer? You were saying that was for ->> Paul Zimmermann: Yes, but you are considering numbers of, say, 300 bits, so this is 30
digits, so this is really at the borderline of the Karatsuba for divide and conquer. I'm not sure this
would be really efficient. I'm not sure you would get the speed happening at the divide and
conquer division. I'm sorry, what would be quite interesting is really to try to implement von zur
Gathen algorithm, which the same complexity as Karatsuba multiplication, but it is tricky to
implement.
>>: And what about the reduction algorithms that you are suggesting, do you definitely think that
you can still get the improvement, like, say in, that range from switching from recovering reduction
to ->> Paul Zimmermann: Yes, for the half-Montgomery reduction, I definitely think this algorithm is
worthwhile to try. Unfortunately I do not have the scaler to implement it, because you really have
to implement it in some record for that kind of size. I do not have the scales but I definitely think it
should be faster.
>>: Peter does.
>>: Does the book discuss [inaudible] between schoolbook and the more advanced?
>> Paul Zimmermann: Yes, we give a few figures, but we are not too much precisely, it depends
on your professor.
>>: Expected range?
>> Paul Zimmermann: Exactly.
>>: Where it crosses?
>> Peter Montgomery: Thank our speaker again.
Download