>> Kristin Lauter: Okay. So today we're very... us from Sandia Labs. Rich is an early inventor...

advertisement
>> Kristin Lauter: Okay. So today we're very pleased to have Rich Schroeppel visiting
us from Sandia Labs. Rich is an early inventor of a factoring method and the Hasty
Pudding cipher and also a coinventor of elliptic curve point-halving.
So today he's going to speak to us about the SANDstorm hash function. Thank you.
>> Rich Schroeppel: Okay. Can you hear me? All right.
This is actually a three-part talk: the first part's the hash function, the second part's about
elliptic curves, and the third part is about fun stuff. So I'll be a little zippy, but I hope I
can at least keep you interested.
Let's see. Does everyone know what is hash function is? Okay. We can skip the first
slide. We can skip is second slide.
Okay. We're sort of getting to where we need to be now. There are these hash functions.
MD5 has been around forever. I guess everybody here was probably born before MD5
was invented, but would have been in school.
Anyway, MD5 is broken. SHA-1 -- this slide is actually several months old, and in the
meantime SHA-1 has gotten closer to being actual broken.
The main reason for worrying about SHA-2 is it's the same design on steroids. What's
happened in MD5 got beefed up and became SHA-1. And that got beefed up further and
became SHA-2. And so there's some question about is this a good way to design a hash
function. It seems to produce random [inaudible] numbers, and then about 10 years later
somebody comes along and finds a way to wiggle in and break the thing.
So NIST to set up a contest to produce an alternative or a backup for SHA-2.
Okay. We had 64 submissions by Halloween. They took 51 first-round candidates, and
the requirements were basically that -- you look startled.
>> Kristin Lauter: I was wondering about the drop from 64 to 51.
>> Rich Schroeppel: Oh, well, okay. The spec had to be at least readable and it had to
compile and run in the NIST harness and produce some sort of answer. And that got
them down to 51.
We have about 22 unblemished survivors at this point. There's a lot of argument about
what's a blemish. So the 22 is a very fuzzy number.
NIST has said at Crypto, maybe after Crypto they'll cut down to 15 submissions.
These are the list. And this may be out of date since the slide's a month old. And you
can see there's quite a bit of innovation in the names. And a number of good ideas have
shown up in the hash functions. And the best attack column sort of indicates the situation
with attacking the hash function.
Green means that it's an attack against some much weaker version of the hash function,
so the fact that they found something doesn't actually mean much one way or the other.
>>: [inaudible].
>> Rich Schroeppel: Oh, let's see. Backspace. Red is dead, where there's actually been
a collision produced.
Orange is some slightly defective collision in some sense. Like maybe they weren't
starting from the right IV or it's a theory collision, you know, with ->>: Just any attack which hasn't been implemented.
>> Rich Schroeppel: Ah. Okay. All right. So a theory collision.
Okay. SANDstorm is required to have four different length outputs. And they're not
supposed to be truncations of each other or something simple like that. They actually
have to be different.
It basically has to plug and play with the SHA-2 spec in terms of same inputs and outputs
and so on. So it could be a drop-in replacement.
We used a truncated tree for reasons that will be clear later. The tree is sort of based on
Merkle-Damgard chaining. Most of the hash functions use the same ideas. They don't
use the tree, though. We have very simple padding. We stick a 1-bit on the end of your
message. The length goes in later to make sure that you can't mess around with it. Then
we have a finishing step to make sure that, well, you can't mess around with it.
Okay. We use what's called a very wide pipe. The internal state of the hashing is much
wider than the final output. That makes it harder to get internal collisions in the hash. It
also gives us resistance to all the various attacks that people have invented.
We have what we regard as a very serious commitment to parallelism. Most of the
submissions instead of having a parallel mode or anything, they just have a little sentence
saying you can use the tree mode if you want.
But that means you'll get a different hash value if you use tree mode, which means if you
and I are both hashing the same thing, we both have to match on the mode we use. So we
have to have made some sort of agreement that we are going to use tree mode and
perhaps we're going to use tree mode with particular parameters and so on.
SANDstorm is a limited-size tree. Period. We don't have to agree on these mode
parameters or anything.
So in a sense, it's a much more limited fixed design. Most of the designs have implicit
parameters or they haven't done anything about tree mode other than add a sentence at the
end of their write-up.
We're set up for parallel. We can be parallel is several different ways: at the hardware
level and various software levels, everything from gates up to you can do -- farm the
hashing out to a hundred different processors and bring the results back and bring them
together and get the same hash answer.
One of the things we put in is if you have a big message that's the same most of the time,
like a movie, and you want to put a wrapper around it identifying who got this copy, then
you're not going to have to rehash your movie each time with our hash function. You can
arrange your wrapper so that everybody you send it to will get a different thing that will
have a different hash value. But you don't have to take the time to rehash the movie.
One of the points of the finishing step is to make sure you can't just stick extra message
on the end. That's a known weakness of a lot of things.
>>: So that wasn't part of the challenge, those features?
>> Rich Schroeppel: Well, the NIST -- you know, they have a list of requirements at the
beginning, and they said we would really like it if it were not susceptible to the following
list of attacks, and they included basically all known attacks on serious hash functions,
and that includes length extension as one of the things you're not supposed to do.
This is our mode. We have a compression function that operates on blocks. And the first
block goes in this box here. And it produces this red arrow, which is -- I guess it's four
hash function lengths. It's the output after rounds 1, 2, 3 and 4.
And that is input to every other hash operation. So if you're dividing the work over
multiple computers, you have to send out a copy of that to the other computers to work
with.
Beyond that, we take the message in in units of 10 blocks -- yes.
>>: You can [inaudible] tree structure?
>> Rich Schroeppel: Yes.
>>: Okay.
>> Rich Schroeppel: Yes. Well, there's not time. I could go into the choice of
parameters. It doesn't hurt anything at all, and it turns out to be a good choice.
Within this block, the next 10 blocks are fed through in effect a Merkle-Damgard
chaining, although it's a very wide pipe sort of arrangement. And the result of that goes
into level 2. And similarly you basically -- it's processed 10 blocks at a time, and those -these can all be on different computers, if you've got that many computers.
Then level 2 is doing the same deal only it's a hundred blocks at a time.
Then if you actually have a huge message, you get into level 3 and the results from the
level 2 just get passed in there.
The final output goes into the finishing step, and then there's the answer here at the
bottom.
Now, you don't have to use any of this if you don't have -- if your message isn't big
enough to need it. So if you have just a one-block message, you'll get this box and it will
be fed down immediately into the finishing step. So you run compress twice.
If you have a 10-block message -- well, an 11-block message, or up to 11 blocks, you'll
get this, you'll get this. These two get dropped out. And it will feed into the finishing
step and so on.
So the result is if you have a small message or various kinds of small, you don't use this
extra overhead here. And you don't need the extra storage associated with it and so on.
Okay. A summary. There are initial and final compression runs. You have between zero
and 300 immediate levels, depending on what you need. Early out means if you don't
need an intermediate level you don't use it. This gives a possibility of a factor of about a
thousand speedup. You could assign all this to different machines and get up to a
thousand speedup if you had a thousand computers to do the work on.
Of course there's always communications overhead and stuff. So actually you'd need to
be hashing a movie or something for it to even be worthwhile.
The first block appears in all subsequent compressions one way or another. So you can't
lose the information in the first block by just putting in a huge message.
Between the individual blocks within the box, you have 4 times the output is the size of
your pipe. When you're passed down to the next level, you only have double size which
corresponds to the size of one input block.
I need to move a little faster here. This is inside the compression function. There are five
rounds numbered 0 to 4. Basically the block is passed down. So the initial block for
SANDstorm-256 is going to use 512-bit blocks. The message input comes in through the
Ms. The internal state is half a block size. So you have 256 bits passed along these down
arrows. You pass four of them to the next compression function. So this is in total 1024
bits.
The data from the message comes in as more 256-bit values. These are all [inaudible]
where ever the arrows meet. And it gets passed down and processed.
The arrows are arranged so that this is compatible with hardware pipelining. You could
have a pipeline that was operating, part of the pipeline would be working on round 4 of
this message block, the next part would be working on round 3 and the next one on round
2 and so on. So each round could be in a separate pipeline stage if you wanted.
We reuse the SHA-2 constants on the assumption that SHA-2 is always going to be
around, and this is, quote, a backup system. So by reusing the constants, we avoid
needing some extra memory.
This is one little step of that previous slide. Gives you a close-up on how the arrows are
arranged. In particular it's important that this arrow happened after this one arrives so
that the pipelining can be -- operate properly.
Let's see. The arithmetic -- the main thing about the arithmetic is it uses a multiply
instruction. The rest of it is pretty much standard stuff.
Multiply is wonderful for two reasons: it's a superb mixing operator, and it just kicks the
hell out of differentials.
In addition, if you are so inclined, you can actually design a fast parallel multiplier. It's
another place you can put in parallelism if you really want to throw gates at it.
We have a minor use of the AES sbox. And we interleave arithmetic and logic
operations. Yes.
>>: [inaudible] 64 by 64. How do they [inaudible] react to that idea?
>> Rich Schroeppel: The which?
>>: The guys with the tiny 8-bit processors.
>> Rich Schroeppel: They can live with it. Because our total number of rounds isn't
very high.
Let's see. We also put in a tunable security parameter because the NIST requirements.
They didn't say you had to have one, but they thought it was a good idea. And what we
do is we just rerun round 4 more times if needed. My guess is in practice it will never be
used, but it's there and tested and so on.
Oh. One other thing to mention. Defense in depth. Our feeling is that all the hash
functions that we've seen that have been used a lot have turned out not to be strong
enough, and the appropriate thing to do is to have more stuff in at the beginning, even
though it's going to slow you down.
In most cases the thing that's slowing down your application is not the hashing; there'll be
something else that's your real slow step. It's very unusual for the hashing to be the slow
step. So we can afford to put more work into the hashing. The point of that is to avoid
having to replace it 10 years from now.
I guess SHA-1 has lasted, what, 17 years now. People are still using it. It's going to be a
while before it's phased out. And it hasn't technically been broken yet.
Okay. The internal arithmetic. For the 64-bit, the 512-bit hash, which produces a 256-bit
output, takes block size 512, mostly operates on 64-bit words. And the internal
multiplication stuff is 32 bits in and 64 bits out.
So we break up a 64-bit word into two halves. We have an F function that just adds the
squares of the two halves. And mod 2 to the 64th means it's just barely possible that this
could over flow. And the G function does that. It also takes the two halves, adds some
constants, does the multiply and then does the swap of the two halves. That's what the
rotate 32 is. And, again, it's all mod 64.
We have a choose function. Sbox of Z takes the low order bit -- low order byte only of
the 64-bit word and replaces it with the sbox value on that byte.
And people used to complain, well, that's a table reference, but of course Intel was about
to give us an sbox lookup instruction with constant time. So that's less of a problem.
Then we also have an operation we call BitMix. That takes four 64-bit words and just in
each column rotates by none or one or two or three places. So you have four words in
and four words out, but the bits have been picked from each of the four words to make
one word.
This is BitMix. Nothing to say about it.
The round function, there are four 64-bit words. And we apply this function to each of
the words. And it basically does a bunch of arithmetic. We then follow it with a BitMix
to guarantee that everything influences everything. Each byte of output is a function of
every bit of input.
Roughly 40 percent of the work of a compression is in the round function. The 60
percent is in the message schedule. That's a little bit unusual, to put so much work in the
message schedule. But if you look at all the breaks, the place where they get in is the
message schedule because the message schedule's too easy. And when the message
schedule's been harder, that's been a better defense.
This is our message schedule. It's sampled five times from the message. The initial
message, we simply do a bunch of mixing up the bits to make sure every bit gets into the
first -- well, the 0 round run of the function, mixing function.
Then we have this iteration where we take stuff from the previous eight words and apply
a magic function to it and that becomes the new word in the message schedule. And then
the actual length is 32, 33 steps. We pick subblocks of four out of that. So we'll take
four at a time and feed it into the round function. But we just don't take four, four, four,
four. There's four, and then there's a gap of one, four more, gap of one. And then there's
a big gap at the beginning between the round 0 setup and the round 1 inputs.
And we've done some checking to make sure that all the bits here influence all the
down-line bits after a couple of steps and so on.
About 60 percent of the work of hashing is in the message schedule, which incidentally,
should you be inclined, condition parallelized in a separate thing.
And actually when we recoded this for the x86 64 architecture, I took advantage of that
and I moved the message schedule into one set of registers and that's fed to one of the
core, subcore processors. And I had the round function at a different set of registers and
so on. And it all runs fast.
This is a summary of the God-awful 64-bit x86 architecture. I'll get to the next slide so
nobody chokes.
This is our current performance situation. The original submission was in nice portable
NCC and was God-awful slow. This has been improved quite a bit. Single core
assembly language is about 15 clocks per byte. The compression function runs at 13, and
we have 10 percent tree overhead. On long messages you'll pay 10 percent extra
compression than just taking the message blank would be.
In C, dual core is 10 clocks a byte. I think on the Linux I can actually get it down to 7.5,
but I don't have the dual core code running yet. This is on the NIST test machine. Well,
of course, Linux isn't a part of that configuration, but it's equivalent. And some day I
hope to have it under Windows Vista dual core. But that's not right at the top of my list
right now.
512-bit C we have something that's sort of like assembly language called Intrinsics,
which is basically C languages calls to the assembly code, the assembly instructions. On
a single core, that's 24 blocks a byte.
Then we have a machine down in the lab which is a dual quad core that's got eight
processors in it, tightly connected. And that does 2.1 clocks per byte, which is six or
seven -- it's basically linear speedup with some overhead.
We have something called a Sun Niagara. I'd never heard of them. But apparently the
gadget has 16 threads and gets a linear speedup. You can get more threads even though
there are no more processors. So you have the processor -- the thing that actually does
the arithmetic, and then it connects to one or another set of registers for each individual
thread.
So the processors can have up to eight threads per, and the speedup on doubling the
number of threads there is basically times 1.5 each time, which is about what you expect
for a parallel processor that doesn't really have that many processors in the bottom.
The point of all this is this is actually the same sort of thing you'd expect for a typical talk
about parallel programming. Yeah.
>>: Rich, am I reading that correct, you're going from two cores to eight cores? You
seem to have gotten more than a factor of 4 speedup.
>>: It's dual quad core.
>> Rich Schroeppel: This is a ->>: Quad -- dual quad core.
>> Rich Schroeppel: Dual quad core. There are eight processors there ->>: And the dual core has two.
>> Rich Schroeppel: Right. This is in C, though. This is the assembly code.
>>: Okay. Got it.
>> Rich Schroeppel: Sorry. I threw this together this morning based on some other
notes.
Okay. Scratch memory. This is here because of Neals [phonetic]. Neals produced this
list of these functions are good, these are not so good. These are just barely good. These
are awful. And we came in at the bottom of just barely good.
And one of his criticisms was we were using too much memory. So I actually sat down
and worked out how much memory was required. For SANDstorm-256, a one-block
message, it's actually one block with padding. So you have 511 bits of message plus a
padding bit.
You'll need 128 bytes of scratch space. And that will give you -- as part of that you'll
have a hash value in, well, I guess 32 of those bytes at the end.
And I'm assuming this includes the message being copied into the first 64 bytes and I'm
free to stomp on it and so on.
Now, if we go up to a larger message, a message of three blocks needs 200 scratch bytes;
a message of 11 blocks needs a little bit more; 1000 blocks needs a little bit more. You
go over a thousand blocks, you'll need a little bit more.
So what we're coming down to here is that you can hash small stuff in a couple of
hundred bytes of scratch.
For SANDstorm-512 and 384, everything is doubled. It's the same number of blocks
because the blocks are double size, the scratch space in bytes is doubled.
In particular, if you're doing in kind of public key operation, these numbers are less than
you'll be using in your public key operation anyway, so you can share the scratch
memory.
Okay. Feature summary. 64-bit design. And for the long hash we've used 128 bits.
Block size 512; that's standard. A brick construction by which we mean that if you take
one of the parts and kick it and take it out of the thing, you still expect to have a secure
hash function.
Now, that was a design goal and we mostly satisfy that. But I can't say for sure that we
got every part covered.
Multiplication is the best mixer. It's annoying that you cannot write C code that
efficiently uses multiply. You have to throw away the bottom half or the top half or
something, which means that a portable implementation is going to suffer in
performance. Means there are always be assembly language sitting there at the bottom
doing that multiply for you if you need the performance.
Let's see. What do we do here. Oh. One other design decision, most of the hash
functions bring in the message a little bit at a time. Like one word at a time is very
typical. This we think has turned out to be a design weakness in practice. It means that
somebody who's sitting there trying to mess with you gets to play with the individual
parts of the message schedule and tweak here and tweak here and tweak here and verify
how the bits manipulate and fall down.
If instead you bring in enough message to cover the whole state at a time, he can get you
on one round, but then he's stuck on the round either before or after. That's the deal here.
60 percent of the work -- I mentioned this. Serious parallelism. There's no separate
mode for SANDstorm. It has a maximum possible speedup of 1100 or something in
terms of divvying it up among multiple computers.
Well, actually even more if you did -- you could do the message schedule separately.
Anyway, but that's it. We're not going to have a tree that could be infinitely tall and that
you have to allocate extra memory for or something. We just fixed it.
And the parallel mode gets the same answer, the same hash as somebody who's sitting
there doing it serially. So we have a standard. That's the point. We have the tunable
security parameter, because it's kind of required. Wide-pipe design. Block numbers get
put in everywhere, which prevents people from various length stuff, and for moving the
blocks around, which, you know, want to happen.
Parallelism. I guess I've said all of these things individually before. If you want to do
movies where you don't recompute the hash, then you have to arrange your format so that
the first block is fixed, because the first block is put into everything else. So your movie
is going to have to start out with copy-write panavision [phonetic] or something. That
will be block 1 of the message. Block 2 through whatever is part of your wrapper. You
can put more wrapper in wherever. You probably want wrapper at the end, but maybe in
the middle and various places.
If you fix the first block, then all the other parts of the hash can be precomputed where
you have fixed stuff. So this is where you're hashing something and you only want to
recompute your changes in essence.
This is the pitch for multiply. Far and away the best mixer. If you have an XOR
operation, every bit in your result depends linearly on 2 bits of the input. My definition
of depends is that if you flip a bit it's got a 20 percent or higher chance of changing the
answer. For add it's a little bit more. Each bit depends linearly on 2 bits in the same
position and nonlinearly on the next 2 bits over, and a little bit on the bits beyond that.
So the realistic answer here is two linear, two nonlinear.
Multiply. Every bit on average depends nonlinearly on 32 bits of the input.
The cost of a typical multiply is actually only about three clocks. If you look at what the
computer is doing, you know, you've got this 500 million bits of gates on the chip, all of
that work is going to make happen a few arithmetic operations at each turn, at each cycle.
The part that's doing the arithmetic is maybe a thousand, couple thousand gates.
Everything else is devoted to getting operands down to this little part that's doing the
arithmetic. So it makes sense in terms of balancing your overheads to get as much work
as you can out of that arithmetic.
Multiply is a little slower than add. Not a lot slower. And it does a whole bunch more
work. That's the deal.
Okay. And it has this drawback: that's hard to write in C.
Comment. You know, we're going to have parallel computers. We're stuck. We seem to
have finally hit the end on Moore's law. If you look at what the clock speeds are and the
chips coming out, they're no longer getting faster clocks. They're giving you more cores
instead. The transistors are still getting smaller, but they're not wiggling any faster.
That's the pitch.
Comment on simplicity. Everyone likes simplicity. It's pretty. It's beautiful. It's risky.
And I'm going to skip on the questions, because I want to get to part two.
And now for something completely different.
>>: Can I interrupt with a question before you go into ->> Rich Schroeppel: Yes.
>>: -- something completely different? Using multiplication as a core, mixer, it's a great
mixer on devices that has a nice 32-by-32 multiplier.
>> Rich Schroeppel: Right.
>>: This is you came up with the AES composition ->> Rich Schroeppel: Yeah.
>>: -- and the problem is it's really a bear if you -- when you try to have implications on
small devices.
>> Rich Schroeppel: Yeah. So if you don't have a multiply instruction at all, you've got
to find some way to do it.
>>: Or even just, you know, if you've got an 8-bit multiply, it's going to [inaudible].
>> Rich Schroeppel: If you've got an 8-bit multiply, then you'll be doing 4 by 4 -- 16 of
them and the adds and carries and so on.
The draw -- well, the plus side of that. We've only got five rounds in the compression
function. The total number of multiplies to do a compress is going to be -- let's see -- 3
times 30 -- I think the aggregate is going to be under 200 multiply instructions. And
that's actually comparable to the arithmetic you're doing anyway.
So there's probably a penalty, but it's not a kicker, terrible penalty.
We have less of the other mixing, is what it comes down to, because the multiply
substitutes for a lot of that. You can't just have pure arithmetic, though, or you'll get
killed in a different way.
Okay. And now for something completely different. Those of you who have never heard
of elliptic curves can leave.
Okay. This is a talk about scalar multiplication of points on elliptic curves. P is usually a
point on any elliptic curve. K is usually an integer. And we want to compute K times P,
another elliptic curve point. Mostly we're working on the Galois field, GF[2^N]. Mostly.
Mostly we'll use affine coordinates, which means just X and Y. And our elliptic curve
equation is usually the standard one. In fact, usually A is either 1 or 0. Yeah.
>>: On a GF[2^N], don't you have a different formula for the elliptic curve because ->> Rich Schroeppel: This is [inaudible].
>>: -- squaring of linear operator and ->> Rich Schroeppel: Yes, it is. This is that different formula. In the mod P systems -well, everything except mod 2 and 3, you would instead have AX here instead of AX
squared.
>>: That's why [inaudible].
>> Rich Schroeppel: Right. Oh. I left out the XY. Oh, oh. Bug. Well, I can't edit it.
Okay. There should be an XY term here, as Peter points out. God. I've shown this slide
I don't know how many times and didn't catch that. And nobody in the audience caught it
either.
Okay. The usual way of computing a scalar multiple is called double and add. But when
you look at it closely, it turns out there are two ways to do double and add. There's top
down, left to right, and bottom up, right to left. And this is a -- I lost it. Pardon me. I
have the wrong talk. Huh. Oh, no, it's here. Okay. Sorry.
Okay. The key idea. We use the bottom-up scheme and we postpone adding up the
points. So the bottom-up scheme calculates P and then 2P and 4P and 8P and so on. And
it saves the ones it needs, which is usually about half of them. And then we do all the
adds at the end.
The point of doing that is that you can share all the reciprocals using Peter's trick for
sharing reciprocals.
And this actually works quite well when the cost of reciprocal is more than 3 times the
cost of a multiply.
Okay. Here's the double and add slide again. This is a photograph, and I got the lighting
bad, but it illustrates how you process the bits in the multiplier. You can either go left to
right, this is top down, or bottom up, right to left. And the scheme I'm suggesting here
uses bottom up, right to left.
And they went ahead and did a patent application without asking me, although I had to
provide input. And but then it was dropped, so it's unpatented at the moment.
This is the regular double and add scheme. You have a doubler here, and when
appropriate with the control, you add into the running sum.
This is the same thing expect the add step has been revised. You save the multiples you
want. You use Peter's reciprocal sharing trick. You pair all the points. And each of
these octagons is going to compute a reciprocal.
And you arrange to have all these reciprocals to be shared using Peter's trick, and then
you have your points. And you now have half as many points, and you do it again, and
you do it a few more times and you're done.
So the total number of reciprocals you need to add these affine values is like six. That's
the deal. All the others are replaced by three multiplies instead.
And these are the formulas for conventional point addition. You get the slope of the
denominator is the sum of the 2X coordinates. You do a reciprocal, you multiply to get
the slope of the numerator. And there's the final slope joining the two points. The X
formula is just this. The Y formula is this. That's what happens in the regular way of
doing an affine addition in GF[2].
The rearrangement you make for my scheme, you compute the denominator as standard.
You then have all these octagons that are sharing the reciprocal operation. And then you
do the rest of the processing as standard.
I prefer the word reciprocal to inverse because reciprocal is much more specific than
inverse.
Okay. This is Peter's trick. It's actually probably Peter's 17th trick, because Peter has
done a number of tricks. It sort of could be called one more reciprocal in that if you've
already got one, you can get another one sort of.
And I was talking to Peter. We're not sure if he originated it or not. It's certainly been
dependent with him. But it's possible it was something that was known to people in the
'50s from way before I started using computers. It's the sort of trick that would be in the
lore but never get published.
Suppose I need two reciprocals of A and B. I do the following thing. I compute A times
B, I take that reciprocal and then I can separately multiply that by A and by B to get the
two I need.
Now, in -- what I've paid here to do this is I've got a multiply here, here, and here. So
I've paid three multiplies and I've gotten rid of one reciprocal.
And there's a very easy way of extending this so that you can do K reciprocals, bring
them down to 1, and the K minus 1 things that you replaced become 3K minus three
multiplies.
>> Kristin Lauter: So in your chart before, when you had like -- let's say there were N -I think it was N [inaudible] number [inaudible] and so you have like N over 2 octagons at
the first step. So your little K here is N over 2, so 3 times N over 2 minus 3, whatever, 3
minus 6 or something like that, then how come you said you get -- pay only three
multiplies each of those steps? You pay like 3N multiplies at that stage and then ->> Rich Schroeppel: Each octagon cost you three multiplies. So the work per bit is three
multiplies.
>> Kristin Lauter: Oh. You don't put them all together, then. You don't do like K
reciprocals all at the same time; you do [inaudible].
>> Rich Schroeppel: Well, you're -- actually, you're making that -- you're getting that
result. You're getting -- let's see. Let's suppose you're adding in things, which is -- you
probably go to 2N bit number, but you're adding N things, so you've got to add up.
So you pair them. You have N over 2 pairs. So the first layer of octagons you have N
over 2 octagons.
>> Kristin Lauter: Yeah. But so you want to do it 2 by 2; you don't want to do all of
the -- you don't want to share one reciprocal for all of them?
>> Rich Schroeppel: No. You share one reciprocal for all of them.
>> Kristin Lauter: Oh, okay.
>> Rich Schroeppel: So you precompute each sum of X coordinate, each pair. So that's
going to be -- let's see. If you have N over 2 octagons, you're going to be doing N over 2
additions at that point. You need N over 2 reciprocals. That's going to cost you 3 times
N over 2 minus three multiplies.
>> Kristin Lauter: Right.
>> Rich Schroeppel: One reciprocal. So the number of multiplies that it cost you in each
octagon, it's just a little bit less than three.
>> Kristin Lauter: I see.
>> Rich Schroeppel: So each octagon achieves one addition. So the total number of
additions you need with all the levels of the tree is going to be one less than the total
number of things you add.
>> Kristin Lauter: Yeah.
>> Rich Schroeppel: Okay. Now, things -- whether this is worthwhile depends on the
cost of a reciprocal versus a multiply. This can be quite low if you're using field towers.
But we have on good advice that we shouldn't use field towers anymore.
The typical reported value for this ratio is 10. My personal experience is actually a much
smaller number but still bigger than 3. And if you're doing it mod P, the scheme works
mod P. It'll operate. The typical cost of the reciprocal is 60.
>> Kristin Lauter: So two questions. First one, why do you say Sun IPC [inaudible]?
>> Rich Schroeppel: That's where I wrote code to do field tower. I was comparing 155
using conventional and 156 using a three-level tower. And that led -- you know, for that
particular machine -- this is 1997 or something -- I got a ratio of 1.5. And the reason it's
so low is that the actual number of reciprocals you do is one in the base field. And my
base field was GF[2^13], and I just did a table lookup for that final step.
Actually, that was sort of queued. I had a bunch of logarithm tables and exponential
tables for doing GF[2^13] arithmetic because it's like four instructions.
>> Kristin Lauter: And then the other question is so for mod P, I mean, this depends a lot
on the prime, the size of the prime, the [inaudible] ->> Rich Schroeppel: Absolutely. Yeah.
>> Kristin Lauter: -- all of that. I mean, so for many years Peter had a ratio of about 5.
>> Rich Schroeppel: Oh. Okay. All right.
>> Kristin Lauter: And then with special instructions to speed up the multiply, it's up to
10 now. But that's on cryptographic sizes. So we've always maintained this 60, what you
see in the literature, sometimes you see 80.
>> Rich Schroeppel: Yeah.
>> Kristin Lauter: This is very close when you're using special crimes, like percent
crimes. So where's your 60 coming from? Is that special primes?
>> Rich Schroeppel: That's just pulled out of the literature.
>> Kristin Lauter: Yeah. Okay. So that's more typical if you have a very special prime.
>> Rich Schroeppel: Yeah. Okay. I'm not going to quarrel. I'm just saying the numbers
are obviously highly variable. And whether this scheme actually buys you anything will
depend on your circumstances.
I should mention we had some students code up the mod P one. They were looking for a
high school science fair project. And they found that the overall scheme gave only a
slight improvement, like 5 or 10 percent over a projective implementation.
Now, I thought they could have put more work into speeding up the reciprocal, but it was
time to turn in the project and they'd already written a lot of code. So, anyway, I think in
the mod P case it would actually perform also.
One of the -- well, there's a complication in mod P and I hope we'll get to it.
Okay. This is a summary of elliptic curve addition for GF[2^N].
It's very similar for mod P, but you have to count the squarings in mod P. In mod 2 you
can pretty much ignore the squarings. They're a minor cost.
Okay. This is just a repeat. Okay. In general for point doubling, if you're doubling
points in projective, you can do this interesting thing. You can do a series of doubles in
projective form, mod P, and then -- well, one of the benefits you get just incidentally in
passing is typically the Z coordinate from one point is a divisor of the Z coordinate for
the next point. And that allows you to slightly simplify Peter's trick. You say one
multiply because of that.
Anyway, you collect all the points that you need to add in projective form. And then you
can convert them all to affine at a cost of only one reciprocal. And I think it's two
multiplies per point that actually needs to be added.
And then the net result on that, it may be a benefit. It depends on the cost of your
reciprocal or not.
However, if you can do point-halving, then you don't need reciprocals for -- you don't
need a doubling chain. You can have a halving chain and there are no reciprocals to
compute at all for that part.
So the combination of point-halving together with the reciprocal sharing trick to do the
addition tree is the most efficient way of doing GF[2^N].
It's even better if you have Koblitz curves. Because then even your point-halving gets
simplified. You can replace that with a doubling substitute, which is multiplying by this
gadget. And some of you must know what this is already, and the other people I don't
want to take time to explain it. So it's basically you can use that as a multiplier and your
main cost at that point is the additions. And there's no way of doing the additions as a
nice help on that.
You can even cause yourself to have fewer additions if you precompute a bunch of
multiples. So, for example, tau plus 1 and tau minus 1 can be used as -- when you group
together -- you do your base conversion to base tau, and then you group some of the bits
together, some of the coefficients, and you can use tau minus 1 and tau plus 1.
The doubles and halves of these -- well, the tau powers of these are just rotations. So you
can compute them on the fly. And that's what makes it nice to have several available
digits in your representation.
Okay. These are just various circumstances that win and don't win. There are a bunch of
different things you can do. You know, you can double in groups.
If you don't want to build the reciprocal circuit for some reason, you can take a penalty
and compute it this way. That's old news.
And we did an implementation with a generic affine curve and got a 30 percent speedup
over the simple projective.
And if you don't have the memory to store the points, if you can store a few points you
can get a large part of the benefit. So this is going back to your smart card situation
where you're only storing -- have room to store four points or something.
>> Kristin Lauter: So when you say 30 percent over the projective [inaudible] more
detail about that comparison? Like so projective using the same double and add strategy,
like the same windowing and all of that?
>> Rich Schroeppel: I don't remember. I didn't do the implementation.
>> Kristin Lauter: Well, because one -- it seems like it would be good to compare it to
just the regular affine without your trick rather than ->> Rich Schroeppel: Oh, it will kill affine.
>> Kristin Lauter: Yeah. Obviously.
>> Rich Schroeppel: You want to compare it to the best other choice, which is going to
be probably some variation on projective.
>> Kristin Lauter: Yeah. But I would say that that comparison is going to have to vary
widely.
>> Rich Schroeppel: Absolutely. Yeah. So what I think this comes down to is it's one
more variable in the use this, this and this, but not this, this and this when you're actually
making your engineering choice.
I guess from my personal viewpoint I like to be at the high -- the fast, high end, but that's
probably unrealistic in terms of when people are actually building a system. But it does
provide another option in building the thing.
Does speed ever actually matter? You know, occasionally it does. Most of the time,
well, no, actually it doesn't. You need it to be fast enough but not fast as possible.
Okay. Well, also this is just other things you can do with it. One interesting addition:
Because all the doublings happen before all the additions, if somebody's listening to the
ka-chunk-ata noises, they won't get the information that they would usually get, where
you have one kind of ka-chunk for the multiply and a different kind of ka-chunk for the
double and so on.
Okay. And we have a few minutes for part three. Where is part three. Yeah.
Okay. This is fun and games. And I don't have time to do everything here, so I'll pick a
couple.
I did this and I gave a talk about it and sent it off to people, and then I got this letter back
from Don Zagier saying somebody named Dennis had something very similar many years
ago, although it was not published and nobody knew about it.
Okay. There's a function called the dilogarithm. It's an analytic function. It's way back
in the special functions section of the big blue book. It's basically what you want to study
if you're trying to evaluate zeta of 2. So [inaudible] was interested in it.
It also shows up if you've just invented integration and you're running around trying to
figure out what is the integral of tangent and so on, one of your very beginning steps is
going to be to differentiate a bunch of things and see what the results look like and then
try to work backwards.
And then you're going to come across a bunch of things that seem like they ought to be
simple enough to integrate but you can't quite do it, like the error function.
Another one is this very simple log expression here. And you can write the power series
for that easily and you can integrate the power series as long as you're in the unit circle
and so on.
And you wound up -- you wind up discovering this special function. And beyond that
you don't think you have much need for it. The following functional equations are in a
sense trivial. They either appear directly from the power series or it's what happens when
you try integrating -- or differentiating different combinations.
You can check these trivially. And you probably don't remember from high school, but
when you tried to integrate sine times cosine, you got back cosine times sine or
something. And then you could just do this little tricky manipulation and all of a sudden
you had your integral. And people were doing similar things with that integral I showed
you earlier.
So you discover these functional equations easily. Then if you're really brilliant, a
hundred years later you find this. This isn't what Spence actually found; he found
something that's equivalent to this. Because I don't think he was actually working with
Li2 but 1 minus it or something.
This is really weird. This first part of it is actually like the regular functional equation for
logarithm. But then there are these extra terms.
This extra term explains my interest. What I've done here is I've converted this function
into a modular function. If there were magically some way to compute these guys
modular -- I could do this one. This one I have a lot of interest in. Because it would also
be modular.
Let's see. I worked with what I call the cose [phonetic] dilog, which this very similar
thing. But it has an easier functional equation. So it's one you can remember and write
down and work with. This one you need to have sitting in front of you all the time to do
anything with it.
This one is actually true for regular logarithms, interestingly enough. Yeah.
>>: [inaudible] symmetric in X and Y, is it?
>> Rich Schroeppel: Well, it's C squared. So if you swap X and Y, the logarithm's
negative.
>>: Oh, the logarithm is squared. Okay. Sorry.
>> Rich Schroeppel: Yeah. Yeah, it's -- okay. Here's what's going on. My question was
is there a modular version of the dilog that satisfies the same identities.
Now, there's an analogy with discrete logs. If you look at there are actually a whole
collection of functional equations for this thing and they have lots and lots of log terms in
them. So you could imagine that part might work.
The drawback is there aren't very many rational values to work with. If you're trying to
extend ordinary logs or exponentials from integers or rationals over to mod P, you have a
bunch of rational values you can much on. You can say I'm going to say log 2 is this one
value, and then I've automatically got log 4, log 8 and so on. It's not so easy with dilog.
There's some question as to what would this even mean. The domain presumably is
going to be modulo some prime. The range, well, we're not real sure about the range. It's
probably got a P minus 1 in it because we had those log terms. And logarithms mod P
come out mod P minus 1.
Another problem with dilog is that there are Riemann sheets and there are infinite values
at a few places and things like that.
And I simply ignored that problem. If I had an equation that involved the zero or an
infinity at a bad place, I'd just say all right, I'll let that one go.
Here's what we're aiming for in the answers. Okay. Logarithm, I plain old log is .30103.
We remember those. We even do a discrete log in which the log part is mod 19 and the
answer comes out mod 18.
Now, for dilog, the dilog of 1/2 is .58 mumble, and there's actually an expression for this
involving pi squared and the square of log 2 or something.
And the modular dilog of 1/2, mod 19, is the same as the modular dilog of 10, and that
turns out to be 265, interpreted modulo 360. 360, of course, is 1 less than the square of
19.
So this is the table, the dilog table for mod 19. If the residue N is one of 19 values, then
the dilog is one of these. And I wrote a program, of course, and I have solutions modulo
P for primes ranging from 5 up to 23. The answers always turned out to be modulo P
squared minus 1.
Now, my code was actually agnostic on the modulus. It allowed for the possibility that
the modulus might have come out to be anything. And that involved keeping a couple of
extra degrees of freedom in the intermediate stuff. And it always turned out to be P
squared minus 1.
And I asked what happens on nonprime moduli, and the answer there is you can get the
same table, mod 25, and the answers come out mod 600. And then I said what about
GF[5] squared, and then the answers come out mod 624, which of course is 5 to the 4th
minus 1.
These are the functional equations satisfied by the discrete dilog. And I should have put
the real complex valued ones up here to match. But they basically match.
The one minor gimmick is there's a constant involved, and that depends on your choice of
base for the dilog and your choice of base for the log. But it satisfies all the dilog
functional equations.
It turns out, magically, that every time a log of zero appears in one of these, like here, it's
always multiplied by the log of 1. So you just say [inaudible] that log 0 times log 1 is 0,
just like you do in information theory. And everything plays.
It turns out you can do trilogarithms. That's the same power series only you have cubes
in the denominators. And that's what you study if you're interested in zeta of 3.
There are many functional equations. I was only able to get one of them to work. I tried
two or three. It has 20-odd terms in it. It's a mess. The solutions come out to be either
mod P cubed minus 1 or 7 times that. And I suspect if I introduced an additional
functional equation, the 7 option would disappear and it would just be P cubed minus 1.
But I never checked it.
Whether you can do higher polylogs, I don't know. Maybe.
Okay. Now, does this represent anything real? Is there anything at the bottom here?
This is stuff that I learned from other people, the letters that Zagier sent to me mentioned
this stuff. There's a puzzle on why does this exist.
Every simpler function -- you know, add, subtract, multiply, exponential log, trigs,
elliptic functions and so on -- can be brought down to a base that reaches either the
integers or the rationals. And then you can just move forward modulo P in that basis.
And if you want to move over to a finite field, you actually use algebraic numbers as
some of your ingredients.
This does not touch down -- this has no base in that sense. A particular value is neither
right nor wrong; it's only a collection of them that allows you to decide if the functional
equations are satisfied or not. But you can't just say an individual dilog value is
so-and-so. It only makes sense in the group context.
It would be interesting to have a better way of calculating them. If you had a really good
way, then you could use it for discrete logs. The ways I have are not really good.
And I'm sort of wondering to what extent can this be extended. I know I can extend -- I
have similar examples with theta functions, which will have the similar property of no
rational basis. Being free floating. But I was able to come up with a -- I think it's mod 41
or mod 43 with the theta 0, 1, 2, 3, 4 stuff that satisfies all the functional equations.
So what I'm thinking is a large part of our special functions have mod P analogs. And it
would be worth and interesting to investigate them.
Okay. And I should let you go. Anyone who wants to hang around, we can talk about
the other stuff or answer questions or anything. But I've gabbled on for an hour.
[applause]
>>: [inaudible] your data memory uses [inaudible]?
>> Rich Schroeppel: Of course. Sure. Yeah.
>> Kristin Lauter: More questions?
>>: How much information does a multiplier need via power analysis and timing
analysis? [inaudible] optimized.
>> Rich Schroeppel: Yeah. A typical processor today, you know, if you reach into a
barrel and pull out a processor, the multiply time is independent of the operants. There
are a number of exceptions.
>>: No early out? I've been trying ->> Rich Schroeppel: No, no. That's the length.
>>: I'm sorry? I've been trying to get Intel to promise us no early outs. But it is true
today [inaudible].
>>: [inaudible] early out? Wow. Okay. Fine.
>> Rich Schroeppel: So little of the -- you know, the arithmetic is such a tiny part of the
total -- and, you know, to screw up your timing change just because of the data change,
you know, you've got all this ->>: [inaudible].
>> Rich Schroeppel: It used to be. But, you know, you can't change the timing on a
cycle now. You don't want to. Because there's this march going on of, you know, data
coming in and going out into the cache and the -- you know, every time you introduce an
uncertainty into that, you've got to pay 10 engineers to figure out what it means.
Now, there certainly are machines that like check for zero and things, that it depends on
how many one-zero transitions you have and things. I worked on those machines a long
time ago.
Let's see. What was your other?
>>: The other is power.
>> Rich Schroeppel: Yeah, there's power. I don't know on that. Very possibly it's a
problem. But, on the other hand, if you can look into hash functions, a lot of them, it
depends on the power. These guys have adds, how many carries.
>>: If you do any [inaudible], the power depends on what -- the data you're processing.
There's no defense against that stuff.
>> Rich Schroeppel: I mean, the two sides of XOR, whether it was X is 1 and Y is 0 or
X is 0 and Y is 1, they might have different powers, the answers might arrive a
picosecond before the other, it's -- yeah.
Anyone else? Thank you.
[applause]
Download