>> Kristin Lauter: Okay. So today we're very pleased to have Rich Schroeppel visiting us from Sandia Labs. Rich is an early inventor of a factoring method and the Hasty Pudding cipher and also a coinventor of elliptic curve point-halving. So today he's going to speak to us about the SANDstorm hash function. Thank you. >> Rich Schroeppel: Okay. Can you hear me? All right. This is actually a three-part talk: the first part's the hash function, the second part's about elliptic curves, and the third part is about fun stuff. So I'll be a little zippy, but I hope I can at least keep you interested. Let's see. Does everyone know what is hash function is? Okay. We can skip the first slide. We can skip is second slide. Okay. We're sort of getting to where we need to be now. There are these hash functions. MD5 has been around forever. I guess everybody here was probably born before MD5 was invented, but would have been in school. Anyway, MD5 is broken. SHA-1 -- this slide is actually several months old, and in the meantime SHA-1 has gotten closer to being actual broken. The main reason for worrying about SHA-2 is it's the same design on steroids. What's happened in MD5 got beefed up and became SHA-1. And that got beefed up further and became SHA-2. And so there's some question about is this a good way to design a hash function. It seems to produce random [inaudible] numbers, and then about 10 years later somebody comes along and finds a way to wiggle in and break the thing. So NIST to set up a contest to produce an alternative or a backup for SHA-2. Okay. We had 64 submissions by Halloween. They took 51 first-round candidates, and the requirements were basically that -- you look startled. >> Kristin Lauter: I was wondering about the drop from 64 to 51. >> Rich Schroeppel: Oh, well, okay. The spec had to be at least readable and it had to compile and run in the NIST harness and produce some sort of answer. And that got them down to 51. We have about 22 unblemished survivors at this point. There's a lot of argument about what's a blemish. So the 22 is a very fuzzy number. NIST has said at Crypto, maybe after Crypto they'll cut down to 15 submissions. These are the list. And this may be out of date since the slide's a month old. And you can see there's quite a bit of innovation in the names. And a number of good ideas have shown up in the hash functions. And the best attack column sort of indicates the situation with attacking the hash function. Green means that it's an attack against some much weaker version of the hash function, so the fact that they found something doesn't actually mean much one way or the other. >>: [inaudible]. >> Rich Schroeppel: Oh, let's see. Backspace. Red is dead, where there's actually been a collision produced. Orange is some slightly defective collision in some sense. Like maybe they weren't starting from the right IV or it's a theory collision, you know, with ->>: Just any attack which hasn't been implemented. >> Rich Schroeppel: Ah. Okay. All right. So a theory collision. Okay. SANDstorm is required to have four different length outputs. And they're not supposed to be truncations of each other or something simple like that. They actually have to be different. It basically has to plug and play with the SHA-2 spec in terms of same inputs and outputs and so on. So it could be a drop-in replacement. We used a truncated tree for reasons that will be clear later. The tree is sort of based on Merkle-Damgard chaining. Most of the hash functions use the same ideas. They don't use the tree, though. We have very simple padding. We stick a 1-bit on the end of your message. The length goes in later to make sure that you can't mess around with it. Then we have a finishing step to make sure that, well, you can't mess around with it. Okay. We use what's called a very wide pipe. The internal state of the hashing is much wider than the final output. That makes it harder to get internal collisions in the hash. It also gives us resistance to all the various attacks that people have invented. We have what we regard as a very serious commitment to parallelism. Most of the submissions instead of having a parallel mode or anything, they just have a little sentence saying you can use the tree mode if you want. But that means you'll get a different hash value if you use tree mode, which means if you and I are both hashing the same thing, we both have to match on the mode we use. So we have to have made some sort of agreement that we are going to use tree mode and perhaps we're going to use tree mode with particular parameters and so on. SANDstorm is a limited-size tree. Period. We don't have to agree on these mode parameters or anything. So in a sense, it's a much more limited fixed design. Most of the designs have implicit parameters or they haven't done anything about tree mode other than add a sentence at the end of their write-up. We're set up for parallel. We can be parallel is several different ways: at the hardware level and various software levels, everything from gates up to you can do -- farm the hashing out to a hundred different processors and bring the results back and bring them together and get the same hash answer. One of the things we put in is if you have a big message that's the same most of the time, like a movie, and you want to put a wrapper around it identifying who got this copy, then you're not going to have to rehash your movie each time with our hash function. You can arrange your wrapper so that everybody you send it to will get a different thing that will have a different hash value. But you don't have to take the time to rehash the movie. One of the points of the finishing step is to make sure you can't just stick extra message on the end. That's a known weakness of a lot of things. >>: So that wasn't part of the challenge, those features? >> Rich Schroeppel: Well, the NIST -- you know, they have a list of requirements at the beginning, and they said we would really like it if it were not susceptible to the following list of attacks, and they included basically all known attacks on serious hash functions, and that includes length extension as one of the things you're not supposed to do. This is our mode. We have a compression function that operates on blocks. And the first block goes in this box here. And it produces this red arrow, which is -- I guess it's four hash function lengths. It's the output after rounds 1, 2, 3 and 4. And that is input to every other hash operation. So if you're dividing the work over multiple computers, you have to send out a copy of that to the other computers to work with. Beyond that, we take the message in in units of 10 blocks -- yes. >>: You can [inaudible] tree structure? >> Rich Schroeppel: Yes. >>: Okay. >> Rich Schroeppel: Yes. Well, there's not time. I could go into the choice of parameters. It doesn't hurt anything at all, and it turns out to be a good choice. Within this block, the next 10 blocks are fed through in effect a Merkle-Damgard chaining, although it's a very wide pipe sort of arrangement. And the result of that goes into level 2. And similarly you basically -- it's processed 10 blocks at a time, and those -these can all be on different computers, if you've got that many computers. Then level 2 is doing the same deal only it's a hundred blocks at a time. Then if you actually have a huge message, you get into level 3 and the results from the level 2 just get passed in there. The final output goes into the finishing step, and then there's the answer here at the bottom. Now, you don't have to use any of this if you don't have -- if your message isn't big enough to need it. So if you have just a one-block message, you'll get this box and it will be fed down immediately into the finishing step. So you run compress twice. If you have a 10-block message -- well, an 11-block message, or up to 11 blocks, you'll get this, you'll get this. These two get dropped out. And it will feed into the finishing step and so on. So the result is if you have a small message or various kinds of small, you don't use this extra overhead here. And you don't need the extra storage associated with it and so on. Okay. A summary. There are initial and final compression runs. You have between zero and 300 immediate levels, depending on what you need. Early out means if you don't need an intermediate level you don't use it. This gives a possibility of a factor of about a thousand speedup. You could assign all this to different machines and get up to a thousand speedup if you had a thousand computers to do the work on. Of course there's always communications overhead and stuff. So actually you'd need to be hashing a movie or something for it to even be worthwhile. The first block appears in all subsequent compressions one way or another. So you can't lose the information in the first block by just putting in a huge message. Between the individual blocks within the box, you have 4 times the output is the size of your pipe. When you're passed down to the next level, you only have double size which corresponds to the size of one input block. I need to move a little faster here. This is inside the compression function. There are five rounds numbered 0 to 4. Basically the block is passed down. So the initial block for SANDstorm-256 is going to use 512-bit blocks. The message input comes in through the Ms. The internal state is half a block size. So you have 256 bits passed along these down arrows. You pass four of them to the next compression function. So this is in total 1024 bits. The data from the message comes in as more 256-bit values. These are all [inaudible] where ever the arrows meet. And it gets passed down and processed. The arrows are arranged so that this is compatible with hardware pipelining. You could have a pipeline that was operating, part of the pipeline would be working on round 4 of this message block, the next part would be working on round 3 and the next one on round 2 and so on. So each round could be in a separate pipeline stage if you wanted. We reuse the SHA-2 constants on the assumption that SHA-2 is always going to be around, and this is, quote, a backup system. So by reusing the constants, we avoid needing some extra memory. This is one little step of that previous slide. Gives you a close-up on how the arrows are arranged. In particular it's important that this arrow happened after this one arrives so that the pipelining can be -- operate properly. Let's see. The arithmetic -- the main thing about the arithmetic is it uses a multiply instruction. The rest of it is pretty much standard stuff. Multiply is wonderful for two reasons: it's a superb mixing operator, and it just kicks the hell out of differentials. In addition, if you are so inclined, you can actually design a fast parallel multiplier. It's another place you can put in parallelism if you really want to throw gates at it. We have a minor use of the AES sbox. And we interleave arithmetic and logic operations. Yes. >>: [inaudible] 64 by 64. How do they [inaudible] react to that idea? >> Rich Schroeppel: The which? >>: The guys with the tiny 8-bit processors. >> Rich Schroeppel: They can live with it. Because our total number of rounds isn't very high. Let's see. We also put in a tunable security parameter because the NIST requirements. They didn't say you had to have one, but they thought it was a good idea. And what we do is we just rerun round 4 more times if needed. My guess is in practice it will never be used, but it's there and tested and so on. Oh. One other thing to mention. Defense in depth. Our feeling is that all the hash functions that we've seen that have been used a lot have turned out not to be strong enough, and the appropriate thing to do is to have more stuff in at the beginning, even though it's going to slow you down. In most cases the thing that's slowing down your application is not the hashing; there'll be something else that's your real slow step. It's very unusual for the hashing to be the slow step. So we can afford to put more work into the hashing. The point of that is to avoid having to replace it 10 years from now. I guess SHA-1 has lasted, what, 17 years now. People are still using it. It's going to be a while before it's phased out. And it hasn't technically been broken yet. Okay. The internal arithmetic. For the 64-bit, the 512-bit hash, which produces a 256-bit output, takes block size 512, mostly operates on 64-bit words. And the internal multiplication stuff is 32 bits in and 64 bits out. So we break up a 64-bit word into two halves. We have an F function that just adds the squares of the two halves. And mod 2 to the 64th means it's just barely possible that this could over flow. And the G function does that. It also takes the two halves, adds some constants, does the multiply and then does the swap of the two halves. That's what the rotate 32 is. And, again, it's all mod 64. We have a choose function. Sbox of Z takes the low order bit -- low order byte only of the 64-bit word and replaces it with the sbox value on that byte. And people used to complain, well, that's a table reference, but of course Intel was about to give us an sbox lookup instruction with constant time. So that's less of a problem. Then we also have an operation we call BitMix. That takes four 64-bit words and just in each column rotates by none or one or two or three places. So you have four words in and four words out, but the bits have been picked from each of the four words to make one word. This is BitMix. Nothing to say about it. The round function, there are four 64-bit words. And we apply this function to each of the words. And it basically does a bunch of arithmetic. We then follow it with a BitMix to guarantee that everything influences everything. Each byte of output is a function of every bit of input. Roughly 40 percent of the work of a compression is in the round function. The 60 percent is in the message schedule. That's a little bit unusual, to put so much work in the message schedule. But if you look at all the breaks, the place where they get in is the message schedule because the message schedule's too easy. And when the message schedule's been harder, that's been a better defense. This is our message schedule. It's sampled five times from the message. The initial message, we simply do a bunch of mixing up the bits to make sure every bit gets into the first -- well, the 0 round run of the function, mixing function. Then we have this iteration where we take stuff from the previous eight words and apply a magic function to it and that becomes the new word in the message schedule. And then the actual length is 32, 33 steps. We pick subblocks of four out of that. So we'll take four at a time and feed it into the round function. But we just don't take four, four, four, four. There's four, and then there's a gap of one, four more, gap of one. And then there's a big gap at the beginning between the round 0 setup and the round 1 inputs. And we've done some checking to make sure that all the bits here influence all the down-line bits after a couple of steps and so on. About 60 percent of the work of hashing is in the message schedule, which incidentally, should you be inclined, condition parallelized in a separate thing. And actually when we recoded this for the x86 64 architecture, I took advantage of that and I moved the message schedule into one set of registers and that's fed to one of the core, subcore processors. And I had the round function at a different set of registers and so on. And it all runs fast. This is a summary of the God-awful 64-bit x86 architecture. I'll get to the next slide so nobody chokes. This is our current performance situation. The original submission was in nice portable NCC and was God-awful slow. This has been improved quite a bit. Single core assembly language is about 15 clocks per byte. The compression function runs at 13, and we have 10 percent tree overhead. On long messages you'll pay 10 percent extra compression than just taking the message blank would be. In C, dual core is 10 clocks a byte. I think on the Linux I can actually get it down to 7.5, but I don't have the dual core code running yet. This is on the NIST test machine. Well, of course, Linux isn't a part of that configuration, but it's equivalent. And some day I hope to have it under Windows Vista dual core. But that's not right at the top of my list right now. 512-bit C we have something that's sort of like assembly language called Intrinsics, which is basically C languages calls to the assembly code, the assembly instructions. On a single core, that's 24 blocks a byte. Then we have a machine down in the lab which is a dual quad core that's got eight processors in it, tightly connected. And that does 2.1 clocks per byte, which is six or seven -- it's basically linear speedup with some overhead. We have something called a Sun Niagara. I'd never heard of them. But apparently the gadget has 16 threads and gets a linear speedup. You can get more threads even though there are no more processors. So you have the processor -- the thing that actually does the arithmetic, and then it connects to one or another set of registers for each individual thread. So the processors can have up to eight threads per, and the speedup on doubling the number of threads there is basically times 1.5 each time, which is about what you expect for a parallel processor that doesn't really have that many processors in the bottom. The point of all this is this is actually the same sort of thing you'd expect for a typical talk about parallel programming. Yeah. >>: Rich, am I reading that correct, you're going from two cores to eight cores? You seem to have gotten more than a factor of 4 speedup. >>: It's dual quad core. >> Rich Schroeppel: This is a ->>: Quad -- dual quad core. >> Rich Schroeppel: Dual quad core. There are eight processors there ->>: And the dual core has two. >> Rich Schroeppel: Right. This is in C, though. This is the assembly code. >>: Okay. Got it. >> Rich Schroeppel: Sorry. I threw this together this morning based on some other notes. Okay. Scratch memory. This is here because of Neals [phonetic]. Neals produced this list of these functions are good, these are not so good. These are just barely good. These are awful. And we came in at the bottom of just barely good. And one of his criticisms was we were using too much memory. So I actually sat down and worked out how much memory was required. For SANDstorm-256, a one-block message, it's actually one block with padding. So you have 511 bits of message plus a padding bit. You'll need 128 bytes of scratch space. And that will give you -- as part of that you'll have a hash value in, well, I guess 32 of those bytes at the end. And I'm assuming this includes the message being copied into the first 64 bytes and I'm free to stomp on it and so on. Now, if we go up to a larger message, a message of three blocks needs 200 scratch bytes; a message of 11 blocks needs a little bit more; 1000 blocks needs a little bit more. You go over a thousand blocks, you'll need a little bit more. So what we're coming down to here is that you can hash small stuff in a couple of hundred bytes of scratch. For SANDstorm-512 and 384, everything is doubled. It's the same number of blocks because the blocks are double size, the scratch space in bytes is doubled. In particular, if you're doing in kind of public key operation, these numbers are less than you'll be using in your public key operation anyway, so you can share the scratch memory. Okay. Feature summary. 64-bit design. And for the long hash we've used 128 bits. Block size 512; that's standard. A brick construction by which we mean that if you take one of the parts and kick it and take it out of the thing, you still expect to have a secure hash function. Now, that was a design goal and we mostly satisfy that. But I can't say for sure that we got every part covered. Multiplication is the best mixer. It's annoying that you cannot write C code that efficiently uses multiply. You have to throw away the bottom half or the top half or something, which means that a portable implementation is going to suffer in performance. Means there are always be assembly language sitting there at the bottom doing that multiply for you if you need the performance. Let's see. What do we do here. Oh. One other design decision, most of the hash functions bring in the message a little bit at a time. Like one word at a time is very typical. This we think has turned out to be a design weakness in practice. It means that somebody who's sitting there trying to mess with you gets to play with the individual parts of the message schedule and tweak here and tweak here and tweak here and verify how the bits manipulate and fall down. If instead you bring in enough message to cover the whole state at a time, he can get you on one round, but then he's stuck on the round either before or after. That's the deal here. 60 percent of the work -- I mentioned this. Serious parallelism. There's no separate mode for SANDstorm. It has a maximum possible speedup of 1100 or something in terms of divvying it up among multiple computers. Well, actually even more if you did -- you could do the message schedule separately. Anyway, but that's it. We're not going to have a tree that could be infinitely tall and that you have to allocate extra memory for or something. We just fixed it. And the parallel mode gets the same answer, the same hash as somebody who's sitting there doing it serially. So we have a standard. That's the point. We have the tunable security parameter, because it's kind of required. Wide-pipe design. Block numbers get put in everywhere, which prevents people from various length stuff, and for moving the blocks around, which, you know, want to happen. Parallelism. I guess I've said all of these things individually before. If you want to do movies where you don't recompute the hash, then you have to arrange your format so that the first block is fixed, because the first block is put into everything else. So your movie is going to have to start out with copy-write panavision [phonetic] or something. That will be block 1 of the message. Block 2 through whatever is part of your wrapper. You can put more wrapper in wherever. You probably want wrapper at the end, but maybe in the middle and various places. If you fix the first block, then all the other parts of the hash can be precomputed where you have fixed stuff. So this is where you're hashing something and you only want to recompute your changes in essence. This is the pitch for multiply. Far and away the best mixer. If you have an XOR operation, every bit in your result depends linearly on 2 bits of the input. My definition of depends is that if you flip a bit it's got a 20 percent or higher chance of changing the answer. For add it's a little bit more. Each bit depends linearly on 2 bits in the same position and nonlinearly on the next 2 bits over, and a little bit on the bits beyond that. So the realistic answer here is two linear, two nonlinear. Multiply. Every bit on average depends nonlinearly on 32 bits of the input. The cost of a typical multiply is actually only about three clocks. If you look at what the computer is doing, you know, you've got this 500 million bits of gates on the chip, all of that work is going to make happen a few arithmetic operations at each turn, at each cycle. The part that's doing the arithmetic is maybe a thousand, couple thousand gates. Everything else is devoted to getting operands down to this little part that's doing the arithmetic. So it makes sense in terms of balancing your overheads to get as much work as you can out of that arithmetic. Multiply is a little slower than add. Not a lot slower. And it does a whole bunch more work. That's the deal. Okay. And it has this drawback: that's hard to write in C. Comment. You know, we're going to have parallel computers. We're stuck. We seem to have finally hit the end on Moore's law. If you look at what the clock speeds are and the chips coming out, they're no longer getting faster clocks. They're giving you more cores instead. The transistors are still getting smaller, but they're not wiggling any faster. That's the pitch. Comment on simplicity. Everyone likes simplicity. It's pretty. It's beautiful. It's risky. And I'm going to skip on the questions, because I want to get to part two. And now for something completely different. >>: Can I interrupt with a question before you go into ->> Rich Schroeppel: Yes. >>: -- something completely different? Using multiplication as a core, mixer, it's a great mixer on devices that has a nice 32-by-32 multiplier. >> Rich Schroeppel: Right. >>: This is you came up with the AES composition ->> Rich Schroeppel: Yeah. >>: -- and the problem is it's really a bear if you -- when you try to have implications on small devices. >> Rich Schroeppel: Yeah. So if you don't have a multiply instruction at all, you've got to find some way to do it. >>: Or even just, you know, if you've got an 8-bit multiply, it's going to [inaudible]. >> Rich Schroeppel: If you've got an 8-bit multiply, then you'll be doing 4 by 4 -- 16 of them and the adds and carries and so on. The draw -- well, the plus side of that. We've only got five rounds in the compression function. The total number of multiplies to do a compress is going to be -- let's see -- 3 times 30 -- I think the aggregate is going to be under 200 multiply instructions. And that's actually comparable to the arithmetic you're doing anyway. So there's probably a penalty, but it's not a kicker, terrible penalty. We have less of the other mixing, is what it comes down to, because the multiply substitutes for a lot of that. You can't just have pure arithmetic, though, or you'll get killed in a different way. Okay. And now for something completely different. Those of you who have never heard of elliptic curves can leave. Okay. This is a talk about scalar multiplication of points on elliptic curves. P is usually a point on any elliptic curve. K is usually an integer. And we want to compute K times P, another elliptic curve point. Mostly we're working on the Galois field, GF[2^N]. Mostly. Mostly we'll use affine coordinates, which means just X and Y. And our elliptic curve equation is usually the standard one. In fact, usually A is either 1 or 0. Yeah. >>: On a GF[2^N], don't you have a different formula for the elliptic curve because ->> Rich Schroeppel: This is [inaudible]. >>: -- squaring of linear operator and ->> Rich Schroeppel: Yes, it is. This is that different formula. In the mod P systems -well, everything except mod 2 and 3, you would instead have AX here instead of AX squared. >>: That's why [inaudible]. >> Rich Schroeppel: Right. Oh. I left out the XY. Oh, oh. Bug. Well, I can't edit it. Okay. There should be an XY term here, as Peter points out. God. I've shown this slide I don't know how many times and didn't catch that. And nobody in the audience caught it either. Okay. The usual way of computing a scalar multiple is called double and add. But when you look at it closely, it turns out there are two ways to do double and add. There's top down, left to right, and bottom up, right to left. And this is a -- I lost it. Pardon me. I have the wrong talk. Huh. Oh, no, it's here. Okay. Sorry. Okay. The key idea. We use the bottom-up scheme and we postpone adding up the points. So the bottom-up scheme calculates P and then 2P and 4P and 8P and so on. And it saves the ones it needs, which is usually about half of them. And then we do all the adds at the end. The point of doing that is that you can share all the reciprocals using Peter's trick for sharing reciprocals. And this actually works quite well when the cost of reciprocal is more than 3 times the cost of a multiply. Okay. Here's the double and add slide again. This is a photograph, and I got the lighting bad, but it illustrates how you process the bits in the multiplier. You can either go left to right, this is top down, or bottom up, right to left. And the scheme I'm suggesting here uses bottom up, right to left. And they went ahead and did a patent application without asking me, although I had to provide input. And but then it was dropped, so it's unpatented at the moment. This is the regular double and add scheme. You have a doubler here, and when appropriate with the control, you add into the running sum. This is the same thing expect the add step has been revised. You save the multiples you want. You use Peter's reciprocal sharing trick. You pair all the points. And each of these octagons is going to compute a reciprocal. And you arrange to have all these reciprocals to be shared using Peter's trick, and then you have your points. And you now have half as many points, and you do it again, and you do it a few more times and you're done. So the total number of reciprocals you need to add these affine values is like six. That's the deal. All the others are replaced by three multiplies instead. And these are the formulas for conventional point addition. You get the slope of the denominator is the sum of the 2X coordinates. You do a reciprocal, you multiply to get the slope of the numerator. And there's the final slope joining the two points. The X formula is just this. The Y formula is this. That's what happens in the regular way of doing an affine addition in GF[2]. The rearrangement you make for my scheme, you compute the denominator as standard. You then have all these octagons that are sharing the reciprocal operation. And then you do the rest of the processing as standard. I prefer the word reciprocal to inverse because reciprocal is much more specific than inverse. Okay. This is Peter's trick. It's actually probably Peter's 17th trick, because Peter has done a number of tricks. It sort of could be called one more reciprocal in that if you've already got one, you can get another one sort of. And I was talking to Peter. We're not sure if he originated it or not. It's certainly been dependent with him. But it's possible it was something that was known to people in the '50s from way before I started using computers. It's the sort of trick that would be in the lore but never get published. Suppose I need two reciprocals of A and B. I do the following thing. I compute A times B, I take that reciprocal and then I can separately multiply that by A and by B to get the two I need. Now, in -- what I've paid here to do this is I've got a multiply here, here, and here. So I've paid three multiplies and I've gotten rid of one reciprocal. And there's a very easy way of extending this so that you can do K reciprocals, bring them down to 1, and the K minus 1 things that you replaced become 3K minus three multiplies. >> Kristin Lauter: So in your chart before, when you had like -- let's say there were N -I think it was N [inaudible] number [inaudible] and so you have like N over 2 octagons at the first step. So your little K here is N over 2, so 3 times N over 2 minus 3, whatever, 3 minus 6 or something like that, then how come you said you get -- pay only three multiplies each of those steps? You pay like 3N multiplies at that stage and then ->> Rich Schroeppel: Each octagon cost you three multiplies. So the work per bit is three multiplies. >> Kristin Lauter: Oh. You don't put them all together, then. You don't do like K reciprocals all at the same time; you do [inaudible]. >> Rich Schroeppel: Well, you're -- actually, you're making that -- you're getting that result. You're getting -- let's see. Let's suppose you're adding in things, which is -- you probably go to 2N bit number, but you're adding N things, so you've got to add up. So you pair them. You have N over 2 pairs. So the first layer of octagons you have N over 2 octagons. >> Kristin Lauter: Yeah. But so you want to do it 2 by 2; you don't want to do all of the -- you don't want to share one reciprocal for all of them? >> Rich Schroeppel: No. You share one reciprocal for all of them. >> Kristin Lauter: Oh, okay. >> Rich Schroeppel: So you precompute each sum of X coordinate, each pair. So that's going to be -- let's see. If you have N over 2 octagons, you're going to be doing N over 2 additions at that point. You need N over 2 reciprocals. That's going to cost you 3 times N over 2 minus three multiplies. >> Kristin Lauter: Right. >> Rich Schroeppel: One reciprocal. So the number of multiplies that it cost you in each octagon, it's just a little bit less than three. >> Kristin Lauter: I see. >> Rich Schroeppel: So each octagon achieves one addition. So the total number of additions you need with all the levels of the tree is going to be one less than the total number of things you add. >> Kristin Lauter: Yeah. >> Rich Schroeppel: Okay. Now, things -- whether this is worthwhile depends on the cost of a reciprocal versus a multiply. This can be quite low if you're using field towers. But we have on good advice that we shouldn't use field towers anymore. The typical reported value for this ratio is 10. My personal experience is actually a much smaller number but still bigger than 3. And if you're doing it mod P, the scheme works mod P. It'll operate. The typical cost of the reciprocal is 60. >> Kristin Lauter: So two questions. First one, why do you say Sun IPC [inaudible]? >> Rich Schroeppel: That's where I wrote code to do field tower. I was comparing 155 using conventional and 156 using a three-level tower. And that led -- you know, for that particular machine -- this is 1997 or something -- I got a ratio of 1.5. And the reason it's so low is that the actual number of reciprocals you do is one in the base field. And my base field was GF[2^13], and I just did a table lookup for that final step. Actually, that was sort of queued. I had a bunch of logarithm tables and exponential tables for doing GF[2^13] arithmetic because it's like four instructions. >> Kristin Lauter: And then the other question is so for mod P, I mean, this depends a lot on the prime, the size of the prime, the [inaudible] ->> Rich Schroeppel: Absolutely. Yeah. >> Kristin Lauter: -- all of that. I mean, so for many years Peter had a ratio of about 5. >> Rich Schroeppel: Oh. Okay. All right. >> Kristin Lauter: And then with special instructions to speed up the multiply, it's up to 10 now. But that's on cryptographic sizes. So we've always maintained this 60, what you see in the literature, sometimes you see 80. >> Rich Schroeppel: Yeah. >> Kristin Lauter: This is very close when you're using special crimes, like percent crimes. So where's your 60 coming from? Is that special primes? >> Rich Schroeppel: That's just pulled out of the literature. >> Kristin Lauter: Yeah. Okay. So that's more typical if you have a very special prime. >> Rich Schroeppel: Yeah. Okay. I'm not going to quarrel. I'm just saying the numbers are obviously highly variable. And whether this scheme actually buys you anything will depend on your circumstances. I should mention we had some students code up the mod P one. They were looking for a high school science fair project. And they found that the overall scheme gave only a slight improvement, like 5 or 10 percent over a projective implementation. Now, I thought they could have put more work into speeding up the reciprocal, but it was time to turn in the project and they'd already written a lot of code. So, anyway, I think in the mod P case it would actually perform also. One of the -- well, there's a complication in mod P and I hope we'll get to it. Okay. This is a summary of elliptic curve addition for GF[2^N]. It's very similar for mod P, but you have to count the squarings in mod P. In mod 2 you can pretty much ignore the squarings. They're a minor cost. Okay. This is just a repeat. Okay. In general for point doubling, if you're doubling points in projective, you can do this interesting thing. You can do a series of doubles in projective form, mod P, and then -- well, one of the benefits you get just incidentally in passing is typically the Z coordinate from one point is a divisor of the Z coordinate for the next point. And that allows you to slightly simplify Peter's trick. You say one multiply because of that. Anyway, you collect all the points that you need to add in projective form. And then you can convert them all to affine at a cost of only one reciprocal. And I think it's two multiplies per point that actually needs to be added. And then the net result on that, it may be a benefit. It depends on the cost of your reciprocal or not. However, if you can do point-halving, then you don't need reciprocals for -- you don't need a doubling chain. You can have a halving chain and there are no reciprocals to compute at all for that part. So the combination of point-halving together with the reciprocal sharing trick to do the addition tree is the most efficient way of doing GF[2^N]. It's even better if you have Koblitz curves. Because then even your point-halving gets simplified. You can replace that with a doubling substitute, which is multiplying by this gadget. And some of you must know what this is already, and the other people I don't want to take time to explain it. So it's basically you can use that as a multiplier and your main cost at that point is the additions. And there's no way of doing the additions as a nice help on that. You can even cause yourself to have fewer additions if you precompute a bunch of multiples. So, for example, tau plus 1 and tau minus 1 can be used as -- when you group together -- you do your base conversion to base tau, and then you group some of the bits together, some of the coefficients, and you can use tau minus 1 and tau plus 1. The doubles and halves of these -- well, the tau powers of these are just rotations. So you can compute them on the fly. And that's what makes it nice to have several available digits in your representation. Okay. These are just various circumstances that win and don't win. There are a bunch of different things you can do. You know, you can double in groups. If you don't want to build the reciprocal circuit for some reason, you can take a penalty and compute it this way. That's old news. And we did an implementation with a generic affine curve and got a 30 percent speedup over the simple projective. And if you don't have the memory to store the points, if you can store a few points you can get a large part of the benefit. So this is going back to your smart card situation where you're only storing -- have room to store four points or something. >> Kristin Lauter: So when you say 30 percent over the projective [inaudible] more detail about that comparison? Like so projective using the same double and add strategy, like the same windowing and all of that? >> Rich Schroeppel: I don't remember. I didn't do the implementation. >> Kristin Lauter: Well, because one -- it seems like it would be good to compare it to just the regular affine without your trick rather than ->> Rich Schroeppel: Oh, it will kill affine. >> Kristin Lauter: Yeah. Obviously. >> Rich Schroeppel: You want to compare it to the best other choice, which is going to be probably some variation on projective. >> Kristin Lauter: Yeah. But I would say that that comparison is going to have to vary widely. >> Rich Schroeppel: Absolutely. Yeah. So what I think this comes down to is it's one more variable in the use this, this and this, but not this, this and this when you're actually making your engineering choice. I guess from my personal viewpoint I like to be at the high -- the fast, high end, but that's probably unrealistic in terms of when people are actually building a system. But it does provide another option in building the thing. Does speed ever actually matter? You know, occasionally it does. Most of the time, well, no, actually it doesn't. You need it to be fast enough but not fast as possible. Okay. Well, also this is just other things you can do with it. One interesting addition: Because all the doublings happen before all the additions, if somebody's listening to the ka-chunk-ata noises, they won't get the information that they would usually get, where you have one kind of ka-chunk for the multiply and a different kind of ka-chunk for the double and so on. Okay. And we have a few minutes for part three. Where is part three. Yeah. Okay. This is fun and games. And I don't have time to do everything here, so I'll pick a couple. I did this and I gave a talk about it and sent it off to people, and then I got this letter back from Don Zagier saying somebody named Dennis had something very similar many years ago, although it was not published and nobody knew about it. Okay. There's a function called the dilogarithm. It's an analytic function. It's way back in the special functions section of the big blue book. It's basically what you want to study if you're trying to evaluate zeta of 2. So [inaudible] was interested in it. It also shows up if you've just invented integration and you're running around trying to figure out what is the integral of tangent and so on, one of your very beginning steps is going to be to differentiate a bunch of things and see what the results look like and then try to work backwards. And then you're going to come across a bunch of things that seem like they ought to be simple enough to integrate but you can't quite do it, like the error function. Another one is this very simple log expression here. And you can write the power series for that easily and you can integrate the power series as long as you're in the unit circle and so on. And you wound up -- you wind up discovering this special function. And beyond that you don't think you have much need for it. The following functional equations are in a sense trivial. They either appear directly from the power series or it's what happens when you try integrating -- or differentiating different combinations. You can check these trivially. And you probably don't remember from high school, but when you tried to integrate sine times cosine, you got back cosine times sine or something. And then you could just do this little tricky manipulation and all of a sudden you had your integral. And people were doing similar things with that integral I showed you earlier. So you discover these functional equations easily. Then if you're really brilliant, a hundred years later you find this. This isn't what Spence actually found; he found something that's equivalent to this. Because I don't think he was actually working with Li2 but 1 minus it or something. This is really weird. This first part of it is actually like the regular functional equation for logarithm. But then there are these extra terms. This extra term explains my interest. What I've done here is I've converted this function into a modular function. If there were magically some way to compute these guys modular -- I could do this one. This one I have a lot of interest in. Because it would also be modular. Let's see. I worked with what I call the cose [phonetic] dilog, which this very similar thing. But it has an easier functional equation. So it's one you can remember and write down and work with. This one you need to have sitting in front of you all the time to do anything with it. This one is actually true for regular logarithms, interestingly enough. Yeah. >>: [inaudible] symmetric in X and Y, is it? >> Rich Schroeppel: Well, it's C squared. So if you swap X and Y, the logarithm's negative. >>: Oh, the logarithm is squared. Okay. Sorry. >> Rich Schroeppel: Yeah. Yeah, it's -- okay. Here's what's going on. My question was is there a modular version of the dilog that satisfies the same identities. Now, there's an analogy with discrete logs. If you look at there are actually a whole collection of functional equations for this thing and they have lots and lots of log terms in them. So you could imagine that part might work. The drawback is there aren't very many rational values to work with. If you're trying to extend ordinary logs or exponentials from integers or rationals over to mod P, you have a bunch of rational values you can much on. You can say I'm going to say log 2 is this one value, and then I've automatically got log 4, log 8 and so on. It's not so easy with dilog. There's some question as to what would this even mean. The domain presumably is going to be modulo some prime. The range, well, we're not real sure about the range. It's probably got a P minus 1 in it because we had those log terms. And logarithms mod P come out mod P minus 1. Another problem with dilog is that there are Riemann sheets and there are infinite values at a few places and things like that. And I simply ignored that problem. If I had an equation that involved the zero or an infinity at a bad place, I'd just say all right, I'll let that one go. Here's what we're aiming for in the answers. Okay. Logarithm, I plain old log is .30103. We remember those. We even do a discrete log in which the log part is mod 19 and the answer comes out mod 18. Now, for dilog, the dilog of 1/2 is .58 mumble, and there's actually an expression for this involving pi squared and the square of log 2 or something. And the modular dilog of 1/2, mod 19, is the same as the modular dilog of 10, and that turns out to be 265, interpreted modulo 360. 360, of course, is 1 less than the square of 19. So this is the table, the dilog table for mod 19. If the residue N is one of 19 values, then the dilog is one of these. And I wrote a program, of course, and I have solutions modulo P for primes ranging from 5 up to 23. The answers always turned out to be modulo P squared minus 1. Now, my code was actually agnostic on the modulus. It allowed for the possibility that the modulus might have come out to be anything. And that involved keeping a couple of extra degrees of freedom in the intermediate stuff. And it always turned out to be P squared minus 1. And I asked what happens on nonprime moduli, and the answer there is you can get the same table, mod 25, and the answers come out mod 600. And then I said what about GF[5] squared, and then the answers come out mod 624, which of course is 5 to the 4th minus 1. These are the functional equations satisfied by the discrete dilog. And I should have put the real complex valued ones up here to match. But they basically match. The one minor gimmick is there's a constant involved, and that depends on your choice of base for the dilog and your choice of base for the log. But it satisfies all the dilog functional equations. It turns out, magically, that every time a log of zero appears in one of these, like here, it's always multiplied by the log of 1. So you just say [inaudible] that log 0 times log 1 is 0, just like you do in information theory. And everything plays. It turns out you can do trilogarithms. That's the same power series only you have cubes in the denominators. And that's what you study if you're interested in zeta of 3. There are many functional equations. I was only able to get one of them to work. I tried two or three. It has 20-odd terms in it. It's a mess. The solutions come out to be either mod P cubed minus 1 or 7 times that. And I suspect if I introduced an additional functional equation, the 7 option would disappear and it would just be P cubed minus 1. But I never checked it. Whether you can do higher polylogs, I don't know. Maybe. Okay. Now, does this represent anything real? Is there anything at the bottom here? This is stuff that I learned from other people, the letters that Zagier sent to me mentioned this stuff. There's a puzzle on why does this exist. Every simpler function -- you know, add, subtract, multiply, exponential log, trigs, elliptic functions and so on -- can be brought down to a base that reaches either the integers or the rationals. And then you can just move forward modulo P in that basis. And if you want to move over to a finite field, you actually use algebraic numbers as some of your ingredients. This does not touch down -- this has no base in that sense. A particular value is neither right nor wrong; it's only a collection of them that allows you to decide if the functional equations are satisfied or not. But you can't just say an individual dilog value is so-and-so. It only makes sense in the group context. It would be interesting to have a better way of calculating them. If you had a really good way, then you could use it for discrete logs. The ways I have are not really good. And I'm sort of wondering to what extent can this be extended. I know I can extend -- I have similar examples with theta functions, which will have the similar property of no rational basis. Being free floating. But I was able to come up with a -- I think it's mod 41 or mod 43 with the theta 0, 1, 2, 3, 4 stuff that satisfies all the functional equations. So what I'm thinking is a large part of our special functions have mod P analogs. And it would be worth and interesting to investigate them. Okay. And I should let you go. Anyone who wants to hang around, we can talk about the other stuff or answer questions or anything. But I've gabbled on for an hour. [applause] >>: [inaudible] your data memory uses [inaudible]? >> Rich Schroeppel: Of course. Sure. Yeah. >> Kristin Lauter: More questions? >>: How much information does a multiplier need via power analysis and timing analysis? [inaudible] optimized. >> Rich Schroeppel: Yeah. A typical processor today, you know, if you reach into a barrel and pull out a processor, the multiply time is independent of the operants. There are a number of exceptions. >>: No early out? I've been trying ->> Rich Schroeppel: No, no. That's the length. >>: I'm sorry? I've been trying to get Intel to promise us no early outs. But it is true today [inaudible]. >>: [inaudible] early out? Wow. Okay. Fine. >> Rich Schroeppel: So little of the -- you know, the arithmetic is such a tiny part of the total -- and, you know, to screw up your timing change just because of the data change, you know, you've got all this ->>: [inaudible]. >> Rich Schroeppel: It used to be. But, you know, you can't change the timing on a cycle now. You don't want to. Because there's this march going on of, you know, data coming in and going out into the cache and the -- you know, every time you introduce an uncertainty into that, you've got to pay 10 engineers to figure out what it means. Now, there certainly are machines that like check for zero and things, that it depends on how many one-zero transitions you have and things. I worked on those machines a long time ago. Let's see. What was your other? >>: The other is power. >> Rich Schroeppel: Yeah, there's power. I don't know on that. Very possibly it's a problem. But, on the other hand, if you can look into hash functions, a lot of them, it depends on the power. These guys have adds, how many carries. >>: If you do any [inaudible], the power depends on what -- the data you're processing. There's no defense against that stuff. >> Rich Schroeppel: I mean, the two sides of XOR, whether it was X is 1 and Y is 0 or X is 0 and Y is 1, they might have different powers, the answers might arrive a picosecond before the other, it's -- yeah. Anyone else? Thank you. [applause]