>> Michael Naehrig: All right. Good morning, everybody. ...

advertisement

>> Michael Naehrig: All right. Good morning, everybody. It's my pleasure to entries Peter Schwabe from Radboud University Nijmegen in the Netherlands. And yeah, he's known for working on highspeed crypto implementations, and today he's going to talk to us about vector instructions.

>> Peter Schwabe: Thank you very much, Michael, for the introduction, and well, thank you very much for inviting me. Yeah, as Michael said, I'm from Radboud University Nijmegen. That's, well, already led to some brief discussion just before about what other things are going on at that university, but I am going to talk about optimizing crypto with vector instructions. And maybe I should at first say something about the title of this talk. I already pointed out, this is not a very relevant question, but I was standing at a small workshop in Paris together with Dan Bernstein, my Ph.D. supervisor, and we were chatting a bit about high speed implementations of crypto and about papers being published and that pretty few papers actually use vector instructions, and we were thinking about why this is the case. Okay. There's some cases where vector instructions will go prone and not speed up any computations, but there are some computations where, very obviously, it's faster, and still people don't use it. We said somehow people must be afraid of using vector instructions. They must be afraid of vectors maybe. So afterwards we were just walking around the conference asking everybody whether he was afraid of vectors.

>>: That's a good point. Well, crypto on GPUs, is very high [indiscernible].

>> Peter Schwabe: But it's not really vector instructions. It's slightly different. Well, yeah. So at the beginning of my talk, let me start with a quote, and that's a quote from a classical paper. It's from

Michael Flynn from 1966 where, I think, it's the first paper where the term SIMD appeared. The context where it appears is that he's looking at a certain reference organization, and he calls this organization as the prototype of a class of machines which we label Single Introduction Stream-Single Data Stream,

SISD. And later, he's looking at variations of that, and the first one is Single Introduction Stream-

Multiple Data Streams, and this is what I am going to look at in this talk. And actually, I am going to look at a very specific implementation of SIMD; namely, vector introduction. Before doing that, let me look at SISD, so the very classical thing of how most computer programs are implemented. And this is just an example, well, in sort of high-level assembly code. It's CHASM, where what happens is we just do a 32bit addition. So we declare two variables, a and b, its register variables, in this case, 64-bit registers, and then we load two values for memory into a and b. We do an addition, 32-bit addition, and then we stall the result. Now, when we do something similar with SIMD instructions with vector instructions, it looks like that. It looks almost the same.

In this case, we just declare two variables, which are now 128-bit register variables, and we load now,

128-bit from memory into these two, and then we do a four-way parallel addition, which means that we basically add a0 plus b0 and a1 plus b1 and a2 plus b2 and so on, and then at the end we stall the result.

What this small example shows is what you can really do with vector instructions. It's not only that you have independent data streams that are handled with the same instructions, but these data streams have to be arranged in memory in a certain way. The individual chunks have to sit next to each other in memory; otherwise, vector instructions are very, very unhappy. If you want to load something from here and from here and from here and here and combine that in a vector register, that's pretty inefficient.

Why would you care about that? I mean that looks like if you want to rewrite your algorithms in a way to support that, then that sounds like a not-very-convenient way of programming. So just consider the

Intel Nehalem processor. And if you look at the throughput of introductions, then, for this 32-bit addition, we need, well, we have to have one cycle. We can, once per cycle, we can do a load. We can do three additions per cycle, and we can do one store per cycle. And if we look at the 128-bit vectorized version, then we can do one load per cycle into a 128-bit register. We can do two of these vectorized additions per cycle and one store per cycle. In other words, the vector instructions are just about almost as expensive as the scalar versions of instructions, but they do four times the work. And well, actually it's very, very similar for other instructions. It's very similar for other architectures. Very similar for other microarchitectures. If you replace the addition just with an xor, you get exactly the same numbers, so it's one, three, one for both scalar and vectorized.

There is a less obvious reason why you would care, and this is that if you have data-dependent branches in SIMD generally -- so also other implementations of SIMD -- that's typically expensive. So if you want to carry out the same instructions on independent data streams, and then at some point you look at data and say, well, if this data has a certain property, then do this and otherwise do this. Then typically, you have to do just both parts of the branch then mask out the individual results. It's very expensive. What's also very expensive is what I said before if you use vector instructions are variably indexed loads, so exactly this. Pick up some data from here; pick up some data from there, which is only known at runtime maybe because you looked up the offsets from another vector register. So you need to rewrite all your algorithms in a way that you eliminate these branches, these data-dependent branches, and you need to eliminate lookups. This doesn't sound like a good motivation to use vector instructions. I mean, this sounds like it's actually extremely annoying thing to do, and this is not how we want to write programs. But this talks about crypto, and for cryptography, these two things -- datadependent branches and lookups -- are the two major sources for timing attacks or the two major sources for reasons or for vulnerability against timing attacks. So when you protect your cryptographic algorithm against timing attacks and at the same time optimize by using vector instructions, it's very, very strong synergies. It's not like it's automatically protected against timing attacks when you use vector instructions, but when you do both at the same time, that's typically a very good thing that works very well together. What I want to do in this talk is to look at five examples of vector instructions. So it's based on five papers but I'm also only going to zoom in at the spots that are really interesting for how to use vector instructions in an efficient way.

>>: Usually when you do [indiscernible], all kinds of flags are being set, so like overflow, so you get an array of flags which are referring to the individual?

>> Peter Schwabe: Thank you very much for this question. I am going to come back to this in a few slides [laughter]. So actually, we want to do this. We want to use vectorize instructions; we need multiple data streams. That's exactly what Flynn said in his 1966 paper, and so we need to think about where these individual independent data streams come from, and there's a very easy case. We just say well, instead of doing one encryption, we just do n encryptions at the same time. Or instead of computing one signature, we compute n signatures at the same time. Or we just do cryptanalysis where typically you have just so much parallelism and so much independent computations, it's just very, very nice to parallelize and to vectorize.

This, however, requires a little bit of rewriting on the low level because you need to interleave your data structures. So I gave an example here of large integers, so 256-bit integers represented as four 64-bit chunks. And if we have four of those, so a, b, c, d, then we cannot just use the original data structure and then magically just send the instructions or vector instructions work on those, but we have to interleave the chunks of this. So we have to put a memory a0, then b0, c0, d0 and so on. So if you want to use that, it's kind of worth thinking about it right from the beginning because otherwise you end up rewriting the whole code.

But what I want to do in this talk is actually something else. I want to just compute one signature. I just want to encrypt one message. I want to see. I want to take a look at how can we still use vector instructions, where do we find enough data-level parallelism make this work? And so yeah. This I want to show in five examples. And the first example is Salsa20 stream cipher in NEON. And this is joint work with Dan Bernstein, which we published at CHES last year. And at first I should say something about

NEON. So NEON you will find, for example, on this phone, so it's something like a vector instruction extension to the ARMv7 instruction set. It's not on all ARMv7 processors, but on most. It has 16 128-bit vector registers, and on the Cortex-A8, which we considered in this paper, you can do one arithmetic instruction per cycle, and at the same time you can do one load, store, or shuffle instruction per cycle.

On Salsa20, the target of this optimization is a stream cipher designed by Bernstein in 2005, and it's in the eSTREAM software portfolio. It generates stream in 64-byte blocks, and it works on 32-bit integers internally. And what it does per block is it does 20 rounds -- that's where the 20 in Salsa20 comes from -

- and each of these rounds does 16 add-rotate-xor sequences.

Well, I just wrote down such a sequence. So we add two values and then we take the result, we rotate it by a constant -- these constants are different throughout these 16 add-rotate-xor sequences -- and then we xor it into another. We have 16 of those, and it's actually four blocks of four where within one block, these sequences are completely independent, so we have a very, very low level, well, data-level parallelism. Okay, let's do that in vector instructions. That's roughly what it looks like. So we do an addition here, and that now does four additions. And then we shift and shift to the left by 7, shift to the right by 25, because we don't have rotate instructions, and then we do these two xors here. And then okay, we need some shuffles, but shuffles can just be interleaved with arithmetic, so they are essentially free. Now okay, let's think about how fast could that be. Well, the intuitive cycle lower bound is okay, we can do one arithmetic instruction per cycle; that's five. We need to do four blocks of this here, and we need to do that in 20 rounds, and then it generates 64 bytes of upward stream. So that results in

6.25 cycles per byte. But actually what happens is that this sequence up here has a latency of nine cycles because -- so, for example, the second instruction uses the result of the first instruction, so it needs to wait for a while. And then this second-but-last instruction uses the result of the second instruction and so on, so there's a lot of, lot of dependency in there. That's why it results in a nine-cycle agency and we end up with something like as a lower bound of 11.25 cycles. Yeah.

>>: You earlier said that you can do one arithmetic instruction per cycle?

>> Peter Schwabe: That's throughput.

>>: Did it depend on the previous?

>> Peter Schwabe: Yeah. So that's purely throughput, ignoring latencies, yes. This is a general problem that we have with SIMD. So let's look at what just happened. So we have this four-way level data

parallelism. If we write that in scalar instructions, that turns into a four-way instruction-level parallelism. We basically have always four instructions doing the same on different data items, and these four instructions are completely independent. That is extremely good for pipelined execution, and that's also very, very good for superscalar execution, so all modern processors are very, very happy with that. Now, we took this data-level parallelism and turned that into a vector instruction, so we removed all this instruction-level parallelism, and well, that is very, very bad for pipelined and superscalar execution. So how do we fix this? Well, there's an idea; namely that if we have two subsequent blocks of stream output, then the computations are completely independent. So we can reintroduce instruction-level parallelism by just interleaving two blocks. If we do that with two blocks, we cannot hide all latencies, but almost all of them, so we get a lower bound of 6.875 cycles per byte. If we do it for three blocks, then we end up with 6.25 cycles per byte. Now, what you would expect at this point is that afterwards I just say that's what we did and we got extremely close to these 6.25 cycles per byte, and then we're happy. But actually, we can go further than that.

So NEON is a coprocessor to the ARM, so you can think of it as follows: You have this ARM processor and it looks at all the instructions and decodes them, and whenever it finds a NEON instruction, it just forwards it to the NEON unit and says hey, it's your work. And so the ARM unit is basically all the time idle, just looking at the NEON instructions and forwarding them, so the whole arithmetic units of ARM don't do anything so far. Clearly, we want to keep also the ARM core busy with Salsa20 computations, but that introduces a new bottleneck because this ARM needs to decode all instructions, the NEON and the ARM instructions, and it can decode at most two instructions per cycle. But then what is very nice is that this add-rotate-xor sequence which turned into five instructions in NEON turns into only two instructions for ARM because it has a rotate, but it cannot only rotate, it can rotate for free if it's rotating one of the inputs of an arithmetic instruction, so it just rotates and xors immediately in one instruction. So this leads to a tradeoff that it's best to do one block on ARM and two blocks on NEON, and that's roughly what the code looks like. So you can see up here that we do like the first part in

NEON, second part in NEON. That's the second block in NEON. Then there's two ARM instructions following. Then we continues with the first block in NEON and so. So this was the most readable part of the code that I found to show on this slide. And you do that throughout the whole inner loop and just interleave everything in a way to hide all latencies. And that's the final result. We get down to 5.47 cycles per byte for Salsa20 encryption. And this is not a lower bound; this is measured.

If you're interested in the code, it's online, and all the code that I'm mentioning here is in the public domain, so just do with it whatever you like. And you'll find it on cryptojedi.org/crypto/#neoncrypto. As a second example -- and in the second example I am going to come back to your question -- is Elliptic curve Diffie–Hellman, on the Cell Broadband Engine. This is a project that I did together with Neil

Costigan, and we published at Africacrypt 2009. And this Cell Broadband Engine is a very interesting machine. As I still very much like it. You'll find it in the PlayStation 3; that's not why I like it. You also find it in IBM Cell Blades, but I just like the architecture. So it has one central core. It's a power G5. And then around that it has eight so-called synergistic processor units. And the idea that this power core is running, say, an operating system, and the synergistic processor units, they do the real work. So whenever you really have computations to do, they just outsource it to these surrounding cores, and they are really optimized for doing high-throughput SIMD computations. So they have an instruction set which is completely vectorized, so there's no nonvector instructions. And similar to NEON, it does one arithmetic instruction per cycle and one load, store, or shuffle per cycle. What is interesting for us to

doing elliptic curve cryptography is the size of the multiplier. And the largest multiplier does 16 by 16 bits and returns a 32-bit result, but that's four-way vectorized in these vector registers. The target of this optimization was curve25519, which is a Elliptic curve Diffie–Hellman key exchange proposed by

Bernstein in 2006, and it uses a Montgomery curve over this field. So what we will have to do is optimize arithmetic in this field. And the main computation is, well, 255 of these ladder steps of the

Montgomery ladder, and each needs five multiplications, four squarings, eight additions, and one multiplication by a constant. Okay. Let's think about how to represent elements of this finite field.

Intuitive approach is -- well, the largest thing that we can multiply is 16 by 16 bit, so we just use 16 chunks of 16-bit integers, and we use radix 2 to the 16 to write down these numbers. And then we just do schoolbook multiplication, and well, maybe not, but if we do schoolbook multiplication, then we get

256 16-by-16-bit multiplication, and 224 32-bit additions. And that was exactly what [indiscernible] mentioned before: What happens to carries if we do that? Well, the answer for the cell is somewhat interesting. So the cell can generate these carries in a separate register in one extra instruction which is maybe not too expensive. I mean, you do one extra instruction for the carries, but it gets worse because it cannot only do multiply, but can multiply add, which is extremely nice to use, but the multiply add cannot handle the carries, so basically if you could ignore carries, you could adjust multiply add, multiply add, multiply add all the time. But in this way, we have to do multiply, generate carries -- oh, well, add, generate carries, and it turns into at least three instructions there.

>>: Just for the multiply gives you a double precision.

>> Peter Schwabe: It gives me 32 bit, yes.

>>: And you're adding single precision [cross-talk] sorry?

>> Peter Schwabe: I can't do 32-bit additions, so I'm also adding. I'm adding double precision, yeah.

>>: If you were adding single precision --

>> Peter Schwabe: No. That would be way worse. No, no. It can do 16 by 16 gives 32 bit and then it can, even in one instruction, add up these 32-bit results, these double-precision results. Yeah thanks.

Actually, the answer for other vector instructions sets is even worse. So when you add these things, the carries are gone. There is no flag register for SIMD instructions, and if you want to recompute these carries, that's just a whole lot of work, and you just don't want to do that. So you have to think about some way to avoid carries, and that's why in hardware is known as a carry-safe adder or carry-safe representation. So the idea is very simple. You just don't use all the 32 bits of a result. You just use a little bit less, and then you can just accumulate carries in there for free. So what we do here is we use a representation with radix 2 to the 12.75, which looks a bit odd, but it's very, very nice later for the reduction, for the modular reduction of this prime with a special shape. Because we get here we have

20 chunks, and if you have 20 times 12.75, that's 255, and then so the uppermost lim -- or the lowest lim that you get additionally immediately corresponds to the low lim for reduction, so you just multiply by

19 and add. If we'll see that --

>>: There will be no carry right? You can also play probabilistic trick hoping that there is no carry, checking at the end or something like this, so you could still --

>> Peter Schwabe: So if you use like radix 2 to the 16 and add all these up, I mean, you're accumulating then, in the middle part, something like 16 values of 32 bits.

>>: Probably be too small but if you had larger size, you could.

>> Peter Schwabe: But doing that in constant time, doing that in SIMD, probably not. Probably it's not going to be efficient. So when we use this representation, we start multiply by -- we start multiplying reduced elements. So we assume that these elements really only use 13 bits in each of the chunks, and then when we multiply two, we get 26 bits, and we can add up quite a few of those before we reach 32.

So when we do that, then we can just use hundreds more and all add instructions. Hundred just comes from, well, that's 20 chunks, that's 20 chunks, schoolbook multiplication gives 400 multiplications, and we can do it four way parallel, so that's hundred instructions. To produce a result, I'll zero to r32 -- 38, sorry. But these chunks in r, they are larger. They are maybe not exactly 32 bits, but maybe somewhat smaller, but they're certainly much, much larger than 13. Okay. We also need a lot of shuffles to do this within one multiplication, to combine all the things properly, but shuffles are free. We get a little bit of overhead from this noninteger radix from this exponent 12.75, so we need to multiply some of the inputs by two. And then in is the end after we do all the shuffling and multiply add up everything, we need to recombine some intermediate results to get the final result. So in total, we actually don't get hundred instructions, but 145 instructions.

>>: [indiscernible]

>> Peter Schwabe: It's just that you can interleave shuffles with arithmetic, and so each cycle, the cell can do one arithmetic instruction and one load, store, or shuffle. And then if you carefully interleave that -- it has to be the second instruction properly aligned in memory. The shuffle, it has to be, of course, independent of everything, but you can interleave it in a way that you get the shuffles for free.

Just goes to a different unit and then well, with a lot of care, you can get them for free.

>>: If you round down from 12.75 to 12, you'll lose a little bit but you gain from the overhead for the noninteger radix. Did you check whether you eventually lose --

>> Peter Schwabe: So the thing is the 12.75, well, we need for the convenient reduction. If we go down to 12 we would need a few more lim. I thought about it. I thought about using a 20 by 24 lims, and I must admit it's still unclear to me whether that's better. I know that currently I'm working together with a students from [indiscernible] who is doing 24 lims, but on a different architecture with the same size multiplier. We're playing with it. It's an interesting tradeoff, and it's unclear to me. Okay. Now we get this result from r0 to r38, and now we need to put that back into a reduced result. That's pretty tricky, so because the standard way of doing this is that we carry from r20 to r21, which means we just look at the upper bits of it and just remove them from r20 and add them into r21. And then we do that from r21 to r22 and so on until reach from r38 to r39. 39 was, well, zero before, so we didn't have that one.

And then we do this reduction modulo the prime which just means we add 19 times r20 to r0, 19 times r21 to r1. And this is exactly with this noninteger radix comes in, which makes this part extremely efficient. In the end, we just carry from r0 to r1 and so on until we're through. This is pretty, pretty bad.

This has two problems. One thing is that it has no data-level parallelism at all. So for vector instructions, this is really bad. We're only working with one of the lim within the 128-bit register so we cannot actually make any use of vector instructions. We are losing a factor of four there. And the second thing is that it has almost no instruction-level parallelism, so that is always extremely dependent.

We get a huge chain of dependent instructions. And the cell has pretty long latencies, so we're doing arithmetic only about every fourth cycle which means that we're losing a factor of 16 there compared to what the architecture can really do. So let's think at this moment, how do we get that better? Let's fix

the first problem first. To do this we just do four independent carry chains. So we carry from r20 to r21, and at the same time we carry from r24 to r25, from r28 to r29, r32 to r33. It's not at the same time in

SIMD way, but at the same time as an interleave so that the latencies are hidden. We continue like that from r21 to r22, r25 to r26, and so on. And at the end we need to carry more because in the last step we carry -- so in this, which is here with the three dots, we carried two r24. So that one becomes too large again, so we need to do more rounds of carrying here. It looks pretty stupid. I mean, we end up doing these reduction steps. We increase the number from 20 to 32, but on the other hand, we can do arithmetic now really every cycle, so we increase the speed by a factor of 2.5 which is not so bad. Now let's fix the second problem. The second problem was that we didn't have any data-level parallelism to use SIMD instructions efficiently. And for that, we have to look at a higher level. And when we look at the Montgomery ladder, then there were these five multiplications, four squarings, some additions, one multiplication with a constant. Now let's just think that all the squarings are multiplications. Then we have nine multiplications, and the nice thing is that you can group them in two groups of four where, in these groups, everything is independent. Now what we do is we just group four multiplications together and well squarings handled as multiplications, and then we just do it in a completely streamlined way throughout the whole multiplication including the reduction at the end. And we always process four operations at a time, and that just leaves one single multiplication at the end which we need to handle separately, which is a bit annoying, because of this reduction chain. It clearly has a huge advantage in the reduction chain, but it also reduces the number of arithmetic instructions for the pure multiplication because we don't have to do this recombination of intermediate results any more, so we get down from

580 to 420 instructions and well, for the reduction, we immediately get a speed-up by a factor of four.

Now, when we put all that together -- and well, on top of that, I everything else that's required for curve25519, we in the end got something below 700,000 cycles for curve25519 on the Cell Broadband

Engine. I think the speed record still stands, maybe partially because it's not the most common architecture not so many people are looking into it. I could imagine that if you spend a lot of time you can beat this record, but still I think it's pretty, pretty good. At that time I think we had even a prize performance record saying that we can do the most curve25519 operations on any architecture for a given amount of dollars. Yeah, so if you're interested in that, then here's the code again. It's completely public domain. Let me come to a third example. And the third example is joint work Güneysu, Oder, and Pöppelmann, which we published this year at PQcrypto, and that's lattice-based signatures using

AVX. And AVX is a pretty recent instruction set introduced with, well, Intel core I7, Sandy Bridge, Ivy

Bridge processors, and also AMD has it. The nice thing is they have really huge vector registers. They went up from 128- to 256-bit and the somewhat downside of it is that we can only do floating-point arithmetic in those. That changed with AVX2, but on the processes we had when writing this paper, we didn't have AVX2 because it just came out a few weeks ago. So somehow would have to deal with it by using floating point, and it's pretty powerful for floating points. So, for example, it can do in SIMD four floating-point multiplications, double precision, and four additions in one cycle. That's pretty strong on

Ivy Bridge. Also on Sandy Bridge, actually. And yeah, so we thought about --

>>: AVX2 will do exactly the same but with integers instead of floating point or?

>> Peter Schwabe: It can do both. It can do. Basically, it extends AVX also by integer support. I haven't looked at the throughputs yet because I don't have a Haswell processor yet. I'm hoping to acquire one when I'm back from this trip and when the administration in the Netherlands is also back from holidays.

I think it's basically the same throughput for integers, also yeah. So the lattice-based signatures that we

want to look at were introduced Eurocrypt 2012 by Lyubashevsky. And we look at a parameter set that aims at 100 bits of security. Well, actually maybe more like 80 bit of security. So more recent paper at

Crypto 2013 by Lyubashevsky said well, you know, I've looked at this scheme again, and maybe a actually it's not really like 100 bits; it's more like 80 bits. And then I don't know who's looking at it next, so let me just put a question mark behind the 80 here. So it's sort of cutting-edge things. I could imagine that if some number theorists look at it or some people really into the math of lattices, maybe it has no security at all. We don't know, but for the moment it has 80 bits of security.

>>: Just go down to 60, Eurocrypt 2014 and 40 the year after that, so you've got a couple --

>> Peter Schwabe: Oh, okay. That's very nice of you thanks [laughter] I'll have to generate some more papers. What it does is it does arithmetic in a ring which is well, F P F X modulo this polynomial of degree 512. And P is a 23-bit prime which is congruent one mod to 1024. And as you can imagine, everything that we do will be arithmetic in this ring. So how do we represent elements of this ring?

Well, we just declare them as an area of double-precision floating points because we want to use the

AVX instructions, and we align that on a 32-byte boundary. That makes loads more efficient into these registers later. And then we have 512 of those because we have 512 coefficients. Then, as I said, we want to use AVX double precision, and that works very well for addition and multiplication of coefficients. Now, we do to do -- occasionally, we understood to do modular reduction which we do by precomputing the double-precision approximation of P to the -1 one. And then we just, whenever we want to reduce a coefficient a, we just multiply by this precomputed value. Then we round c, this carry, and this is very nice. There's a very high throughput rounding instruction. You can basically do four, again, SIMD way, four per cycle and then we just multiply again by P and then subtract. So that performs a modular reduction, and the very nice thing is that you can specify the rounding mode in this here. So depending on what round mode, whether you round towards 0 or whether you truncate, you can get a result, either between minus P minus one-half and P minus one-half, or between zero and P minus one. For this scheme we need this part here, but it would be very easy also to use this one, depending on what you prefer. Then, of course, we use lazy reduction a lot. So we have these 22-bit numbers, and we multiply two of these, we get 44 bits. And then we have a 53-bit mantissa in doubleprecision floats, so we can add up quite a few of the results of a multiplication before we need to do modular reduction. Okay. How do we do multiplication in this ring? We use the number theoretic transform, NTT, which uses 512 roots of unit [indiscernible] omega, and we have psi squared is omega.

And that exists because P is congruent one mod 1024. Yeah, then we use the number theoretic transform defined like that, and then we consider multiplication a times b giving the result d. So we first precompute a bar and b bar like that, and then we obtain d bar as just, well, compute the entity of a and b then do component y's multiplication here and do the inverse entity, which is just the same as the NTT with omega to the minus one instead of omega. So very clearly, this componentwise multiplication is just perfect or vectorization. It's very, very efficient and we're very happy with that. The remaining part to make this multiplication run fast is to make this entity run fast. Okay, so what we do in this entity is we do basically nine levels of butterfly transformations where each butterfly just picks up two values, just spaced out by a certain constant depending on the level, and then multiplies one of the two that are picked up by a power of omega, a fixed power of omega, depending on the level and position, to obtain some intermediate value t and subtract t here and add t there. And that's one butterfly. And each level it does 256 of those with different values. Now when you look at that then on levels from -- starting counting levels at zero as a computer scientist would do. And from levels two to eight that's great.

That's very easy because the values that interact there naturally in vectors, you can just pick up four, and you pick up another four, and you just do the whole transformation on vectors, and that's very, very convenient, very nice. Level zero and one the problem is when you pick up a value, then the values within a register, within a vector register, interact so that requires a little bit of shuffling of these values and something called horizontal addition, which luckily AVX supports, so you can just add within a register the values, and even within half a register. So it's pretty, pretty powerful instruction set there.

Details are in the paper how to do it. Yeah, that leads to a pretty fast conversion with the vector registers. But there's one remaining problem. It's memory access. So basically one of these butterfly things is extremely fast, right? I mean, it's just multiply, add, add, done. But at the same time you need to pick up the two values. You need to pick up the constant and then you need to store, so in the end it's a lot of memory access. And now there's a very standard FFT optimization technique which is that you just notice that, okay, if you go in one level it's always two values interacting. And then through two levels, it's four values interacting, and through three levels, eight values and so on. So we merge three levels, which means that we just pick up four times eight values and do all the computations on those and afterwards store again. And the eight just comes from the number of registers we have. So we need a little bit for the constants that we need, and you can merge more levels if you have more registers. Okay. And then in the end, we get a final performance of a little bit less than 4,500 cycles for one NTT transformation which is pretty fast. I mean, we beat all previous speed records. Unfortunately,

I cannot explain all of those. So if you look at the lower bounds you get from arithmetic instructions, I can explain something like 2,600, 2,700 cycles. And I'm still investigating where the rest is going. So yeah. I don't know. There's some bottlenecks on Ivy Bridge that are, as far as I know, not documented, and well, need to work on that. And the overall --

>>: Is that the time to verify a signature? At the NTT cost or?

>> Peter Schwabe: No, that comes here. So basically within signature generation and verification, there's various computations of NTT, various multiplications and each multiplication needs -- well three

NTTs basically, except multiplications by constant. They only need two. So in the end we get something like 635,000 cycles to sign a message and that's on average. So you have something like you try to sign a few times until it works. Expected number is seven times and so.

>>: [indiscernible]

>> Peter Schwabe: Say again?

>>: Do you know [indiscernible]?]

>> Peter Schwabe: This is just what as super cop uses, so if you use very, very long messages you hash them first. And then at some point, the speed is determined by the speed of hashing, and well, that's just a convenient thing to test and work with. Yeah. Then when you verify signature again, 59-byte message, than ha 45,000 cycles. Now, I wouldn't actually recommend anyone to use this scheme because, as you saw, the security drops dramatically over the years. But I would very much encourage people to look into the security of it, because these signatures are claimed to be secure against attacks by quantum computers. They scale very, very nicely, so if you scale up the security, it will become slower but not dramatically slower. And if they actually turn out to be secure, like maybe this one's 80bit secure in the end or maybe just 64-bit secure then, well, we know how to scale up. And this would

be a pretty promising candidate for postquantum secure signatures. Again, the code is online public domain.

>>: [indiscernible]

>> Peter Schwabe: Well, I'd have to look that up. So we had implementations on that. The first implementation was using integers, so 32-bit integers and then multiply to get 64-bit integer. I'd have to benchmark that. Also admittedly, at some point, we stopped optimizing that, that part and then just went for the more promising AVX part. So even I give you the number how much the implementation we have takes, I can't tell you how much faster it is. My guess is if you do it with integer arithmetic, more than four times slower.

>>: When you load these from memory, what's in memory? Are the integers in memory? Are you loading integers from memory into your floating-point registers?

>> Peter Schwabe: No, they're just -- well, they're floating points. So I if go back -- let me see -- a few more. That's the data type that we have here which is just a double. Yeah, they're all doubles in memory. Otherwise things would go pretty bad, and then we'd have to do conversions. So we're, throughout the whole computation, working on doubles.

>>: Are you planning to do this in AVX2 for comparison?

>> Peter Schwabe: Could be actually. I'm currently more -- I don't even think that AVX2 would give a big improvement. I mean, the main improvement that you would get is that you have a 64-bit mantissa instead of 53, so you can delay the reductions further.

>>: [indiscernible]

>> Peter Schwabe: You mean just use AVX2 with the same code?

>>: No. With, you know, making full use of AVX, the new instructions, integer instructions in AVX.

>> Peter Schwabe: I don't -- so we could do that, but I don't think it gives a lot of improvement. I think it gives a huge -- it starts giving a huge improvement if your prime goes beyond something like 25 bits, because then when you multiply to -- well, or maybe 26 bits, then you get something like a 50-bit result, and then you need to reduce quite often. And then having a larger radix, so having the space up to 64 bits really gains you something, but as long as the prime is 23 bits, it's maybe not that good. There's also what I'm more interested in is that so the paper that reduced the security did not only reduce the security, it also proposed a new signature scheme by volume. And I am more interested in looking into this which uses a smaller field, and it may be possible to use single-precision floating points to represent that, and then you can do eight operations per cycle instead of four. But it's still unclear to me what the tradeoffs are there, because then you have to modulo reduce every time you do a multiplication. Well, still quite a few things to do there. Now, I'm not continuing with the next example. It's sort of like in between. I would like to introduce a technique which is known as bitslicing. So far when we did computations, we always did computations on basically integers. I mean, the fact that we had to use double-precision floating points was just we were using the integer part of it. How about arithmetic in, say, binary fields? Where we don't have any hardware support for and in particular on well, on most processors, and not in vector registers again, on most processors. And --

>>: -- some support of binary.

>> Peter Schwabe: True. It has PCLMULQDQ which is, well, pretty powerful. It doesn't have it, I think, like vectorized with small chunks, so you can do 64 by gives 128. There is also the NEON which does 8 by 8, gives 16 as a result and as a binary multiplier, but still there's too many architectures not supporting it, and also I will have an example of G F two to the 20 which is just, well, much smaller than

64 by 64. And then well, as I will show, it turns out that there's better ways to do it. I think even with the PCLMULQDQ, I think we can beat that. Hopefully. So you've seen a lot of vector registers now so, for example, with four double precision floats or with eight 16-bit integers or something like that in that.

So now think of an n-bit register just as a vector containing n entries, and they all have just one bit. So that means that you do arithmetic on these one bits, and that means you do bit-logical operations so xor and and and or and these kind of things. And this is a technique known as bitslicing and was introduced by Biham in 1997 for DES. This is the way that I like to think about bitslicing: It's just vector instructions, nothing else. But there are alternative ways the think about bitslicing, and one way that's sometimes very helpful when you look for literature, it's just simulating a hardware implementation of software. So just look at the gates. Every and gate becomes an and instruction every xor gate becomes an xor instruction. And then you just execute the circuit that you're transforming on lots of lots of parallel independent computations. Another way to look at it, which is sometimes helpful, is that it's just a transposition of data. So imagine that you've, like, look at 32 values in the field, G F two to the 32. Then you would usually just put them in a 32-bit register like one value each. What you do now instead is that from this first value, you'd put the first bit in one register, the second bit into another register, third bit into yet another register. And then with the second value you do the same, just that you use not the first bit in each register but the second bit. That's exactly transposition of the data if you think of it as a binary matrix. Okay. There's some issues if you do that. I mean, in principle, you just transform a hardware implementation into software, and you can express any circuit with and dates and xor gates, so you can transform any computation in a bitsliced way. Whether that's a good idea or not is a different question. Now these xor, and, and or instructions are typically extremely fast because they're very easy to do in hardware. For example, you can do three 128-bit operation, xors or ands, on an Intel

Core 2, so that's 384 bit operations per cycle. That's actually quite powerful, and it can be extremely fast if you do operations that are not natively supported, so for example, binary multiplication and on most processors is not natively supported. But there is a problem with it. Namely, that you increase your set of active data set massively. Namely, by a factor of 128 if you use 128-bit registers. So you always have to work on 128 computations in parallel, so all the data that you work with is basically 128 as much. Similar comments apply to other vector implementations where you, for example, multiply the active data set by a factor of four, but this is compensated by larger vector registers. So for example, in AVX I mean, on these machines you have 16 256-bit vector registers. Compare to that 60- and 64-bit integer registers. You just have four times as much space. But with bitsliced, to have you use the normal register, so basically you increase your data set, but the register space doesn't increase. And the typical consequence is that you need many, many more loads and stores just because things don't fit in registers. That easily becomes the bottleneck, and you really want to have good register allocations built and spilling strategies there. Let me look at an example of where bitslicing is actually pretty nice.

This is joint work with Ben Bernstein or Tony Chou, he usually calls himself Tony, which we just presented last week at CHES 2013. And I should say that the main part of the paper is not about CFS signatures. The main part is about [indiscernible] encryption, but the [indiscernible] part is basically doing always 256 messages encrypted in parallel, whereas the CFS part doesn't need that, so I'm

focusing on CFS here. And we're using AVX, and it does one bit-logical operation on a 256-bit vector per cycle. I said before you can also do three bit-logical operations per cycle on 128-bit registers. That sounds much better. Currently this one is slightly faster because the bottlenecks are loads and stores and we get a better tradeoff there. CFS is a code-based signature system introduced by Courtois,

Finiasz, and Sendrier 2001. Last year, Landais and Sendrier had a paper at Indocrypt, and they implemented it and they used some parameters that are aiming at 80-bit security. I'm not going to change this number again. I mean, if you can break it, it's great, but I think it still stands with 80-bit security. So the basic idea of -- and if you're not familiar with code-based script, it doesn't matter on the next slide. We don't really use that any more. It uses a hidden binary Goppa code over F two to the 20 that can correct eight errors. Now, what you do is to send a message, you hash that message, to a syndrome, and then you compute the word corresponding to this syndrome, and you hope that it has distance at most eight from a codeword. And if it I does, then you just compute the error positions with your secret decoding algorithm and send your oppositions to the receiver of the message. And then verification is very easy. You just take these eight columns of the public matrix, you xor them, and you're done. You compare whether they're the same as hash and you're done.

>>: What's the density of the code because you're hoping that it is within the distance?

>> Peter Schwabe: So the problem is that this usually doesn't work. What we have to do -- so basically this fails all the time as you're pointing out, which yeah. So what we do is we just guess a few errors, a few errors positions, and we do that until we found something that works. Now, the question is how often do we have to guess? Well, we have to guess about 40,000 times which from a practical perspective, I would not recommend anyone to use this scheme. It's still fun to optimize, but it's embarrassingly parallel. It's just great, fortunately, for bitslicing. I mean, you just parallelize all these guesses and you have is source of like almost infinite parallelism. It's great. And if you scale up the security parameters you can even -- it's getting worse and worse.

>>: I think about the timing side chain which is coming down here. Assuming that you are looking at the guesses in sequential order, fixed sequential order, timing will tell you what was the point.

>> Peter Schwabe: Yes, that's why we don't do that. Yeah. It's not constant time the implementation in the sense that it takes, for all messages, the same amount of time. That wouldn't work, but we don't leak any information about the previous guesses that failed which is yeah -- very good point. Okay, so how do we represent these elements of the finite field now? So basically everything breaks down in the end in highly, highly parallel operations in G F two to the 20. And well, polynomials over G F two to the

20, but then even further down to operations in G F two to the 20. So we just say we have a data type called bit which is a 256-bit register, vector register. And of course, that's, in fact, 256 bits belonging to

256 independent elements. And then we just say okay, we have a data type which is a batch G F two to the 20 element which is, well i20 of those, so 20 coefficients. And again, we align that on a 32-byte boundary. If we do that then an addition just means that in GF two to the 20, we that we do 40 loads,

20 xors, and 20 stores, so that takes 56 cycles. Now, keep in mind that this actually doing 256 parallel computations, so one addition in this field takes about a fifth of a cycle which is amazingly fast. Same for squaring. So squaring is just modular reduction because we know that in a binary field we just insert zeros between the coefficients and then just reduce, so its only reduction takes 64 cycles. If you don't bitslice and put that in one register, you will never get this kind of performance, like one-fourth of a cycle. But how about multiplications? So if we do multiplication, then basically you look at it hardware

algorithms for multiplication, and what we're doing is we first a binary polynomial multiplication then do reduction. We don't claim that this is optimal. It may be that tower fields are better, and we're currently looking into it. Let's see what the outcome is, but if we do that, then we could use just schoolbook multiplication which for, well, 20 by 20 means 400 multiplication, and 361 decisions which is just, well, in F 2 it's just and operations and xor operations. And then at the end, we need to do the polynomial reduction.

What's much better is Karatsuba, of course. I think you've all seen this Karatsuba inequality which just breaks -- Karatsuba equation, sorry, which breaks multiplication of 2 n-bit polynomials or numbers into three half-size or well, n-bit multiplications and a few additions. What is even better than Karatsuba is refined Karatsuba. And this is maybe well known since 2009. I know it from a paper that Dan had at crypto 2009, but he doesn't claim novelty, so he says well, it's some people know it, but apparently not that many. Refined Karatsuba also gets these three multiplication, half-size multiplications, but it gets fewer additions. So more specifically, you get instead of 8n, for Karatsuba you get 7n additions. So it's somewhat more efficient, and if you do that through two levels, then you get 225 ands and 303 xor which is much better count than here. Also this refined Karatsuba, when you implement it in assembly, it's pretty load-store friendly, so you can even through two levels do that very, very nicely. And then at the end, you get 744 cycles for 256 multiplications, so roughly three cycles for one multiplication, which again, even when you use lookup tables for like lock tables, you won't get this type of performance, so it's actually pretty, pretty fast. Then you put all of this together to CFS signatures. And okay, I said verification was fast, but the signing is somewhat not that fast. It's something like 425 million cycles for signing on Ivy Bridge. The code is not yet online. It will be online. It will also be public domain. And I find it kind of hard to advertise this result. I mean, it's 80-bit security. It's really amazingly slow, and if you scale that up to higher security you will probably choose a slightly larger field which makes you public keys much, much, much larger, and you will probably also increase t. And you saw that the number of guesses is t factorial, so if you just go, say, to t equals 10, it will already become, well, 90 times, actually a bit more because of the larger field, maybe 100 times slower. And then yeah, I wouldn't call this really a practical scheme. It's still fun to optimize and to advertise that it's ten times faster than previous results. Which, well, at least it's something.

>>: It's a million times faster [indiscernible] encryption. Thanks for this great advertisement. Yes

[laughter] Okay. Let me at the ends of my talk go back the something slightly more practical. And let me also go back to, actually, the beginning of my talk which was about symmetric cryptography. And this is really practical. I mean, this is being used, for example, in open SSL. It's also being used in salt, so this is joint work with Emilia Kasper per CHES 2009, and it's implementing AES using SSE instruction set.

Now, probably everybody here is familiar with AES. I still put it on the slides just as a short recap. So we have a block cipher which four 128-bit key transforms a 16-byte state through 10 rounds. At each round, a round consists of four operations which is SubBytes, ShiftRows, MixColumns, and

AddRoundKey. The last round doesn't have mixed columns. What do these operations look like? With

SubBytes, we just look at the state and the state is always written in this 4 by 4 byte matrix. We just take one byte and just substitute it by another byte, and this is based on inversion F two to the eight, and zero goes to zero. ShiftRows, well, rotates the rows, but different distances. MixColumns does a linear transformation on columns, and AddRoundKey, well, you take the 128-bit AES key at the beginning. You derive 11 RoundKeys, one at the beginning or widening, and then after afterwards you just add, just xor it with the corresponding RoundKey. Now, I said that we want to use vector instructions; I've introduced bit slicing before, so how do you bitslice that? Well, let's just remember

what I said at the beginning of the talk where for Salsa20 we had, like, subsequent blocks are independent, so if we just use a mode of operation which is parallelizable, so computes on subsequent blocks or consecutive blocks independently, then there's a very straightforward way to bitslice that. So let's just use counter mode, which well, is parallelizable, and then just take 128 blocks, put it in a bitsliced state, and then run a hardware implementation of AES simulating the software on that. This is an approach taken, for example, by Matsui and Nakajima from CHES 2007 and it's great. I mean, it gives really good performance of 9.2 cycles per byte which, well, set a speed record at that time. There's two problems with this. One problem is that it's for bitsliced input, so if you want to be compatible to nonbitsliced implementations, you have to transform first and transform back at the end, which adds a little bit of overhead but not too much, but it's only good for long messages, and this is a problem. So if you do disk encryption with several gigabytes to encrypt, great. Perfect. If you encrypt an Internet packet of, say, 500 bytes then you have a four times overhead because you need to pad it, basically, with garbage and then remove the garbage at the end. So the question is can we do bitsliced AES for small packets? And now the idea that comes in is that you look at SubBytes, and SubBytes is sort of the heart of AES. It's the nonlinear part. It's the most expensive operation to do. And this already has 16way parallelism, so you're substituting all these 16 bytes independently of the state. And now if you want to use 128-bit registers, then we need 128-way parallelism, so we consider eight consecutive blocks, and we use this internal parallelism in the SubBytes. If we do that, then we have to be very careful how to pack this state into a register, and the important thing here is that on the bit level, the corresponding bits from independent blocks. So at that level everything is completely independent computations, and this means that for the operations that do have interaction between the bytes, so

MixColumns and ShiftRows they can work on bytes, and they don't have to twiddle with bits inside the bytes and then yeah. So then on a higher level, you just put the columns and then on the last level, the different rows. Whenever you have bytes interacting within one register, like in ShiftRows and

MixColumns, than you can use this pshufb byte shuffle instruction which was introduced in the SSSE3 instruction set which is only implemented on Intel processors. So AMD users are kind of -- well, it's not supported by this implementation. But for Intel it’s pretty good. So when you do that, actually

ShiftRows and MixColumns are a little bit of effort, but you can do that. So the remaining part is make the S box fast. Okay, so if we do that then, we'll start with a good hardware implementation. That's exactly the start of simulate hardware and software and there is a paper by Canright from 2005 which at that time published the most compact S box in hardware which is based on this inversion in F two to the eight. It was slightly improved by Boyar and Peralta in 2009, and then goes down to 117 gates. I think the original was 122 or something like that, so I think they removed five gates from it. Okay, so let's simulate that. We have 117 gates. Let's write down 117 instructions, right? That's how it should work.

Well, no. One problem is we have only 16 registers, so whenever we want to store -- whenever hardware people want to store a value, they just store it. Whenever we want to store it, well, if it fits in a register, it's good; otherwise, no. Second problem is, and that's more serious, we have two operant instruction, so we can do something like a equals or becomes a xor b, but not c xor b. So we always overwrite one of the inputs with the output. This is also not the case for typical gates in hardware, so we need more instructions, and this is the count that we need in the end. So we have 117 gates turn into 163 instructions, but out of those, 35 are move instruction, so register-to-register copies, because we otherwise would override inputs or something like that. Okay. So if we do that, put everything together, then we got 9.32 cycles per byte for AES counter mode, and this is among bitsliced input, so we don't need -- this includes all the transformation before and at the end already. And on a somewhat newer generation Core 2, we even get down to 7.58 cycles per byte. Again, these results, they're online on my Web site. As I said, also included in recent versions of open SSL. They are included in salt library, and yeah, feel free to use it. It's all public domain.

>>: Why can you only use two operand instructions?

>> Peter Schwabe: Instruction set limitations. They don't support three operands, yeah. Let me conclude with the references. So these are the five papers that this talk was roughly based on, so if you you're interested, all the papers are online. And thank you for your attention. [applause]

>>: One comment then a question. It occurred to me some time ago, I never published it, that if you have parallelism, there is a situation where you can use the parallelism in a nonconventional way.

Suppose that you are looking at the [indiscernible] such as, say, it's applicable to any kind of block cipher which is [indiscernible] and you’re trying to do cryptanalysis. So you are given a plain text, you are given a cipher text, and you are trying to find the key. If you are running in parallel encrypting the plain text through half the rounds and decrypting the cipher text through half the rounds, and then compare, it gives you an extra vector of two in the available parallelism. So of course, you have huge amount of parallelism, because you have a very large number of keys, but suppose that the keys are coming from a somewhat constrained domain which is a bit hard to enumerate. So it's coming from possible to the possible of being hashed or whatever. So you're given the keys at a slow rate, but you can double, if you have the hardware available, by encrypting and decrypting and meeting in the middle. So try to think if this could be of any help in doing cryptanalysis. So this was my comment. Now the question. Is any of your ideas obtainable with standard optimizing compilers or it's totally hopeless? Did you try to run the problem on available compilers and see what's they did with this?

>> Peter Schwabe: Well, I'm usually using GCC. I think that the Intel --

>>: I was about all this extensions and.

>> Peter Schwabe: Well, yeah. Yeah, of course, most recent versions of everything. The problem is that, for example, bitslicing is something that a compiler won't do automatically for you. I mean, it's completely different way of expressing an algorithm. There are some simple things of vectorizing that I think compilers do. Intel compiler is somewhat better at it than GCC. I don't know about the Microsoft compiler. So if you do something like a loop which just adds up numbers, then the compiler may know it's okay. It's like from consecutive memory addresses. It may see that can be vectorized and they will do that, but that's very, very simple, very trivial add-on things. The problem is that what I said at the very beginning. Even if you have very high-level parallelism, you need to first write your data structures in a way that vectorization can use it. If you do that, then usually it's even easier to also write the instructions in assembly having in the and/or in intrinsics. So compilers are pretty bad at it.

>>: And they keep changing all the time.

>> Peter Schwabe: It's true. So automatic vectorization, when it gets to somewhat advanced vectorization and really figuring out, for example, inside a multiplication where you have to -- you can, of course, do that in vector registers, but you have to you permute all the time in between, and these shuffles for various architectures are free. They go to different unit, so you can speed that up. I haven't seen compilers doing that automatically.

>>: How much work has it been to keep up with these different versions. If you have a version SSE 2 and then SSE 2 comes out, can you use similar code tweaked or is it just a rewrite?

>> Peter Schwabe: There's a few things where for different versions you basically only reimplement the low-level things. For example, for the CFS, we also have a version that uses SSE, and all the high-level things are the same. And on a low level, I mean, okay, I wrote everything in assembly and then these are differently instructions, but to a large extent, it's just removing the v at the beginning of the instructions and renaming all ymm to xmm, and then you're done, so that's fairly easy. But there's many cases you have to be more careful because even if you can make the code work in that way, there's different tradeoffs, different performance tradeoffs, for different microarchitectures, and then you actually end up rewriting it for a different one.

>>: Did you optimize the search for the number of gates in the last paper, because you say oh, we are coming from 117 to 163, but I guess maybe there's less. Did you search automatically in your?

>> Peter Schwabe: No, I didn't. So that's, to a large extent, hand-optimized starting with this hardware implementation. And I'd be very curious to see better counts there. It's very easy to get to better counts if you just use the AVX 128-bit operations because those are three operands. That makes it much, much easier, but really, doing it with this, like, two operant instructions, I'd be very curious to see better solutions to it. We've played for a long time, and particularly, Emilia has spent a lot of time on it.

[applause]

Download