>> Patrick Longa Pierola: Okay. Hello, everyone. It is my pleasure to introduce today Diego Aranha. He is -- Well, let's give a little background about him. He holds a Ph.D. degree from the University of Campinas in Brazil. He is currently a professor at the Computer Science Department in the University of Brasilia and while his expertise is in cryptography, efficient implementation of cryptographic primitives most notably on pairings and binary elliptic curves. Well, as a side note we know that he became very famous, probably more famous than Neymar and the Brazilian Soccer Team in Brazil because he led a team that found vulnerabilities in the Brazilian voting machine. So please let's welcome Diego Aranha. He will be presenting a talk on software implementation of binary field arithmetic using vector instructions. >> Diego Aranha: Thank you. Thank you, Patrick. Good morning. So it's good to talk about arithmetic again because this voting machine stuff it's too easy. So this is joint work with Armando who is here, Darrel Hankerson, my ex-Ph.D. advisor Julio Lopez, Francisco Rodriguez and Jonathon Taverne. So I'll talk about how to use vector instructions to implement binary fields. In my opinion this is one of the most fascinating aspects of doing arithmetic of cryptographic interest, how to manipulate the math to use some interesting computing resources. So I really like to do this type of work. I think it's right on the border between the math and the technology. So let's begin. Binary fields are basically everywhere in cryptography. We can find them when we deploy ECC or PBC systems or any curve-based cryptosystem. Also some multivariate systems use binary fields to provide security which is resistant to quantum computers. And even some block ciphers have building blocks built on top of binary fields such as, for example, the X-Box is built on binary filed arithmetic. And we know that since binary fields started to become relevant for cryptography, we have many algorithms and optimizations already proposed in the literature. So one of the questions this talk wants to answer is can we try to unify at least the best approaches we have in the literature under the same formulation or framework? And if so, can this formulation provide new ideas not only represent what's already there but also provide some new ideas? So that's what I want to answer with this talk. So just to summarize: I will present a formulation of binary field arithmetic which captures the state-of-the-art on this small field using vector instructions of course. This formulation provides a new algorithm for implementing multiplication on binary fields which is an operation which is not commonly supported. So it's good to have other approaches to implement binary field arithmetic. And also how this formulation gives some other approaches for implementing operations which, let's say, restore the ratios we had on the literature before support native multiplication was introduced. So if the multiplier becomes significantly faster, we have impacts on several algorithms. For example, choosing between point halving or point doubling in binary curves. And then we have some approaches to restore the classical ratios, let's say, the conventional ratios which make the previous analysis still valid. So it has this slight defect. And I'll present some experimental results for [inaudible] multiplication in two different scenarios. So let's just examine the arsenal we have. So the Intel Core architecture has a vector instruction set called SSE or Streaming SIMD Extensions. It started as a 128-bit instruction set, and it was introduced actually on the Pentium 3, Pentium 4 family of processors. But it has been proved a lot with every new processor family, and it gained several instructions on the Intel Core architecture especially the 45-nanometer series of processors. In this same series they introduced what they called a super shuffle engine which is most frequent in the process for -- to accelerate some specific instructions which would be very useful here. Also in the Nehalem family which is the second iteration of the 45-nanometer series. They introduced native support for binary field multiplication as this carry-less multiplier instruction which multiplies two 64-bit integers without taking account carries, which is basically binary field multiplication. And in the more recent 32-nanometer series, they also expanded the size of the [inaudible] 128 to 256 bits. But not all instructions are read available on this new instruction set. The floating point instructions are but mainly integer instructions are still not available. They will become available on the second iteration of this new instruction set which is coming beginning next year if I'm not mistaken. So let's see which instructions these vector instruction sets provide to us so we can implement fast binary field arithmetic. So of course first thing we need is an instruction to copy things thrown into memory. So this is the name of the instruction. If I present an algorithm, this will be the mnemonic that we use in the algorithm. And this is the cost of the instruction before and after in different processor series. So we'll make is more explicit depending on the instruction. So, for example, so we have memory load and store, of course. This cost depends if the operants are aligned. And 128-bit [inaudible] in memory. So if they are aligned, usually this instruction was faster. In the latest processor it doesn't make a difference but in the Core2 processors it was faster if operants were aligned. So of all it's a good idea to put everything in aligned addresses. We have shifting instructions; these are vector instructions so they operate on multiple objects in the same register. [Inaudible] in the same way for every operant. So we have instructions to shift simultaneously two 64-bit integers inside a vector register but without propagating bits from one 64-bit integer to the other. So that's what I'm calling 64-bit bitwise shifts. And you have left and right shifts, of course. And representing this: these are used -- it will become clear later, too, but it's better to use this when the shifting amount is not a multiple of 8 because we have a better instruction to shift things when the shift amount is a multiple of 8. So we have XOR, AND and OR of course. This work across the whole 128-bit vector register. We have two byte interleaving instructions which take alternately bytes from two registers, from the lower or higher part of two registers, introduce a new register as output. So that's what I'm calling a byte interleaving instruction. And we have these faster shifts when the shifting amount is a multiple of 8. So it's a bytewise shift so the amount must be -- And then we have -- now we have propagation of bytes from the lower part to the higher part of the register of bytes not of bits. We also have this byte shuffling instruction which is -- it permits the order of the bytes in a register. A memory alignment instruction to make operants aligned if they are not, so it's just for convenience. And in the latest processors we have this carry-less multiplying instruction. So after the super shuffle engine was introduced, these three instructions became a single cycle instruction. Yes? >> : What's the number in parenthesis? You have 8, 10... >> Diego Aranha: Yeah, that's what I'm saying right now. So this is the cost of the instruction before the super shuffle engine was introduced. So they cost it 2 or 3 cycles, and now they cost a single cycle for these 3 instructions. So, please? >> : Yeah, so all these ghosts are in cycles I presume. But also... >> Diego Aranha: Yes. >> : ...to measure a ghost you also need to know how often you can dispense this instruction... >> Diego Aranha: [Inaudible]. >> : ...[inaudible] instruction or once every five instructions [inaudible].... >> Diego Aranha: Yes. >> : So do you take that into account when --? >> Diego Aranha: Yeah, in this table, no. But at this time I was only considering the [inaudible] across the source and the throughput was just one instruction per cycle. I know that in more recent processors this throughput got higher, but I'm not entering into this detail. Well it will become clear in the later slides, but at first I'll just show how using these instructions is better compared to the state of the art. And then any improvement given to these instructions will just reduce [inaudible] by a constant. So it won't be as important. At first I just want to prove that these instructions are good for binary field arithmetic. That's my first point. And also we have the carry-less multiplier, it can cost on the first iteration the Nehalem architecture between 10 and 16 cycles if you have true multiplications with independent operants. It causes [inaudible]. So throughput plays a role here. But usually you can organize code so you can exploit this throughput. So in this line here I'm already considering that throughput can be exploited. And in the [inaudible] processors the instruction out-cost is between 8 and 14 cycles which is very interesting because this operation is much simpler than an integer multiplier. It's just integer multiplication without the carries. But an integer multiplication costs 3 cycles. So we have a much simpler operation which costs 3 times more. Yeah? Probably for market reasons. I don't know exactly way but probably it's something related to -- or maybe an area of performance tradeoff in the process. I don't know the reason. Please? >> : If you ever talk to Internet engineers, they want to look good on standard benchmarks. There is no carry-less multiplier in the standard benchmarks. There is [inaudible] on the standard benchmarks. >> Diego Aranha: Yeah, we should introduce [inaudible]. A binary field multiplier in the benchmarks. >> : Yeah. So what you're saying is multiplication takes three cycles, regular multiplication. So you... >> Diego Aranha: Yes. >> : ...mean a SIMD multiplication or regular multiplication. >> Diego Aranha: Regular integer multiplication using 64-bit integers like on these standard x86-64 registers, takes three cycles. >> : [Inaudible] SIMD? >> Diego Aranha: Not in SIMD just using the... >> : But this is SIMD so... >> Diego Aranha: This is SIMD. >> : ...[inaudible] makes more sense to compare SIMD to SIMD, right/ >> Diego Aranha: But we don't have a 64-bit cross 64-bit integer multiplier. Actually we have on the Pentium I remember the SSE-2 had one. But still if I'm not mistaken because I've never use this instruction, we have a 64-bit times 64-bit SIMD instruction but it only does one multiplication at the time. It is not a parallel like a SIMDstyle instruction. Exactly as this one, we have two 64-bit integers per vector register but we can only multiply a pair of integers at once. We cannot do things in parallel as like additions or other instructions. And I think that the integer counterpart is faster than the SIMD counterpart if you are only considering SIMD integer instructions. So I think still using these standard registers is faster in the integer case as well. But what I was -- I think an important point is this operation is much simpler than the other. So naively you could think that we should pick the integer multiplier, take the carry logic out and we should have a binary field multiplier in just two cycles, for example, or even one cycle because the carries are the big problems. But of course not everything works like [inaudible] say, so it's not that simple. But this... >> : Just to go a little farther, we were experimenting with Armando on AVX to do prime field arithmetic and so far AVX is -- so using basically the multiplier SIMD-style for prime field arithmetic it's slower than regular integer multiplication. That's for sure. What will happen with AVX2, still unknown. But AVX2 is still coming next [inaudible]. So, so far, yeah. It's... >> Diego Aranha: So it's still... >> : ...[inaudible] experiment... >> Diego Aranha: I never benchmark [inaudible]... >> : ...integer multiplier is still more efficient for [inaudible]. >> Diego Aranha: Because it's on the standard benchmark. So they want to make this thing as fast as possible because it looks good. >> : [Inaudible] just a little comment that as you said [inaudible] costly integer [inaudible] that this could be the effect of the area tradeoff in the chip. So carry-less [inaudible] instructions that are used for all the SSE instructions. So it's in another unit core [inaudible]... >> Diego Aranha: Yeah, I always thought that was the reason but it's just that --Please, Peter? >> : There may be a big lead for the standard multiplier and subscript complications. >> Diego Aranha: Sorry, can you -- I didn't [inaudible]. >> : [Inaudible] in the calculations subscripts. >> : Yeah, so you need fast multiplication. So it might -- [Inaudible] because I was really surprised to [inaudible] that regular multiplication takes three cycles. On the fastest machines if I lose 64 times 64 to 128-bits, it takes always at least 10 to 15 cycles. >> : That's what I’m talking about. >> : If you only want to lower 64-bit, that will take [inaudible]... >> Diego Aranha: Oh, okay. Okay. >> : And I would make one point. >> Diego Aranha: Okay. >> : If you write C code, you cannot get to those 64 bits. So, again, it doesn't appear [inaudible]... [ Multiple inaudible voices continue the conversation ] >> : We're discussing assembly instructions. >> Diego Aranha: Yeah, [inaudible]. >> : I don't know about that because I have been -- I read this, it's in the Intel manual... >> : So only if you put... >> : ...[inaudible]... >> : ...[inaudible] if you get -- if you think that the lower part... >> : The lower part becomes available after three or four cycles. In the upper part on [inaudible] where eight... [ Multiple inaudible voices continue the conversation ] >> : Oh, I really don't know about that. I'm not sure because -- Yeah, actually my [inaudible] makes sense more with the three... >> : This is according... [ Multiple inaudible voices continue the conversation ] >> : Run a little loop and measure it. >> : Yeah, yeah. That's --. >> Diego Aranha: What I am -- Just to be clear here what I'm [inaudible] is not from official Intel documents. It's from the [inaudible] instruction tables. And he's benchmarking instructions for a long time. He is really complete, with different processor families and everything. He is really complete. So we can even try these later, but I would love to see this cost in ten cycles and lose my point here because I have been insisting on this for a long time that these instructions should be made faster. Maybe it's the way also we use the integer multiplier in a [inaudible] organization that you kind of hide these ten cycles. You first need the lower part to accumulate maybe with an addition right afterwards. But we can do some tests later. So let's proceed. So let's discuss first most of these instructions are really simple to understand what they do. I just speak a little bit more about these two instructions because at first they look really useless. But if you take a more careful look they are actually very, very useful. So the first one is the byte shuffling instruction. As I said, it just [inaudible] to order of bytes inside the register. So this is the name of the instruction. This is intrinsic. I'm not doing assembly directly; I’m doing intrinsic basic programming. You use this -- They are function calls at [inaudible] but they are translated to single or a very small number of instructions. So it makes your life easier because at many times the compiler is -- at least [inaudible] is becoming really good at scheduling registers and doing these things. So any helps [inaudible] between the autocompilers with the Intel compiler too. So it was better for us than -- especially at this point. Nowadays I'm not completely sure. Like you use some control by using intrinsics but still --. So this instruction works like this. We have a source integer which is a sequence of bytes in a mask. And you copy bytes from the source to the result register according to the mask. So, for example, we have zero in the first three bytes, so we copy the first byte to the first three bytes into the output. So it's just according to the mask up here [inaudible] bytes. But the interesting part here is if you transform this into a table of pre-computed values, 16 pre-computed bytes, and you consider this mask as a sequence of 4bit indexes, you do 16 simultaneous table lookups. So the real power of this instruction which looks very useless at first is that you can implement in parallel any function which inputs 4 bits and produce 8 bits as output. Okay? Is it clear? So just to present an example, let's take a bit manipulation example. I'm interested in picking 4 bits and expanding it by inserting zero bits between these 4 bits. And this is very common in binary field arithmetic. You have to manipulate things at bit level which is hard on software because you have to do all this shifting and masking. So with this instruction you can pre-compute -- evaluate this simple function for 16 possible inputs, sort this into a register. And the other register, the mask register, stores the indexes to look up on this table. So at the end you evaluate this function 16 times in parallel. >> : Is this similar to the ancient IBM PR instruction? >> Diego Aranha: I don't know because I don't know this instruction. Although, I'm talking about Intel instructions here, but I've seen other examples of this instruction in other instruction sets. So it's perfectly possible. I know the AltiVec unit [inaudible] had a very similar instruction with a different name. Like, well, it's [inaudible] bytes or something not byte shuffling. So you can see that I'm here operating -- If you consider the indexes only, I'm doing 4-bit granular arithmetic in a sense. So my whole formulation of binary field arithmetic will try to represent things in terms of sets of 4 bits so I can use this instruction as much as possible, especially because I know this instruction is becoming faster with time. So I want to do things at 4-bit level or 4-bit granularity. The other instruction is the memory alignment instruction which looks even more useless because you can at first make everything align and not even this alignment instruction. What it does, this instruction does is it receives two registers as inputs on offset and it extracts a 128-bit or 16 bytes section of the concatenation of these two registers and produces this as output. So what this really implements is a multiprecision bytewise shift. So you can use the instruction to left or right multi-precision shifts because it's implicitly propagating bytes from one register to the other. >> : What machine supports misaligned? Is this faster than writing misaligned and reading misaligned? >> Diego Aranha: On the Core2 processors it was faster to deal with aligned operants. On the Sandy Bridge processors it doesn't make a difference. So it's the same cost reading or writing aligned or misaligned operants. >> : So a misaligned read following an aligned, right, would give you the same semantics? >> Diego Aranha: Yes. >> : Is it faster or slower? >> Diego Aranha: Probably it's the same because implicitly I assume that the processor caused this instruction and then does the operation with aligned operants. I assume so, but I never tried. >> : But sometimes using misaligned data produces scratches in the... >> : Misaligned what? >> : Misaligned data [inaudible]. >> : Yes. >> : Produces scratches in the program sometimes. >> Diego Aranha: It produces, if you -- For example, if you call an instruction with a memory operant and it's not aligned. So this also on the first -- I didn't try -- Because I used everything aligned, I didn't try it on the recent processors to see if this [inaudible] still give crashes. But on the first you had to have operants aligned or align them somehow, ask the compiler to align it or use this instruction. So, no, this I didn't try anymore because, well, it --. >> : I will just add that reading and writing misalign can be written in C. This... >> Diego Aranha: Yes. >> : ...[inaudible] can be written in pure C. >> Diego Aranha: Yes. Yes. There are compiler extensions for that. But in any case, the instruction was introduced to pick something unaligned, make it aligned so you wouldn't have crashes or performance penalties. But nowadays it's not really useful anymore because it doesn't make a difference if the crashes are not a problem. But we can still use it for fast multi-precision byte shifting. So now let's just fix some notation. So I will present my binary fields with irreducible polynomial usually chosen as a trinomial or pentanomial of course. I would use polynomial basis because we have a polynomial binary field multiplier, so no sense in using normal basis in this processor. In some cases, yes, but not -- it's out of scope here. I'm only dealing with polynomial basis. In software we just represent a binary field as a sequence of words. So I'm calling this number n. It's just [inaudible] of the degree of the binary field divided by 64 so a contiguous set of 64-bit words. I'm considering n to be even just to make things simpler, but of course n is not always even. If it's even, the number of it is easy to determine the number of vector registers just n over 2 of course. And graphically in this talk I will present a binary field in this way, so it's just a sequence of n 64-bit integers. So first of all let's talk about the proposed representation. So as I said I want to use the byte shuffling instruction as much as possible. So I really need 4-bit granular arithmetic. And this representation is just splitting the binary field into two other binary field elements represented with 4-bit groups or 4-bit sets. So suppose this is a vector register composing a binary field element, you just split them into higher and lower 4-bit groups by applying a simple mask and a shift. So you do this. And then you can see that by looking at this and this, I have the 4-bit sets exactly on the positions I need to use the byte shuffling instruction. So that's why we do it this way. Of course it's very easy to convert to this split form just applying masks and a shift and very easy to convert back to. You just shift the higher part and add the two. I will not always explicitly convert things to this representation because sometimes it doesn't make sense. Sometimes I'll just simplicity process these 4 bits in the algorithms. And while it doesn't make sense to keep things represented in this split form because it wastes more memory. For every binary field element you have, you waste twice more memory. So in the algorithms when I need it, I convert to split form, do the arithmetic I need and then convert back at the end. So let's start with addition and subtraction which are the same thing in a binary field. This is of course the easiest operation on a binary field. It's just XOR. But the interesting part is as we were speaking or talking about throughput, you usually implement addition with the operant with the largest operant size. In these processors, in the Sandy Bridge processor for example, we have three options. We can either do exclusive at 64-bit granularity, use the SSE operation or the AVX operation. We have these three options. The thing is the AVX looks faster because it deals with a bigger number of bits. So you need less additions to add to binary fields, but the throughput is slower. So you can do three 128-bit additions per cycle if everything is not dependent on each other, the operants and results, and the operants are not discarded into memory. But you can only do one 256-bit operation per cycle. So according to our tests, it's better to use the SSE instructions instead of the AVX instructions. I don't know about AVX 2 if they will change the throughput ratios or not. So it's not always -- But in some cases it's better -- There will be cases where it's better to use the larger or the wider instruction. So let's talk about squaring which is much more interesting than addition. So if a binary field is representing this way in polynomial basis, we know that squaring is just doubling the -- inserting zeros between each consecutive pair of coefficients. So if we have a binary field represented by the sequence of coefficients here, the square of A is just this with zeros inserted between each -- inside each pair of coefficients or contiguous coefficients. And of course this has to be [inaudible] reduced with the irreducible polynomial to reduce things to the standard size. This operation, if you implement it naively, it's going to cost a lot of bit manipulation. You have to isolate these bits and apply masking and shift them. So the way to do this usually is to pre-compute a small table of results and use this table to expand the coefficients in this way. So squaring is also a linear operation. So let's see how squaring behaves with what we call the split form, so just distributing the square operation in that formula. So what this means is you can square a binary field element by just squaring the lower 4-bit sets and the higher 4-bit sets and accumulate them an 8-bit shift. And you can do this of course with a lookup table. So for all the 16 possible combinations of 4 bits, you can pre-compute this table of bytes with the zeros already inserted. It's just a 16-byte table. And it does this much more efficiently than isolating and masking the bits individually. Of course this looks perfect to apply that byte shuffling instruction already, so that's what we're going to do now. So let's take one register of the binary field as input. Let's convert it to a split form, so applying the mask and then shifting to the right 4 bits to position these bits in the place we want them. And then we can apply the byte shuffling instruction which is really working as a simultaneous lookup instruction here, and expand these 4 bits into 8 bits here. Also we need to accumulate these two expended results here with an 8-bit offset. So what we do is we use this byte interweaving instructions to take alternately the bytes from one and the other. So we are simulating an 8-bit shift and accumulation. So basically this is the result, and we are implementing this explicitly. So let's now talk about square-root extraction which uses the same ideas, but it likely trickier. So this is a classical all ready algorithm by Fong, Lopez and others. So you can take a square-root of a binary field element by isolating the bits with even and odd indexes and accumulating them with the odd part multiplied by the square-root of z. The square-root of z is a [inaudible] constant, so depending on the choice of F this can be a dense or a sparse element. We prefer it to be a sparse element. When it's sparse we call it a square-root friendly field because this multiplication here only costs us some shifted addition so it's faster. >> : Is there -- Sorry. Is there a certain [inaudible] or something that will make these things fast? >> Diego Aranha: Yes, if the known zero coefficient of F of z, all of them are odd, you have a square-root friendly polynomial. And there are several papers in the literature detailing the structure of these F's so a few papers. There is even a hierarchy-like [inaudible] divided [inaudible] I think by Scott if I'm not mistaken. But some of these [inaudible] polynomials are square-root friend, some of them are not. But I will also present numbers with and without square-root friendly polynomials. So even though when you don't have a square-root friendly polynomial, you can still -- and the square-root of z is dense, depending on the multiplication algorithm you can still pre-compute part of this multiplication. So for one of them you can still do it. And this usually costs -- Well, this is [inaudible] precision too. Actually this is [inaudible] precision, so this doesn't usually cost a full multiplication even in the worst case. But we know that square-root is also linear because squaring is linear, one of the inverse of the other. So we can apply this square-root operation to the split form. Of course we get this expression here. So as in squaring, we can compute a square-root by extracting -converting to split form, applying the square-root operator to the true groups of 4-bit sets, and I [inaudible] then with a 2-bit shift. If we use the same formula to expand these two formulas, we arrive at this which looks complicated but it's really simple. So what we need to do is for each of the two operants or results in the split form representation, we have again to isolate the bits with even and odd indexes and combine all of them in this way by also applying these two bit shifts and then multiplication by the square-root of z. And as I said we should choose F -- It's better to choose F -- so the square-root of z is really sparse. So this is just a bunch of shifted additions. So this is how the square-root looks like. So as before, we receive as input a part of a binary field element. First of all I shuffle the bytes in that register so that the odd and even parts are easier to collect at the end. So there is a small trick here which kind of violates this split form, but it's a trick. So with this permuted -byte permuted operant, we convert this to split form. So actually we don't have probably this split form but a permutation of this split form if you are pedantic. And so we can, again, use a simultaneous table lookup instruction to isolate the even and odd -- the bits with even and odd indexes. And that 2-bit shift can be already embedded in one of the tables. So in one side we do this and this because we need them multiplied by z squared or shifted to the left by two. And on the other side we compute this and this, the even and odd parts of AL or the even and odd part of AH. So at the end we have all the 2-bit sets already isolated with the nice shifting applied to this side. We can add them and recover the 4bit sets corresponding to A-even and A-odd. Again, we convert to the split form for the registers at the end because this is also out of order as you can see. So at the end we have the bits with even indexes, the 64 bits, at one side and the bits with odd indexes at the other side. So we -Basically we are using the same lookup instruction but now to isolate bits in even or odd indexes. Multiplication now has three different strategies. The first one is the Lopez-Dahab comb method. It was introduced in 2000 I think, and it was the faster approach to multiply binary fields in many platforms especially when platforms didn't have any support for native binary field multiplication. So also I would talk about the new approach to multiply binary fields using shuffle instructions, the byte shuffle instruction, and some considerations about how to better use the carry-less multiplier. So Lopez-Dahab multiplication works like this: supposing that it's easy to compute a small polynomial times a field element by using just shifted additions supposing that you have this as a building block. It's just you can compute it explicitly by shifted additions. You can perform multiplication by processing one of the operants, considering it as a sequence of small polynomials and then accumulating the result of these products in a school book way, let's say, just with the corresponding shifts applied. So in this case here since we have already one trick, yes, I'm jumping -- One thing you can do at this algorithm is you can jump the offsets - you consider on one of the operants, for example at every 8 bits, so you don't have shifts by 4 bits. You can consider only the lower parts of each byte of 8, accumulate all of them, do a 4-bit left shift and then do the other part. So you can do it to remove 4-bit shifts, at least most of the 4-bit shifts. And interesting part of this algorithm is of course is many of the small polynomials when you are processing A are repeated. So you can pre-compute several of them. So use your choices to consider you to have degree less than 4. So it's just a set of 4 bits and then you build a table of 16 pre-computed values and you can just accumulate these values and apply the corresponding shifts. At the end you have Dahab precision polynomial and you can reduce it to get a proper field element. So this is Lopez-Dahab the comb method. So if you consider that one of the operants is represented in that split form, so in this case it's a; so B is still the same and A is expanded in this split form. And if you distribute this operation, you get this. So what it basically says is if you build two tables for two different multiplications, you removed the shifts completely. So if you build a precomputation table for B or B shifted to the left by 4, you can reduce this multiplication by two simple Lopez-Dahab multiplications. So it's basically another way of doing the same trick I presented here, but just using the split form. And this is well-known. It was already suggested in the regional paper; they already have this because not many processors have good 4-bit shifts but many processors have good 8-bit shifts. In some cases it's even for free. If you are on an 8-bit microcontroller, of course shifting by 8 bits is for free. So the core operation of this algorithm is accumulating these types of products, the results of these types of products, multiplying small polynomials by dense polynomials. This is the core, and we repeat this operation many, many times. So now this is how the algorithm looks like. So we pre-compute online two tables. The first table, as I said, based on B. The other table is based on B shifted to the left by 4 bits for all possible choices of u when u has only 4 bits or a degree less than 4. So we accumulate all the results, the intermediate results in a register window so we don't do operations into memory so we exploit the --. Actually we are still using a table we stored into memory to [inaudible] the values that accumulate, so we don't enjoy the benefits of the higher throughput. >> : Is it similar [inaudible] algorithm from binary matrices? >> Diego Aranha: I think so, yes. I remember reading a paper dealing with this similarity. I think so. And it has been proved too. I think Bernstein has done some work on this algorithm, the [inaudible]. So let's continue. And then I split them up in two different loops. The reason for that is in the first loop I process things with 128-bit offsets so I don't have to do 64-bit shiftings. And I also when I process in the first lines in the two loops I process, I think, AL and in the second line AH with also this offset. And by using these two tables I can eliminate 4-bit shifts. I only need to do 8-bit shifts. So I can use the memory alignment instruction here to do the 8-bit shifts with bi-propagation in a faster way. That's basically how the algorithm works. And since I'm splitting it into two loops, by just reordering the registers I can do 128-bit shifts, so I don't have to do 64-bit shifts explicitly. And at the end, of course, I have to reduce mod Fz. So, okay, this is all also well-known. The only difference is that in the context of vector instructions that's new. But now let's say what happens if both the multiplicand and the multiplier are represented in this split form. So the expression looks like this. Using Karatsuba we can reduce it, of course, to three multiplications. So we can do a multiplication by multiplying AL and BL, AH and BH, computing these additions -- actually these additions. This multiplication and additions with the intermediate results. And also accumulate all these intermediate values with an 8-bit shift and a 4-bit shift. Karatsuba is usually much better in binary fields than in prime fields because we don't have carry propagation. Addition is really fast, so it makes much more sense to use Karatsuba here, even for small field sizes. Karatsuba is only useful in prime fields, I think, over like 2000-bit fields or something like this currently so -because of the ration between addition and multiplication. And addition can cost, I think, up to two cycles in Sandy Bridge. So it's almost the cost of an integer multiplication model the differences we've been seeing the instruction latencies. >> : Also in [inaudible] field. >> : In [inaudible] it makes sense to use Karatsuba? >> Diego Aranha: It makes sense. Yes, yes. So, yeah, let's only think about prime -- large characteristic fields. So now we are multiplying a pair of sparse polynomials. So if you think of the Lopez-Dahab method or if you look at this operation explicitly, what you are doing is multiplying a small polynomial by a sparse polynomial so a 4-bit set by a sequence of 4-bit coefficients. So this is how the algorithm looks like. It basically uses the same tricks to remove the 4-bit shifts. The only difference is that I don't have any more online pre-computed tables. I have a table of constants which pre-compute these values for all possible choices of these small polynomials. So I have a table of sixteen 128-bit values. So it's a table of constant size. And the core operation of the algorithm is just using the shuffling instruction to do the multiplication of this 4-bit set by each of these 4-bit sets in just one operation because it's a table lookup again. So basically that's how the algorithm works. So we'll analyze trade-offs and when this could be better or worse later. To use the carry-less multiplier we have two different options. We can either consider operants in a 128-bit granularity. This is good if you want to reduce the number of required registers because then you can store all the intermediate values in a smaller number of registers than considering operants as 64-bit values because the instruction is a SIMD instruction. So you still have to use a vector register to store the operants even if they are only 64-bit wide. In this granularity you multiply digits of 128-bit size and then you can do each of these 128-bit multiplications with just three carry-less multipliers. >> : You mentioned you have to [inaudible] put the full 128-bit register in it doing 64-bit arithmetic. Now can it at least attack or do two such operations in parallel using the top and bottom [inaudible] advantage? >> Diego Aranha: The problem is you have to -- You can do it and you enjoy some throughput gain, but the problem is the operation won't be executed in parallel like a SIMD instruction where you add the two 64 bits in parallel. So you have to call the instruction once, call the [inaudible], and some work will be shared because, as I was saying, the latency goes down four cycles. So I speculate that this is just the time for the code instruction and some pre-fetching of the operants is not really dealing with [inaudible]. I speculate but I don't have concrete evidences of that because -- Well, in the [inaudible] we lost the same amount of cycles in the two situations between to the two processors. I just think the optimizer, the decoding part, or the prefetching of operants that's why we have a constant difference. So if you consider this granularity you really should use the maximum number of Karatsuba levels for N over 2 digits. So for example this is a good organization if your binary field size states four vector registers. Or usually a product of two and three, like a power of the 2 to the K, 3 to the L, for example, because then you have nice Karatsuba formulas for each of these steps. But sometimes we have a better formula if you deal with 64-bit granularity. But then you have to solve the higher number of registers because then you need to store a 64-bit register in your vector register. What we do is you occupy 128-bit wide register with two useful, let's say, useful 64-bit integers, which will be needed. So this has been proposed on the gf2x library by Emmanuel Thome and others. And you can do this to do 64-bit additions in parallel and to reduce the number of registers to store intermediate values because the faster formulas at this granularity usually have a higher number of intermediate values. Some of them were proposed by Peter Montgomery for example. And then of course this works if you have a formula with a lower number of multiplications, so then this makes sense. If you can reduce the number of carry-less multiplications according to this. In some situations we have seen that this organization provides a higher throughput probably because we have this nice structure here where all the three multiplications are together. So it's easier for the compiler and the processors to schedule these operations together. But this is more like a set of guidelines; we don't have very concrete statements on this. It's just experimentally for some field sizes, one is better than the other and probably the throughput is the difference or the number of multiplications in each organization. So let's compare these three different strategies. Please? >> : Yes. So one question: if you use the 128-bit granularity, if I understood correctly in the split form, if you have a 128-bit polynomial can you store it in two 128-bit registers. So you're not really reducing the number of registers but [inaudible]... >> Diego Aranha: Yeah, but in this case I'm not using split form. I'm cheating somehow because I wouldn't gain anything by using the split form here because I have a dense multiplier already for 64 times 64-bit polynomial. So here I'm not using the split form. Only in the shuffle [inaudible] and the Lopez-Dahab without the 4-bit shifts. So here is just -- it's native. I don't have to do anything fancy to use the instruction. So let's compare these strategies. So a good thing of the Lopez-Dahab multiplication is that it can use the highest-granularity exclusive OR operation. You should also consider throughput effects as we saw in the case comparing SSE and AVX. The memory space, however, is proportional to the field size because you need to compute this table of 16 or even more depending on the window size of field elements. So if the field is higher you spend more memory pre-computing this table online. For the shuffle-based multiplication, the core operation is sparser so you have to do more, three times more executions of this core operation. So it would only really work in practice if you had a really high throughput of the byte-shuffling instruction. But the good thing is, a good factor is, it consumes constant memory space, so it's just a table of constants. 16 time 16-byte table, so a 256-byte table apart from the state to maintain the Karatsuba state, to maintain the Karatsuba values. And sometimes you can -- If the field is small enough you can keep this state all in register, so it's not a problem. But of course for field sizes the Karatsuba state will also be proportional to the field size. It will increase with higher fields. Another problem is here you have to deal with constants stored into memory, so you are forced to use instructions which use values from memory which lowers your throughput. So this is a disadvantage. For the native multiplication, of course, it's a native operation. Even if the latency's not stellar, it's still faster than both of the algorithms we saw, the Lopez-Dahab and the shuffle-based multiplication. And it has constant memory consumption of course because it only deals with registers. It doesn't need to compute anything or us constants stored into memory. The only disadvantage I would appoint is there is no widespread support of these instructions among architectures out there. Only the Sandy Bridge I think that the [inaudible] processor has an 8bit version of it which does operations in parallel which is interesting. But it's tricky also to use it efficiently in a multiplier, especially for bigger fields for like [inaudible] field sizes. So now let's talk a little bit about modular reduction. Modular reduction, again, it's very heuristic because it requires heavy shifting, and we know that vector instruction sets are not very good for shifting values. So this split representation here does not really help, especially because after the multiplication we have advanced double-precision polynomial. So if we needed to use the split representation to do modular reduction we would to have to split this again and now we have a double precision operant. So some guidelines we collected when implementing different field sizes. If F is a trinomial, you should implement this at 128-bit granularity. The number of [inaudible] is small and usually you save on the other part so this is faster. If it's a pentanomial, you should either process pairs of digits in parallel or do the operation in 64- bit mode where all the shifts are already provided. And a technique we've discovered recently is if you have a field, a pentanomial for example, where the addition of this value is equal this value here, you can simply write the reducible polynomial this way and save some shiftings when multiplying by these values which is required in the modular reduction part. So we have done this -- this is the standard polynomial at the 128-bit [inaudible] level for elliptical cryptography. And this saved some shifting instructions in the modular reduction. So actually you have three -- So we do it at 128-bit granularity for these few. So you have actually three different options for the pentanomial polynomial. So of course you should always accumulate writes into registers before sending them to memory just to enjoy the higher throughput. And reduction should be done while the results of squaring or multiplication operations are [inaudible] stored into registers. There is no point in saving them into memory or reading them from memory again and then reducing them to send to memory again. So you should keep contacts in your register to save memory operations. Now half-trace. Half-trace is useful for computing point halving as the point multiplication algorithm. So what the half-trace has completed in this way, we just accumulate lots of [inaudible] squarings of Z. What half-trace is, it provides a solution of this quadratic equation when the trace of zero is -- trace of Z is zero. And then an interesting property of the half-trace is if i is even, the half-trace of Z to the i can be computed in terms of the half-trace of Z to i over 2. So inside the algorithms for half-trace use this fact either to get speedups or reduce memory consumption because you don't have to deal with bits i of even position. You can simply convert the problem of dealing with them to bits of odd position. So this algorithm basically works at 8-bit granularity processes one byte of Z per iteration. Actually this is already the last part. So what we do is we pre-compute a very big table of half-traces considering only the bits with odd positions of odd indexes. And then we preprocess Z to eliminate all the bits using this property with even position. So the problem is restricted to the bits of odd position. And then we accumulate values from this table by just doing a sequence of additions. Here since these values are always stored into memory, the AVX implementation is usually faster than the SSE implementation because if operants are into memory both of them have the same throughput. So using larger operants it's better. But still this part dominates the cost and, I think, we only saw a 25% speed up with the larger operant sizes. Eliminating the bits at even positions can be done using the same byte shuffling instructions because it's just bit manipulation. So it's using the same techniques we used for square-root, for example. So you can remove them and continue. The final operation I'll present in the [inaudible] is the M squaring or multi-squaring. It's a time-memory trade proposed by Bos. So the really interesting feature of this time-memory tradeoff is you can compute any number of consecutive squarings of [inaudible] element at constant time. The time doesn't depend on K. So it doesn't matter what the value of K, it's just a sequence of additions of values stored into also big pre-computing tables. So it's really a time-memory tradeoff. So you basically pre-compute for each 4-bit set. And the 16 different values of this 4-bit set, you pre-compute a table of these polynomials squared consecutively K times and then you can add further by processing 4 bits a time to compute this consecutive squaring of A. Oh, multi-squaring was not the last. We still have inversion. Inversion is, again, very heuristic. When you don't have memory available in a small processor, for example, we usually implement the Extended Euclidean algorithm in 64-bit mode because this algorithm, again, requires heavy shifting. So you shouldn't -- At least all our tries in doing it in vector instructions were not very efficient. But memory is available you can use the Itoh-Tsuji algorithm which basically recodes inversion in this way. And if you have a short addition chain to compute that exponentiation, it's just a product of several 2 to i powers of A which can be computed using the time-memory tradeoff. So you can -- For each of these powers pre-compute a table. It's a table of constants too so we'll have a table pre-computed already, and compute this exponentiation in a fairly efficient time. So now let's go to implementation details. First I'll present the implementation of 16 different binary fields ranging from 113 to 1200 bits. All of these fields are useful for curve-based cryptography either ECC or PBC. At first we used the GCC 4.1.2 because it provided the fastest SSE intrinsics. Then performance of these intrinsics became horrible in the next iterations of the compiler and they became fast again in GCC 4.5. In GCC 4.7 we are seeing the best times we ever saw so somehow it became horrible and then it became good again; I don't know why. This first of the results I will present using -- the experiments were done with the RELIC cryptographic library. On top of these implementations we build implementations of scalar multiplication, but this field part is done with the RELIC library. And I will restrict my comparison to other results only comparing implementations with vector instructions. It doesn't make sense to compare a vector implementation with a C implementation. Please. >> : Well, it sounds a bit strange to me that intrinsic maps to one assembly instruction, how it can be faster in one compiler to another. >> Diego Aranha: It's very strange to me too. By looking at the assembly produced by GCC, it just inserts lots of penalties and overheads because... >> : So this would suggest that writing your own assembly [inaudible] are much... >> Diego Aranha: Yes. >> : ...[inaudible]. >> Diego Aranha: Yes. I agree. But then a good thing is they solved the problem somehow, and the more recent versions of the algorithm are better than my assembly code so -- especially for the multiplier. I tried many options different organizations of the multiplier in assembly and the compiler still beats me by one or two cycles. >> : The 113 bits should've been retired a long time ago. And [inaudible] are now retiring 80-bit security, so 60-bit security is ridiculous. >> Diego Aranha: Yes. Here I'm not dealing with 60-bit security. Just do... >> : [Inaudible] 113 bits... >> Diego Aranha: Yeah, but this can useful for a curve defined over a quadratic extension, so this is actually no necessarily --. I just need these to build something, a higher extension for example. So I'm not necessarily doing -- I remember other application of this field but now --. We site in the paper and now I don't remember. It's useful for something else, not necessarily -- I think it's the [inaudible] curve use it in [inaudible] which is this field size. Anyway, so we compare only to other implementations with vector instructions, so the mpFq implementation. Also this is a paper by Beuchat and Francisco dealing with pairings. And I will only compare number in the Core2 65 nanometers because the shuffle instruction is really expensive in this processor. I'm comparing my worst case to the other cases, so I'm not picking an architecture where the shuffling instruction is really fast. I'm picking the worst possible for this. So let's see how the numbers look like. First of all I'm not trying to be misleading here. We don't have many points in the literature to compare to, so blue is our sequence of -- the blue lines are a sequence of 16 different fields. The red is the sequence of benchmarking data I could find in literature, so of course it's not completely filled out. But I still compare some points. For example, for squaring an interesting thing we see is sometimes we increase the field size and the time goes down. The reason for this is we are oscillating here between good and bad irreducible polynomials in terms of shifting. Because the coefficient extension is really negligible so it doesn't make a difference. For example, we are doubling the field size here and then the latency barely increased. The bit manipulation part is really efficient; the rest is just modular reduction. So we can see for example, this is the [inaudible] 123 field. We have a huge speed up over Francisco's implementation using pre-computed tables into memory, not vector instructions for the squaring part. The square-root is the same. Now [inaudible] by square-root friendly polynomial. Now we have an even stranger effect with double the field size and square-root gets faster because now we are oscillating between different choices of the square-root of Z. So this is a very, very good choice proposed, actually, by Mike Scott. And it's very [inaudible] polynomial because all these shiftings are aligned in a vector register. And so, again, we are oscillating between better or worse choices of the square-root of Z. Again, coefficient manipulation is more expensive here but still negligible. >> : So in some cases square-root is faster than squaring? >> Diego Aranha: In most cases we saw it was -- you can see that the squaring is dominated here the square-root is likely higher. So squareroot is more expensive because we need more -- You remember the figures. It's more complicated. We need more shuffle instructions, but it's fairly close in the most recent platforms and code. So it's fairly close. >> : [Inaudible] faster. >> Diego Aranha: Yes, this point is faster because square-root tell you multiplying by square-root of Z here is faster than in this case doing modular reduction [inaudible] F of Z. So it's a very interesting choice of F. For pairing basic cryptography it's really useful. So now let's compare with the standard polynomials which sometimes are not square-root friendly. In this case these are the not-square-root friendly polynomials. And you can see that this is the 571 field size which is a horrible, horrible polynomial to multiply to because the square-root of z is almost dense and we still have to do a modular reduction afterwards. So that's why the time goes really high. But in these more controlled cases here, we still are oscillating between better or worse choices of square-root of Z. And we still get a good speed up between comparing to the single point of the mpFq library. It's a 251-bit size. So now the Lopez-Dahab multiplication. So this is the Lopez-Dahab implemented in mpFq and Beuchat and others. And you can see also the quadratic costs explicitly in this graph. And basically what we save compared to these other approaches are the 4-bit shifts. So we only do 8-bit shifts. And the difference is mainly this. And also the way we organize the multiplier so we don't pay 64-bit shifts too. So basically this is the savings in the shifting part by using the split form. And now on internal comparison between the Lopez-Dahab and the shuffling-based multiplier, Lopez-Dahab is for bigger field sizes still much faster than the shuffle-based multiplier because the core operation is sparser. So we need more of this core operation to perform multiplication. But for really small field sizes, it's competitive. So if you need a compact implementation with constant memory conception for a small field size, this algorithm could be useful. Which makes me think that [inaudible] maybe could be useful if you don't want to support a full native multiplier. But I never tried or I am not sure if someone tried. So for like 113-bit sizes it could be interesting for a compact implementation, a low area implementation. I didn't put the timings for the native multiplier in this graph because this is a Core2 65-nanometer processor. It doesn't have the instruction, so you would be misleading. But you can just imagine a line here which follows this line and it's twice faster. For all the field sizes we tried, the native multiplier is twice faster than the Lopez-Dahab multiplier. So it's [inaudible] faster but still not as fast as we wanted. So some observations. We efficiently formulate squaring and squareroot. The multiplication over squaring ratio is up to 34 which is very big. The classical ratio is around 12 or 16 found in the literature. So of course this formulation is faster when the shuffling throughput is higher, so if shuffling throughput is higher as Sandy Bridge on Nehalem processors. We gain improvements in squaring and square-root as well. But still, the performance is heavily dependent on the choice of F of Z because you either have to do modular reductions or multiply by the square-root of Z. The shuffle-based multiplication has a problem with all the constants stored into memory because the throughput is lower. Memory operants are most of the time slower. And to make it faster we would need a bunch of registers to store the table of constants and then do operations, the lookups on this table of registers. We don't have anything closer to this in the current processors. But, I don't know, maybe in the future or a specialized hardware design could support these operations in this way. This is, actually I consider it a victory. It looks bad, but I consider it a victory. It's only between 50 and 90% slower than LopezDahab when it should be three times as slower because it requires three times as much core operations. For the other operations we didn't find times in the literature employing vector instructions, so I'm only saying that we restored the [inaudible] ratios we've seen in the literature. So when point-halving was, for example, proposed half-trace was considered to be one multiplication. We got the new instruction to make the multiplication much faster and nothing explicitly for the half-trace operation. But with that organization of the half-trace we make it comparable to multiplication again. So in a sense it's twice faster than the halftrace approaches proposed in the literature but using more memory. Any version by using the pre-computed tables is around 25 times the cost of the multiplication. So it’s also close to the ratios we see even in text books, like the guide to ECC sites something closer. So now let’s illustrate these implementations with timing for elliptical arithmetic, for scalar multiplication. So now first I am considering comparing different implementations for side-channel resistant scalar multiplication into two different processors. So let me give a lot of context here. I’m on the 128-bit field size. This is implementation, a bit-slice implementation by Bernstein. It computes several scalar multiplications in parallel, I think ten 24 multiplications in parallel to reach this number. It’s on the Core2 platform. And this is the time for a single scalar multiplication with a random point, of course, with the MPF implementation also in the Core2. So this was the speed record for either batched or non-batched scalar multiplications for some time. And by employing this field arithmetic on the Core2 processor we arrived at this number which is much faster than the mpFq implementation because squaring and multiplication are faster but still slower than the bit-slice implementation by Bernstein. On the bit-slice paper, he tried to estimate if a carry-less multiplier based implementation would be faster or not. So we did this for Nehalem and the Sandy Bridge architecture, and we got some considerable speedups. I’m not comparing to the execution of this code into Nehalem or Sandy Bridge because when I ran his implementation in the machines I have access to I got numbers worse than this. So I think his implementation relies on some very specific features of the Core2 platform and these do not exist on – Not in the sense of instructions but maybe cache alignment or code organization. So I couldn’t replicate these numbers on the latest processors. So I couldn’t do a proper comparison. But still – And I don’t want to implement everything just to see how it compares. But still we got a considerable speedup -considering cycles only; this is not dependent on frequency – compared to the batched implementation. These are, of course, non-batched implementations of the same curve at --. Please? >> : Does side-channel resistant mean constant time? >> Diego Aranha: Yeah. That’s a good question. So actually another thing I didn’t say, this has a higher side-channel resistance than our implementations because this takes constant time. In our case we are using the Lopez-Dahab point multiplication algorithm which is a variant of the Montgomery [inaudible]. And it has some resistance to simple side-channel attacks like simple [inaudible] analysis and in timing analysis in, I think, cache memory, cache timing analysis too. But it doesn’t cover many side-channel attacks, so this is in a sense more secure than this implementation. We are more interested, here, in performance than necessarily protecting against side channels. But, yeah, good question. Thanks. So now let’s go to – I’ll just provide a few notes on more recent work. So actually the move to squaring provides an interesting property or application to Koblitz curves. So recall that Koblitz curves have the Frobenius automorphism, tao, which is computed this way by just inexpensive squarings which are made even faster by using the split form. And we can do scalar multiplication in Koblitz curves replacing all point doublings with applications of the Frobenius automorphism. An interesting thing is if you can compute powers 2 to the K in constant time independent of the value of K, this actually provides endomorphism in the Koblitz curve. So we can go further or not in an iteration of the scalar multiplication algorithm. So let’s take for example K to be M over 2, so we could represent our scalar multiplication in this way. And this looks just like the application of an endomorphism in the GLV context. So we can do these two scalar multiplications in an interleaved way and remove further applications of the Frobenius automorphism. This is faster always when – This operation is faster than squaring consecutively K times. So you need K to be, for example, higher than 10 for this to happen. But of course we are speaking about M between 251 and – actually 283 to have proper Koblitz curves and 571. So in any real scenario this would be true. So speaking in general terms we can consider M over S multiples of this power of the tao automorphism, and this is an analogue for Koblitz curves of an s-dimension GLV decomposition. So we can fold this [inaudible] multiplication loop S minus 1 times and remove the corresponding number of applications of the Frobenius. You might say, “But squaring is really efficient on a binary curve. Why remove FAs?” The reason is for some of these standard fields, squaring is not that efficient because the choice of polynomial is really bad. So actually we are saving modular reductions not squarings. We are interested in saving modular reductions because the squaring part is – the coefficient expansion part is really negligible as you saw. This of course has an impact in memory because we have to pre-compute the tables to do this [inaudible] squarings. But if you choose a nice addition chain for the inversion part, you can reuse the same table here. So that’s exactly what we’ve done to the 283-bit curve. So this is how the scalar multiplication algorithm looks like. It’s a standard width-w taoNAF algorithm. I would just point, first we precompute online several small multiples of the point B multiplier. We apply this map to all of these points to get several different tables with the [inaudible] applied. And then we just deal with the high bits and do the interleaved loop here by just folding the loop S minus 1 times and using all of these tables. We could also apply this map during the addition part but it turned out it was faster doing this beforehand than inside the loop. So some numbers: this is unprotected scalar multiplication in many different meanings. It’s really tricky to protect scalar multiplication in Koblitz curves against side-channels because the recoding part it’s very branch-friendly, let’s say. It requires many branches in several parts to be efficient. So we decided to only do unprotected scalar multiplication and leave a possible side-channel multiplication, scalar multiplication as future work. So these are numbers for a standard curve, the NSTK283.In the Nehalem and in Sandy Bridge architectures I’m choosing the window size to be 5 and I’m folding the loop just one time. If I fold more I have problems between this – what I spent here and what I saved here. So S equals 2 was the best scenario for our random point. If the point is fixed, of course, you can use a higher S and spend more memory to make this faster. So I’m comparing to the best implementations by Patrick on the same platforms. It’s also a random point in unprotected scalar multiplication. And I have considerably higher speedups by using a Koblitz curve. Remembering that my field multiplication here, it’s much more expensive than it should be considering the complexity of doing carry-less and not carry-less multiplication. So if I had the multiplication instruction with the same cost in cycles, this would be actually much faster. And I have to say that Patrick has improved his implementations so now this cost is 92 K cycles. So the Koblitz curve implementation is not the state-of-the-art any more, but it was for a few minutes. But, anyhow, so we have a considerably efficient implementation considering that we don’t have as good support to integer arithmetic for binary fields. >> : Is that on the same core? >> Diego Aranha: Single core, yeah. >> : [Inaudible]… >> Diego Aranha: [Inaudible] just single core. So this was – At least when – It should appear the next late encrypt and was the first implementation to every cross the hundred-thousand barrier. So now it’s still the first but it’s not the fastest. So, okay. Let’s summarize the results. So I presented a new formulation and implementation of binary field arithmetic. It follows this trend of getting faster shuffling instructions with higher throughput and improves results on the field arithmetic side between 8 and 84%. This 84% speedup is squaring on the really big field. As you saw there was a huge difference in latency. This formulation induces a new implementation strategy for multiplication which is not very useful at this point. It still could be useful for compact implementations of small fields. And if we have custom table addressing features on the processor, this could be interesting. But we don’t have this. So comparing our times with non-batched arithmetic on binary elliptic curves we obtained the speed record for side-channel resistant, not considering the same side-channels but at least resistance to some side-channels. For scalar multiplication on generic binary curves I’m, in this case, referring to the bit-slice implementation by Bernstein. And we provide that this should be in the best a new speed record for scalar multiplication across all curves. So now it’s how did you compute how much faster results, 7% or 8%? >> : [Inaudible]. >> Diego Aranha: Yeah. So but still I just have to rephrase this for scalar multiplication across all binary curves. So it’s still true if I only restrict to the binary curves. >> : If you run our computers at 84% will become 500%. And in this case… >> : [Inaudible]. >> : …[inaudible] no ambiguity, right? You know that’d get negative time. >> Diego Aranha: Yes. Yes. In a sense, yes. But I prefer 84 so I don’t have people raising their hand and saying, “Well, it’s not actually 500, it’s 84.” But, yes, depending on the ratio you take it could be 500%. It’s much faster. It’s five times faster. So thank you for your attention. Please, Peter? >> : [Inaudible] work on binary Edwards curves? >> Diego Aranha: Yes, actually when I’m presenting this, this work is on binary Edwards curves. This work is on the same curve but on the standard [inaudible] representation of the same curve because then I have the algorithm I need for some side-channel resistance which is the Lopez-Dahab point multiplication algorithm. But it’s the same curve with just different representations. >> : [Inaudible] attack, side-channel attack on Edwards curves? >> Diego Aranha: Yeah, I’ve seen on the [inaudible]… >> : You had AB plus BA, not AB plus AB. >> Diego Aranha: Yes. Yes, it was a very, very interesting attack because it looks so simple but it’s not that simple in the details. Yes. So thank you for your attention. I can answer questions that you may have now. [ Audience applause ] >> : And does this all work on the UA and B [inaudible] or you have to [inaudible] to use a different [inaudible]? >> Diego Aranha: Very, very interesting question. The bulldozer is a very weird processor or probably you’ve heard this. They have a faster binary field multiplier, for example. It oscillates between 7 and [inaudible] cycles so it is likely faster than the Intel one. But everything else’s is lower. And it’s the same thing with integer arithmetic, too. So it’s very strange because we have a faster multiplication but slower everything else. So the timings are slower than this. But the [inaudible] initiative is benchmarking our coding in the EMG. I think it compiled and ran successfully in the AMG. And the timings are worse than I’m presenting here. But the [inaudible] are there. The same implementation techniques can be used. You only get slower performance because AMG started in different applications, I think. >> : [Inaudible]. >> Diego Aranha: Thank you. [ Audience applause ]