>> Patrick Longa Pierola: Okay. Hello, everyone. It is my... introduce today Diego Aranha. He is -- Well, let's give...

advertisement
>> Patrick Longa Pierola: Okay. Hello, everyone. It is my pleasure to
introduce today Diego Aranha. He is -- Well, let's give a little
background about him. He holds a Ph.D. degree from the University of
Campinas in Brazil. He is currently a professor at the Computer Science
Department in the University of Brasilia and while his expertise is in
cryptography, efficient implementation of cryptographic primitives most
notably on pairings and binary elliptic curves. Well, as a side note we
know that he became very famous, probably more famous than Neymar and
the Brazilian Soccer Team in Brazil because he led a team that found
vulnerabilities in the Brazilian voting machine. So please let's
welcome Diego Aranha. He will be presenting a talk on software
implementation of binary field arithmetic using vector instructions.
>> Diego Aranha: Thank you. Thank you, Patrick. Good morning. So it's
good to talk about arithmetic again because this voting machine stuff
it's too easy. So this is joint work with Armando who is here, Darrel
Hankerson, my ex-Ph.D. advisor Julio Lopez, Francisco Rodriguez and
Jonathon Taverne.
So I'll talk about how to use vector instructions to implement binary
fields. In my opinion this is one of the most fascinating aspects of
doing arithmetic of cryptographic interest, how to manipulate the math
to use some interesting computing resources. So I really like to do
this type of work. I think it's right on the border between the math
and the technology. So let's begin.
Binary fields are basically everywhere in cryptography. We can find
them when we deploy ECC or PBC systems or any curve-based cryptosystem.
Also some multivariate systems use binary fields to provide security
which is resistant to quantum computers. And even some block ciphers
have building blocks built on top of binary fields such as, for
example, the X-Box is built on binary filed arithmetic.
And we know that since binary fields started to become relevant for
cryptography, we have many algorithms and optimizations already
proposed in the literature. So one of the questions this talk wants to
answer is can we try to unify at least the best approaches we have in
the literature under the same formulation or framework? And if so, can
this formulation provide new ideas not only represent what's already
there but also provide some new ideas? So that's what I want to answer
with this talk.
So just to summarize: I will present a formulation of binary field
arithmetic which captures the state-of-the-art on this small field
using vector instructions of course. This formulation provides a new
algorithm for implementing multiplication on binary fields which is an
operation which is not commonly supported. So it's good to have other
approaches to implement binary field arithmetic. And also how this
formulation gives some other approaches for implementing operations
which, let's say, restore the ratios we had on the literature before
support native multiplication was introduced. So if the multiplier
becomes significantly faster, we have impacts on several algorithms.
For example, choosing between point halving or point doubling in binary
curves. And then we have some approaches to restore the classical
ratios, let's say, the conventional ratios which make the previous
analysis still valid. So it has this slight defect. And I'll present
some experimental results for [inaudible] multiplication in two
different scenarios.
So let's just examine the arsenal we have. So the Intel Core
architecture has a vector instruction set called SSE or Streaming SIMD
Extensions. It started as a 128-bit instruction set, and it was
introduced actually on the Pentium 3, Pentium 4 family of processors.
But it has been proved a lot with every new processor family, and it
gained several instructions on the Intel Core architecture especially
the 45-nanometer series of processors. In this same series they
introduced what they called a super shuffle engine which is most
frequent in the process for -- to accelerate some specific instructions
which would be very useful here. Also in the Nehalem family which is
the second iteration of the 45-nanometer series. They introduced native
support for binary field multiplication as this carry-less multiplier
instruction which multiplies two 64-bit integers without taking account
carries, which is basically binary field multiplication. And in the
more recent 32-nanometer series, they also expanded the size of the
[inaudible] 128 to 256 bits.
But not all instructions are read available on this new instruction
set. The floating point instructions are but mainly integer
instructions are still not available. They will become available on the
second iteration of this new instruction set which is coming beginning
next year if I'm not mistaken. So let's see which instructions these
vector instruction sets provide to us so we can implement fast binary
field arithmetic.
So of course first thing we need is an instruction to copy things
thrown into memory. So this is the name of the instruction. If I
present an algorithm, this will be the mnemonic that we use in the
algorithm. And this is the cost of the instruction before and after in
different processor series. So we'll make is more explicit depending on
the instruction.
So, for example, so we have memory load and store, of course. This cost
depends if the operants are aligned. And 128-bit [inaudible] in memory.
So if they are aligned, usually this instruction was faster. In the
latest processor it doesn't make a difference but in the Core2
processors it was faster if operants were aligned. So of all it's a
good idea to put everything in aligned addresses. We have shifting
instructions; these are vector instructions so they operate on multiple
objects in the same register. [Inaudible] in the same way for every
operant. So we have instructions to shift simultaneously two 64-bit
integers inside a vector register but without propagating bits from one
64-bit integer to the other. So that's what I'm calling 64-bit bitwise
shifts. And you have left and right shifts, of course.
And representing this: these are used -- it will become clear later,
too, but it's better to use this when the shifting amount is not a
multiple of 8 because we have a better instruction to shift things when
the shift amount is a multiple of 8. So we have XOR, AND and OR of
course. This work across the whole 128-bit vector register. We have two
byte interleaving instructions which take alternately bytes from two
registers, from the lower or higher part of two registers, introduce a
new register as output. So that's what I'm calling a byte interleaving
instruction.
And we have these faster shifts when the shifting amount is a multiple
of 8. So it's a bytewise shift so the amount must be -- And then we
have -- now we have propagation of bytes from the lower part to the
higher part of the register of bytes not of bits.
We also have this byte shuffling instruction which is -- it permits the
order of the bytes in a register. A memory alignment instruction to
make operants aligned if they are not, so it's just for convenience.
And in the latest processors we have this carry-less multiplying
instruction.
So after the super shuffle engine was introduced, these three
instructions became a single cycle instruction. Yes?
>> : What's the number in parenthesis? You have 8, 10...
>> Diego Aranha: Yeah, that's what I'm saying right now. So this is the
cost of the instruction before the super shuffle engine was introduced.
So they cost it 2 or 3 cycles, and now they cost a single cycle for
these 3 instructions. So, please?
>> : Yeah, so all these ghosts are in cycles I presume. But also...
>> Diego Aranha: Yes.
>> : ...to measure a ghost you also need to know how often you can
dispense this instruction...
>> Diego Aranha: [Inaudible].
>> : ...[inaudible] instruction or once every five instructions
[inaudible]....
>> Diego Aranha: Yes.
>> : So do you take that into account when --?
>> Diego Aranha: Yeah, in this table, no. But at this time I was only
considering the [inaudible] across the source and the throughput was
just one instruction per cycle. I know that in more recent processors
this throughput got higher, but I'm not entering into this detail. Well
it will become clear in the later slides, but at first I'll just show
how using these instructions is better compared to the state of the
art. And then any improvement given to these instructions will just
reduce [inaudible] by a constant. So it won't be as important. At first
I just want to prove that these instructions are good for binary field
arithmetic. That's my first point.
And also we have the carry-less multiplier, it can cost on the first
iteration the Nehalem architecture between 10 and 16 cycles if you have
true multiplications with independent operants. It causes [inaudible].
So throughput plays a role here. But usually you can organize code so
you can exploit this throughput. So in this line here I'm already
considering that throughput can be exploited. And in the [inaudible]
processors the instruction out-cost is between 8 and 14 cycles which is
very interesting because this operation is much simpler than an integer
multiplier. It's just integer multiplication without the carries. But
an integer multiplication costs 3 cycles. So we have a much simpler
operation which costs 3 times more. Yeah? Probably for market reasons.
I don't know exactly way but probably it's something related to -- or
maybe an area of performance tradeoff in the process. I don't know the
reason. Please?
>> : If you ever talk to Internet engineers, they want to look good on
standard benchmarks. There is no carry-less multiplier in the standard
benchmarks. There is [inaudible] on the standard benchmarks.
>> Diego Aranha: Yeah, we should introduce [inaudible]. A binary field
multiplier in the benchmarks.
>> : Yeah. So what you're saying is multiplication takes three cycles,
regular multiplication. So you...
>> Diego Aranha: Yes.
>> : ...mean a SIMD multiplication or regular multiplication.
>> Diego Aranha: Regular integer multiplication using 64-bit integers
like on these standard x86-64 registers, takes three cycles.
>> : [Inaudible] SIMD?
>> Diego Aranha: Not in SIMD just using the...
>> : But this is SIMD so...
>> Diego Aranha: This is SIMD.
>> : ...[inaudible] makes more sense to compare SIMD to SIMD, right/
>> Diego Aranha: But we don't have a 64-bit cross 64-bit integer
multiplier. Actually we have on the Pentium I remember the SSE-2 had
one. But still if I'm not mistaken because I've never use this
instruction, we have a 64-bit times 64-bit SIMD instruction but it only
does one multiplication at the time. It is not a parallel like a SIMDstyle instruction. Exactly as this one, we have two 64-bit integers per
vector register but we can only multiply a pair of integers at once. We
cannot do things in parallel as like additions or other instructions.
And I think that the integer counterpart is faster than the SIMD
counterpart if you are only considering SIMD integer instructions. So I
think still using these standard registers is faster in the integer
case as well. But what I was -- I think an important point is this
operation is much simpler than the other. So naively you could think
that we should pick the integer multiplier, take the carry logic out
and we should have a binary field multiplier in just two cycles, for
example, or even one cycle because the carries are the big problems.
But of course not everything works like [inaudible] say, so it's not
that simple. But this...
>> : Just to go a little farther, we were experimenting with Armando on
AVX to do prime field arithmetic and so far AVX is -- so using
basically the multiplier SIMD-style for prime field arithmetic it's
slower than regular integer multiplication. That's for sure. What will
happen with AVX2, still unknown. But AVX2 is still coming next
[inaudible]. So, so far, yeah. It's...
>> Diego Aranha: So it's still...
>> : ...[inaudible] experiment...
>> Diego Aranha: I never benchmark [inaudible]...
>> : ...integer multiplier is still more efficient for [inaudible].
>> Diego Aranha: Because it's on the standard benchmark. So they want
to make this thing as fast as possible because it looks good.
>> : [Inaudible] just a little comment that as you said [inaudible]
costly integer [inaudible] that this could be the effect of the area
tradeoff in the chip. So carry-less [inaudible] instructions that are
used for all the SSE instructions. So it's in another unit core
[inaudible]...
>> Diego Aranha: Yeah, I always thought that was the reason but it's
just that --Please, Peter?
>> : There may be a big lead for the standard multiplier and subscript
complications.
>> Diego Aranha: Sorry, can you -- I didn't [inaudible].
>> : [Inaudible] in the calculations subscripts.
>> : Yeah, so you need fast multiplication. So it might -- [Inaudible]
because I was really surprised to [inaudible] that regular
multiplication takes three cycles. On the fastest machines if I lose 64
times 64 to 128-bits, it takes always at least 10 to 15 cycles.
>> : That's what I’m talking about.
>> : If you only want to lower 64-bit, that will take [inaudible]...
>> Diego Aranha: Oh, okay. Okay.
>> : And I would make one point.
>> Diego Aranha: Okay.
>> : If you write C code, you cannot get to those 64 bits. So, again,
it doesn't appear [inaudible]...
[ Multiple inaudible voices continue the conversation ]
>> : We're discussing assembly instructions.
>> Diego Aranha: Yeah, [inaudible].
>> : I don't know about that because I have been -- I read this, it's
in the Intel manual...
>> : So only if you put...
>> : ...[inaudible]...
>> : ...[inaudible] if you get -- if you think that the lower part...
>> : The lower part becomes available after three or four cycles. In
the upper part on [inaudible] where eight...
[ Multiple inaudible voices continue the conversation ]
>> : Oh, I really don't know about that. I'm not sure because -- Yeah,
actually my [inaudible] makes sense more with the three...
>> : This is according...
[ Multiple inaudible voices continue the conversation ]
>> : Run a little loop and measure it.
>> : Yeah, yeah. That's --.
>> Diego Aranha: What I am -- Just to be clear here what I'm
[inaudible] is not from official Intel documents. It's from the
[inaudible] instruction tables. And he's benchmarking instructions for
a long time. He is really complete, with different processor families
and everything. He is really complete. So we can even try these later,
but I would love to see this cost in ten cycles and lose my point here
because I have been insisting on this for a long time that these
instructions should be made faster.
Maybe it's the way also we use the integer multiplier in a [inaudible]
organization that you kind of hide these ten cycles. You first need the
lower part to accumulate maybe with an addition right afterwards. But
we can do some tests later. So let's proceed.
So let's discuss first most of these instructions are really simple to
understand what they do. I just speak a little bit more about these two
instructions because at first they look really useless. But if you take
a more careful look they are actually very, very useful.
So the first one is the byte shuffling instruction. As I said, it just
[inaudible] to order of bytes inside the register. So this is the name
of the instruction. This is intrinsic. I'm not doing assembly directly;
I’m doing intrinsic basic programming. You use this -- They are
function calls at [inaudible] but they are translated to single or a
very small number of instructions. So it makes your life easier because
at many times the compiler is -- at least [inaudible] is becoming
really good at scheduling registers and doing these things.
So any helps [inaudible] between the autocompilers with the Intel
compiler too. So it was better for us than -- especially at this point.
Nowadays I'm not completely sure. Like you use some control by using
intrinsics but still --. So this instruction works like this. We have a
source integer which is a sequence of bytes in a mask. And you copy
bytes from the source to the result register according to the mask. So,
for example, we have zero in the first three bytes, so we copy the
first byte to the first three bytes into the output. So it's just
according to the mask up here [inaudible] bytes. But the interesting
part here is if you transform this into a table of pre-computed values,
16 pre-computed bytes, and you consider this mask as a sequence of 4bit indexes, you do 16 simultaneous table lookups. So the real power of
this instruction which looks very useless at first is that you can
implement in parallel any function which inputs 4 bits and produce 8
bits as output. Okay? Is it clear?
So just to present an example, let's take a bit manipulation example.
I'm interested in picking 4 bits and expanding it by inserting zero
bits between these 4 bits. And this is very common in binary field
arithmetic. You have to manipulate things at bit level which is hard on
software because you have to do all this shifting and masking.
So with this instruction you can pre-compute -- evaluate this simple
function for 16 possible inputs, sort this into a register. And the
other register, the mask register, stores the indexes to look up on
this table. So at the end you evaluate this function 16 times in
parallel.
>> : Is this similar to the ancient IBM PR instruction?
>> Diego Aranha: I don't know because I don't know this instruction.
Although, I'm talking about Intel instructions here, but I've seen
other examples of this instruction in other instruction sets. So it's
perfectly possible. I know the AltiVec unit [inaudible] had a very
similar instruction with a different name. Like, well, it's [inaudible]
bytes or something not byte shuffling.
So you can see that I'm here operating -- If you consider the indexes
only, I'm doing 4-bit granular arithmetic in a sense. So my whole
formulation of binary field arithmetic will try to represent things in
terms of sets of 4 bits so I can use this instruction as much as
possible, especially because I know this instruction is becoming faster
with time. So I want to do things at 4-bit level or 4-bit granularity.
The other instruction is the memory alignment instruction which looks
even more useless because you can at first make everything align and
not even this alignment instruction. What it does, this instruction
does is it receives two registers as inputs on offset and it extracts a
128-bit or 16 bytes section of the concatenation of these two registers
and produces this as output. So what this really implements is a multiprecision bytewise shift. So you can use the instruction to left or
right multi-precision shifts because it's implicitly propagating bytes
from one register to the other.
>> : What machine supports misaligned? Is this faster than writing
misaligned and reading misaligned?
>> Diego Aranha: On the Core2 processors it was faster to deal with
aligned operants. On the Sandy Bridge processors it doesn't make a
difference. So it's the same cost reading or writing aligned or
misaligned operants.
>> : So a misaligned read following an aligned, right, would give you
the same semantics?
>> Diego Aranha: Yes.
>> : Is it faster or slower?
>> Diego Aranha: Probably it's the same because implicitly I assume
that the processor caused this instruction and then does the operation
with aligned operants. I assume so, but I never tried.
>> : But sometimes using misaligned data produces scratches in the...
>> : Misaligned what?
>> : Misaligned data [inaudible].
>> : Yes.
>> : Produces scratches in the program sometimes.
>> Diego Aranha: It produces, if you -- For example, if you call an
instruction with a memory operant and it's not aligned. So this also on
the first -- I didn't try -- Because I used everything aligned, I
didn't try it on the recent processors to see if this [inaudible] still
give crashes. But on the first you had to have operants aligned or
align them somehow, ask the compiler to align it or use this
instruction. So, no, this I didn't try anymore because, well, it --.
>> : I will just add that reading and writing misalign can be written
in C. This...
>> Diego Aranha: Yes.
>> : ...[inaudible] can be written in pure C.
>> Diego Aranha: Yes. Yes. There are compiler extensions for that. But
in any case, the instruction was introduced to pick something
unaligned, make it aligned so you wouldn't have crashes or performance
penalties. But nowadays it's not really useful anymore because it
doesn't make a difference if the crashes are not a problem. But we can
still use it for fast multi-precision byte shifting.
So now let's just fix some notation. So I will present my binary fields
with irreducible polynomial usually chosen as a trinomial or
pentanomial of course. I would use polynomial basis because we have a
polynomial binary field multiplier, so no sense in using normal basis
in this processor. In some cases, yes, but not -- it's out of scope
here. I'm only dealing with polynomial basis.
In software we just represent a binary field as a sequence of words. So
I'm calling this number n. It's just [inaudible] of the degree of the
binary field divided by 64 so a contiguous set of 64-bit words. I'm
considering n to be even just to make things simpler, but of course n
is not always even. If it's even, the number of it is easy to determine
the number of vector registers just n over 2 of course. And graphically
in this talk I will present a binary field in this way, so it's just a
sequence of n 64-bit integers.
So first of all let's talk about the proposed representation. So as I
said I want to use the byte shuffling instruction as much as possible.
So I really need 4-bit granular arithmetic. And this representation is
just splitting the binary field into two other binary field elements
represented with 4-bit groups or 4-bit sets. So suppose this is a
vector register composing a binary field element, you just split them
into higher and lower 4-bit groups by applying a simple mask and a
shift.
So you do this. And then you can see that by looking at this and this,
I have the 4-bit sets exactly on the positions I need to use the byte
shuffling instruction. So that's why we do it this way. Of course it's
very easy to convert to this split form just applying masks and a shift
and very easy to convert back to. You just shift the higher part and
add the two. I will not always explicitly convert things to this
representation because sometimes it doesn't make sense. Sometimes I'll
just simplicity process these 4 bits in the algorithms. And while it
doesn't make sense to keep things represented in this split form
because it wastes more memory. For every binary field element you have,
you waste twice more memory. So in the algorithms when I need it, I
convert to split form, do the arithmetic I need and then convert back
at the end.
So let's start with addition and subtraction which are the same thing
in a binary field. This is of course the easiest operation on a binary
field. It's just XOR. But the interesting part is as we were speaking
or talking about throughput, you usually implement addition with the
operant with the largest operant size. In these processors, in the
Sandy Bridge processor for example, we have three options. We can
either do exclusive at 64-bit granularity, use the SSE operation or the
AVX operation. We have these three options. The thing is the AVX looks
faster because it deals with a bigger number of bits. So you need less
additions to add to binary fields, but the throughput is slower. So you
can do three 128-bit additions per cycle if everything is not dependent
on each other, the operants and results, and the operants are not
discarded into memory. But you can only do one 256-bit operation per
cycle. So according to our tests, it's better to use the SSE
instructions instead of the AVX instructions.
I don't know about AVX 2 if they will change the throughput ratios or
not. So it's not always -- But in some cases it's better -- There will
be cases where it's better to use the larger or the wider instruction.
So let's talk about squaring which is much more interesting than
addition. So if a binary field is representing this way in polynomial
basis, we know that squaring is just doubling the -- inserting zeros
between each consecutive pair of coefficients. So if we have a binary
field represented by the sequence of coefficients here, the square of A
is just this with zeros inserted between each -- inside each pair of
coefficients or contiguous coefficients. And of course this has to be
[inaudible] reduced with the irreducible polynomial to reduce things to
the standard size.
This operation, if you implement it naively, it's going to cost a lot
of bit manipulation. You have to isolate these bits and apply masking
and shift them. So the way to do this usually is to pre-compute a small
table of results and use this table to expand the coefficients in this
way.
So squaring is also a linear operation. So let's see how squaring
behaves with what we call the split form, so just distributing the
square operation in that formula. So what this means is you can square
a binary field element by just squaring the lower 4-bit sets and the
higher 4-bit sets and accumulate them an 8-bit shift. And you can do
this of course with a lookup table. So for all the 16 possible
combinations of 4 bits, you can pre-compute this table of bytes with
the zeros already inserted. It's just a 16-byte table. And it does this
much more efficiently than isolating and masking the bits individually.
Of course this looks perfect to apply that byte shuffling instruction
already, so that's what we're going to do now.
So let's take one register of the binary field as input. Let's convert
it to a split form, so applying the mask and then shifting to the right
4 bits to position these bits in the place we want them. And then we
can apply the byte shuffling instruction which is really working as a
simultaneous lookup instruction here, and expand these 4 bits into 8
bits here. Also we need to accumulate these two expended results here
with an 8-bit offset. So what we do is we use this byte interweaving
instructions to take alternately the bytes from one and the other. So
we are simulating an 8-bit shift and accumulation. So basically this is
the result, and we are implementing this explicitly.
So let's now talk about square-root extraction which uses the same
ideas, but it likely trickier. So this is a classical all ready
algorithm by Fong, Lopez and others. So you can take a square-root of a
binary field element by isolating the bits with even and odd indexes
and accumulating them with the odd part multiplied by the square-root
of z.
The square-root of z is a [inaudible] constant, so depending on the
choice of F this can be a dense or a sparse element. We prefer it to be
a sparse element. When it's sparse we call it a square-root friendly
field because this multiplication here only costs us some shifted
addition so it's faster.
>> : Is there -- Sorry. Is there a certain [inaudible] or something
that will make these things fast?
>> Diego Aranha: Yes, if the known zero coefficient of F of z, all of
them are odd, you have a square-root friendly polynomial. And there are
several papers in the literature detailing the structure of these F's
so a few papers. There is even a hierarchy-like [inaudible] divided
[inaudible] I think by Scott if I'm not mistaken. But some of these
[inaudible] polynomials are square-root friend, some of them are not.
But I will also present numbers with and without square-root friendly
polynomials.
So even though when you don't have a square-root friendly polynomial,
you can still -- and the square-root of z is dense, depending on the
multiplication algorithm you can still pre-compute part of this
multiplication. So for one of them you can still do it. And this
usually costs -- Well, this is [inaudible] precision too. Actually this
is [inaudible] precision, so this doesn't usually cost a full
multiplication even in the worst case.
But we know that square-root is also linear because squaring is linear,
one of the inverse of the other. So we can apply this square-root
operation to the split form. Of course we get this expression here. So
as in squaring, we can compute a square-root by extracting -converting to split form, applying the square-root operator to the true
groups of 4-bit sets, and I [inaudible] then with a 2-bit shift.
If we use the same formula to expand these two formulas, we arrive at
this which looks complicated but it's really simple. So what we need to
do is for each of the two operants or results in the split form
representation, we have again to isolate the bits with even and odd
indexes and combine all of them in this way by also applying these two
bit shifts and then multiplication by the square-root of z.
And as I said we should choose F -- It's better to choose F -- so the
square-root of z is really sparse. So this is just a bunch of shifted
additions. So this is how the square-root looks like. So as before, we
receive as input a part of a binary field element. First of all I
shuffle the bytes in that register so that the odd and even parts are
easier to collect at the end. So there is a small trick here which kind
of violates this split form, but it's a trick. So with this permuted -byte permuted operant, we convert this to split form. So actually we
don't have probably this split form but a permutation of this split
form if you are pedantic. And so we can, again, use a simultaneous
table lookup instruction to isolate the even and odd -- the bits with
even and odd indexes.
And that 2-bit shift can be already embedded in one of the tables. So
in one side we do this and this because we need them multiplied by z
squared or shifted to the left by two. And on the other side we compute
this and this, the even and odd parts of AL or the even and odd part of
AH. So at the end we have all the 2-bit sets already isolated with the
nice shifting applied to this side. We can add them and recover the 4bit sets corresponding to A-even and A-odd. Again, we convert to the
split form for the registers at the end because this is also out of
order as you can see.
So at the end we have the bits with even indexes, the 64 bits, at one
side and the bits with odd indexes at the other side. So we -Basically we are using the same lookup instruction but now to isolate
bits in even or odd indexes. Multiplication now has three different
strategies. The first one is the Lopez-Dahab comb method. It was
introduced in 2000 I think, and it was the faster approach to multiply
binary fields in many platforms especially when platforms didn't have
any support for native binary field multiplication. So also I would
talk about the new approach to multiply binary fields using shuffle
instructions, the byte shuffle instruction, and some considerations
about how to better use the carry-less multiplier.
So Lopez-Dahab multiplication works like this: supposing that it's easy
to compute a small polynomial times a field element by using just
shifted additions supposing that you have this as a building block.
It's just you can compute it explicitly by shifted additions. You can
perform multiplication by processing one of the operants, considering
it as a sequence of small polynomials and then accumulating the result
of these products in a school book way, let's say, just with the
corresponding shifts applied.
So in this case here since we have already one trick, yes, I'm jumping
-- One thing you can do at this algorithm is you can jump the offsets - you consider on one of the operants, for example at every 8 bits, so
you don't have shifts by 4 bits. You can consider only the lower parts
of each byte of 8, accumulate all of them, do a 4-bit left shift and
then do the other part. So you can do it to remove 4-bit shifts, at
least most of the 4-bit shifts.
And interesting part of this algorithm is of course is many of the
small polynomials when you are processing A are repeated. So you can
pre-compute several of them. So use your choices to consider you to
have degree less than 4. So it's just a set of 4 bits and then you
build a table of 16 pre-computed values and you can just accumulate
these values and apply the corresponding shifts.
At the end you have Dahab precision polynomial and you can reduce it to
get a proper field element. So this is Lopez-Dahab the comb method. So
if you consider that one of the operants is represented in that split
form, so in this case it's a; so B is still the same and A is expanded
in this split form. And if you distribute this operation, you get this.
So what it basically says is if you build two tables for two different
multiplications, you removed the shifts completely. So if you build a
precomputation table for B or B shifted to the left by 4, you can
reduce this multiplication by two simple Lopez-Dahab multiplications.
So it's basically another way of doing the same trick I presented here,
but just using the split form.
And this is well-known. It was already suggested in the regional paper;
they already have this because not many processors have good 4-bit
shifts but many processors have good 8-bit shifts. In some cases it's
even for free. If you are on an 8-bit microcontroller, of course
shifting by 8 bits is for free. So the core operation of this algorithm
is accumulating these types of products, the results of these types of
products, multiplying small polynomials by dense polynomials. This is
the core, and we repeat this operation many, many times.
So now this is how the algorithm looks like. So we pre-compute online
two tables. The first table, as I said, based on B. The other table is
based on B shifted to the left by 4 bits for all possible choices of u
when u has only 4 bits or a degree less than 4. So we accumulate all
the results, the intermediate results in a register window so we don't
do operations into memory so we exploit the --. Actually we are still
using a table we stored into memory to [inaudible] the values that
accumulate, so we don't enjoy the benefits of the higher throughput.
>> : Is it similar [inaudible] algorithm from binary matrices?
>> Diego Aranha: I think so, yes. I remember reading a paper dealing
with this similarity. I think so. And it has been proved too. I think
Bernstein has done some work on this algorithm, the [inaudible].
So let's continue. And then I split them up in two different loops. The
reason for that is in the first loop I process things with 128-bit
offsets so I don't have to do 64-bit shiftings. And I also when I
process in the first lines in the two loops I process, I think, AL and
in the second line AH with also this offset. And by using these two
tables I can eliminate 4-bit shifts. I only need to do 8-bit shifts. So
I can use the memory alignment instruction here to do the 8-bit shifts
with bi-propagation in a faster way. That's basically how the algorithm
works. And since I'm splitting it into two loops, by just reordering
the registers I can do 128-bit shifts, so I don't have to do 64-bit
shifts explicitly. And at the end, of course, I have to reduce mod Fz.
So, okay, this is all also well-known. The only difference is that in
the context of vector instructions that's new. But now let's say what
happens if both the multiplicand and the multiplier are represented in
this split form. So the expression looks like this. Using Karatsuba we
can reduce it, of course, to three multiplications. So we can do a
multiplication by multiplying AL and BL, AH and BH, computing these
additions -- actually these additions. This multiplication and
additions with the intermediate results.
And also accumulate all these intermediate values with an 8-bit shift
and a 4-bit shift. Karatsuba is usually much better in binary fields
than in prime fields because we don't have carry propagation. Addition
is really fast, so it makes much more sense to use Karatsuba here, even
for small field sizes. Karatsuba is only useful in prime fields, I
think, over like 2000-bit fields or something like this currently so -because of the ration between addition and multiplication. And addition
can cost, I think, up to two cycles in Sandy Bridge. So it's almost the
cost of an integer multiplication model the differences we've been
seeing the instruction latencies.
>> : Also in [inaudible] field.
>> : In [inaudible] it makes sense to use Karatsuba?
>> Diego Aranha: It makes sense. Yes, yes. So, yeah, let's only think
about prime -- large characteristic fields. So now we are multiplying a
pair of sparse polynomials. So if you think of the Lopez-Dahab method
or if you look at this operation explicitly, what you are doing is
multiplying a small polynomial by a sparse polynomial so a 4-bit set by
a sequence of 4-bit coefficients. So this is how the algorithm looks
like. It basically uses the same tricks to remove the 4-bit shifts. The
only difference is that I don't have any more online pre-computed
tables. I have a table of constants which pre-compute these values for
all possible choices of these small polynomials. So I have a table of
sixteen 128-bit values. So it's a table of constant size. And the core
operation of the algorithm is just using the shuffling instruction to
do the multiplication of this 4-bit set by each of these 4-bit sets in
just one operation because it's a table lookup again. So basically
that's how the algorithm works. So we'll analyze trade-offs and when
this could be better or worse later.
To use the carry-less multiplier we have two different options. We can
either consider operants in a 128-bit granularity. This is good if you
want to reduce the number of required registers because then you can
store all the intermediate values in a smaller number of registers than
considering operants as 64-bit values because the instruction is a SIMD
instruction. So you still have to use a vector register to store the
operants even if they are only 64-bit wide.
In this granularity you multiply digits of 128-bit size and then you
can do each of these 128-bit multiplications with just three carry-less
multipliers.
>> : You mentioned you have to [inaudible] put the full 128-bit
register in it doing 64-bit arithmetic. Now can it at least attack or
do two such operations in parallel using the top and bottom [inaudible]
advantage?
>> Diego Aranha: The problem is you have to -- You can do it and you
enjoy some throughput gain, but the problem is the operation won't be
executed in parallel like a SIMD instruction where you add the two 64
bits in parallel. So you have to call the instruction once, call the
[inaudible], and some work will be shared because, as I was saying, the
latency goes down four cycles. So I speculate that this is just the
time for the code instruction and some pre-fetching of the operants is
not really dealing with [inaudible]. I speculate but I don't have
concrete evidences of that because -- Well, in the [inaudible] we lost
the same amount of cycles in the two situations between to the two
processors. I just think the optimizer, the decoding part, or the prefetching of operants that's why we have a constant difference.
So if you consider this granularity you really should use the maximum
number of Karatsuba levels for N over 2 digits. So for example this is
a good organization if your binary field size states four vector
registers. Or usually a product of two and three, like a power of the 2
to the K, 3 to the L, for example, because then you have nice Karatsuba
formulas for each of these steps. But sometimes we have a better
formula if you deal with 64-bit granularity. But then you have to solve
the higher number of registers because then you need to store a 64-bit
register in your vector register. What we do is you occupy 128-bit wide
register with two useful, let's say, useful 64-bit integers, which will
be needed. So this has been proposed on the gf2x library by Emmanuel
Thome and others. And you can do this to do 64-bit additions in
parallel and to reduce the number of registers to store intermediate
values because the faster formulas at this granularity usually have a
higher number of intermediate values. Some of them were proposed by
Peter Montgomery for example.
And then of course this works if you have a formula with a lower number
of multiplications, so then this makes sense. If you can reduce the
number of carry-less multiplications according to this. In some
situations we have seen that this organization provides a higher
throughput probably because we have this nice structure here where all
the three multiplications are together. So it's easier for the compiler
and the processors to schedule these operations together. But this is
more like a set of guidelines; we don't have very concrete statements
on this. It's just experimentally for some field sizes, one is better
than the other and probably the throughput is the difference or the
number of multiplications in each organization. So let's compare these
three different strategies. Please?
>> : Yes. So one question: if you use the 128-bit granularity, if I
understood correctly in the split form, if you have a 128-bit
polynomial can you store it in two 128-bit registers. So you're not
really reducing the number of registers but [inaudible]...
>> Diego Aranha: Yeah, but in this case I'm not using split form. I'm
cheating somehow because I wouldn't gain anything by using the split
form here because I have a dense multiplier already for 64 times 64-bit
polynomial. So here I'm not using the split form. Only in the shuffle
[inaudible] and the Lopez-Dahab without the 4-bit shifts. So here is
just -- it's native. I don't have to do anything fancy to use the
instruction.
So let's compare these strategies. So a good thing of the Lopez-Dahab
multiplication is that it can use the highest-granularity exclusive OR
operation. You should also consider throughput effects as we saw in the
case comparing SSE and AVX. The memory space, however, is proportional
to the field size because you need to compute this table of 16 or even
more depending on the window size of field elements. So if the field is
higher you spend more memory pre-computing this table online. For the
shuffle-based multiplication, the core operation is sparser so you have
to do more, three times more executions of this core operation. So it
would only really work in practice if you had a really high throughput
of the byte-shuffling instruction. But the good thing is, a good factor
is, it consumes constant memory space, so it's just a table of
constants. 16 time 16-byte table, so a 256-byte table apart from the
state to maintain the Karatsuba state, to maintain the Karatsuba
values. And sometimes you can -- If the field is small enough you can
keep this state all in register, so it's not a problem. But of course
for field sizes the Karatsuba state will also be proportional to the
field size. It will increase with higher fields.
Another problem is here you have to deal with constants stored into
memory, so you are forced to use instructions which use values from
memory which lowers your throughput. So this is a disadvantage. For the
native multiplication, of course, it's a native operation. Even if the
latency's not stellar, it's still faster than both of the algorithms we
saw, the Lopez-Dahab and the shuffle-based multiplication. And it has
constant memory consumption of course because it only deals with
registers. It doesn't need to compute anything or us constants stored
into memory. The only disadvantage I would appoint is there is no
widespread support of these instructions among architectures out there.
Only the Sandy Bridge I think that the [inaudible] processor has an 8bit version of it which does operations in parallel which is
interesting. But it's tricky also to use it efficiently in a
multiplier, especially for bigger fields for like [inaudible] field
sizes. So now let's talk a little bit about modular reduction. Modular
reduction, again, it's very heuristic because it requires heavy
shifting, and we know that vector instruction sets are not very good
for shifting values. So this split representation here does not really
help, especially because after the multiplication we have advanced
double-precision polynomial. So if we needed to use the split
representation to do modular reduction we would to have to split this
again and now we have a double precision operant.
So some guidelines we collected when implementing different field
sizes. If F is a trinomial, you should implement this at 128-bit
granularity. The number of [inaudible] is small and usually you save on
the other part so this is faster. If it's a pentanomial, you should
either process pairs of digits in parallel or do the operation in 64-
bit mode where all the shifts are already provided. And a technique
we've discovered recently is if you have a field, a pentanomial for
example, where the addition of this value is equal this value here, you
can simply write the reducible polynomial this way and save some
shiftings when multiplying by these values which is required in the
modular reduction part. So we have done this -- this is the standard
polynomial at the 128-bit [inaudible] level for elliptical
cryptography. And this saved some shifting instructions in the modular
reduction. So actually you have three -- So we do it at 128-bit
granularity for these few. So you have actually three different options
for the pentanomial polynomial. So of course you should always
accumulate writes into registers before sending them to memory just to
enjoy the higher throughput. And reduction should be done while the
results of squaring or multiplication operations are [inaudible] stored
into registers. There is no point in saving them into memory or reading
them from memory again and then reducing them to send to memory again.
So you should keep contacts in your register to save memory operations.
Now half-trace. Half-trace is useful for computing point halving as the
point multiplication algorithm. So what the half-trace has completed in
this way, we just accumulate lots of [inaudible] squarings of Z. What
half-trace is, it provides a solution of this quadratic equation when
the trace of zero is -- trace of Z is zero. And then an interesting
property of the half-trace is if i is even, the half-trace of Z to the
i can be computed in terms of the half-trace of Z to i over 2. So
inside the algorithms for half-trace use this fact either to get
speedups or reduce memory consumption because you don't have to deal
with bits i of even position. You can simply convert the problem of
dealing with them to bits of odd position. So this algorithm basically
works at 8-bit granularity processes one byte of Z per iteration.
Actually this is already the last part. So what we do is we pre-compute
a very big table of half-traces considering only the bits with odd
positions of odd indexes. And then we preprocess Z to eliminate all the
bits using this property with even position. So the problem is
restricted to the bits of odd position. And then we accumulate values
from this table by just doing a sequence of additions.
Here since these values are always stored into memory, the AVX
implementation is usually faster than the SSE implementation because if
operants are into memory both of them have the same throughput. So
using larger operants it's better. But still this part dominates the
cost and, I think, we only saw a 25% speed up with the larger operant
sizes. Eliminating the bits at even positions can be done using the
same byte shuffling instructions because it's just bit manipulation. So
it's using the same techniques we used for square-root, for example. So
you can remove them and continue.
The final operation I'll present in the [inaudible] is the M squaring
or multi-squaring. It's a time-memory trade proposed by Bos. So the
really interesting feature of this time-memory tradeoff is you can
compute any number of consecutive squarings of [inaudible] element at
constant time. The time doesn't depend on K. So it doesn't matter what
the value of K, it's just a sequence of additions of values stored into
also big pre-computing tables. So it's really a time-memory tradeoff.
So you basically pre-compute for each 4-bit set. And the 16 different
values of this 4-bit set, you pre-compute a table of these polynomials
squared consecutively K times and then you can add further by
processing 4 bits a time to compute this consecutive squaring of A. Oh,
multi-squaring was not the last. We still have inversion.
Inversion is, again, very heuristic. When you don't have memory
available in a small processor, for example, we usually implement the
Extended Euclidean algorithm in 64-bit mode because this algorithm,
again, requires heavy shifting. So you shouldn't -- At least all our
tries in doing it in vector instructions were not very efficient. But
memory is available you can use the Itoh-Tsuji algorithm which
basically recodes inversion in this way. And if you have a short
addition chain to compute that exponentiation, it's just a product of
several 2 to i powers of A which can be computed using the time-memory
tradeoff. So you can -- For each of these powers pre-compute a table.
It's a table of constants too so we'll have a table pre-computed
already, and compute this exponentiation in a fairly efficient time. So
now let's go to implementation details.
First I'll present the implementation of 16 different binary fields
ranging from 113 to 1200 bits. All of these fields are useful for
curve-based cryptography either ECC or PBC. At first we used the GCC
4.1.2 because it provided the fastest SSE intrinsics. Then performance
of these intrinsics became horrible in the next iterations of the
compiler and they became fast again in GCC 4.5. In GCC 4.7 we are
seeing the best times we ever saw so somehow it became horrible and
then it became good again; I don't know why. This first of the results
I will present using -- the experiments were done with the RELIC
cryptographic library. On top of these implementations we build
implementations of scalar multiplication, but this field part is done
with the RELIC library. And I will restrict my comparison to other
results only comparing implementations with vector instructions. It
doesn't make sense to compare a vector implementation with a C
implementation. Please.
>> : Well, it sounds a bit strange to me that intrinsic maps to one
assembly instruction, how it can be faster in one compiler to another.
>> Diego Aranha: It's very strange to me too. By looking at the
assembly produced by GCC, it just inserts lots of penalties and
overheads because...
>> : So this would suggest that writing your own assembly [inaudible]
are much...
>> Diego Aranha: Yes.
>> : ...[inaudible].
>> Diego Aranha: Yes. I agree. But then a good thing is they solved the
problem somehow, and the more recent versions of the algorithm are
better than my assembly code so -- especially for the multiplier. I
tried many options different organizations of the multiplier in
assembly and the compiler still beats me by one or two cycles.
>> : The 113 bits should've been retired a long time ago. And
[inaudible] are now retiring 80-bit security, so 60-bit security is
ridiculous.
>> Diego Aranha: Yes. Here I'm not dealing with 60-bit security. Just
do...
>> : [Inaudible] 113 bits...
>> Diego Aranha: Yeah, but this can useful for a curve defined over a
quadratic extension, so this is actually no necessarily --. I just need
these to build something, a higher extension for example. So I'm not
necessarily doing -- I remember other application of this field but now
--. We site in the paper and now I don't remember. It's useful for
something else, not necessarily -- I think it's the [inaudible] curve
use it in [inaudible] which is this field size. Anyway, so we compare
only to other implementations with vector instructions, so the mpFq
implementation. Also this is a paper by Beuchat and Francisco dealing
with pairings. And I will only compare number in the Core2 65
nanometers because the shuffle instruction is really expensive in this
processor. I'm comparing my worst case to the other cases, so I'm not
picking an architecture where the shuffling instruction is really fast.
I'm picking the worst possible for this.
So let's see how the numbers look like. First of all I'm not trying to
be misleading here. We don't have many points in the literature to
compare to, so blue is our sequence of -- the blue lines are a sequence
of 16 different fields. The red is the sequence of benchmarking data I
could find in literature, so of course it's not completely filled out.
But I still compare some points.
For example, for squaring an interesting thing we see is sometimes we
increase the field size and the time goes down. The reason for this is
we are oscillating here between good and bad irreducible polynomials in
terms of shifting. Because the coefficient extension is really
negligible so it doesn't make a difference. For example, we are
doubling the field size here and then the latency barely increased. The
bit manipulation part is really efficient; the rest is just modular
reduction. So we can see for example, this is the [inaudible] 123
field. We have a huge speed up over Francisco's implementation using
pre-computed tables into memory, not vector instructions for the
squaring part.
The square-root is the same. Now [inaudible] by square-root friendly
polynomial. Now we have an even stranger effect with double the field
size and square-root gets faster because now we are oscillating between
different choices of the square-root of Z. So this is a very, very good
choice proposed, actually, by Mike Scott. And it's very [inaudible]
polynomial because all these shiftings are aligned in a vector
register. And so, again, we are oscillating between better or worse
choices of the square-root of Z. Again, coefficient manipulation is
more expensive here but still negligible.
>> : So in some cases square-root is faster than squaring?
>> Diego Aranha: In most cases we saw it was -- you can see that the
squaring is dominated here the square-root is likely higher. So squareroot is more expensive because we need more -- You remember the
figures. It's more complicated. We need more shuffle instructions, but
it's fairly close in the most recent platforms and code. So it's fairly
close.
>> : [Inaudible] faster.
>> Diego Aranha: Yes, this point is faster because square-root tell you
multiplying by square-root of Z here is faster than in this case doing
modular reduction [inaudible] F of Z. So it's a very interesting choice
of F. For pairing basic cryptography it's really useful.
So now let's compare with the standard polynomials which sometimes are
not square-root friendly. In this case these are the not-square-root
friendly polynomials. And you can see that this is the 571 field size
which is a horrible, horrible polynomial to multiply to because the
square-root of z is almost dense and we still have to do a modular
reduction afterwards. So that's why the time goes really high.
But in these more controlled cases here, we still are oscillating
between better or worse choices of square-root of Z. And we still get a
good speed up between comparing to the single point of the mpFq
library. It's a 251-bit size.
So now the Lopez-Dahab multiplication. So this is the Lopez-Dahab
implemented in mpFq and Beuchat and others. And you can see also the
quadratic costs explicitly in this graph. And basically what we save
compared to these other approaches are the 4-bit shifts. So we only do
8-bit shifts. And the difference is mainly this. And also the way we
organize the multiplier so we don't pay 64-bit shifts too. So basically
this is the savings in the shifting part by using the split form.
And now on internal comparison between the Lopez-Dahab and the
shuffling-based multiplier, Lopez-Dahab is for bigger field sizes still
much faster than the shuffle-based multiplier because the core
operation is sparser. So we need more of this core operation to perform
multiplication. But for really small field sizes, it's competitive. So
if you need a compact implementation with constant memory conception
for a small field size, this algorithm could be useful. Which makes me
think that [inaudible] maybe could be useful if you don't want to
support a full native multiplier. But I never tried or I am not sure if
someone tried. So for like 113-bit sizes it could be interesting for a
compact implementation, a low area implementation. I didn't put the
timings for the native multiplier in this graph because this is a Core2
65-nanometer processor. It doesn't have the instruction, so you would
be misleading.
But you can just imagine a line here which follows this line and it's
twice faster. For all the field sizes we tried, the native multiplier
is twice faster than the Lopez-Dahab multiplier. So it's [inaudible]
faster but still not as fast as we wanted.
So some observations. We efficiently formulate squaring and squareroot. The multiplication over squaring ratio is up to 34 which is very
big. The classical ratio is around 12 or 16 found in the literature. So
of course this formulation is faster when the shuffling throughput is
higher, so if shuffling throughput is higher as Sandy Bridge on Nehalem
processors. We gain improvements in squaring and square-root as well.
But still, the performance is heavily dependent on the choice of F of Z
because you either have to do modular reductions or multiply by the
square-root of Z.
The shuffle-based multiplication has a problem with all the constants
stored into memory because the throughput is lower. Memory operants are
most of the time slower. And to make it faster we would need a bunch of
registers to store the table of constants and then do operations, the
lookups on this table of registers. We don't have anything closer to
this in the current processors. But, I don't know, maybe in the future
or a specialized hardware design could support these operations in this
way. This is, actually I consider it a victory. It looks bad, but I
consider it a victory. It's only between 50 and 90% slower than LopezDahab when it should be three times as slower because it requires three
times as much core operations.
For the other operations we didn't find times in the literature
employing vector instructions, so I'm only saying that we restored the
[inaudible] ratios we've seen in the literature. So when point-halving
was, for example, proposed half-trace was considered to be one
multiplication. We got the new instruction to make the multiplication
much faster and nothing explicitly for the half-trace operation. But
with that organization of the half-trace we make it comparable to
multiplication again. So in a sense it's twice faster than the halftrace approaches proposed in the literature but using more memory. Any
version by using the pre-computed tables is around 25 times the cost of
the multiplication. So it’s also close to the ratios we see even in
text books, like the guide to ECC sites something closer.
So now let’s illustrate these implementations with timing for
elliptical arithmetic, for scalar multiplication. So now first I am
considering comparing different implementations for side-channel
resistant scalar multiplication into two different processors. So let
me give a lot of context here. I’m on the 128-bit field size. This is
implementation, a bit-slice implementation by Bernstein. It computes
several scalar multiplications in parallel, I think ten 24
multiplications in parallel to reach this number. It’s on the Core2
platform. And this is the time for a single scalar multiplication with
a random point, of course, with the MPF implementation also in the
Core2. So this was the speed record for either batched or non-batched
scalar multiplications for some time. And by employing this field
arithmetic on the Core2 processor we arrived at this number which is
much faster than the mpFq implementation because squaring and
multiplication are faster but still slower than the bit-slice
implementation by Bernstein.
On the bit-slice paper, he tried to estimate if a carry-less multiplier
based implementation would be faster or not. So we did this for Nehalem
and the Sandy Bridge architecture, and we got some considerable
speedups. I’m not comparing to the execution of this code into Nehalem
or Sandy Bridge because when I ran his implementation in the machines I
have access to I got numbers worse than this. So I think his
implementation relies on some very specific features of the Core2
platform and these do not exist on – Not in the sense of instructions
but maybe cache alignment or code organization. So I couldn’t replicate
these numbers on the latest processors. So I couldn’t do a proper
comparison. But still – And I don’t want to implement everything just
to see how it compares. But still we got a considerable speedup -considering cycles only; this is not dependent on frequency – compared
to the batched implementation. These are, of course, non-batched
implementations of the same curve at --. Please?
>> : Does side-channel resistant mean constant time?
>> Diego Aranha: Yeah. That’s a good question. So actually another
thing I didn’t say, this has a higher side-channel resistance than our
implementations because this takes constant time. In our case we are
using the Lopez-Dahab point multiplication algorithm which is a variant
of the Montgomery [inaudible]. And it has some resistance to simple
side-channel attacks like simple [inaudible] analysis and in timing
analysis in, I think, cache memory, cache timing analysis too. But it
doesn’t cover many side-channel attacks, so this is in a sense more
secure than this implementation. We are more interested, here, in
performance than necessarily protecting against side channels. But,
yeah, good question. Thanks.
So now let’s go to – I’ll just provide a few notes on more recent work.
So actually the move to squaring provides an interesting property or
application to Koblitz curves. So recall that Koblitz curves have the
Frobenius automorphism, tao, which is computed this way by just
inexpensive squarings which are made even faster by using the split
form. And we can do scalar multiplication in Koblitz curves replacing
all point doublings with applications of the Frobenius automorphism.
An interesting thing is if you can compute powers 2 to the K in
constant time independent of the value of K, this actually provides
endomorphism in the Koblitz curve. So we can go further or not in an
iteration of the scalar multiplication algorithm.
So let’s take for example K to be M over 2, so we could represent our
scalar multiplication in this way. And this looks just like the
application of an endomorphism in the GLV context. So we can do these
two scalar multiplications in an interleaved way and remove further
applications of the Frobenius automorphism. This is faster always when
– This operation is faster than squaring consecutively K times. So you
need K to be, for example, higher than 10 for this to happen. But of
course we are speaking about M between 251 and – actually 283 to have
proper Koblitz curves and 571. So in any real scenario this would be
true. So speaking in general terms we can consider M over S multiples
of this power of the tao automorphism, and this is an analogue for
Koblitz curves of an s-dimension GLV decomposition. So we can fold this
[inaudible] multiplication loop S minus 1 times and remove the
corresponding number of applications of the Frobenius.
You might say, “But squaring is really efficient on a binary curve. Why
remove FAs?” The reason is for some of these standard fields, squaring
is not that efficient because the choice of polynomial is really bad.
So actually we are saving modular reductions not squarings. We are
interested in saving modular reductions because the squaring part is –
the coefficient expansion part is really negligible as you saw. This of
course has an impact in memory because we have to pre-compute the
tables to do this [inaudible] squarings. But if you choose a nice
addition chain for the inversion part, you can reuse the same table
here. So that’s exactly what we’ve done to the 283-bit curve.
So this is how the scalar multiplication algorithm looks like. It’s a
standard width-w taoNAF algorithm. I would just point, first we precompute online several small multiples of the point B multiplier. We
apply this map to all of these points to get several different tables
with the [inaudible] applied. And then we just deal with the high bits
and do the interleaved loop here by just folding the loop S minus 1
times and using all of these tables.
We could also apply this map during the addition part but it turned out
it was faster doing this beforehand than inside the loop. So some
numbers: this is unprotected scalar multiplication in many different
meanings. It’s really tricky to protect scalar multiplication in
Koblitz curves against side-channels because the recoding part it’s
very branch-friendly, let’s say. It requires many branches in several
parts to be efficient. So we decided to only do unprotected scalar
multiplication and leave a possible side-channel multiplication, scalar
multiplication as future work.
So these are numbers for a standard curve, the NSTK283.In the Nehalem
and in Sandy Bridge architectures I’m choosing the window size to be 5
and I’m folding the loop just one time. If I fold more I have problems
between this – what I spent here and what I saved here. So S equals 2
was the best scenario for our random point. If the point is fixed, of
course, you can use a higher S and spend more memory to make this
faster.
So I’m comparing to the best implementations by Patrick on the same
platforms. It’s also a random point in unprotected scalar
multiplication. And I have considerably higher speedups by using a
Koblitz curve. Remembering that my field multiplication here, it’s much
more expensive than it should be considering the complexity of doing
carry-less and not carry-less multiplication. So if I had the
multiplication instruction with the same cost in cycles, this would be
actually much faster.
And I have to say that Patrick has improved his implementations so now
this cost is 92 K cycles. So the Koblitz curve implementation is not
the state-of-the-art any more, but it was for a few minutes.
But, anyhow, so we have a considerably efficient implementation
considering that we don’t have as good support to integer arithmetic
for binary fields.
>> : Is that on the same core?
>> Diego Aranha: Single core, yeah.
>> : [Inaudible]…
>> Diego Aranha: [Inaudible] just single core. So this was – At least
when – It should appear the next late encrypt and was the first
implementation to every cross the hundred-thousand barrier. So now it’s
still the first but it’s not the fastest.
So, okay. Let’s summarize the results. So I presented a new formulation
and implementation of binary field arithmetic. It follows this trend of
getting faster shuffling instructions with higher throughput and
improves results on the field arithmetic side between 8 and 84%. This
84% speedup is squaring on the really big field. As you saw there was a
huge difference in latency. This formulation induces a new
implementation strategy for multiplication which is not very useful at
this point. It still could be useful for compact implementations of
small fields. And if we have custom table addressing features on the
processor, this could be interesting. But we don’t have this.
So comparing our times with non-batched arithmetic on binary elliptic
curves we obtained the speed record for side-channel resistant, not
considering the same side-channels but at least resistance to some
side-channels. For scalar multiplication on generic binary curves I’m,
in this case, referring to the bit-slice implementation by Bernstein.
And we provide that this should be in the best a new speed record for
scalar multiplication across all curves. So now it’s how did you
compute how much faster results, 7% or 8%?
>> : [Inaudible].
>> Diego Aranha: Yeah. So but still I just have to rephrase this for
scalar multiplication across all binary curves. So it’s still true if I
only restrict to the binary curves.
>> : If you run our computers at 84% will become 500%. And in this
case…
>> : [Inaudible].
>> : …[inaudible] no ambiguity, right? You know that’d get negative
time.
>> Diego Aranha: Yes. Yes. In a sense, yes. But I prefer 84 so I don’t
have people raising their hand and saying, “Well, it’s not actually
500, it’s 84.” But, yes, depending on the ratio you take it could be
500%. It’s much faster. It’s five times faster. So thank you for your
attention. Please, Peter?
>> : [Inaudible] work on binary Edwards curves?
>> Diego Aranha: Yes, actually when I’m presenting this, this work is
on binary Edwards curves. This work is on the same curve but on the
standard [inaudible] representation of the same curve because then I
have the algorithm I need for some side-channel resistance which is the
Lopez-Dahab point multiplication algorithm. But it’s the same curve
with just different representations.
>> : [Inaudible] attack, side-channel attack on Edwards curves?
>> Diego Aranha: Yeah, I’ve seen on the [inaudible]…
>> : You had AB plus BA, not AB plus AB.
>> Diego Aranha: Yes. Yes, it was a very, very interesting attack
because it looks so simple but it’s not that simple in the details.
Yes. So thank you for your attention. I can answer questions that you
may have now.
[ Audience applause ]
>> : And does this all work on the UA and B [inaudible] or you have to
[inaudible] to use a different [inaudible]?
>> Diego Aranha: Very, very interesting question. The bulldozer is a
very weird processor or probably you’ve heard this. They have a faster
binary field multiplier, for example. It oscillates between 7 and
[inaudible] cycles so it is likely faster than the Intel one. But
everything else’s is lower. And it’s the same thing with integer
arithmetic, too. So it’s very strange because we have a faster
multiplication but slower everything else. So the timings are slower
than this. But the [inaudible] initiative is benchmarking our coding in
the EMG. I think it compiled and ran successfully in the AMG. And the
timings are worse than I’m presenting here. But the [inaudible] are
there. The same implementation techniques can be used. You only get
slower performance because AMG started in different applications, I
think.
>> : [Inaudible].
>> Diego Aranha: Thank you.
[ Audience applause ]
Download