>> Peter Montgomery: Today Joppe Bos speaks on how... using game consoles. In this presentation he will outline...

advertisement
>> Peter Montgomery: Today Joppe Bos speaks on how to solve the 112 bit ECDLP
using game consoles. In this presentation he will outline to projects which he has been
working on during his PhD. Both projects are related to the elliptic curve discrete
logarithm problem, ECDLP, the theoretical foundations of many modern cryptosystems.
First he will outline how they have set a new record by solving the ECDLP over 112 bit
prime field using a cluster of PlayStation 3 game consoles in 2009. Next negation map
optimization is discussed. This is a technique to speed up the Pollard rho method when
solving the ECDLP. It's well known that the random walks used by Pollard rho when
combined with the negation map get trapped in fruitless cycles. He will present that
previously published approaches to deal with this problem are plagued by recurring
cycles; effective alternative countermeasures are proposed. So, speaker.
>> Joppe Bos: Thank you Peter. Thank you for the introduction. Indeed so Peter
[inaudible] have dried up. So to quickly re-up I am doing my PhD at the laboratory for
cryptographic algorithms at EPFL in Switzerland and currently I am doing my internship
here under the supervision of Peter. I want to talk about two projects that have to do with
my PhD. The first one is related to the title of the presentation, how to solve elliptical
curve discrete logarithm problem with game consoles. The second one is related to the
discrete log problem and it will discuss the negation optimization.
So when you talk about game consoles it gives you a good opportunity to show fancy
pictures. So the picture here unfortunately is not our cluster; it is part of what is currently
the biggest PlayStation cluster in the world and is located here in the US; it's from the Air
Force. They have even a much bigger cluster. I think it is one and a half thousand
PlayStations that they have. The simple use it because they are cheap and because they
are computationally really efficient. So first I will give a quick in a few slides summary
of the cell broadband engine, yes?
>>: [inaudible]?
>> Joppe Bos: I am not sure. I think they use it for rendering, oh yeah, processing their
data. I am not 100% sure what they use it actually for. So first I will give a summary of
the cell broadband engine architecture which is the chip that is inside the PlayStation. So
this actually sums it up real nicely. I will go into the details which we actually need later
a little bit more. So on top you see eight SPEs, the synergistic processing elements. So
these are like compute cores and these are the cores that we are interested in and I will go
into more detail of what these things are. They are all connected together using a circular
bus so they can communicate with each other and this bus is also connected to the PPE
which is also a processor; it is an additional processor and we won't be using it, which is
a derivative of PowerPC architecture. But our main focus will be on these eight tiny
cores up here.
So why are we interested in the PlayStation? It's all about price/performance ratio. An
old PlayStation, that is this little bit fetter guy here; it is now discontinued. There are six
of these SPEs available, so the chip actually has eight of them. One is disabled on the
chip because while manufacturing them if one of them fails, they don't have to throw
away the chip and they can still sell it, and one of them is reserved by the hypervisor
hardware layer from Sony. So there are six available in the PlayStations and you used to
be able, or Sony let you install your own operating system on it, so you could simply
install your operating system, write your code and run your own code. Note that they did
not allow you to have access to the graphics cards, because otherwise you could officially
install your games under Linux and you don't play their games anymore. So you only get
access to this chip, not the graphics cards. On the newer model, that is this tiny model
here, they simply don't allow you, so after this. There is a bit of history. I won't go into
details. This whole architecture got hacked and now they don't allow you to install Linux
anymore. Yes?
>>: Is the chip 280 [inaudible] or is that for the entire board [inaudible] memory? It's an
awful lot for one chip.
>> Joppe Bos: It's not the chip; it's the whole PlayStation, including the graphics cards,
which consumes most of the power. But this chip will boot in other devices as well. For
instance it will boot in a PCI express board or for compute clusters in a blade server.
Now look at the price difference. So here there are two cell chips in one blade server.
Note that there is much more memory; it's just a regular PC so you can plug in up to 32
gigs memory while a PlayStation is restricted to roughly 256 MB. So if you have a high
memory demanding applications, the PlayStation, you simply cannot use it. But look at
the price. So I checked this morning. Actually the price dropped a little bit more. It's
not $300; it's now around $250. Then you have a PlayStation, well if you want to buy the
blade server you are looking at a 10 to 14 K investment. So that is the motivation why
we originally purchased these PlayStations.
>>: Why is there such a big difference?
>> Joppe Bos: There are multiple reasons why there is this big difference. First of all
they sell the PlayStation with loss; that is how it goes in the whole game console, and you
sell your console with loss and you make all of your profits on games. If you make your
console too high, no one will buy it and no one will buy your games. That is one of the
main reasons why these things are so cheap now. Yes?
>>: So you said that they used to allow you to [inaudible] people like you wanted to use
it [inaudible]?
>> Joppe Bos: Yes, I'm not 100% sure why they did it, but I know indeed a lot of
universities from all different fields were really happy to use this.
>>: So they want people to buy it so that they will buy games and [inaudible].
>> Joppe Bos: Yes, that is true. So I don't know what their initial incentive was to allow
you to install your own operating system. We were at least very happy that we were able
to. So to zoom in a little bit more on the course which are our main targets, the
synergistic processing elements. They consist of three parts, the synergistic processing
unit, which is the compute core. This has access to a large register file. It has 128
registers. Note that this is much, much bigger than for instance your typical x86 or x86
64 architecture, and they are much wider. They are 128 bits so they are comparable to
the 128 bit registers that you have in your SSE or MMX extension units in the regular
architecture. It is a single instruction multiple data architecture and it has a dual pipeline.
Now this is important. What does it mean, a dual pipeline? It means that every clock
cycle you can dispatch two instructions, one even instruction and one odd instruction and
they go both in one pipeline. So it's a challenge for the programmer to redesign your
algorithm in such a way that you always fill both pipelines to really get, to really dispatch
two different instructions every clock cycle.
It has a local store which--it's not really a cache, but you can compare it with some sort of
cache. It's 256 kB and your executable and all the data required by your program locally
needs to fit in here. So it is almost nothing. If it doesn't--it doesn't necessarily need to fit
in here, but otherwise you have to go to the memory flow controller and you have to
manually fetch all of your memory from memory. So it's not like on a regular PC where
it will all be done automatically for you; you have to really handle all of the pointer
magic yourself and you have to make sure that the memory flow controller really fetches
the memory for you. We are always in this project concerned in this case. Everything
fits in our 256 kB.
So the registers, it is single instruction multiple data. On multiple levels, so you can
think of it on the byte, [inaudible] or word level. You can even think of it at the bit level;
that it's 128 way SIMID level on all of the bits. So the preferred slot on the cell
architecture is on the word level. So we will do four way SIMDs, so we will process on
four 32-bit words in parallel. Yes?
>>: Just a side note, the local jargon at 16-bit is a word and 32-bit is a double word
[inaudible].
>> Joppe Bos: It depends entirely on how you [inaudible].
>>: [inaudible].
>> Joppe Bos: I will in this presentation and this is also how in this architecture a word
is defined as 32-bit. Another feature of this architecture is that it has a really rich
instruction set. For instance, all of the two input one output binary functions are
available. So this is already a feature which is not available in the x86 architecture.
Furthermore, it has this whole SCALA of different instructions for which the shuffle byte
is a really interesting instruction which is also available in the SSE extensions, but since
we will be mostly interested in doing false arithmetic, let's have a look at the
multiplication instructions.
Unfortunately it only has a 16-bit multiplier, so it can multiply 16 bits to 32 bits. The
upside is at the same coast it has a multiply and add instruction. At the same latency you
can add a 32-bit word to this result and it won't cost you anything extra. The other upside
is we don't do one but we do four of them because it's a single instruction multiple data,
we do four of these operations every clock cycle. So this is roughly the setting so taking
all of this into consideration, the programmer has to obey, better take these things into
account. One of the things I didn't mention is branching. So as in all parallel
architectures and especially on single instruction multiple data architectures, branching is
best to be avoided if possible, and on this architecture there is no hardware branch
prediction, so everything is done in software, so there is nothing smart out there. So if
you miss-predict the branch, you get a huge penalty. Instead, there is a prepare to branch
instruction to redirect the instruction prefix to the branch targets.
Furthermore you should really take into consideration your memory limitations. As I
said the executable and all the data should fit in your local store, otherwise you have to
fetch everything yourself from the main memory, which of course comes at a
performance cost. For us the main instruction set limitation is the 16 bit multiplier and
you have to redesign and rethink a bit how you fill both of these pipelines. So what is our
setting? What hardware do we have to run everything on?
So at l École, our lab in Switzerland, we physically have in our cluster room 190
PlayStation 3 sitting. Then we have a room; this is a picture of the room called PlayLab
where we invite little children to come to the University so we can bore them first with a
presentation about crypto, and then they can play games and you can see that they can sit
around the table. So here they have monitors in the cluster room, of course they are not
connected to any monitors and there are no game consoles. There are four PlayStations
per table and we have six tables and then we have five PlayStations scattered around our
offices to debug our codes. So in total we have 219 PlayStation 3s.
And to show you another nice picture, this is what our PlayStation cluster looks like. So
you see there are a whole bunch of shelves and on each of these shelves there are four
PlayStations located and not directly but they are in this zigzag motion because the heat
is dispensed from this side so that the heat doesn't blow directly in the face of the one in
the back, otherwise it will get immediately overheated. So now we know a little bit more
about the chip inside a PlayStation, so many people found other uses than gaming.
Japanese people even rebuild a PlayStation such that they could barbecue on it [laughter]
and I don't know what kind of sausages these are. They don't look very tasty. That is
probably some kind of Japanese sausage.
But we are actually going to do something else. We are going to try to see how we can
solve the 112 bit elliptic curve discrete logarithm problem. So this was joint work with
Marcelo Kaihara, Thorsten Kleinjung, Arjen Lenstra and Peter. So what is our setting?
We are given an elliptic curve of E over a finite field of P where P is in alt prime. We are
given a point on this elliptic curve and for now we will assume that it is prime order n
because we are doing crypto, and we are given another point Q and we are told that it is
that it is a multiple of this point P and given all of this the question is what is K? Find the
integer K.
So there is a list of challenges published by Certicom similar to the RSH challenge list
and it challenges you to solve challenges over prime fields and over binary extension
fields. So the previous largest solve challenge was over 109 prime bit challenge solved
by Chris Monico using a [inaudible] project in November 2002 and in total he estimated
that it would have taken 4 to 5000 PCs working day and night for one year. The next
challenge on the list is a 131 bit prime field challenge. So we were looking into this and
we said okay, this takes at least 2000 times more effort, so maybe this is a bit too hard, at
least to do on our PlayStation cluster.
So then we looked at the current standards. So we simply looked at the three big elliptic
curve standards, standard for efficient cryptography, the wireless transport layer security
specification and the digital signature standard. So immediately what jumps out are these
really low p sizes, 112 bit which are used in these two standards. So they are used for
smart card security and these things are for low-level type of security, so for us it was not
a question, can we solve this, but a question was how fast can we solve this 112 bit
elliptic curve discrete log problem. So how are we going to approach? The typical
approaches use Pollard rho. If you have generic curves, the fastest logarithm out there to
solve this is the Pollard rho algorithm, so the underlying idea is that you are going to look
for pairs C and D such that this equation holds. This is called collision. Because if you
find a pair like this where the CI and the CJs are different and their DI in DJs are
different, then you can find because you know Q is equal to K times P, you can find your
K by simply computing this.
So how are we going to look for these collisions? So the idea is to form a random walk
through a set of points generated by P. How are we going to do this? We have our points
P and Q and our walk is on points Xi which are known multiples of the sum of the Ps an
the Qs with help from some iteration function, and we know since we do a random walk
in a finite set, this will eventually collide. And based on the birthday paradox you can
show that the expected number of steps needed to collide again is roughly square root of
the order of the group that we are looking at.
So how are we going to achieve this? So first of all we are going to represent, we are
going to really exploit the SIMD characteristics of this architecture. So what we are
going to do, so here we have a 128 bit vector so these are 32-bit words, four of them.
And if we represent the number, we have an array or a whole bunch of these 128 bit
vectors and each vector will contain four different 32-bit parts of different numbers. So
we are going to represent the numbers like this. So we have a number X1, which goes up
like this. X2 goes up like this up like this. So we do it in column wise we represent our
numbers. So in this way if we add different parts, we add actually on four limbs of
different numbers. So some more implementation details and I won't go into the details
of these, but they are described in the paper. We are going to optimize for high
throughput. We don't care if we have to wait a little bit longer if we get higher
throughput. And we are going to interleaf to hide the instruction latencies, two of these
four-way single instruction multiple data streams.
Furthermore we present, because occasionally we need to compute an inversion and
inversions are relatively calm due to multiplication quite expensive; we propose an
efficient four-way SIMD modular inversion algorithm, and using all of this we are going
to compute 400 curves on one of these SPEs in parallel. So why don't we simply
compute on one curve, why do we choose 400 curves? It is because of the method by
Peter Montgomery called the simultaneous inversion method that if you have
independent inversions, you can trade one inversion for roughly 3 multiplications. So in
practice that is really, really a good trade-off and the number 400 is because this is
exactly the number for which we completely fill the whole local store of the SPE. 400
curves is the max we found which fit into 256 kB, and we do not use the negation map
optimization. I will not go into details. That will be the second part of the presentation.
I will go into a little bit of detail here. To achieve more speeds, we are going to trade
correctness for efficiency. So as you might know, adding points to an elliptic curve so
you have point addition and you have point duplication. Point duplication is used if your
inputs are the same, so you do X plus X. So there are different formulas. If you feed X
and X into your addition formula you get a wrong result and we are not going to check
for this. For two reasons, it saves code space which we can use to run more curves to fit
in the local store and it cost us a branch. We don't have to check if this actually happens
and branching is expensive, so it might happen, but this is with an extremely low
probability that something goes wrong, and if it goes wrong one of our 400 walks is
wrong, we simply restart it. So we truly don't care.
Another trick which I will explain on the next slide is that we design false modular
reduction but it might compute the wrong result. Yes?
>>: Were you able to check quickly if you made a mistake [inaudible]?
>> Joppe Bos: Yes. Occasionally the algorithm how it works it will walk to the set and
occasionally when it finds a point with special properties, it will output it to a central
server and there we can do some posts and anti-checking in there we can see if something
went wrong or not. So before describing how we do this faster multiplication, let's have a
look at the prime which is proposed in the standard. So it's a 112 bit prime and it’s this
number. So we immediately see that it is not a randomly picked prime, it has some very
special shape. So what we are going to do is we are going to calculate much lower
redundant representation of a model of p tilde and we're going to use a redundant
representation. So let's take R as 2 to the 128 and then we are going to do multiplication
mult p tilde, where p tilde is simply R-3, so it is a multiple of the prime. Note
immediately that we have X times R is congruent to 3x mod p tilde. So how are we
going to exploit this special shape?
So let's define a reduction of R, which is defined like this. And then we can immediately
see that if you write our number X in an array that is 2 to the 128 system and we take the
lower part and we add three times the higher part than it is congruent mult p tilde. So this
is a basic trick also used for the nis primes, but how are we going to use this and how do
we get wrong results? This is because we only want to apply it twice. If we apply it
three times, we always get correct results. They are always fully reduced and everything
is nice. We are going to apply it twice, so how many times will it happen that we will get
an incorrect result? So let me first give a sketch, just on the back of the envelope
calculation of how one can do this. So in input we get a number which is between zero
and R squared because it is after our multiplication step and we want to reduce the
number. And it is written if X0 is the low part and X1 is a high part over array X 2 to the
128 system. We apply our map R and we get a number Y. We immediately can see that
the high part is less than or equal to three. So this number is indeed too big and we
cannot use it for the rest of our computations.
So now let's split this into two cases. Y1 is 3 and Y1 is less than or equal to 2 and then we
can see that if we apply it twice with straightforward computation that the next value we
obtain is R +5, which is indeed too big, so we need to apply it a third time. But how
often does this go wrong? So the numbers are still R+5 are wrong, so 6 out of the R+6
possible ways so a very rough heuristic approximation is this. And remember R is 2 to
the 128, so it doesn't happen very often. So if you do this a little bit more carefully, and a
little bit more sophisticated, you can actually show using some heuristics that this is the
probability that things will go wrong, and this actually is less than 1 over R. So less then
1 over 2 to the 128 times it will go wrong and so we expect to do less than 2 to the 128
reductions. We expect that it will never go wrong. Actually we use this and we didn't
see a single time where it went wrong in practice.
So to give you some performance results as we implemented all of this. So you can see
here the modulo multiplication using sloppy reduction takes 430 cycles and most of the
time indeed is spent in multiplication and not by the reduction, so you see some other
numbers for the inversion, for instance, indeed it takes a really long time but we only
need to do one inversion shared among all of the 400 curves in parallel, so in total we
only pay roughly 12 cycles for it. So in total to take one step in this huge group to do one
iteration takes roughly 456 cycles. So we have our PlayStation cluster. That means if we
can pair second we can actually compute 2 to the 33 iterations per second on our cluster
and we work with more than half a million elliptic curves in parallel. So it's always
really nice to tell visitors when they walk through the cluster room, did you know that
you are now in a room where more than a half a million curves are being processed at this
very moment.
As you asked, we occasionally need to output a distinguished point, point with a special
property. We used very [inaudible] to store this. We didn't use any fancy techniques,
because it was not necessary. We stored them easily using 4x16 bytes every 2 seconds
there would come a distinguished point, and we expected based on the birthday paradox
that we needed 300 gigs of storage which is nothing. So I now presented you all of these
numbers and I claimed that it is false, but how do you know that this is actually a good
result? Let's try to compare it with some other results out there. So there is this FPGA
machine called the Copacabana which is work by Guneysu Paar and Pelzl and there is a
paper describing the same algorithm, the paralyzed Pollard rho algorithm, and for
different bit sizes, and we will of course be interested in the 96 and 128 bit sizes. Note
that the main difference is they target generic primes so they can use this false arithmetic,
while we use 112 bit prime, although we do everything modulo P tilde, which is 128 bit,
but we have this false reduction. So to buy this Copacabana you need around US
$10,000 and assuming that the PlayStations are not at discount, so we really need to pay
$300. We can buy roughly 33 PlayStations for this price.
So as you can see here the paper proposing this Copacabana and this Pollard rho from
2008, so let's give them a doubling factor by Moore's law that everything went faster so
these are their performance numbers. They can do this many iterations per second. Let's
give them a factor of 2 speedup due to Moore's law, and let's assume so the negation map
optimization which leads them at 1.4 roughly times speedup. We didn't use it, but let's
assume they could get it to work and everything worked fine, so we give them that as
well. Then these are the final performance numbers and we can see for 33 PlayStations
that we are at least a magnitude faster than this FPGA hardware machine. But, there are
actually two butts, we didn't use--we still have this PowerPC core on the PlayStation
which actually didn't compute anything. So we could have sped up our calculations using
this PowerPC core and on the other hand there is now a new Copacabana out there which
is much, much faster FPGAs, but there are no performance results known yet how fast
these algorithms would actually run on this newer Copacabana because it is out this year.
It's really new.
So what is our challenge? Because there was no challenge specified. So this is actually a
bit of the boring part, the solution. No one really cares what the solution is but I will
present it pretty quickly. So the point P is the generator proposed in the standard and to
come up with a challenge, we had to make sure the people believed us as we made a
challenge that was not pre-cooked of course. So we simply chose the decimal expunction
of pi s the x coordinate over point Q and the expected number of iterations is around 10
to 16, 10 to 17 and we ran this between January and July in 2009, but not continuously
because once in a while these kids came and then we had to run demos in our cluster. We
had other projects, but in total using the latest version of our code this could have been
done in 3 1/2 months so these are the points P and Q and this is the n the prime order of
the group and then we found in July 2009 the solution, which again, by itself is not
interesting at all. So now let's go to the second part of this--yes?
>>: You said [inaudible] how did it compare with the Lupton method?
>> Joppe Bos: So there is a bit of confusion always in the literature of what actually the
Lupton method is because there are two algorithms which are known as the Pollard
Lupton method, so some people say the Pollard Lupton method is the paralyzed version
of Pollard rho. If you officialize it as [inaudible] shape then the answer is we used the
Pollard Lupton method, and then there is this other Pollard Lupton method with the
kangaroos and that asymptotically has the same runtime as this, but in practice it is a little
bit slower because you need to know some other things as well. You need to know some
other things about your group as well. It's faster if you know that your solution lies in a
certain interval, then the kangaroo method is faster in practice.
Now let's move to our second project which is about the negation map which is a joint
work with Thorsten Kleinjung and Arjen Lenstra and was published at ANTS last year.
So this actually was a result of the first part that I presented. We wanted to study a
negation map in practice so first of all why? We focused on prime fields, so why did we
focus on prime fields? If you, for instance look at the Suite B Cryptography set by the
NSA, they only allow you to do elliptic curves over prime fields. They don't allow you to
do elliptic curves over binary extension fields. And of course the whole thing why we
started this, if you can solve the discrete log problem fast, you can break elliptic curve
base schemes.
So now look at all the previous prime fields records over, 79, 89, 97 and 109 bits and the
one I just described about 112 bits, so they were all solved, but there is this textbook
optimization. It is in every crypto book. They say there is this negation map and it will
lead to matter what curve you have to a square root of two speedups so 1.4 times
speedup. That is really nice. But actually it has not been used in any of these records.
So why do people not use this optimization? So while trying to use this for our setting,
we discovered a lot of side issues with this and this paper is the result of that.
So let's first zoom in a little bit more on Pollard rho. So I said we were going to do a
random walk in our groups generated by our given point P. But we cannot do a truly
random walk because that would be tremendously slow and we need to keep track of the
multiples of the points P and Q we are having. So in practice what people are doing is
you're going to approximate a random walk. So this is done with some index function
with in the rest of this presentation I will call L. So your group generated by the point P
is partitioned into T sets, T partitions and this function L will map a point to simply an
integer between 0 and T.
We assume in practice that this is more or less always the case that the size of these
partitions is roughly the same; they all have the same cardinality. And so what people do
in practice is we compute these T precomputed constants and you can either use an R
adding walk which is simply your next point as your current point plus one of your
precomputed points based on which partition it falls in, or you say our next point is this
case or if the partition is bigger, we simply double the points. So that is how you
simulate a random walk. That is what people do to simulate a random walk and Teske
actually shows in 2001 if you take capital R around 20 so you have around 20 partitions
or more than 20, then the performance of this not so random walk is actually close to a
real random walk. Yes?
>>: If you apply R plus S then it looks like there is a probability of doubling is much
bigger than [inaudible] doubling correct or am I missing something?
>> Joppe Bos: Sorry?
>>: You said before that you [inaudible] in the case of doubling because it never
happens. But here doubling does happen with final probability.
>> Joppe Bos: Here [inaudible] so, this is a completely different project, so here we do
adding and here we do doubling, so that was just an implementation issue to speed things
up, and here this is just we are looking now at things from a theoretical point of view. So
we assume that we can add, we can double and things will go correctly.
So what is the negation map? It was introduced in 1998 by Wiener and Zuccherato, I
probably don't pronounce it correctly. And the idea is to find an equivalence relation on
your group by associating P and minus P to each other and then instead of solving your
history log on your whole group, you do it modulo your equivalence relations and this is
size n and this is size n over 2 and so in practice this has a huge advantage that you get a
speedup of a vector squared up 2. And the only thing you need to do is if given a point P,
you need to be able to compute really fast to point minus P and vice versa and for elliptic
curves this is the case. If you have a point then the negative of this point is by simply
taking the negative of your Y coordinate if you are using [inaudible] coordinates.
So in practice how would your iteration function look? We are at some point Pi; we look
up, we compute to which partition it belongs, we select the correct partition constant, we
add the 2 and we add, and we compute the negative at this point. Then we look, we have
to set some property which is the representative of this equivalence class and you can for
instance take if you normalize the Y coordinate, so the one with the lowest or the highest
value. It doesn't matter. You need to have a deterministic way of selecting a
representative of this equivalence class. Let's assume that in this case it is the minus
point and then we take this as our next point in our walk, so that that is the idea of how
the negation method, how you would alter your iteration function.
Something that is already well known, if you simply apply it as described here you will
not solve any discrete logs at all. This is due to something which is called fruitless
cycles. So the most simple fruitless cycle is one of length 2 discussed also by Duursma,
Gaudry, Morain in 1999, is if you have your point P and it belongs with probability 1 to
some partition, let's say i and with, so you compute it and let's say the representative here
is the minus 1, so it happens with probability half. So then we are at this point and then
let's say at this point also belongs to partition i, so it happens since there are partitions.
We assuming that we are using R and walk, it happens with probability 1 over R and then
with probability 1 the representative will be the minus point again and then we are back
at our point P. And we will simply, when we implement this or when we run this it will
simply go and go and go and we are stuck in this fruitless cycle and we don't get
anywhere, and it happens with probability 1 over 2R.
So one way of reducing this event introduced in 1998 was by more or less detecting the
situation so that is what they are going to do here. So they are going to look for a
partition number which is minimal such as these two are not the same. That is actually
what this formula is saying. So if they are not the same, you simply take this partition if
the partition number of these two points is the same then you try to look, you increase the
partition number and hope that the resulting point is not the same. Of course you pay a
price for this because it might happen that you populate the next point and like here
belongs to i and you have to redo it. So the cost increase with at least a factor of 1 plus 1
over R. and in some cases it might happen that all of your points will add to something
bad and then you restart the walk, but that is very, very unlikely, with probability R to the
R, once every R to the R steps.
So this was only for two cycles. Of course cycles of any length can occur, so a general
technique described by Gallant, Lambert and Vanstone in 2000, have an elegant method.
They simply say we run for some alpha steps. We simply iterate and iterate and iterate.
Alpha is usually much bigger than beta and then we record the next data step. So we
stored them in memory, then we take the next point P and we compare it to the previous
data points. If it's the same to any of the data points, we know that we are stuck in a
cycle; if it's not, we know for sure that not a cycle of length up to beta is present. So this
gives you a technique to detect cycles but how to get out of the cycle. So cycle escaping
can be done of course in numerous ways, so proposed was to add one of your other, so
you have to deterministically select one of your points which is representative of the
cycle and then you need to add something else to it. So you can add one of your other
partition constants to it. You can simply add a precomputed value of prime to it or from
a list of precomputed values to it. And then you are not stuck in a cycle anymore because
you add a different constant to this current point.
So this was all known. This was all ready known literature and it was as far as I know
pretty much assumed that this would solve all issues. But in practice when we
implemented it, I saw for instance that this happened pretty frequently. So this is when
using this technique to reduce the occurrence of two cycles, but still two cycles can occur
so this is an example. Let's say we are at point P, we walk to the next point and this point
belongs to partition i -1 so everything is fine; they are not the same. So we go there.
When we are here we go to the next point which belongs to partition i -1; it's not good,
we add one to it. But hey, we cancel the previous step. So of course it happens the
probability that this happens is not 1 over 2 R anymore; it is a bit lower, so a lower bound
is 1 over 2R cubed, but still it will happen in practice and you need to have some way of
detecting this, otherwise all of your walks eventually get stuck in some infinite loop.
An obvious extension to this to cycle reduction technique is to reduce the event of four
cycles, so a four cycle similar to a two cycle technique looks like this and entering one
happens with this probability and you can, it's a bit more, it's a little bit more tricky to
define your iteration function because your iteration function needs to be deterministic
for every point where you start, because you have this, you collapse points with this
negation map. You need to make sure that you are always on the same walk, otherwise
you won't find anything. But the idea is exactly the same, so now you don't look at one
point, but you look at one point and two points and you hope to reduce the events of four
cycles. So the disadvantage here is that you might have to redo two steps so your
iteration function becomes a little bit more expensive, but the funny thing is there a
positive effect of it, because now you're walk gets some sort of a different shape because
you don't allow certain patterns in your iteration function so for our function G our image
of G of all these points are a subset of our whole group with the size of this image
roughly R-1 over R times all of the points in our group so we got a positive effect of the
square root of R-1 over R.
But in exactly the same situation as before, that we could have two cycles, with a two
cycle reduction. So a nice thing to know with four cycle reduction this actually is
prevented, because it looks 2 ahead and it will note hey, we are doing i here and two steps
ahead we will do an i again, so it's prevented. But with four cycle reduction, of course
we can have two cycles as well. The dilted lines, the lines which are taken and then
canceled are only tilde or more and it will happen. The probability is only slightly
higher. And of course then we can have two cycles, with two cycle reduction we can a
four cycles, with four cycle reduction as well and you think, you might think I am
completely crazy, but these things we observe them many, many times in practice. So
these things really, really happen. So here you can see, so all the sides are taken, then
canceled so all the conditions are exactly right that all these points cancel each other out.
So we saw that all of these probabilities strongly depend on R, so you could say why are
we not simply increasing this R to some high-value, things become really, really unlikely
and problem solved. In theory, yes, that is completely true; in practice it's not that
simple. In practice, here for instance, is a plot on a regular PC. So here we have the size
of R, so the log of R, so the R n walk, the number of partitions we need to store and here
we have to perform this iteration function. Yes?
>>: So this step is about R equals [inaudible] probability is something like one in 1
billion?
>> Joppe Bos: Yes?
>>: That's too high for you?
>> Joppe Bos: Yes, because we want to calculate a lot of steps. As we will see, we will
calculate many millions steps even on one processor core or I don't know where we are
computing them per second and you will need to run this for years, so even if the
probability that happens is a few billion, you run into it within a minute. So it's--so here
you see that in practice when you increase the size of R, here it stays more or less for a
while, doesn't decrease too much, but here you see a significant drop, which can easily be
explained that the whole partition table doesn't fit anymore into your cache. So you have
constant cache misses; it needs to be retrieved from memory and your iteration function
in practice will drop from 4 1/2 million to roughly half a million steps per second. So
this, and plus the fact that you don't have eliminated the occurrence between the cycles.
They are much more rare, but they will still happen.
So I discussed all of this so now when you have seen this you are being smart. You say
okay, let's use all of this. We are going to use an R adding walk with an R that is not too
small; it's not too big. We are going to use two or four cycle reduction techniques and we
use this method for [inaudible] cycles with, we go for alpha and then we report beta steps.
I claim many walks will never find a single distinguished point. That depends on your
point property, but your discrete log will not be solved. So this is due to something
called recurring cycles. So let's have a look at this funny picture. Let's assume that we
are stuck in this four cycle. After a while this is detected, so this can be any bigger cycle
as well but for the sake of simplicity it is a four cycle. And let's say that P is our
designated point to escape this cycle. So we add a different partition constant to it, P plus
fk, so we escape out of it, but it is very likely that we recur to the same cycle again and
then again we are stuck in an infinite loop, because after a while the cycle’s detected and
we say, yay, we are really happy. We are going to escape the cycle but then within a few
steps we are back into the cycle. So again, we haven't solved anything.
So to recap an overview of all of the different probability so we can see here the
probability to enter a two and a four cycle in a regular way or when running a two cycle
or four cycle reduction techniques, so you can indeed see that the probabilities are
significantly reduced. So it becomes less likely to enter such cycles and here the
probabilities to recur to your same cycle after escaping, and here the price you pay for it.
So the times your iterate function becomes more expensive. So how are we going to fix
this last issue? So heuristically you can assume that a cycle with at least one application
in it is not fruitless. So what do we mean by this? So let's go back to the beginning of
the presentation. We are actually looking for a cycle, the Pollard rho algorithm wants
you to have a cycle, because then you get this collision. And if you have a cycle, you can
show if you have a cycle with at least one application in it, that it's not rootless. What
does that mean, it's not fruitless? Then you can solve your discrete log problem.
So the idea is to reduce the number of fruitless cycles by either using a mixed walk or if
you are stuck in a cycle by simply go to a designated point and double that point. Then
the probability to go back to the same cycle is extremely low. So the advantage is you
more or less avoid all recurring cycles; its disadvantages, duplications when used with
[inaudible] coordinates that is what people would use with Pollard rho in practice are
slightly more expensive, so using this inversion trick, duplication will cost roughly 7
multiplications, while an addition costs six multiplications, so you pay, and since
multiplication is the main time-consuming part of Pollard rho, this is a disadvantage but
you avoid a lot of trouble.
So we run a lot of tests, and here I will explain this in more detail some numbers to show
illustrate something. So what do these numbers actually mean? So F means two cycle
reduction and G means four cycle reduction. E means that we are using the regular cycle
escaping and E bar means that we escape by using duplications. So on top here are the
performance numbers we use in the negation map. Let's first describe what setting we
are using. We are going to run for 2x10 to the 9 iteration, so on RPC on AMD Phenom
which we tested on 10 to the 9 iterations took roughly half an hour to run. So that is, we
run a considerable amount of iterations and then we are going to run twice, half an hour
and we are going to ignore the yield from the first half-hour, because we might find the
log there and then they might enter all of these infinite loops and then it's not interesting
because we care about the long-term yields.
So we're going to look, our long-term yields is this second 10 to the 9 iterations. So the
yield in millions are these numbers, the first numbers here and the speed up, so we take
this as a base is, the speedup compared to not using the negation map including some
theoretical measures, for instance you can see here that actually when using 32 adding
walk without using the negation map, the performance is a little bit higher, 7.28 million,
but we are using the R equals 64 because they 32 adding walk is slightly less random
than a 64 adding walk, and if you take this theoretical measurement into account, you
expect to find your solve, your discrete log a little bit faster with this R equals 64. So this
will be our base which we compare to and so now let's look at some of the other cases.
So, for instance, if we just use the four cycle reduction technique without any cycle
escaping, it's not a surprise that when your R is really low in the second half hour, they
are all stuck and we won't find anything. So you cannot expect to solve anything, a large
ECDLP with it. Of course when you are gross, it will still find something in the second
half hour, but then you would expect a few hours later these are all stuck as well.
So our best combination, so for these we display some more numbers so these top
numbers here are the number of additional additions. So these are the additions spent for
instance in a fruitless cycle or spent when computing the next point and we had to redo it.
So it is an elliptic point curve additions which did not bring you any further. So they
were pointless so to speak. And the number below there are a number of duplications
used to escape. So for instance, here, we didn't use any duplications. It was a regular E,
not an E bar, so these are always zero, but here we used an E bar so we used some
duplications and these were a little bit more expensive.
So our best combination was using a two cycle reduction technique and detecting longer
cycles and then escaping them by using duplication for an R is 128 because if we went
bigger we run into all of this cache problems, and there we see that we get a speedup of a
factor of 1.29, so first look at this 6.57, so our performance is actually a bit lower than
when not using the negation map, which was 7.27. But since we get this squared to 2
speed up, the net effect is 1.29. This number below here is taking only this additional
additions and the extra cost of duplications into account. Theoretically, we would expect
maximum achievable speed up. So without considering side effects from a cache or from
branching [inaudible] not taking it into account any practical considerations, just from
these numbers, what is the max achievable speedup we can do.
So there is still a gap between these two, but I will say something about that in the next
slide. So using the negation optimization for solving ECDLPs in practice, it actually is
useful and you should use it. So we didn't use it for our PlayStation calculation, but, and
there are some things to say there as well because it is SIMD architecture which makes
things even more complicated, but you should use it, but you should make sure that you
use two and four cycle reduction techniques and you should make sure that recurring
cycles are avoided, by for instance, escaping the cycle using duplications and using a
medium-sized R n walk. So we managed to get a speedup of max 1.29 which is
unfortunately a bit less than our square root of 2, but we stated in the paper that it might
be simply due to our, that we weren't capable of coming up with a better implementation
or better techniques to do this, so we left it open to find maybe better cycle reduction or
escaping techniques and we challenged people to do better, and actually this year a paper
you can see by Dan Bernstein, T. Lange and Peter Schwabe presented a way, follow-up
work which indeed addressed two of these issues. So they targeted the PlayStation
architecture as well and first of all they presented a way to implement the negation map
using straight-line code, so without using any branches, so that was really nice. That
already speeded things up. And then they presented some techniques to use a huge
adding walk, a 2048 adding walk, so that already reduces the probability to enter cycles a
lot. On a regular PC, it might be a little bit different because the cell, as we showed, is
cache less if it fits in your local store, everything is faster if there is no cache.
Unfortunately, they didn't measure a direct comparison how fast their implementation
was compared to a known negation map setting. They only measured some functions in
their code so there was no direct real comparison, but they said so we estimate that
compared to a known negation map setting, we achieve a 1.37 times speedup and this is
exactly an agreement with the theoretical numbers that we have shown before. Yes?
>>: [inaudible] you make 2048 adding you make 2000 steps [inaudible]…
>> Joppe Bos: No. So 2048 adding walk which means you partition your whole group
in 2048 partitions. So these are the numbers here, so here we use 16 partitions to 512
partitions and they used the technique to store these partition numbers using less space so
they could fit more partitions in memory and that would reduce the likelihood of running
into the fruitless cycles. So that concludes my presentation, thank you.
[applause].
>> Joppe Bos: Yes?
>>: What [inaudible] the numbers you gave I estimate that going up to about 128
[inaudible] prime is a factor of 1000 more cost, but there would be a lot more cost going
128 to 129 in just imagine. And I am wondering if you [inaudible] how much pain, how
much cost…
>> Joppe Bos: This cost was very, very rough and imprecise and is submitted only to
take into account the bit complexity and indeed not the additional arithmetic complexity
that you have to compute with slightly larger numbers. No, that is true. So that is an
additional challenge even.
>>: Yeah, so maybe the other way of asking the question was what if you were doing
this on machines with 110 bit registers as opposed to 128 registers?
>> Joppe Bos: Yes. Then we would have to come up with different arithmetic tricks and
nice methods to run all of this. No, it's true so we…
>>: After two or three magnitude or any idea?
>> Joppe Bos: Ah, yeah, so it depends. It depends also which algorithms you are using;
we are using because of our small, relatively small mid size numbers, we were using
schoolbook multiplication for instance, for everything. And since multiplication is your
main runtime, so it would maybe depend on this algorithm, but if you go for much larger
number you could use [inaudible] or even FFTs, but these things are not faster for these
small elliptic curve size as targets. Yes, so I think the 131 bit challenge is--so indeed take
the 131, it's really an unfortunate size to implement, because even on your PC you have
these SSE extensions they are nicely 128 bits, so that is why people like these binary
extension fields because then you can get slice everything and that tends to be much
nicer. Then you don't have this problem.
>>: Unrelated but another question also, have you looked at X boxes [inaudible]?
>> Joppe Bos: So, that is a good question. No, because as far as I know you are not
allowed to install your own operating system on an Xbox, and we did not want to break
or hack open anything, so we never looked in detail at the architecture for the Xbox or
how to get these things running on Xbox.
>>: Would you be willing to if you got permission?
>> Joppe Bos: Yes. It would be a really fun project.
>>: Okay.
>> Peter Montgomery: Any more? If there are no more, then let's give another round of
applause.
>> Joppe Bos: Thank you.
Download