>> Francesco Logozzo: Good afternoon. For me it's a... today, Roberto Giacobazzi, a professor at the University of Verona....

advertisement
>> Francesco Logozzo: Good afternoon. For me it's a pleasure to introduce our speaker
today, Roberto Giacobazzi, a professor at the University of Verona. So you know
Roberto is one of the most important researchers in static analysis and in abstract
interpretation. He's famous for his work with Francesco. He'll give a talk on Thursday on
domain theory and abstract interpretation particularly in the concept of completeness.
And he's also famous because he was my very first professor at the university, the one
who taught me about data verification and weakest precondition.
>> Roberto Giacobazzi: Yeah.
>> Francesco Logozzo: He's young because it was not so long a time ago.
>> Roberto Giacobazzi: I know. [Inaudible]
>> Francesco Logozzo: He's relative young, yes. And, yeah, back when I was at the
University of Pisa and then he moved to the University of Verona. And I had him for one
year and he never taught me about abstract interpretation. And then, we found ourselves
later. So, thank you very much.
>> Roberto Giacobazzi: Okay, thank you, Francesco. Thank you to all of you. So today I
will try to introduce the notion of completeness and incompleteness in abstract
interpretation, but I will make in such a way that the interpretation of these two notions
will be more in the language-based security than in program analysis. But, I think that
the notions are basically the same because it is about the precision of an abstraction,
the precision of a procedure that tries to learn what the program does. And what we will
see is that changing the program in order to make this analysis imprecise is like
obfuscating, hiding information. And refining the analysis in order to get the information
from the program is like attacking the code.
So these two, the battle between this rat and cat is exactly the same battle that happens
in security from the language-based approach of course. The scenario quickly: doing this
slide here is like saying something obvious. I mean there is a line that goes towards
mainframe to ubiquitous. And this makes things in the context where typically you cannot
always trust the environment where your program runs. So the standard crypto
assumption is that the perimeter of defense is around the software, at least around the
software. Bob tried to communicate and the attacker tries to listen into the middle. So I
want to hide the information but I cannot hide the fact that the message exists. Indeed,
crypto doesn't hide the fact the message exists; it hides the content of the message. I
will try to interpret completeness and incompleteness, namely precision of analysis, in
the context of white-box attack or white-box cryptography which is more related to the
ubiquitous nature of software nowadays.
The fact that Alice produced her information but she cannot trust completely the fact that
Bob will run that, be Bob first, and secondly that the environment that Bob uses can be
trusted. So basically I will be in the context of having a man at the end attack. When
Alice delivers the software, at the end there can be somebody that tries to make
complete reverse engineering and crack the information that the program contains. So
this is the context we tried to approach. And this is basically how these things are
handled in reality: namely there is an adversary. This is the asset that I want to protect.
There is a sensor that tries to see whether this asset has been attacked. There is a
control system that activates the defense. This is typical in tamper-proofing which is a
kind of software that reacts to [inaudible] or in code [inaudible] marking, fingerprinting
and so on. Well, this has quite a value in the market. And the interesting thing I think,
and this is the line of my most recent research, is that trying to see whether behind these
different bubbles there is a common path or ground which can be linked to the precision
of the analysis by viewing the analysis as the process of attacking the code.
And this is basically the picture because typically in black-box cryptography we have an
input-output but we cannot see much about the inside of the running of the code. We
can weaken this and having the attack to the code more and more it tells about the
running of internal of the program in such a way that from the black-box we go to the
white-box. And this is something like making the analysis more and more precise about
the behavior of the program. Going through along these lines corresponds exactly to
refining the abstraction. So basically if I want to interpret this gray-box crypto, white-box
crypto and black-box crypto, I can say that, "Well, this is a standard input-output
abstraction." That's the identity [inaudible]. So in the middle there are levels of obscurity
that I can have, and for each of them there will be probably a reaction or a protection
system that my code has to deliver in order to defeat that attack. I want to link these two.
So I want to link the precision of the attacker with respect to the fact that the program
can be transforming in order to defeat that abstraction.
Okay, what is this? It looks like a picture at the beginning of the universe, the very early
seconds in the universe. If you look, the picture looks like this. But it's not; it's a chess
board. So what's the difference between these two? This is absolutely obscure. Here we
have information. What kind of information? Well, we know the pieces on the chess
board so we know how many of them, what type and so on. The relation between these
two has to be understood with respect to the eyes so the perception we have. So the
analysis, our view over this, is able to extract something: colors, shadows. And here, it is
able to abstract more. I want to use this analogy in order to do the same on the software,
on the code with respect to an analysis which will be an abstract interpretation.
So we need a model and, well of course, it's the standard model that we all know. It's too
complicated, too complex, undecidable. [Inaudible] showed us that this is not recursive
in general so it's absolutely complicated. So this is a complete mess and, well, we need
abstraction. Abstraction means that the traces can be -- We don't have a precise
definition of each single transition but we have an approximation of these. And this
should be computable. In this set me have a loss of precision. We all know that, for
instance, if you take the interval of the maximum and the minimum in these traces
computed we get an interval that contains many [inaudible] traces that don't exist in real
execution. And while we can set up logic around this and have logic over abstract traces.
This was interesting old paper that links model checking -- analyzing is model check -analyzing [inaudible] is modeling checking of an abstract interpretation. It was an
interesting old paper in the nineties. And then it may well happen that we deal with the
precision, the precision that, well, actually we think that this is the interval computed but
in reality the true interval computed in the end is much smaller than the interval
computed by the analysis which is bigger. So we have a loss of precision.
And completeness means that the analysis loses precision with error. Okay, from the
early definition of abstract interpretation in Cousot and Cousot '77 and '79, there has a
flourishing of works that deals with precision: Steffen, Mycroft and then myself,
Francesco and Francesca tried to solve the problem once and for all. And we proved
that indeed it's possible to refine an abstraction with respect to any [inaudible]
continuous function, namely for any computable function in the least possible way in
such way to make it complete. And then we tried to apply this little result to many
aspects, and language-based security is the one that we will try. The scenario that I've
shown you is the area where I've tried to show this application. What are the ingredients
of [inaudible]?
So the ingredients are the standard ones: abstraction. Abstraction I think most of you
know very well. I used the formalization standard from abstract interpretation and used a
pair of functions that take a concrete object, abstract into any property and then
concretize it back to something which is above, which is the error made in the
abstraction. And these correspond exactly to see an abstract domain or to see a subset
of the concrete that contains only the points that represent the abstract object is perfect
[inaudible]. And this means that basically an abstract domain is nothing else than an
operation that takes an object, concrete one, maps into somewhere which is above,
which is approximation and then is stuck there because once you lose information you
cannot recover it any more. This is an [inaudible] closure operator. So the lattice of all
[inaudible] closure operators is the lattice of all possible abstractions.
And this is pretty nice because you can play there the game of transforming closures
which means transforming domains. So when you do this standard approximately you
typically inject an error because you compute in the abstract instead of computing the
concrete. And the error you made corresponds basically to be sound but not complete.
And the error can be propagated in a fixed point, and this is what happens typically. This
would be the true abstraction of the true computation; if you are computing the abstract,
you can get an object which is an over approximation.
Soundness means what? Standard soundness that we know is the following: well, you
typically have a function that computes from x to f of x. But, in the abstract domain you
don't have x; you have the property of x so you have the approximation of x. Then, you
compute the function. And then, you need to go into the domain of objects, of abstract
objects so you approximate the result. So in the abstract domain you compute this. In
the concrete domain you compute this. When the completeness happens -- So in this
case you have sound because you are above then the approximation of the true result
which is this. If these two collapse, you are complete. This is called backward
completeness, namely by approximation of the input you don't lose precision in the
computation. Typical example: you have rule of sign. Rule of sign is complete with
respect to multiplication but is incomplete, doesn't work -- it's sound but not complete
with respect to addition because you lose the magnitude of numbers. So once you have
the positive and the negative, you want to multiply. Then, you've got exactly a negative.
But if you made the addition of the two -- once you've lost the magnitude of the number
you don't know any more who was prevailing of the two. So you can only say I don't
know [inaudible].
Forward completeness is perfectly the dual. In this case instead of looking if you lose
precision by approximating the object in the input with respect to what is computed, you
see whether you lose precision approximating the output. So you assume that the input
is abstract. And then, what's happened is that you are incomplete when you have an
arrow between abstracting the output or having the concrete output. It's perfectly dual.
Look at this example; this is a classical example to show these two notions. Being
abstract and concrete is a relative notion, so you can be abstract of something which is
more concrete than another and so on. So consider that this is your concrete domain; it's
a simple lattice of intervals. And take this abstract domain, this abstract domain with the
red bullets. Take the square operation. The square operation is computed with the blue
arrows here. This domain that says, "I don't know the number. It's positive. It's between
0 and 10," it's forward complete but not backward complete.
Why? If you approximate -- Being backward complete means that you don't lose
precision by approximating the input of the function. So if you approximate 0, 2, you get
here. Then, you do the square you get here. Okay, while if you don't approximate the
input and you do the square, you get here. This is the arrow made here, made not
backward complete but it is forward complete. All these points are already the output of
the function square and they are all inside the abstract domain. So basically if you look,
being backward complete means it contains the inverse image of the function with
respect to which I want to be complete. This is linked with the control algorithm that tries
to refine the partitions going backward by the precondition. The only difference is that we
proved this in the year 2000 and Clark made it in 2002.
Sorry?
>>: And [inaudible] backward complete, what does that give you with respect to the
compute semantics?
>> Roberto Giacobazzi: The fact that with respect to the approximation of the old
compute -- So if you compute in the concrete and then you approximate the output or
you compute in the abstract, you get the same. This is backward completeness.
>>: But how does that help static analysis?
>> Roberto Giacobazzi: Well, it's the top you can make; you cannot get it better. You
don't have false alarms.
>>: With respect to the abstract domain?
>> Roberto Giacobazzi: Yeah. Conversely, there are domains that are backward
complete and not forward complete and this is all dual stuff. So what we proved is that
we can modify domains. Any...
>>: Maybe it's interesting to say that trivial abstraction is trivially complete both
backward and forward.
>> Roberto Giacobazzi: And forward. Yeah, of course. The concrete semantics are
perfectly complete. We can modify domain namely -- This is a case of completeness.
You see that x is approximated here then computed and approximated there, so the two
elements collapse exactly through the same point. This is incompleteness. When this
happens, there is an error here due to the approximation. Well, in this case we can -- If
you have an incomplete domain abstraction, you can make it complete by adding points
-- You refine your abstraction -- or eliminating points and you simply your abstraction.
Typically in static analysis we refine because we look for a more precision domain that is
able to avoid false alarms. But you can also avoid false alarms by removing information
which is simplification.
>>: But it might have more false alarms with respect to the concrete semantics.
>> Roberto Giacobazzi: You don't have more false alarms because you are complete.
You are less precise with respect to the property. You don't have any more of the same
property. You lose the property you want to look for. But you remove the presence of
false alarms.
>>: [Inaudible] abstract domain [inaudible]...
>> Roberto Giacobazzi: Yeah, the property is the abstract domain. And this was proved
-- Well, actually it was from '98 but basically a backward problem can only be
transforming in the forward problem by considering the inverse function with respect to
ways to become complete. Amazingly we can also modify programs, not only domains.
So until now we have a domain. We have a program. We want to refine the domain or
simply the domain to avoid false alarms for that program. But we can take the domain
fixed and change the code, the program, in order to be complete for that domain.
Well, it's possible theoretically. Basically this is a case of incompleteness and in order to
become complete you simply have to avoid that this -- namely transform the function to
the closest from above or the closest from below that is complete for that abstraction.
And this is very simply easy because you can compose this with the abstraction itself or
with the adjoined of the abstraction. Okay so how all this fits into the stuff of security or
let's say static analysis as a way for attacking code and code transformation towards
obfuscation is a way for protecting code. We go back to the picture. So basically, what is
an obfuscator? And obfuscator is a compiler or a bad student writing code. Typically you
have your input-output and you want to keep your input-output and you want a
transformation from this code that everybody can understand what is inside here goes
there then nobody can understand what's happening inside. This has to be compiler. But
true hackers actually do not perform compilation; they really add junk, reorder code.
They do very weird stuff on the machine level. So the idea is that I want to see that this
transformation, tau, can be systematically derived from the precision in terms of
completeness of the attacker and how this can be done.
So the typically attackers use [inaudible] many tools like old GDB and so on, colluding
attacks, differential attacks. There are many ways for attacking code, for making reverse
engineering and understanding how it works. And most of them use tools that are based
on the analysis. So the objection that, well, your way of viewing the relation between
attack and defense is strictly related to the analysis that doesn't consider the human
capability of understanding code in the attack is partially true because in reality for
industrial-size code the reverse engineering cannot be done without a tool based on
analysis, which is a slicer, which can be a debugger or whatever.
So if you are able to defeat an analysis, automatically you delay much the power of an
attacker in understanding the behavior of the code. So this is the idea basically. The
malicious user has a lens so he cannot really see everything but can only see a portion,
an abstraction of the execution. And the obfuscation wants to make this malicious user
blind. So basically the defense has to turn this into this and the attacker has to do the
reverse. And that will use some stuff many years ago by Neil Jones; indeed, this is a
paper that we did together last year. And it's interesting because we said obscuring code
is compiling. Well, you can specify a compiler at least at the level of specification like the
combination of a abstract interpreter.
Because if this is your source code that you want to make obscure, well, we all know
that the source code is equivalent to the specialization of an interpreter with the source
code. And if you want to keep the input-output of the program, it's enough to find any
interpreter of your language and a specializer for that make this combination. But, in
most cases you inherit almost completely the structure. So basically if this is clear then
this is clear too. The challenge is to make this obscure, namely, to twist something inside
here in order to make it obscure and link the twisting of the object inside here to the
power of the attacker. So, look, this is a little program. This is another program that
computes exactly the same. What's the difference between these two? Well, it's obvious.
This is the true code. This is the flattening of the code. If you take the [inaudible] of this
program, it's completely flat.
And everything is handled by the program counter which is statically here, statically
written inside the program itself. So if you have a good specializer, typically the
specializer doesn't return to you this. It's able to understand that the program counter
can be statically derived. So if you apply this equation, you get back to here. So how can
I let this equation generate this instead of that? This is related to completeness and I will
show you how. So the attacker -- In order to understand this we have to understand
what is the attacker. The attacker is an abstract interpretation. So imagine that you have
the previous approximation; you have a function which chess are on the board. You
have an approximately that takes an image and returns another image; for example, the
strange image where we cannot recognize if it's the origin of the universe of it's a chess
board. This is an abstraction because this contains this and many other images of
course.
Then, you have a function that counts an upper bound of the number of different types of
chess on the board. Here you have a case of incompleteness because if you
approximate the image with this -- So if you approximate the input: WhichChess is only
able to say, "Well, there are probably so many chess from all the kind of chess over the
board," black and white. So I can produce 12. While instead if I have the true image then
I get 7. So moving from this picture to that picture is an incompleteness. And from the
perspective of our eyes, it's an obfuscation. So does it work the same on the program?
Yes.
From my point of view, obfuscating is making an abstract interpreter incomplete. So the
attacker is an abstract interpreter, whatever abstraction considers, and failing precision
is like return the maximum amount of false positives namely failing in the capability of
extracting the true information. But this can be simply proved by simple reasoning. Well,
basically if you want to keep the input-output, the transformed code has to have the
same input-output of the original one. You assume that an abstraction is complete. So if
you compute the abstraction of the semantics, this is equivalent to computing the
abstract interpretation of the program. So you don't lose precision by analyzing. Well,
you obfuscate when you transform the problem in order to lose some information. This
happens if and only if the transformed code is incomplete for that abstraction; if and only
if.
So losing precision in transforming code is precisely the same as telling you that the
transformed program is incomplete for that abstraction. Well, this happens also in static
analysis because if you compile your code, transform your code, it may well happen that
the same analysis doesn't work any more in the same way because what's happening
there is that the transformer obfuscated the analysis. Let's go back to the example of the
rule of sign. The rule of sign is, we said, complete for multiplication; we all know it. So if
you approximate the input with the sign, you get precisely the sign of the output with no
loss of precision. But it is incomplete with respect to addition.
So if you have a little program which is one line of code that makes multiplication, how
can you obfuscate it with respect to the rule of sign? It's very simple: you transform
multiplication into an iteration of additions. You keep the same input-output but the static
analysis which is of course very poor, the abstract interpretation which is very poor is
only able to see the rule of sign fails in extracting the sign of the code. So this is a
transformation that keeps the input-output but obfuscates the analysis. What we will try
to see now is how to derive this transformation systematically from the property that I
want to make obscure, blind.
Well, we tried some with my little group and we observed that most truths used by
attackers correspond to abstract interpretations. Profiling: abstract the memory over
particular variables. Tracing, slicing, monitoring, decompilation, disassembly can all be
formalized as abstract interpretations. So if each of these is an attack strategy against
the code then I can derive from each of them a transformation of the code that makes
that attack blind. Okay, how?
We all know that good programs are well structured and have concise invariants.
Obfuscated programs should be very badly structured and very ugly invariants,
incomprehensible or at best you basically say, "I don't know what's happening in that
program." So this is a conflict being well written and obfuscated of course. There is
interesting stuff around the idea of deriving a compiler by specializing an interpreter. The
following two aspects hold: the first is that the program that you attain in this way inherits
the algorithm of the source code, so the algorithm remains basically the same. What
changes is the programming style which is inherited from the interpreter. So when you
have a code, you specialize an interpreter with that code, you inherit the algorithm of
your source but the programming style is taken from the interpreter. So if I want to
obscure my code, I have to twist the interpreter in order to change the programming
style in such a way that the analysis becomes blind.
So I have to derive a distorted interpreter. Well, from the interpreter I have to move to an
interpreter which is distorted but is still an interpreter from my language. An example:
let's see this by two examples. The first is flattening. Flattening is a pretty well
established technology for -- actually, I think, the very first patent around this was by
Microsoft in 1992. So we go back. And they were hiding in this flattening the key for the
use of the program, such a way for basically activating the code because the order of the
blocks becomes relevant in order to activate the code. It's an interesting patent to learn.
Well, actually the technology of flattening is much developed and there is a company,
Cloackware -- that now is completely [inaudible] by Irdeto in Canada which is a multinational big company making security -- that basically made around flattening their core
business.
The flattening idea is the following simplified: you have your control flow graph, you
flatten it and you have a dispatcher that decides which block goes into execution. Of
course all the complexity is moved from the control flow graph to the dispatcher. The
dispatcher can be very complicated and become flow sensitive so if you input some data
the control flow -- the sequence of blocks changes. For the same data you may have
change of the control flow because basically blocks are redundant and so on. But it is
flattening. So it works very well with this example because if you take this and you take
the program that I showed you before, this is the original code, this is the flattened code.
You have a case -- The dispatcher here is very basic; it's basically the program counter.
These two are exactly the correspondence of what? Of the source program and the
specialization of an interpreter with this code. Look at the interpreter. The interpreter is
by itself flattening code because you have [inaudible], the code [inaudible] go back to the
same loop. How? Well, if I take this program and I specialize a little interpreter for C I
don't get that because the control flow here is static. So I can predict the next program
counter perfectly, and once I predict the specialization, do a little pass through
evaluation and generate the true code.
This should not happen because, otherwise, I get back to the original code. I want an
obfuscated one. So how can I make it? Well, you take the interpreter and if you force the
program counter to be dynamic so the specializer cannot understand -- It's forbidden for
the specializer to understand and analyze the program counter -- automatically the
specialization generates for you a flattened program. So by specializing this interpreter
with the original code, forcing the program counter to become dynamic, you get an
automatically flattened program.
Then, if you twist the interpreter, you add very complicated homomorphic encrypted
function around the program counter then you get a more, more, more, more
complicated way for flattening the program and make it more and more secure. But why
is this true? Namely why is making this dynamic related and how is this related with the
attack? Because this looks like a trick: I have an interpreter. I force the program counter
to become dynamic. Automatically, it returns me the flattened code. Where is the attack
there? We proved the theorem that says you are forced to be dynamic if and only if you
want to make incomplete a very simple abstract interpretation that is the one that
constructs the control flow graph.
So if you make a very simple abstract interpretation that forgets completely about the
memory of your computation and simply extract the control flow graph, you make that
incomplete if and only if the program counter is dynamic. So...
>>: And then I move to tracing the control flow.
>> Roberto Giacobazzi: Of course. And then, you swap to another attack and you try to
make incomplete the data. Why? This is the theorem. Namely by extracting the control
flow graph from the execution is equivalent to extracting the control flow graph statically.
So your algorithm for extracting the control flow graph is complete. So you don't lose
precision, so you are complete if and only if the program counter is not the program
variable; it's not variable so it's static. That means that if you want to let your attack -The attacker here is the algorithm that extracts the control flow graph which is static,
purely; it's an inspection of the code. It can be easily extracted as an iteration of the code
by simple abstract interpretation that forgets the computation memory.
You don't lose precision if and only if that is fully static. Namely, if you want to make it
incomplete, obscure, you have to make it dynamic. This is exactly what you do in order
to generate the transformed code. So basically flattening is nothing else than distorting
an interpreter by forcing the program counter to become dynamic that makes the
abstract interpreter of extracting the control flow graph imprecise. Is there a theory
behind this? Yes. It's exactly the theory of transforming domains, making it complete,
incomplete and so on.
I go quickly around this. Typically you have a domain and you have another domain. And
if you refine for becoming complete you add points and you become more complete. For
instance, [inaudible] refines a domain to become more complete. So you add points and
the domain becomes more and more precise. Here you have many domains that may
reach to the same point, so there are many domains that once refined provide you data
as the result. Among all of them take the most abstract if it exists. Would it exist, that
corresponds to a kind of compression of your domain that once refined gives you the
target domain which is this. This is the most abstract domain that once refined gives you
this. Yeah?
>>: Just thinking of a way if it existed [inaudible] because it's a complete lattice, the
abstraction over the UCL so...
>> Roberto Giacobazzi: No, that...
>>: ...[inaudible].
>> Roberto Giacobazzi: There are cases where it doesn't exist. For instance if the
operation with respect to which you were refined is negation. You have a square. You
have one [inaudible] abstraction. You add the other point. The other one add the other
point but the most abstract doesn't contain any of them. It is complete. It is the two-point
lattice, top and bottom. It's a property of r. We started with Francesco many years ago
the property of compressible domains, compressible abstraction. For instance if you
have junction completion, we take the disjunction, the compression is the joining
reducible elements. So those are the kinds of flat graphs, flat lattices that contains all the
basic points from which you can generate all of the disjunctions. It's a property of r.
Okay, so basically you have a function that refines and you have an inverse function that
squeezes the domain when it exists. It doesn't always exist; in most cases it exists. For
instance this is the lattice of intervals. This is the square. Then we can build this little
function by considering this formula. So basically if we remove [inaudible] with respect to
that function square, this is the squeeze of the original domain. Okay, so what we tried to
prove is that with respect to the function that is inside r, r is a way for completing with
respect to the function f. This inverse is the one that induces the maximal amount of
incompleteness, namely removes all the relevant points that are useful for removing
false alarms. So it's exactly the contrary of what we do in static analysis, but it's exactly
what we look for if we want to make the analysis blind. Okay?
Let's see this with another example then I finish. Slicing. Slicing obfuscation is more
tricky. Program slicing obfuscation. So program slicing: basically you generate the
program dependency graph. You have this little program then you slice off from this
program with respect to the variables x and y and so on. And all this is statically derived
from the program dependency graph. Take for instance this little word-count program.
Okay, you have number of lines, number of words, number of characters. The slicing
criterion is the variable with respect to which you want to slice; this is the number of
lines. And you get out this slice. And if you have number of words, you have this slice.
Okay, if you want to obfuscate program slicing, what you should do is to return a slice of
the old code. So the slicing algorithm is more precise, it's able to have a sharp view of
the execution around that criterion if the slice is small in size.
If you want to obfuscate the program slicer, you have to make the slicer blind to its
capability of selecting instructions. Basically he has to return the old code as a possible
slice. That means that it fails. Of course, I mean if I tried to attack a program and I used
a program slicer to reduce the size of the code I want to attack, it returns me the code at
the beginning, it's a completely useless tool for may attack. Okay, so how do hackers -and this is simple hacking -- do this? They add the fake dependencies. Because the
program slicing is related with the control dependency graph, the program dependency
graph. If you add dependencies which are fake -- For instance, in this case you see that
this is always true and this is always false, so there are instructions that relate the
variables, link, make the variables depending with each other but they will never be
executed. Because, the program dependency graph is extracted statically, I would say,
in abstract interpretation of the program.
Then the program slicer is enabled to return a good slice. Indeed, it gets a much bigger
slice for number of lines and for number of words: two big slices. Is this related with the
algorithm that attacks the code which is the algorithm that's extracted from the
dependency graph? Yes, exactly as before here the transformation that adds the fake
dependency is precisely induced by the algorithm that extracts the program dependency
graph. Look, the algorithm of program dependency graph is an abstract interpretation
where you forget completely about the state once again and generates the graph. Okay,
so what's happening here? If I formalize this as an abstraction what happens is that it's
very easy to prove once again an "if and only if" that says that the program dependency
graph algorithm is an abstract interpretation defined by an abstraction row and that
abstract is incomplete if and only if the code contains static -- so not dynamic -dependencies, fake dependencies, namely dependencies that are not true in the true
trace of execution. So dependencies that are not generated [inaudible].
Okay, so it seems that with these two examples the theory is more general of course; it
doesn't work for all examples. What we tried to do is the following: we want to obfuscate
the program means we want to make blind an attacker. The attacker for myself is an
abstract interpretation. Warning: an abstract interpretation doesn't need to be static.
Also, monitoring, tracing can be formalized with an abstraction. So also dynamic attacks
can be formalized by an abstraction. Also tracing when you have huge amounts of
traces and you make mining on this, the mining is related to some abstraction because
you lose some information in order to extract some of the information.
Once you know this abstraction no matter what -- For instance take the compilation. The
compilation you look for irreducible graphs in the code. You know, [inaudible] of the
loops. So how can you make it incomplete, the algorithm that extracts the irreducible
graph? You jump inside the code with fake jumps. In this way the code appears
completely reducible and the decompiler is unable to reconstruct the original structure.
Once again this is making incomplete an abstraction which is the one that looks for the
graphs that are reducible. Disassembling, if you see the standard is assembled they
work perfectly in the same way. So once you are able to extract the abstraction, you can
always build the twisted interpreter which is always a modification of the standard
interpreter that depends on this abstraction and makes by this equation the transformed
code blind for the abstraction.
The point is the following: you can always find a better abstraction than the obfuscated
one. Of course. But look, Barack and others proved in 2001 that obfuscation is
impossible. So you cannot universally obfuscate your programs. Rice in 1952 proved
that analysis is impossible. Well, we all have done program analysis for at least 40 years
so it makes sense to do obfuscation even though it's impossible. That's it. Thank you.
[Applause]
>> Francesco Logozzo: Time for some questions.
>>: So you might also -- you can even increase the power of the abstract interpreter, not
just by changing the domain but by, for example, unrolling that loop to begin with. It
might get rid of the irreducible part. Or the multiplication example that you gave; you can
just do trace partitioning. You can actually get that one, right?
>> Roberto Giacobazzi: Yeah.
>>: So I mean even if you stayed in the domain...
>> Roberto Giacobazzi: In that case what I would do -- I would say this is a line of
research; we don't have the ending point on this of course. But I would do that and try to
specify trace partition as a refinement of the domain. And then, I use that domain for
deriving the obfuscated code that defeats your trace partition. I agree with you that there
is a rigidity inside this stuff, that we always pass through the abstraction in order to
construct the interpreter. But, I believe that the most refinement you can do of the
interpreter, you can see that as an abstraction of the domain over a more standard
interpreter, a standard interpreter like you want, the simplest one. Of course if you look
at define refiners like refining the widening or weighting some iteration before threshold,
that cannot be specified as [inaudible]. But it's nice, challenging stuff because I think that
also, for instance, in the delay of the widening it's very easy to find the transformation of
the code that simply delays more the change of the variable in such a way that it breaks
your refinement.
So probably there is something even more general than the thing that we are looking at,
at the moment. But we are pretty happy that if you take this book by Christian Collberg,
some kind of bible of all these tricky transformations, most of them we were able to
specify as an abstract interpretation. And for each of them, the twisted interpreter was
derived almost naturally.
>>: Are you able to define new obfuscation techniques using [inaudible]?
>> Roberto Giacobazzi: Well, for the moment...
>>: For the moment you...
>> Roberto Giacobazzi: For the moment...
>>: ...view the...
>> Roberto Giacobazzi: ...we tried to understand...
>>: ...existing of [inaudible]...
>> Roberto Giacobazzi: Yeah, it was a kind of understanding that instead of viewing
obfuscation as a trick that each time I think new stuff I generate then I think that I have a
billion-dollar company mine that doesn't work, of course, we tried to derive the principle
behind this. The idea now is the following: is it possible to compose in a kind of cryptoway very simple transformations in order to make more complicated ones by composing
in such a way that the order becomes relevant? So if you know the order of the
transformations of the very tiny little transformations that you do, you are able to
reconstruct back the original code. So the order can be exponential because you have
exponentially many different orders among it. And that would be interesting stuff to do.
At the moment we are trying to understand existing. I think, yes, in principle.
>> Francesco Logozzo: Questions?
>>: Quick question. So my understanding is that all this works because you have yet to
[inaudible] so you are just considering the static approximation, you are considering the
best transformer.
>> Roberto Giacobazzi: Best transformer.
>>: Okay, then. That's not reality.
>> Roberto Giacobazzi: Yeah.
>>: You always don't have the best transformer. You have widening. You have
sometimes separation which -- Yeah, you are considering...
>> Roberto Giacobazzi: If you defeat the...
>>: ...the worst case -- You are considering the worst case but...
>> Roberto Giacobazzi: If you defeat the best transformer, you will defeat any other
such.
>>: Yeah, of course. You are considering the worst case. But as I say, how far is the
worst case from the real case?
>> Roberto Giacobazzi: Yeah, but from my point of view when I want to protect -because from my perspective I want to protect against somebody that wants to enter my
house. So if I'm able to protect against the best...
>>: Yeah, but it can be too much.
>> Roberto Giacobazzi: ...guy that can...
>>: I'm saying, you can protect it by just putting [inaudible]...
>> Roberto Giacobazzi: Yeah, I'm probably too much.
>>: ...[inaudible] or whatever. The door fine. The lock [inaudible]...
>> Roberto Giacobazzi: I agree with you. I agree.
>>: So that's what I'm wondering. What's [inaudible]...?
>> Roberto Giacobazzi: You can probably have a lower level of obfuscation to defeat
the true tools. But from my point of view if I -- This is why I look for simple
transformations because if I'm able to defeat the basic attacks and compose them with
respect to the strongest possible attacker which is the best [inaudible] then I'm pretty
sure that other attackers in any way have trouble getting in. Of course you pay -- There
is...
>>: [Inaudible] performance [inaudible] too complicated or [inaudible] does not kick in
and...
>> Roberto Giacobazzi: I agree. But the...
>>: So it can be to much. So I'm wondering if you know your attacker, you know for
instance what is the [inaudible], if you know that widening is used...
>> Roberto Giacobazzi: If you know the widening, you can probably simplify this. Yes.
Consider that anyway most of these technologies are used not for protecting the
algorithm -- Nobody wants to protect the weak sword because everybody knows it -- it's
for protecting keys inside the program. And these are related to a very small portion of
the code. So you don't really need to obfuscate the old code. You really need to target
the specific area of the code in order to let them, for instance, very hard to extract by
slicing, very hard to understand in the control flow and so on. So you probably pay a
runtime slowdown of ten times over that little piece of code. Computed over -- A student
of mine made [inaudible] he made a dynamic obfuscator that was encrypting code in
Java bytecode so by passing the type system, so it was very complicated. The
slowdown was ten thousand times. But he applied it in such a small area of the code that
the eventual slowdown was less than 0.7.
So it depends on where -- Of course, I mean if you applied it to the old, it can be too
much.
>> Francesco Logozzo: Okay. Thank you.
[Applause]
>> Roberto Giacobazzi: Okay, thank you.
Download