1
>> Ben Livshits: So let's get started. Thank you for coming today and thanks for listening online. We have Ben Hardekopf today visiting us from UC Santa
Barbara. He'll talking about sound analysis of JavaScript browser add-ons.
Thank you.
>> Ben Hardekopf: Thank you, and thanks for coming. I appreciate the opportunity to talk about our work. I wanted to start right off by acknowledging my students in my lab, Vineeth, Madhukar, Kyle, Ethan, Jared,
John and Kevin, who have been the ones who have actually done most of this work. And in particular, Vineeth Kashyap is the lead on the project that I'm going to be talking about.
I wanted to give you just a really quick overview of the kind of work that we do in our lab. So one main focus is abstract interpretation and program analysis, and the main focus there has been this static analysis of JavaScript
I'll be talking about, but there have been other projects as well related to just pure abstract interpretation.
We also collaborate with the UC Santa Barbara architecture lap. That's Tim
Sherwood and Fred Chong and their students on applying programming language techniques for hardware design. So basically using type systems and program analysis to verify certain properties of hardware.
And then we also are collaborating with Qualcomm research on a novel JavaScript engine that is just about to be open sourced. But the project that I want to focus on today is our JavaScript analysis project. And sort of the underlying thesis of this project is that it is both we can and we should create a sound, precise and tractable static analysis for JavaScript.
And it's not clear from the outset that this is actually possible. JavaScript is a very dynamic, very difficult language to analyze, and if you're really concerned with soundness, then you often lose out on precision to the point where maybe it's not very worthwhile.
But we believe that we can, in fact, get both sound and precise analysis and something that's still tractable, and this project is sort of our proof of concept that, yes, we can do this.
And our main motivating example, the main thing that we're applying this
2 analysis for, at least at first, is this idea of security auditing for browser add-ons, which I'll be talking about in a moment. Now, we are really interested in soundness, that's one of our primary concerns, and so our strategy that we're applying for this is to take a very formal approach.
So we are specifying formally a concrete semantics for JavaScript and then abstracting that semantics to derive our actual analysis and have proofs connecting the two. And we kind of think really that's the only way we can gain a high level of confidence that the analysis is actually sound.
So what I'm going to do is start off with a little bit of motivation about why sound JavaScript analysis is important. Of course, there are many reasons why
JavaScript analysis can be useful. So for optimization, type inference, error checking, refactoring, things of that sort. But our main focus is going to be this idea of security for browser add-ons.
I'm then going to discuss our particular design for our static analysis of
JavaScript that allows us to achieve a sound, precise analysis. And then so that's going to be just basically how do we analyze JavaScript soundly. And then I'm going to focus specifically on how we apply that to security auditing for browser add-ons and end up with some preliminary results.
So we actually had a first version of this analysis, and we learned some lessons during the process of making it and having used it, and we actually used that to inform a second design where we used -- had a number of new insights and used that to go back and come back with a second version of our analysis, which we're currently, in fact, my students are right now working on it at the moment.
So I'm going to present some results from our first version and sort of describe how those results modify what we did in our second version.
All right. So browser add-ons, they're written in JavaScript by third party developers, and their entire purpose is to extend the functionality of the browser. So when the browser starts executing, it loads up these add-ons, they register some event handlers, and they set in this big event loop waiting for events like key presses and page loads and so forth in responding to them, perhaps by sending network messages or modifying the web page being displayed or so forth.
3
And the key thing about this add-ons is that they have extremely high privileges. Pretty much anything that the browser can do, they can do. And there's no sandboxing or other security restrictions, like the same origin policy or anything like that.
So browser add-ons have pretty much complete access for anything that they want. They can sent any messages to any domains that they want. And clearly, this raises some security hazards.
So there, in fact, have been add-ons that have taken advantage of these elevated privileges to both malicious add-ons, like firestarterfox, which hijacks search requests through a Russian website, so essentially they can see what you're searching for and modify the results and tailor what you see.
Of course, add-ons don't have to be malicious to be taken advantage of. So, for example, in their DefCon 17 talk, Liverani and Freeman went through a range of 11 add-ons and showed exploits ranging from cross side scripting to local file access to password stealing.
So this is a real problem that people are taking advantage of right now. Now, naturally, the official add-on repositories -- you have a question?
>>: Yeah, just a question about the ecosystem. So in terms of these add-ons, you know, obviously there's a place where you get add-ons. They're sort of -- is there a feedback path that says if a malicious one shows up in the market or whatever the equivalent is in Firefox, people sort of post and say, oh, don't use this? How does that work?
>> Ben Hardekopf: So there is no official feedback loop. There are, I mean, obviously -- well, so let's take the Mozilla repository. So when you submit an -- a developer submits an add-on, it goes through a vetting process. We'll talk about it in a moment. And they'll only post things to the official repository that go through this vetting process.
This vetting process is not very good in ways that I'll describe. And so clearly, malicious add-ons slip through or vulnerable add-ons slip through when people notice that these add-ons or malicious or vulnerable, so forth.
Obviously, there's a lot of chatter and people talking about it. The Mozilla repository monitors this sort of thing and they'll see that and they'll pull add-ons from the repository if they feel it's warranted and so forth. That's
4 basically the mechanism.
So clearly, what we would like to have happen is we don't allow those through in the first place, rather than them going through, to go some sort of damage, and eventually being noticed and then being pulled after the fact.
>>: So like with the Android market, there's sort of a -- it's known to have tons of malware, and it's got a reputation for this. Has that happened with
Firefox or Mozilla repository?
>> Ben Hardekopf: The general public perception of that?
>>: Yeah, exactly.
>> Ben Hardekopf: My impression is no, they're pretty trusted. Like if you assume -- if I'm going to the official Mozilla repository and downloading an add-on from Mozilla, then I should assume that this is a trusted thing. And if it is not trustworthy, if it does malicious things, someone's as likely to
Mozilla and the repository as I am the person who created the add-on in the first place.
So there's actually an issue here not just with the users but with the repositories themselves and how people view them. Yes?
>>: So this issue is a bit more complicated in the case of Mozilla by the fact that they have the official [indiscernible] from anyplace on the map and I believe there are tons of cases of those being quite malicious.
>> Ben Hardekopf: Right. So that's kind of --
>>: Either users are somewhat discouraged from [indiscernible] about the browser gets. So there it's kind of, you know, questionable, you know, when you can blame the browser versus the people doing that.
>> Ben Hardekopf: Certainly. I think if the user just goes out and downloads add-ons from some random website, then it's kind of on their head what happens.
So mainly what we've been focusing on are the sort of, quote, official add-ons in the repositories. Those are the ones that we're looking at.
So again, these repositories that you submit these third party add-ons to are
5 aware of these problems and they go through a vetting process before they're actually posted. And this vetting process is essentially a manual inspection of the add-ons. They don't have really any formal security guidelines or policies. Essentially, just institutionalized wisdom as to what's acceptable and what's not.
So essentially, they just sniff the code to see if it passes the smell test.
If it does, it gets allowed to be submitted. Otherwise, it gets rejected.
So obviously, based on the examples that I showed before, this process is not really adequate. And it seemed to us that static analysis is very applicable to this problem, that this is a very good way to try to vet add-ons in a much more useful way.
So if we want to statically analyze these add-ons and try to vet them automatically, then, of course, since they're written in JavaScript, we need to be able to analyze JavaScript itself.
So there are a number of people here at MSR who actually are intimately familiar with JavaScript, even more so than I am. But in case there are people here who are not entirely familiar with it, the JavaScript basics are it is imperative, dynamically typed with objects and prototype-based inheritance, higher order functions and exceptions.
Objects are really the fundamental data structure, and these object properties, which are the JavaScript names for fields or object members, they can be dynamically inserted and deleted. The properties themselves are just strings and these strings can be computed at run time. So you can't tell statically what property of an object is being accessed.
Objects allow introspection so you can see what the fields of an object are.
And one of the things that makes JavaScript so difficult to statically analyze is that it's designed to be very resilient. So even nonsensical actions, like accessing the property of a non-object or adding two functions together don't actually raise any exceptions. They're handled using implicit conversions and default behaviors, which can be really confusing as to what the proper behavior should be.
So I just wanted to give you kind of a flavor of some of the things we might have to deal with when we're analyzing JavaScript for security flaws. So
6 suppose that you had some secret information that you were trying to send over the network in some obfuscated fashion. So it's not immediately obvious that that's what you're doing.
In this example, we're declaring a function called foo that returns a string bar, and then remember that functions are just objects in JavaScript. So I can treat it as an object. I can do object accesses, and I can in this case look up foo.prototype and then dynamically insert a new property called bar that contains the secret information.
And then later on, I can create a new object, obj, using foo as the constructer. Essentially what does is sets obj prototype field to foo's prototype, and then I can send obj of foo. So this is going to call foo and compute a string, in this case, bar. Look it up in obj. Obj doesn't have it so it goes up the prototype chain to foo.prototype, and which does have bar and it's the secret information so this is going to send the secret information.
So this uses a combination of language features that include the function being treated as an object, dynamically inserted properties, computed property accesses and prototype-based inheritance that all combine together to allow you to send this secret information.
A couple of other just quick easy examples. So again, I can use these computed app properties and dynamically inserted properties and object introspection to do something similar. So here, I say obj of secret info gets undefined. It doesn't really matter if secret info could be a string, could be a number.
Could be anything that's convertible into a string. And then this 4N loop is basically object introspection is going to it rate through the properties of the object, one of which is the secret information, and it's going to send it out over the network.
And the final example, and one of the things that makes static analysis really dick is this idea of implicit conversions. So in this case, JavaScript has a default number object, and I can assign again to [indiscernible] property number.prototype.foo gets secret information, and then I can send 5.foo. And, of course, 5 is not an object. It's a primitive. But it's going to automatically or implicitly be converted into an object, and then we're going to do the property access on that object.
It's not going to be in the newly created object, of course. But it's going to
7 look up the prototype change to find number.prototype.foo, and the end result is we are going to again send the secret information.
And it's actually worse than that, because these implicit conversions can actually end up running arbitrary code. So if I were to take an object and use it in a place where I need either a string or a number, the way that it does that is not by some default object conversion. It actually calls either the two string or value of methods on that object, with the idea it's going to look up the prototype chain and eventually probably find the default value of and to string. And you can also, of course, in an object, override exactly these methods so if anywhere along the prototype chain you've overridden to string or value of, it's going to write whatever code you've put in there, rather than default string conversion.
So you could, in theory, launch your entire program in this way, just by implicitly -- so there's no actual call to the code anywhere present. It's just through this implicit conversion, because of the way it happens to do things, it's going to find your code and run it.
So there is no current JavaScript analysis that soundly handles all of these tricky behaviors. Of course, these examples aren't even comprehensive. There are a lot of other examples I could give you of really surprising behaviors that JavaScript has.
So what I want to do now is describe how we try to handle all of these tricky things. How do we go about creating a static analysis for JavaScript that we can, at least in some sense, claim to be sound.
And when we design this analysis, we really had three criteria. The first one, of course, was soundness. We wanted it to be a sound analysis, and we wanted to be able to make formal claims about that soundness.
We also wanted a pretty straightforward implementation, meaning that it's nice to have a formal specification of your analysis, but there's always a gap between the specification on paper and the actual implementation of that analysis. And we wanted that gap to be as small as possible so it was fairly straightforward that, yes, this implementation actually implements this analysis that we have formally described.
And another issue that we wanted was we wanted the precision of that analysis
8 to be very tunable in terms of not just the abstract domains that you use, but also the control flow sensitivity. So path sensitivity, contact sensitivity, flow sensitivity. We wanted to be able to tune these things arbitrarily.
And that's because JavaScript analysis is fairly new, and we as a community don't really know what the best precision would be. So, for example,
JavaScript has some elements of functional languages with higher functions and so forth. Maybe KCFA would be a good option for context sensitivity. But it also has some aspects of objects. So maybe object sensitivity would be a good idea or maybe some combination, or maybe something else entirely. We don't really know.
One of the things we wanted to do -- yes, you had a question?
>>: We sort of do know something, right? I mean, if you [indiscernible] you can find all sorts of things in open source projects. Add-ons, you can simply do statistical analysis to figure out how the equation [indiscernible] and how restrictive you want to be.
>> Ben Hardekopf: Are you talking in terms of the precision that we want for the analysis, what kind of precision that we have?
>>: What kind of things you need to support, right? I mean, there is a movement in the community around these ideas like just the good parts, which is the [indiscernible]. And if you're sticking to targets to say browse add-ons, that meshes pretty well with that approach. If you're not restricting your targets at all, then yes, you have to support them.
>> Ben Hardekopf: Right, so the goal for this JSAI static analysis of
JavaScript is to ultimately support anything. We don't currently. And, in fact, as I say here, our current target applications for this analysis at the moment are browser add-ons, machine generator JavaScript through something like
Emscripten, and something like open source projects from GitHub. So we're not handling, for example, any random JavaScript you pull off of a website, currently. That's ultimately the goal, we'd like to do that. But at the moment, this is what we are specifically targeting.
But even for these, so there's a wide range of JavaScript features, and we just want to be able to handle as many of them as possible. Does that answer your question?
9
>>: Yes.
>> Ben Hardekopf: Okay. So the idea again, for tunability, was in terms of precision, what kind of precision we should bring to the table, we don't really know, and we would like to be able, with our static analysis, be able to experiment with those and try a bunch of different ones and see what works well and what doesn't work.
And we've actually implemented all of this in Scala. So I'll be talking about it with respect to our Scala implementation. So the first thing that we do is we translate JavaScript, the generic JavaScript to a simpler, intermediate language that we call Not JS. And this is basically just to simplify a lot of aspects and make things explicit that used to be implicit and make things more regular.
So this is a fairly standard strategy when you're analyzing something. And, in fact, it's a similar philosophy as lambda JS. But because lambda JS was doing something different, they ended up with a different IR design and semantics.
So the actual design of the IR that we have and the semantics that we give to it is different than lambda JS. They are going for this idea of a min more core calculus that you can do type system, soundness proofs on, and we were going for precise inefficient abstract interpretation so we just ended up in different places.
So this design that I'm showing is actually our second pass at it. We had a first pass, as I said before, our first version, and we learned some lessons from it. And one of the lessons we learned, for example, is that it's really helpful if you actually separate out and have pure expressions that are guaranteed to terminate without throwing an exception, and impure statements that don't have those guarantees. And that makes things actually a lot simpler and more efficient to do that, and it's not as easy as you might think when you have implicit conversions to consider. These implicit conversions, again, can run arbitrary code so you have to be very careful about how you do them in order to make these guarantees.
And there are other lessons learned as well. For example, we kept in the 4N loop, which he is the object introspection one, because there are opportunities to handle it more precisely than if we desugared it into, for example, a Y loop. And there are a few other design decisions that we made, based on that.
10
But the basic idea is just we wanted a fairly small, simple language that we can then formalize and apply our proofs to.
Now, the actual translation from JavaScript to Not JS is itself formalized, and then we give the Not JS language a formal semantics.
>>: How much do you handle?
>> Ben Hardekopf: How much of JavaScript? All of it. Ecma 3.
>>: Ecma 3?
>> Ben Hardekopf: Yeah.
>>: [indiscernible].
>> Ben Hardekopf: There is. I plead resource exhaustion. We're a small lab with a few students.
>>: My question is [indiscernible] things you can handle?
>> Ben Hardekopf: In ecma 3, we handle everything.
>>: So where does eval show up?
>> Ben Hardekopf: So eval is a method on the global object. So we have it in the concrete semantics and then I'll talk about it in abstract version, how we handle it in a slide towards -- later on. The basic answer is going to be that in this -- specifically for the things we're targeting, which are browser add-ons, machine generator JavaScript, they don't use eval. And so as long as you're careful bit, you can ignore eval.
I mean, you can't really ignore it because syntactically, you can't just say, well, eval never happens, but you can, as I'll explain, there are ways we can get around that. There are also some strategies we can use to actually support eval, and I'll talk about those a bit later on. We don't implement those strategies as of yet. That's future work.
>>: Okay. So in other words, is there a distinction between fudges and
11 objects?
>> Ben Hardekopf: Well, actually, functions are objects.
>>: Right, but you have this new --
>> Ben Hardekopf: That's going to create specifically a function object. So a function object is an object that has an internal field a closure. So you can treat it as a regular object, but also has some additional stuff like the closure that you're going to call.
>>: So there was a reason for keeping -- why didn't you just, since everything's an object, why didn't you go lower? I'm just curious about the new --
>> Ben Hardekopf: So it just turned out we were doing the semantics in the implementation that they were different enough that it was more useful to keep -- essentially, when we put them into the same one, we just had a bunch of -- if we were doing a function, do this. Otherwise, do this. If you're doing a function do this. Otherwise, do this. So we just split it out into two different things and that just made it easier.
>>: Okay. I'm sorry. Another question. So does JavaScript have
[indiscernible].
>> Ben Hardekopf: Yes.
>>: Oh, shoot.
>>: What's the strategy for [indiscernible] compliance? Are you basically
[indiscernible] compliance with the test?
>> Ben Hardekopf: Right. So I'll answer exactly that question. So, of course, once we have this intermediate language, we need to give it a concrete semantics, and there are a number of different ways you can give a language a concrete semantics. And our criteria were that first we wanted, again, this near one-to-one correspondence of the implementation, and we wanted, in fact, to be able to create an executable semantics that we could enable testing against a referenced implementation, exactly like you're saying.
12
So what we wanted was something that, again, when we implement it is obviously exactly the specification that we had. But we could execute it and then compare the translation plus the semantics that we're executing together should give us exactly the same result as our reference implementation, which currently we're comparing against no JS is what we're comparing against.
So that's the validation. Essentially, of course, because ecma 3 is not formally specified. There's no way that I can prove that we conform to the
JavaScript standard. The best that we can do at this point is just say we can do at least as well as someone who's actually trying to implement a JavaScript engine, right, because that's basically the same thing they're going to be doing.
And then our second criteria was that looking forward to the analysis itself, we wanted to allow for this tunable control flow sensitivity after we abstract the semantics.
So the approach that we settled on was a small step abstract machine-based semantics. For those not familiar with it, you can think of it as just a state transition system where we define a state, what a state of a machine is, and then a set of transition rules between states.
And essentially, this gives us a concrete interpreter for JavaScript. So we have some initial state, and then we consult our state transition rules to see what the next state would be, and then we can keep just calling the next state, next state, next state, until we end the program if it terminates.
So just to give you an idea of what that looks like for JavaScript, here is an excerpt of our concrete static domain. So this is our definition of a concrete state is it has a term, which is either a statement or a value, an environment that maps variables to addresses, and a store that maps addresses to either values or objects.
And then a continuation stack that tells you the rest of the computation. So when you have a state, you can think of it as the term tells you here's the thing you're working on now. The environment and storer tell you here are the values of the variables in the scope, and the continuation stack tells you when you're done doing what you're doing, here's the stuff to do next.
So this is a fairly standard formulation of a state for a small step semantics.
13
You can see that the values are a standard numbers, Boolean, strings, addresses of objects, null or undev. I do want to particularly point out the concrete object domain.
So this, as far as we can tell, is novel in terms of formalizing JavaScript representing objects in this way. So the way we represent them is with two maps. The first map is a programmer visible map. This is what they actually interact with that maps properties to values, and you can do the property insertion and deletion and enumeration and so forth.
And then we have a second implementation map that the programmer is not visible to the programmer. They can't directly interact with, that holds information like if this is a function object, here's where the closure is. If this is, say a number, Boolean or string, here's where the value is. And it also, every object contains its class, where a class is one of these things that I've listed below here.
And these classes actually determine some of the behavior of objects. So depending on what class you are, you're going to react to certain things differently. So we explicitly keep track of what class each object belongs to, so it turns out to be fairly easy and efficient to handle all of these things.
We originally did not have objects this way. And, in fact, we had objects the same way everybody else that we've seen has done JavaScript analysis has done, where you just have this one map from properties to values. You don't keep track of classes or have the internal map or anything like that.
You can do it that way, but it turns out to be very complicated and messy and hard to get right. So this is just a nice -- yes?
>>: What is that metadata stored in the map that's on the values? It's really metadata associated with the object, not associated with -- like the class is a tag on the state, not on the mapping of the object.
>> Ben Hardekopf: I'm not sure I understand, like why is this a map?
>>: The class here is not associated with what string it came from. It's a piece of state associated with value, right?
>> Ben Hardekopf: Yes. Well, no, so the class is specific to this object,
14 right. The entire wrapper for all this mapping of properties to values. So this second map is going to say, like, the code maps to here's your closure.
The class maps to here's your class. The value maps to here's the value if it's a number string or Boolean.
So the class of the object is going to be, for example, is this an array. If it's an array, that means certain things are numerable, certain things are not.
There are special rules for updating the length field and the external map, things like that, right. And they're different rules than if this is an object object or a function object.
>>: I'm sorry. What's a B value again?
>> Ben Hardekopf: It's a base value. So there are actually other kinds of values for exceptions and go-tos as well, and I just didn't, because I was trying to fit it all on one slide, I just didn't show them all. So the value of a state can either be a base value or an exception value or a jump value, and the exception values are going to be caught by these try catch finally blocks and the block values are going to be caught by labeled statements. So you can label a statement and then jump to that label.
>>: It cease like any base value has a class.
>> Ben Hardekopf: No, no. These are not base values. These are objects. The objects have class tags, not values.
>>: I see, but there's one class tag per object?
>> Ben Hardekopf: Yes.
>>: I see.
>>: So what's the string that maps to the class?
>> Ben Hardekopf: It's just going -- so the code maps to closure. We have a number of things we just map a string to what that is.
>>: So it's not a property?
>> Ben Hardekopf: No.
15
>>: See, that's the confusion.
>> Ben Hardekopf: I'm sorry. I should have made that clear. This is the external map that maps properties to values. This is an internal thing that's just our implementation inside the programmer.
>>: [indiscernible].
>> Ben Hardekopf: Exactly, I'm sorry. I should have made that clear.
>>: Okay. That makes a lot more sense.
>> Ben Hardekopf: Any other questions? Okay. So again, I want to point out that one of our goals was this straightforward implementation. So in Scala, each of these is just a class. So there's a pretty much one-to-one correspondence between when you can look at the formalism and see the definition of what a state is, and you can look in the implementation and there's a class that has exactly those fields and so forth. So it's a very straightforward idea.
Now, given the definition of a concrete state within the state transition rules, and so here is just some very simple rules, handling conditionals.
Essentially, all we do if we're given a state with a term environment store and continuation, what should the next set of terms, environments, stores and continuations be.
For example, if the term is an if statement, we evaluate the guard and see if it's true or false, and then the next term, depending if it's true, then we use the true branch. If it's false, we use the false branch. And there end up being about 40 of these rules to cover the entire language.
And this is again, I wanted to point out the difference from the first version that we had, where we didn't separate out these peer expressions. Here, I take the expressions and I just evaluate them in place. So this is essentially a big step evaluation of these discretions and we can do that because we're guaranteed they're pure. They don't have any side effects, they don't raise any exceptions.
And we did not do that the first time around, and we ended up having about
16 three times as many semantic communications and about be an extra 30 rules or so, semantic rules for handling them. And the rules themselves are a bit trickier, because we had to worry about exceptions and side effects happening at any point in time.
So separating things out the way that we did here made the semantics itself a lot simpler and more efficient. Yeah?
>>: As you pointed out earlier, even just a plus operation can have an implicit conversion that causes a value exception. So no -- there's very few expressions which guarantee --
>> Ben Hardekopf: Right. So these are our expressions in the Not JS IR. So specifically, these are these expressions, and some of these -- for example, you'll see that the two obj conversion is actually a statement, and we actually spell out in the translation to the Not JS IR these things like, you know, if this is a primitive, then go ahead and use the expression. But if it's an object, then do this first and get a result and so forth.
So all of these conversions are spelled out explicitly.
And again, in Scala, there's actually a very close mapping between these rules as I show here and the actual implementation. So every state class -- the state class has a next method and it's essentially a big pattern matching case that looks through and says which one of these rules applies to me. In that case, here's the next state. So the next method just looks at the current state, pattern matches against these rules, and says -- gives you the appropriate next state back.
So again, I can take an initial state and then just keep calling the next method on it, and that's essentially an interpreter for JavaScript, and I can compare it against, for example, the result of no JS or something like that.
All right. So this gives us our concrete semantics. The next step, of course, is we actually want an analysis for JavaScript so we need to abstract that semantics in order to create our analysis. So the idea of abstract interpretation is that we're going to take this concrete state space, which is potentially infinite, and over approximate it with an abstract state space, where that state space is bounded to be finite, which we need to make the analysis computable.
17
And then the analysis itself is simply compute all reachable abstract states from some initial state. That's what gives us our analysis. And since we are abstracting a small step abstract machine, the abstract semantics we get is, itself, a small step abstract machine.
So the idea, again, the analysis itself is going to be we have some initial abstract state, which is an over approximation of the initial concrete state.
And then we successively explore the reachable states from that and it's just a state exploration. And once we discover all reachable states that's our analysis. We can then derive invariants from that result.
Now, I do want to point out that this is a different approach than all previous
JavaScript analyses, which are sort of based on the traditional control flow graph. So they take the JavaScript program, they derive a control no graph and then they apply the standard traditional data flow analysis on that, sort of the monotone framework so you have monotone transfer functions, theory and
[indiscernible] lattices and things like that.
>>: The analysis we built back in 2008 [indiscernible] control flow sensitivity doesn't seem to make much of a difference. You can play these things -- within procedures, I should say [indiscernible] procedure.
>> Ben Hardekopf: Right. So first of all, I wasn't aware that [indiscernible] actually didn't use a control flow graph. I forgot that that was true. But we have found that the sensitivity actually makes a big difference, and it may depend on exactly what you're trying to do with it. So maybe we can talk about that more offline to see exactly where the difference lay.
But we have found that the specific -- specifically, flow sensitivity is very important, and then context sensitivity can make a big difference as well when actually different context sensitivities vary wildly in how well they perform.
So that's the results we've found for, specifically, the analyses that we're running on the target applications that we have. Of course, you were targeting completely different applications for completely different things, and maybe that made a big difference.
But most of the JavaScript analyses that we have seen, like the type analysis for JavaScript and the pointer analysis for JavaScript and other analyses for
JavaScript have used a control flow graph, and the main difference between the
18 two approaches is that as we've seen the state-based approach, the abstract machine based approach actually pairs the information about the where you are in the store, in the environment and so forth with the semantic continuations altogether.
And a control flow graph actually externalizes those continuations into a separate data structure, which is the control flow graph. So essentially, there is no relation anymore between the state of the analysis or the program point, current values, all the variables and so forth, and the continuations themselves.
And it turns out that if you put those together in the abstract machine way, that makes dealing with indirect and complicated control flow easier. Of course, you can do it with control flow graphs, but the end result ends up being kind of messy and ad hoc. So, for example, we were talking about try catch blocks. In JavaScript, the result of the -- when you end up -- the ultimate result of a try catch finally is usually, whatever you competed in a finally block, unless you enter the finally block through a jump or exception, unless the final block itself throws a jump or exception -- so there's a fairly complicated set of rules about what exactly -- where exactly the return value should come from. It depends on how you actually enter the finally block. And if you have the continuations in the states together, then this turns out to be fairly easy to deal with. And if you don't, you can deal with it, and things like TAJS do deal with it but it gets messy in ad hoc and hard to get right.
And, in fact, TAJS had a bug in their implementation of it that we discovered when we were formalizing our semantics.
And things like indirect control flow as well. So if you have the CFG-based approach, then when you are -- of course, you don't have the entire control flow graph there present, because of all these indirect calls, so you're adding edges dynamically to the control flow graph as you run the analysis.
And the problem is, especially if you're doing a contact sensitive analysis, certain edges are relevant to certain states depending on the context that you're in, and you have to have this mapping between the states in which edges are relevant to that state. Again you can do it, but it's kind of messy and ad hoc and easy to get wrong. And if you pair everything together, and you have the semantic continuations coupled with the rest of the state, then this turns out to be pretty easy and trivial to do.
19
And it's also going to turn out that this design is going to give us this easily tunable control flow sensitivity that I talked about, which I'm going to discuss in a few slides.
So the main point of this slide is just to show you that the abstract semantics is, in fact, very close to the concrete semantics that I showed you before. So here was the concrete semantics. That was the concrete semantics. And essentially, most of what we did was take that and put a bunch of hats on things and then make things sets instead of singletons paw we're doing overapproximations. So there's a very close relation between the two.
I again want to point out this object domain. It's based directly on the concrete object domain that I described. And again, this is a novel way of abstracting objects. And it turns out to help a lot in terms of precision by separating these out and keeping track of the classes.
And then, of course, given the definition of an abstract state, we need abstract transition rules, and this again is going to be very similar to the concrete one. The main difference here is that because we're over approximating, the value of a guard can actually be both true and false, and so these rules are going to end up being nondeterministic, and that's easily handled just by, for example, in Scala, we have the next function and instead of returning a single state, it's going to return a set of states.
But again, it's the same idea, we're going to start with an initial state. We just keep talking next over and over again until we've seen all of the reachable states, and that's the analysis.
So I told you about this idea about tunable control flow sensitivity that we wanted to have in order to be able to investigate a lot of different possible sensitivities, and I just want to quickly describe how we achieve that. So this is actually implementing an idea from another project we currently have under submission to ICFP that sort of discusses a theory behind it, and this is where we're applying it in practice. Yes?
>>: Just to understand what you were talking about before, if you were going to evaluate an eval, right, how does it affect this? How does it affect this set of states that you potentially are in?
>> Ben Hardekopf: So let me go ahead and skip a slide ahead.
20
>>: You can go back, do your flow.
>> Ben Hardekopf: All right. So the way that I describe the analysis is basically just this reachable states computation. The problem with that, of course, is that there are an exponential number of possible states based on the number of nondeterministic choices you make. And so that's not a very tractable analysis.
So the way that we achieve control flow sensitivity, like flow sensitivity, context sensitivity and so forth, is we add a component to a state called a trace. And that's going to abstract the execution path that was taken to reach the current state. So that abstraction might be, for example, the last K call sites or the top K elements of the call stack or the allocation sites of the last K receiver objects, or the branch condition predicates that we've taken.
And so it can be any abstraction that we want, really, and we're just going to add that abstraction to the state. And then the idea is during the analysis, all states with the same trace component are going to be joined together. So the idea is we're looking at this set of reachable states, so we're dealing with sets of states. What we do at every step of the analysis is we partition these states so that all states with the same trace component are in the same partition, join all of those states together so that the number of states is bounded by the number of partitions, which we control by whatever trace abstraction we've chosen.
And since the states include the semantic continuations, when we join the two states together, we over approximate the continuations of each individual state. And essentially, what this means is so, for example, let's say that I wanted a flow insensitive analysis. I would make the trace abstraction just a single unit value that's always the same in all states. So at every step of the analysis, I'm going to take all of the states, I'm going to put them in the same partition, join them together into one state that over approximates the control flow of all the constituent states, and so I'm only going to have one solution over the entire program that's going to -- basically says any statement can execute after any other statement.
So if I have a flow sensitive analysis, I can make the trace abstraction the current program point. What that means is as I compute the reachable states, every state at the same program point is going to be in the same partition,
21 joined together and I have one solution for every program point. If I make the trace abstraction the last K call site, then again, every state with the same last K call sites are going to be joined together and so forth. So I can just, by just plugging in different trace abstractions, I again get flow insensitive, flow sensitive KCFA, stack-based KCFA, object sensitive and a bunch of different control flow sensitivities, including ones that people have never looked at before.
And the really nice thing is that the analysis and the implementation don't actually have to know ahead of time what this trace abstraction is so we can parameterize the implementation over the possible trace abstractions, and then so for example, in our Scala implementation, we have an abstract class called trace, and you just inherit different classes off of this trace abstract case and plug it into a state, and that controls the context sensitivity of the analysis. And you can also make it path sensitive or flow sensitive or whatever you want after the fact.
Yes?
>>: So I didn't mean to interrupt.
>> Ben Hardekopf: No, go ahead.
>>: I see what you're saying about this decoupling, and yes, it's nice in principle. It's a way to think about it. Oftentimes, if you want to get efficiency, you really want to bring these two close together. And by efficiency, I mean even something like two level object sensitivity. So 2V, 2H type of thing. So [indiscernible] is difficult. Now, JavaScript, if you
[indiscernible] similarly difficult.
>> Ben Hardekopf: So the whole point is we don't know the answer to that. And that by allowing this sort of tunability, we can answer those kinds of questions. So, in fact, we have a paper that we're working on where what we do is we have this wide range of context sensitivities like object sensitivity and
KCFA and stack-based KCFA and object sensitivity, and vary both the Ks and across all these things, the heap sensitivity, as well, it's easy to incorporate heap sensitivity into this.
And we explore exactly this question. So we've taken, for example, we've taken a stack-based KCFA and explored from K equals one all the way to K equals five,
22 and the corresponding heap sensitivity. What we've found, interestingly enough -- I did not expect this -- was that if you take a particular add-on and you take, say, stack-based KCFA and you range the Ks from zero to five and the heap sensitivity from zero to four or whatever the valid thing is, the performance actually goes something like that.
So basically, it means that as the sensitivity increases [indiscernible] in some cases time gets a lot longer and then it will get a lot smaller and then it will get a lot longer again, and there's very little correlation between the sensitivity that you give it and the actual performance that you see.
And what's happening is that, well, actually -- okay. I'm starting on I other talk, which is the talk about the tunable control flow sensitivity. So I'll be happy to talk about it offline. But there's some interesting stuff there.
So the idea is that this tunability allows us to explore those kinds of questions. Of course, once you've decided that this is the right sensitivity, then probably it would be more efficient to implement that specific one built into the analysis. But this allows you to figure out what that sensitivity should be.
All right. So to go to whoever's question it was, your question about eval, so obviously, dynamically injected code can't be statically analyzed. So there are a variety of strategies. One is you can just say, if I see an eval, I have no idea what could happen, I'm going to set everything to unknown.
The problem is that since evals are not sandboxed and they can just modify anything in the entire state, you basically, it's not useful at all at this point. If there's any possibility there's an eval in your program, then essentially you're going to say I have no idea what's going on. So that's not very useful.
In certain circumstances, for certain targets, including browser add-ons, it is reasonable to disallow eval, and that's because, for example, the current vetting process for browser add-ons strongly discourages eval, because the people vetting them don't want them in there anyway. So this is sort of an additional imposed burden. This is what the developers currently have to deal with. You're not supposed to use eval in browser add-ons.
Now, of course, you can't just check for eval syntactically, because, again,
23 eval is just a method. It's just a property of the global object, and so if somebody could somehow cleverly compute the string eval, then they can call it.
But you can easily handle that, because it's just a property, by appending a little snippet of code to the top of every add-on that says global object dot eval gets function no op. And that means that even if somehow they manage to cleverly call eval, if they do, it's a no op. It's not going to do anything.
So we can guarantee, while we're analyzing it, that even if we don't know what this string is that's being applied to the global object, even if it's eval, it's a no op, so we don't need to worry about that.
Now, of course, this strategy, while it's reasonable for the targets we're looking at right now, the machine generated JavaScript and the browser add-ons, is not going to allow you to analyze arbitrary JavaScript, which, in fact, heavily uses eval.
So there are various strategies we can take for that, and we have some thoughts. We have not implemented these yet, but that's future work to be able to handle these things.
All right. So I've talked about how we actually can soundly analyze JavaScript itself as sort of a base level, a fundamental analysis that gives you sort of control dependencies data dependencies and so forth. Now I want to talk specifically about how we can apply this kind of analysis to security auditing for browser add-ons.
The general concept, the kinds of things we want to enforce are, for example, that the add-on doesn't inject scripts into browser web pages or that the add-on doesn't leak private information to the network. But, of course, some add-ons necessarily inject scripts and/or leak private information over the network to fulfill their intended function.
If I have an URL shortener add-on, then the entire purpose of that add-on is to take the current URL and send it over the network and get a shortened version of it. So you can't just apply this blanket security policy to all add-ons, right? You can't say that no add-on should leak the current URL or something like that.
There's also some questions in the security policy about whether implicit leaks or important to track. So I'll talk about implicit leaks in a minute and what they are and why they may or may not be important, and there's some controversy
24 over that.
So for script injection, there are a variety of ways of achieving this, that have to deal with assigning to certain fields, certain properties, or calling certain functions. And the solution really is that our analysis can detect any object, access or update that matches any of the above patterns. And, of course, these are properties -- when I say document.getelementbyid.innerhtml, those are all products, and they can be arbitrary strings that are computed.
So you just can't do a syntactic check. You just can't grab for does anybody assign inner HTML or something like that.
So you actually need our analysis to actually trace through the control and data flow and see if you can actually have any object access or update that might match of these patterns.
Several of these are general and some are specific to Mozilla. If you were to, for example, do this for Chrome, there would be a different interface and you'd have to look for different things. So this list is specific to Mozilla. Yes?
>>: So if you [indiscernible] you would assume that this is an injection possibility? Like, for instance, in the first case, right-hand side is instead of being constant, is something that's updated --
>> Ben Hardekopf: If it's something that could contain a script text. So, in particular, your analysis can say, well, is this going to be a string. And if it's a string, is it a /STREURPBG that can contain script in it. If so, we would say this is a possible script injection. So it doesn't have to be just any assignment to inner HTML, you can be more precise than that.
>>: So you can reason about the content of a string?
>> Ben Hardekopf: Yes. So I'll be talking about that. Okay. I mean, we're not doing regular expressions. So I'm not saying we're being extremely precise, but we can be somewhat precise, and I'll talk about that in a minute.
All right. So that's script injection. It's pretty easy to grasp. If we want to deal with leaky information, there are two basic kind of leaks we want to look at. The first is explicit leaks, and these are caused by data dependencies between the secret information and the information that's sent over the network. So the examples I gave of tricky behavior from earlier were
25 all examples of explicit leaks. So just to give you a reminder, we have our example where we are setting a property via secret info and then iterating over the properties and sending them out and that's a direct data dependence.
If we look for implicit leaks, these are controlled dependencies between the secret information and the information sent over the network, and here's the reason why implicit leaks are kind of controversial. So in many times, implicit leaks are, in fact, very innocent. So here's an example of an implicit leak. If the current URL is YouTube, then send a message to YouTube.
So if I have an add-on whose specific intended purpose is to deal with YouTube, then it's going to contain code like this, and essentially I'm leaking information to YouTube that I'm visiting YouTube. So not terribly interesting.
But I can't just ignore implicit leaks altogether, because I can have code like this. And I won't go through the details. But essentially, what this is going to do is I'm going to have some secret info. I'm going to copy it over into another place and then send that copy, and there's never going to be any data dependence between the copy and the original version. It's going to be exactly the same, but there's no data dependence.
And so this is probably malicious. And one of the key differences between the two is this idea of unamplified versus amplified. So you can think of an implicit leak as essentially leaking one bit of information. If I only have that one bit of information, probably not terribly useful. But if I wrap that inside of a loop, and then I can leak an arbitrary number of bits, right? So in this case, I can leak an entire string, if I wanted to.
So an amplified implicit leak is one that's wrapped inside of a loop that allows me to leak an arbitrary number of bits.
So the way that we present our analysis and again, this is a revised version from our first version. Our first version didn't do this. And now we looked at what we had and figured out this is a better way of applying our analysis.
What we want to do is instead of enforcing a specific policy, we're going to generate a signature that shows possible script injections and leaks. And this gives the add-on better discretion to decide whether a signature is acceptable or not.
So they can essentially say, oh, are you trying to send the current URL over the network? Well, if this is an URL shortener, yes, that's expected behavior, that's fine.
26
Now, some questions are script injection is simple to represent. We can just say here's a line where it looks like you're going to try to inject a script.
But generating useful information about leaks is more difficult. So, for example, you want to know what kind of network domains are involved or what kind of leaks are present.
Specifically for network domains, standard constant string analysis is insufficient. There's a very common pattern for communicating with websites where you come up with some base URL, and then you append strings to it based on arguments you want to give it or sub-pages you want to visit and so forth.
If we were doing a constant string propagation, essentially at the sent message point, the base URL would be some unknown string. You'd have no idea. So our solution was to implement a string prefix analysis. So this is what I was talking about. We were being somewhat more precise with our strings. So we were actually tracking prefixes of strings so that we can, after sending a message to the base URL, we'd know that that example.com plus something that we don't know.
So this is a fairly good trade-off between precision and performance. It gives you in many circumstances the domain that you're communicating with, without imposing a heavy performance penalty.
The kinds of leaks that are being sent are also important to the add-on vetter.
And so our solution was this idea of qualified security labels. The idea is we're going to have labels for every source of secret information so network domains, key presses, browser history, local file system and so forth. And then we're going to qualify those labels by how that information is being leaked, whether it's an unamplified, implicit leak, an implicit leak that's amplified just by the event handler, the event loop itself, an implicit leak that's amplified through some loop, or an explicit leak.
And so every program value is going to be tagged with a set of these qualified security labels that are going to be propagated according to the data and control dependencies. And so the signature that we give the vetter reveals not just what sources are being leaked, but exactly how they're being leaked.
So I can say, okay, you are leaking the current URL, but it's only an unamplified implicit leak, but you're also leaking information from this
27 network domain, and it's an explicit leak. And you can use that to judge how important you think a leak might be.
All right. So really quickly, I just want to give some preliminary results on the first version of our analysis we had. So again, this is not exactly the version that I've been talking about. That version I've been talking about is sort of the lessons we learned from this version. We're about two or three weeks away from actually being able to give all of these solid results for the new version that we're currently implementing.
So the lessons learned, first of all, again, this new IR design, so original IR design does not separate pure expressions from impure statements and had several other changes that ended up making the semantics fairly complicated and large.
These new, concrete and abstract object domains which add a lot more precision than what we had before. A new string abstract domain. So, in fact, there's interesting -- so most objects have object.prototype as their prototype. And that has a lot of built-in properties in it. Whenever you look up an unknown string to pretty much any object, because an unknown string, you don't know if it's there or not, you'll have to go all the way up the prototype chain. So you're going to grab all of these built-in properties from the object prototype whenever you look up an unknown string, and that can actually lose a lot of precision. So we're working on a new string abstract domain that actually tries to be able to say something like I don't know what this string is, but I know it's not any of these special properties, so don't bother grabbing these.
And we think that will actually help precision a lot in certain cases.
We have some new precision and performance optimizations, and this whole idea of signatures, which is a new approach to security auditing. What we had before was essentially your standard secure information flow analysis. So we had this lattice of labels, and we would say you have a leak based on that lattice, rather than giving you more information about what exactly the sources were and what kind of leak it was and so forth.
So just a couple of more slides. What we did was for our first version of the analysis, we grabbed 838 add-ons from the official Mozilla add-on repository.
And we looked at the security policy, don't leak the browsing history over the network, and we did it under two modes. One where we only looked at explicit leaks and one where we looked at both explicit and implicit leaks.
28
If we look at explicit leaks only, what our analysis showed was that we could take about 79 percent of the add-ons and just say they're safe, about 20 percent of them, we said they might be unsafe, and then about one percent of them we timed out. And the median time was 16 seconds. Average time, couple of minutes.
If we add in implicit leaks, they the story gets a bit worse. So we could only verify as safe 65 percent of the add-ons. We said 32 percent of them might be unsafe. And if we look -- so, of course, this is just results of the analysis.
We want to know how precise was our analysis. If we reported a potential leak, was it really there. We didn't actually look through all 838 add-ons. We picked 70 of them, and we manually inspected those 70, and classified the reported leaks as either true leaks or false positives.
And what we found was that if we were only looking at explicit leaks, we did, in fact, a really good job. So 95 percent in this sample were not over the whole 838. In this sample, 95 percent of the reported leaks were true leaks, and only 5 percent false positives. If we added in -- by the way, the true leaks were mainly sending the current URL to ad servers. There were a couple that we found, there was one called surf canyon, for example, that sent the current URL to their own web server for some reason. We don't know what they did with it once they got it. But most the explicit leaks were to ad servers.
If we do implicit leaks, we find that we do obviously much worse. So a little less than 75 percent of them were true leaks, and the rest were false positives. And of the true leaks, most of them were completely uninteresting, like that first benign example that I showed that I'm at YouTube and I'm sending a message to YouTube. Like nobody cares.
So sort of the lesson that we had from this was, first of all, nobody right now is trying to be really tricky about how they leak information and how they do these kinds of security vulnerabilities. And that's mainly because they don't have to be. Because the current state of add-on vetting is so poor, you don't have to be really tricky in order to make it happen.
So we really, at this point, only need this sort of explicit leak detection.
But we believe that as you start using tools like this, that will detect those kinds of leaks, obviously the attackers are going to turn to more sophisticated techniques and so it's quite possible that they would start using these kinds
29 of implicit amplified leaks and so forth.
So the idea of the add-on signatures, again, is that you can choose how much of these you want to pay attention to and whether you want to explore the implicit leaks and give you an idea of the level of the implicit leak. And we believe that could be important in the future.
All right. Thank you.
>>: Are you going to give us any information or do you have information about just the precision of the analysis, not with respect to this one application?
>> Ben Hardekopf: So measuring the precision of the analysis is actually very difficult, especially if you're analyzing it across different control flow sensitivities, which we did. So we looked at things like closure set size and address set size and tried to get various statistics like that. And it turned out that there was very little correlation between those numbers and the -- so we were looking at the moment, because we're only doing a preliminary study, at the performance. And we're trying to see, we thought, well, okay, if we're seeing a much better performance, this means we should have, like, much fewer addresses or many fewer closures or something like that. That wasn't true.
So pretty much the only metric that we could come up with that correlated with the performance across the different sensitivities was the metric for a particular function context. How many times you saw that context during the fixed point computation.
Turned out that the sensitivities, rather than doing things like having fewer closures that you might be calling or fewer objects that you might be accessing, most of the difference was in because I'm splitting up information in different ways, I'm visiting a particular context many fewer times.
But basically, I mean, we looked at things like -- we tried to gather some statistical metrics, numbers, like closure set size, address set size, other similar types of things. And we just found very little correlation with anything that we were seeing, so we ended up saying this is not a useful metric. It's not telling us anything.
>>: Just to follow up on that, do you still think in this basis for doing data analysis in large scale, you certainly need some level of [indiscernible] but
30 object sensitivity do well precision wise. That's been the result of
[indiscernible].
>> Ben Hardekopf: And mainly in Java, right?
>>: In Java, right absolutely. There's no such thing as [indiscernible].
>> Ben Hardekopf: Right.
>>: The problem with that sensitivity is difficult to scale [indiscernible].
So is that sort of --
>> Ben Hardekopf: So we have looked at it. So object sensitivity turns out not to be that great for JavaScript. So in our preliminary studies, so we looked at KCFA, stack-based KCFA, object sensitivity, and acyclic CFA.
And the one actually that we found works best is acyclic CFA for -- and I forget what our heap sensitivity setting was, one or two or something like that. So object sensitivity, yeah, in Java, it works great. Actually, people have done it in JavaScript and said hey, it worked for Java, it must work for
JavaScript. Nobody's really done a study until we started looking at this that actually says okay, let me compare object sensitivity for JavaScript versus these other things and see which one works best.
What we're finding is it may be useful to mix in some amount of object sensitivity, like sort of a hybrid idea so I can at least say am I inside the global object. So one of the problems with object sensitivity is very often,
I'm in the global object. That's my allocation site. And so a lot of things gets mushed together that otherwise wouldn't be.
So just object sensitivity by itself doesn't work very well. But if you add it with some other things, like acyclic CFA or something like that, then maybe it would be more useful. So we have not completed all of our studies, but that's exactly the kind of question that we want to answer with this idea of the tunable sensitivity. So we have a number of studies laid out where we're going to be exploring those, plus some new sensitivities that nobody's looked at before.
It turns out with this method for tunability, it's really easy to come up with new kinds of sensitivities. For example, based on type signatures or other
31 things. Another question?
>>: I guess it's related to the vetting process. You pointed out that -- Ben and I have seen similar things. Most people aren't trying hard, it's pretty easy to catch them, right? You can basically -- so it's good if people are not trying to be malicious and they're just accidentally doing something. But when you get to the point in JavaScript where people are actually trying to be clever, right, then there's like a sharp cliff, and these kind of analyses just, it seems like it's almost impossible.
Now, maybe I'm wrong, I'd love to hear you tell me that I'm wrong, but it seems like for a determined attacker, seems like there are easily ways to get around your analysis. Is that --
>> Ben Hardekopf: Get around it meaning that it would -- it would miss the leak or get it around it meaning that you could just get it to the point where you can't get --
>>: Sufficiently, the approximations you're doing, the analysis --
>> Ben Hardekopf: Just imprecise enough that I can say there might be a leak here, but I can't give you any information about it?
>>: Right, right.
>>: Two points, two ways to [indiscernible]. Precision and performance. They can design obfuscations that would be deliberate on precision --
>> Ben Hardekopf: Right. So okay. A couple of answers. One is so far, what we looked at is just the actual add-ons from the official repository, plus a few injections that we've done where we've taken an add-on and sort of injected stuff in it and verified, yes, we find it. So we've not done a -- I guess right now we don't have any add-ons for people that tried to be really, really tricky about how they do things to test this on.
So I can't give you a firm answer, yes or no. We are going to, because it is a sound analysis, if they are having a security leak, we are going to mark it as being possible. But, of course, if we mark a whole bunch of things as being possible and it turns out that many of them are false positive, people aren't going to pay much attention to that.
32
So there is this question of how precise can we be, and that's going to, I guess, we're going to find that out as we do more experiments, and we can actually tried to build add-ons that try to be really tricky and say how much information can I give you and so forth.
Right now, it's hard enough just to be sound in the first place with
JavaScript. So that's what we've been concentrating on, and we wanted to get as much precision as we could while guaranteeing that. The next step is to maintain that soundness but be more and more precise and see how precise do we have to get in order to detect these kinds of things.
The other answer might be more social, rather than technical. Because the add-on developers have incentive to actually get their add-ons added to these repositories, because that's where most of the people get it, if they submit add-ons that end up being really difficult for the analysis, if the add-on vetters are actually using this kind of analysis, then the add-on vetters can just say, well, this was too tough. We're not going to accept your add-on. So it's up to the add-on developers to write add-ons that our analysis, in fact, will work well on and give the vetters answers. So that's more social answer than technical.
>>: You said you're able to effectively guarantee that you'll report that there was a leak.
>> Ben Hardekopf: That there could be a leak, right.
>>: There could be a leak. But then you have the explicit leaks and implicit leaks. And so if you're -- basically, it seems like there's a set of leaks that even your implicit leak detection wouldn't catch.
>> Ben Hardekopf: Oh, certainly, yes. There are many other kinds of leaks besides -- I should have been clearer. Our security policy is with only regards to the explicit and implicit. So there are timing leaks, there are termination leaks, there are AM radiation leaks. There are all sorts of leaks that you could have that we are not catching.
>>: For the implicit leaks, there's a lot of communication channels through
[indiscernible] and things like that where you could indirectly communicate from two seemingly unrelated APIs. And unless you're basically poisoning
33 access to most of the Dom, it seems like there would be --
>> Ben Hardekopf: So this is why the implicit leaks are kind of controversial, because they do, in fact, sort of proliferate a lot and give a lot of false positives, because essentially, if there's any control dependence at all, then it's going to mark it as being tainted.
And so --
>>: Dependence on anything which accesses the Dom?
>> Ben Hardekopf: The idea is if I'm control dependent on some secret information, so, for example, if I say, if the current URL is this, then either do something or don't do something. Then on either branch of that conditional, whether something happens or not, depends on the current URL. And so anything that happens in either of those is going to say this could be a leak. If it's public -- if it's either self-publicly observable or leads to something that will be publicly observer.
>>: You've annotated the entire Dom surface area with --
>> Ben Hardekopf: Currently modeling the Dom is, in fact, very coarse. And that's one of the things that we can do to make our analysis more precise and better is to do a better modeling of the Dom. So again, we're focusing on the
JavaScript part, rather than the Dom part. So our Dom model is very coarse grained. So definitely one of our bits of future work is to do a better modeling of the Dom.
But yes, I mean, it is a concern, especially for implicit leaks. That's why we got so many more false positives for the implicit leaks than we did for the explicit ones and that's something, again, we need to work on. So that's, in fact, an advantage to the attacker, because if you get used to saying, well, all of these implicit leaks reported are false positives, they can actually start exploiting them and people are just going to disregard that warning and say that's not relevant.
So it is important to be able to do a better job of that, which we're trying to get towards that by at least saying in our signatures what kind of implicit leak was it? Was it amplified? Was it unamplified and so forth. So that's not all the way there, but it's at least a step in the right direction.
34
>> Ben Livshits: So let's thank the speaker. Ben will be here for the next three days.