>> Madan Musuvathi: Hi, everyone. I am Madan... Engineering Group, and it's my pleasure to invite Iulian here...

advertisement
>> Madan Musuvathi: Hi, everyone. I am Madan Musuvathi from the Research and Software
Engineering Group, and it's my pleasure to invite Iulian here today. Iulian, welcome back. He
was an intern a very, very long time ago with Charles and myself, working on the Chess Project,
when the project was just beginning. So he was one of those people who had to put up a lot with
-- what do you say? Newly forming research, not knowing where we are going and dealing with
lots of Microsoft infrastructure. It was great. It was a great experience. So since then, he's been
very successful. He's an assistant professor at UC Riverside, and his main focus is on
programming languages and software engineering, and he's been focusing a lot on smartphones
these days, so that's what he's going to be talking about today. Iulian?
>> Iulian Neamtiu: Thanks, Madan. So I'm going to present some work on dynamic analysis,
how to enable dynamic analysis and what to apply them to, and this is joint work with my
students and my collaborators. So smartphones and smartphone applications have been
exploding in popularity recently. More and more tasks that used to be performed on servers or
desktop computers now run on smartphones, so that means we need to shift our focus whenever
we do, say, security or performance or verification for applications -- to shift our focus to the
mobile platform. And here are some of the popular apps that I found at the Windows
Smartphone App Store, and if we look at -- at the outset, we have right there, it's quite a broad
range, and they all come with some particular concerns. And, what I'm going to be talking about
is how to develop and the kind of infrastructure that we've developed to allow us to study some
of these concerns. So one of your concerns might be, let's see, does this financial application,
Chase Mobile, actually implement the transaction scenarios when I do a financial transaction? It
better do so. Another app that you might look at is say, if you have babies, there is a baby
monitor app, and now the concern is, okay, the parents are monitoring the baby, but who else is
monitoring the baby, and what kind of information this app collects and where does it send it to?
Another particular and pressing concern, actually, for mobile is energy consumption. So, for
instance, if you run a game and it has heavy usage of graphics and 3D and things like that, you
would like to be able to know which parts of our application are eating up a lot of power and
what can you do to optimize that? And it turns out that there is a very powerful tool for enabling
this kind of analysis. It's called dynamic analysis, and we can run a whole bunch of analysis
applications on it from debugging to verification to profiling. So how does dynamic analysis
work? We run the app on a phone, and then we have some collection infrastructure, and what's
going to happen, we have some inputs to the app. So this is the app running straight on the
phone, as it's supposed to. This is the thread of execution, and then we have some inputs to the
app, and then we take a magnifying glass, and this magnifying glass can be a whole bunch of
analysis, and we try to check whether the app actually conforms to the property that we're
interested in. So, for instance, for this mobile banking app, we might want to define some app
invariants, like this transaction semantics and then use dynamic analysis to see if those invariants
are being respected. What about the baby monitor? For this, we might run a dynamic analysis
that's called information flow tracking, so we collect the kind of data that the application reads,
and then we also look at the data that the application writes or sends over the network, and we
tried to see if this conforms to a certain security policy. For instance, the network flow shouldn't
escape the local network, or something of that sort. And for an app such as a game, we can run
an energy profiler, which is another kind of dynamic analysis, to measure energy usage in
various parts of the app. Okay, so this all sounds nice, but building infrastructure for dynamic
analysis, especially for smartphone apps, is not so simple. And what I'm going to do, I'm going
to talk about a set of challenges that basically make our task a little bit more complicated. So
anyone who has worked with dynamic analysis before knows that there's two main components
for a successful analysis. First of all, you need high-quality input, and then you need good
coverage. So you need a good set of inputs to drive the application through the set of relevant
states, so that you do this dynamic analysis on relevant states. So this concerns good inputs and
then comprehensive or systematic coverage or common denominator across all dynamic analysis.
However, on smartphones, things are a little bit more complicated, because as we're going to see
in a second, on smartphones, the input can come from various sources, can be very complex, and
it can take many forms. Another challenge is source code availability. I think it's unrealistic to
expect that the apps that I put out there are going to come with source code attached, so we need
to come up with an analysis that allows us to run without having access to the source code.
Another challenge is running on real phones, and you're going to see in a minute why that's the
case, because whenever we run on emulators, the full gamut of input is not available, so we have
to mock, say, GPS or the microphone or the camera, and that severely restricts the set of
observable states that you can drive the application to. And, finally, there's this -- because of the
wide array of sensors on mobile, it's really hard to reproduce execution. So I'd like to be able to
construct a system that allows me to even reproduce an execution. I'm not asking for more, and
even that is challenging. Okay, so what I'm going to do, I'm going to focus on three main lines
of work. The foundation is a record and replay system for smartphone apps. That appeared in
ICSE this year, and then building on that system, so that's the foundation, if you want, of
dynamic analysis, because it helps us deal with many of these challenges, and then building on
that foundation, we constructed a way to do systematic exploration of apps. In the left-hand
corner there, that's a paper that was published in OOPSLA this year, and then to show the
applicability of this dynamic analysis, I'm going to talk about some profiling, mostly focused on
network profiling, and that has appeared in MobiCom last year. So let's focus on the record and
replay first. So the kinds of challenges that we addressed in that work were dealing with
complex sensor input, dealing with the lack of source code, running on real phones and
reproducibility. So if you look at the problem formulation, what we would like to be able to do
is to record an input and be able to replay later. That's all we're asking for, and there's been work
in this area on smartphones, and there's been a lot of work on other platforms, as Madan has
mentioned, the kind of record and replay that we did for Win 32. So past approaches used
something called a keyword-action paradigm. So how do we do record and replay there? We
look at the application as having a set of fields, and then we do keyword matching on these
fields, and whenever we encounter a certain keyword, we perform an action. So suppose we
want to use record and replay for a very simple application that just has some textual input here.
This is an example for setting up your e-mail account. So what a record and replay scheme
would entail would be, first, you need to find the field that's called give this account a name.
Once you found this field, execute an action, and the action is to type in jane@coolexample.com.
Then, the next keyword-action pair should be find the next field that's called your name and type
Jane Doe. And then, we do a button kind of action, so we find the button that's called Done, and
the action is click. And these keyword actions are usually saved in a script. So, yes, please,
Charles. Yes.
>>: Does a user do it? I mean, you can simulate this clicking and the typing business using a
computer script? How does the user actually click it?
>> Iulian Neamtiu: So there's a capture tool. There's a capture tool that will record a script, so
whenever -- yes. Whenever the user actually moves the focus around different fields, then the
script will retrieve the name of the field, and the input that the user has typed, and then they can
replay the script later on. So this is saving to a script.
>>: [Indiscernible]. I'm sorry, fill-ins of people or ->> Iulian Neamtiu: Right, so that's a good question. So a field is something that the approach
should hang onto. So it's something that is available via kind of a structured query language, so I
can ask the state -- I can ask the app, give me all the components of the GUI right now, and it
gives me a list of components, and one of the components is this field. So field simply stands for
something that accepts input, like a text box where you can input text. So field is accepting
inputs from the user via this textual mechanism. This can also be buttons or maybe selecting an
item in a menu.
>>: [Indiscernible].
>> Iulian Neamtiu: Yes, I'll get to that in a minute. That's where the scheme falls apart,
actually. So this is prior work. This is what's been tried before, and I'll show you why it doesn't
work. So without further ado, let's see why it doesn't work. Let's look at this app. It's called
Angry Birds. Probably most of you are familiar with it, and this is a view on the kryptonite of
record and replay, because there is no keyword action that can help you deal with Angry Birds
and being able to record and replay Angry Birds. So why is that inappropriate? Well, consider
this action. We are pretty much in the physical world here, so we have a very wide range of
motions and gestures, so you can drag the slingshot, you can spin the bird, you can throw the
bird into the ground. There's a whole bunch of actions here. So the question is, if I were to
design a general-purpose record and replay tool, how would I design that? And the problem is
that if you use this keyword-action pattern, then keyword action actually relies on a certain
semantics, so as you've asked, what's a field? Well, that's a very good question. What's a field
here? I don't know. Because if you look at the actions that you need to record and the actions
that the user actually can perform, and you try to abstract from that and construct the language,
the language is going to be very app specific, very app specific. So, for instance, in this case,
maybe you want to introduce constructs for finding the bird in the slingshot. Maybe that's one of
the keywords. What about this? Once you find the bird, drag back the bird one inch, 210
degrees. And even worse, you can have this kind of very complex gesture. So, for instance, you
can spin the bird four times and then release it, and I don't know how to capture that in a
language. That would be circular bird swipe with increasing slingshot tension. It's very
complicated to come up with a general language for this. So not only are we going to have
trouble capturing and expressing the gestures, also this is going to be very app specific, because
this is only going to work for Angry Birds. Let me give you another example. There's an app
called Magic Eight Ball. So Magic Eight Ball, extremely useful app. It can answer a whole
range of difficult questions, so, for instance, if you want to know just before you leave home if
there's going to be traffic on the 520, you can ask, is there traffic ->>: To come up with a question that there's more than one answer.
>> Iulian Neamtiu: So the way this Magic Eight Ball works is you write down the question, and
then you shake the device, so you shake the ball, and voila, here comes your answer in terms of
that. In this case, the answer was yes, because I [indiscernible]. I don't know, we haven't looked
at the full range of behavior for this. But the point is, again, this keyword action falls apart,
because we need a language that should be able to express something like, how do you say in the
keyword action shake the phone? So basically, you want to get into modeling the physics, and
so you quickly move the device around the X, Y and Z axis. Now, the problem here, just like for
the previous app, we had gesture input from the touchscreen. Here, we have sensor input from
something called the accelerometer, and there's no end to this. If you look at all the popular
apps, they have a very wide gamut of sensors, so lots of apps depend on the geographical
location, so they have a sensor for GPS. There are apps that take pictures or that recognize
pictures. When it's a barcode scanner, the barcode scanner has a frame buffer, and then you need
to express, record and replay that, the same thing with the microphone, where since there are
apps like Shazam, where you can record input, and then the app does some recognition there.
And it turns out that, if you want your approach to work for popular apps, you have to deal with
all this input. So we did a study on popular apps, like the top 120 or so, and we measured the
number of gestures that are encountered in a five-minute run of the app, and I'm plotting that
number of gestures here. On the X axis, we have the number of gestures during the run, and on
the Y axis, I have the frequency. So, as we can see, pretty much all the apps use gestures, and
some of the apps use a high number of gestures, more than 100 gestures in a five-minute run, so
our approach had better be able to deal with this. And this is just gestures. I'm not talking about
GPS. This is just touchscreen gestures. So what we do, we construct an approach that's called
Reran that will actually capture, will tap into the physical input sensors, into all the sources, will
tap into the kernel, and we save that in an event trace, and that's going to be replaced. And we
deal with the entire gamut of physical sensors on the phone, from GPS to camera to microphone,
compass, accelerometer, keyboard, touchscreen and even the light sensor. So the key insight of
our work is to say key action is too high level for us, because it assumes too much structure, and
we're dealing with very low-level input here. So what we're doing, actually, we're lowering the
level of abstraction quite a bit, and instead of using the high-level structure based on keyword
action, we capture and replay low-level sensor input. Yes, please.
>>: [Indiscernible] the kernel and the physical device.
>> Iulian Neamtiu: Yes, that's right. So, actually, the kernel -- so the kernel, we tap into the
kernel's input and output streams.
>>: [Indiscernible] the apps.
>> Iulian Neamtiu: We also do that. So, for instance, for GPS, you have to do that, because
GPS is not exposed in the kernel for various concerns, and for GPS, we have to intercept the
API. Yes, thank you.
>>: [Indiscernible] the API for every sensor.
>> Iulian Neamtiu: That's a good point. So some of the sensors, the touchscreen, light sensor,
compass and accelerometer do not have a very clean way of intercepting the API. Rather, so
they post a reading in like a buffer, and then the application has to read from that buffer. So
intercepting at that level is going to be very costly. GPS is better, because we don't have a
million -- we don't have, say, 1,000 readings a second, whereas for the touchscreen -- so the
touchscreen for this gesture that I have just shown with the -- where you do a circular swipe,
that's like 400 events in one second. So if you were to intercept the API, the approach is not
going to be timely, to allow you to do replay. So that's point number one. Point number two,
Angry Birds uses a lot of native code, so the API is completely shunted there, so they access the
sensors in a roll manner, and then if you do interception at the API level, you've lost, because
you're shut off the equation.
>>: [Indiscernible].
>> Iulian Neamtiu: They go to the kernel, but they don't go to the API. So when you say API,
you mean the virtual machine or something that's even lower than that. Yes, we tap into the
kernel, but the -- yes.
>>: [Indiscernible] is there an API called that's someone shook the phone.
>> Iulian Neamtiu: Oh, no, not at all.
>>: Or a whole bunch of acceleration readings and then I have to figure that out.
>> Iulian Neamtiu: Exactly, exactly, yes. Yes, please.
>>: With this, the previous paradigm that you had about the ->> Iulian Neamtiu: Keyword action.
>>: Keyword action thing, right. That's at a sufficiently high-level abstraction that it's perhaps
robust to small changes in the app. If you're capturing touchscreen input and then keyboard
input, and the browser renders my page slightly differently, that touchscreen input may no longer
actually select a field.
>> Iulian Neamtiu: Sure, sure. So I'm going to come to that in a moment. That's one of the
limitations of our approach. That is a problem if your phone is not stationary when you apply
this. So if you -- why don't I just go to the next slide or two, and then that's going to come up.
So at a high level, how Reran works is very natural. So this works on the actual phone. We tap
into the kernel in a nonintrusive manner. We record the input events into an event race, and then
we do some processing on the event race to separate the wheat from the chaff. Some of the
events are simply not worth replaying, and also the user might not want to replay everything.
And then this gets sent to a replay agent that works on a PC, like a laptop, and the replay agent is
going to process the trace, copy the trace onto the phone, and then everything is done. And then
on the phone we do this event injection. The only reason why we have this replay agent running
offline is because we want to run some processing on the trace, but there is no problem in
capturing the trace on the phone and replaying it on the phone. Unless you want to do some
complicated processing, like we want, there's no reason to have a separate system like a PC in
the equation. Everything can run on the phone. Yes, please.
>>: Will energy be a piece of that?
>> Iulian Neamtiu: Sorry.
>>: Will energy consumption of the analysis of the trace be a concern there?
>> Iulian Neamtiu: That's right. So not only that, but also perhaps storage, because you don't
have that much storage. Yes. That's a good point. Thank you. So just to give you a flavor as to
the challenges we experience when we have to build such a system, let's look at this app called
Gas Buddy. It helps you find cheap gas in your neighborhood, and what I'm plotting here, on the
X axis, I'm plotting time, so this is an 80-second run of Gas Buddy, and on the Y axis, I'm
plotting the number of events that come from the sensors. And for simplicity, I'm not going to
plot all the sensor input, only two, so input from the compass and input from the touchscreen.
So, for instance, what the user might decide to do is to type in the location, say, Los Angeles, or
if you use GPS, simply say locate me. And then, we see that -- so these are the touchscreen
events, so the user had to use the touchscreen, and then there's quiet on the touchscreen front
while the app loads the station map, but once the station map is loaded, the app needs to find the
user's position, orientation, so there is a stream of compass events. Then the user -- after the map
is loaded, the user might want to navigate with pinch and zoom. That's why we have a lot of
touchscreen events so that we can see that in 20 seconds we have more than 1,000 events there.
And then the app is exercised as usual. So there are several points that come across here. First
of all is how to deal with concurrent events, so the approach has to be able to record and deliver
concurrent events. In this case, we see that we have both touchscreen and the compass, and
there's also the GPS in there, or perhaps the accelerometer, and so that's one challenge that we
had to overcome. The other challenge is mixing the incoming events with replayed events. So
this becomes apparent in the context, when you're doing something like locate me, and then you
recorded a trace saying Los Angeles and you're replaying the trace in Seattle, because then you
have to reconcile what the app is reading with what's saved in the trace, and we have certain
policies for dealing with that. For instance, in the case of GPS, we simply discard the current
readings and we only feed the GPS readings from the original trace, not from the incoming
stream. Also, timing accuracy is critical, and just to give you an idea as to why it's critical, let's
look at these two streams. So this is the same number of events, roughly, but you have to be
very careful with the timing. So if -- suppose I want to replay 400 events, and a swipe -- if it's a
complex swipe, it's about that many. If I have a very small interruption in the middle, so I
deliver, say, 100 events and then I pause very little, and then I deliver one event, I pause a little
and I deliver the rest of the events, this is not going to appear as a continuous smooth swipe
anymore. It's going to appear as one swipe, one press, another swipe, and that will kill replay.
Timing accuracy is critical, and it turns out that we have to replay with sub-millisecond
accuracy, something on the order of 10 to 100 microseconds, to be able to replay and revert
smoothly. And another concern is, for instance, when you have very high throughput input from
the frame buffer, when you're replaying, say, the barcode scanner. The barcode scanner does
analysis on the live image from the camera and tries to find the barcode in there. So there, it's
also critical to record and replay with a high accuracy. Yes.
>>: [Indiscernible].
>> Iulian Neamtiu: Oh, yes. That's a good point. So the apps are not ready -- however, there is
a lot of determinism. It turns out that thread nondeterminism is not really an issue for
smartphone apps, which in a sense saves the day here. And I have some schedules that I can
show to you offline. But basically, what happens, there's a main application thread, and the job
of the thread is to process whatever the GUI update changes certain threads have posted, and lots
of this input process is done via something called a synchronous stash. Depending on the
platform, you have different names for it, but it's pretty much the same thing. So, for instance, if
you have network input that you know is going to take a long time, like downloading a file or
something like that, you put in a background task. However, that background task is not going to
bother your app until it's completed. So it turns out that thread nondeterminism is not an issue
for the vast majority of apps. There are apps -- for instance, SoundHound, where that's been a
problem. Or, basically, you start it 40 different times and you see 40 different schedules for the
threads.
>>: [Indiscernible] this is Android, correct?
>> Iulian Neamtiu: This is Android, yes. So the results of running our approach were we were
able to demonstrate on 110 popular apps. So what we did, there's about 20 categories -- fitness,
lifestyle, health, games -- so we took about 20 apps in each category, and we selected apps from
there. In total, there's about 110 apps that we were able to replay, and the trace is phone specific,
because we depend on the physical coordinates, obviously, but the approach is not. You can run
Reran on another phone, and you're going to get the trace for that phone. And the runtime
overhead is less than 2%. There are limitations to this approach, as it has been alluded to earlier.
So, for instance, if we have network nondeterminism, so there are apps that loaded content
dynamically, the layout of a webpage and that changes from run to run, then we're going to have
a problem. So if there's a dynamic layout or network nondeterminism, then this approach might
not work. And, in a sense, we could solve that by simply recording the network input, but that's
overkill in my view. And then there's nothing really live in this whole record and replay scheme.
Sometimes, live content can mess up the approach, but we made a decision not to record and
replay the network input. And the apps are spread across different categories, and we have
demos on how we replay Angry Birds and other apps, and you can find that on YouTube.
So let me present three applications of this technique. So first, I'm going to talk about how we
can use record and replay to reproduce bugs in popular applications. Then, I'm going to discuss
something called time warping and then semantic input alteration. So, yes, please. Go ahead.
>>: This was in the beginning of your talk, so this replay is for developers or app store owners?
Or who is the target?
>> Iulian Neamtiu: So there are actually all of the above, because the developers can record an
interaction. A user, for instance, can record a cashing interaction and send a trace back to the
developer.
>>: I asked because you're making a very important assumption that you don't have the source
code.
>> Iulian Neamtiu: Yes.
>>: So that forces you to do things at a much lower level than you would have had to, and also
things like shared memory concurrency within the app. Because you don't have the source code,
it's difficult for you to capture that at a high level. If this were targeted at the developers, you
could say, well, you have the source code.
>> Iulian Neamtiu: Sure.
>>: So it's because you want to address everybody that you're doing that at such a low level.
>> Iulian Neamtiu: I think, in a sense, we want to be able to record and replay a wide range of
apps, and we don't want to -- we want to make as few assumptions as possible. As you said, if
you have the source code, then other possibilities become realities here. But in terms of this
case, if you use a little bit more abstraction, that's not necessarily a good thing, because then you
have to cater the record and replay scheme to specific apps, and I don't necessarily want that. So
let's see how we apply this to the task of reproducing bugs. So what we do here, we look at bug
repositories for popular apps, especially bug reports that have something called steps to
reproduce, so then we take those steps to reproduce, we install the app, we reproduce the bug,
and all this time the app has been running with record enabled, and then we replay the interaction
up until the point where just about to crash, and there we break into the debugger, and that's
useful, for instance, for developers, to look at the root cause of the app. And it turns out that this
works well for many apps that perhaps you've heard about. I'm presenting the bug categories
there, so many apps crash because they crash on incorrectly formatted input, like all the way to
SoundCloud, and we're able to replay that. K-9 Mail crashes on invalid input. NPR News
crashes because too many events are being delivered per second or something like that, and then
we're able to reproduce that. We're able to reproduce the bug in Firefox and Facebook, as well.
So this is a tool that can help both [that] concern to help both the user and the developer, the user
in capturing a trace and the developer in breaking to a debugger just before you're about to crash.
Another application is something called time warping. So if you have a long-running app, an
app that, say, takes 30 minutes or so before it crashes, that might be a little bit too much and we'd
like to compress that. So we use a technique called time warping. So what we do, we take the
trace and we look at what elements of the trace can actually be compressed and delivered in
rapid sequence. And it turns out that whenever you have, for instance, data entry, if you have a
texting app, the phone is perfectly capable of accepting your 140-character message in something
like two milliseconds, but you can't type that fast. So what we do, we take the trace, we look at
this kind of situation, and then we compress the trace.
>>: [Indiscernible].
>> Iulian Neamtiu: So we have something called clustering. So we look at the trace, and we
have certain candidates of events. This is not fully automated. It's automated in the sense that I
can apply it automatically on a trace, but when we constructed the approach, we had to look at
what kind of events are amenable to compression and whatnot. So, for instance, data entry is
amenable to compression, or whenever the app is paused because the user reads something on
the screen, then that's also amenable to compression. So, for instance, you load a page in the
browser and then you take 20 seconds to go from the top to bottom because you have to do some
comprehension. That we can just fast-forward through, so we can deliver the event so that you
go through the page in, say, two seconds. Yes, please.
>>: So another way of thinking of that is when the user is using the application, there are some
things that split to the users, so is it like if you load a page, you might want to expand it so you
can read it, you might want to scroll it, but when you're doing the replay, you don't need to
recreate all of that, because no one sits there reading a page.
>> Iulian Neamtiu: That's right, that's right. You have to be careful. We do selective replay, but
you have to be careful with choosing what not to replay, right? Because, for instance, if you
have a web app and you skip loading the page or so, then that's catastrophic. You won't be able
to get to the next page.
>>: [Indiscernible]. That example suggests that you're going to excise parts of the trace, like get
rid of the zoom events and so on.
>> Iulian Neamtiu: Oh, no, no.
>>: But what you're really doing is just compressing them.
>> Iulian Neamtiu: Compressing them. We still deliver the events, yes.
>>: Why is it semantic at all? You know the minimum delta that the app can accept input on, so
why can't you just squeeze out all events to get down to that one tenth of a millisecond?
>> Iulian Neamtiu: Right, so that's a good question. For instance, for swipes, swipes cannot be
compressed at all, because if you try to compress them -- so a swipe is going from point X to
point Y at a certain speed. So depending on the platform, if you swipe faster or slower, that
might mean different things, so for instance, if you swipe faster, that means you're going to land
further on the screen.
>>: There was a touch of a -- it's actually recording the time that it happened.
>> Iulian Neamtiu: That's right. That's right. You have things like momentum. So swipes, for
instance, we cannot compress at all. And if you try to compress a swipe too much, it's going to
be kind of jagged. The user is not capable of reaching from this point to this point in a very short
time, and that might look like two presses or so, so it's not going to look like a swipe anymore.
Using this technique, we're able to fast-forward the executions quite substantially, and it really
depends on the type of app. Apps that don't require that much user input, for instance, Pandora,
we're able to compress an original execution from 300 seconds to 24 seconds. Yes, please.
>>: Just to make sure I understand, you're not -- there was a process of manual discovery in
which you, through your expertise, figured out that certain events were amenable to compression
and others not.
>> Iulian Neamtiu: Right, right.
>>: You're not trying to discover this by looking at traces and seeing what you could compress.
>> Iulian Neamtiu: No, no. Now -- so what we had to do here, or what my then-student
Lorenzo had to do is look at the trace manually and try to find out -- so, first of all, you have to
cluster events to be able to say this is a swipe, this is a press. And then, once he was able to
cluster the events, we looked in a sense at the semantics of this, and we looked in a sense at the
semantics of this, and we tried to see which events and which intervals between events can be
compressed. Yes, that was a manual process. However, now, you just give me a trace and we
automatically compress it for you. Yes, please.
>>: What happens to timers, so my keep-alives and various timers that are in the applications
sort of natively, quote-unquote. It's designed to wait for five seconds. Do you compress those
durations?
>> Iulian Neamtiu: Oh, no. Not at all. No.
>>: How does it not? Timers start going off, for instance. So when we do not compression but
the other way, expansion, time expansion, what we see is that various timeouts stat to kick in and
you basically are tracing an error path rather than a regular path in the app, which is not always
what you want.
>> Iulian Neamtiu: So we haven't had that issue with the apps that we've looked at. I'm not
saying that might not be an issue for the other 800,000 apps out there, but the apps that we
looked at, the timers were not an issue, and the time warping that we did was only in the
compress direction. It was not in the expand. I can imagine trying to expand this, and maybe
you have some issues, but in the apps that we looked at, that wasn't a problem, no.
The last application of our technique is something that we call semantic sensor data alteration.
So what we do here is we take an input, a sequence of readings from the sensors, and then we
alter some of the readings, and we use this to drive the execution through a different path, and
sometimes this exposes a crash. Now, this is akin to fast testing, because it's not really fast
testing, because we're not really trying to create malformed input or nonsensical input. So let me
give you an example of how we do this sensor alteration. For instance, for the GPS location, we
do something called a map shift. So if you used -- we add some data to the coordinates in the
initial input, or we say instead of starting a route at one location, start in another location, or a
location saying give me driving direction to Hawaii, so something of that sort. Or we might
want to change the speed. And so we alter, and this is sensor dependent, because we have to
apply these transformations in a way that preserves the semantics of the reading there. For
instance, for camera readings, we might choose to blur or darken the image, or we -- for the
microphone, we might want to add noise or change the sample rate, things like that. And it turns
out that this technique is quite effective in exposing some app crashes. So when we alter the
GPS location, we're able to crash these four apps, Yelp, GPS Navigation&Maps, Route 66 and
Navfree USA. And then we're able to drive the execution through different paths for other apps.
For instance, for the camera, doing those kind of transformations gives us altered execution or
leads the app to displaying an error, which is good for developers, actually. Right, it's a way of
them testing or giving -- given one execution, given one set of inputs, they can test the app in a
more robust way by altering the input and seeing what happens.
So now I'm going to talk about the other challenge, how to provide comprehensive coverage for
apps. So the idea here is that a dynamic analysis really is critically dependent upon having a
good set of inputs and good coverage in our testing suite. And we tried to see if we can come up
with a way of automatically exploring apps and automatically generating this input to coverage.
So what we did, we tried to see, can we use actual executions from the field to drive coverage?
And to do so, we did a user study. So we recruited seven users, and we asked the users to play
with apps for five minutes, with a set of 28 apps. We recorded the interaction, and then we
measured the coverage, what percentage of the app was covered during this five-minute
interaction. And it turns out that there is not that much coverage, so only 30% of the app screens
were covered, and only 6% of the methods were covered. And one of the users here was a so-
called super user that explicitly high to achieve high coverage, but they didn't go very far. So
why do we see such little coverage?
Well, it turns out that users might not choose to exercise all the features, or they don't necessarily
feel compelled to change the settings or to buy stuff or to share the results of their game on a
social network, or to click on ads, or to, say, change their user profiles. Hence, we are seeing
reduced coverage. I think this is important for the developer, because they can see what's left out
of the app and why is that not exercised. Yes, please.
>>: How are you estimating coverage and so on? Is this a different tool that's instrumenting
source code?
>> Iulian Neamtiu: Yes, yes. This is a different tool. So here, we're instrumenting -- actually,
we're instrumenting the Bytecode, and we look at how many screens or what percentage of the
screens and what percentage of the methods are actually exercised. So it's a different tool, yes.
The tool is called A3E, and here are the results in a nutshell. So we were able to do this for 25
popular apps, all the apps, more than a million downloads, and you see that I am saying 25 apps.
In the initial, so in this user study, we had 28 apps. But those are three of the apps used native
code, and that's creating an issue for us with measuring coverage or doing any kind of analysis.
The tool runs on actual phones, and after we apply our techniques, we increase the screen
coverage to about 62%, the method coverage to about 33%. We don't need the source code, but
we need an access to the Bytecode. We are able to do complex gesture reproduction. I'm going
to talk about that in a second, because the input to these apps is quite rich, and we have two
approaches for the exploration, a fast approach and one that's a little bit more thorough. And the
tool is open source. You can download it at that address, and it's been tried on apps such as
these. So let's see how we constructed that. Well, it turns out this is an execution of the -Automatic Android App Explorer. So if we look a mobile version of Amazon, an interaction
might look like this. It just starts the app, then it clicks the search box to look for a product, so it
clicks the search box then. You can type in some characters, and in the end we hit enter, and
Amazon displays a list of products. These are called -- we see that all this interaction takes place
across different screens called activities. So if we can find a way to reconstruct this screen to
extract the screens and drive the execution automatically, then we're golden, so drive the
execution among screens and within screens automatically, and we do just that. So we construct
something called an activity transition graph, so using Bytecode analysis, using taint analysis,
actually, we extract a graph of the legal transitions among the app screens, and then, within each
of the app screens, we do a little bit of local exploration, and I'll tell you in a moment why we
have to do that dynamically. So we reconstruct this graph using static taint analysis, and then,
using dynamic exploration, we're going to explore the local elements in each screen. So this
might be like an unusual choice of an analysis, because we're trying to extract an app model here,
and we use taint tracking. But it turns out that taint tracking is very useful in this context. So
suppose I want to see whether it's possible to transition from screen A to screen B, and the way
we set up that analysis is to say, try to look at screen A as a source, screen B as a sink, and using
taint tracking we try to see if there is a path. Actually, right, there's a path that goes from screen
A to screen B. And this path it turns out is made possible via something called intent. So we
create an intent, and intent sets up the destination for a screen, and then when we actually invoke
the intent, this transition is realized. So just to give you an idea as how we mark the source code
with intents and with sinks and sources, basically, so these are three examples, and we can create
an intent in a straightforward way. We mark it as a source. Or we can create an intent -- so this
has the source and the destination right there. This is the easy case. Or we can create the intent
and later on set B as the destination of this intent. And, finally, we can create the intent and then
do a little bit of more complicated way of setting up the destination. So all of those are marked
as sources. Now, we mark the sink.
>>: [Indiscernible] special class ->> Iulian Neamtiu: Yes, so actually, intent is a way for apps -- for intra and inter-app
communication, is a structured way of doing IPC.
>>: These are the results are currently [indiscernible].
>> Iulian Neamtiu: Yes, because in the -- let's see. Can you be a little bit more specific. So
there's unsound, there's incomplete.
>>: [Indiscernible]. And things are tied together with XML and intents are resolved through
text streams that open up constants and so on and so forth. And so soundly approximating that
leads to a gigantic blow-up in terms of static analysis. So generally, people when they try to do
that, they make some defining assumptions.
>> Iulian Neamtiu: That's right.
>>: Something specific.
>> Iulian Neamtiu: Sounds -- so that's -- what's the worst that can happen if we miss one such
transition, is that coverage is not going to be so -- the automatic coverage that our tool provides
is not going to be 100%, which is a decision that I'm comfortable with. So once we set up the
sinks and the sources, we do this taint analysis, so we use a static Bytecode analysis tool, and
then we reconstruct a graph that looks like this. So that allows us to construct a high-level model
of the app screens. Why is dynamic exploration needed? Well, it turns out that if you look at the
screen of an app, depending on the device's form factor and depending on the device's
orientation, the layout is going to be different. So we do dynamic exploration to be able to find,
if you will -- to find what are all of the elements on the screen so I can exercise them. There's
also issues like sometimes there's ads that pop up, and good luck finding this with static analysis.
So the layout is highly dynamic, so we need a technique that's able to cope with that, and we use
this local dynamic exploration for that purpose. Now, let's see what's so unique about
>>: [Indiscernible].
>> Iulian Neamtiu: Let me roll to this slide. So it turns out that there are screens that are not -if you just fire the app, if you start the app, there are screens that you cannot get to. Why?
Because, for instance, this is one password for -- this is drop out. So it turns out that the only
way to invoke this screen is from another app to send an intent from another app to this and that
will open up the screen, just like, say, share on Facebook or so. That's a screen that's not
naturally accessible from the original app. So we need to discover all of these possible entry
points, and we use static analysis for that. So that's one problem, the screens that are involved
externally that we have to use static analysis for. Another problem is, if you look at the
interaction, just like we've seen with, say, Angry Birds or Yelp or so, the interaction is quite
complex, so we had to build a gesture library to be able to drive the execution in those situations
where simply retrieving the GUI elements and saying, hey, button, press yourself, was not
sufficient. And, finally, we had to deal with real-time input from the sensors. So, for instance, in
Shazam, you can't get very far in the exploration if you don't play a song in the background, so
our library of sensor input was quite diverse. And we introduced two techniques to do the
exploration. The first technique is called forced depth-first exploration, which performs
exploration in the manner that users would naturally interact with the app. So, for instance, from
the home screen I go to the search screen, product, cart, checkout, and then we press the back
button to come back to the other states. So this is just -- it's trying to emulate a user, but we do a
much more thorough job. So that's called a depth-first exploration path. And another way is to
do something called targeted exploration, when we go from one screen to another screen directly,
without having for GUI interaction to trigger these screen transitions. And here we have to be
careful and make sure that these screen transitions are actually legal, and that's the reason why
we don't go from home to checkout. That path is not possible.
So what are the tradeoffs here? Of course, targeted exploration is going to be fast, but it's not
going to be as thorough. Depth-first exploration is going to take more time. And how does this
work in practice? So we installed the app on the phone. We have an automatic explorer that
runs on a PC, and then we perform the actual exploration while the app runs on the phone, and
the result obviously is going to be a trace that's replayable with Reran later on. So that's depthfirst exploration, very simple. For targeted exploration, where we use static analysis, there's an
intermediate step here where we run the analysis, we construct a model of the application, and
then we take this modeling into account, where we -- before exploring the app, so that we know
exactly how to transition among the screens. Yes.
>>: So I understand how you're using taint analysis to construct this graph, but how does taint
analysis reveal to you what the actions are that let you move around the pods.
>> Iulian Neamtiu: That's a good point. There's a way to actually fire up a screen without going
to an action, so once you're in a screen, you can say I want to go to the next screen, and then you
can do so.
>>: You start up that screen in ->> Iulian Neamtiu: Like in a default state, yes. So this is a problem, and that's why we have to
be careful with targeted exploration, for instance, when the state that's being passed around, for
instance, the credit card number or something like that, then we cannot just transition among
screens wildly. So just to give you an idea of how effective this is, I contend that the approach is
quite effective, given that we've run this on wildly popular apps, so we measure the number of
screens that are covered using our approach, and it turns out that we cover more than 59%. This
is the average number of screens that we now cover, and we also have a meta coverage, and
that's 29% to 36%. These are average numbers for the 25 real-world apps that we use. Yes,
please.
>>: Is there a time bound on how long this algorithm was run.
>> Iulian Neamtiu: Yes, good point. And, no, we didn't have any time bound. So you're
anticipating my next slide, so my next slide talks about how efficient this approach is. It turns
out that, depending on the app exploration, it might take more or less -- we had no bound, so
these are the minimum runtimes. The faster static graph construction was for CNN, and in terms
of runtime exploration, it took us 18 minutes to explore the BBC News app. That's for targeted
exploration. This is just shallow exploration, just switching among the screens for apps that are
amenable to that. And for depth exploration, the fastest we could explore one app was Tiny
Flashlight. It's not very feature rich, and that took 39 minutes. And the most, the longest, was
Shazam, for targeted exploration, 236 minutes. That's close to four hours. And using depth-first
exploration, that took pretty much the same time. But on average, targeted takes about an hour
and 20 minutes. Depth-first takes about an hour and 40 minutes, which we believe is reasonably
efficient, because, say, if the explorers want to run this, they can just run it and in a couple of
hours, or at most four hours, you have your systematic exploration. Yes, please.
>>: [Indiscernible].
>> Iulian Neamtiu: Next slide. So we used this -- yes, so what can you use this systematic
exploration for? So we used systematic exploration, coupled with something called model-based
runtime verification. So we have a dynamic analysis, but the dynamic analysis is just an
infrastructure. This is just the infrastructure. Now, let's run some applications on top of it. So in
this case, we constructed manually some models of what an app should behave like, and we tried
to check at runtime whether the models are obeyed or if the app goes through the legal set of
transitions. So let's assume that this is the app state machine, and this transition is not legal. If
we see that transition at runtime, we found a bug. So we had a paper in ASD in 2011, and we
found 21 bugs in open-source apps using this verification technique. So that was one app of
dynamic analysis. Now, I'm going to talk about another application, and that's on profiling. So
what we do here, we take the app. The app's running on the phone with Reran capability, and
then we instrument the platform to collect data at three different layers, at the sensor layer, the
user interaction, the system calls and the network traffic. And this will have all sorts of benefits,
and I'm going to summarize those benefits in a second. We ran this on 27 apps across different
categories, and what we focused on here was, first of all, trying to understand, trying to get some
behavioral snapshot of apps and also how free and paid apps differ. So what we did, for
instance, for Shazam and Angry Birds, we looked at both the free version and the paid version
and tried to see what are the differences in the capsule profile of the app. For each app, we did
30 runs, one recording run and nine runs, so we had three users. We recorded one run, and then
we replayed the other nine runs at different times of day to see if there was any variation, and we
didn't find much variation. So the first findings were obtained using TCP dump, so we looked at
the network traffic, we measured the TCP -- we inspected the traffic headers and contents, and
we tried to see how much of the traffic is with the app's website, how much is with the cloud,
with the content distribution network, and how much is adds and tracking. So without further
ado, these are the results for some of the apps. So what we have here, we have six apps, and
what I'm going to show, I'm going to present the download traffic, the upload traffic, and then
what percentage of the traffic was with the origin, so the origin is the app's site. So Tiny
Flashlight, if it has a developer named [foo.com], how much of the traffic was [foo.com], how
much was CDN and cloud -- for instance, Azure or Amazon AWS, things like that? How much
is third party, ads tracking, Google? And within the traffic, what's the split between HTTP and
HTTPS? So let's take them in turn. Tiny Flashlight is, as the name implies, a tiny app to turn on
the flashlight. It turns out that you need the Internet access to be able to use the flashlight, and if
we look at the traffic, 99% of the traffic was with third parties, and all of it was unencrypted. It
was over HTTP. Now, let's look at the differences between a free and a paid version of an app.
So Advanced Task Killer, the free version has some traffic. It's all third-party traffic, mostly
unencrypted, whereas the paid version has no traffic whatsoever. So that's what shelling out
$0.99 or so buys you. Facebook and Amazon, so they have heavy traffic, as expected. What
we've discovered is that not all the traffic is encrypted for Facebook, so about a quarter of the
traffic was encrypted -- sorry, a quarter of the traffic was in the clear HTTP, and the rest is
HTTPS. And, finally, an app that measures your heart rate, it measures your pulse, it turns out
that if you get the free version of the app, it has pretty serious download and upload --
>>: [Indiscernible].
>> Iulian Neamtiu: You press your finger against the flash and the flash of the camera, and then
it -- I guess it sees how your tip of the finger expands and contracts, and that's your pulse. So it
turns out that, again, marked differences between the free version and the paid version of the app
-- 91% to 96% of the traffic goes to third parties, and this is something I'm uncomfortable with,
so for the free app, lots of traffic actually goes unencrypted, so who knows who's reading your
heart rate, whereas if you pay for the app, then the split is kind of reversed -- 80% of the traffic is
over HTTPS and the rest is over HTTP.
>>: Browsing or is there a purchase actually here? I'm surprised at the 1% unencrypted.
>> Iulian Neamtiu: There was no purchase. These were just browsing.
>>: If it's also already going to destinations unknown, why do you care if it's HTTPS at that
point? It seems like men in the middle and the destination are roughly in the same camp.
>> Iulian Neamtiu: Sure. I agree. Actually, I'm going to talk about that in a second, that lots of
this traffic goes to ill-reputed websites. And whether it goes encrypted or not, who cares? The
only problem is -- I don't feel comfortable if someone is tampering with the flow here. So HTTP
is subject to tampering that it can be read, written, man in the middle, so that makes me
uncomfortable. So I have discussed that. And what we did with these apps, actually, we used
clustering to beam for each app the usage, to beam it into one of three clusters. So we have a
high-usage cluster, medium usage and low usage. So if an app has a lot of network traffic, it's
going to end up in the high cluster. If an app has very little traffic, it's going to end up in the low
cluster. And we did this analysis at three different layers, as we said before, so OS, and we
measure the intensity at the OS level in syscalls per second. Here we measure it in bytes per
second and here in events per second. And this kind of application thumbnail can be used to
give you an idea as to behavior differences between apps. So, for instance, this is
Dictionary.com, and we see that if you are purchasing the app, you're not going to see that much
physical activity, which is good news, I guess, for your battery. This is Dolphin, so Dolphin has
high network activity in the free version and medium network activity in the paid version, and so
this can be used to compare apps. So I discuss that. In terms of reading between the lines, what
have we found? Well, free apps are not as free as we think. We measured the number of
systems calls per second. It turns out that the free apps have 50% to 100% higher system call
intensity than the paid apps, and they have dramatically higher network traffic due to ads and
tracking, and this is bad for your battery and bad for your data plan and bad for your privacy.
And also, coming back to this idea, Ben asked where does the traffic go? It turns out that the
free apps send traffic to more domains, so Dolphin Free sends its traffic to 22 domains, the paid
version to just nine domains, the same thing for Angry Birds. Oh, yes, please.
>>: Guess a lot on whether this is specific to Android or iOS and Windows were, as well?
>> Iulian Neamtiu: So this is all Android. We don't have -- we did work on -- well, we did
another measurement study on other platforms, and it turns out that there's also Windows and
iOS, and it turns out that there's not that much difference between Windows, iOS, BlackBerry,
but there is difference between those operating systems and Android, in the sense that we see
higher traffic in Android. But we didn't control for apps, so that was geared -- we only looked at
the platform and the traffic. It could be that Android was used by some users who are already,
say, prone to high traffic. Maybe they were watching video or something like that. I can't really
tell whether it's the platform or whether it's the user, because that's a confounding factor.
>>: To comment on that, the way I think about it is that basically Android is the Windows 95 of
smartphone operating systems, so it's tremendously popular and also tremendously prone to
various problems, and these are just some of them.
>>: And has looser policies.
>>: Yes, and that's more [indiscernible].
>> Iulian Neamtiu: And just to wrap up this study, another piece of work, we only looked at the
URLs that are accessed by the app, so we have a system called Aura, and the idea here is that if
you have an app that collects all this kind of sensitive information, do you know what kind of
websites is the app sending the information to? And I don't have time to go over the details of
the study, but suffice it to say that we looked at URLs using 13,000 apps, a quarter million
URLs, and it turns out that half of these -- sorry. Of the top 20 most popular URLs, 10, so half
of the top 20, are ad servers, and we also found that good apps, so apps that are not malicious,
per se, talked to so-called bad websites as marked by sources such as Web of Trust or
VirusTotal, so it turns out that 66% of the apps talk to domains with poor reputation. Seventyfour of the apps talk to domains that are unsuitable for children or talk to at least one domain
that's unsuitable for children, and 50% of the apps, and these are the supposedly good apps, talk
to blacklisted domains, as identified by Web of Trust and VirusTotal. So those are domains
whose only raison d'etre is to spread malware, to do phishing scams. So in the time remaining,
I'm just going to talk about -- I guess I'm going to skip. I'm going to discuss about this
individually when we meet one on one in the afternoon. So let me just conclude. I presented an
approach that facilitates a whole host of dynamic analysis on smartphones. At the core of our
approach is a record and replay scheme, and then I've demonstrated how to do profiling and how
to do our automatic exploration and do so while having reproducible results and apps that run on
real phones. And you can go to this website to download the tools or to read the papers, and I'd
be happy to discuss details with you. This concludes my talk. I'd be happy to take any
questions.
>>: [Indiscernible]. So runtime analysis certainly is a [indiscernible] when it comes to
analyzing benign applications, but a malicious application can use a variety of techniques to
avoid being analyzed, you see. So I assume you have not actually observed that in the wild, but
have you tried analyzing at least obfuscated, highly obfuscated apps to see how that goes? Have
you tried to play with that?
>> Iulian Neamtiu: No, so in this study, actually, we also looked at malicious apps, about 1,000
malicious apps in this study. And it turns out that there is no sophistication whatsoever. So, for
instance, if the app collects your GPS reading, it sends it in the clear, and in the HTTP request,
you see longitude and latitude. And the ->>: Might they actually do this without your observing?
>> Iulian Neamtiu: Oh, absolutely. What I'm saying is the level of sophistication we observed
was not very high. I'm not saying that there are not sophisticated apps that we have not
observed, but even for the ones that we were able to see, we were surprised at how just basic the
technique was. So no obfuscation whatsoever, just leak whatever personal information in the
field, yes.
>>: These were tagged as being malicious?
>> Iulian Neamtiu: Oh, yes, we knew they were malicious.
>>: North Carolina subset?
>> Iulian Neamtiu: No, this is from Purdue. We got this set from Purdue, Christina and her
students.
>>: Thank you.
>> Iulian Neamtiu: All right, thanks.
Download