>> Madan Musuvathi: Hi, everyone. I am Madan Musuvathi from the Research and Software Engineering Group, and it's my pleasure to invite Iulian here today. Iulian, welcome back. He was an intern a very, very long time ago with Charles and myself, working on the Chess Project, when the project was just beginning. So he was one of those people who had to put up a lot with -- what do you say? Newly forming research, not knowing where we are going and dealing with lots of Microsoft infrastructure. It was great. It was a great experience. So since then, he's been very successful. He's an assistant professor at UC Riverside, and his main focus is on programming languages and software engineering, and he's been focusing a lot on smartphones these days, so that's what he's going to be talking about today. Iulian? >> Iulian Neamtiu: Thanks, Madan. So I'm going to present some work on dynamic analysis, how to enable dynamic analysis and what to apply them to, and this is joint work with my students and my collaborators. So smartphones and smartphone applications have been exploding in popularity recently. More and more tasks that used to be performed on servers or desktop computers now run on smartphones, so that means we need to shift our focus whenever we do, say, security or performance or verification for applications -- to shift our focus to the mobile platform. And here are some of the popular apps that I found at the Windows Smartphone App Store, and if we look at -- at the outset, we have right there, it's quite a broad range, and they all come with some particular concerns. And, what I'm going to be talking about is how to develop and the kind of infrastructure that we've developed to allow us to study some of these concerns. So one of your concerns might be, let's see, does this financial application, Chase Mobile, actually implement the transaction scenarios when I do a financial transaction? It better do so. Another app that you might look at is say, if you have babies, there is a baby monitor app, and now the concern is, okay, the parents are monitoring the baby, but who else is monitoring the baby, and what kind of information this app collects and where does it send it to? Another particular and pressing concern, actually, for mobile is energy consumption. So, for instance, if you run a game and it has heavy usage of graphics and 3D and things like that, you would like to be able to know which parts of our application are eating up a lot of power and what can you do to optimize that? And it turns out that there is a very powerful tool for enabling this kind of analysis. It's called dynamic analysis, and we can run a whole bunch of analysis applications on it from debugging to verification to profiling. So how does dynamic analysis work? We run the app on a phone, and then we have some collection infrastructure, and what's going to happen, we have some inputs to the app. So this is the app running straight on the phone, as it's supposed to. This is the thread of execution, and then we have some inputs to the app, and then we take a magnifying glass, and this magnifying glass can be a whole bunch of analysis, and we try to check whether the app actually conforms to the property that we're interested in. So, for instance, for this mobile banking app, we might want to define some app invariants, like this transaction semantics and then use dynamic analysis to see if those invariants are being respected. What about the baby monitor? For this, we might run a dynamic analysis that's called information flow tracking, so we collect the kind of data that the application reads, and then we also look at the data that the application writes or sends over the network, and we tried to see if this conforms to a certain security policy. For instance, the network flow shouldn't escape the local network, or something of that sort. And for an app such as a game, we can run an energy profiler, which is another kind of dynamic analysis, to measure energy usage in various parts of the app. Okay, so this all sounds nice, but building infrastructure for dynamic analysis, especially for smartphone apps, is not so simple. And what I'm going to do, I'm going to talk about a set of challenges that basically make our task a little bit more complicated. So anyone who has worked with dynamic analysis before knows that there's two main components for a successful analysis. First of all, you need high-quality input, and then you need good coverage. So you need a good set of inputs to drive the application through the set of relevant states, so that you do this dynamic analysis on relevant states. So this concerns good inputs and then comprehensive or systematic coverage or common denominator across all dynamic analysis. However, on smartphones, things are a little bit more complicated, because as we're going to see in a second, on smartphones, the input can come from various sources, can be very complex, and it can take many forms. Another challenge is source code availability. I think it's unrealistic to expect that the apps that I put out there are going to come with source code attached, so we need to come up with an analysis that allows us to run without having access to the source code. Another challenge is running on real phones, and you're going to see in a minute why that's the case, because whenever we run on emulators, the full gamut of input is not available, so we have to mock, say, GPS or the microphone or the camera, and that severely restricts the set of observable states that you can drive the application to. And, finally, there's this -- because of the wide array of sensors on mobile, it's really hard to reproduce execution. So I'd like to be able to construct a system that allows me to even reproduce an execution. I'm not asking for more, and even that is challenging. Okay, so what I'm going to do, I'm going to focus on three main lines of work. The foundation is a record and replay system for smartphone apps. That appeared in ICSE this year, and then building on that system, so that's the foundation, if you want, of dynamic analysis, because it helps us deal with many of these challenges, and then building on that foundation, we constructed a way to do systematic exploration of apps. In the left-hand corner there, that's a paper that was published in OOPSLA this year, and then to show the applicability of this dynamic analysis, I'm going to talk about some profiling, mostly focused on network profiling, and that has appeared in MobiCom last year. So let's focus on the record and replay first. So the kinds of challenges that we addressed in that work were dealing with complex sensor input, dealing with the lack of source code, running on real phones and reproducibility. So if you look at the problem formulation, what we would like to be able to do is to record an input and be able to replay later. That's all we're asking for, and there's been work in this area on smartphones, and there's been a lot of work on other platforms, as Madan has mentioned, the kind of record and replay that we did for Win 32. So past approaches used something called a keyword-action paradigm. So how do we do record and replay there? We look at the application as having a set of fields, and then we do keyword matching on these fields, and whenever we encounter a certain keyword, we perform an action. So suppose we want to use record and replay for a very simple application that just has some textual input here. This is an example for setting up your e-mail account. So what a record and replay scheme would entail would be, first, you need to find the field that's called give this account a name. Once you found this field, execute an action, and the action is to type in jane@coolexample.com. Then, the next keyword-action pair should be find the next field that's called your name and type Jane Doe. And then, we do a button kind of action, so we find the button that's called Done, and the action is click. And these keyword actions are usually saved in a script. So, yes, please, Charles. Yes. >>: Does a user do it? I mean, you can simulate this clicking and the typing business using a computer script? How does the user actually click it? >> Iulian Neamtiu: So there's a capture tool. There's a capture tool that will record a script, so whenever -- yes. Whenever the user actually moves the focus around different fields, then the script will retrieve the name of the field, and the input that the user has typed, and then they can replay the script later on. So this is saving to a script. >>: [Indiscernible]. I'm sorry, fill-ins of people or ->> Iulian Neamtiu: Right, so that's a good question. So a field is something that the approach should hang onto. So it's something that is available via kind of a structured query language, so I can ask the state -- I can ask the app, give me all the components of the GUI right now, and it gives me a list of components, and one of the components is this field. So field simply stands for something that accepts input, like a text box where you can input text. So field is accepting inputs from the user via this textual mechanism. This can also be buttons or maybe selecting an item in a menu. >>: [Indiscernible]. >> Iulian Neamtiu: Yes, I'll get to that in a minute. That's where the scheme falls apart, actually. So this is prior work. This is what's been tried before, and I'll show you why it doesn't work. So without further ado, let's see why it doesn't work. Let's look at this app. It's called Angry Birds. Probably most of you are familiar with it, and this is a view on the kryptonite of record and replay, because there is no keyword action that can help you deal with Angry Birds and being able to record and replay Angry Birds. So why is that inappropriate? Well, consider this action. We are pretty much in the physical world here, so we have a very wide range of motions and gestures, so you can drag the slingshot, you can spin the bird, you can throw the bird into the ground. There's a whole bunch of actions here. So the question is, if I were to design a general-purpose record and replay tool, how would I design that? And the problem is that if you use this keyword-action pattern, then keyword action actually relies on a certain semantics, so as you've asked, what's a field? Well, that's a very good question. What's a field here? I don't know. Because if you look at the actions that you need to record and the actions that the user actually can perform, and you try to abstract from that and construct the language, the language is going to be very app specific, very app specific. So, for instance, in this case, maybe you want to introduce constructs for finding the bird in the slingshot. Maybe that's one of the keywords. What about this? Once you find the bird, drag back the bird one inch, 210 degrees. And even worse, you can have this kind of very complex gesture. So, for instance, you can spin the bird four times and then release it, and I don't know how to capture that in a language. That would be circular bird swipe with increasing slingshot tension. It's very complicated to come up with a general language for this. So not only are we going to have trouble capturing and expressing the gestures, also this is going to be very app specific, because this is only going to work for Angry Birds. Let me give you another example. There's an app called Magic Eight Ball. So Magic Eight Ball, extremely useful app. It can answer a whole range of difficult questions, so, for instance, if you want to know just before you leave home if there's going to be traffic on the 520, you can ask, is there traffic ->>: To come up with a question that there's more than one answer. >> Iulian Neamtiu: So the way this Magic Eight Ball works is you write down the question, and then you shake the device, so you shake the ball, and voila, here comes your answer in terms of that. In this case, the answer was yes, because I [indiscernible]. I don't know, we haven't looked at the full range of behavior for this. But the point is, again, this keyword action falls apart, because we need a language that should be able to express something like, how do you say in the keyword action shake the phone? So basically, you want to get into modeling the physics, and so you quickly move the device around the X, Y and Z axis. Now, the problem here, just like for the previous app, we had gesture input from the touchscreen. Here, we have sensor input from something called the accelerometer, and there's no end to this. If you look at all the popular apps, they have a very wide gamut of sensors, so lots of apps depend on the geographical location, so they have a sensor for GPS. There are apps that take pictures or that recognize pictures. When it's a barcode scanner, the barcode scanner has a frame buffer, and then you need to express, record and replay that, the same thing with the microphone, where since there are apps like Shazam, where you can record input, and then the app does some recognition there. And it turns out that, if you want your approach to work for popular apps, you have to deal with all this input. So we did a study on popular apps, like the top 120 or so, and we measured the number of gestures that are encountered in a five-minute run of the app, and I'm plotting that number of gestures here. On the X axis, we have the number of gestures during the run, and on the Y axis, I have the frequency. So, as we can see, pretty much all the apps use gestures, and some of the apps use a high number of gestures, more than 100 gestures in a five-minute run, so our approach had better be able to deal with this. And this is just gestures. I'm not talking about GPS. This is just touchscreen gestures. So what we do, we construct an approach that's called Reran that will actually capture, will tap into the physical input sensors, into all the sources, will tap into the kernel, and we save that in an event trace, and that's going to be replaced. And we deal with the entire gamut of physical sensors on the phone, from GPS to camera to microphone, compass, accelerometer, keyboard, touchscreen and even the light sensor. So the key insight of our work is to say key action is too high level for us, because it assumes too much structure, and we're dealing with very low-level input here. So what we're doing, actually, we're lowering the level of abstraction quite a bit, and instead of using the high-level structure based on keyword action, we capture and replay low-level sensor input. Yes, please. >>: [Indiscernible] the kernel and the physical device. >> Iulian Neamtiu: Yes, that's right. So, actually, the kernel -- so the kernel, we tap into the kernel's input and output streams. >>: [Indiscernible] the apps. >> Iulian Neamtiu: We also do that. So, for instance, for GPS, you have to do that, because GPS is not exposed in the kernel for various concerns, and for GPS, we have to intercept the API. Yes, thank you. >>: [Indiscernible] the API for every sensor. >> Iulian Neamtiu: That's a good point. So some of the sensors, the touchscreen, light sensor, compass and accelerometer do not have a very clean way of intercepting the API. Rather, so they post a reading in like a buffer, and then the application has to read from that buffer. So intercepting at that level is going to be very costly. GPS is better, because we don't have a million -- we don't have, say, 1,000 readings a second, whereas for the touchscreen -- so the touchscreen for this gesture that I have just shown with the -- where you do a circular swipe, that's like 400 events in one second. So if you were to intercept the API, the approach is not going to be timely, to allow you to do replay. So that's point number one. Point number two, Angry Birds uses a lot of native code, so the API is completely shunted there, so they access the sensors in a roll manner, and then if you do interception at the API level, you've lost, because you're shut off the equation. >>: [Indiscernible]. >> Iulian Neamtiu: They go to the kernel, but they don't go to the API. So when you say API, you mean the virtual machine or something that's even lower than that. Yes, we tap into the kernel, but the -- yes. >>: [Indiscernible] is there an API called that's someone shook the phone. >> Iulian Neamtiu: Oh, no, not at all. >>: Or a whole bunch of acceleration readings and then I have to figure that out. >> Iulian Neamtiu: Exactly, exactly, yes. Yes, please. >>: With this, the previous paradigm that you had about the ->> Iulian Neamtiu: Keyword action. >>: Keyword action thing, right. That's at a sufficiently high-level abstraction that it's perhaps robust to small changes in the app. If you're capturing touchscreen input and then keyboard input, and the browser renders my page slightly differently, that touchscreen input may no longer actually select a field. >> Iulian Neamtiu: Sure, sure. So I'm going to come to that in a moment. That's one of the limitations of our approach. That is a problem if your phone is not stationary when you apply this. So if you -- why don't I just go to the next slide or two, and then that's going to come up. So at a high level, how Reran works is very natural. So this works on the actual phone. We tap into the kernel in a nonintrusive manner. We record the input events into an event race, and then we do some processing on the event race to separate the wheat from the chaff. Some of the events are simply not worth replaying, and also the user might not want to replay everything. And then this gets sent to a replay agent that works on a PC, like a laptop, and the replay agent is going to process the trace, copy the trace onto the phone, and then everything is done. And then on the phone we do this event injection. The only reason why we have this replay agent running offline is because we want to run some processing on the trace, but there is no problem in capturing the trace on the phone and replaying it on the phone. Unless you want to do some complicated processing, like we want, there's no reason to have a separate system like a PC in the equation. Everything can run on the phone. Yes, please. >>: Will energy be a piece of that? >> Iulian Neamtiu: Sorry. >>: Will energy consumption of the analysis of the trace be a concern there? >> Iulian Neamtiu: That's right. So not only that, but also perhaps storage, because you don't have that much storage. Yes. That's a good point. Thank you. So just to give you a flavor as to the challenges we experience when we have to build such a system, let's look at this app called Gas Buddy. It helps you find cheap gas in your neighborhood, and what I'm plotting here, on the X axis, I'm plotting time, so this is an 80-second run of Gas Buddy, and on the Y axis, I'm plotting the number of events that come from the sensors. And for simplicity, I'm not going to plot all the sensor input, only two, so input from the compass and input from the touchscreen. So, for instance, what the user might decide to do is to type in the location, say, Los Angeles, or if you use GPS, simply say locate me. And then, we see that -- so these are the touchscreen events, so the user had to use the touchscreen, and then there's quiet on the touchscreen front while the app loads the station map, but once the station map is loaded, the app needs to find the user's position, orientation, so there is a stream of compass events. Then the user -- after the map is loaded, the user might want to navigate with pinch and zoom. That's why we have a lot of touchscreen events so that we can see that in 20 seconds we have more than 1,000 events there. And then the app is exercised as usual. So there are several points that come across here. First of all is how to deal with concurrent events, so the approach has to be able to record and deliver concurrent events. In this case, we see that we have both touchscreen and the compass, and there's also the GPS in there, or perhaps the accelerometer, and so that's one challenge that we had to overcome. The other challenge is mixing the incoming events with replayed events. So this becomes apparent in the context, when you're doing something like locate me, and then you recorded a trace saying Los Angeles and you're replaying the trace in Seattle, because then you have to reconcile what the app is reading with what's saved in the trace, and we have certain policies for dealing with that. For instance, in the case of GPS, we simply discard the current readings and we only feed the GPS readings from the original trace, not from the incoming stream. Also, timing accuracy is critical, and just to give you an idea as to why it's critical, let's look at these two streams. So this is the same number of events, roughly, but you have to be very careful with the timing. So if -- suppose I want to replay 400 events, and a swipe -- if it's a complex swipe, it's about that many. If I have a very small interruption in the middle, so I deliver, say, 100 events and then I pause very little, and then I deliver one event, I pause a little and I deliver the rest of the events, this is not going to appear as a continuous smooth swipe anymore. It's going to appear as one swipe, one press, another swipe, and that will kill replay. Timing accuracy is critical, and it turns out that we have to replay with sub-millisecond accuracy, something on the order of 10 to 100 microseconds, to be able to replay and revert smoothly. And another concern is, for instance, when you have very high throughput input from the frame buffer, when you're replaying, say, the barcode scanner. The barcode scanner does analysis on the live image from the camera and tries to find the barcode in there. So there, it's also critical to record and replay with a high accuracy. Yes. >>: [Indiscernible]. >> Iulian Neamtiu: Oh, yes. That's a good point. So the apps are not ready -- however, there is a lot of determinism. It turns out that thread nondeterminism is not really an issue for smartphone apps, which in a sense saves the day here. And I have some schedules that I can show to you offline. But basically, what happens, there's a main application thread, and the job of the thread is to process whatever the GUI update changes certain threads have posted, and lots of this input process is done via something called a synchronous stash. Depending on the platform, you have different names for it, but it's pretty much the same thing. So, for instance, if you have network input that you know is going to take a long time, like downloading a file or something like that, you put in a background task. However, that background task is not going to bother your app until it's completed. So it turns out that thread nondeterminism is not an issue for the vast majority of apps. There are apps -- for instance, SoundHound, where that's been a problem. Or, basically, you start it 40 different times and you see 40 different schedules for the threads. >>: [Indiscernible] this is Android, correct? >> Iulian Neamtiu: This is Android, yes. So the results of running our approach were we were able to demonstrate on 110 popular apps. So what we did, there's about 20 categories -- fitness, lifestyle, health, games -- so we took about 20 apps in each category, and we selected apps from there. In total, there's about 110 apps that we were able to replay, and the trace is phone specific, because we depend on the physical coordinates, obviously, but the approach is not. You can run Reran on another phone, and you're going to get the trace for that phone. And the runtime overhead is less than 2%. There are limitations to this approach, as it has been alluded to earlier. So, for instance, if we have network nondeterminism, so there are apps that loaded content dynamically, the layout of a webpage and that changes from run to run, then we're going to have a problem. So if there's a dynamic layout or network nondeterminism, then this approach might not work. And, in a sense, we could solve that by simply recording the network input, but that's overkill in my view. And then there's nothing really live in this whole record and replay scheme. Sometimes, live content can mess up the approach, but we made a decision not to record and replay the network input. And the apps are spread across different categories, and we have demos on how we replay Angry Birds and other apps, and you can find that on YouTube. So let me present three applications of this technique. So first, I'm going to talk about how we can use record and replay to reproduce bugs in popular applications. Then, I'm going to discuss something called time warping and then semantic input alteration. So, yes, please. Go ahead. >>: This was in the beginning of your talk, so this replay is for developers or app store owners? Or who is the target? >> Iulian Neamtiu: So there are actually all of the above, because the developers can record an interaction. A user, for instance, can record a cashing interaction and send a trace back to the developer. >>: I asked because you're making a very important assumption that you don't have the source code. >> Iulian Neamtiu: Yes. >>: So that forces you to do things at a much lower level than you would have had to, and also things like shared memory concurrency within the app. Because you don't have the source code, it's difficult for you to capture that at a high level. If this were targeted at the developers, you could say, well, you have the source code. >> Iulian Neamtiu: Sure. >>: So it's because you want to address everybody that you're doing that at such a low level. >> Iulian Neamtiu: I think, in a sense, we want to be able to record and replay a wide range of apps, and we don't want to -- we want to make as few assumptions as possible. As you said, if you have the source code, then other possibilities become realities here. But in terms of this case, if you use a little bit more abstraction, that's not necessarily a good thing, because then you have to cater the record and replay scheme to specific apps, and I don't necessarily want that. So let's see how we apply this to the task of reproducing bugs. So what we do here, we look at bug repositories for popular apps, especially bug reports that have something called steps to reproduce, so then we take those steps to reproduce, we install the app, we reproduce the bug, and all this time the app has been running with record enabled, and then we replay the interaction up until the point where just about to crash, and there we break into the debugger, and that's useful, for instance, for developers, to look at the root cause of the app. And it turns out that this works well for many apps that perhaps you've heard about. I'm presenting the bug categories there, so many apps crash because they crash on incorrectly formatted input, like all the way to SoundCloud, and we're able to replay that. K-9 Mail crashes on invalid input. NPR News crashes because too many events are being delivered per second or something like that, and then we're able to reproduce that. We're able to reproduce the bug in Firefox and Facebook, as well. So this is a tool that can help both [that] concern to help both the user and the developer, the user in capturing a trace and the developer in breaking to a debugger just before you're about to crash. Another application is something called time warping. So if you have a long-running app, an app that, say, takes 30 minutes or so before it crashes, that might be a little bit too much and we'd like to compress that. So we use a technique called time warping. So what we do, we take the trace and we look at what elements of the trace can actually be compressed and delivered in rapid sequence. And it turns out that whenever you have, for instance, data entry, if you have a texting app, the phone is perfectly capable of accepting your 140-character message in something like two milliseconds, but you can't type that fast. So what we do, we take the trace, we look at this kind of situation, and then we compress the trace. >>: [Indiscernible]. >> Iulian Neamtiu: So we have something called clustering. So we look at the trace, and we have certain candidates of events. This is not fully automated. It's automated in the sense that I can apply it automatically on a trace, but when we constructed the approach, we had to look at what kind of events are amenable to compression and whatnot. So, for instance, data entry is amenable to compression, or whenever the app is paused because the user reads something on the screen, then that's also amenable to compression. So, for instance, you load a page in the browser and then you take 20 seconds to go from the top to bottom because you have to do some comprehension. That we can just fast-forward through, so we can deliver the event so that you go through the page in, say, two seconds. Yes, please. >>: So another way of thinking of that is when the user is using the application, there are some things that split to the users, so is it like if you load a page, you might want to expand it so you can read it, you might want to scroll it, but when you're doing the replay, you don't need to recreate all of that, because no one sits there reading a page. >> Iulian Neamtiu: That's right, that's right. You have to be careful. We do selective replay, but you have to be careful with choosing what not to replay, right? Because, for instance, if you have a web app and you skip loading the page or so, then that's catastrophic. You won't be able to get to the next page. >>: [Indiscernible]. That example suggests that you're going to excise parts of the trace, like get rid of the zoom events and so on. >> Iulian Neamtiu: Oh, no, no. >>: But what you're really doing is just compressing them. >> Iulian Neamtiu: Compressing them. We still deliver the events, yes. >>: Why is it semantic at all? You know the minimum delta that the app can accept input on, so why can't you just squeeze out all events to get down to that one tenth of a millisecond? >> Iulian Neamtiu: Right, so that's a good question. For instance, for swipes, swipes cannot be compressed at all, because if you try to compress them -- so a swipe is going from point X to point Y at a certain speed. So depending on the platform, if you swipe faster or slower, that might mean different things, so for instance, if you swipe faster, that means you're going to land further on the screen. >>: There was a touch of a -- it's actually recording the time that it happened. >> Iulian Neamtiu: That's right. That's right. You have things like momentum. So swipes, for instance, we cannot compress at all. And if you try to compress a swipe too much, it's going to be kind of jagged. The user is not capable of reaching from this point to this point in a very short time, and that might look like two presses or so, so it's not going to look like a swipe anymore. Using this technique, we're able to fast-forward the executions quite substantially, and it really depends on the type of app. Apps that don't require that much user input, for instance, Pandora, we're able to compress an original execution from 300 seconds to 24 seconds. Yes, please. >>: Just to make sure I understand, you're not -- there was a process of manual discovery in which you, through your expertise, figured out that certain events were amenable to compression and others not. >> Iulian Neamtiu: Right, right. >>: You're not trying to discover this by looking at traces and seeing what you could compress. >> Iulian Neamtiu: No, no. Now -- so what we had to do here, or what my then-student Lorenzo had to do is look at the trace manually and try to find out -- so, first of all, you have to cluster events to be able to say this is a swipe, this is a press. And then, once he was able to cluster the events, we looked in a sense at the semantics of this, and we looked in a sense at the semantics of this, and we tried to see which events and which intervals between events can be compressed. Yes, that was a manual process. However, now, you just give me a trace and we automatically compress it for you. Yes, please. >>: What happens to timers, so my keep-alives and various timers that are in the applications sort of natively, quote-unquote. It's designed to wait for five seconds. Do you compress those durations? >> Iulian Neamtiu: Oh, no. Not at all. No. >>: How does it not? Timers start going off, for instance. So when we do not compression but the other way, expansion, time expansion, what we see is that various timeouts stat to kick in and you basically are tracing an error path rather than a regular path in the app, which is not always what you want. >> Iulian Neamtiu: So we haven't had that issue with the apps that we've looked at. I'm not saying that might not be an issue for the other 800,000 apps out there, but the apps that we looked at, the timers were not an issue, and the time warping that we did was only in the compress direction. It was not in the expand. I can imagine trying to expand this, and maybe you have some issues, but in the apps that we looked at, that wasn't a problem, no. The last application of our technique is something that we call semantic sensor data alteration. So what we do here is we take an input, a sequence of readings from the sensors, and then we alter some of the readings, and we use this to drive the execution through a different path, and sometimes this exposes a crash. Now, this is akin to fast testing, because it's not really fast testing, because we're not really trying to create malformed input or nonsensical input. So let me give you an example of how we do this sensor alteration. For instance, for the GPS location, we do something called a map shift. So if you used -- we add some data to the coordinates in the initial input, or we say instead of starting a route at one location, start in another location, or a location saying give me driving direction to Hawaii, so something of that sort. Or we might want to change the speed. And so we alter, and this is sensor dependent, because we have to apply these transformations in a way that preserves the semantics of the reading there. For instance, for camera readings, we might choose to blur or darken the image, or we -- for the microphone, we might want to add noise or change the sample rate, things like that. And it turns out that this technique is quite effective in exposing some app crashes. So when we alter the GPS location, we're able to crash these four apps, Yelp, GPS Navigation&Maps, Route 66 and Navfree USA. And then we're able to drive the execution through different paths for other apps. For instance, for the camera, doing those kind of transformations gives us altered execution or leads the app to displaying an error, which is good for developers, actually. Right, it's a way of them testing or giving -- given one execution, given one set of inputs, they can test the app in a more robust way by altering the input and seeing what happens. So now I'm going to talk about the other challenge, how to provide comprehensive coverage for apps. So the idea here is that a dynamic analysis really is critically dependent upon having a good set of inputs and good coverage in our testing suite. And we tried to see if we can come up with a way of automatically exploring apps and automatically generating this input to coverage. So what we did, we tried to see, can we use actual executions from the field to drive coverage? And to do so, we did a user study. So we recruited seven users, and we asked the users to play with apps for five minutes, with a set of 28 apps. We recorded the interaction, and then we measured the coverage, what percentage of the app was covered during this five-minute interaction. And it turns out that there is not that much coverage, so only 30% of the app screens were covered, and only 6% of the methods were covered. And one of the users here was a so- called super user that explicitly high to achieve high coverage, but they didn't go very far. So why do we see such little coverage? Well, it turns out that users might not choose to exercise all the features, or they don't necessarily feel compelled to change the settings or to buy stuff or to share the results of their game on a social network, or to click on ads, or to, say, change their user profiles. Hence, we are seeing reduced coverage. I think this is important for the developer, because they can see what's left out of the app and why is that not exercised. Yes, please. >>: How are you estimating coverage and so on? Is this a different tool that's instrumenting source code? >> Iulian Neamtiu: Yes, yes. This is a different tool. So here, we're instrumenting -- actually, we're instrumenting the Bytecode, and we look at how many screens or what percentage of the screens and what percentage of the methods are actually exercised. So it's a different tool, yes. The tool is called A3E, and here are the results in a nutshell. So we were able to do this for 25 popular apps, all the apps, more than a million downloads, and you see that I am saying 25 apps. In the initial, so in this user study, we had 28 apps. But those are three of the apps used native code, and that's creating an issue for us with measuring coverage or doing any kind of analysis. The tool runs on actual phones, and after we apply our techniques, we increase the screen coverage to about 62%, the method coverage to about 33%. We don't need the source code, but we need an access to the Bytecode. We are able to do complex gesture reproduction. I'm going to talk about that in a second, because the input to these apps is quite rich, and we have two approaches for the exploration, a fast approach and one that's a little bit more thorough. And the tool is open source. You can download it at that address, and it's been tried on apps such as these. So let's see how we constructed that. Well, it turns out this is an execution of the -Automatic Android App Explorer. So if we look a mobile version of Amazon, an interaction might look like this. It just starts the app, then it clicks the search box to look for a product, so it clicks the search box then. You can type in some characters, and in the end we hit enter, and Amazon displays a list of products. These are called -- we see that all this interaction takes place across different screens called activities. So if we can find a way to reconstruct this screen to extract the screens and drive the execution automatically, then we're golden, so drive the execution among screens and within screens automatically, and we do just that. So we construct something called an activity transition graph, so using Bytecode analysis, using taint analysis, actually, we extract a graph of the legal transitions among the app screens, and then, within each of the app screens, we do a little bit of local exploration, and I'll tell you in a moment why we have to do that dynamically. So we reconstruct this graph using static taint analysis, and then, using dynamic exploration, we're going to explore the local elements in each screen. So this might be like an unusual choice of an analysis, because we're trying to extract an app model here, and we use taint tracking. But it turns out that taint tracking is very useful in this context. So suppose I want to see whether it's possible to transition from screen A to screen B, and the way we set up that analysis is to say, try to look at screen A as a source, screen B as a sink, and using taint tracking we try to see if there is a path. Actually, right, there's a path that goes from screen A to screen B. And this path it turns out is made possible via something called intent. So we create an intent, and intent sets up the destination for a screen, and then when we actually invoke the intent, this transition is realized. So just to give you an idea as how we mark the source code with intents and with sinks and sources, basically, so these are three examples, and we can create an intent in a straightforward way. We mark it as a source. Or we can create an intent -- so this has the source and the destination right there. This is the easy case. Or we can create the intent and later on set B as the destination of this intent. And, finally, we can create the intent and then do a little bit of more complicated way of setting up the destination. So all of those are marked as sources. Now, we mark the sink. >>: [Indiscernible] special class ->> Iulian Neamtiu: Yes, so actually, intent is a way for apps -- for intra and inter-app communication, is a structured way of doing IPC. >>: These are the results are currently [indiscernible]. >> Iulian Neamtiu: Yes, because in the -- let's see. Can you be a little bit more specific. So there's unsound, there's incomplete. >>: [Indiscernible]. And things are tied together with XML and intents are resolved through text streams that open up constants and so on and so forth. And so soundly approximating that leads to a gigantic blow-up in terms of static analysis. So generally, people when they try to do that, they make some defining assumptions. >> Iulian Neamtiu: That's right. >>: Something specific. >> Iulian Neamtiu: Sounds -- so that's -- what's the worst that can happen if we miss one such transition, is that coverage is not going to be so -- the automatic coverage that our tool provides is not going to be 100%, which is a decision that I'm comfortable with. So once we set up the sinks and the sources, we do this taint analysis, so we use a static Bytecode analysis tool, and then we reconstruct a graph that looks like this. So that allows us to construct a high-level model of the app screens. Why is dynamic exploration needed? Well, it turns out that if you look at the screen of an app, depending on the device's form factor and depending on the device's orientation, the layout is going to be different. So we do dynamic exploration to be able to find, if you will -- to find what are all of the elements on the screen so I can exercise them. There's also issues like sometimes there's ads that pop up, and good luck finding this with static analysis. So the layout is highly dynamic, so we need a technique that's able to cope with that, and we use this local dynamic exploration for that purpose. Now, let's see what's so unique about >>: [Indiscernible]. >> Iulian Neamtiu: Let me roll to this slide. So it turns out that there are screens that are not -if you just fire the app, if you start the app, there are screens that you cannot get to. Why? Because, for instance, this is one password for -- this is drop out. So it turns out that the only way to invoke this screen is from another app to send an intent from another app to this and that will open up the screen, just like, say, share on Facebook or so. That's a screen that's not naturally accessible from the original app. So we need to discover all of these possible entry points, and we use static analysis for that. So that's one problem, the screens that are involved externally that we have to use static analysis for. Another problem is, if you look at the interaction, just like we've seen with, say, Angry Birds or Yelp or so, the interaction is quite complex, so we had to build a gesture library to be able to drive the execution in those situations where simply retrieving the GUI elements and saying, hey, button, press yourself, was not sufficient. And, finally, we had to deal with real-time input from the sensors. So, for instance, in Shazam, you can't get very far in the exploration if you don't play a song in the background, so our library of sensor input was quite diverse. And we introduced two techniques to do the exploration. The first technique is called forced depth-first exploration, which performs exploration in the manner that users would naturally interact with the app. So, for instance, from the home screen I go to the search screen, product, cart, checkout, and then we press the back button to come back to the other states. So this is just -- it's trying to emulate a user, but we do a much more thorough job. So that's called a depth-first exploration path. And another way is to do something called targeted exploration, when we go from one screen to another screen directly, without having for GUI interaction to trigger these screen transitions. And here we have to be careful and make sure that these screen transitions are actually legal, and that's the reason why we don't go from home to checkout. That path is not possible. So what are the tradeoffs here? Of course, targeted exploration is going to be fast, but it's not going to be as thorough. Depth-first exploration is going to take more time. And how does this work in practice? So we installed the app on the phone. We have an automatic explorer that runs on a PC, and then we perform the actual exploration while the app runs on the phone, and the result obviously is going to be a trace that's replayable with Reran later on. So that's depthfirst exploration, very simple. For targeted exploration, where we use static analysis, there's an intermediate step here where we run the analysis, we construct a model of the application, and then we take this modeling into account, where we -- before exploring the app, so that we know exactly how to transition among the screens. Yes. >>: So I understand how you're using taint analysis to construct this graph, but how does taint analysis reveal to you what the actions are that let you move around the pods. >> Iulian Neamtiu: That's a good point. There's a way to actually fire up a screen without going to an action, so once you're in a screen, you can say I want to go to the next screen, and then you can do so. >>: You start up that screen in ->> Iulian Neamtiu: Like in a default state, yes. So this is a problem, and that's why we have to be careful with targeted exploration, for instance, when the state that's being passed around, for instance, the credit card number or something like that, then we cannot just transition among screens wildly. So just to give you an idea of how effective this is, I contend that the approach is quite effective, given that we've run this on wildly popular apps, so we measure the number of screens that are covered using our approach, and it turns out that we cover more than 59%. This is the average number of screens that we now cover, and we also have a meta coverage, and that's 29% to 36%. These are average numbers for the 25 real-world apps that we use. Yes, please. >>: Is there a time bound on how long this algorithm was run. >> Iulian Neamtiu: Yes, good point. And, no, we didn't have any time bound. So you're anticipating my next slide, so my next slide talks about how efficient this approach is. It turns out that, depending on the app exploration, it might take more or less -- we had no bound, so these are the minimum runtimes. The faster static graph construction was for CNN, and in terms of runtime exploration, it took us 18 minutes to explore the BBC News app. That's for targeted exploration. This is just shallow exploration, just switching among the screens for apps that are amenable to that. And for depth exploration, the fastest we could explore one app was Tiny Flashlight. It's not very feature rich, and that took 39 minutes. And the most, the longest, was Shazam, for targeted exploration, 236 minutes. That's close to four hours. And using depth-first exploration, that took pretty much the same time. But on average, targeted takes about an hour and 20 minutes. Depth-first takes about an hour and 40 minutes, which we believe is reasonably efficient, because, say, if the explorers want to run this, they can just run it and in a couple of hours, or at most four hours, you have your systematic exploration. Yes, please. >>: [Indiscernible]. >> Iulian Neamtiu: Next slide. So we used this -- yes, so what can you use this systematic exploration for? So we used systematic exploration, coupled with something called model-based runtime verification. So we have a dynamic analysis, but the dynamic analysis is just an infrastructure. This is just the infrastructure. Now, let's run some applications on top of it. So in this case, we constructed manually some models of what an app should behave like, and we tried to check at runtime whether the models are obeyed or if the app goes through the legal set of transitions. So let's assume that this is the app state machine, and this transition is not legal. If we see that transition at runtime, we found a bug. So we had a paper in ASD in 2011, and we found 21 bugs in open-source apps using this verification technique. So that was one app of dynamic analysis. Now, I'm going to talk about another application, and that's on profiling. So what we do here, we take the app. The app's running on the phone with Reran capability, and then we instrument the platform to collect data at three different layers, at the sensor layer, the user interaction, the system calls and the network traffic. And this will have all sorts of benefits, and I'm going to summarize those benefits in a second. We ran this on 27 apps across different categories, and what we focused on here was, first of all, trying to understand, trying to get some behavioral snapshot of apps and also how free and paid apps differ. So what we did, for instance, for Shazam and Angry Birds, we looked at both the free version and the paid version and tried to see what are the differences in the capsule profile of the app. For each app, we did 30 runs, one recording run and nine runs, so we had three users. We recorded one run, and then we replayed the other nine runs at different times of day to see if there was any variation, and we didn't find much variation. So the first findings were obtained using TCP dump, so we looked at the network traffic, we measured the TCP -- we inspected the traffic headers and contents, and we tried to see how much of the traffic is with the app's website, how much is with the cloud, with the content distribution network, and how much is adds and tracking. So without further ado, these are the results for some of the apps. So what we have here, we have six apps, and what I'm going to show, I'm going to present the download traffic, the upload traffic, and then what percentage of the traffic was with the origin, so the origin is the app's site. So Tiny Flashlight, if it has a developer named [foo.com], how much of the traffic was [foo.com], how much was CDN and cloud -- for instance, Azure or Amazon AWS, things like that? How much is third party, ads tracking, Google? And within the traffic, what's the split between HTTP and HTTPS? So let's take them in turn. Tiny Flashlight is, as the name implies, a tiny app to turn on the flashlight. It turns out that you need the Internet access to be able to use the flashlight, and if we look at the traffic, 99% of the traffic was with third parties, and all of it was unencrypted. It was over HTTP. Now, let's look at the differences between a free and a paid version of an app. So Advanced Task Killer, the free version has some traffic. It's all third-party traffic, mostly unencrypted, whereas the paid version has no traffic whatsoever. So that's what shelling out $0.99 or so buys you. Facebook and Amazon, so they have heavy traffic, as expected. What we've discovered is that not all the traffic is encrypted for Facebook, so about a quarter of the traffic was encrypted -- sorry, a quarter of the traffic was in the clear HTTP, and the rest is HTTPS. And, finally, an app that measures your heart rate, it measures your pulse, it turns out that if you get the free version of the app, it has pretty serious download and upload -- >>: [Indiscernible]. >> Iulian Neamtiu: You press your finger against the flash and the flash of the camera, and then it -- I guess it sees how your tip of the finger expands and contracts, and that's your pulse. So it turns out that, again, marked differences between the free version and the paid version of the app -- 91% to 96% of the traffic goes to third parties, and this is something I'm uncomfortable with, so for the free app, lots of traffic actually goes unencrypted, so who knows who's reading your heart rate, whereas if you pay for the app, then the split is kind of reversed -- 80% of the traffic is over HTTPS and the rest is over HTTP. >>: Browsing or is there a purchase actually here? I'm surprised at the 1% unencrypted. >> Iulian Neamtiu: There was no purchase. These were just browsing. >>: If it's also already going to destinations unknown, why do you care if it's HTTPS at that point? It seems like men in the middle and the destination are roughly in the same camp. >> Iulian Neamtiu: Sure. I agree. Actually, I'm going to talk about that in a second, that lots of this traffic goes to ill-reputed websites. And whether it goes encrypted or not, who cares? The only problem is -- I don't feel comfortable if someone is tampering with the flow here. So HTTP is subject to tampering that it can be read, written, man in the middle, so that makes me uncomfortable. So I have discussed that. And what we did with these apps, actually, we used clustering to beam for each app the usage, to beam it into one of three clusters. So we have a high-usage cluster, medium usage and low usage. So if an app has a lot of network traffic, it's going to end up in the high cluster. If an app has very little traffic, it's going to end up in the low cluster. And we did this analysis at three different layers, as we said before, so OS, and we measure the intensity at the OS level in syscalls per second. Here we measure it in bytes per second and here in events per second. And this kind of application thumbnail can be used to give you an idea as to behavior differences between apps. So, for instance, this is Dictionary.com, and we see that if you are purchasing the app, you're not going to see that much physical activity, which is good news, I guess, for your battery. This is Dolphin, so Dolphin has high network activity in the free version and medium network activity in the paid version, and so this can be used to compare apps. So I discuss that. In terms of reading between the lines, what have we found? Well, free apps are not as free as we think. We measured the number of systems calls per second. It turns out that the free apps have 50% to 100% higher system call intensity than the paid apps, and they have dramatically higher network traffic due to ads and tracking, and this is bad for your battery and bad for your data plan and bad for your privacy. And also, coming back to this idea, Ben asked where does the traffic go? It turns out that the free apps send traffic to more domains, so Dolphin Free sends its traffic to 22 domains, the paid version to just nine domains, the same thing for Angry Birds. Oh, yes, please. >>: Guess a lot on whether this is specific to Android or iOS and Windows were, as well? >> Iulian Neamtiu: So this is all Android. We don't have -- we did work on -- well, we did another measurement study on other platforms, and it turns out that there's also Windows and iOS, and it turns out that there's not that much difference between Windows, iOS, BlackBerry, but there is difference between those operating systems and Android, in the sense that we see higher traffic in Android. But we didn't control for apps, so that was geared -- we only looked at the platform and the traffic. It could be that Android was used by some users who are already, say, prone to high traffic. Maybe they were watching video or something like that. I can't really tell whether it's the platform or whether it's the user, because that's a confounding factor. >>: To comment on that, the way I think about it is that basically Android is the Windows 95 of smartphone operating systems, so it's tremendously popular and also tremendously prone to various problems, and these are just some of them. >>: And has looser policies. >>: Yes, and that's more [indiscernible]. >> Iulian Neamtiu: And just to wrap up this study, another piece of work, we only looked at the URLs that are accessed by the app, so we have a system called Aura, and the idea here is that if you have an app that collects all this kind of sensitive information, do you know what kind of websites is the app sending the information to? And I don't have time to go over the details of the study, but suffice it to say that we looked at URLs using 13,000 apps, a quarter million URLs, and it turns out that half of these -- sorry. Of the top 20 most popular URLs, 10, so half of the top 20, are ad servers, and we also found that good apps, so apps that are not malicious, per se, talked to so-called bad websites as marked by sources such as Web of Trust or VirusTotal, so it turns out that 66% of the apps talk to domains with poor reputation. Seventyfour of the apps talk to domains that are unsuitable for children or talk to at least one domain that's unsuitable for children, and 50% of the apps, and these are the supposedly good apps, talk to blacklisted domains, as identified by Web of Trust and VirusTotal. So those are domains whose only raison d'etre is to spread malware, to do phishing scams. So in the time remaining, I'm just going to talk about -- I guess I'm going to skip. I'm going to discuss about this individually when we meet one on one in the afternoon. So let me just conclude. I presented an approach that facilitates a whole host of dynamic analysis on smartphones. At the core of our approach is a record and replay scheme, and then I've demonstrated how to do profiling and how to do our automatic exploration and do so while having reproducible results and apps that run on real phones. And you can go to this website to download the tools or to read the papers, and I'd be happy to discuss details with you. This concludes my talk. I'd be happy to take any questions. >>: [Indiscernible]. So runtime analysis certainly is a [indiscernible] when it comes to analyzing benign applications, but a malicious application can use a variety of techniques to avoid being analyzed, you see. So I assume you have not actually observed that in the wild, but have you tried analyzing at least obfuscated, highly obfuscated apps to see how that goes? Have you tried to play with that? >> Iulian Neamtiu: No, so in this study, actually, we also looked at malicious apps, about 1,000 malicious apps in this study. And it turns out that there is no sophistication whatsoever. So, for instance, if the app collects your GPS reading, it sends it in the clear, and in the HTTP request, you see longitude and latitude. And the ->>: Might they actually do this without your observing? >> Iulian Neamtiu: Oh, absolutely. What I'm saying is the level of sophistication we observed was not very high. I'm not saying that there are not sophisticated apps that we have not observed, but even for the ones that we were able to see, we were surprised at how just basic the technique was. So no obfuscation whatsoever, just leak whatever personal information in the field, yes. >>: These were tagged as being malicious? >> Iulian Neamtiu: Oh, yes, we knew they were malicious. >>: North Carolina subset? >> Iulian Neamtiu: No, this is from Purdue. We got this set from Purdue, Christina and her students. >>: Thank you. >> Iulian Neamtiu: All right, thanks.