1 >> Dennis Gamon: Well, actually, we have a very exciting session for you. Not surprising, of course. The general topic is streaming environmental and urban data, and it's actually a topic which is near and dear to my heart. We have three very good presentations on the topic. The very first one is Jie Liu is going to speak to us on something which actually we may not fully understand the implications of, but I'm sure he'll tell us. Cloud-offloaded GPS for energy efficient location sensing. >> Jie Liu: Thank you. So yeah. I'm going to talk about how cloud can actually help us not just store the data and process them but also collect more data. In particular, in this case is location data, which is essential for many scientific discoveries. It's an effort that a bunch of us in MSR worked on, as well as our interns from Purdue and Brazil and our collaborators in Australia. As I said, location is one of the fundamental properties scientific data is relying on from, like, animal tracking to tracking migrating birds, environmental monitoring, water quality monitoring to human well being and then health applications. More and more we see these devices, wearable, clip-on collars and things we attach to animals and try to collect data about their locations, interactions, and other properties. Lots of data that we do collect, we want the location property to attach to them. I'll show you a very concrete collaboration we're having with [indiscernible] in Australia. They are interested in this fairly big bat. It's called flying fox. They travel about -- in the order of up to a hundred miles in a day, and they are a potential of both spreading disease and spreading seas and other things. So scientists are really interested in trying to understand their behavior. So I'll show you one video, not using our technologies, but they've done by monitoring the location of a bat through a day. So how far they travel. You'll see these diamond-shaped, those are actually GPS data, and once in a while you'll see a gap where they don't have the data and they sort of interpolate it and assume the bat actually fly a straight line from one location to another. And through the day, it travels pretty far over water and over the forest and at the end of the day it comes back to the bank of a river and rests there. This application is fairly challenging in several sense. One is although these 2 bats are pretty big, they have a payload limitation. They can carry about 30 gram of things, and bigger than that, they don't like it. They probably won't fly as far. They have energy constraint or challenges, because effectively, these bats rest -- they go out at night, and they rest during the day, even for energy scavenging, where they're flying when you're interested in the location data, you don't have the solar energy to be available to power them up. And the third one is the communication challenge. When they get the data when the bat come back to their resting trees, and they set out the radio station and try to read the data out of these color tags, and at those time, if you're monitoring lots of these bats, I'll show the location data can be fairly large. And even just downloading them using a low power radio could be fairly challenging just in terms of the network throughput. So this really, as a sort of an example of motivating many applications that we've seen in these scientific data collections that you have lots of limitations in terms of form factor, the energy and the communication technologies you have. So for this particular time, we want to drill into the GPS part of the power consumption. People all know this is a fairly energy consuming device, and, in fact, that bat right now with the battery they have there, it lasts for a day. The device lasts for a day, and they just die, right. You can collect data for a day and that's it. This data is collected on a mobile phone. That's another reference point people generally have a personal experience on the energy consumptions of when you turn on the GPS for tracking and so on. It consumes up to, like, a watt of energy if you have the GPS on and turn it on continuously. In a sense, we all have this experience. We have the location continuously turned on. Your phone probably would die in a few hours, maybe six or eight hours if you do navigation continuously. So what can we do? We really want location sensing and we don't want to pay the energy. So for this particular work, let's drill down a little bit deeper into how GPS works. People probably have this rough notion that there are satellites in the sky that cover the earth and they send some signals to the earth surface and there are these receivers that receive these signals. And through triangulation, you'll find out where the receiver is. And there's a lot of technology and love in this rough story. For example, the GPS satellite actually doesn't stay fixed in the sky. They move, and their 3 trajectory is send out through this so-called ephemeris data, and that's a very slow radio. So the signal coming down from the satellite is -- they transmit these parameters of satellite trajectories at 50 bits perfected. In other words, in order to receive the whole packet of where the satellites are and the time stamps in it, the radio has to be turned on for 30 seconds. So we have this experience, when the car GPS first came up on to the market, you turn it on and it says acquire satellite, and generally it takes about a minute to find out where you are, and that's the time is spent to decode all this information. And for that, during that period of time, the radio has to be on, all the signal processing has to be working for it to decode this. So that's one of the key reasons for the power consumption in GPS. It needs processing. It also needs time. To look at the GPS signals, this is getting a little involved. I won't get to the equation, the math and so on, but there are these so-called gold code for each satellite. There's a specific code they send out. And they got modulated and further modulated on the carriers, and then through the propagation actually introduce Doppler and other frequency effects on to these signals. And the receiver's role is to find the so-called code phase, which is the delay, the propagation delay from the signal that comes out of the satellite to the time value receive it so this is getting a little involved, but if you add high level understanding, you get these signals coming down. There are the correlation happening both in terms of the code phase dimension and the Doppler dimension. You try to detect a spike. That spike tells you which satellite you're seeing, what's the code phase delay and what Dopplers are introduced and sort of how fast the satellite moves. And then you get to the tracking mode. Both acquisition and tracking can be done every millisecond so that actually doesn't take a lot of time to do this. But the decoding, as I said, is the main part. It takes 30 seconds to do the decoding, where you get the time stamp from the satellite signal, you get the ephemeris from the packets, and you got the code phase, which tells you the fine grained delay of the propagation result of that you feed into a [indiscernible] competition, at the end you get the location. So that whole process takes about a joule of energy in your mobile phone. But if we think about what's essential for the device needs to provide, it's only the code phase, because ephemeris and time stamp we can estimate. Ephemeris, we can get from the cloud. NASA publish satellite ephemeris information or parameters frequently, and even project into the future. That 4 we can get in the cloud. Time stamp is something, there are ways of estimating the time stamp. You can do a device level time stamping, but this won't be very accurate. But there are ways of using that as an estimation parameter in your least square and still be able to find out the location. So essentially, we only need code phase, which we can sense every millisecond. So that's the minimum amount of time we need to get the GPS signal to derive your location. So there we're using the trick called core [indiscernible] navigation, which basically says if I know roughly where you are, up to a few hundred kilometers range, I can estimate the millisecond part of the propagation delay because light travels so fast. If I know my location, I know a nearby location, the millisecond part of the propagation delay is going to be the same, and only the code phase, which is -- you can estimate every millisecond, only that determines your fine grain location. So you have a rough notion of where you are, then you correct it with the code phase. That's the basic insight. I won't go into too much of the detail. But if you apply that blindly, if you don't know your reference location, you'll get lots of possible solutions. These are all possible solutions, given a particular set of code phase. So obviously, we can't use or accept all those locations. There are interesting ways of estimating a rough location using the Doppler interception, because Doppler give you a sense of where the -- how fast the satellite is traveling and by looking at it from the ground, you know an angle -- if you know the satellite's raw speed, then you know the angle -- what angle you're looking at it, and you have multiple of these angles, you can do this interception and that tells you a rough estimate of a location. A third trick we can do to disambiguate reference location is using the third dimension of the earth elevation. If you're sure you're close to the earth surface, every millisecond of error that you get in the distance measurement actually give you 300 kilometer error in the distance, and so it will -- even if the [indiscernible] converge will probably converge into somewhere that's higher than the Mt. Everest or deeper than the deepest ocean you're going to get. So those are some of the tricks we rely on data that's already on the web. So the elevation data is already on the web. The [indiscernible] is on the web and so on. That really helps us to minimize what information we need to collect on the device itself. The basis of what we call the cloud offloaded 5 GPS design, where the device only collects, in our case, a minimum of two millisecond of this raw signal chunk, restore that. We have a reasonable time stamp on the twice clock. Then we go through -- we send it to the cloud. The cloud will do the acquisition of the software, the intersection of [indiscernible] and look into the NASA database to find out the ephemeris and using [indiscernible] to have the reference location combine it with code phase and device time stamp, plug into the course time navigation. Most of the time it give us only one result. If it gives more than one result, we use the elevation to do further filtering. And through that we always get only one, at most one location coming official it. So we build a prototype with essentially a GPS receiving front end, a microcontroller ASD card and a battery and that's the antenna, and we did some evaluation on this. This give you the power trees of collecting two millisecond of GPS signal right into the flashcard. We only store that data, store the raw data. We don't process anything. And then we accumulate that and send it to the cloud afterwards. So that energy number is for just storing the data, and it doesn't include sending the data or processing them in the cloud. But the take-away message is we're about a thousand times cheaper than your GPS receiver of the market that you do resolve the location on the device itself. Convert it into a more intuitive number with a pair of AA batteries with one second per GPS sampling that this device will last for more than a year. So that's a great capability to have in collecting this data. So I'm going to show you a demo of sending this data. Okay. Let me just talk without the slide. So what we did was we built a web service that's deployed on Azure. And what it does is it has a web role that received the data -- okay. I'll try to see -- I won't show you the demo. Anyway, it received the raw GPS signals that you can send to the cloud service and in the background, it mines NASA assisted database and pull the data into this web service, and after receiving these raw signals, it will do the cloud offloaded GPS process I described and return you with the set of locations. Let's see how well they work. We use about 1500 data traces were collected on our devices to evaluate how well they can work. These data come from several continents. And in order to reduce the power consumption the way we do it is to turn on the device. We called it a chunk. We sample a few millisecond of the GPS signal. Then we turn it off, put it in idle for a while, and then we sample another chunk. The reason is to get more opportunities to average out the temporary errors in the signal. And in this particular case, we used two millisecond chunks. This bar is two millisecond and then we wait for 50 millisecond and then we turn off for 6 another two millisecond. We take three to five of these chunks with 16 times over sampling of the GPS signal itself. So two millisecond is eight kilobytes of data. With that, we can achieve a medium area about 11 or 12 meters. The average area is about 20 meters. And if we look deeply, deeper into the error distribution, when we have seven or eight satellite, we can detect in the signal, we can actually get pretty reasonable results. So when we have low satellite visibility, the area is bigger and that's mainly because the receivers we have is a [indiscernible] receiver hasn't really been optimizing the an [indiscernible] part of that very much. We also evaluated how much error we can tolerate in time stamping, because we're now relying on a device time stamp rather than the time stamp that your GPS receiver will decode from the satellite. So we can tolerate, you see that plot, about 60 seconds of timing error, meaning that even if my local clock on the device is about a minute off from the GPS time, we can still get a very good location. And the byproduct of that location is it actually resolve the GPS time. So we can use the previous time stamp on the device -- or resolved from the data to correct the future time stamp. So in a sense, we can recursively do time stamp correcting, which -- so this particular plot only applied to the first time stamp you get from the stream. I don't know how much time I have. I'm going to talk briefly how to reduce the data sites, right so we are collecting the raw GPS signals and at minimum, we are looking at for ten millisecond 40 kilobyte of raw RF data, and that could accumulate pretty fast. If you just do the backup calculation, if I do one location per minute sampling over a day, that's 57 megabyte of data. If I use a low power radio to just read that data off the device with a typical throughput, we're talking about ten minutes to just download that thing from the device. Put the energy requirement for downloading, for running the radio for ten minutes aside. If I have a thousand bats or a hundred bats to monitor, that's a lot of time to just read them out so it doesn't scale very well. So how can we reduce that? These signals turn out to be fairly challenging to be compressed, because these are random signals by design so these are CDMA signals. Those should look like random signals. So if you just apply your typical, like, data compression programs to it, we tried it. We got five percent or, actually, it should be five percent of the compression rate so you don't get a lot of space out of 7 these kind of data. So what can we do? There's interesting notion called a sparse representation. If you look at what we care about in this data, it's just spike we talk about. It represent the code phase and the Doppler frequency. If we can define a dictionary that represent raw signal just as a sparse single spike in the transform domain, then their techniques called compressive sensing that basically we can take the raw signal, compress to it a much shorter signal, and using L1 minimization to recover and find that same spike as if they were in the raw signal. So just to give you a sense of this is what we get the spike from the raw signal and using correlation, cross-correlation the typical way of doing GPS acquisition, versus we use this sparse approximation or compressive sensing techniques to estimate those spikes. The tallest spike is actually the same. That's much flatter, because we're trying to minimize the number of spikes in the representation. So it could introduce some error, but the main spikes should be there. Our evaluation shows that with 30 percent of the amount of data, and we can achieve more than 95 percent of success rate. So we can reduce the data by a factor of one-third and with 90 percent of the chance, we still get the same spikes in the acquisition results and the same location. So that's an interesting way to reduce how much data you need to store on the device. And turns out this projection or this way of compressing data is also very cheap. It's actually computationally cheaper than you're doing the impositive type of typical text compression. So just to conclude, we split the GPS sensing between the devices and the cloud, and we leverage the information in the cloud, things like the satellite trajectories, things like the earth elevation, and we leveraged the computational power in the cloud to reduce the energy consumption of GPS sensing on the device side and the data amount we need to store and move most of the computation into the cloud. Turned out this way of doing things is quite powerful. We have a different project that shows that this system can even work indoors for many places without -- we can't handle the highrise buildings. But in shopping malls and single floor buildings, we can design really strong signal, antennas. And by leveraging the computational power in the cloud to get locations in many indoor 8 places. Just to do a little bit advertisement, we are going to release this. This is the new sensor board design which has the same GPS front end, but FPGA to do processing and microcontroller and you can integrate it into your own sensors, and we have the web service running in Azure. It's about probably in a few days even we're going to turn it on and have people try it. Thank you. >> Dennis Gamon: question. >>: So maybe we can set out the next talk and you can take a Can I put that device in a phone? >> Jie Liu: It's possible. So we actually had this conversation with multiple groups. The challenge there is your phone has a very integrated GPS Chip to the extent that the cell radio, the WiFi, Bluetooth and GPS all use the same Chip. And that Chip doesn't give us low-level enough access to the raw data, like what we want. We only want to log the raw data, not doing any processing in the phone. And they don't have that interface right now to give us the raw data. So that's a deeper conversation. That's why we use our own GPS front-end chip in this sensor device. >>: But there could be a model of the Windows Phone. >> Jie Liu: If we have enough market share. It's in the conversation, but just we don't have the power to determine that. >>: [Indiscernible] have a phone kit now, you can sample your own phones. >> Jie Liu: >>: A question over here. I was just wondering [indiscernible]. >> Jie Liu: Indoor. We see this -- if we can get location, we see the same accuracy, about ten meter median accuracy. We get location in about 60 percent of the places. Really, the inside is we take advantage of these sky lights so glass are really good that the signal can penetrate so we have to this antenna, it's directional and it's intuitive so we point it different directions to try to find the opportunity that we can see a satellite. 9 >>: Okay. >> Jie Liu: And then we have tricks to combine different directions and form a -- the same process to get the location. So if we get the location, it's okay. But sometimes we just don't get it. >> Dennis Gamon: As you may remember from just a half hour ago, our topic is streaming environmental and urban data, and in this particular presentation by professor Yung Liu from Purdue, he's going to discuss another aspect of this, which is when cloud computing meets data streams. His title is cloud computing for analyzing many data streams. >> Yung-Hsiang Lu: Okay. Thank you. So actually, I change the title at This has a bitter acronym. We call it CAM2. It's a work in progress. I to make two slight advertisements. ACM pays my airplane ticket so I have show what is ACM. And also -- I assume everybody knows ACM. And ACM distinguished speaker program, I'm one of the speakers in their program. fulfilled my duty. bit. have to So I We currently have 35,000 cameras in our system. We don't open the cameras. We find it online. I'll tell you what we can do with it. This is one of the pictures of the cameras in Europe. We have approximately same number of cameras in the U.S. and outside the U.S. So what's the motivation? Well, why do I come here? Because a couple months ago, we did a study using the cloud of a company across town and then we receive email saying why do you push so much data -- pour so much data into the system. Do you have a problem? Is your cloud hacked or something? We say no, that's our intention. We're just doing experiment. Then before we did it with Azure, I ask Dennis, I said is there any problem if you try that? He said come over and talk to us what you're doing. So that's why I'm here. We currently have 35,000 cameras. If you spend ten sects per camera, you'll spend 90 hours just looking at the cameras. I hope my math is correct. We actually want to do continuous analysis of data to understand our world, and I'll show you a few slides of some performance using the cloud and talk about our API. So what's our motivation? We want to understand the world through many, many cameras. These are a few pictures from National Park Service. Actually, National Park Service is very interested in working with us. We got physical 10 hard drive a few days ago, because there's too much data. push over network. So actually hard drive sent to them. They don't want to We use publicly available camera, connect to the internet to understand world. This is their website. And there are many, many people that put cameras online for various reasons Department of Transportation, city government. You can see traffic, snow and so on. This camera is one of my favorite. I don't know if you can see what that is. It's from a university and you look at ant running around. It's kind of funny to look at this. And there's a chicken farm here and some other places. So why do people put cameras online? Many, many reasons. They want inform people of some scientific data. They want to study such as animal behavior to observe, to attract customers, and for many reasons. Why are images and video particularly interesting? Because looking at one single image, you can have many different interpretations. Let me look at -- I'll show you a few examples and you hope you can tell us what you can find. Many people use cameras to do various kind of studies. Many of them use public cameras. Most studies use only a few cameras, and there are a few, I can tell you a few examples. They use a thousand cameras, but very low frame rate. Maybe one frame a day or something like that. And those systems, as far as I know, they are closed systems. Means they may give data, but that's all you can do. You don't get the source of the data. If you want to increase the frame rate, there's nothing you can do because the data is archived. What I want to do is give you the choices, because for different type of studies, you want different options. For example, if you want to detect wildlife, animals, you are going to want one frame per second. Otherwise, you cannot detect. If you want to look at the weather, one day one frame every hour may be enough. So you want to have the option and want to give you the option. So I hope there's some interaction over here. Imagine you have some archive of this data or stream of this data. What can you find? This is the danger of those few I know their names. >>: Plant growth. >> Yung-Hsiang Lu: Say that again? 11 >>: Plant growth. >> Yung-Hsiang Lu: >>: They're sitting on the glaciers. >> Yung-Hsiang Lu: >>: Plant growth. Glaciers, water, what else? Animal migration. >> Yung-Hsiang Lu: Animal, there you need a hyphen, right? Anything else? Maybe the water level. Maybe when the snow melts. Maybe if you are really an expert -- I am not -- you can as you see the plant species. That's an indication of how climate change, right? >>: Clouds. >> Yung-Hsiang Lu: >>: Humidity? >> Yung-Hsiang Lu: >>: Clouds, weather. Humidity? I'm not sure. You can. >> Yung-Hsiang Lu: Okay, thank you. How about this one? What can you find if you look at this one single image or you have series of images from this camera. This is a shopping mall in Italy. >>: Easy hours. >> Yung-Hsiang Lu: >>: Amount of people. >> Yung-Hsiang Lu: >>: Easy hours. Amount of people. The state of the economy. >> Yung-Hsiang Lu: The state of the economy, very good. 12 >>: Fashion friends. >> Yung-Hsiang Lu: Fashion trend, okay. Actually, a few weeks ago, a professor told me somebody predicted in late May light green will be the dominant color for fashion. We have about a month to figure that out whether that's true. We'll see. Yes, you want to say something? >>: I'm ahead of the game. >> Yung-Hsiang Lu: lab in our school. >>: Great, thank you. How about this one? This is a computer What can you find? What information can you tell? Graduation rates. >> Yung-Hsiang Lu: Okay. So as you can see, there are many different data, different type information, you can extract from same set of data. How about this one? >>: Velocity of water. >> Yung-Hsiang Lu: >>: I think this is the last one I have for examples. Velocity of water. Existence of pollution. >> Yung-Hsiang Lu: Pollution. I guess many of you probably know there was a chemical leak in West Virginia a couple weeks ago and the water color changed, right? So there are many, many useful information you can chart by the cameras. Every time I talk about this, people ask me this I'm going to answer. We don't do surveillance. publicly available data, and we have been looked board. They said we don't have any problem with question. So before you ask, Okay. We only deal with at by Purdue internal review human subject study. I want to tell you a story that a part of the motivation of this study. Some time ago, a few weeks ago, one of my friend called me and said I get stuck in traffic. Can you tell me what's going on? So I went online. I look at a camera. I say about five miles in front of you, I see the flashing light. 13 There's an emergency vehicles there. So I guess there's a traffic accident. If you can just wait -- drive slowly a few more miles, you'll be fine. And interestingly enough, about 20 minute later, my friend told me you are right. Now, the question is can we automate this process, because I had to go online myself, watch them by eyes. I don't want to do that for everybody, right? I think we already talk about this. We can do environment study. We can do city planning, traffic, and this is an example from one of my collaborators. They want to use ground image on the left side to calibrate aerial image. So I guess the light doesn't make it very easy to see. There's more yellow kind of a color here to indicate the crops, they like water, and they want to use the color to calibrate the aerial image to predict the yield of crops. So this is an example of where you can use a ground image to help you do a larger study using aerial images. Now, let's talk about how we can analyze the data, and then I'll give you a little bit of data of what we have done so far. First, you have to find data source. That's why we find the cameras all over the world. Retrieve data and understand what's inside the data, right. If you want to study natural resources, look in the highway probably doesn't help you much. But if you want to understand traffic, a camera in the national park is not going to help you much so you have to understand what's in the data. Based on the data, base on the value will determine whether a particular camera is going to be helpful to you. Then you are going to analyze the data and then obtain useful information. Here, it shows three examples of different kind of studies. If I analyze traffic, this camera will give you better result than this camera. If you want to analyze ant behavior, the second one is going to be useful. So let me try to give some idea how much data we are talking about. I hope my mathematics is correct. But if I'm wrong, then you can tell me. So one high definition video camera can generate approximately one mega bit per second, plus/minus. And if multiply it by so many seconds in your day, that's how much data you get, and you multiply it by how much days a year, that's how much data you have. That's about one hard drive. So what. One hard drive has a sheep. But if you look at how many cameras are sold each year, one market study suggests that about 20 million network cameras -- this does not include mobile phone, are sold a year. Let me do maybe a conservative estimate. Say five 14 percent of the cameras are online. Means you can take them, publicly available, that will generate about 8.6 petabyte a day. If you get one frame per minute and suppose your resolution is not super high, you get 100 kilobyte per image. Of course, you can do all kinds of compression. It's 100 kilobyte, assuming you use JPEG. That's a reasonable approximation. We find lots of camera that give you more than ten frame per second and we also find lot of camera that give you very high resolution. But this is a good starting point as an approximation. Because the data, there is so much data, usually data is nonpersistent, meaning nobody store them. They just come and if you don't do anything, they're gone forever. So how can cloud computing help? Well, the problem is people want different things, and that's where cloud computing can be very useful, because some people want high frame rate, some people want low frame rate. Some people want to store data forever, like I talk to some people, they want to study environment. He said talk to me after view 20 years of data. I'm not interested in anything less than 20 years. Okay. Wait for 20 years after the project starts. Some depends on the time of day or season. If want to study snow, there's not much to study in late April, right? I mean here, okay. All right. Or if want to study traffic, probably there's not much to study in general at midnight. So the variation makes cloud computing great because you pay only for what you have to. So cloud computing will be a great way to handle differences of data. Now, I'm going to start talking about our work in progress. This is our system. It's a general purpose cloud system for analyzing many video streams, and the idea is you will be able to configure the system for your study. You can say I want to analyze data for the following 500 cameras. You'll identify how many camera based on various conditions, such as geographical location, time of day, and so on. You will say I want so many frames per second or per minute. We'll tell you this camera cannot do that, because the camera simply doesn't refresh so fast. For example some traffic camera will refresh once a minute, let's say, on the website, and we can also detect how often they refresh. And you can also tell whether you want to store data or not. If you want to store data, we'll do some information, how much data has been generated you will need to generate so much data, and we'll have an event based API for programming. I'll show you an example. 15 So this is our system. I'll only talk about a few parts of the system. We have first 35,000 cameras and we are still adding more. We have a database describing the cameras. And through various conditions, you can select the camera you want to use, and then the camera you select, the data heavy traffic, this is a big arrow here to indicate heavy traffic. The heavy traffic also cloud. It doesn't go through our system. I mean, the [indiscernible]. We have resource manager to give you a suggestion -- actually doesn't give you suggestion. You want to decide where to send the data to, so don't have to worry about that before then we have a program interface for you to write your program. This dash line is lots of place where the domain [indiscernible] will come in. They have their -- [indiscernible] to do. All the others are common to other scenarios. So we handle all of them so that you don't really hurt a few times. You want to repeat some effort for different kind of stuff. I'm going to talk a little bit about the resource manager part and then talk about the cloud part. So first, this is a camera looking at a straight [indiscernible] at my office, construction being done right now on this street. The purpose of this example is to show different cameras can give you different type of data, different protocols. That's one of the challenges for people. They don't want to deal with protocols, right. So our system will handle that. It is an example I'm going to show you, putting data for motion JPEG. What is motion JPEG? You retrieve data from the cameras, and it has intra-frame compression so each image is JPEG compressed. There's no compression between images. Why do we do that? Because that will reduce the computation on the camera. Most cameras are very small. They have a tiny processor inside. But it's bad for network, because there's no inter-frame compression, but just as example as a starting point, once you issue an HTTP request, it will give you the frames as they are available. So one request will get quite a few frames. Honestly, we don't know exact location of the cameras, because currently we have an IP address, and IP address is not precise for location. So we decide to use this plot to indicate we figure approximate location of cameras, and we only can see the cameras that can give you -- we have measured at more than ten frames per second for the following studies. So first question we ask is should we use a thread or process? The answer is pretty obvious, we should use thread because process overhead is too high. This is using Purdue computers. So starting from here, I have a few slides 16 showing the Azure dataset. So this is an example where we tried to pull data from multiple cameras up to almost 500 cameras using Azure large. Eight cores, 14 gigabyte of data, memory. The three curves show the data rate is megabit per second. I guess that's why we [indiscernible] traveling the traffic. As you can see, it's not surprising that United States, if you put a virtual machine in the United States, you get the most data rate. And we also tried north Europe and east Asia. And then we go back to the slide to show we used the cameras in the United States. A natural question to ask is why do we even bother. The camera's in the United States. Why do we even consider to [indiscernible] to Asia? Can anybody suggest an answer? >>: [Indiscernible]. >> Yung-Hsiang Lu: A. >>: It's purely [indiscernible] with no practical information. Price? >> Yung-Hsiang Lu: Price. Because if you're -- talking about a lot of data. If you can save a little money per hour, you can save a lot of money. If you don't have to use the high performance, depends on the application, then you can save money. So the purpose of starting is to understand what's the implication of the various kind of configurations. As I mentioned earlier, when you configure the system. There's another study we want to compare the number of threads -- number camera per thread. So if you want to have one camera per thread, basically one thread just go [indiscernible] camera. If you monitor camera per thread, then that thread will grab data from multiple cameras sequentially. Want to see which one is better, it turns out you shouldn't use one thread for camera, at least for our analysis, our measurement. But earlier data show that we gravitate how we do absolutely nothing, we just throw it away. Suppose you want to archive data. Obviously, your data rate drops dramatically compared to this one. In this case, drop by quite a bit 17 where we store it at. We only store the data. We don't do any processing. So this photo suggests that if you want to different type of studies, you have to select your virtual machines and the locations more carefully. The next example will do what we call background subtraction. This is using office [indiscernible], doing background subtracter. It shows example of a street and if a car comes in, you can find the background. So start with background and take a car. A few more examples here, you can find a bus and find car and actually there's a person here so we want an openCV background subtraction and you can see -- now as you can see, in this case, the virtual machine in the United States and north Europe actually don't have much difference. So now, you can see why the resource management, we want to [indiscernible] too far away if that save you cost. This is an example where we compare different type of virtual machines. These are all from North America. Background subtraction is very computation intense and memory intense so in this case, you really want to use eight cores, as you can see. This A7 and A4, they like obviously win over the others. And then once you use 14 gigabyte of memory, you stop here because you're out of memory. But if you use more memory, you can continue. So in this case, this application, more cores are obviously better. And if you want to save image that's IO intense, more cores do not help. And this we haven't figured out why that's the case. If you use very few cores, we measured this a couple times. It's pretty consistent. If you just retrieve and not do any processing, actually A5 is actually better. We haven't figured out good explanation. But the point here is you want to select different type of virtual machines to make it more efficient and to save cost. Frame rate, we find, is a very closely related to round trip time. Not surprising, because of the TCP protocol building properties. So what do I mean. Depending on application running virtual machines differently, and they have a close virtual machine will reduce round trip time and improve data rate, but depends on your applications, you may not want to do that, because it may be too expensive. There's no clear solution, not yet, about how to allocate virtual machine. The next few minutes, I will talk about analysis part because so far, I only bring out the -- bring the data. We have not done any analysis. So our 18 analysis is an event-driven API. I use a very simple open CV [indiscernible] to show what's a typical program analyzing video. We basically have three steps. You have an initialization, processing and finalization. So this is initialization. You say I'll grab a camera or I'll grab a file from my disk. And at the end, you finish, you use your resources. In the middle you do some processing. In this case, I get one frame, I [indiscernible] background, I show the frame. This is an example. If you use our event driven API, what you do is you replace the while using an event called onnewframe. That's a name we create. So basically, when a new frame's ready, your phone will be called. So essentially, you only need to write run call-back function called onnewframe. You don't need to know where the data come from. You don't need to know the frame rate. That's all in your configuration. Your program just needs to handle this event and you do however you want to do website to do your analysis. You can save the result, you can do the other things. The idea here is we want to be able to process very large amount of data from many cameras using very simple program. So if you did that with a few standard change, that's what I did here, change whatever our program look like originally, you change your while loop into an event. And then in that event, you process it and are able to run this program across more than 1,700 cameras. We can select a frame rate. In this case, we use one frame every five second and using five virtual machines. We demonstrated how we can do this using this simple method to analyze data from many, many cameras. And 1,700 the number we select as a starting point. We don't want to push that image to 30,000 cameras yet, because I will create -- maybe we'll create too much traffic. This is the largest experiment we have done simultaneously so far. I want to acknowledge my collaborators and Microsoft and a company across town. This one started as an [indiscernible] student project. You want to see the award on my desk. So we study a project. I also want to acknowledge the source of data. Many, many sources. Some of them allow us to use it without any conditions, some that we actually signed data sharing agreement with them. To summarize, we are building a cloud-based system to analyze data from many cameras. And I have talked a little bit about our resource manager. Give you an idea that it's not so easy to determine what kind of cloud virtual machine you need. Really depends on the type of applications you want to do is very computation intense, [indiscernible] intense or what kind of computation you want to do. What you want to do a computation. We have designed an 19 event-based API for simple programming and I think there are still a lot, as I think Steve mentioned that there are a lot of research opportunities. We hope you can have a [indiscernible] for people to play with our system so we can write a simple program and analyze data files and cameras sometime later this year. Thank you. >> Dennis Gamon: sets up. So we have time for a couple questions while John comes and >>: So there's an effect when you're using, say, a virtual machine with one core versus eight cores. Just the number of cores also affects the access to number of cameras. If other videos on the same server [indiscernible] interface. So having many cores on an eight-core machine makes me a little nervous. >> Yung-Hsiang Lu: Okay. So the question is if we use different virtual machines, they may share the same physical resources. We have not done a measurement yet. Also, how usually we don't have the direct control which physical machine it will be on. So if you allocate two virtual machines, they may or may not be on the same physical machine, so we don't have that control. Maybe ->>: I can't give it to you either. >> Yung-Hsiang Lu: I understand you cannot give it to me. So we have not done the measurement simply because we are not sure we will get meaningful data. >>: [Indiscernible]. >> Yung-Hsiang Lu: see how we can ->>: We can talk about that, okay. [inaudible]. >> Yung-Hsiang Lu: >>: Okay. Yes, please. Just a quick one for you. >> Yung-Hsiang Lu: Yes. Maybe we can try to 20 >>: You said you're not using it for surveillance, but seems like you're providing technology that could be abused. Any thoughts on ->> Yung-Hsiang Lu: Okay. So the question is will our [indiscernible] be abused by people who want to do surveillance. But that is publicly available. So if somebody wants to do it, we have no control over it. The only thing we can say is we don't do it. >>: Do you have any [indiscernible] experiments that you had are you [indiscernible] the data [indiscernible]. >> Yung-Hsiang Lu: Okay. So the question is we partition a camera in any particular way? we have measured successfully to get only ten measurement. Within that, we randomly assign machines. >>: over the measurement we have, do No, we only choose a camera that frames per second before I'll do amount of [indiscernible] virtual Would it not be [indiscernible] same virtual machine? >> Yung-Hsiang Lu: It's possible that if we do a finer grain measurement, but we have not done that. >> Dennis Gamon: Okay. Thank you once again. So now we come to the third and final talk of this particular session. You know, last night just before I was falling asleep, I had a question I was really concerned about. In fact, I wanted to know what John could do with my location history so I think you're going to tell us, right? >> John Krumm: That's right, yeah. Thank you. Thanks for the introduction, Harold, and thanks to Dennis and Christin for inviting me to talk. I'm from MSR and I work here in this building and I live not too far away, and so professor Lu's talk just now kind of caught me off guard. I was almost had time to run home and put on my light green pants. I didn't think I could make it back in time. Now to my talk. I like to start every talk I give by saying big data and machine learning just so I get those somewhere in my talk. And I'm going to talk about location history, which is something I've been working on for a while about location. Location is really an intrinsic part of our minds, and 21 this is a quote here that kind of explains why. Knowing what direction you are facing, where you are, and how to navigate are really fundamental to your survival. For any animal that is preyed upon, you'd better know where your hole in the ground is and how you're going to get there quickly. And you also need to know direction and location to find food resources, water resources and the like. So location is a fundamental thing that's kind of been drilled into us as important based on just evolution. And so I'm going to talk about what can we do with our location history, and this talk is divided into two parts. The first part is things that we've already done. I'll talk about some of the research we've done before, and then some ideas that we haven't really tried yet but I think that would be fun to do with a sequence of location data. First of all, I'll talk just about some of the location data we've been gathering. We've been doing this now for -- this is a little bit of an old slide. For about eight years, we've been giving GPS loggers to people to keep in their pocket or their car as they move around, and this is just a map of some of the data that was made by a summer intern of mine. So most of the data we've got is from the Seattle area. In fact, there's downtown Seattle over there and we're kind of right around there now. So what we do is we gave out -- we loaned people GPS loggers like this first and then subsequently that, go to smaller and cheaper, and we put them in a bunch of regular vehicles, people volunteered to carry them. We put them in paratransit vans and the Microsoft shuts that run around on the campus here and got data that way. Okay. So I'll start talking about some of the things that we've done with this data, and the first thing I want to talk about is making maps. Here is a plot of our GPS data. So, again, that's downtown Seattle. Here's us here right by Microsoft. That's like Washington, and that's the 520 bridge over lake Washington. That's the I-90 bridge right there. And when you look at this, it looks sort of like a map, and it seems intuitively that you should be able to go from this plot of the data over here to a map like that over there because it's almost a map like that. So we looked at doing things like that. Why would you ever want to do that? Well, it's expensive to get that data. Navteq and Tele Atlas these especially equipped vans with train drivers that drive around and get the data, and then 22 it's kind of expensive to buy it. And also, roads are changing. There was a bridge collapse not too long ago around here on interstate 5, north of here. And we would like to get that updated quickly in the map. So here's one of the things we did. Here are GPS traces that we took from this intersection here. And our goal in this project was to try to count and locate the lanes of the roads leading into that intersection. And the way we did that was we would draw virtual line across the road like this and look where it was intersected by the GPS traces going through it. And if you make a histogram of that, you get something like this for the mixture of Gaussians to it, and then you can count the number of lanes by the number of Gaussians and the means of the Gaussians or the centers of the lanes and the width of the Gaussians are a function of the width of the lanes. So it does actually a pretty good job of interpreting what the lanes are along the road just from the GPS data. Another thing we did was finding the intersections. So here is the GPS data is drawn here in white, and what we did is we made this detector that passed over the data, and it kind of just built counts in all these annualar sections of this detector so kind of how many are in this section, how many are in this section. And we trained up a simple machine learning binary classifier to say, based on those counts, is this an intersection or not. And here are the intersections that we found. So it actually did a pretty good job of automatically finding where the intersections are on the map. And then finally, we actually tried to just build a routable roadmap based on the GPS data we had gathered. So this is raw GPS data from kind of a complicated intersection. That's a highway there, and the first thing we did was we clarified those traces and the way we did that was we tried to move them around and we tried to coalesce traces for lanes going the same direction and try to separate traces for lanes that were going in different direction. So we pretended that each trace was a little electrostatic wire and if the directions were the same, those wires would get attracted to each other. If the directions were different, then the wires would repel each other. And so you can see an example of that here. Here traffic is moving in two different direction, but you can't really tell it from that picture after we do this clarification step then it separates the lanes so it looks pretty good. And then from that, you could just make a routable roadmap. You could click on 23 the start point and the end point and it would compute a pretty reasonable route between the two points. So there's still more to do on this project, figuring out road directionality, figuring out turn restrictions, road names, finding the stop lights and stop signs, the public versus private road, interpreting parking lots. That would be a really challenging, fun problem, figuring out which lanes are the car pool lanes and express lane scheduling. So Microsoft used to have this tag line, where do you want to go today? We took it seriously and we actually figured out where you want to go today so we're not asking people that anymore, because we already know, based on your location history, we can predict where you're going. So wouldn't it be nice if you had this constant companion with you, as you're moving around that could tell you things about gas prices, traffic, points of interest, available parking, and advertising. It wouldn't be that hard to built something like that, but it would be a lot better if it knew where you were going, could tell you about these things before you got there so you actually had a chance to decide whether or not or how to take advantage of the information that it's getting. And so you could ask well, could you just use route planning for that. Well if you look at the numbers, it turns out that people only enter their destination when they're driving for about one percent of their trips. So you need something automatic. Here's a little video I'll show. I'm hoping the audio will work. Why men don't ask for directions. >>: Hey, can you tell me -- [screaming.] >> John Krumm: Okay. So we're trying to avoid that situation. And so here's how our destination prediction works, and I'll stop this just so I can explain what's on the image when it gets going here. This is a route I'm going to drive. I'm right here. Now I'm going to drive along this black route, and these red points are candidate destinations, and they happen to just be where the road intersections are. And what I want to do is compute a probability for each one of those red dots as to whether or not I think that is going to be your destination as you drive. 24 And the way this works is we look at each candidate destination, each red dot, and we look at the partial route that you've taken so far and if that route was an efficient way to get to the red dot, then that red dot's probability is higher. If your route was an inefficient way to get there, then that red dot's probability is lower. And so you'll see pretty soon, when I start this up, that all these red dots are going to go away. Their probability is low because to get over there, the most efficient way to do it would be to go across this bridge down here rather than up across the top. And so we'll see the red dots just kind of start to disappear as the drive goes along. Okay. >>: So then at the end, they're kind of clustering around a destination. How do you know that that's the end when you're starting out? >> John Krumm: Well, we don't know that's the end. This was kind of a lucky situation because you're blocked by the water, right. Although there's some dots that didn't show up that are on this -- there's a ferry route. That's actually a ferry dock there. So if I showed more of this, you'd see some dots over here. But we do know how long most trips are. We have a distribution so it's not going to predict that you're going to be driving for six hours. Probably 20 or 30 minutes, usually. But this is prediction -- this is forecasting, not prediction, because you're doing it just like a few minutes ahead? >> John Krumm: Okay, sure, yeah. Okay. We also do longer term prediction though. One application to this is picking the nearest gas station or any kind of search you want to do. I'm looking for pizza along the way or something like that. And normally when you do a search like that, you'll just -- the local search will just be a radius so you might get a gas station that's behind you that you really don't want to turn around and drive to. So what we did is we used our prediction algorithm to try to find gas stations that would give you the minimum expected cost of diversion along the route that we think you're going to take. So it doesn't have you turn around. And so I'll show you that video, which is the same route. Just want to pause it here. 25 So all the white circles are actual gas stations and the one that it's chosen for you to stop at is the solid white circle. And so it doesn't do a very good job at the beginning because it hasn't seen much of your route so far. But it gets better. So the goal, then, is to have the gas station somewhere ahead of you along your route so you wouldn't have to drive very far off your route to get to it. And you'll see when it gets to that top peak, it's a little bit confused. It doesn't know you're going to turn. But then it recovers. So we actually built a web service to do this so you could do a local search on your phone as you're moving along and then it will give you search results that are ahead of you rather than behind you. So now you talking about a longer term prediction, we did a project called Far Out which tried to predict where you're going to be a day from now, a week from now, a month, and even two years into the future where you could get an you average error of less than a kilometer of where you're going to be because people are actually pretty repeatable in how they behave. It was just based on time of day, day of week, and whether or not it's a holiday. It worked surprisingly well. And we're continuing this work. Here is someone's GPS data as they're driving around. These are just little triangles that they went through on the ground, and so we're predicting where they're going to be in the next five minutes. So an average error was somewhere below 200 meters like that, and here's average error after ten minutes predicting 20 minutes ahead, 30 minutes ahead, an hour ahead and so the error's getting worse and worse as you go up. Here is 12 hours ahead. So the error is still below a kilometer. And then, interesting thing happens. When you're predicting one day ahead, the error drops back down again, because people are pretty repeatable in what they do. So you're probably going to be doing -- not you, because a lot of you are visiting, but most people, if you're at home, are going to be doing probably the same thing 24 hours from now that they're doing right now. And then this goes on. This is just more and more days out into the future. So it's really surprising, I think that with a pretty easy machine routine, just looking at time of day and things like that, that you can predict where people are going to be pretty far into the future. Okay. >>: Looking at -- question? So when you're doing things like weather forecasts, one of the measures of 26 how well you're doing is comparing it simplistic effort. In other words if you're going to guess tomorrow's weather, you'd say it's like today's. And so you don't need a [indiscernible] to make that sort the forecast. Does the same sort of thing apply here? In other words, is there something where you can estimate that the persistence based prediction might be uncertain? >> John Krumm: I guess the way I'd respond to that is you can think of a whole continuum of sophistication of the models, right. And so one might be the very simple thing, that you're going to be in the safe place 24 hours from now that you were now, right. And now maybe I can add something about, well, if I know it's a weekend, then I've got this kind of two-dimensional space of predictions of inputs and you can just make it gradually and gradually more complicated. So we've looked at things, you know, you tend to go back to the same places that you've been to before, like the same grocery store. It might not be predictable in time, but it is predictable in space. You tend to go back to places that are near places you've been before so maybe you want to go to the Starbucks that's close to the grocery store. So to answer your question, I kind of interpret it as good baseline, a very simple baseline and how can you by adding more and more features, right. And there's of different algorithms you can use for this, so it's a baseline. you're asking what's a measure the boost you get almost a continuous space hard to nail down one as Predicting collective travel behavior. So what we're looking at here is taking relatively long trips from your home. Can we predict places that you'll visit, given where you live and the demographics around there and the distance to a candidate place you might want to visit. What are the chances you're actually going to go there. So, for instance, in Redondo Beach it turns out that our results say that these are the places that you're likely to visit. The darker the dot, the higher the probability. So you're going to go up to the northeast here and around in California if you live there. The way we do this is we've got a bunch of data from geo tagged tweets so about one percent of tweets or so are geo tagged, and we have access to those and they have persistent user IDs on them so we can actually look at where people travel. And so our goal was given where you are now and given a candidate destination, can they predict what's the probability that you would go there. And these are 27 the features we use. We use the distance bunch of demographic features that we got of where you live and the demographics of just did a machine learning classifier on that. to the new place and then a whole from census data on the demographics the candidate destination. And so that. So here are some results from If you live in Redondo Beach, these are the places that you'd go. There's a little close-up around California so you go up to San Francisco area and down here if you live in man tat hat tan, these are the places you go. You tend to stick around here but also go out to the west coast. These are flyover states. They're really flyover states and that's pretty concentrated in New York. And I don't know if you've seen this old cover of the New Yorker, but it's really kind of true, right? That if you live in New York, you kind of don't pay as much attention to the middle, but you do know about the west coast over here. Let's say you live in Milburn, Nebraska, population 66. So we only had four total tweets from Milburn, Nebraska, so you can't really look at the data from Milburn and predict where people are going to go, because you only have four samples. But since we did machine learning on this, we look at other places that are like Milburn, Nebraska, that have similar demographics and we can make a decent prediction on where people from there would go. And so they kind of go all over the country. The other thing you can do is given a place where do people come from to visit you in this place? So everyone wants to go to Redondo Beach, according to our analysis. Milburn Nebraska, not to popular, except for people around in Nebraska. And Minneapolis, Minnesota, as an example, kind of draws people from the five state area. Yes? >>: Does predictor do that or real data? >> John Krumm: >>: This is predicted data trained on real data. How do you verify? >> John Krumm: We can't verify for Milburn, since we don't have that much data from it. But for these, we do have a lot of travel data so we verify with this. I didn't put in my accuracy graphs just to save time. Yes? 28 >>: [Indiscernible] this is a hub. I'm wondering whether that's -- >> John Krumm: Minneapolis is a hub for Delta Air Lines, Professor Liu says, so that could affect people visiting there. So it could. Yeah, I didn't know that. >>: Not in your data? >> John Krumm: It's not in our data. It's reflected in the Twitter data, right, but it's not reflected in the features we're using to make the inferences. >>: [Indiscernible] tweet I'm at the airport. >> John Krumm: We didn't look at the tweet texts. This would be another level of sophistication. And then we can look at all the -- oh, you really can't read these too well, but these are the relative importance of the features for making those predictions. And the top one is distance. That's really what affects mostly how far you travel. And the other ones are all just demographic features. It turns out mostly race and age are predictors of places that you'll visit. Okay. Labeling your places. The idea here is when you tell people about a place, like let's say you've got an automatic notification of when your child arrives at home. You don't want that notification to come in and just give the latitude, longitude of your home or even the street address. You'd like the message to say they've arrived at home or they've arrived at school or your spouse has arrived at work. And the idea here is can we look at your GPS data and automatically attach semantic labels to the places you tend to go. So we start off with just GPS data, people running around. This happens to be of a student-aged girl. And then we do clustering on the data to find out where the people spend their time, and that happens to be her home and that's her school, I happen to know. And so what we want to do is automatically classify those places and our ground truth data is kind of interesting. There's this thing called the American Time Use Survey. It's free data from the U.S. government, and what they had people do is fill out a survey or a diary that where they kept track of everywhere 29 they went for a day and what times they went there. So I was at home this length of time and I was at work this length of time and you can see across the bottom here, this is zero to 24 hours, and that's the time when people were at home. So most people are at home around midnight. And then at noon, you see kind of a peak and people being at work there and then the next highest one is someone else's home. But these are all the different kind of places that people kept track of in that data. So that's a great source of ground truth data for machine learning, and so when we do that, here are our accuracies. Here are the actual places people were. This is a confusion matrix and that's where we inferred that they were so home, you know, we're 91 percent accurate. Work is 83 percent, school is 88 percent. And then not so great for the places that you go less often. Location privacy. Turns out when you look at the literature, people have done surveys and people don't care that much about location privacy. They're willing to give up their data. They don't really care that much, and that's a pretty consistent finding in most of the surveys that I've seen. One thing we were interested in is what if people had actual location data at stake. If you were doing this survey and I held some of your location data, then are you going to be so willing to give it away. So we collected GPS data from 32 adults and 12 households, 12 non-Microsoft people who were drawn here. We had two months of data for each one of them. After we got the data from them, we showed a map of their data and had them fill out a survey, asking about their privacy preferences. And turns out one of the things we asked about, can we take your data, anonymize it, but with a persistent ID for all your GPS points and a time stamp. Can we take that data and put it on a publicly accessible website. Two-thirds of our participants said yes, that's okay. So now that data is actually on a public website for anyone to download and use for their reserve. So I was kind of astounded by that, that big number of people that were willing to share their data. We said would you trade your data back to Microsoft if we gave you location-based service in return? Every one of the 32 people said would. And these are the location-based services that we picked. a list. Help determine where the bus routes should be. So that's a yes, they We gave them just an 30 altruistic service where we'll look at traffic data and figure out, well, we should put a new bus route there. Tell you about traffic jams before you get there. Tell drivers where traffic is slow. Control our home thermostat to save energy. But everyone was willing to trade their data away for a location-based service. So pretty cheap. Another little study we did on that data was can we take your anonymous data and figure out who you are. And it turns out that you can. If you can figure out where the person lives, which is where they spend most of their time, then you got a latitude, longitude of that and you can reverse geo code, find their street address and then put that into a reverse white pages lookup and figure out their identity. So we can do that and then what we did is we started to obfuscate the data. So a lot of privacy advocates will say, well, you can add noise for the data to preserve privacy or discretize it somehow. So we did those things and we ran our privacy attack again to see if we could identify who the people were. And these are the different obfuscation techniques that we looked at. But in general, it turned out that you had to obfuscate the data so much that it was really before -- the tech wouldn't work, but it was really useless for a location-based service. It would work maybe for giving you the weather or something like that, but not something for, say, tell me if one of my friends is close by or tell me when the next bus is going to get here. So obfuscation of the data is not a very good privacy technique. Okay. So I'm done talking about what we have done with data and now I just want to speculate a little bit about what we could do with some of this data, things that we haven't fully tried with. One is personalized routing. We've done a little bit of this. We've looked at the routes that people take. Here's someone who drove from Point A to Point B here along the green route and then we planned routes between those two points. One was map point. We looked at the shortest route, the fastest route, and you notice they didn't take any of the routes that we planned. And it turns out that when we looked at the data that 60 percent of the people were taking neither the shortest nor the fastest route. So it made us think that people have different criteria for picking the route they want to drive. And so we made a new router that was sensitive to traffic speeds, but also 31 tended to favor roads that you've driven on before, thinking that you're in the habit of -- if you like these particular roads so you want to go back to them. So after making that simple change to the router, our new routes that we planned match the routes that people took a lot more often. But going forward, you can imagine that people have a lot of different criteria. You know, when you go to Bing maps or Google maps, it optimizes for time or distance, right. But maybe there are other things that come into your calculation and basically, these routers are minimizing cost. So maybe that cost is a function of driving time, number of left turns, complexity, scenery, traffic lights and you have a different weighting for all these different factors and so by looking at your driving data, maybe we could figure out what's your rating to apply to all these different factors and then make a route that works best for you. One of the factors could be safety. So here are traffic fatalities from 2001 to 2006. So they imply that they might be a more dangerous area to drive and so you might want to avoid that. So let's say you're planning a trip from Pittsburgh to Detroit and now we add this slider, which is the probability of death along your route, okay? And you're free to choose any setting you want on that, you know. We haven't done the research, but it may be the more risk you're willing to take, the faster your drive will be, right? But down here, if you want to be safer, maybe it will take longer. But it would be interesting to do something like that. Collaborative filtering. This is one of the first things we thought of when we were starting to take data. So you know, when you go to Amazon to buy a book, they'll suggest other books for you to buy. But you can also imagine doing the same thing with places that you go. Let's say you're new to a city and you visited this independent book store and this independent coffee shop. Well, based on where other people have gone, if they've gone to these two places, the system could maybe automatically recommend that people who like these places also like this independent theater. Someone else has done a little bit of research on that over at University of Washington, John Froelich when he was a Ph.D. student there said this actually might not work all that well because it turns out that people; when they're picking places to go, they often go there not because they like it but because their friends are going there or because it's close by. So it's a little bit different from buying a book on Amazon. 32 I think this is the last one I want to talk about. Crowd sourcing. So looking at data from other drivers, what can you do with it? Well, you know, this company Waze that was bought by Google recently, they used the data from these Waze NAV apps running on people's smartphones to find traffic speeds, to report speed traps, and gas prices. But you can maybe even do more with that. By just instrumenting the vehicle. So what if you could instrument -- you could detect people places where there's sudden braking. That might be -- there's an accident there. Or the anti-lock brakes had been activated to indicate a slippery road. Windshield wipers are on means it's raining so you could track precipitation as it moves through the area or look at sudden suspension jolts to figure out where the pot holes and rough roads are. So I think that would be an interesting thing to try. Okay. And in the interest of time I'm going to skip over this last one talking about activity inferences. And just conclude this is what I've talked about, some things that we've already worked on, and some ideas for things we could work on in the future. And that's the end might have talk, so thanks for listening. >> Dennis Gamon: Questions? >>: So how specific a lot of the features and a lot of the sort of machine learning machinery behind it is to humans versus if I apply that to GPS, the baboon data. >> John Krumm: I'll repeat that question. How specific is the machine learning that we've done to humans and, you know, maybe how well would it work on baboon data or any kind of ->>: Or zebras or animals in general. >> John Krumm: Right. That's a good question. The first bit of location prediction that I did, the short-term prediction, assumed you were on the road network and that you had an efficient route planned to where you were going. And I suspect that -- well, animals don't follow the road network, obviously, and you probably don't know the paths that they follow, necessarily. So I don't think that technique would work very well, but the longer term prediction that we're doing pays no attention to what's on the ground. It doesn't pay attention to businesses, which we've done in other contexts before. It just looks at your specific behavior, places that you've gone to before as a 33 function of time of day. So it seems like that would probably apply better to wildlife. agree? >>: Although animals do follow roads. >> John Krumm: >>: Would you Yeah. They do? We can talk. >> John Krumm: Okay. Interesting. Yes? >>: The last point you made about instrumenting vehicles to detect conditions, environmental conditions on the roads, this is an area of inquiry that's in an advanced state of development and perhaps we have a talk about that. So they've done some experiments with -- in the field with looking at windshield wiper settings to get realtime information back. You have to have a car which will indicate the windshield wiper setting, but we have systems that are in development like that. So perhaps it's a [indiscernible]. >> John Krumm: Good. Sounds like a lot of fun. >>: Is there bias in diary data? I remember when I was in school a long time ago, people were keeping track of where they went. They went to the dentist more frequently than they go to a bar. >> John Krumm: Oh, really? >>: I guess [indiscernible] or survey data. survey, is the survey biased in some way. It's always the problem with any >> John Krumm: Yes, so the question is are these diary studies biased. And I don't know, honestly, and I'm not sure how we could tell. Maybe you could instrument some of the people, right, and figure out whether they were doing that. But even if you had GPS coordinates in the car, which we do a lot of times, you don't park at the exact place you're going. So you're going to the strip mall, right, and you go to the coffee shop but right next door is the diet center. We don't know which one you really went to. So it's even hard to tell from GPS data. But yeah, that's interesting about the dentist versus the 34 bar. >>: Do you keep track -- this is two questions. Do you keep track how far people drive between gas stations? And the second part of that is do you predict, you look at a gas station in front of you, you can predict you're going to run out of gas within 50 miles, you better pull over before you see a gas station very far away even though it's in front of you. >> John Krumm: Right, that's a good idea. So yeah, we haven't looked at distance between fill-ups, but that would be maybe a good proxy for how empty your tank is, right? And then you could recommend the right distance gas station to fill up your tank. And that leaves us -- we thought about thing called satiation modeling. Every once in a while, you need gas, right? You need an oil change. You need a hair cut every once in a while, right? So if you've just gotten gas, just gotten a hair cut, then we probably don't want to tell you about places where you can do those again very soon, because you're just not going to need that. So it would be fun to look at the long-term intervals between different things that you need. >> Dennis Gamon: Last question? >>: So here in Seattle, you guys have the Microsoft connector sort of all over the city with this realtime GPS data. And I'm trying to avoid traffic a lot. Bing, for example, if I route from here to my home will tell me, like, some estimation of different routes based upon, I'm assuming, past data collected for a similar day and a similar time. Are there any efforts to integrate in realtime information doing that analysis, like, immediately when I'm looking at it or maybe that's already being done. So there's an accident on 405 and the connector is sort of sitting here in traffic. And so I know I should take I-90. >> John Krumm: Right. So looking at realtime traffic data, to help you route, my boss has actually some work on that, Eric Horowitz. What he was doing was predicting when traffic jams will form and when they'll dissipate based on past data, and he was also looking at can we infer what traffic on the side streets is, where you don't have the traffic loops that are measuring the traffic based on the traffic that's on the highway, where you are measuring. So if there's some correlation between those two things, then you can infer the traffic even where you don't have sensors. 35 So in that sense, yes, we've got some work around realtime traffic estimation. >> Dennis Gamon: >> John Krumm: Thank you, John. Thank you.