The CarTel Automotive Mobile Sensor Networking System Samuel Madden

The CarTel Automotive Mobile Sensor Networking System Samuel Madden >> Michel Goraczko: Yes, so we have Sam here today from MIT, he joined MIT in 2004, he'll be talking about the CarTel project. Sam. >> Samuel Madden: Thanks, Michel. Thanks, guys, for having me. So I'm going to talk today about the work we've been doing for the last I guess three years now on this CarTel project. This is a joint project with my colleague Hari Balakrishnan, who I'm sure many of you know, and a giant pile of students and staff people and other folks. Every time I give this talk I sort of have to think about all the people and I have to add new names to this list. Anyway, so motivation for the CarTel project sort of started -- we spent a lot of time doing sensory networking research at both at MIT and before I was at MIT at Berkeley, and a lot of the focus there were on these sort of small scale deployments where you put a few sensors in a building or a space. And one of the things we were thinking about, we started thinking about, is how you can you do deployments that really sense a wide area, like a whole city. And so why would you want to do that? Well, there's lots of things you might want to do. Civil infrastructure monitoring, like measuring what roadways look like or what water pipelines look like all throughout a city, so on and so forth. Road surface conditions is sort of a related thing, monitoring black ice, probably not generally a problem you guys have here. Some of these visual mapping applications, those sorts of things that Microsoft and Google are deploying, where they take pictures of roads everywhere. Or monitoring traffic throughout a whole city. One way you might think about doing this is some sort of a wide area static sensing deployment, like putting inductive loop sensors in roadways in order to measure traffic everywhere. Of course, that's costly both to deploy and maintain, right? It takes a huge amount of effort to do that. So the observation is -- and again, you know, Google and Microsoft taking pictures of, you know, streets is an instantiation of this. You don't need -- a lot of applications don't need imagery -- you know, don't need data at some incredibly high temporal fidelity. So for example, measuring traffic you might be okay with knowing once per hour what every road looks like, or once every 15 minutes what every road looks like, you don't need continuous sensing all the time. That's sort of our insight or what we're doing in this project of looking at the use of mobile devices for sensing. In particular, we sort of -- one challenge with mobile sensing is one way you might do this is to go, for example, buy 100 cars and put them out on the roads and have them drive all around all the time. But that itself was going to be pretty costly. So the sort of particular thing we're going at in the CarTel project is we call opportunistic mobility, and that's sort of the slogan on the front slide. Making sense of your drive to the store. Can we use the mobility that people have in their everyday lives have already, in order to sense things about the world in some way. So that's what CarTel is about. There's two sort of obvious ways you might want to do this. One is by using cell phones, and the other one is by using cars, because these are things that people naturally have, that they move around with all the time. In this project we really have focused on cars, partly just because cars are this very attractive platform for deploying stuff, right there. It doesn't rain inside of cars, they already have power, you have power inside of cars, there's this onboard sensor network inside of cars that lets you measure all kinds of things about what the car is doing and what it's experiencing. Okay? So the sort of first question that I want to lead off with is just talking a little bit about what's the system architecture, like what's the thing that we built that allows us to get data from cars that are moving around. And then I'll talk, probably more of the talk will actually be about what we've actually done with this platform. So as I said, we built this thing, we've had it running for about three years on a number of cars, and have used it to do some things that I think are kind of neat. So I'll just tell you about that. Any questions about anything at this point? No, okay. All right, so CarTel is a mobile sensor computing system, basically think of it as a tool to answer questions about spatially diverse data sets, if you like, that are collected from these mobile devices. So suppose this is -- it both has, you know, focuses on the collection of the data, so collecting traffic flow information from roads, as well as then allowing you to do some processing and sort of asking questions about the data that you have collected. And so with that sort of goal in mind, you can at a very high level decompose CarTel into sort of three core tasks. Some piece of software that does the collection and the processing of data; something that does the delivery of data, that is, how do you get the data off of these mobile devices into some centralized infrastructure where you can look at it; and then some visualization or analyzing tools to allow to you to sort of process the data. Okay, so just to sort of give you specifically how these three layers are instantiated in the CarTel project, at the lowest layer, the stuff that's actually running on the cars, there's a collection of embedded hardware, and this just gives you a picture of some of the earlier generation devices that we've been deploying. This is just a little embedded basically access point device from Socarus (phonetic), just has a wi-fi radio and some flash storage, so on and so forth. GPS interfaced the onboard diagnostics network inside of the cars. In some cases webcams. We have a second generation deployment that we're doing that's based on a much smaller and cheaper access point device. Has no GPS, does all of its localization via wi-fi, so it uses wi-fi both for localization as well as for data uplink. We also have another box that we deployed in some of the cabs that -- a cab test bed that we've been running that I'll tell you about in a little bit. So this is sort of running on the cars on the roads, data is being transferred up over the internet. In the CarTel project, we tried to be sort of agnostic about what the actual data -- insisting on one particular way of getting data off the cars. So in particular, we didn't want to insist that every car that's out there on the road have a cellular data plan available to it to get data off, which would be then sort of the obvious way to get data off of things that are moving around on roads. One of the things we spent quite a bit of time doing, actually I'll talk about this, is investigating the question as to whether the currently deployed wi-fi networks in the world are sufficient for -- as a sort of data uplink for a lot of these mobile uplink data applications. I'll talk about how we've -some of the technology that we've developed to make it so that cars can rapidly associate with wireless networks as they drive around. Then the third piece of this is some sort of stuff that runs on the web that users interact with. So call these the portal. Sort of at the middle layer we have this thing we called Cabernet. So Cab. Cab network. Which is the carrying forward networking system. If you look at our paper there, there's something we used to call Caf Net, this Cabernet is sort of the second generation of this networking layer. Then there's ICEDB, which is intermittently connected embedded database, which is really the data collection abstraction that runs down on the cars. So I'll talk about the three pieces of the system quickly. All right, so just to sort of give you a little bit more road map, I'll talk about the three pieces of the system, I'll talk about deployments, and the case studies that we've done. And then I'll talk a little bit, if I have time, about sort of how some of the things we've done to help users actually manage the data from a -- sort of the management of data that has been collected from these things. So part of my end, sort of, research agenda with this is to build -- I'm coming from the sort of database community, and one of my agendas is to build tools that make it easier for people to actually manage all this data that's coming from things like cars out on the road. So I'll talk a little bit about that at the end. All right, so the portal, it really is a pretty -- I would say this is a pretty generic architecture. There's some web server, there are some collection of applications that run inside of the web server that do things, provide these applications to users, like a traffic application or a wi-fi application that allows to you digitalize wi-fi. These applications can retrieve data from this thing that we call the ICEDB server. This is really just basically a database, except that it has -- it allows applications if they want to to pose not only query historical data, but to pose queries that request that cars on the road deliver data continuously. So the traffic application, for example, can register itself with the ICEDB server and say deliver me GPS information once every three seconds. And then there's some collection of data visualization tools that we provided that allow users to -applications to sort of overlay data on maps and other things like that. Then the way the communication works is the ICEDB server uses the Cabernet system to receive data from cars, as well as to send data to cars that are out there on the road. And the real thing that Cabernet is doing for us here is dealing with disconnectivity. We're not making an assumption that the cars are always connected, so therefore you need to sort of have some way to allow cars to pop up and register themselves and get data from -- and to buffer data on cars so that when they connect they can deliver to you, and so on, so forth. Of course, inside this system there's also a relational database, inside ICEDB there's also just a relational database that stores all the historical data. And most of the applications that we build are actually just querying historical data most of the time. The continuous queries are used to sort of configure things and set up the data that streams in, but for the most part applications don't really want to change the data that's being collected in realtime. At least in the deployments we've done. I'll just give you just a little demo of sort of this portal software. I'm going to -- this is some sort of old data that we collected, but this is data from Hari Balakrishnan, so we're going to compromise Hari's privacy all day today. He's generously allowed me to do this in this talk. I didn't have a car for much of the time that we were doing this, so -- >>: (Inaudible) >> Samuel Madden: I'm not going to -- I would love to talk to you guys about privacy, actually. I mean, I think it's a real concern here. I'm not going to go into much detail about it. The standard story that I tell people is, well, Hari logs into the system, and this is what he sees. So if all we're doing is just showing Hari his data, I'm not sure that the privacy question is huge. When it starts to get really problematic is when you talk about -- which we'll talk about later -- when I take Hari's data and synthesize it together with everybody else's data to do interesting things, how do we make sure that we're not compromising Hari's privacy when we do that. Like if I give traffic information about roads that Hari drives on, you can probably learn quite a bit about what Hari does. If I know -- if there's only a small number of cars inside of the system and I know that Hari lives up here in Winchester, I can figure out things like when does Hari leave the house every day, from that data. So that's definitely a problem. Okay, so we've got -- this is just a very simple visualization that allows you to do things like -- it's just running on top of Google maps. It allows you basically -- these two blue boxes define regions that we want to find roads within. So I said I want roads that go from here to here, drives that go from here to here, and this is just showing me all those drives. I can, you know, click on one of these and see where that drive specifically goes. There's some sort of interesting things you can observe about this. This is the duration. You see most of the time this is a 20 minute commute. Some days, like here, it turns out to be it's a 40 minute commute. >>: So you have some definition of trip segments, which is start and end points. Say you go somewhere, you stop for five minutes or one minute, and then you don't -- >> Samuel Madden: In the case of an individual user driving around, this turns out to be pretty straightforward. Every time the user turns the car on and off, you use it to delineate roads, we do record that information. It gets a lot harder, we've been doing this now lately with cell phones. We also have this taxicab deployment. Both have these issues, that cab drivers will for example sit somewhere with their -this is extra bad for our cabs, because the cabs are all Toyota Priuses, so the cabdrivers will just sit there, and they won't actually turn the engine off, they'll just sit someplace for a long time. And sort of how do you decide, is this a new drive or is this just the cabdriver sitting somewhere for a long time. So figuring out how you sort of delimit drives is hard. Cell phones have the same problem, because the phone doesn't turn off. So you end up doing things like, well, we could use the accelerometer to determine if the user has been stationary for a long time. And then we'll sort of segment drives manually, in some sort of heuristic way. Fundamentally this is like -- or you could give the user a button that says I started or stopped doing something new. But it's hard to figure out how to actually split these up. For now these are just split up from car turning on and off. The sort of thing you would want to do is zoom in on this and see a little bit more information. So this is now the same drive -- Microsoft has really good network connectivity, by the way. This thing is really snappy here to MIT. You can zoom in on this and see a little bit of detailed information about the drive. So you see here, you know, so you see he leaves MIT and he goes on this sort of broad way up here, and you just see that -- this is one of these drives where it took him 43 minutes. And you just see he gets stuck at these lights, basically, he just backs up and waits for a long time to get through an intersection. So and then this is a plot over here of his speed versus time. So one of the sort of cute things about this, Hari had sort of convinced himself that this back route was like the best way to drive from his home to work. And we actually have a little analysis that we did where we made him drive on the freeway like 20 days and we made him drive on this road like 20 days, and basically it turns out that the freeway is just always a better objection. Even though you feel like the freeway is just awful, it turns out that on average at 5:00 in the afternoon it's like 10 minutes faster to take the freeway than it is to take back roads. Even though it takes you -- you wait for five minutes to get on the freeway, but then when you finally do get on it you actually move pretty quickly. Whereas here you're sort of moving, but you're just moving very slowly for a lot of time. >> : -- how much gas it consumes. >> Samuel Madden: Absolutely. So actually I'll show you -- we have a little traffic portal. One of the things we're doing that this is using is to do traffic planning. We actually have a little thing that will say here's how much fuel we think that it will consume and how much carbon dioxide we think you'll emit when you -- you know, depending on this route. The other thing I said you'll want to do is this is supposed to be designed to be kind of a generic visualization interface, so we can do things like overlay engine RPMs, so this will color code his trace by engine RPMs instead of color coding it by speed. We can also see all the wi-fi networks that he observed along this drive. Again, as I mentioned, one of the things we're interested in is investigating the use of wireless from moving cars. So this is just a log of all the wireless that he could see. In this case red means we saw it but we didn't associate with it, yellow means we were able to associate with it. And then we in this drive had the feature that actually caused the car to try and transfer data over the networks turned off, but in some of these you'll see that things will be green, which means we actually sent data back from it. Yes. >> : -- RPMs, can you actually tap into the car's network? >> Samuel Madden: The car has this onboard diagnostics network. So every car sold in the United States since 1996 has basically been required to have an onboard diagnostics interface, and this is primarily for emissions testing, so most of the data that comes over it is things like the status of the -- you know, oxygen sensor or something. But every manufacturer has its own set of basically bits of information that it exposes over this thing. So it's just almost -- if you look in your car, typically it's under the driver's -- sort of under where the driver's side, the steering wheel section there's a plug there, looks kind of like a parallel port. It's like a sort of a -- you know, about that big, and you just plug this -- you can buy these solutions off the shelf. Michel knows quite a bit about it, if you want to talk to him. >>: Here in Washington state, the (inaudible) car for a second, this is exactly the port. >> Samuel Madden: This is just an example from Seattle, so Michel had one of these boxes for awhile driving around. He in this case put a camera on it, and you can actually see -- you can get Google maps to -- Google does a very poor job of rendering imagery, but you can see what he was seeing out the front of his car as he was driving around. So that's the visualization interface. So sort of the next two pieces I want to talk about now are actually how we do the data collection, and then how we do the networking. So ICEDB is this intermittently connected embedded database. IRD, or our thought was, well, a relational model is sort of a convenient way to express what data you would like to collect from these cars. I'll give you a couple examples of that. The idea Is users write queries in extended simple version of SQL. This is a continuous query processor, which means these queries specify not what you want to do to store data in the database, but what date you would like to collect. And it's distributed in a really very simplistic way, just means users write queries at the server, queries get sent to all the cars, and then all the cars sent their data back, whatever data back that they want. So it's not -- we're not somehow partitioning the query up into a bunch of different little pieces, where different pieces are running on each car. So one of the challenges we wanted to deal with in the ICEDB world is the fact that bandwidth is variable. So you've got this data that's streaming back continuously, but the connectivity that the car experiences varies over time. Because you go into tunnels, or you're in a region where there's not much wireless connectivity, or your cell phones drops out. So you can't assume that you have this sort of regular, very continuous connectivity that you can sort of use to deliver data at regular rates. So that means, first of all, you need to buffer the sort of query commands. If I issue a query, I may not be able to actually get that query to a car for several seconds. I also need to buffer query results, so a car collects information over time, it doesn't necessarily have the connectivity to deliver that information right away. Also, one of the things we wanted to support was the car could store data locally, that then users could come back and drill down into and query, rather than necessarily sending all of the raw data off at a high rate from the cars. And then finally the thing that we really focused on in the ICEDB paper and evaluation is we wanted some way to be able to prioritize results between -- that cars collect. So the idea was every piece of data that a car collects isn't created equally, and so we within the query language provide support for users attaching simple sort of user defined priorities to the data items that the system takes into account when it's deciding what to transfer at a given point in time. And I'll talk in more detail about that. Just to give you a high level picture of how this works, there's ICEDB server user issues queries to cars, cars stream results back. Zooming in on what happens on the cars, there's some collection of sort of sensor hardware here. The sensors are talking through adapters, these would be like drivers that let the car sample GPS, for example. And then the data is sort of going to be stored in these output buffers, and the Cabernet system is going to be taking data from the output buffers and delivering it when connectivity becomes available. In the middle of this there's sort of two data paths, one is the continuous query data path, where data just streams from the sensors into the buffers and then out through Cabernet, and the other piece is what we call this ad hoc query processing piece. So data comes from the sensors, get stored in a local database on the cars, and then the user can issue queries externally over that stored data. So these two data paths to cope with limited bandwidth. Again, the idea is locally you can store more data locally than maybe you can transmit continuously off the car. Okay, now I'm going to talk about the prioritization issue, which I think is sort of the most interesting one. So the motivation here is, again, there's a couple different motivations. So the first one is what we call interquery prioritization. Which is within a query, imagine I'm collecting some data, say for example about that car's -- a car's position with time, right? If you imagine you're in a bandwidth constrained environment, so imagine at each one of these points you have not just position, but say a photograph or something, right? So if you think about a bandwidth constrained environment, if you just take that data and you stick it into a FIFO buffer, the problem is that you're going to get a whole bunch of data about the very first part of the drive, but you're not going to get any information about what's happened during later parts of the drive, if there's not enough bandwidth to transmit all the information. So we wanted to provide a way for users to specify simple policies about what data is more valuable than other data within a query. You can specify these things like this. So that we basically added at the end of every SQL -- SQL query is a thing that says something like delivery order by first in first out, so FIFO, like that would be the default ordering that all these things would get. But something you might rather do is what we call bisect here, which is you take all the data that's in a buffer, in the buffer at a given point in time, you send the first point and the last point, and then for each of those segments you recursively bisect them. So you can specify delivery order bisect here. It turns out that there's no way to get the standard SQL order by clause to actually do this in the right way, because what you want to do here is to reprioritize all the data every time a new data entry comes in. The SQL order by clause just orders the data by some static value of every record. Now, our idea here is that this bisect is not -- we will provide a library of these kinds of prioritization functions, but this bisect is really the user is free to write their own simple prioritization function that takes the records in a buffer and reorders them however they would like. So we provide bisect as a default, as well as some other things like random sampling as a default. Or most recent value as a thing that you can use if you want, but you can also specify your own values. So this interquery prioritization, this is great for -- it lets us deal with this case where within a particular piece of -- within on a particular car some data may be more valuable than others. But just within a single car you may not have sufficient information to know how valuable your data is, right? Because my data is only -- the data that I collect on cars may be less valuable if other cars have already collected. So for example if there's lots of cars that drive on the freeway every day, right? It's not necessarily valuable for every one of those cars -- you want some information about the freeway, but it's not necessarily valuable for every car to transmit information about the freeway. So the idea in the global prioritization scheme again is very simple. This is just sort of a schematic of how it works, but the idea is that what cars can do is rather than transmitting all of their raw data as soon as they get connectivity, the user can specify a way -- a summarization function that summarizes the raw data that the user has, that the car has. And then on the server you can take all these summaries, the server takes the summary of the data the car has, and basically orders it and tells the car in what order the summarized value should -- that the data should be transmitted back. So the server can take into account the sort of global information that it has from all the cars. So I'll just give you a simple example. Imagine the cars have a bunch of imagery about the world that they've been driving around in. And so what the cars can do is they can summarize this data in a very course granular way by saying -- by gridding the whole world up into a bunch of grid cells and saying these are the grid cells that I have information about. So it's a very small amount of data, the cars just say here are the coordinates of the grid that I have information about, they send that to the server. So the way that they can -- the one way they might specify this, or the way that we let users specify this is by attaching the summarize-as clause to every query, or optionally attaching a summarize-as clause to queries. And what happens here is this is just a SQL query that specifies taking all the raw data, gridding it, and just -- so you get to attach an arbitrary SQL query that sort of summarizes the data however you would like, using SQL aggregate. Now, what happens on the server is the server receives all of these -- receives all of this information from all the cars, and it looks at this information and assigns priorities to it. Now, priority in the case of the server, the way it assignments priorities is the user just writes a function, an arbitrary function that can look at these priorities as they come in, it just assigns a numeric score to each one of the values in the -- to each one of the entries in the summary. >>: -- you have a temporary (inaudible) the data that's more recent is more useful than the data that's, say, three hours ago. >> Sam Madden: Again, it's application specific. So the user can, if you want, you can include temporal information in this query, in this summarize-as clause, and it will be there. And then what the server gets is whatever it wants. The server can then -- the server can write whatever function it wants to assign priorities. So it can say oh, well, this is a new bit of information about this grid cell I haven't seen before, so it can assign a higher priority to that. So the server, again, the user writes an arbitrary prioritization function that can assign a score to each one of these grid cells. And then it sends it back on the servers on the cars, and then the cars automatically join these prioritized -- uses these priorities in order to compute the ordering in which data should be sent back. And the database system takes care of figuring out, sort of doing the reordering for you based on whatever priority is sent back. So again our sort of theme with ICEDB is we wanted to allow applications to have control over the value of information we think applications know that better than the network or the database. We don't think the database can do a good job of doing this automatically, but we wanted to provide some mechanism that made it easy to do this. Okay, so and again, we provide three prioritization clauses that allow you to do this. I talked about two of these, the delivery order by clause and the summarize-as clause. Priority is just a way to say, you know, one class of query is more important than another query. So GPS data is more important than camera imagery, for example. And then we have the ICEDB 2006 paper that talks about this. Yes. >> : I think it would change if like every car on the road was embedded with such a sensor, would you still be driven by a cellular driven prioritization? >> Samuel Madden: If every car -- obviously if every car is running -- if every car on the road is -- I mean, I think we're sort of thinking that this is working in the model where you have not, you know, 10 million cars that are all being queried simultaneously. We were thinking this is our sort of test bed of some small collection of cars that we want to individually collect data from. I think this makes sense if you're an organization, you have a fleet of cars, or you want to get information from those cars that you want to use specifically for your application. If this is personal -- you know, if this is your personal data goes to your personal server or something, then obviously you would probably architect in some slightly different way. Does that kind of answer your question? >> : Yeah, I was wondering, I mean, like things like in network aggregation or something, so like that all the cars are not talking to the server, they're not intertalking to the server, because -- >> Samuel Madden: Sure, yeah, so again, the ICEDB system, we don't support aggregation of data across cars. I think there's -- again, that's a good question, I should clarify that. In the CarTel project thusfar we have not really focused on the problem of car-to-car anything, right? We've been focusing mostly on car to infrastructure, as the domain problem we've been focusing on. And the reason we've done that is because from the outset we wanted to sort of be able to build -- we wanted to be able to deploy stuff that we could actually -- we wanted to be able to do research that we could actually deploy in the real world. If you imagine 10,000 cars, you know -- it's great, it's fun to imagine what would happen if you had, you know, 10,000 cars on the road. But because we can't actually deploy that then, you know, we haven't -- so we haven't really looked at car-to-car stuff. Although I think there's a lot of very interesting things that happen with car-to-car, including in-network aggregation as well as the sort of peer-to-peer communication, a lot of the delay tolerant networking kind of things that come up. It's something we're sort of -- a direction we're considering heading, I'll talk a little bit more about how we're thinking about scaling up the deployment we have when I get to that part of the talk. >> : Just one other question. Was that was unique to this project work, or is this something that comes up with other kinds of (inaudible). >> Samuel Madden: I think it could come up with any -- we want this prioritization in settings where bandwidth is variable. So I think you could argue for this kind of a thing in any kind of a setting like that. And I think that, you know, I'm not sure that SQL is the right language for all users, but I think some sort of a high level language that allows users to say what data they would like to collect, you know, specify sort of very sort of simple configuration of the data collection system in a declarative way is a genuinely useful idea. In that sense -- and it's broader than just cars. I mean, we don't intend it for just being cars. So. >> : Since your -- not every car has instant connectivity on the data they have, in your model what information that each car has is going to be -- you know, out of date a little bit, right? And you have some uncertainty in terms of whether you have access to it. So do you take that uncertainty into consideration when you optimize the queries across multiple users, in a global sense? >> Samuel Madden: So again, we've sort of -- I mean, I think that's a very interesting thing to think about. We've sort of evaded it in the -- what I presented here, right? Because I've just said, well, the server had to just run some arbitrary function that does whatever prioritization it wants. I mean, obviously you could embed some -- sort of think of this as what we've provided is the infrastructure upon which you could build something that could take that kind of uncertainty into account. In the evaluations that we did of the system, we didn't spend a lot of time focusing on uncertainty, specifically within regard to the ICEDB system. Some of our other work where we've looked at modeling sensor data has looked a little bit more at uncertainty explicitly, and I'm not going to really talk about that in the talk, but maybe at lunch or something we can talk about that. Okay. So now what I'm going to do is move on and talk to you a little bit about the Cabernet system. So again, the Cabernet system is the part of the system that's responsible for the delivery of the data from the cars up into the server, okay? And so what's going on here, we've got cars driving around on roads, they drive by wi-fi access points, they attempt to associate with them and deliver data off of them. That's primarily what the Cabernet system is working on, although it also supports you can plug in a cell phone into this thing and all the technologies will work. Although I'm going to spend a fair amount of time here talking about issues that specifically arise in the context of trying to do this with 802.11 with wi-fi, okay? So although the system supports both cell phones and 802.11, we haven't done anything particularly clever for the case of cell phones. Okay. So again, what Cabernet, the sort of basic idea with Cabernet is it's a disconnection tolerant transport layer. So the idea is data gets buffered on the cars until connectivity becomes available, at which point it gets sent to whatever its destination is. So this means fundamentally this isn't a connection oriented protocol. You're sending messages, or you say I have this message that I want to deliver to this application on the other end, please, and queue it, and deliver it when it arrives. And there are two piece of this system that I really want to talk to you about. The first one is this thing called quick wi-fi, which is thinking about how do you establish connections very quickly from moving vehicles. And then the other piece is this thing I call CTP. And I won't talk very much about CTP, but I'll just give you a quick sort of hint, talk about it very briefly, which is a well-known -- obviously a well-known problem within wireless networks, is that if you are -- a problem with TCP is that it equates all losses with congestion, right? And that's obviously not a very -- that's not true in the wireless environment, where there may be losses due to the wireless environment. And so the wrong thing to do in the case -- TCP does sort of the wrong thing in this case. It backs off, it sends at a slower rate when it observes losses on the wireless network. Whereas in fact, in our world you might want to send at a slightly higher rate, because these connections are transient and you want to take advantage of whatever connectivity is there. Those of you who know Hari Balakrishnan, this was his Ph.D. thesis, and this was sort of his -- a piece of the system that he's really worked on more than I have, but I just want to mention it quickly. All right, this is work with post-doc Jacob Erickson. So again, the first challenge I want to mention is connection establishment here is very slow. So this is the issue, again, that if you just use wi-fi, if you just take -- in this case our cars are running Linux, you just take the default Linux wireless stack and you sort of say okay, you know, just try and associate with whatever wireless access points you see, it turns out that this just takes forever. So our measurements showed, in our initial deployments, it took on average about 13 seconds to establish an antenna connection. You might say, man, 13 seconds, that's really bad. Why is that? Well, first of all, what's -- there's just a lot of stuff that goes on when you have to get a connection, right? So first you have to scan, and then you have to scan for a network, then you find a network, then you have to authenticate with that network, you have to associate with that network, you have to request a DHCP address. You have to discover the HTPC server, then you have to request a DHCP address, and then you have to do an ARP to announce your presence to the rest of the network, okay? The problem -- there's a bunch of problems with this. First of all, there's a lot of messages and a lot of round trips, and they are sequentialized by default in the stack. So you do one phase, then the next phase, then the next phase. Second of all, the loss rates here turn out to be very high, okay? So if you look at this as a plot of loss rate through the lifetime of a connection, and what you see is at the beginning, so this is the percentage, the sort of overall connections. This is 20 percent of the time into the connection, 40 percent of the time into the connection, and so on, so forth. You see especially at the beginnings of the connection the loss rates are very high, and the end of the connections the loss rates go up. And this is just because you're driving into range of the access point, and then driving out of the range of the access point. The problem is this little spike at the beginning of the connectivity, right? Because what that means is when you first discover this access point, that's when you start doing this initiation protocol, just as you're driving into range of it. You have a very high probability of loss in any one of these phases. You see that by default the Linux stack uses these 3 second time-outs, right? And why they use 3 second time-outs I don't really know, but 3 second time-outs just kills you, because you get this -- you know, there's a very high probability, like a 70 percent probability that your request to authenticate or associate is going to be lost. And then you're going to wait 3 seconds before you even retry again. Right? So that just introduces this huge latency into the stack. So as I said, the default Linux stack takes 13 seconds to associate. To give you a little bit of perspective, the average entire connection on average only lasts 19 seconds. Okay, so 13 seconds is really squandering bandwidth in this case. So what did we do? Well, some of what we did was -- is probably obvious, we turned down time-outs, that makes a difference. Another thing we did was to scan the more -- most popular channels first. Okay, so it turns out that in -- by default, almost all wireless access points are on either channel 1, 6, or 11. Because these are the channels that are sort of maximally spread out in the -- the intermediate channels actually overlap. So even though by default the Linux stack scans channel 1, channel 2, channel 3, channel 4, channel 5, obviously it makes a -- it's better to scan channels in the sort of frequency with which they occur. So we have a little -- we observe the frequency with which access points observe on channels over time, and then we adjust our scanning policy to take that into account. We also have -- you can also, because we're only trying to associate with open wireless access points, we're not connecting to authenticated networks, we just -- the protocol standard requires that you actually do the authentication phase. But you know if this is a wireless network, this phase is going to succeed, so you can actually do it in parallel with the next phase. There's no requirement that you wait until authentication has happened until you move on to the next phase. And because the authentication phase is basically a no op in an open network, you can just go ahead and start doing the other phases in the case when the network is open. We set all the time-outs to 100 milliseconds. We play a bunch of other games where we can do some of these parallelization things for various phases of this thing. And the bottom line is we have this quick wi-fi stack that can associate with a wireless network in 370 milliseconds by default. This is just a plot showing the time that each of the different phases takes. All right, so the other problem that I mentioned that we deal with in Cabernet is this problem of TCP treating wireless losses like congestion. So this problem is -- I mean, as I said, this was Hari's Ph.D. thesis. You might wonder what's different in this particular setting. The real problem here is that we -- we can't control what software is running on the access points. Okay, so we get to control what's running on the two end points, but we don't know what's running on the access point. So if you think about what you would like to do here is to measure the loss rates that are due to wireless losses, and the loss rates that are due to congestion, and then you'd like to factor out the losses that are due to the wireless channel. Because you don't want to do congestion backoff as a result of those losses. So the question so how do you understand what the loss rates look like on the wireless side, right? So if I'm sending from the car to the access point, this problem is easy. Because I can modify the software that's running the access point, so I can basically modify the wi-fi driver so it tells me how many times it had to retry every transmission, or something. And I understand something about what loss rates look like. The problem is that if I'm sending from the server to the car, I don't necessarily have access to that -- I don't have access to that information about how many of my losses were due to the wireless, the wireless link. So what I can do instead is to try and estimate the losses between the car and the access point. The wired side losses, right? And subtract those out from the overall loss rate, and then I'll get some information about what the wireless losses were. Okay, so how do I do this? Well, it turns out that -- so this is this problem of measuring from the internet to the car. So it's just one of these sort of networking tricks, that if you think about it long enough it's kind of obvious. You can understand that -- most of the time you know the IP address, of the access point, right? Because the access point is acting like a -- it's serving as a NAT that's sitting in front of the car. So I know what the IP address of the access point is. So what we do is -- and this works in about 90 percent of the cases, you send periodic probe packets to the access point, you address them to a random IP address, okay? And most access points, what they do in response to a packet sent to a random IP address, they send you a TCP reset back. And that gives you a way to estimate the loss rates along that wired side of the link, and then you can subtract out the wireless, okay? So the bottom line is that in this sort of this thing gets us something like a 30 to 40 percent improvement. By doing this estimation this way you get something like a 30 to 40 percent improvement in overall throughput, when we're running this protocol. In addition to the dramatic reduction in the overall connection, the overall connection duration that we get out of using quick wi-fi. >> : -- open APs all the time, or does it measure when a car passes? >> Samuel Madden: Only when a car -- a car -- basically what happens is the car connects, the car says, oh, I'm here, I'm alive, I have some data to send, or I would like to download this file, and then we do the -- run the protocol. So now I've given you an overview of the software, okay, so now what I want to do is sort of switch to telling you a little about what we've done with the system. So the deployment that we've done to date is -- I actually should add a third deployment now to this. We did initial deployment on about nine individual users' cars, and then we have a partnership with a taxi company to deploy 27 -- we have this software running on 27 taxicabs. And then since we've developed a next generation hardware box that we're deploying again on some individual users' cars with a goal of rolling this out to a much larger fleet of taxis, we have a partnership with Boston Cab that we hope will amount to a couple hundred devices that are out shortly. The taxicab deployment is actually kind of an interesting story. Anytime you're doing these kinds of deployments, you sort of -- the question is like what's in it for the taxi company, why are these guys willing to allow us to put random hardware on their cars. So what we've done, it's kind of a good solution, Jacob came up with this. He actually puts two computers in every cab. One of these computers is hosting services that are beneficial to the cab company. So in particular, we provide the cab company with a little portal interface where they can go and see where all their cabs are. We also provide a gateway that runs -- the cab company was willing in this case to pay for EVDO modems for all the cars. So actually in the cab case we have both wi-fi as well as EVDO modems. And the cab companies is -- one of the things they advertise now to their customers is the gateway that goes from wi-fi to EVDO, so customers can get in their cabs, open their laptops, and surf the web. So that's what the cab company got out of it. Then we have this second computer that's running there that's just running our own services. So we have a separate wi-fi radio on this thing. And the second computer actually loads its image, its boot image from the master box over ethernet, so we can actually upload over EVDO a new disk image that the secondary computer can then boot from, and we have a way to -- little hardware relays we can actually power toggle the secondary box from the master box, as well. So that gives us a way to deploy new software out in the field that doesn't interfere with what's going on in the cars. This is just a map showing I think a week's worth of data of all the roads that we had coverage about. This is sort of the Boston metropolitan area, and then this is the center of Boston. Here's MIT and Cambridge here, here's Boston. So you can see that we get lots of data about all of the -- basically all the major roads in Boston. >> : How many (inaudible) >> Samuel Madden: This is 27 taxis. Okay. So that's the sort of where all this data is coming from, now I'm going to talk about a couple of the applications that we built on top of this. So the first one is a route planning interface. So the way that the route planning interface works, the idea is let's use this data from cars to estimate what traffic looks like, okay? So we can observe how long it takes every car to drive on every road segment, and then we can build up a distribution -- we can basically, for every road segment, build up a distribution of the travel times over that road segment. We're making the very simple assumption that travel times are Gaussian distributed, so we just compute a mean and a variance for travel times on every road segment. We assume that consecutive segments here are independent, which is obviously not a valid assumption, it's something we're coming back to and looking at how nonindependence affects things. But clearly, you know, you've got two road -- a road segment here is defined by the underlying maps that we're using, but typically it's from one intersection to another intersection. And clearly if you have two intersections that are next to each other, there's a light that's in between them, the delays on those two intersections are correlated in some way. But we're assuming that there are no such correlations. Then we're looking at simple route planning metrics that run on top of this. One of them is distance, obviously you can do distance based routing .Google doesn't do distance based routing, they do some very simple -- you know, the weights of each road is weighted by the sort of size of the road, basically. They know this is a road you can drive 25 on, this is a road you can drive 35 on, so on, so forth. But we can now do things like expected delay routing. So what's the route that's going to have the expected fastest travel time. And then we've looked at kind of an interesting variant, which turns out to be a little bit -- you can't just throw Dijkstra's over them, or A-Star or whatever at it, it's this sort of problem of finding the probability of missing -- finding a route that has the maximum probability -- a maximum probability route, so I can say I want to find the route that has the highest probability of getting me to my destination within 15 minutes. And I'll talk a little bit about why you can't just throw Dijkstra's at that. Yeah. >> : Can you look at time of day or day of week? >> Samuel Madden: Yeah, I'll give you a little demo. So -- so this is just the interface, I don't know if you can read it, I'm sorry the fonts are so small. But this just says you can say at this time of day, and this day of week, I would like to know what's the minimum distance route, or minimum expected time route, or the route that has the highest probability of getting me there within this deadline. Okay. So this is the -- the O here is origin, D here is destination. What this -- this route that I'm asking for, for anybody who has lived in Boston, is the route from MIT, out to the entrance to Route 2. So most -- a lot of people who live in the sort of suburbs around Boston commute out from Route 2. You either -- people -- there are suburbs to the north, to the south, and then out. Concord and Lexington are out to the west, and a lot of the faculty and people at MIT commute in from the west. If you know Boston at all you can sort of -- you probably have an opinion about the best way to do this drive from Route 2 to here. Everybody, you know, sort of says, oh, you know, well, you should take Mem Drive, right? Or you should take some little complicated route through here, or whatever it is. So if you ask Google what it thinks you should do, Google actually will give you different -- it's very sensitive to the starting point of this route. But what Google says you should do is you should come up here and then you should get on this road here, which is Massachusetts Avenue. Okay, so this is almost certainly not the right way to go, right? Because this Massachusetts Avenue connects -- is the road that connects Harvard to MIT together, and at 3: 00 to 4:00 p.m. it's going to be packed with cars, okay? So if you ask our system what it thinks you should do, you say what's the minimum expected time route. This is the route that it thinks overall will take the shortest time at this time of day. What it finds is some -- some route which avoids Mass Ave, basically. It goes and it takes a little side street, and then takes this Prospect Street here, and takes another side street. And it does get on Massachusetts Avenue a little bit here, but this is after it's already past Harvard. So it's sort of a -- it seems like it's a better part of Mass Avenue. And then it again takes some little back road to get you out to where you need to go. What this visualization over here is showing is the amount of time -- this number can't be right, but the amount of time that it takes to drive on each one of these segments, so each one of these lines is a road segment. And we have information about the number of samples that we have about these road segments, as well. So we have anywhere from 800 samples at this time of day to, you know, other segments where you have 10 or 15 samples about how fast cars could drive, okay? So you see that if this information here also is computing things like the expected fuel consumption of the drive, as well as the expected CO2, there's a bug here where it's somehow not computing CO2 right, so there's some normalization factor the student who is doing has wrong. And it tells you what the expected average velocity is, as well as the probability of making this thing within 15 minutes. So now I can also ask for the maximum probability route. It's going to turn out that when we have 15 minutes, the maximum probability route is the same as, I think -- this is demo is not canned, so the routes do change sometimes. But yes, the maximum probability route is the same as the maximum expected time route. But if I have more time, if I say I have 19 minutes, then what it will actually see is that the maximum probability route is slightly different than the minimum expected time route. So it says if you have 19 minutes you should take this other road, this other way, which it thinks has a 99.7 percent chance of getting you there. And the reason it picks this road is because even though this road has a higher expected travel time, it thinks it has a lower variance. So it's better to take a road -- it says okay, it's better to take a road with a lower variance with a higher average expected travel time. So this maximum probability planning algorithm, as I said, turns out to be kind of interesting. And sort of the inside or the observation here is that, well, if the sort of -- well, so the way to think about this problem is if you're trying to do maximum probability planning, the travel time of each edge of the Gaussian, that if travel times are -- if each segment is independent of every other segment, then the travel time of an entire path is also Gaussian distributed. So our goal was to find the path with the maximum probability of reaching a destination by some deadline. You can't just throw Dijkstra's or dynamic programming at this, because these problems don't have sub -- there's not a suboptimality to this. And when do I mean by that. Well, if I have three -- if I'm trying to go from A to B, right, and there's some intermediate node C, just because this route from -- just because this route through, for example, route 1 to C is the optimal way to get to A to C, that doesn't necessarily mean that this route 1 would be on the optimal path from A to B to C, okay? And dynamic programming sort of exploits this sort of -Dijkstra's algorithm exploits this fact in order to be able to do an efficient search, right? So I just have some examples that show that. Basically you can end up with the alternative route being the optimal -- I'm going to skip through this visualization, but the optimal path can be the one that's not -- the optimal path from A to B via C can be -- may not contain the optimal route between A and C. So what we've done is we have a -- so you can't just use A-Star or Dijkstra. We have a student who's been working on this problem, he has a sort of heuristic, exponential time heuristic algorithm that works pretty well in practice, and we have it running on real data. And there have been some other people who have studied this problem as well. Okay, so that was traffic, now I want to talk about potholes. So potholes, what's going on here, with these cabs, we went out and a bunch of these cabs we put accelerometers, okay, so these are sampling at 600 hertz, I believe. So we've got these three access accelerometers that are measuring, you know, both how the car is going this way, as well as up and down on the roads, and then we just let the cars drive around wherever they go. And we're looking for potholes. So a pothole looks like a spike, like this in the road. And we have some sort of relatively straightforward machine learning classifier that runs on this, and tries to distinguish a pothole from a not pothole. So we went out and we drove over a bunch of potholes and that was our training data, and then we input that into our system in order to find potholes. So we actually have a map that you can go to and you can see these are the 10 biggest potholes in Cambridge. You can click on one of them and you see a plot of the thing, we went out and took pictures of the potholes as well. So what's actually going on here we have a simple classifier that can determine several different types of road anomaly. So we looked not just as potholes, but actually trying to distinguish potholes from other kinds of bumps that you experience in roads, like manhole covers as well as railroad crossings or expansion joints in freeways that cross the freeway entirely. It turns out that we can do a pretty good job of differentiating between manhole covers and potholes and other sort of protrusions in the road from things that cross the entire road. And the reason of that is we can see -- if you imagine something that goes across the entire road, it goes like this with both wheels, whereas a pothole does something like this, so you can see that in the accelerometer. So basically our classifier works by extracting a bunch of different features from the signal, and then we run a sort of a standard classification algorithm on top of this thing. And then once we've found things that look like candidate potholes, we do some clustering in order to basically group together detections that were near each other, as well as to throw out detections that haven't been seen very many times. So we want to throw out things that haven't been seen very many times because one, they're anomalies, like people slamming doors or, you know, people knocking the accelerometer inside of the car, so we want to filter those things out. We want things that only occur at the same place multiple times. And also if there's something that occurs once, it's probably not a pothole that we need to worry that much about, right? Because if drivers can avoid the pothole most of the time, then it's probably less severe than something that gets hit all the time. Okay? Again, we have a Mobile Assist 2008 paper that talks about this particular deployment, and talks about some of the data we've collected, but at this point that's sort of all I'm going to say about the pothole thing. Okay, so the next and the last little part of the talk I want to talk about in terms of what we've done with it is some experiments about wi-fi. So I mentioned that we've been trying to associate automatically with -- that one of the things we're trying to do is look at wi-fi as an uplink for these devices. So I'm going to sort of talk about what we did to measure whether that was feasible or not. And I want to talk a little bit about some more recent work we've been doing on wi-fi based mapping, or using wi-fi for doing mapping. So the idea here is everybody knows there's a lot of wireless out there, right? You go out anywhere in the city, you open your laptop, and you observe a huge number of access points. So on a typical drive, about an hour long drive around Boston, we'll see something like a thousand access points. On average, about 5 percent of those access points are open, okay? So we're interested in the question as to whether there's enough connectivity from these open wireless access points to get data off of the cars. A common question is whether this is legal. It turns out at least in Massachusetts, it apparently is not illegal. There are no laws that say you can't -- you know, you can't just associate with these access points. In some states it appears to be less -- is a little bit more dubious. From our perspective we're interested in it, because if you could -- you know, one of the -- if you could demonstrate that these networks were a feasible way to get data off of cars, then you could imagine partnerships with hot spot providers, partnerships with urban municipal area networks, or even incentivizing people to open up their networks in order to allow you to access. So there's a lot of these companies like Fone and Meraki who are now doing these large scale deployments where they're basically getting people to, you know, participate in an open -- in an open network, sort of built from the ground up. Okay, so again, there's another example, this is wigle.net, I don't know if you guys have seen wigle.net, but these are these people who do WAR driving, so they just build these maps of where all the open wireless access points are. Every city is just covered with wireless access. Here the green dots are open, the red dots are closed. All right, so the question is well, you know, given that there's all this connectivity out there, can we actually use wi-fi from cars as we're driving around. So the experimental method we used is really very straightforward. Cars are just sitting at a loop, scanning. When they see an open wireless access point, they attempt to associate with it. When that's successful, they acquire an IP address using DHCP. And then what they do is they begin an end-to-end ping with a server at MIT. And this is to demonstrate that this network is actually open, because there's a lot of these for-pay networks that from the point of view of the wireless stack appear to be open, but they actually require you to do HTTP authentication before they'll deliver data for you, right? So you have to pay, you have to put your credit card in. So once we've established that the network is in fact open in this way, we do two things. We initiate a small TCP test upload to our server to measure the bandwidth of the connection, and then we begin a local AP ping. So we just pinging -- the car starts pinging the local access point. And this is again in order to determine the duration of the connection. So when we can no longer ping the access point, that must mean we're out of range of the access point. And then once we get three seconds of lost pings, we go back to scanning. Okay, so this data was collected, this was done with an early deployment, so this was done with our individual -- all the data I'm going to show you was done with just individual cars, not with the taxi test bed, this is from a Mobicom 2006 paper. So it's about 32,000 distinct access points that we observed, 290 distinct kilometers of road, and about 300 hours of driving. So association duration is the first metric. So the first question you might have is how long is a car associated with a network on average, okay, when it gets connectivity. So this is we're scanning, we associate, we get an IP address, we get a ping, we lose the connection. So we're interested in how long does this AP ping -- the time from the first AP ping to the last AP ping. That's the association duration, this is just a CDF of association duration. So I already told you that I think the mean is something around 19 seconds. You can see the median is about 13 seconds, so the mean is slightly longer than the median because this is a long-tailed distribution, where there's some connections that last a really long time, like two, three minutes, because the car sitting in traffic, or sitting at stoplight or stopped. Okay, so the thing to get from this is that we get quite a bit of connectivity, right? 15, 20 seconds is a little bit surprising. The other question though, what about how does this vary with the speed of the car, right? Like is the only because -- are we only getting these connections when the car is stopped. This is just a fraction of associations that we had versus the speed of the car. And you can see that we continue -- we have pretty -- you know, pretty linear distribution of associations by speed, up to about 60 kilometers an hour. Okay, so this is like, whatever, 35 miles an hour or something. Partly our data mostly consists of people driving about 35, not driving on the freeway, these are mostly people who live around MIT who take city streets. But it also probably is the case that there's just -- you're not going to do as well as you get on freeways, there's fewer wireless networks and there's less opportunity to connect. So this is just showing a -- the distribution of connection duration by speed. The thing that -- this right here, there just aren't very many connections that are happening here, but the thing to observe is obviously duration of connection goes down as speed goes up. Not surprising, right? Okay, so the kind of take-away here is we get connections at a lot of speeds, we don't have a lot of data at higher speeds but, you know, for at least city driving there's quite a bit of opportunity here, right? And then when we get connections, they're pretty long, like 15, 20 seconds. So there's the ability to upload quite a bit of data when that happens. So the next question though is okay, well, that tells me how long connections last on average, but how frequently do you actually experience connections. So we looked at two metrics, which are -- we call disconnection duration metrics, and these are the delay between attempts to join. So how often do we just see any old access point, not successfully associate, but how frequently do we see an access point, and then how frequently do we successfully associate and obtain connectivity. So this is just a CDF showing those two things. The dark black line is the number, the distribution of attempts. So again, if you look at the median here, you can see that we see an access point about every two seconds on average. Of course, this is very -- this metric sort of makes that seem -- it's not that you really see an access point, sometimes you'll see 20 access points in one second, and then you'll have 20 seconds with no connectivity, right? So this thing is -- it's a pretty variable measure. And of course granularity gives you a sense of what's going on here. And you can see that -- you know, so the median of associations is we get an association on average about every 13 seconds. So we get an association about every 13 seconds, it lasts 19 seconds, on average. Sorry, the mean is different than 13 seconds, sorry. We get -- the median here is every -- is every about 13 seconds. The mean is much, much higher than that. Because we -- there's long periods where there's no connectivity at all, like when you're driving on the street. So it actually turns out that we get a -- the mean disconnection time is about 260 seconds, so it's about four minutes, and about once every four minutes we get this 19 seconds worth of connectivity, okay? And about every 20, 23 seconds we get a -- we see an access point, on average. As opposed to a median. And then the final question is just okay, how much total data do you transfer when you get a connection, this is just showing a CDF of that. Again, you can see that if you sort of go across this 50 percent line, you can see that there's about -- this graph is really hard to read, I'm sorry, but this is about 200k. So you get about 200k of data when you get the association. So if you like, you get -- the mean is actually 600k, so the mean is considerably higher, because the average -- the connectivity is somewhat higher when you have -- because connection durations are longer when you have -- you get more data through, because the mean connection time is longer than the median connection time. In average, you get more data through than in the median. So the sort of take-away from this is there's a lot of connectivity in the wild, right? We get on average 600 kilobytes every 200 seconds. So if you like, you can think of this as like whatever, a 3 kilobyte per second network, right? At some -- you know, viewed at a very high level. And 3 kilobytes a second doesn't sound like a huge amount of data, but if you're thinking about getting traffic off of cars, or traffic information onto cars, or news updates like textual information onto cars, it's quite -- it's definitely sufficient information, it's sufficient bandwidth to get a fair amount of data onto or off the cars in realtime that might potentially be very useful. You know, giving you your e-mail in your car. All those kinds of applications could be supported by this kind of thing. So we think that's pretty encouraging. And this is true even at sort of normal driving speeds, just driving around. Yeah. >> : -- over time whether this ratio of open access points is getting more or more, or less? >> Samuel Madden: Yeah, that's something that we're sort of -- is on our stack of things to do, now that we have this running for a couple of years, to actually look and see whether there's any appreciable change to this over the last couple of years. It seems probable that openness has gone down somewhat. Although, I mean, we've been running these experiments, we've switched over -- our next generation network is we're looking at is really wi-fi only, and it works surprisingly well. So. All right. So the last thing I just want to talk about with respect to wi-fi is a sensor, is this what we call V-track. And again, this sort of -- there are two things that are going on. The sort of one motivation of V-track is to look at whether it would be possible to build this kind of system that we had before, in a framework that doesn't have any GPS. Okay, so why would you want to get rid of GPS? Well, it makes it a little bit more expensive. For us as academics, if we're trying to deploy a thousand of these things, that cost maybe matters. In some case it may not matter. It also complicates deployment a little bit because you need an antenna that has a clear view of the sky. If you imagine a GPS, you know, buried in your pocket all the time, in a cell phone for example, you're likely to have significant loss and obstruction issues as a result of that. GPS also is quite power consumptive. So our approach is to use wi-fi to do positioning. Again, using wi-fi to do positioning isn't an entirely new idea, right? People have been doing this for quite awhile, and there now are these companies, Skyhook, for example, Navazon, that have these things that run like on phones that you can say tell me where I am, and it uses wi-fi triangulation in order to estimate your position. The difference in V-track is that we're not estimating your position at a single point in time, but you're trying to estimate the trajectory that you took over time, particularly the historical trajectory that you took. So we get an observation of all your wireless -- all of your positions over time, and then we're trying to find a path that traverses a series of roads that explains those positions, right? And we try and match those things to roads. So -- and this is just a plot showing if you just were to do this with raw wi-fi, you see that the mean error is something like 73 meters of this sort of raw -- the estimate that you get from using a Navazon-like or a Skyhook-like thing is, in our experiments, about 73 meters. So that's not good enough to really know what road you're on. So that's kind of the problem we've been trying to solve. And this is just a picture showing what happens. This is if you just take -- you take, every time you see an access point or a series of access points, you use this wi-fi localization algorithm like what these companies provide in order to estimate where you are, and then you connect that together with a bunch of lines. So blue is where you actually traveled according to a GPS, and green is what you see. And this is fairly typical of what we see. This is actually good, based on our observations. You see these things where the car seems to be jumping from one location to the other, because we don't exactly know where the centroid is, and you see these funky things where it seems like it, you know, it did this -- went on this path that it just didn't go on, it's kind of ugly looking. So the problem we're trying to solve is -- I'm not going to go into too much detail about how we solved this problem, but there are a lot of tricky issues that sort of come up with the things. So one of -- so I already mentioned that accuracy is just poor. The other problem that comes up is you see people moving their access points around with time. So in Cambridge you get students who move from one place to another, and so you see an access point that you thought was in one location that is actually in some other location, and that really screws up your algorithms. You also see crazy roads, right, so there's these intersections where it's just how do you -- I mean, it's very, very hard to figure out what road you're on. Like in freeway overpasses are typical. Then you also see things like cab drivers who drive around and around in circles, which makes doing this -- makes this problem just really, really hard. But we've looked at two approaches, one of them is an iterative shortest path approach where we take two end points of a route. We sort of find locations that we're pretty sure about, where we have lots of wi-fi observations that seem really good. We connect them with the shortest path, and then we try and see if that shortest path explains all the other access point observations that we saw, or explains enough of them. And if it does, then we say okay, we're done. If it doesn't, then we recurse, we find some other intermediate points and we connect those together, and we do that until we get a route that seems to explain most of the observations we've seen. We've also looked at a particle filter based approach, where you constrain the path of the particles to travel on roads and to travel at car speed. So the very preliminary results we see is we get about 80 percent estimate of road segment accuracy. Whether that's enough or not, we're not sure. One thing we'd like to do is to be able to throw out the 20 percent of bad estimates. So if we could get 80 percent that we're sure about, and 20 percent that were -- we knew were bad, then we think this would actually be pretty good. That's what we're working on. We also get mean error, position errors in the range of about 2 meters per second or 10 kilometers per hour. Speed errors, we're estimating speed on road segments as well. I'll just give you one last little visualization. This is just an example of a drive. So you see here we don't actually have raw GPS information. The two lines are -- red is the wi-fi, the centroids of the wi-fi localizations that we saw over time. As you can see, in many places it doesn't match roads very well, and it appears to do funky things like jump around. So this is a case one of these access points seems to have moved. Anyway, so the system is able to match this particular drive onto a set of roads, that it appears to be doing a pretty good thing. And this is just showing all the data that we've collected from this particular car over the last -- you know, few days. So this is current up to now. >> : -- because you don't have -- >> Samuel Madden: It doesn't have any information. There were no observations during from here, basically, until somewhere down here. And it just -- the system said oh, well, you know, just use Trapelo Road. Which is almost certainly the right road in this particular case, I mean, Trapelo Road is the only major road that goes through this area. So. Okay, so I think in the interest of time I'm going to wrap up. I did want to talk a little bit about managing missing and uncertain data on some of the database work we've been doing, but I just think you guys probably -- I don't know whether -- should I keep going, or you guys tell me. I mean, it's been an hour, so -- I'll -- it will take two minutes. Okay, if you guys want to run, feel free to run. So one of the problems we've been sort of looking at in this case, and this is a function DB system is about, is okay, so we've got all this data. How do we sort of allow users to make sensible queries over all this data. So one problem is that if you just run raw queries on discrete data, it turns out not to work very well. So here the little red points are the GPS -- the data coming from the GPS sensor over time. And this is real data that's stored in our system, and you see that, you know, there are just these periods where we don't have any position information, because either some packets were lost or because the GPS, you know, went out on us for some period of time. We have missing data, right? And so if the user asks a question like what was my average speed in this rectangle, you know, who knows what kind of answer you're going to get, right? Or when did the car pass -- tell me all the cars that passed through this rectangle, even, right? You're going to get some sort of nonsense answer. So what you would rather do, instead of querying the raw data, probably, is fit some collection of line segments to this data, right, that tell you trajectories, basically, that estimate where the car was going and its speed over time. So this is the idea in function DB, which is that rather than storing discrete data points, we're going to allow the system to store trajectories, and particularly we're going to provide support inside of this database system for automatically fitting functions to raw data as it streams in. So these fit functions get maintained over time by the system, as data streams into the database. And then we allow queries to -- users to pose queries over these fit functions. So that users can always pose data queries over the raw data if they want, but they can also define what we call model based views that are fit -- views over the raw data that users can pose queries over when they prefer to be querying over this fit data. So we support these kinds of models as first class objects inside the database, and we allow users to query them. So just to give you a simple example, suppose I have a collection of raw data, like temperature readings over time, okay? And suppose I want to know, tell me the time when the temperature crossed this threshold, right? The simple way to write this in SQL would be to say select time where temperature equals threshold. Clearly, at some point, you know, you can -- if you have interpolation you could guess that the temperature probably was actually at that point in time. But if you ask this query of the SQL database, you're going to get a null response back. This is a trivial example, so yeah, you could say where temperature is greater than threshold. But just -- the idea now in function DB is that we can fit a function. You say fit this data using this regression function, again the user gets to specify what regression function they would like to use to fit the data, and then they can ask this question, select time where temperature equals thresh, and what the system will actually do when it evaluates this query, is to solve this equation in order to figure out times when the equation crossed that time. So the kind of cute thing about this is that now query processing, instead of doing this -- query processing, instead of looking at a whole bunch of discrete points and seeing whether they satisfy the condition in this query, query processing actually becomes doing a little bit of function solving, okay? And the system that sort of meets -- the interesting part about the system -- so this is -- what we've built is a system that works for any polynomial function to fit with regression. It maintains these regression functions over time as data streams in. We support aggregates and joins over the data, so aggregates become integrals. So an integral, you know, try to find the average temperature over this range is an integral under the curve, right? It also supports joins of two data sets together, so you can sort of attach new data to a particular curve. So I could, you know, if I had light data in addition to temperature data, I could join those two data sets together and get a new joined data set that represented light and temperature over time. And then the sort of cute -- all the guts of the query processer that turns out to be interesting is how do you deal with the case where you have queries that you can't find, easily find a closed form solution to. So the system actually falls back on a form of approximation that it uses to estimate the answers to queries. But we find that you get about five to six times performance gain versus running the queries on raw data when you're doing this inside of CarTel. Because there's just so much less data, these functions are so much more sparse than the actual raw data, that the performance actually turns out to be better. And on top of that, you get this advantage of you're querying these fit functions, this cleaned up data, instead of querying this sort of raw uncleaned data. So that's -- >> : Your example of temperature (inaudible) fit data even though that data is quite sparse, how much of your experience is application-specific function fitting, how much is quite general, applied to a very large number of (inaudible) of data sets. >> Samuel Madden: So, and again, obviously figuring out how -- what the right way to fit your data is and what set of basis functions to use and all that stuff is a very application specific thing, and it's something that everybody wants to do slightly differently. Our idea is, though, for a lot of science types, anyway, people are using pretty simple tools, right? They're happy with simple interpolation, or some fairly simple linear regression things, and we think you can support those in the database system pretty well. We don't say you must use -- you know, we support any class of polynomial basis function in the system, which we -- you know, it's very hard for me to say, you know, is this general for all users in the world, of course maybe it's not. But it does seem like it's useful for at least the kinds of data that we've worked with, like trajectory data. It's clearly what you want is some form of interpolation on trajectory data, right? We can do these, play the same tricks with interpolated data. It also does seem for some of these simple sensor field data, like temperature stuff, seems to be fit pretty well by simple basis, relatively simple polynomial basis function. >> : -- if you can give the user, in addition to the answer, some estimate -- >> Samuel Madden: We absolutely can. When you do fitting you can absolutely say here's the quality, here's the quality of the regression that you ran. And so one idea -- I mean, you know, one idea is to start to support to add more. I mean, if this idea, if -- you know, if this idea were to take off, one thing you could do is clearly add support for a broad range of models that would be supported inside of the database system. So anyway. So just to quickly wrap up, mobile sensor networks I think have a really great potential to sense the world at a much higher scale than these static networks, they're much cheaper. In the car space, anyway, lots of applications. Traffic, fleet management, automotive diagnostics, this kind of wireless network monitoring and mapping. Environmental stuff, like you imagine putting pollution sensors on cars. Traffic planning, like where should I build roads. I talked to you about the three sort of high level components of the system, the portal, ICEDB and Cabernet, and then I went into some details about some of the applications and research results that we've come up with, so we have a website, go check it out, and I'd be happy to take more questions. >> : (Inaudible) >> Samuel Madden: Some of it can. If you want, like if you want a specific demo, we can, you know, find a way to get you a specific demo. >> : -- would Hari have (inaudible). >> Samuel Madden: Yeah, so we don't have -- some of that data isn't there. But if you want, if you guys, if you like want to show somebody in your research group a demo, I'd be happy to, you know, get -- Michel knows how to get to all of it, and we can show you web pages of it, so. >> : You mentioned using wi-fi as a way to estimate your location based on the car location, car (inaudible) data, but also the history. >> Samuel Madden: Yeah. >> : So you said explicitly that you are not going to (inaudible) talk about the privacy (inaudible), but that does have an implication on the architecture and design, right? You have to keep the segments from the past history, and someone has access to the database, so have (inaudible) ways of querying the system can tell, you know, Sam, what you have done. >> Samuel Madden: Sure, you have this thing that says oh, well, Sam, visited -- this car visited all these access points at this point in time. And clearly we have a database right now that has all that information in it. And, you know, if you look at it, you can learn pretty interesting things. I mean, most people don't -- first thing you learn is most people don't do anything interesting. Like they drive to a work every day and do exactly the same thing. But you can see a lot about what people have been doing, I agree that this is problematic, so -- >> : What are the implications for your future system design, where you take those into consideration? >> Samuel Madden: Yeah, so I think absolutely, you need to be taking this stuff into consideration. And there are some things that you can do, like for example, not reporting -- you know, if a user didn't report his information when he was within some radius of his house, or when he was on some road that he knew he was the only person who ever drove on, or something like that, then that could potentially be one way to mitigate privacy concerns. But I think that what needs to happen is some -- I mean, as these systems get more and more popular I view it as serving two things. One is there's a technological thing that happens where people come up with formal models that allow us to reason about what it means for data in this kind of spatial world to be private. And that's happening, people are working on doing that, we have some students who are working on doing it. The other thing that needs to happen is there needs to be a set of policies that get established here. Like say, you know, in this setting, for example, it needs to be illegal to -- not illegal, or there needs to be some set of regulations that say what is okay, what data is okay to export in this way and what data is not okay. So that's kind of the way I view it. Question. >>: Have you tried doing AP viewing correlation to try to notice when these APs move? >> Samuel Madden: I'm sorry, you said what do you mean by AP viewing correlation? >>: Say you're in one spot, and you see access points A, B and C, and later -- at the same time, temporal time, somebody else might be in a different spot, in C, D and E. And then later someday -- D and A is (inaudible), and A, C, D, E and B -- >> Samuel Madden: Sure, we can do that, it clearly happens. The question is whether there's some -- I mean, you could learn interesting things about how frequently things move around. I mean, I don't know if there's a more interesting application than that, besides for people who are using wi-fi for localization like us and trying to filter this out. >> : -- in terms of just keeping your database accurate. >> Samuel Madden: Yeah, so clearly we can filter -- one thing we can do is we -- a good time when you might throw out some old observations of an access point. I mean, there's always this question about how long should you keep these things around in your database. If you suddenly see that something appears to have moved location, you probably want to throw the old observations of that thing out of your database, right. >> : What's the (inaudible). I was thinking if you build the net, instead of using the wi-fi. >> Samuel Madden: Yeah, so -- I'm trying to think of -- I don't know the numbers offhand. I know that this company Boston Taxi has 300 cabs, I think. And there's something like five or six major taxi providers in Boston. I would guess the number is a small number of thousands of cabs in a city like Boston. So I think my guess is of course Boston, like many cities, has an urban -- the urban core of Boston is pretty densely trafficked by cabs, right? Like it's not the -- you know, if you go out in the outlier -- out in the suburbs, you'd never see a cab if you stood on the street. But if you stand on Mass Ave at 7:00 at night, you would probably see, you know, 30 percent of the cars you see are probably taxi cabs, right, because they're taking people to and from restaurants and bars and stuff. So in that sense, I think there are certain times of day where you could use taxis to effectively -you could do peer to peer networking effectively with taxis. And I think it's very interesting to actually understand the question of what density would you need. Even just answering the question of what density do you need of these kinds of cars in order to be able to do peer to peer networking is interesting. It's actually interesting to ask the question of what density of cars do I need to be able to observe traffic, get a traffic update about every road once every 15 minutes, right? Like how many cars do I need on the road in order to be able to do that. And the answer is nontrivial, because of course the cabs don't -- cars don't move according to a uniform distribution at all, right? So you can't just assume that because cars travel 100 kilometers of road a day that -- you know, use that to easily interpolate, sort of figure out how many miles of road they drive on. How many cars you would need. Other questions? Okay, thanks, guys. >> : Thank you. (Applause.)

The CarTel Automotive Mobile Sensor Networking System Samuel Madden

Related documents

Products

Support

The CarTel Automotive Mobile Sensor Networking System Samuel Madden

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib