The CarTel Automotive Mobile Sensor Networking System Samuel Madden

advertisement
The CarTel Automotive Mobile Sensor Networking System
Samuel Madden
>> Michel Goraczko: Yes, so we have Sam here today from MIT, he joined MIT in 2004,
he'll be talking about the CarTel project. Sam.
>> Samuel Madden: Thanks, Michel. Thanks, guys, for having me. So I'm going to talk today
about the work we've been doing for the last I guess three years now on this CarTel project.
This is a joint project with my colleague Hari Balakrishnan, who I'm sure many of you know, and a
giant pile of students and staff people and other folks. Every time I give this talk I sort of have to
think about all the people and I have to add new names to this list.
Anyway, so motivation for the CarTel project sort of started -- we spent a lot of time doing sensory
networking research at both at MIT and before I was at MIT at Berkeley, and a lot of the focus
there were on these sort of small scale deployments where you put a few sensors in a building or
a space. And one of the things we were thinking about, we started thinking about, is how you
can you do deployments that really sense a wide area, like a whole city.
And so why would you want to do that? Well, there's lots of things you might want to do. Civil
infrastructure monitoring, like measuring what roadways look like or what water pipelines look like
all throughout a city, so on and so forth. Road surface conditions is sort of a related thing,
monitoring black ice, probably not generally a problem you guys have here. Some of these
visual mapping applications, those sorts of things that Microsoft and Google are deploying, where
they take pictures of roads everywhere. Or monitoring traffic throughout a whole city.
One way you might think about doing this is some sort of a wide area static sensing deployment,
like putting inductive loop sensors in roadways in order to measure traffic everywhere. Of
course, that's costly both to deploy and maintain, right? It takes a huge amount of effort to do
that.
So the observation is -- and again, you know, Google and Microsoft taking pictures of, you
know, streets is an instantiation of this. You don't need -- a lot of applications don't need
imagery -- you know, don't need data at some incredibly high temporal fidelity. So for example,
measuring traffic you might be okay with knowing once per hour what every road looks like, or
once every 15 minutes what every road looks like, you don't need continuous sensing all the time.
That's sort of our insight or what we're doing in this project of looking at the use of mobile devices
for sensing. In particular, we sort of -- one challenge with mobile sensing is one way you might
do this is to go, for example, buy 100 cars and put them out on the roads and have them drive all
around all the time. But that itself was going to be pretty costly. So the sort of particular thing
we're going at in the CarTel project is we call opportunistic mobility, and that's sort of the slogan
on the front slide. Making sense of your drive to the store. Can we use the mobility that people
have in their everyday lives have already, in order to sense things about the world in some way.
So that's what CarTel is about.
There's two sort of obvious ways you might want to do this. One is by using cell phones, and the
other one is by using cars, because these are things that people naturally have, that they move
around with all the time.
In this project we really have focused on cars, partly just because cars are this very attractive
platform for deploying stuff, right there. It doesn't rain inside of cars, they already have power,
you have power inside of cars, there's this onboard sensor network inside of cars that lets you
measure all kinds of things about what the car is doing and what it's experiencing. Okay?
So the sort of first question that I want to lead off with is just talking a little bit about
what's the system architecture, like what's the thing that we built that allows us to get data from
cars that are moving around. And then I'll talk, probably more of the talk will actually be about
what we've actually done with this platform. So as I said, we built this thing, we've had it running
for about three years on a number of cars, and have used it to do some things that I think are kind
of neat. So I'll just tell you about that.
Any questions about anything at this point? No, okay.
All right, so CarTel is a mobile sensor computing system, basically think of it as a tool to answer
questions about spatially diverse data sets, if you like, that are collected from these mobile
devices. So suppose this is -- it both has, you know, focuses on the collection of the data, so
collecting traffic flow information from roads, as well as then allowing you to do some processing
and sort of asking questions about the data that you have collected.
And so with that sort of goal in mind, you can at a very high level decompose CarTel into sort of
three core tasks. Some piece of software that does the collection and the processing of data;
something that does the delivery of data, that is, how do you get the data off of these mobile
devices into some centralized infrastructure where you can look at it; and then some visualization
or analyzing tools to allow to you to sort of process the data.
Okay, so just to sort of give you specifically how these three layers are instantiated in the CarTel
project, at the lowest layer, the stuff that's actually running on the cars, there's a collection of
embedded hardware, and this just gives you a picture of some of the earlier generation devices
that we've been deploying. This is just a little embedded basically access point device from
Socarus (phonetic), just has a wi-fi radio and some flash storage, so on and so forth. GPS
interfaced the onboard diagnostics network inside of the cars. In some cases webcams.
We have a second generation deployment that we're doing that's based on a much smaller and
cheaper access point device. Has no GPS, does all of its localization via wi-fi, so it uses wi-fi
both for localization as well as for data uplink.
We also have another box that we deployed in some of the cabs that -- a cab test bed that we've
been running that I'll tell you about in a little bit.
So this is sort of running on the cars on the roads, data is being transferred up over the internet.
In the CarTel project, we tried to be sort of agnostic about what the actual data -- insisting on one
particular way of getting data off the cars. So in particular, we didn't want to insist that every car
that's out there on the road have a cellular data plan available to it to get data off, which would be
then sort of the obvious way to get data off of things that are moving around on roads.
One of the things we spent quite a bit of time doing, actually I'll talk about this, is investigating the
question as to whether the currently deployed wi-fi networks in the world are sufficient for -- as a
sort of data uplink for a lot of these mobile uplink data applications. I'll talk about how we've -some of the technology that we've developed to make it so that cars can rapidly associate with
wireless networks as they drive around.
Then the third piece of this is some sort of stuff that runs on the web that users interact with. So
call these the portal. Sort of at the middle layer we have this thing we called Cabernet. So Cab.
Cab network. Which is the carrying forward networking system. If you look at our paper there,
there's something we used to call Caf Net, this Cabernet is sort of the second generation of this
networking layer.
Then there's ICEDB, which is intermittently connected embedded database, which is really the
data collection abstraction that runs down on the cars.
So I'll talk about the three pieces of the system quickly. All right, so just to sort of give you a little
bit more road map, I'll talk about the three pieces of the system, I'll talk about deployments, and
the case studies that we've done. And then I'll talk a little bit, if I have time, about sort of how
some of the things we've done to help users actually manage the data from a -- sort of the
management of data that has been collected from these things.
So part of my end, sort of, research agenda with this is to build -- I'm coming from the sort of
database community, and one of my agendas is to build tools that make it easier for people to
actually manage all this data that's coming from things like cars out on the road. So I'll talk a
little bit about that at the end.
All right, so the portal, it really is a pretty -- I would say this is a pretty generic architecture.
There's some web server, there are some collection of applications that run inside of the web
server that do things, provide these applications to users, like a traffic application or a wi-fi
application that allows to you digitalize wi-fi. These applications can retrieve data from this thing
that we call the ICEDB server. This is really just basically a database, except that it has -- it
allows applications if they want to to pose not only query historical data, but to pose queries that
request that cars on the road deliver data continuously. So the traffic application, for example,
can register itself with the ICEDB server and say deliver me GPS information once every three
seconds.
And then there's some collection of data visualization tools that we provided that allow users to -applications to sort of overlay data on maps and other things like that.
Then the way the communication works is the ICEDB server uses the Cabernet system to receive
data from cars, as well as to send data to cars that are out there on the road. And the real thing
that Cabernet is doing for us here is dealing with disconnectivity. We're not making an
assumption that the cars are always connected, so therefore you need to sort of have some way
to allow cars to pop up and register themselves and get data from -- and to buffer data on cars so
that when they connect they can deliver to you, and so on, so forth.
Of course, inside this system there's also a relational database, inside ICEDB there's also
just a relational database that stores all the historical data. And most of the applications that we
build are actually just querying historical data most of the time. The continuous queries are used
to sort of configure things and set up the data that streams in, but for the most part applications
don't really want to change the data that's being collected in realtime. At least in the
deployments we've done.
I'll just give you just a little demo of sort of this portal software. I'm going to -- this is some sort of
old data that we collected, but this is data from Hari Balakrishnan, so we're going to compromise
Hari's privacy all day today. He's generously allowed me to do this in this talk. I didn't have a
car for much of the time that we were doing this, so --
>>: (Inaudible)
>> Samuel Madden: I'm not going to -- I would love to talk to you guys about privacy, actually. I
mean, I think it's a real concern here. I'm not going to go into much detail about it. The
standard story that I tell people is, well, Hari logs into the system, and this is what he sees. So if
all we're doing is just showing Hari his data, I'm not sure that the privacy question is huge. When
it starts to get really problematic is when you talk about -- which we'll talk about later -- when I
take Hari's data and synthesize it together with everybody else's data to do interesting things,
how do we make sure that we're not compromising Hari's privacy when we do that.
Like if I give traffic information about roads that Hari drives on, you can probably learn quite a bit
about what Hari does. If I know -- if there's only a small number of cars inside of the system and
I know that Hari lives up here in Winchester, I can figure out things like when does Hari leave the
house every day, from that data. So that's definitely a problem.
Okay, so we've got -- this is just a very simple visualization that allows you to do things like -- it's
just running on top of Google maps. It allows you basically -- these two blue boxes define
regions that we want to find roads within. So I said I want roads that go from here to here, drives
that go from here to here, and this is just showing me all those drives. I can, you know, click on
one of these and see where that drive specifically goes.
There's some sort of interesting things you can observe about this. This is the duration. You
see most of the time this is a 20 minute commute. Some days, like here, it turns out to be it's a
40 minute commute.
>>: So you have some definition of trip segments, which is start and end points. Say you go
somewhere, you stop for five minutes or one minute, and then you don't --
>> Samuel Madden: In the case of an individual user driving around, this turns out to be pretty
straightforward. Every time the user turns the car on and off, you use it to delineate roads, we
do record that information.
It gets a lot harder, we've been doing this now lately with cell phones. We also have this taxicab
deployment. Both have these issues, that cab drivers will for example sit somewhere with their -this is extra bad for our cabs, because the cabs are all Toyota Priuses, so the cabdrivers will just
sit there, and they won't actually turn the engine off, they'll just sit someplace for a long time.
And sort of how do you decide, is this a new drive or is this just the cabdriver sitting somewhere
for a long time. So figuring out how you sort of delimit drives is hard.
Cell phones have the same problem, because the phone doesn't turn off. So you end up doing
things like, well, we could use the accelerometer to determine if the user has been stationary for a
long time. And then we'll sort of segment drives manually, in some sort of heuristic way.
Fundamentally this is like -- or you could give the user a button that says I started or stopped
doing something new. But it's hard to figure out how to actually split these up. For now these
are just split up from car turning on and off.
The sort of thing you would want to do is zoom in on this and see a little bit more information. So
this is now the same drive -- Microsoft has really good network connectivity, by the way. This
thing is really snappy here to MIT. You can zoom in on this and see a little bit of detailed
information about the drive.
So you see here, you know, so you see he leaves MIT and he goes on this sort of broad way up
here, and you just see that -- this is one of these drives where it took him 43 minutes. And you
just see he gets stuck at these lights, basically, he just backs up and waits for a long time to get
through an intersection. So and then this is a plot over here of his speed versus time.
So one of the sort of cute things about this, Hari had sort of convinced himself that this back route
was like the best way to drive from his home to work. And we actually have a little analysis that
we did where we made him drive on the freeway like 20 days and we made him drive on this road
like 20 days, and basically it turns out that the freeway is just always a better objection. Even
though you feel like the freeway is just awful, it turns out that on average at 5:00 in the afternoon
it's like 10 minutes faster to take the freeway than it is to take back roads. Even though it takes
you -- you wait for five minutes to get on the freeway, but then when you finally do get on it you
actually move pretty quickly. Whereas here you're sort of moving, but you're just moving very
slowly for a lot of time.
>> :
-- how much gas it consumes.
>> Samuel Madden: Absolutely. So actually I'll show you -- we have a little traffic portal. One
of the things we're doing that this is using is to do traffic planning. We actually have a little thing
that will say here's how much fuel we think that it will consume and how much carbon dioxide we
think you'll emit when you -- you know, depending on this route.
The other thing I said you'll want to do is this is supposed to be designed to be kind of a generic
visualization interface, so we can do things like overlay engine RPMs, so this will color code his
trace by engine RPMs instead of color coding it by speed. We can also see all the wi-fi networks
that he observed along this drive.
Again, as I mentioned, one of the things we're interested in is investigating the use of wireless
from moving cars. So this is just a log of all the wireless that he could see. In this case red
means we saw it but we didn't associate with it, yellow means we were able to associate with it.
And then we in this drive had the feature that actually caused the car to try and transfer data over
the networks turned off, but in some of these you'll see that things will be green, which means we
actually sent data back from it. Yes.
>> :
-- RPMs, can you actually tap into the car's network?
>> Samuel Madden: The car has this onboard diagnostics network. So every car sold in the
United States since 1996 has basically been required to have an onboard diagnostics interface,
and this is primarily for emissions testing, so most of the data that comes over it is things like the
status of the -- you know, oxygen sensor or something. But every manufacturer has its own set
of basically bits of information that it exposes over this thing.
So it's just almost -- if you look in your car, typically it's under the driver's -- sort of under where
the driver's side, the steering wheel section there's a plug there, looks kind of like a parallel port.
It's like a sort of a -- you know, about that big, and you just plug this -- you can buy these
solutions off the shelf. Michel knows quite a bit about it, if you want to talk to him.
>>: Here in Washington state, the (inaudible) car for a second, this is exactly the port.
>> Samuel Madden: This is just an example from Seattle, so Michel had one of these boxes for
awhile driving around. He in this case put a camera on it, and you can actually see -- you can
get Google maps to -- Google does a very poor job of rendering imagery, but you can see what
he was seeing out the front of his car as he was driving around. So that's the visualization
interface.
So sort of the next two pieces I want to talk about now are actually how we do the data collection,
and then how we do the networking. So ICEDB is this intermittently connected embedded
database. IRD, or our thought was, well, a relational model is sort of a convenient way to
express what data you would like to collect from these cars. I'll give you a couple examples of
that.
The idea Is users write queries in extended simple version of SQL. This is a continuous query
processor, which means these queries specify not what you want to do to store data in the
database, but what date you would like to collect. And it's distributed in a really very simplistic
way, just means users write queries at the server, queries get sent to all the cars, and then all the
cars sent their data back, whatever data back that they want. So it's not -- we're not somehow
partitioning the query up into a bunch of different little pieces, where different pieces are running
on each car.
So one of the challenges we wanted to deal with in the ICEDB world is the fact that bandwidth is
variable. So you've got this data that's streaming back continuously, but the connectivity that the
car experiences varies over time. Because you go into tunnels, or you're in a region where
there's not much wireless connectivity, or your cell phones drops out. So you can't assume that
you have this sort of regular, very continuous connectivity that you can sort of use to deliver data
at regular rates.
So that means, first of all, you need to buffer the sort of query commands. If I issue a query, I
may not be able to actually get that query to a car for several seconds. I also need to buffer
query results, so a car collects information over time, it doesn't necessarily have the connectivity
to deliver that information right away.
Also, one of the things we wanted to support was the car could store data locally, that then users
could come back and drill down into and query, rather than necessarily sending all of the raw data
off at a high rate from the cars.
And then finally the thing that we really focused on in the ICEDB paper and evaluation is we
wanted some way to be able to prioritize results between -- that cars collect. So the idea was
every piece of data that a car collects isn't created equally, and so we within the query language
provide support for users attaching simple sort of user defined priorities to the data items that the
system takes into account when it's deciding what to transfer at a given point in time. And I'll talk
in more detail about that.
Just to give you a high level picture of how this works, there's ICEDB server user issues queries
to cars, cars stream results back. Zooming in on what happens on the cars, there's some
collection of sort of sensor hardware here. The sensors are talking through adapters, these
would be like drivers that let the car sample GPS, for example. And then the data is sort of going
to be stored in these output buffers, and the Cabernet system is going to be taking data from the
output buffers and delivering it when connectivity becomes available.
In the middle of this there's sort of two data paths, one is the continuous query data path, where
data just streams from the sensors into the buffers and then out through Cabernet, and the other
piece is what we call this ad hoc query processing piece. So data comes from the sensors, get
stored in a local database on the cars, and then the user can issue queries externally over that
stored data.
So these two data paths to cope with limited bandwidth. Again, the idea is locally you can store
more data locally than maybe you can transmit continuously off the car.
Okay, now I'm going to talk about the prioritization issue, which I think is sort of the most
interesting one. So the motivation here is, again, there's a couple different motivations. So the
first one is what we call interquery prioritization. Which is within a query, imagine I'm collecting
some data, say for example about that car's -- a car's position with time, right?
If you imagine you're in a bandwidth constrained environment, so imagine at each one of these
points you have not just position, but say a photograph or something, right? So if you think
about a bandwidth constrained environment, if you just take that data and you stick it into a FIFO
buffer, the problem is that you're going to get a whole bunch of data about the very first part of the
drive, but you're not going to get any information about what's happened during later parts of the
drive, if there's not enough bandwidth to transmit all the information.
So we wanted to provide a way for users to specify simple policies about what data is more
valuable than other data within a query. You can specify these things like this. So that we
basically added at the end of every SQL -- SQL query is a thing that says something like delivery
order by first in first out, so FIFO, like that would be the default ordering that all these things
would get.
But something you might rather do is what we call bisect here, which is you take all the data that's
in a buffer, in the buffer at a given point in time, you send the first point and the last point, and
then for each of those segments you recursively bisect them. So you can specify delivery order
bisect here.
It turns out that there's no way to get the standard SQL order by clause to actually do this in the
right way, because what you want to do here is to reprioritize all the data every time a new data
entry comes in. The SQL order by clause just orders the data by some static value of every
record.
Now, our idea here is that this bisect is not -- we will provide a library of these kinds of
prioritization functions, but this bisect is really the user is free to write their own simple
prioritization function that takes the records in a buffer and reorders them however they would
like. So we provide bisect as a default, as well as some other things like random sampling as a
default. Or most recent value as a thing that you can use if you want, but you can also specify
your own values.
So this interquery prioritization, this is great for -- it lets us deal with this case where within a
particular piece of -- within on a particular car some data may be more valuable than others. But
just within a single car you may not have sufficient information to know how valuable your data is,
right? Because my data is only -- the data that I collect on cars may be less valuable if other
cars have already collected.
So for example if there's lots of cars that drive on the freeway every day, right? It's not
necessarily valuable for every one of those cars -- you want some information about the freeway,
but it's not necessarily valuable for every car to transmit information about the freeway.
So the idea in the global prioritization scheme again is very simple. This is just sort of a
schematic of how it works, but the idea is that what cars can do is rather than transmitting all of
their raw data as soon as they get connectivity, the user can specify a way -- a summarization
function that summarizes the raw data that the user has, that the car has. And then on the
server you can take all these summaries, the server takes the summary of the data the car has,
and basically orders it and tells the car in what order the summarized value should -- that the data
should be transmitted back. So the server can take into account the sort of global information
that it has from all the cars.
So I'll just give you a simple example. Imagine the cars have a bunch of imagery about the
world that they've been driving around in. And so what the cars can do is they can summarize
this data in a very course granular way by saying -- by gridding the whole world up into a bunch of
grid cells and saying these are the grid cells that I have information about. So it's a very small
amount of data, the cars just say here are the coordinates of the grid that I have information
about, they send that to the server.
So the way that they can -- the one way they might specify this, or the way that we let users
specify this is by attaching the summarize-as clause to every query, or optionally attaching a
summarize-as clause to queries. And what happens here is this is just a SQL query that
specifies taking all the raw data, gridding it, and just -- so you get to attach an arbitrary SQL query
that sort of summarizes the data however you would like, using SQL aggregate.
Now, what happens on the server is the server receives all of these -- receives all of this
information from all the cars, and it looks at this information and assigns priorities to it. Now,
priority in the case of the server, the way it assignments priorities is the user just writes a function,
an arbitrary function that can look at these priorities as they come in, it just assigns a numeric
score to each one of the values in the -- to each one of the entries in the summary.
>>:
-- you have a temporary (inaudible) the data that's more recent is more useful than the data
that's, say, three hours ago.
>> Sam Madden: Again, it's application specific. So the user can, if you want, you can include
temporal information in this query, in this summarize-as clause, and it will be there. And then
what the server gets is whatever it wants. The server can then -- the server can write whatever
function it wants to assign priorities. So it can say oh, well, this is a new bit of information about
this grid cell I haven't seen before, so it can assign a higher priority to that.
So the server, again, the user writes an arbitrary prioritization function that can assign a score to
each one of these grid cells. And then it sends it back on the servers on the cars, and then the
cars automatically join these prioritized -- uses these priorities in order to compute the ordering in
which data should be sent back. And the database system takes care of figuring out, sort of
doing the reordering for you based on whatever priority is sent back.
So again our sort of theme with ICEDB is we wanted to allow applications to have control over the
value of information we think applications know that better than the network or the database. We
don't think the database can do a good job of doing this automatically, but we wanted to provide
some mechanism that made it easy to do this.
Okay, so and again, we provide three prioritization clauses that allow you to do this. I talked
about two of these, the delivery order by clause and the summarize-as clause. Priority is just a
way to say, you know, one class of query is more important than another query. So GPS data is
more important than camera imagery, for example. And then we have the ICEDB 2006 paper
that talks about this. Yes.
>> : I think it would change if like every car on the road was embedded with such a sensor,
would you still be driven by a cellular driven prioritization?
>> Samuel Madden: If every car -- obviously if every car is running -- if every car on the road
is -- I mean, I think we're sort of thinking that this is working in the model where you have not, you
know, 10 million cars that are all being queried simultaneously. We were thinking this is our sort
of test bed of some small collection of cars that we want to individually collect data from.
I think this makes sense if you're an organization, you have a fleet of cars, or you want to get
information from those cars that you want to use specifically for your application.
If this is personal -- you know, if this is your personal data goes to your personal server or
something, then obviously you would probably architect in some slightly different way. Does that
kind of answer your question?
>> : Yeah, I was wondering, I mean, like things like in network aggregation or something, so like
that all the cars are not talking to the server, they're not intertalking to the server, because --
>> Samuel Madden: Sure, yeah, so again, the ICEDB system, we don't support aggregation of
data across cars. I think there's -- again, that's a good question, I should clarify that. In the
CarTel project thusfar we have not really focused on the problem of car-to-car anything, right?
We've been focusing mostly on car to infrastructure, as the domain problem we've been focusing
on. And the reason we've done that is because from the outset we wanted to sort of be able to
build -- we wanted to be able to deploy stuff that we could actually -- we wanted to be able to do
research that we could actually deploy in the real world.
If you imagine 10,000 cars, you know -- it's great, it's fun to imagine what would happen if you
had, you know, 10,000 cars on the road. But because we can't actually deploy that then, you
know, we haven't -- so we haven't really looked at car-to-car stuff. Although I think there's a lot
of very interesting things that happen with car-to-car, including in-network aggregation as well as
the sort of peer-to-peer communication, a lot of the delay tolerant networking kind of things that
come up.
It's something we're sort of -- a direction we're considering heading, I'll talk a little bit more about
how we're thinking about scaling up the deployment we have when I get to that part of the talk.
>> : Just one other question. Was that was unique to this project work, or is this something that
comes up with other kinds of (inaudible).
>> Samuel Madden: I think it could come up with any -- we want this prioritization in settings
where bandwidth is variable. So I think you could argue for this kind of a thing in any kind of a
setting like that. And I think that, you know, I'm not sure that SQL is the right language for all
users, but I think some sort of a high level language that allows users to say what data they would
like to collect, you know, specify sort of very sort of simple configuration of the data collection
system in a declarative way is a genuinely useful idea. In that sense -- and it's broader than just
cars. I mean, we don't intend it for just being cars. So.
>> : Since your -- not every car has instant connectivity on the data they have, in your model
what information that each car has is going to be -- you know, out of date a little bit, right? And
you have some uncertainty in terms of whether you have access to it. So do you take that
uncertainty into consideration when you optimize the queries across multiple users, in a global
sense?
>> Samuel Madden: So again, we've sort of -- I mean, I think that's a very interesting thing to
think about. We've sort of evaded it in the -- what I presented here, right? Because I've just
said, well, the server had to just run some arbitrary function that does whatever prioritization it
wants. I mean, obviously you could embed some -- sort of think of this as what we've provided is
the infrastructure upon which you could build something that could take that kind of uncertainty
into account.
In the evaluations that we did of the system, we didn't spend a lot of time focusing on uncertainty,
specifically within regard to the ICEDB system. Some of our other work where we've looked at
modeling sensor data has looked a little bit more at uncertainty explicitly, and I'm not going to
really talk about that in the talk, but maybe at lunch or something we can talk about that.
Okay. So now what I'm going to do is move on and talk to you a little bit about the Cabernet
system. So again, the Cabernet system is the part of the system that's responsible for the
delivery of the data from the cars up into the server, okay? And so what's going on here, we've
got cars driving around on roads, they drive by wi-fi access points, they attempt to associate with
them and deliver data off of them. That's primarily what the Cabernet system is working on,
although it also supports you can plug in a cell phone into this thing and all the technologies will
work.
Although I'm going to spend a fair amount of time here talking about issues that specifically arise
in the context of trying to do this with 802.11 with wi-fi, okay? So although the system supports
both cell phones and 802.11, we haven't done anything particularly clever for the case of cell
phones.
Okay. So again, what Cabernet, the sort of basic idea with Cabernet is it's a disconnection
tolerant transport layer. So the idea is data gets buffered on the cars until connectivity becomes
available, at which point it gets sent to whatever its destination is. So this means fundamentally
this isn't a connection oriented protocol. You're sending messages, or you say I have this
message that I want to deliver to this application on the other end, please, and queue it, and
deliver it when it arrives.
And there are two piece of this system that I really want to talk to you about. The first one is this
thing called quick wi-fi, which is thinking about how do you establish connections very quickly
from moving vehicles.
And then the other piece is this thing I call CTP. And I won't talk very much about CTP, but I'll
just give you a quick sort of hint, talk about it very briefly, which is a well-known -- obviously a
well-known problem within wireless networks, is that if you are -- a problem with TCP is that it
equates all losses with congestion, right? And that's obviously not a very -- that's not true in the
wireless environment, where there may be losses due to the wireless environment.
And so the wrong thing to do in the case -- TCP does sort of the wrong thing in this case. It
backs off, it sends at a slower rate when it observes losses on the wireless network. Whereas in
fact, in our world you might want to send at a slightly higher rate, because these connections are
transient and you want to take advantage of whatever connectivity is there.
Those of you who know Hari Balakrishnan, this was his Ph.D. thesis, and this was sort of his -- a
piece of the system that he's really worked on more than I have, but I just want to mention it
quickly.
All right, this is work with post-doc Jacob Erickson. So again, the first challenge I want to
mention is connection establishment here is very slow. So this is the issue, again, that if you just
use wi-fi, if you just take -- in this case our cars are running Linux, you just take the default Linux
wireless stack and you sort of say okay, you know, just try and associate with whatever wireless
access points you see, it turns out that this just takes forever. So our measurements showed, in
our initial deployments, it took on average about 13 seconds to establish an antenna connection.
You might say, man, 13 seconds, that's really bad. Why is that?
Well, first of all, what's -- there's just a lot of stuff that goes on when you have to get a connection,
right? So first you have to scan, and then you have to scan for a network, then you find a
network, then you have to authenticate with that network, you have to associate with that
network, you have to request a DHCP address. You have to discover the HTPC server, then
you have to request a DHCP address, and then you have to do an ARP to announce your
presence to the rest of the network, okay?
The problem -- there's a bunch of problems with this. First of all, there's a lot of messages and a
lot of round trips, and they are sequentialized by default in the stack. So you do one phase, then
the next phase, then the next phase.
Second of all, the loss rates here turn out to be very high, okay? So if you look at this as a plot
of loss rate through the lifetime of a connection, and what you see is at the beginning, so this is
the percentage, the sort of overall connections. This is 20 percent of the time into the
connection, 40 percent of the time into the connection, and so on, so forth.
You see especially at the beginnings of the connection the loss rates are very high, and the end
of the connections the loss rates go up. And this is just because you're driving into range of the
access point, and then driving out of the range of the access point.
The problem is this little spike at the beginning of the connectivity, right? Because what that
means is when you first discover this access point, that's when you start doing this initiation
protocol, just as you're driving into range of it. You have a very high probability of loss in any
one of these phases.
You see that by default the Linux stack uses these 3 second time-outs, right? And why they use
3 second time-outs I don't really know, but 3 second time-outs just kills you, because you get this
-- you know, there's a very high probability, like a 70 percent probability that your request to
authenticate or associate is going to be lost. And then you're going to wait 3 seconds before you
even retry again. Right? So that just introduces this huge latency into the stack.
So as I said, the default Linux stack takes 13 seconds to associate. To give you a little bit of
perspective, the average entire connection on average only lasts 19 seconds. Okay, so 13
seconds is really squandering bandwidth in this case.
So what did we do? Well, some of what we did was -- is probably obvious, we turned down
time-outs, that makes a difference. Another thing we did was to scan the more -- most popular
channels first. Okay, so it turns out that in -- by default, almost all wireless access points are on
either channel 1, 6, or 11. Because these are the channels that are sort of maximally spread out
in the -- the intermediate channels actually overlap. So even though by default the Linux stack
scans channel 1, channel 2, channel 3, channel 4, channel 5, obviously it makes a -- it's better to
scan channels in the sort of frequency with which they occur. So we have a little -- we observe
the frequency with which access points observe on channels over time, and then we adjust our
scanning policy to take that into account.
We also have -- you can also, because we're only trying to associate with open wireless access
points, we're not connecting to authenticated networks, we just -- the protocol standard requires
that you actually do the authentication phase. But you know if this is a wireless network, this
phase is going to succeed, so you can actually do it in parallel with the next phase. There's no
requirement that you wait until authentication has happened until you move on to the next phase.
And because the authentication phase is basically a no op in an open network, you can just go
ahead and start doing the other phases in the case when the network is open.
We set all the time-outs to 100 milliseconds. We play a bunch of other games where we can do
some of these parallelization things for various phases of this thing. And the bottom line is we
have this quick wi-fi stack that can associate with a wireless network in 370 milliseconds by
default.
This is just a plot showing the time that each of the different phases takes.
All right, so the other problem that I mentioned that we deal with in Cabernet is this problem of
TCP treating wireless losses like congestion. So this problem is -- I mean, as I said, this was
Hari's Ph.D. thesis. You might wonder what's different in this particular setting. The real
problem here is that we -- we can't control what software is running on the access points. Okay,
so we get to control what's running on the two end points, but we don't know what's running on
the access point.
So if you think about what you would like to do here is to measure the loss rates that are due to
wireless losses, and the loss rates that are due to congestion, and then you'd like to factor out the
losses that are due to the wireless channel. Because you don't want to do congestion backoff as
a result of those losses.
So the question so how do you understand what the loss rates look like on the wireless side,
right? So if I'm sending from the car to the access point, this problem is easy. Because I can
modify the software that's running the access point, so I can basically modify the wi-fi driver so it
tells me how many times it had to retry every transmission, or something. And I understand
something about what loss rates look like.
The problem is that if I'm sending from the server to the car, I don't necessarily have access to
that -- I don't have access to that information about how many of my losses were due to the
wireless, the wireless link. So what I can do instead is to try and estimate the losses between
the car and the access point. The wired side losses, right? And subtract those out from the
overall loss rate, and then I'll get some information about what the wireless losses were.
Okay, so how do I do this? Well, it turns out that -- so this is this problem of measuring from the
internet to the car. So it's just one of these sort of networking tricks, that if you think about it long
enough it's kind of obvious. You can understand that -- most of the time you know the IP
address, of the access point, right? Because the access point is acting like a -- it's serving as a
NAT that's sitting in front of the car. So I know what the IP address of the access point is.
So what we do is -- and this works in about 90 percent of the cases, you send periodic probe
packets to the access point, you address them to a random IP address, okay? And most access
points, what they do in response to a packet sent to a random IP address, they send you a TCP
reset back. And that gives you a way to estimate the loss rates along that wired side of the link,
and then you can subtract out the wireless, okay?
So the bottom line is that in this sort of this thing gets us something like a 30 to 40 percent
improvement. By doing this estimation this way you get something like a 30 to 40 percent
improvement in overall throughput, when we're running this protocol. In addition to the dramatic
reduction in the overall connection, the overall connection duration that we get out of using quick
wi-fi.
>> :
-- open APs all the time, or does it measure when a car passes?
>> Samuel Madden: Only when a car -- a car -- basically what happens is the car connects, the
car says, oh, I'm here, I'm alive, I have some data to send, or I would like to download this file,
and then we do the -- run the protocol.
So now I've given you an overview of the software, okay, so now what I want to do is sort of
switch to telling you a little about what we've done with the system.
So the deployment that we've done to date is -- I actually should add a third deployment now to
this. We did initial deployment on about nine individual users' cars, and then we have a
partnership with a taxi company to deploy 27 -- we have this software running on 27 taxicabs.
And then since we've developed a next generation hardware box that we're deploying again on
some individual users' cars with a goal of rolling this out to a much larger fleet of taxis, we have a
partnership with Boston Cab that we hope will amount to a couple hundred devices that are out
shortly.
The taxicab deployment is actually kind of an interesting story. Anytime you're doing these kinds
of deployments, you sort of -- the question is like what's in it for the taxi company, why are these
guys willing to allow us to put random hardware on their cars. So what we've done, it's kind of a
good solution, Jacob came up with this. He actually puts two computers in every cab. One of
these computers is hosting services that are beneficial to the cab company. So in particular, we
provide the cab company with a little portal interface where they can go and see where all their
cabs are.
We also provide a gateway that runs -- the cab company was willing in this case to pay for EVDO
modems for all the cars. So actually in the cab case we have both wi-fi as well as EVDO
modems. And the cab companies is -- one of the things they advertise now to their customers is
the gateway that goes from wi-fi to EVDO, so customers can get in their cabs, open their laptops,
and surf the web. So that's what the cab company got out of it.
Then we have this second computer that's running there that's just running our own services. So
we have a separate wi-fi radio on this thing. And the second computer actually loads its image,
its boot image from the master box over ethernet, so we can actually upload over EVDO a new
disk image that the secondary computer can then boot from, and we have a way to -- little
hardware relays we can actually power toggle the secondary box from the master box, as well.
So that gives us a way to deploy new software out in the field that doesn't interfere with what's
going on in the cars.
This is just a map showing I think a week's worth of data of all the roads that we had coverage
about. This is sort of the Boston metropolitan area, and then this is the center of Boston.
Here's MIT and Cambridge here, here's Boston. So you can see that we get lots of data about
all of the -- basically all the major roads in Boston.
>> :
How many (inaudible)
>> Samuel Madden: This is 27 taxis. Okay. So that's the sort of where all this data is coming
from, now I'm going to talk about a couple of the applications that we built on top of this. So the
first one is a route planning interface. So the way that the route planning interface works, the
idea is let's use this data from cars to estimate what traffic looks like, okay? So we can observe
how long it takes every car to drive on every road segment, and then we can build up a
distribution -- we can basically, for every road segment, build up a distribution of the travel times
over that road segment.
We're making the very simple assumption that travel times are Gaussian distributed, so we just
compute a mean and a variance for travel times on every road segment. We assume that
consecutive segments here are independent, which is obviously not a valid assumption, it's
something we're coming back to and looking at how nonindependence affects things. But
clearly, you know, you've got two road -- a road segment here is defined by the underlying maps
that we're using, but typically it's from one intersection to another intersection. And clearly if you
have two intersections that are next to each other, there's a light that's in between them, the
delays on those two intersections are correlated in some way. But we're assuming that there are
no such correlations.
Then we're looking at simple route planning metrics that run on top of this. One of them is
distance, obviously you can do distance based routing .Google doesn't do distance based
routing, they do some very simple -- you know, the weights of each road is weighted by the sort of
size of the road, basically. They know this is a road you can drive 25 on, this is a road you can
drive 35 on, so on, so forth.
But we can now do things like expected delay routing. So what's the route that's going to have
the expected fastest travel time. And then we've looked at kind of an interesting variant, which
turns out to be a little bit -- you can't just throw Dijkstra's over them, or A-Star or whatever at it, it's
this sort of problem of finding the probability of missing -- finding a route that has the maximum
probability -- a maximum probability route, so I can say I want to find the route that has the
highest probability of getting me to my destination within 15 minutes. And I'll talk a little bit about
why you can't just throw Dijkstra's at that. Yeah.
>> :
Can you look at time of day or day of week?
>> Samuel Madden: Yeah, I'll give you a little demo. So -- so this is just the interface, I don't
know if you can read it, I'm sorry the fonts are so small.
But this just says you can say at this time of day, and this day of week, I would like to know
what's the minimum distance route, or minimum expected time route, or the route that has the
highest probability of getting me there within this deadline. Okay.
So this is the -- the O here is origin, D here is destination. What this -- this route that I'm asking
for, for anybody who has lived in Boston, is the route from MIT, out to the entrance to Route 2.
So most -- a lot of people who live in the sort of suburbs around Boston commute out from Route
2. You either -- people -- there are suburbs to the north, to the south, and then out. Concord
and Lexington are out to the west, and a lot of the faculty and people at MIT commute in from the
west.
If you know Boston at all you can sort of -- you probably have an opinion about the best way to do
this drive from Route 2 to here. Everybody, you know, sort of says, oh, you know, well, you
should take Mem Drive, right? Or you should take some little complicated route through here, or
whatever it is.
So if you ask Google what it thinks you should do, Google actually will give you different -- it's
very sensitive to the starting point of this route. But what Google says you should do is you
should come up here and then you should get on this road here, which is Massachusetts Avenue.
Okay, so this is almost certainly not the right way to go, right? Because this Massachusetts
Avenue connects -- is the road that connects Harvard to MIT together, and at 3: 00 to 4:00 p.m.
it's going to be packed with cars, okay?
So if you ask our system what it thinks you should do, you say what's the minimum expected time
route. This is the route that it thinks overall will take the shortest time at this time of day. What
it finds is some -- some route which avoids Mass Ave, basically. It goes and it takes a little side
street, and then takes this Prospect Street here, and takes another side street. And it does get
on Massachusetts Avenue a little bit here, but this is after it's already past Harvard. So it's sort
of a -- it seems like it's a better part of Mass Avenue. And then it again takes some little back
road to get you out to where you need to go.
What this visualization over here is showing is the amount of time -- this number can't be right,
but the amount of time that it takes to drive on each one of these segments, so each one of these
lines is a road segment. And we have information about the number of samples that we have
about these road segments, as well.
So we have anywhere from 800 samples at this time of day to, you know, other segments where
you have 10 or 15 samples about how fast cars could drive, okay?
So you see that if this information here also is computing things like the expected fuel
consumption of the drive, as well as the expected CO2, there's a bug here where it's somehow
not computing CO2 right, so there's some normalization factor the student who is doing has
wrong. And it tells you what the expected average velocity is, as well as the probability of
making this thing within 15 minutes.
So now I can also ask for the maximum probability route. It's going to turn out that when we
have 15 minutes, the maximum probability route is the same as, I think -- this is demo is not
canned, so the routes do change sometimes. But yes, the maximum probability route is the
same as the maximum expected time route.
But if I have more time, if I say I have 19 minutes, then what it will actually see is that the
maximum probability route is slightly different than the minimum expected time route.
So it says if you have 19 minutes you should take this other road, this other way, which it thinks
has a 99.7 percent chance of getting you there.
And the reason it picks this road is because even though this road has a higher expected travel
time, it thinks it has a lower variance. So it's better to take a road -- it says okay, it's better to
take a road with a lower variance with a higher average expected travel time.
So this maximum probability planning algorithm, as I said, turns out to be kind of interesting.
And sort of the inside or the observation here is that, well, if the sort of -- well, so the way to think
about this problem is if you're trying to do maximum probability planning, the travel time of each
edge of the Gaussian, that if travel times are -- if each segment is independent of every other
segment, then the travel time of an entire path is also Gaussian distributed.
So our goal was to find the path with the maximum probability of reaching a destination by some
deadline. You can't just throw Dijkstra's or dynamic programming at this, because these
problems don't have sub -- there's not a suboptimality to this. And when do I mean by that.
Well, if I have three -- if I'm trying to go from A to B, right, and there's some intermediate node C,
just because this route from -- just because this route through, for example, route 1 to C is the
optimal way to get to A to C, that doesn't necessarily mean that this route 1 would be on the
optimal path from A to B to C, okay? And dynamic programming sort of exploits this sort of -Dijkstra's algorithm exploits this fact in order to be able to do an efficient search, right?
So I just have some examples that show that. Basically you can end up with the alternative
route being the optimal -- I'm going to skip through this visualization, but the optimal path can be
the one that's not -- the optimal path from A to B via C can be -- may not contain the optimal route
between A and C.
So what we've done is we have a -- so you can't just use A-Star or Dijkstra. We have a student
who's been working on this problem, he has a sort of heuristic, exponential time heuristic
algorithm that works pretty well in practice, and we have it running on real data. And there have
been some other people who have studied this problem as well.
Okay, so that was traffic, now I want to talk about potholes. So potholes, what's going on here,
with these cabs, we went out and a bunch of these cabs we put accelerometers, okay, so these
are sampling at 600 hertz, I believe. So we've got these three access accelerometers that are
measuring, you know, both how the car is going this way, as well as up and down on the roads,
and then we just let the cars drive around wherever they go. And we're looking for potholes. So
a pothole looks like a spike, like this in the road. And we have some sort of relatively
straightforward machine learning classifier that runs on this, and tries to distinguish a pothole
from a not pothole. So we went out and we drove over a bunch of potholes and that was our
training data, and then we input that into our system in order to find potholes.
So we actually have a map that you can go to and you can see these are the 10 biggest potholes
in Cambridge. You can click on one of them and you see a plot of the thing, we went out and
took pictures of the potholes as well.
So what's actually going on here we have a simple classifier that can determine several different
types of road anomaly. So we looked not just as potholes, but actually trying to distinguish
potholes from other kinds of bumps that you experience in roads, like manhole covers as well as
railroad crossings or expansion joints in freeways that cross the freeway entirely. It turns out that
we can do a pretty good job of differentiating between manhole covers and potholes and other
sort of protrusions in the road from things that cross the entire road. And the reason of that is we
can see -- if you imagine something that goes across the entire road, it goes like this with both
wheels, whereas a pothole does something like this, so you can see that in the accelerometer.
So basically our classifier works by extracting a bunch of different features from the signal, and
then we run a sort of a standard classification algorithm on top of this thing. And then once
we've found things that look like candidate potholes, we do some clustering in order to basically
group together detections that were near each other, as well as to throw out detections that
haven't been seen very many times.
So we want to throw out things that haven't been seen very many times because one, they're
anomalies, like people slamming doors or, you know, people knocking the accelerometer inside
of the car, so we want to filter those things out. We want things that only occur at the same
place multiple times. And also if there's something that occurs once, it's probably not a pothole
that we need to worry that much about, right? Because if drivers can avoid the pothole most of
the time, then it's probably less severe than something that gets hit all the time. Okay?
Again, we have a Mobile Assist 2008 paper that talks about this particular deployment, and talks
about some of the data we've collected, but at this point that's sort of all I'm going to say about
the pothole thing.
Okay, so the next and the last little part of the talk I want to talk about in terms of what we've
done with it is some experiments about wi-fi.
So I mentioned that we've been trying to associate automatically with -- that one of the things
we're trying to do is look at wi-fi as an uplink for these devices. So I'm going to sort of talk about
what we did to measure whether that was feasible or not. And I want to talk a little bit about
some more recent work we've been doing on wi-fi based mapping, or using wi-fi for doing
mapping.
So the idea here is everybody knows there's a lot of wireless out there, right? You go out
anywhere in the city, you open your laptop, and you observe a huge number of access points.
So on a typical drive, about an hour long drive around Boston, we'll see something like a
thousand access points. On average, about 5 percent of those access points are open, okay?
So we're interested in the question as to whether there's enough connectivity from these open
wireless access points to get data off of the cars.
A common question is whether this is legal. It turns out at least in Massachusetts, it apparently
is not illegal. There are no laws that say you can't -- you know, you can't just associate with
these access points. In some states it appears to be less -- is a little bit more dubious. From
our perspective we're interested in it, because if you could -- you know, one of the -- if you could
demonstrate that these networks were a feasible way to get data off of cars, then you could
imagine partnerships with hot spot providers, partnerships with urban municipal area networks,
or even incentivizing people to open up their networks in order to allow you to access. So
there's a lot of these companies like Fone and Meraki who are now doing these large scale
deployments where they're basically getting people to, you know, participate in an open -- in an
open network, sort of built from the ground up.
Okay, so again, there's another example, this is wigle.net, I don't know if you guys have seen
wigle.net, but these are these people who do WAR driving, so they just build these maps of
where all the open wireless access points are. Every city is just covered with wireless access.
Here the green dots are open, the red dots are closed.
All right, so the question is well, you know, given that there's all this connectivity out there, can we
actually use wi-fi from cars as we're driving around. So the experimental method we used is
really very straightforward. Cars are just sitting at a loop, scanning. When they see an open
wireless access point, they attempt to associate with it. When that's successful, they acquire an
IP address using DHCP. And then what they do is they begin an end-to-end ping with a server
at MIT. And this is to demonstrate that this network is actually open, because there's a lot of
these for-pay networks that from the point of view of the wireless stack appear to be open, but
they actually require you to do HTTP authentication before they'll deliver data for you, right? So
you have to pay, you have to put your credit card in.
So once we've established that the network is in fact open in this way, we do two things. We
initiate a small TCP test upload to our server to measure the bandwidth of the connection, and
then we begin a local AP ping. So we just pinging -- the car starts pinging the local access point.
And this is again in order to determine the duration of the connection.
So when we can no longer ping the access point, that must mean we're out of range of the
access point. And then once we get three seconds of lost pings, we go back to scanning.
Okay, so this data was collected, this was done with an early deployment, so this was done with
our individual -- all the data I'm going to show you was done with just individual cars, not with the
taxi test bed, this is from a Mobicom 2006 paper. So it's about 32,000 distinct access points that
we observed, 290 distinct kilometers of road, and about 300 hours of driving.
So association duration is the first metric. So the first question you might have is how long is a
car associated with a network on average, okay, when it gets connectivity. So this is we're
scanning, we associate, we get an IP address, we get a ping, we lose the connection. So we're
interested in how long does this AP ping -- the time from the first AP ping to the last AP ping.
That's the association duration, this is just a CDF of association duration.
So I already told you that I think the mean is something around 19 seconds. You can see the
median is about 13 seconds, so the mean is slightly longer than the median because this is a
long-tailed distribution, where there's some connections that last a really long time, like two, three
minutes, because the car sitting in traffic, or sitting at stoplight or stopped.
Okay, so the thing to get from this is that we get quite a bit of connectivity, right? 15, 20 seconds
is a little bit surprising. The other question though, what about how does this vary with the speed
of the car, right? Like is the only because -- are we only getting these connections when the car
is stopped.
This is just a fraction of associations that we had versus the speed of the car. And you can see
that we continue -- we have pretty -- you know, pretty linear distribution of associations by speed,
up to about 60 kilometers an hour. Okay, so this is like, whatever, 35 miles an hour or
something. Partly our data mostly consists of people driving about 35, not driving on the
freeway, these are mostly people who live around MIT who take city streets. But it also probably
is the case that there's just -- you're not going to do as well as you get on freeways, there's fewer
wireless networks and there's less opportunity to connect.
So this is just showing a -- the distribution of connection duration by speed. The thing that -- this
right here, there just aren't very many connections that are happening here, but the thing to
observe is obviously duration of connection goes down as speed goes up. Not surprising, right?
Okay, so the kind of take-away here is we get connections at a lot of speeds, we don't have a lot
of data at higher speeds but, you know, for at least city driving there's quite a bit of opportunity
here, right? And then when we get connections, they're pretty long, like 15, 20 seconds. So
there's the ability to upload quite a bit of data when that happens.
So the next question though is okay, well, that tells me how long connections last on average, but
how frequently do you actually experience connections. So we looked at two metrics, which
are -- we call disconnection duration metrics, and these are the delay between attempts to join.
So how often do we just see any old access point, not successfully associate, but how frequently
do we see an access point, and then how frequently do we successfully associate and obtain
connectivity.
So this is just a CDF showing those two things. The dark black line is the number, the
distribution of attempts. So again, if you look at the median here, you can see that we see an
access point about every two seconds on average. Of course, this is very -- this metric sort of
makes that seem -- it's not that you really see an access point, sometimes you'll see 20 access
points in one second, and then you'll have 20 seconds with no connectivity, right? So this thing
is -- it's a pretty variable measure. And of course granularity gives you a sense of what's going
on here. And you can see that -- you know, so the median of associations is we get an
association on average about every 13 seconds.
So we get an association about every 13 seconds, it lasts 19 seconds, on average. Sorry, the
mean is different than 13 seconds, sorry. We get -- the median here is every -- is every about 13
seconds. The mean is much, much higher than that. Because we -- there's long periods where
there's no connectivity at all, like when you're driving on the street.
So it actually turns out that we get a -- the mean disconnection time is about 260 seconds, so it's
about four minutes, and about once every four minutes we get this 19 seconds worth of
connectivity, okay? And about every 20, 23 seconds we get a -- we see an access point, on
average. As opposed to a median.
And then the final question is just okay, how much total data do you transfer when you get a
connection, this is just showing a CDF of that. Again, you can see that if you sort of go across
this 50 percent line, you can see that there's about -- this graph is really hard to read, I'm sorry,
but this is about 200k. So you get about 200k of data when you get the association.
So if you like, you get -- the mean is actually 600k, so the mean is considerably higher, because
the average -- the connectivity is somewhat higher when you have -- because connection
durations are longer when you have -- you get more data through, because the mean connection
time is longer than the median connection time. In average, you get more data through than in
the median.
So the sort of take-away from this is there's a lot of connectivity in the wild, right? We get on
average 600 kilobytes every 200 seconds. So if you like, you can think of this as like whatever, a
3 kilobyte per second network, right? At some -- you know, viewed at a very high level. And 3
kilobytes a second doesn't sound like a huge amount of data, but if you're thinking about getting
traffic off of cars, or traffic information onto cars, or news updates like textual information onto
cars, it's quite -- it's definitely sufficient information, it's sufficient bandwidth to get a fair amount of
data onto or off the cars in realtime that might potentially be very useful. You know, giving you
your e-mail in your car. All those kinds of applications could be supported by this kind of thing.
So we think that's pretty encouraging. And this is true even at sort of normal driving speeds, just
driving around. Yeah.
>> : -- over time whether this ratio of open access points is getting more or more, or less?
>> Samuel Madden: Yeah, that's something that we're sort of -- is on our stack of things to do,
now that we have this running for a couple of years, to actually look and see whether there's any
appreciable change to this over the last couple of years. It seems probable that openness has
gone down somewhat. Although, I mean, we've been running these experiments, we've
switched over -- our next generation network is we're looking at is really wi-fi only, and it works
surprisingly well. So.
All right. So the last thing I just want to talk about with respect to wi-fi is a sensor, is this what we
call V-track. And again, this sort of -- there are two things that are going on. The sort of one
motivation of V-track is to look at whether it would be possible to build this kind of system that we
had before, in a framework that doesn't have any GPS.
Okay, so why would you want to get rid of GPS? Well, it makes it a little bit more expensive.
For us as academics, if we're trying to deploy a thousand of these things, that cost maybe
matters. In some case it may not matter.
It also complicates deployment a little bit because you need an antenna that has a clear view of
the sky. If you imagine a GPS, you know, buried in your pocket all the time, in a cell phone for
example, you're likely to have significant loss and obstruction issues as a result of that. GPS
also is quite power consumptive.
So our approach is to use wi-fi to do positioning. Again, using wi-fi to do positioning isn't an
entirely new idea, right? People have been doing this for quite awhile, and there now are these
companies, Skyhook, for example, Navazon, that have these things that run like on phones that
you can say tell me where I am, and it uses wi-fi triangulation in order to estimate your position.
The difference in V-track is that we're not estimating your position at a single point in time, but
you're trying to estimate the trajectory that you took over time, particularly the historical trajectory
that you took. So we get an observation of all your wireless -- all of your positions over time,
and then we're trying to find a path that traverses a series of roads that explains those positions,
right? And we try and match those things to roads.
So -- and this is just a plot showing if you just were to do this with raw wi-fi, you see that the mean
error is something like 73 meters of this sort of raw -- the estimate that you get from using a
Navazon-like or a Skyhook-like thing is, in our experiments, about 73 meters. So that's not good
enough to really know what road you're on. So that's kind of the problem we've been trying to
solve.
And this is just a picture showing what happens. This is if you just take -- you take, every time
you see an access point or a series of access points, you use this wi-fi localization algorithm like
what these companies provide in order to estimate where you are, and then you connect that
together with a bunch of lines. So blue is where you actually traveled according to a GPS, and
green is what you see.
And this is fairly typical of what we see. This is actually good, based on our observations. You
see these things where the car seems to be jumping from one location to the other, because we
don't exactly know where the centroid is, and you see these funky things where it seems like it,
you know, it did this -- went on this path that it just didn't go on, it's kind of ugly looking.
So the problem we're trying to solve is -- I'm not going to go into too much detail about how we
solved this problem, but there are a lot of tricky issues that sort of come up with the things. So
one of -- so I already mentioned that accuracy is just poor. The other problem that comes up is
you see people moving their access points around with time. So in Cambridge you get students
who move from one place to another, and so you see an access point that you thought was in
one location that is actually in some other location, and that really screws up your algorithms.
You also see crazy roads, right, so there's these intersections where it's just how do you -- I
mean, it's very, very hard to figure out what road you're on. Like in freeway overpasses are
typical. Then you also see things like cab drivers who drive around and around in circles, which
makes doing this -- makes this problem just really, really hard.
But we've looked at two approaches, one of them is an iterative shortest path approach where we
take two end points of a route. We sort of find locations that we're pretty sure about, where we
have lots of wi-fi observations that seem really good. We connect them with the shortest path,
and then we try and see if that shortest path explains all the other access point observations that
we saw, or explains enough of them. And if it does, then we say okay, we're done. If it doesn't,
then we recurse, we find some other intermediate points and we connect those together, and we
do that until we get a route that seems to explain most of the observations we've seen.
We've also looked at a particle filter based approach, where you constrain the path of the
particles to travel on roads and to travel at car speed.
So the very preliminary results we see is we get about 80 percent estimate of road segment
accuracy. Whether that's enough or not, we're not sure. One thing we'd like to do is to be able
to throw out the 20 percent of bad estimates. So if we could get 80 percent that we're sure
about, and 20 percent that were -- we knew were bad, then we think this would actually be pretty
good. That's what we're working on. We also get mean error, position errors in the range of
about 2 meters per second or 10 kilometers per hour. Speed errors, we're estimating speed on
road segments as well.
I'll just give you one last little visualization. This is just an example of a drive. So you see here
we don't actually have raw GPS information. The two lines are -- red is the wi-fi, the centroids of
the wi-fi localizations that we saw over time. As you can see, in many places it doesn't match
roads very well, and it appears to do funky things like jump around. So this is a case one of
these access points seems to have moved.
Anyway, so the system is able to match this particular drive onto a set of roads, that it appears to
be doing a pretty good thing. And this is just showing all the data that we've collected from this
particular car over the last -- you know, few days. So this is current up to now.
>> :
-- because you don't have --
>> Samuel Madden: It doesn't have any information. There were no observations during from
here, basically, until somewhere down here. And it just -- the system said oh, well, you
know, just use Trapelo Road. Which is almost certainly the right road in this particular case, I
mean, Trapelo Road is the only major road that goes through this area. So.
Okay, so I think in the interest of time I'm going to wrap up. I did want to talk a little bit about
managing missing and uncertain data on some of the database work we've been doing, but I just
think you guys probably -- I don't know whether -- should I keep going, or you guys tell me. I
mean, it's been an hour, so --
I'll -- it will take two minutes. Okay, if you guys want to run, feel free to run. So one of the
problems we've been sort of looking at in this case, and this is a function DB system is about, is
okay, so we've got all this data. How do we sort of allow users to make sensible queries over all
this data. So one problem is that if you just run raw queries on discrete data, it turns out not to
work very well. So here the little red points are the GPS -- the data coming from the GPS sensor
over time. And this is real data that's stored in our system, and you see that, you know, there
are just these periods where we don't have any position information, because either some
packets were lost or because the GPS, you know, went out on us for some period of time. We
have missing data, right?
And so if the user asks a question like what was my average speed in this rectangle, you know,
who knows what kind of answer you're going to get, right? Or when did the car pass -- tell me all
the cars that passed through this rectangle, even, right? You're going to get some sort of
nonsense answer.
So what you would rather do, instead of querying the raw data, probably, is fit some collection of
line segments to this data, right, that tell you trajectories, basically, that estimate where the car
was going and its speed over time.
So this is the idea in function DB, which is that rather than storing discrete data points, we're
going to allow the system to store trajectories, and particularly we're going to provide support
inside of this database system for automatically fitting functions to raw data as it streams in. So
these fit functions get maintained over time by the system, as data streams into the database.
And then we allow queries to -- users to pose queries over these fit functions. So that users can
always pose data queries over the raw data if they want, but they can also define what we call
model based views that are fit -- views over the raw data that users can pose queries over when
they prefer to be querying over this fit data.
So we support these kinds of models as first class objects inside the database, and we allow
users to query them.
So just to give you a simple example, suppose I have a collection of raw data, like temperature
readings over time, okay? And suppose I want to know, tell me the time when the temperature
crossed this threshold, right? The simple way to write this in SQL would be to say select time
where temperature equals threshold.
Clearly, at some point, you know, you can -- if you have interpolation you could guess that the
temperature probably was actually at that point in time. But if you ask this query of the SQL
database, you're going to get a null response back. This is a trivial example, so yeah, you could
say where temperature is greater than threshold. But just -- the idea now in function DB is that
we can fit a function. You say fit this data using this regression function, again the user gets to
specify what regression function they would like to use to fit the data, and then they can ask this
question, select time where temperature equals thresh, and what the system will actually do when
it evaluates this query, is to solve this equation in order to figure out times when the equation
crossed that time.
So the kind of cute thing about this is that now query processing, instead of doing this -- query
processing, instead of looking at a whole bunch of discrete points and seeing whether they satisfy
the condition in this query, query processing actually becomes doing a little bit of function solving,
okay?
And the system that sort of meets -- the interesting part about the system -- so this is -- what
we've built is a system that works for any polynomial function to fit with regression. It maintains
these regression functions over time as data streams in. We support aggregates and joins over
the data, so aggregates become integrals. So an integral, you know, try to find the average
temperature over this range is an integral under the curve, right? It also supports joins of two
data sets together, so you can sort of attach new data to a particular curve. So I could, you
know, if I had light data in addition to temperature data, I could join those two data sets together
and get a new joined data set that represented light and temperature over time.
And then the sort of cute -- all the guts of the query processer that turns out to be interesting is
how do you deal with the case where you have queries that you can't find, easily find a closed
form solution to. So the system actually falls back on a form of approximation that it uses to
estimate the answers to queries.
But we find that you get about five to six times performance gain versus running the queries on
raw data when you're doing this inside of CarTel. Because there's just so much less data, these
functions are so much more sparse than the actual raw data, that the performance actually turns
out to be better. And on top of that, you get this advantage of you're querying these fit functions,
this cleaned up data, instead of querying this sort of raw uncleaned data. So that's --
>> : Your example of temperature (inaudible) fit data even though that data is quite sparse, how
much of your experience is application-specific function fitting, how much is quite general, applied
to a very large number of (inaudible) of data sets.
>> Samuel Madden: So, and again, obviously figuring out how -- what the right way to fit your
data is and what set of basis functions to use and all that stuff is a very application specific thing,
and it's something that everybody wants to do slightly differently. Our idea is, though, for a lot of
science types, anyway, people are using pretty simple tools, right? They're happy with simple
interpolation, or some fairly simple linear regression things, and we think you can support those in
the database system pretty well.
We don't say you must use -- you know, we support any class of polynomial basis function in the
system, which we -- you know, it's very hard for me to say, you know, is this general for all users
in the world, of course maybe it's not. But it does seem like it's useful for at least the kinds of
data that we've worked with, like trajectory data. It's clearly what you want is some form of
interpolation on trajectory data, right? We can do these, play the same tricks with interpolated
data. It also does seem for some of these simple sensor field data, like temperature stuff, seems
to be fit pretty well by simple basis, relatively simple polynomial basis function.
>> : -- if you can give the user, in addition to the answer, some estimate --
>> Samuel Madden: We absolutely can. When you do fitting you can absolutely say here's the
quality, here's the quality of the regression that you ran. And so one idea -- I mean, you know,
one idea is to start to support to add more. I mean, if this idea, if -- you know, if this idea were to
take off, one thing you could do is clearly add support for a broad range of models that would be
supported inside of the database system.
So anyway. So just to quickly wrap up, mobile sensor networks I think have a really great
potential to sense the world at a much higher scale than these static networks, they're much
cheaper. In the car space, anyway, lots of applications. Traffic, fleet management, automotive
diagnostics, this kind of wireless network monitoring and mapping. Environmental stuff, like you
imagine putting pollution sensors on cars. Traffic planning, like where should I build roads.
I talked to you about the three sort of high level components of the system, the portal, ICEDB and
Cabernet, and then I went into some details about some of the applications and research results
that we've come up with, so we have a website, go check it out, and I'd be happy to take more
questions.
>> : (Inaudible)
>> Samuel Madden: Some of it can. If you want, like if you want a specific demo, we can, you
know, find a way to get you a specific demo.
>> :
-- would Hari have (inaudible).
>> Samuel Madden: Yeah, so we don't have -- some of that data isn't there. But if you want, if
you guys, if you like want to show somebody in your research group a demo, I'd be happy to, you
know, get -- Michel knows how to get to all of it, and we can show you web pages of it, so.
>> : You mentioned using wi-fi as a way to estimate your location based on the car location,
car (inaudible) data, but also the history.
>> Samuel Madden: Yeah.
>> : So you said explicitly that you are not going to (inaudible) talk about the privacy (inaudible),
but that does have an implication on the architecture and design, right? You have to keep the
segments from the past history, and someone has access to the database, so have (inaudible)
ways of querying the system can tell, you know, Sam, what you have done.
>> Samuel Madden: Sure, you have this thing that says oh, well, Sam, visited -- this car visited
all these access points at this point in time. And clearly we have a database right now that has
all that information in it. And, you know, if you look at it, you can learn pretty interesting things.
I mean, most people don't -- first thing you learn is most people don't do anything interesting.
Like they drive to a work every day and do exactly the same thing. But you can see a lot about
what people have been doing, I agree that this is problematic, so --
>> : What are the implications for your future system design, where you take those into
consideration?
>> Samuel Madden: Yeah, so I think absolutely, you need to be taking this stuff into
consideration. And there are some things that you can do, like for example, not reporting -- you
know, if a user didn't report his information when he was within some radius of his house, or
when he was on some road that he knew he was the only person who ever drove on, or
something like that, then that could potentially be one way to mitigate privacy concerns.
But I think that what needs to happen is some -- I mean, as these systems get more and more
popular I view it as serving two things. One is there's a technological thing that happens where
people come up with formal models that allow us to reason about what it means for data in this
kind of spatial world to be private. And that's happening, people are working on doing that, we
have some students who are working on doing it.
The other thing that needs to happen is there needs to be a set of policies that get established
here. Like say, you know, in this setting, for example, it needs to be illegal to -- not illegal, or
there needs to be some set of regulations that say what is okay, what data is okay to export in
this way and what data is not okay. So that's kind of the way I view it. Question.
>>: Have you tried doing AP viewing correlation to try to notice when these APs move?
>> Samuel Madden: I'm sorry, you said what do you mean by AP viewing correlation?
>>: Say you're in one spot, and you see access points A, B and C, and later -- at the same time,
temporal time, somebody else might be in a different spot, in C, D and E. And then later
someday -- D and A is (inaudible), and A, C, D, E and B --
>> Samuel Madden: Sure, we can do that, it clearly happens. The question is whether there's
some -- I mean, you could learn interesting things about how frequently things move around. I
mean, I don't know if there's a more interesting application than that, besides for people who are
using wi-fi for localization like us and trying to filter this out.
>> : -- in terms of just keeping your database accurate.
>> Samuel Madden: Yeah, so clearly we can filter -- one thing we can do is we -- a good time
when you might throw out some old observations of an access point. I mean, there's always this
question about how long should you keep these things around in your database. If you suddenly
see that something appears to have moved location, you probably want to throw the old
observations of that thing out of your database, right.
>> : What's the (inaudible). I was thinking if you build the net, instead of using the wi-fi.
>> Samuel Madden: Yeah, so -- I'm trying to think of -- I don't know the numbers offhand. I
know that this company Boston Taxi has 300 cabs, I think. And there's something like five or six
major taxi providers in Boston. I would guess the number is a small number of thousands of
cabs in a city like Boston.
So I think my guess is of course Boston, like many cities, has an urban -- the urban core of
Boston is pretty densely trafficked by cabs, right? Like it's not the -- you know, if you go out in
the outlier -- out in the suburbs, you'd never see a cab if you stood on the street. But if you stand
on Mass Ave at 7:00 at night, you would probably see, you know, 30 percent of the cars you see
are probably taxi cabs, right, because they're taking people to and from restaurants and bars and
stuff.
So in that sense, I think there are certain times of day where you could use taxis to effectively -you could do peer to peer networking effectively with taxis. And I think it's very interesting to
actually understand the question of what density would you need. Even just answering the
question of what density do you need of these kinds of cars in order to be able to do peer to peer
networking is interesting.
It's actually interesting to ask the question of what density of cars do I need to be able to observe
traffic, get a traffic update about every road once every 15 minutes, right? Like how many cars
do I need on the road in order to be able to do that.
And the answer is nontrivial, because of course the cabs don't -- cars don't move according to a
uniform distribution at all, right? So you can't just assume that because cars travel 100
kilometers of road a day that -- you know, use that to easily interpolate, sort of figure out how
many miles of road they drive on. How many cars you would need.
Other questions? Okay, thanks, guys.
>> : Thank you. (Applause.)
Download