1

advertisement
1
>> Dennis Gamon: Well, actually, we have a very exciting session for you. Not
surprising, of course. The general topic is streaming environmental and urban
data, and it's actually a topic which is near and dear to my heart. We have
three very good presentations on the topic. The very first one is Jie Liu is
going to speak to us on something which actually we may not fully understand
the implications of, but I'm sure he'll tell us. Cloud-offloaded GPS for
energy efficient location sensing.
>> Jie Liu: Thank you. So yeah. I'm going to talk about how cloud can
actually help us not just store the data and process them but also collect more
data. In particular, in this case is location data, which is essential for
many scientific discoveries. It's an effort that a bunch of us in MSR worked
on, as well as our interns from Purdue and Brazil and our collaborators in
Australia.
As I said, location is one of the fundamental properties scientific data is
relying on from, like, animal tracking to tracking migrating birds,
environmental monitoring, water quality monitoring to human well being and then
health applications. More and more we see these devices, wearable, clip-on
collars and things we attach to animals and try to collect data about their
locations, interactions, and other properties. Lots of data that we do
collect, we want the location property to attach to them.
I'll show you a very concrete collaboration we're having with [indiscernible]
in Australia. They are interested in this fairly big bat. It's called flying
fox. They travel about -- in the order of up to a hundred miles in a day, and
they are a potential of both spreading disease and spreading seas and other
things. So scientists are really interested in trying to understand their
behavior.
So I'll show you one video, not using our technologies, but they've done by
monitoring the location of a bat through a day. So how far they travel.
You'll see these diamond-shaped, those are actually GPS data, and once in a
while you'll see a gap where they don't have the data and they sort of
interpolate it and assume the bat actually fly a straight line from one
location to another.
And through the day, it travels pretty far over water and over the forest and
at the end of the day it comes back to the bank of a river and rests there.
This application is fairly challenging in several sense. One is although these
2
bats are pretty big, they have a payload limitation. They can carry about 30
gram of things, and bigger than that, they don't like it. They probably won't
fly as far. They have energy constraint or challenges, because effectively,
these bats rest -- they go out at night, and they rest during the day, even for
energy scavenging, where they're flying when you're interested in the location
data, you don't have the solar energy to be available to power them up.
And the third one is the communication challenge. When they get the data when
the bat come back to their resting trees, and they set out the radio station
and try to read the data out of these color tags, and at those time, if you're
monitoring lots of these bats, I'll show the location data can be fairly large.
And even just downloading them using a low power radio could be fairly
challenging just in terms of the network throughput. So this really, as a sort
of an example of motivating many applications that we've seen in these
scientific data collections that you have lots of limitations in terms of form
factor, the energy and the communication technologies you have.
So for this particular time, we want to drill into the GPS part of the power
consumption. People all know this is a fairly energy consuming device, and, in
fact, that bat right now with the battery they have there, it lasts for a day.
The device lasts for a day, and they just die, right. You can collect data for
a day and that's it.
This data is collected on a mobile phone. That's another reference point
people generally have a personal experience on the energy consumptions of when
you turn on the GPS for tracking and so on. It consumes up to, like, a watt of
energy if you have the GPS on and turn it on continuously. In a sense, we all
have this experience. We have the location continuously turned on. Your phone
probably would die in a few hours, maybe six or eight hours if you do
navigation continuously.
So what can we do? We really want location sensing and we don't want to pay
the energy. So for this particular work, let's drill down a little bit deeper
into how GPS works. People probably have this rough notion that there are
satellites in the sky that cover the earth and they send some signals to the
earth surface and there are these receivers that receive these signals. And
through triangulation, you'll find out where the receiver is.
And there's a lot of technology and love in this rough story. For example, the
GPS satellite actually doesn't stay fixed in the sky. They move, and their
3
trajectory is send out through this so-called ephemeris data, and that's a very
slow radio. So the signal coming down from the satellite is -- they transmit
these parameters of satellite trajectories at 50 bits perfected. In other
words, in order to receive the whole packet of where the satellites are and the
time stamps in it, the radio has to be turned on for 30 seconds. So we have
this experience, when the car GPS first came up on to the market, you turn it
on and it says acquire satellite, and generally it takes about a minute to find
out where you are, and that's the time is spent to decode all this information.
And for that, during that period of time, the radio has to be on, all the
signal processing has to be working for it to decode this. So that's one of
the key reasons for the power consumption in GPS. It needs processing. It
also needs time.
To look at the GPS signals, this is getting a little involved. I won't get to
the equation, the math and so on, but there are these so-called gold code for
each satellite. There's a specific code they send out. And they got modulated
and further modulated on the carriers, and then through the propagation
actually introduce Doppler and other frequency effects on to these signals.
And the receiver's role is to find the so-called code phase, which is the
delay, the propagation delay from the signal that comes out of the satellite to
the time value receive it so this is getting a little involved, but if you add
high level understanding, you get these signals coming down. There are the
correlation happening both in terms of the code phase dimension and the Doppler
dimension. You try to detect a spike. That spike tells you which satellite
you're seeing, what's the code phase delay and what Dopplers are introduced and
sort of how fast the satellite moves.
And then you get to the tracking mode. Both acquisition and tracking can be
done every millisecond so that actually doesn't take a lot of time to do this.
But the decoding, as I said, is the main part. It takes 30 seconds to do the
decoding, where you get the time stamp from the satellite signal, you get the
ephemeris from the packets, and you got the code phase, which tells you the
fine grained delay of the propagation result of that you feed into a
[indiscernible] competition, at the end you get the location. So that whole
process takes about a joule of energy in your mobile phone.
But if we think about what's essential for the device needs to provide, it's
only the code phase, because ephemeris and time stamp we can estimate.
Ephemeris, we can get from the cloud. NASA publish satellite ephemeris
information or parameters frequently, and even project into the future. That
4
we can get in the cloud.
Time stamp is something, there are ways of estimating the time stamp. You can
do a device level time stamping, but this won't be very accurate. But there
are ways of using that as an estimation parameter in your least square and
still be able to find out the location. So essentially, we only need code
phase, which we can sense every millisecond. So that's the minimum amount of
time we need to get the GPS signal to derive your location.
So there we're using the trick called core [indiscernible] navigation, which
basically says if I know roughly where you are, up to a few hundred kilometers
range, I can estimate the millisecond part of the propagation delay because
light travels so fast. If I know my location, I know a nearby location, the
millisecond part of the propagation delay is going to be the same, and only the
code phase, which is -- you can estimate every millisecond, only that
determines your fine grain location. So you have a rough notion of where you
are, then you correct it with the code phase.
That's the basic insight. I won't go into too much of the detail. But if you
apply that blindly, if you don't know your reference location, you'll get lots
of possible solutions. These are all possible solutions, given a particular
set of code phase. So obviously, we can't use or accept all those locations.
There are interesting ways of estimating a rough location using the Doppler
interception, because Doppler give you a sense of where the -- how fast the
satellite is traveling and by looking at it from the ground, you know an
angle -- if you know the satellite's raw speed, then you know the angle -- what
angle you're looking at it, and you have multiple of these angles, you can do
this interception and that tells you a rough estimate of a location.
A third trick we can do to disambiguate reference location is using the third
dimension of the earth elevation. If you're sure you're close to the earth
surface, every millisecond of error that you get in the distance measurement
actually give you 300 kilometer error in the distance, and so it will -- even
if the [indiscernible] converge will probably converge into somewhere that's
higher than the Mt. Everest or deeper than the deepest ocean you're going to
get. So those are some of the tricks we rely on data that's already on the
web. So the elevation data is already on the web. The [indiscernible] is on
the web and so on. That really helps us to minimize what information we need
to collect on the device itself. The basis of what we call the cloud offloaded
5
GPS design, where the device only collects, in our case, a minimum of two
millisecond of this raw signal chunk, restore that. We have a reasonable time
stamp on the twice clock. Then we go through -- we send it to the cloud. The
cloud will do the acquisition of the software, the intersection of
[indiscernible] and look into the NASA database to find out the ephemeris and
using [indiscernible] to have the reference location combine it with code phase
and device time stamp, plug into the course time navigation. Most of the time
it give us only one result. If it gives more than one result, we use the
elevation to do further filtering. And through that we always get only one, at
most one location coming official it. So we build a prototype with essentially
a GPS receiving front end, a microcontroller ASD card and a battery and that's
the antenna, and we did some evaluation on this. This give you the power trees
of collecting two millisecond of GPS signal right into the flashcard. We only
store that data, store the raw data. We don't process anything. And then we
accumulate that and send it to the cloud afterwards. So that energy number is
for just storing the data, and it doesn't include sending the data or
processing them in the cloud. But the take-away message is we're about a
thousand times cheaper than your GPS receiver of the market that you do resolve
the location on the device itself. Convert it into a more intuitive number
with a pair of AA batteries with one second per GPS sampling that this device
will last for more than a year. So that's a great capability to have in
collecting this data. So I'm going to show you a demo of sending this data.
Okay. Let me just talk without the slide. So what we did was we built a web
service that's deployed on Azure. And what it does is it has a web role that
received the data -- okay. I'll try to see -- I won't show you the demo.
Anyway, it received the raw GPS signals that you can send to the cloud service
and in the background, it mines NASA assisted database and pull the data into
this web service, and after receiving these raw signals, it will do the cloud
offloaded GPS process I described and return you with the set of locations.
Let's see how well they work. We use about 1500 data traces were collected on
our devices to evaluate how well they can work. These data come from several
continents. And in order to reduce the power consumption the way we do it is
to turn on the device. We called it a chunk. We sample a few millisecond of
the GPS signal. Then we turn it off, put it in idle for a while, and then we
sample another chunk. The reason is to get more opportunities to average out
the temporary errors in the signal.
And in this particular case, we used two millisecond chunks. This bar is two
millisecond and then we wait for 50 millisecond and then we turn off for
6
another two millisecond. We take three to five of these chunks with 16 times
over sampling of the GPS signal itself. So two millisecond is eight kilobytes
of data. With that, we can achieve a medium area about 11 or 12 meters. The
average area is about 20 meters. And if we look deeply, deeper into the error
distribution, when we have seven or eight satellite, we can detect in the
signal, we can actually get pretty reasonable results. So when we have low
satellite visibility, the area is bigger and that's mainly because the
receivers we have is a [indiscernible] receiver hasn't really been optimizing
the an [indiscernible] part of that very much. We also evaluated how much
error we can tolerate in time stamping, because we're now relying on a device
time stamp rather than the time stamp that your GPS receiver will decode from
the satellite. So we can tolerate, you see that plot, about 60 seconds of
timing error, meaning that even if my local clock on the device is about a
minute off from the GPS time, we can still get a very good location.
And the byproduct of that location is it actually resolve the GPS time. So we
can use the previous time stamp on the device -- or resolved from the data to
correct the future time stamp. So in a sense, we can recursively do time stamp
correcting, which -- so this particular plot only applied to the first time
stamp you get from the stream.
I don't know how much time I have. I'm going to talk briefly how to reduce the
data sites, right so we are collecting the raw GPS signals and at minimum, we
are looking at for ten millisecond 40 kilobyte of raw RF data, and that could
accumulate pretty fast. If you just do the backup calculation, if I do one
location per minute sampling over a day, that's 57 megabyte of data. If I use
a low power radio to just read that data off the device with a typical
throughput, we're talking about ten minutes to just download that thing from
the device.
Put the energy requirement for downloading, for running the radio for ten
minutes aside. If I have a thousand bats or a hundred bats to monitor, that's
a lot of time to just read them out so it doesn't scale very well. So how can
we reduce that?
These signals turn out to be fairly challenging to be compressed, because these
are random signals by design so these are CDMA signals. Those should look like
random signals. So if you just apply your typical, like, data compression
programs to it, we tried it. We got five percent or, actually, it should be
five percent of the compression rate so you don't get a lot of space out of
7
these kind of data.
So what can we do?
There's interesting notion called a sparse representation. If you look at what
we care about in this data, it's just spike we talk about. It represent the
code phase and the Doppler frequency. If we can define a dictionary that
represent raw signal just as a sparse single spike in the transform domain,
then their techniques called compressive sensing that basically we can take the
raw signal, compress to it a much shorter signal, and using L1 minimization to
recover and find that same spike as if they were in the raw signal.
So just to give you a sense of this is what we get the spike from the raw
signal and using correlation, cross-correlation the typical way of doing GPS
acquisition, versus we use this sparse approximation or compressive sensing
techniques to estimate those spikes. The tallest spike is actually the same.
That's much flatter, because we're trying to minimize the number of spikes in
the representation.
So it could introduce some error, but the main spikes should be there. Our
evaluation shows that with 30 percent of the amount of data, and we can achieve
more than 95 percent of success rate. So we can reduce the data by a factor of
one-third and with 90 percent of the chance, we still get the same spikes in
the acquisition results and the same location.
So that's an interesting way to reduce how much data you need to store on the
device. And turns out this projection or this way of compressing data is also
very cheap. It's actually computationally cheaper than you're doing the
impositive type of typical text compression.
So just to conclude, we split the GPS sensing between the devices and the
cloud, and we leverage the information in the cloud, things like the satellite
trajectories, things like the earth elevation, and we leveraged the
computational power in the cloud to reduce the energy consumption of GPS
sensing on the device side and the data amount we need to store and move most
of the computation into the cloud.
Turned out this way of doing things is quite powerful. We have a different
project that shows that this system can even work indoors for many places
without -- we can't handle the highrise buildings. But in shopping malls and
single floor buildings, we can design really strong signal, antennas. And by
leveraging the computational power in the cloud to get locations in many indoor
8
places.
Just to do a little bit advertisement, we are going to release this. This is
the new sensor board design which has the same GPS front end, but FPGA to do
processing and microcontroller and you can integrate it into your own sensors,
and we have the web service running in Azure. It's about probably in a few
days even we're going to turn it on and have people try it. Thank you.
>> Dennis Gamon:
question.
>>:
So maybe we can set out the next talk and you can take a
Can I put that device in a phone?
>> Jie Liu: It's possible. So we actually had this conversation with multiple
groups. The challenge there is your phone has a very integrated GPS Chip to
the extent that the cell radio, the WiFi, Bluetooth and GPS all use the same
Chip. And that Chip doesn't give us low-level enough access to the raw data,
like what we want. We only want to log the raw data, not doing any processing
in the phone. And they don't have that interface right now to give us the raw
data. So that's a deeper conversation. That's why we use our own GPS
front-end chip in this sensor device.
>>:
But there could be a model of the Windows Phone.
>> Jie Liu: If we have enough market share. It's in the conversation, but
just we don't have the power to determine that.
>>:
[Indiscernible] have a phone kit now, you can sample your own phones.
>> Jie Liu:
>>:
A question over here.
I was just wondering [indiscernible].
>> Jie Liu: Indoor. We see this -- if we can get location, we see the same
accuracy, about ten meter median accuracy. We get location in about 60 percent
of the places. Really, the inside is we take advantage of these sky lights so
glass are really good that the signal can penetrate so we have to this antenna,
it's directional and it's intuitive so we point it different directions to try
to find the opportunity that we can see a satellite.
9
>>:
Okay.
>> Jie Liu: And then we have tricks to combine different directions and form
a -- the same process to get the location. So if we get the location, it's
okay. But sometimes we just don't get it.
>> Dennis Gamon: As you may remember from just a half hour ago, our topic is
streaming environmental and urban data, and in this particular presentation by
professor Yung Liu from Purdue, he's going to discuss another aspect of this,
which is when cloud computing meets data streams. His title is cloud computing
for analyzing many data streams.
>> Yung-Hsiang Lu: Okay. Thank you. So actually, I change the title at
This has a bitter acronym. We call it CAM2. It's a work in progress. I
to make two slight advertisements. ACM pays my airplane ticket so I have
show what is ACM. And also -- I assume everybody knows ACM. And ACM
distinguished speaker program, I'm one of the speakers in their program.
fulfilled my duty.
bit.
have
to
So I
We currently have 35,000 cameras in our system. We don't open the cameras. We
find it online. I'll tell you what we can do with it. This is one of the
pictures of the cameras in Europe. We have approximately same number of
cameras in the U.S. and outside the U.S. So what's the motivation? Well, why
do I come here? Because a couple months ago, we did a study using the cloud of
a company across town and then we receive email saying why do you push so much
data -- pour so much data into the system. Do you have a problem? Is your
cloud hacked or something? We say no, that's our intention. We're just doing
experiment. Then before we did it with Azure, I ask Dennis, I said is there
any problem if you try that? He said come over and talk to us what you're
doing. So that's why I'm here.
We currently have 35,000 cameras. If you spend ten sects per camera, you'll
spend 90 hours just looking at the cameras. I hope my math is correct. We
actually want to do continuous analysis of data to understand our world, and
I'll show you a few slides of some performance using the cloud and talk about
our API.
So what's our motivation? We want to understand the world through many, many
cameras. These are a few pictures from National Park Service. Actually,
National Park Service is very interested in working with us. We got physical
10
hard drive a few days ago, because there's too much data.
push over network. So actually hard drive sent to them.
They don't want to
We use publicly available camera, connect to the internet to understand world.
This is their website. And there are many, many people that put cameras online
for various reasons Department of Transportation, city government. You can see
traffic, snow and so on. This camera is one of my favorite. I don't know if
you can see what that is. It's from a university and you look at ant running
around. It's kind of funny to look at this. And there's a chicken farm here
and some other places.
So why do people put cameras online? Many, many reasons. They want inform
people of some scientific data. They want to study such as animal behavior to
observe, to attract customers, and for many reasons. Why are images and video
particularly interesting? Because looking at one single image, you can have
many different interpretations.
Let me look at -- I'll show you a few examples and you hope you can tell us
what you can find. Many people use cameras to do various kind of studies.
Many of them use public cameras. Most studies use only a few cameras, and
there are a few, I can tell you a few examples. They use a thousand cameras,
but very low frame rate. Maybe one frame a day or something like that. And
those systems, as far as I know, they are closed systems. Means they may give
data, but that's all you can do. You don't get the source of the data. If you
want to increase the frame rate, there's nothing you can do because the data is
archived. What I want to do is give you the choices, because for different
type of studies, you want different options. For example, if you want to
detect wildlife, animals, you are going to want one frame per second.
Otherwise, you cannot detect. If you want to look at the weather, one day one
frame every hour may be enough. So you want to have the option and want to
give you the option.
So I hope there's some interaction over here. Imagine you have some archive of
this data or stream of this data. What can you find? This is the danger of
those few I know their names.
>>:
Plant growth.
>> Yung-Hsiang Lu:
Say that again?
11
>>:
Plant growth.
>> Yung-Hsiang Lu:
>>:
They're sitting on the glaciers.
>> Yung-Hsiang Lu:
>>:
Plant growth.
Glaciers, water, what else?
Animal migration.
>> Yung-Hsiang Lu: Animal, there you need a hyphen, right? Anything else?
Maybe the water level. Maybe when the snow melts. Maybe if you are really an
expert -- I am not -- you can as you see the plant species. That's an
indication of how climate change, right?
>>:
Clouds.
>> Yung-Hsiang Lu:
>>:
Humidity?
>> Yung-Hsiang Lu:
>>:
Clouds, weather.
Humidity?
I'm not sure.
You can.
>> Yung-Hsiang Lu: Okay, thank you. How about this one? What can you find if
you look at this one single image or you have series of images from this
camera. This is a shopping mall in Italy.
>>:
Easy hours.
>> Yung-Hsiang Lu:
>>:
Amount of people.
>> Yung-Hsiang Lu:
>>:
Easy hours.
Amount of people.
The state of the economy.
>> Yung-Hsiang Lu:
The state of the economy, very good.
12
>>:
Fashion friends.
>> Yung-Hsiang Lu: Fashion trend, okay. Actually, a few weeks ago, a
professor told me somebody predicted in late May light green will be the
dominant color for fashion. We have about a month to figure that out whether
that's true. We'll see. Yes, you want to say something?
>>:
I'm ahead of the game.
>> Yung-Hsiang Lu:
lab in our school.
>>:
Great, thank you. How about this one? This is a computer
What can you find? What information can you tell?
Graduation rates.
>> Yung-Hsiang Lu: Okay. So as you can see, there are many different data,
different type information, you can extract from same set of data.
How about this one?
>>:
Velocity of water.
>> Yung-Hsiang Lu:
>>:
I think this is the last one I have for examples.
Velocity of water.
Existence of pollution.
>> Yung-Hsiang Lu: Pollution. I guess many of you probably know there was a
chemical leak in West Virginia a couple weeks ago and the water color changed,
right? So there are many, many useful information you can chart by the
cameras.
Every time I talk about this, people ask me this
I'm going to answer. We don't do surveillance.
publicly available data, and we have been looked
board. They said we don't have any problem with
question. So before you ask,
Okay. We only deal with
at by Purdue internal review
human subject study.
I want to tell you a story that a part of the motivation of this study. Some
time ago, a few weeks ago, one of my friend called me and said I get stuck in
traffic. Can you tell me what's going on? So I went online. I look at a
camera. I say about five miles in front of you, I see the flashing light.
13
There's an emergency vehicles there. So I guess there's a traffic accident.
If you can just wait -- drive slowly a few more miles, you'll be fine.
And interestingly enough, about 20 minute later, my friend told me you are
right. Now, the question is can we automate this process, because I had to go
online myself, watch them by eyes. I don't want to do that for everybody,
right? I think we already talk about this. We can do environment study. We
can do city planning, traffic, and this is an example from one of my
collaborators. They want to use ground image on the left side to calibrate
aerial image. So I guess the light doesn't make it very easy to see. There's
more yellow kind of a color here to indicate the crops, they like water, and
they want to use the color to calibrate the aerial image to predict the yield
of crops. So this is an example of where you can use a ground image to help
you do a larger study using aerial images.
Now, let's talk about how we can analyze the data, and then I'll give you a
little bit of data of what we have done so far. First, you have to find data
source. That's why we find the cameras all over the world. Retrieve data and
understand what's inside the data, right. If you want to study natural
resources, look in the highway probably doesn't help you much. But if you want
to understand traffic, a camera in the national park is not going to help you
much so you have to understand what's in the data. Based on the data, base on
the value will determine whether a particular camera is going to be helpful to
you.
Then you are going to analyze the data and then obtain useful information.
Here, it shows three examples of different kind of studies. If I analyze
traffic, this camera will give you better result than this camera. If you want
to analyze ant behavior, the second one is going to be useful.
So let me try to give some idea how much data we are talking about. I hope my
mathematics is correct. But if I'm wrong, then you can tell me. So one high
definition video camera can generate approximately one mega bit per second,
plus/minus. And if multiply it by so many seconds in your day, that's how much
data you get, and you multiply it by how much days a year, that's how much data
you have. That's about one hard drive. So what. One hard drive has a sheep.
But if you look at how many cameras are sold each year, one market study
suggests that about 20 million network cameras -- this does not include mobile
phone, are sold a year. Let me do maybe a conservative estimate. Say five
14
percent of the cameras are online. Means you can take them, publicly
available, that will generate about 8.6 petabyte a day. If you get one frame
per minute and suppose your resolution is not super high, you get 100 kilobyte
per image. Of course, you can do all kinds of compression. It's 100 kilobyte,
assuming you use JPEG. That's a reasonable approximation. We find lots of
camera that give you more than ten frame per second and we also find lot of
camera that give you very high resolution. But this is a good starting point
as an approximation.
Because the data, there is so much data, usually data is nonpersistent, meaning
nobody store them. They just come and if you don't do anything, they're gone
forever. So how can cloud computing help? Well, the problem is people want
different things, and that's where cloud computing can be very useful, because
some people want high frame rate, some people want low frame rate. Some people
want to store data forever, like I talk to some people, they want to study
environment. He said talk to me after view 20 years of data. I'm not
interested in anything less than 20 years.
Okay. Wait for 20 years after the project starts. Some depends on the time of
day or season. If want to study snow, there's not much to study in late April,
right? I mean here, okay. All right. Or if want to study traffic, probably
there's not much to study in general at midnight. So the variation makes cloud
computing great because you pay only for what you have to. So cloud computing
will be a great way to handle differences of data.
Now, I'm going to start talking about our work in progress. This is our
system. It's a general purpose cloud system for analyzing many video streams,
and the idea is you will be able to configure the system for your study. You
can say I want to analyze data for the following 500 cameras. You'll identify
how many camera based on various conditions, such as geographical location,
time of day, and so on. You will say I want so many frames per second or per
minute. We'll tell you this camera cannot do that, because the camera simply
doesn't refresh so fast.
For example some traffic camera will refresh once a minute, let's say, on the
website, and we can also detect how often they refresh. And you can also tell
whether you want to store data or not. If you want to store data, we'll do
some information, how much data has been generated you will need to generate so
much data, and we'll have an event based API for programming. I'll show you an
example.
15
So this is our system. I'll only talk about a few parts of the system. We
have first 35,000 cameras and we are still adding more. We have a database
describing the cameras. And through various conditions, you can select the
camera you want to use, and then the camera you select, the data heavy traffic,
this is a big arrow here to indicate heavy traffic. The heavy traffic also
cloud. It doesn't go through our system. I mean, the [indiscernible]. We
have resource manager to give you a suggestion -- actually doesn't give you
suggestion. You want to decide where to send the data to, so don't have to
worry about that before then we have a program interface for you to write your
program. This dash line is lots of place where the domain [indiscernible] will
come in. They have their -- [indiscernible] to do. All the others are common
to other scenarios. So we handle all of them so that you don't really hurt a
few times. You want to repeat some effort for different kind of stuff.
I'm going to talk a little bit about the resource manager part and then talk
about the cloud part. So first, this is a camera looking at a straight
[indiscernible] at my office, construction being done right now on this street.
The purpose of this example is to show different cameras can give you different
type of data, different protocols. That's one of the challenges for people.
They don't want to deal with protocols, right. So our system will handle that.
It is an example I'm going to show you, putting data for motion JPEG. What is
motion JPEG? You retrieve data from the cameras, and it has intra-frame
compression so each image is JPEG compressed. There's no compression between
images. Why do we do that? Because that will reduce the computation on the
camera. Most cameras are very small. They have a tiny processor inside. But
it's bad for network, because there's no inter-frame compression, but just as
example as a starting point, once you issue an HTTP request, it will give you
the frames as they are available. So one request will get quite a few frames.
Honestly, we don't know exact location of the cameras, because currently we
have an IP address, and IP address is not precise for location. So we decide
to use this plot to indicate we figure approximate location of cameras, and we
only can see the cameras that can give you -- we have measured at more than ten
frames per second for the following studies.
So first question we ask is should we use a thread or process? The answer is
pretty obvious, we should use thread because process overhead is too high.
This is using Purdue computers. So starting from here, I have a few slides
16
showing the Azure dataset.
So this is an example where we tried to pull data from multiple cameras up to
almost 500 cameras using Azure large. Eight cores, 14 gigabyte of data,
memory. The three curves show the data rate is megabit per second. I guess
that's why we [indiscernible] traveling the traffic.
As you can see, it's not surprising that United States, if you put a virtual
machine in the United States, you get the most data rate. And we also tried
north Europe and east Asia. And then we go back to the slide to show we used
the cameras in the United States.
A natural question to ask is why do we even bother. The camera's in the United
States. Why do we even consider to [indiscernible] to Asia? Can anybody
suggest an answer?
>>:
[Indiscernible].
>> Yung-Hsiang Lu:
A.
>>:
It's purely [indiscernible] with no practical information.
Price?
>> Yung-Hsiang Lu: Price. Because if you're -- talking about a lot of data.
If you can save a little money per hour, you can save a lot of money. If you
don't have to use the high performance, depends on the application, then you
can save money. So the purpose of starting is to understand what's the
implication of the various kind of configurations. As I mentioned earlier,
when you configure the system.
There's another study we want to compare the number of threads -- number camera
per thread. So if you want to have one camera per thread, basically one thread
just go [indiscernible] camera. If you monitor camera per thread, then that
thread will grab data from multiple cameras sequentially. Want to see which
one is better, it turns out you shouldn't use one thread for camera, at least
for our analysis, our measurement.
But earlier data show that we gravitate how we do absolutely nothing, we just
throw it away. Suppose you want to archive data. Obviously, your data rate
drops dramatically compared to this one. In this case, drop by quite a bit
17
where we store it at. We only store the data. We don't do any processing. So
this photo suggests that if you want to different type of studies, you have to
select your virtual machines and the locations more carefully.
The next example will do what we call background subtraction. This is using
office [indiscernible], doing background subtracter. It shows example of a
street and if a car comes in, you can find the background. So start with
background and take a car. A few more examples here, you can find a bus and
find car and actually there's a person here so we want an openCV background
subtraction and you can see -- now as you can see, in this case, the virtual
machine in the United States and north Europe actually don't have much
difference.
So now, you can see why the resource management, we want to [indiscernible] too
far away if that save you cost. This is an example where we compare different
type of virtual machines. These are all from North America. Background
subtraction is very computation intense and memory intense so in this case, you
really want to use eight cores, as you can see. This A7 and A4, they like
obviously win over the others. And then once you use 14 gigabyte of memory,
you stop here because you're out of memory. But if you use more memory, you
can continue. So in this case, this application, more cores are obviously
better.
And if you want to save image that's IO intense, more cores do not help. And
this we haven't figured out why that's the case. If you use very few cores, we
measured this a couple times. It's pretty consistent. If you just retrieve
and not do any processing, actually A5 is actually better. We haven't figured
out good explanation. But the point here is you want to select different type
of virtual machines to make it more efficient and to save cost.
Frame rate, we find, is a very closely related to round trip time. Not
surprising, because of the TCP protocol building properties. So what do I
mean. Depending on application running virtual machines differently, and they
have a close virtual machine will reduce round trip time and improve data rate,
but depends on your applications, you may not want to do that, because it may
be too expensive. There's no clear solution, not yet, about how to allocate
virtual machine.
The next few minutes, I will talk about analysis part because so far, I only
bring out the -- bring the data. We have not done any analysis. So our
18
analysis is an event-driven API. I use a very simple open CV [indiscernible]
to show what's a typical program analyzing video. We basically have three
steps. You have an initialization, processing and finalization.
So this is initialization. You say I'll grab a camera or I'll grab a file from
my disk. And at the end, you finish, you use your resources. In the middle
you do some processing. In this case, I get one frame, I [indiscernible]
background, I show the frame. This is an example. If you use our event driven
API, what you do is you replace the while using an event called onnewframe.
That's a name we create. So basically, when a new frame's ready, your phone
will be called. So essentially, you only need to write run call-back function
called onnewframe. You don't need to know where the data come from. You don't
need to know the frame rate. That's all in your configuration. Your program
just needs to handle this event and you do however you want to do website to do
your analysis. You can save the result, you can do the other things.
The idea here is we want to be able to process very large amount of data from
many cameras using very simple program. So if you did that with a few standard
change, that's what I did here, change whatever our program look like
originally, you change your while loop into an event. And then in that event,
you process it and are able to run this program across more than 1,700 cameras.
We can select a frame rate. In this case, we use one frame every five second
and using five virtual machines. We demonstrated how we can do this using this
simple method to analyze data from many, many cameras. And 1,700 the number we
select as a starting point. We don't want to push that image to 30,000 cameras
yet, because I will create -- maybe we'll create too much traffic. This is the
largest experiment we have done simultaneously so far.
I want to acknowledge my collaborators and Microsoft and a company across town.
This one started as an [indiscernible] student project. You want to see the
award on my desk. So we study a project. I also want to acknowledge the
source of data. Many, many sources. Some of them allow us to use it without
any conditions, some that we actually signed data sharing agreement with them.
To summarize, we are building a cloud-based system to analyze data from many
cameras. And I have talked a little bit about our resource manager. Give you
an idea that it's not so easy to determine what kind of cloud virtual machine
you need. Really depends on the type of applications you want to do is very
computation intense, [indiscernible] intense or what kind of computation you
want to do. What you want to do a computation. We have designed an
19
event-based API for simple programming and I think there are still a lot, as I
think Steve mentioned that there are a lot of research opportunities.
We hope you can have a [indiscernible] for people to play with our system so we
can write a simple program and analyze data files and cameras sometime later
this year. Thank you.
>> Dennis Gamon:
sets up.
So we have time for a couple questions while John comes and
>>: So there's an effect when you're using, say, a virtual machine with one
core versus eight cores. Just the number of cores also affects the access to
number of cameras. If other videos on the same server [indiscernible]
interface. So having many cores on an eight-core machine makes me a little
nervous.
>> Yung-Hsiang Lu: Okay. So the question is if we use different virtual
machines, they may share the same physical resources. We have not done a
measurement yet. Also, how usually we don't have the direct control which
physical machine it will be on. So if you allocate two virtual machines, they
may or may not be on the same physical machine, so we don't have that control.
Maybe ->>:
I can't give it to you either.
>> Yung-Hsiang Lu: I understand you cannot give it to me. So we have not done
the measurement simply because we are not sure we will get meaningful data.
>>:
[Indiscernible].
>> Yung-Hsiang Lu:
see how we can ->>:
We can talk about that, okay.
[inaudible].
>> Yung-Hsiang Lu:
>>:
Okay.
Yes, please.
Just a quick one for you.
>> Yung-Hsiang Lu:
Yes.
Maybe we can try to
20
>>: You said you're not using it for surveillance, but seems like you're
providing technology that could be abused. Any thoughts on ->> Yung-Hsiang Lu: Okay. So the question is will our [indiscernible] be
abused by people who want to do surveillance. But that is publicly available.
So if somebody wants to do it, we have no control over it. The only thing we
can say is we don't do it.
>>: Do you have any [indiscernible] experiments that you had are you
[indiscernible] the data [indiscernible].
>> Yung-Hsiang Lu: Okay. So the question is
we partition a camera in any particular way?
we have measured successfully to get only ten
measurement. Within that, we randomly assign
machines.
>>:
over the measurement we have, do
No, we only choose a camera that
frames per second before I'll do
amount of [indiscernible] virtual
Would it not be [indiscernible] same virtual machine?
>> Yung-Hsiang Lu: It's possible that if we do a finer grain measurement, but
we have not done that.
>> Dennis Gamon: Okay. Thank you once again. So now we come to the third and
final talk of this particular session. You know, last night just before I was
falling asleep, I had a question I was really concerned about. In fact, I
wanted to know what John could do with my location history so I think you're
going to tell us, right?
>> John Krumm: That's right, yeah. Thank you. Thanks for the introduction,
Harold, and thanks to Dennis and Christin for inviting me to talk. I'm from
MSR and I work here in this building and I live not too far away, and so
professor Lu's talk just now kind of caught me off guard. I was almost had
time to run home and put on my light green pants. I didn't think I could make
it back in time.
Now to my talk. I like to start every talk I give by saying big data and
machine learning just so I get those somewhere in my talk. And I'm going to
talk about location history, which is something I've been working on for a
while about location. Location is really an intrinsic part of our minds, and
21
this is a quote here that kind of explains why. Knowing what direction you are
facing, where you are, and how to navigate are really fundamental to your
survival. For any animal that is preyed upon, you'd better know where your
hole in the ground is and how you're going to get there quickly. And you also
need to know direction and location to find food resources, water resources and
the like.
So location is a fundamental thing that's kind of been drilled into us as
important based on just evolution.
And so I'm going to talk about what can we do with our location history, and
this talk is divided into two parts. The first part is things that we've
already done. I'll talk about some of the research we've done before, and then
some ideas that we haven't really tried yet but I think that would be fun to do
with a sequence of location data.
First of all, I'll talk just about some of the location data we've been
gathering. We've been doing this now for -- this is a little bit of an old
slide. For about eight years, we've been giving GPS loggers to people to keep
in their pocket or their car as they move around, and this is just a map of
some of the data that was made by a summer intern of mine. So most of the data
we've got is from the Seattle area. In fact, there's downtown Seattle over
there and we're kind of right around there now. So what we do is we gave
out -- we loaned people GPS loggers like this first and then subsequently that,
go to smaller and cheaper, and we put them in a bunch of regular vehicles,
people volunteered to carry them. We put them in paratransit vans and the
Microsoft shuts that run around on the campus here and got data that way.
Okay. So I'll start talking about some of the things that we've done with this
data, and the first thing I want to talk about is making maps. Here is a plot
of our GPS data. So, again, that's downtown Seattle. Here's us here right by
Microsoft. That's like Washington, and that's the 520 bridge over lake
Washington. That's the I-90 bridge right there. And when you look at this, it
looks sort of like a map, and it seems intuitively that you should be able to
go from this plot of the data over here to a map like that over there because
it's almost a map like that.
So we looked at doing things like that. Why would you ever want to do that?
Well, it's expensive to get that data. Navteq and Tele Atlas these especially
equipped vans with train drivers that drive around and get the data, and then
22
it's kind of expensive to buy it. And also, roads are changing. There was a
bridge collapse not too long ago around here on interstate 5, north of here.
And we would like to get that updated quickly in the map.
So here's one of the things we did. Here are GPS traces that we took from this
intersection here. And our goal in this project was to try to count and locate
the lanes of the roads leading into that intersection. And the way we did that
was we would draw virtual line across the road like this and look where it was
intersected by the GPS traces going through it.
And if you make a histogram of that, you get something like this for the
mixture of Gaussians to it, and then you can count the number of lanes by the
number of Gaussians and the means of the Gaussians or the centers of the lanes
and the width of the Gaussians are a function of the width of the lanes. So it
does actually a pretty good job of interpreting what the lanes are along the
road just from the GPS data.
Another thing we did was finding the intersections. So here is the GPS data is
drawn here in white, and what we did is we made this detector that passed over
the data, and it kind of just built counts in all these annualar sections of
this detector so kind of how many are in this section, how many are in this
section. And we trained up a simple machine learning binary classifier to say,
based on those counts, is this an intersection or not. And here are the
intersections that we found. So it actually did a pretty good job of
automatically finding where the intersections are on the map.
And then finally, we actually tried to just build a routable roadmap based on
the GPS data we had gathered. So this is raw GPS data from kind of a
complicated intersection. That's a highway there, and the first thing we did
was we clarified those traces and the way we did that was we tried to move them
around and we tried to coalesce traces for lanes going the same direction and
try to separate traces for lanes that were going in different direction. So we
pretended that each trace was a little electrostatic wire and if the directions
were the same, those wires would get attracted to each other. If the
directions were different, then the wires would repel each other. And so you
can see an example of that here. Here traffic is moving in two different
direction, but you can't really tell it from that picture after we do this
clarification step then it separates the lanes so it looks pretty good.
And then from that, you could just make a routable roadmap.
You could click on
23
the start point and the end point and it would compute a pretty reasonable
route between the two points.
So there's still more to do on this project, figuring out road directionality,
figuring out turn restrictions, road names, finding the stop lights and stop
signs, the public versus private road, interpreting parking lots. That would
be a really challenging, fun problem, figuring out which lanes are the car pool
lanes and express lane scheduling.
So Microsoft used to have this tag line, where do you want to go today? We
took it seriously and we actually figured out where you want to go today so
we're not asking people that anymore, because we already know, based on your
location history, we can predict where you're going.
So wouldn't it be nice if you had this constant companion with you, as you're
moving around that could tell you things about gas prices, traffic, points of
interest, available parking, and advertising. It wouldn't be that hard to
built something like that, but it would be a lot better if it knew where you
were going, could tell you about these things before you got there so you
actually had a chance to decide whether or not or how to take advantage of the
information that it's getting.
And so you could ask well, could you just use route planning for that. Well if
you look at the numbers, it turns out that people only enter their destination
when they're driving for about one percent of their trips. So you need
something automatic. Here's a little video I'll show. I'm hoping the audio
will work. Why men don't ask for directions.
>>:
Hey, can you tell me --
[screaming.]
>> John Krumm: Okay. So we're trying to avoid that situation. And so here's
how our destination prediction works, and I'll stop this just so I can explain
what's on the image when it gets going here. This is a route I'm going to
drive. I'm right here. Now I'm going to drive along this black route, and
these red points are candidate destinations, and they happen to just be where
the road intersections are. And what I want to do is compute a probability for
each one of those red dots as to whether or not I think that is going to be
your destination as you drive.
24
And the way this works is we look at each candidate destination, each red dot,
and we look at the partial route that you've taken so far and if that route was
an efficient way to get to the red dot, then that red dot's probability is
higher. If your route was an inefficient way to get there, then that red dot's
probability is lower.
And so you'll see pretty soon, when I start this up, that all these red dots
are going to go away. Their probability is low because to get over there, the
most efficient way to do it would be to go across this bridge down here rather
than up across the top. And so we'll see the red dots just kind of start to
disappear as the drive goes along.
Okay.
>>:
So then at the end, they're kind of clustering around a destination.
How do you know that that's the end when you're starting out?
>> John Krumm: Well, we don't know that's the end. This was kind of a lucky
situation because you're blocked by the water, right. Although there's some
dots that didn't show up that are on this -- there's a ferry route. That's
actually a ferry dock there. So if I showed more of this, you'd see some dots
over here. But we do know how long most trips are. We have a distribution so
it's not going to predict that you're going to be driving for six hours.
Probably 20 or 30 minutes, usually.
But this is prediction -- this is forecasting, not prediction, because you're
doing it just like a few minutes ahead?
>> John Krumm: Okay, sure, yeah. Okay. We also do longer term prediction
though. One application to this is picking the nearest gas station or any kind
of search you want to do. I'm looking for pizza along the way or something
like that. And normally when you do a search like that, you'll just -- the
local search will just be a radius so you might get a gas station that's behind
you that you really don't want to turn around and drive to.
So what we did is we used our prediction algorithm to try to find gas stations
that would give you the minimum expected cost of diversion along the route that
we think you're going to take. So it doesn't have you turn around. And so
I'll show you that video, which is the same route. Just want to pause it here.
25
So all the white circles are actual gas stations and the one that it's chosen
for you to stop at is the solid white circle. And so it doesn't do a very good
job at the beginning because it hasn't seen much of your route so far. But it
gets better.
So the goal, then, is to have the gas station somewhere ahead of you along your
route so you wouldn't have to drive very far off your route to get to it. And
you'll see when it gets to that top peak, it's a little bit confused. It
doesn't know you're going to turn. But then it recovers.
So we actually built a web service to do this so you could do a local search on
your phone as you're moving along and then it will give you search results that
are ahead of you rather than behind you. So now you talking about a longer
term prediction, we did a project called Far Out which tried to predict where
you're going to be a day from now, a week from now, a month, and even two years
into the future where you could get an you average error of less than a
kilometer of where you're going to be because people are actually pretty
repeatable in how they behave. It was just based on time of day, day of week,
and whether or not it's a holiday. It worked surprisingly well. And we're
continuing this work. Here is someone's GPS data as they're driving around.
These are just little triangles that they went through on the ground, and so
we're predicting where they're going to be in the next five minutes. So an
average error was somewhere below 200 meters like that, and here's average
error after ten minutes predicting 20 minutes ahead, 30 minutes ahead, an hour
ahead and so the error's getting worse and worse as you go up. Here is 12
hours ahead. So the error is still below a kilometer. And then, interesting
thing happens. When you're predicting one day ahead, the error drops back down
again, because people are pretty repeatable in what they do. So you're
probably going to be doing -- not you, because a lot of you are visiting, but
most people, if you're at home, are going to be doing probably the same thing
24 hours from now that they're doing right now. And then this goes on. This
is just more and more days out into the future.
So it's really surprising, I think that with a pretty easy machine routine,
just looking at time of day and things like that, that you can predict where
people are going to be pretty far into the future.
Okay.
>>:
Looking at -- question?
So when you're doing things like weather forecasts, one of the measures of
26
how well you're doing is comparing it simplistic effort. In other words if
you're going to guess tomorrow's weather, you'd say it's like today's. And so
you don't need a [indiscernible] to make that sort the forecast.
Does the same sort of thing apply here? In other words, is there something
where you can estimate that the persistence based prediction might be
uncertain?
>> John Krumm: I guess the way I'd respond to that is you can think of a whole
continuum of sophistication of the models, right. And so one might be the very
simple thing, that you're going to be in the safe place 24 hours from now that
you were now, right. And now maybe I can add something about, well, if I know
it's a weekend, then I've got this kind of two-dimensional space of predictions
of inputs and you can just make it gradually and gradually more complicated.
So we've looked at things, you know, you tend to go back to the same places
that you've been to before, like the same grocery store. It might not be
predictable in time, but it is predictable in space. You tend to go back to
places that are near places you've been before so maybe you want to go to the
Starbucks that's close to the grocery store.
So to answer your question, I kind of interpret it as
good baseline, a very simple baseline and how can you
by adding more and more features, right. And there's
of different algorithms you can use for this, so it's
a baseline.
you're asking what's a
measure the boost you get
almost a continuous space
hard to nail down one as
Predicting collective travel behavior. So what we're looking at here is taking
relatively long trips from your home. Can we predict places that you'll visit,
given where you live and the demographics around there and the distance to a
candidate place you might want to visit. What are the chances you're actually
going to go there. So, for instance, in Redondo Beach it turns out that our
results say that these are the places that you're likely to visit. The darker
the dot, the higher the probability. So you're going to go up to the northeast
here and around in California if you live there. The way we do this is we've
got a bunch of data from geo tagged tweets so about one percent of tweets or so
are geo tagged, and we have access to those and they have persistent user IDs
on them so we can actually look at where people travel.
And so our goal was given where you are now and given a candidate destination,
can they predict what's the probability that you would go there. And these are
27
the features we use. We use the distance
bunch of demographic features that we got
of where you live and the demographics of
just did a machine learning classifier on
that.
to the new place and then a whole
from census data on the demographics
the candidate destination. And so
that. So here are some results from
If you live in Redondo Beach, these are the places that you'd go. There's a
little close-up around California so you go up to San Francisco area and down
here if you live in man tat hat tan, these are the places you go. You tend to
stick around here but also go out to the west coast. These are flyover states.
They're really flyover states and that's pretty concentrated in New York.
And I don't know if you've seen this old cover of the New Yorker, but it's
really kind of true, right? That if you live in New York, you kind of don't
pay as much attention to the middle, but you do know about the west coast over
here.
Let's say you live in Milburn, Nebraska, population 66. So we only had four
total tweets from Milburn, Nebraska, so you can't really look at the data from
Milburn and predict where people are going to go, because you only have four
samples. But since we did machine learning on this, we look at other places
that are like Milburn, Nebraska, that have similar demographics and we can make
a decent prediction on where people from there would go. And so they kind of
go all over the country.
The other thing you can do is given a place where do people come from to visit
you in this place? So everyone wants to go to Redondo Beach, according to our
analysis. Milburn Nebraska, not to popular, except for people around in
Nebraska. And Minneapolis, Minnesota, as an example, kind of draws people from
the five state area. Yes?
>>:
Does predictor do that or real data?
>> John Krumm:
>>:
This is predicted data trained on real data.
How do you verify?
>> John Krumm: We can't verify for Milburn, since we don't have that much data
from it. But for these, we do have a lot of travel data so we verify with
this. I didn't put in my accuracy graphs just to save time. Yes?
28
>>:
[Indiscernible] this is a hub.
I'm wondering whether that's --
>> John Krumm: Minneapolis is a hub for Delta Air Lines, Professor Liu says,
so that could affect people visiting there. So it could. Yeah, I didn't know
that.
>>:
Not in your data?
>> John Krumm: It's not in our data. It's reflected in the Twitter data,
right, but it's not reflected in the features we're using to make the
inferences.
>>:
[Indiscernible] tweet I'm at the airport.
>> John Krumm: We didn't look at the tweet texts. This would be another level
of sophistication. And then we can look at all the -- oh, you really can't
read these too well, but these are the relative importance of the features for
making those predictions. And the top one is distance. That's really what
affects mostly how far you travel. And the other ones are all just demographic
features. It turns out mostly race and age are predictors of places that
you'll visit.
Okay. Labeling your places. The idea here is when you tell people about a
place, like let's say you've got an automatic notification of when your child
arrives at home. You don't want that notification to come in and just give the
latitude, longitude of your home or even the street address. You'd like the
message to say they've arrived at home or they've arrived at school or your
spouse has arrived at work. And the idea here is can we look at your GPS data
and automatically attach semantic labels to the places you tend to go.
So we start off with just GPS data, people running around. This happens to be
of a student-aged girl. And then we do clustering on the data to find out
where the people spend their time, and that happens to be her home and that's
her school, I happen to know.
And so what we want to do is automatically classify those places and our ground
truth data is kind of interesting. There's this thing called the American Time
Use Survey. It's free data from the U.S. government, and what they had people
do is fill out a survey or a diary that where they kept track of everywhere
29
they went for a day and what times they went there. So I was at home this
length of time and I was at work this length of time and you can see across the
bottom here, this is zero to 24 hours, and that's the time when people were at
home. So most people are at home around midnight. And then at noon, you see
kind of a peak and people being at work there and then the next highest one is
someone else's home. But these are all the different kind of places that
people kept track of in that data.
So that's a great source of ground truth data for machine learning, and so when
we do that, here are our accuracies. Here are the actual places people were.
This is a confusion matrix and that's where we inferred that they were so home,
you know, we're 91 percent accurate. Work is 83 percent, school is 88 percent.
And then not so great for the places that you go less often.
Location privacy. Turns out when you look at the literature, people have done
surveys and people don't care that much about location privacy. They're
willing to give up their data. They don't really care that much, and that's a
pretty consistent finding in most of the surveys that I've seen.
One thing we were interested in is what if people had actual location data at
stake. If you were doing this survey and I held some of your location data,
then are you going to be so willing to give it away. So we collected GPS data
from 32 adults and 12 households, 12 non-Microsoft people who were drawn here.
We had two months of data for each one of them.
After we got the data from them, we showed a map of their data and had them
fill out a survey, asking about their privacy preferences.
And turns out one of the things we asked about, can we take your data,
anonymize it, but with a persistent ID for all your GPS points and a time
stamp. Can we take that data and put it on a publicly accessible website.
Two-thirds of our participants said yes, that's okay. So now that data is
actually on a public website for anyone to download and use for their reserve.
So I was kind of astounded by that, that big number of people that were willing
to share their data.
We said would you trade your data back to Microsoft if we gave you
location-based service in return? Every one of the 32 people said
would. And these are the location-based services that we picked.
a list. Help determine where the bus routes should be. So that's
a
yes, they
We gave them
just an
30
altruistic service where we'll look at traffic data and figure out, well, we
should put a new bus route there.
Tell you about traffic jams before you get there. Tell drivers where traffic
is slow. Control our home thermostat to save energy. But everyone was willing
to trade their data away for a location-based service. So pretty cheap.
Another little study we did on that data was can we take your anonymous data
and figure out who you are. And it turns out that you can. If you can figure
out where the person lives, which is where they spend most of their time, then
you got a latitude, longitude of that and you can reverse geo code, find their
street address and then put that into a reverse white pages lookup and figure
out their identity.
So we can do that and then what we did is we started to obfuscate the data. So
a lot of privacy advocates will say, well, you can add noise for the data to
preserve privacy or discretize it somehow. So we did those things and we ran
our privacy attack again to see if we could identify who the people were. And
these are the different obfuscation techniques that we looked at.
But in general, it turned out that you had to obfuscate the data so much that
it was really before -- the tech wouldn't work, but it was really useless for a
location-based service. It would work maybe for giving you the weather or
something like that, but not something for, say, tell me if one of my friends
is close by or tell me when the next bus is going to get here. So obfuscation
of the data is not a very good privacy technique.
Okay. So I'm done talking about what we have done with data and now I just
want to speculate a little bit about what we could do with some of this data,
things that we haven't fully tried with. One is personalized routing. We've
done a little bit of this. We've looked at the routes that people take.
Here's someone who drove from Point A to Point B here along the green route and
then we planned routes between those two points. One was map point. We looked
at the shortest route, the fastest route, and you notice they didn't take any
of the routes that we planned. And it turns out that when we looked at the
data that 60 percent of the people were taking neither the shortest nor the
fastest route. So it made us think that people have different criteria for
picking the route they want to drive.
And so we made a new router that was sensitive to traffic speeds, but also
31
tended to favor roads that you've driven on before, thinking that you're in the
habit of -- if you like these particular roads so you want to go back to them.
So after making that simple change to the router, our new routes that we
planned match the routes that people took a lot more often.
But going forward, you can imagine that people have a lot of different
criteria. You know, when you go to Bing maps or Google maps, it optimizes for
time or distance, right. But maybe there are other things that come into your
calculation and basically, these routers are minimizing cost. So maybe that
cost is a function of driving time, number of left turns, complexity, scenery,
traffic lights and you have a different weighting for all these different
factors and so by looking at your driving data, maybe we could figure out
what's your rating to apply to all these different factors and then make a
route that works best for you.
One of the factors could be safety. So here are traffic fatalities from 2001
to 2006. So they imply that they might be a more dangerous area to drive and
so you might want to avoid that. So let's say you're planning a trip from
Pittsburgh to Detroit and now we add this slider, which is the probability of
death along your route, okay? And you're free to choose any setting you want
on that, you know. We haven't done the research, but it may be the more risk
you're willing to take, the faster your drive will be, right? But down here,
if you want to be safer, maybe it will take longer. But it would be
interesting to do something like that.
Collaborative filtering. This is one of the first things we thought of when we
were starting to take data. So you know, when you go to Amazon to buy a book,
they'll suggest other books for you to buy. But you can also imagine doing the
same thing with places that you go. Let's say you're new to a city and you
visited this independent book store and this independent coffee shop. Well,
based on where other people have gone, if they've gone to these two places, the
system could maybe automatically recommend that people who like these places
also like this independent theater.
Someone else has done a little bit of research on that over at University of
Washington, John Froelich when he was a Ph.D. student there said this actually
might not work all that well because it turns out that people; when they're
picking places to go, they often go there not because they like it but because
their friends are going there or because it's close by. So it's a little bit
different from buying a book on Amazon.
32
I think this is the last one I want to talk about. Crowd sourcing. So looking
at data from other drivers, what can you do with it? Well, you know, this
company Waze that was bought by Google recently, they used the data from these
Waze NAV apps running on people's smartphones to find traffic speeds, to report
speed traps, and gas prices. But you can maybe even do more with that. By
just instrumenting the vehicle. So what if you could instrument -- you could
detect people places where there's sudden braking. That might be -- there's an
accident there. Or the anti-lock brakes had been activated to indicate a
slippery road. Windshield wipers are on means it's raining so you could track
precipitation as it moves through the area or look at sudden suspension jolts
to figure out where the pot holes and rough roads are.
So I think that would be an interesting thing to try. Okay. And in the
interest of time I'm going to skip over this last one talking about activity
inferences. And just conclude this is what I've talked about, some things that
we've already worked on, and some ideas for things we could work on in the
future. And that's the end might have talk, so thanks for listening.
>> Dennis Gamon:
Questions?
>>: So how specific a lot of the features and a lot of the sort of machine
learning machinery behind it is to humans versus if I apply that to GPS, the
baboon data.
>> John Krumm: I'll repeat that question. How specific is the machine
learning that we've done to humans and, you know, maybe how well would it work
on baboon data or any kind of ->>:
Or zebras or animals in general.
>> John Krumm: Right. That's a good question. The first bit of location
prediction that I did, the short-term prediction, assumed you were on the road
network and that you had an efficient route planned to where you were going.
And I suspect that -- well, animals don't follow the road network, obviously,
and you probably don't know the paths that they follow, necessarily. So I
don't think that technique would work very well, but the longer term prediction
that we're doing pays no attention to what's on the ground. It doesn't pay
attention to businesses, which we've done in other contexts before. It just
looks at your specific behavior, places that you've gone to before as a
33
function of time of day.
So it seems like that would probably apply better to wildlife.
agree?
>>:
Although animals do follow roads.
>> John Krumm:
>>:
Would you
Yeah.
They do?
We can talk.
>> John Krumm:
Okay.
Interesting.
Yes?
>>: The last point you made about instrumenting vehicles to detect conditions,
environmental conditions on the roads, this is an area of inquiry that's in an
advanced state of development and perhaps we have a talk about that. So
they've done some experiments with -- in the field with looking at windshield
wiper settings to get realtime information back. You have to have a car which
will indicate the windshield wiper setting, but we have systems that are in
development like that. So perhaps it's a [indiscernible].
>> John Krumm:
Good.
Sounds like a lot of fun.
>>: Is there bias in diary data? I remember when I was in school a long time
ago, people were keeping track of where they went. They went to the dentist
more frequently than they go to a bar.
>> John Krumm:
Oh, really?
>>: I guess [indiscernible] or survey data.
survey, is the survey biased in some way.
It's always the problem with any
>> John Krumm: Yes, so the question is are these diary studies biased. And I
don't know, honestly, and I'm not sure how we could tell. Maybe you could
instrument some of the people, right, and figure out whether they were doing
that. But even if you had GPS coordinates in the car, which we do a lot of
times, you don't park at the exact place you're going. So you're going to the
strip mall, right, and you go to the coffee shop but right next door is the
diet center. We don't know which one you really went to. So it's even hard to
tell from GPS data. But yeah, that's interesting about the dentist versus the
34
bar.
>>: Do you keep track -- this is two questions. Do you keep track how far
people drive between gas stations? And the second part of that is do you
predict, you look at a gas station in front of you, you can predict you're
going to run out of gas within 50 miles, you better pull over before you see a
gas station very far away even though it's in front of you.
>> John Krumm: Right, that's a good idea. So yeah, we haven't looked at
distance between fill-ups, but that would be maybe a good proxy for how empty
your tank is, right? And then you could recommend the right distance gas
station to fill up your tank. And that leaves us -- we thought about thing
called satiation modeling. Every once in a while, you need gas, right? You
need an oil change. You need a hair cut every once in a while, right? So if
you've just gotten gas, just gotten a hair cut, then we probably don't want to
tell you about places where you can do those again very soon, because you're
just not going to need that. So it would be fun to look at the long-term
intervals between different things that you need.
>> Dennis Gamon:
Last question?
>>: So here in Seattle, you guys have the Microsoft connector sort of all over
the city with this realtime GPS data. And I'm trying to avoid traffic a lot.
Bing, for example, if I route from here to my home will tell me, like, some
estimation of different routes based upon, I'm assuming, past data collected
for a similar day and a similar time. Are there any efforts to integrate in
realtime information doing that analysis, like, immediately when I'm looking at
it or maybe that's already being done. So there's an accident on 405 and the
connector is sort of sitting here in traffic. And so I know I should take
I-90.
>> John Krumm: Right. So looking at realtime traffic data, to help you route,
my boss has actually some work on that, Eric Horowitz. What he was doing was
predicting when traffic jams will form and when they'll dissipate based on past
data, and he was also looking at can we infer what traffic on the side streets
is, where you don't have the traffic loops that are measuring the traffic based
on the traffic that's on the highway, where you are measuring. So if there's
some correlation between those two things, then you can infer the traffic even
where you don't have sensors.
35
So in that sense, yes, we've got some work around realtime traffic estimation.
>> Dennis Gamon:
>> John Krumm:
Thank you, John.
Thank you.
Download