17213 >> Jin Li: Hello, everyone. Thanks very much...

advertisement
17213
>> Jin Li: Hello, everyone. Thanks very much for coming to talk. It's our great pleasure to have
Aleksandar Kuzmanovic to come here and give us talk on ISP-enabled behavior ad targeting
without user consent. Alex got his PhD from Rice University in year 2005 and then he joined
Northwestern University and now he is assistant professor in the Northwestern University. His
interest is in computer networking, with emphasis in measurement, analysis and protocol designs
for all kind of Internet algorithms and protocols.
Without further adieu let's here what Alex has to say about this interesting topic on ad targeting.
>> Aleksandar Kuzmanovic: Thanks, Jin, so much. So I'm really happy to be here at Microsoft. I
always liked Microsoft better than any other operating system. I'm a Microsoft kind of guy. I hate
Macs. This can go online. This is my official statement.
And so basically today I'm going to give you a brief outline of four projects that I'm basically that
my group is working on so this is going to be a breadth kind of talk. I'm not going to go too deep
into some technical details. I'll just put like four different ideas to you so if you're interesting in
learning more about it, you can read a paper, you can ask for a paper, you can talk to me, I'll be
happy to answer all these questions that you might have.
So this is also good for me because you'll be allowed to have a chance to ask any question
because I am going to be talking very fast, I'm going to move from one topic to another, by the
time you ask a question I'm going to move to another topic so you're not going to have an
opportunity to ask any questions. That was a joke of course.
So let's have a let's go straight to the first problem. The first problem is, can you use search
engines to do networking research and the scenario I have in mind is a student goes to search
engine, puts a title of his project, PhD thesis and there comes like 10 different versions of this
thesis, student then goes, reads all these different theses and this is the one I like, this is my PhD
thesis and he gets his PhD.
This is, of course, not possible. That was another joke. So basically what I am trying to that is
still not possible but other things are possible. So if you put a random IP number in a search
engine such as Google you can end up having different information coming out of that search
engine. So for example, this particular IP address is associated with the Gnutella network, right?
So basically and you might get surprised that if you start putting different IP numbers in different
search engines, a huge amount of endpoint information is available on the Web. So the problem
we try to answer is can we systematically exploit search engines to harvest endpoint information
available on the Internet?
So basically before I try to answer this question, let me first explain how in the first place does all
this information come to search engines in the first place. So basically there are a number of
reasons why information about endpoints can end up on the Web. One example is, for example,
popular servers, for example, gaming servers. IP addresses are listed and available on the Web.
If when they are available on the Web, they get searched, they get indexed by search engines
and they end up being there.
Then when you go to a Web site, you might think that's your own thing that you visit the given
Web site, well, information about Web sites that you visit can all these Web sites can run logging
software and basically display statistics, among others. They can show what kind of IP
addresses are they accessed from, right?
Then you if this is not enough you might go to a proxy network where you have a number of
proxy systems that actually go on and display logs and they become publicly available on the
Web. And then even with peer-to-peer systems, information is available on the Internet for a
number of different reasons. For example, whenever you're accessing a peer-to-peer system you
have to go first to a public domain and basically to get the information about the files that you
want to access, this becomes again publicly available.
And of course a number of there exists a number of blacklists, banlists, spamlists and so on and
so forth that you can actually this is another source of information that gets publicly available.
So what we did is we designed a methodology so I'm not going to spend a lot of time here, I'm
just going to give you a brief outline of our methodology on how did we go from a given IP
address to tagging that IP address with appropriate properties of this IP address. For example,
what kind of activities are associated with this particular IP address.
So basically here we again, Google is the search engine that we have used, of when you enter an
IP address you end up with a number of URLs and the hit text and then we try to look at the
domain name itself to try to understand what is what this IP address is about. So the example is
when you can directly from the domain name realize what the IP address is associated with. For
example if you have DNS server in the domain name, this implies that this is information about
that.
Of course, not always is this possible and we have to do something else. Again, from the artificial
intelligence point of view, this is a toy problem, this is not our main contribution. So if you are an
AI person looking for some deep things here, this is not the place to look for deep things, this is
just a small hack by a couple of networking people trying to go to something else that we know
better.
And basically and then once you end up with this you go on and tag this IP address with
appropriate features. So a single IP address can be tagged by a number of different features.
Okay. What is this good for? Well, this is good for, for example, understanding what applications
people are using across the world without having access to their network traces. It is hard to play
network traces. You go to ISPs and say can we have some network traces, they will tell you no,
you can't get that, there are security issues, there are privacy issues and you can't get that.
So with this approach you can go on and analyze different networks, try to understand what kind
of information is available on this network. So I'm against keeping a huge part of this problem in
the sense that so basically, what we have, we managed to obtain through our collaborators
packet level tracings from a number of different places in the world and then what we did, then we
compared information gathered from these packet level traces with our approach where we just
go around and input, go around and simply input one by one IP addresses and try to understand
what is going on.
And then the key insight here is that you can if an application is very popular in a particular
region we are capable of capturing that in both the traces in both the packet traces and by using
our approach.
Of course, our approach is incapable of telling you, okay, this is the exact way that different
applications are used and so on but it can still give you if something is strongly strongly shows
up at a given region we're capable of capturing that.
>>: Now are you talking about Web-based applications? Because how would you know if they're
using Word or Office or Excel or running some PC-based game on their computer ….
>> Aleksandar Kuzmanovic: I'm talking so the question is how do we know what kind of
applications am I talking about, how do we know if they're using this or that type of application.
So what I am talking here when I am talking about applications I am talking about applications,
one is streaming media, two is Web applications, three is peer-to-peer and so on and so forth.
So I am not talking about client level applications.
>>: [inaudible] you described you did some validation of the quality of your inferences that are
coming out. It seems for someone to actually use this in practice do they have to keep validating
this system because I mean, it seems that it works in the present, it seems it might not have any
bearing on whether the technique will continue to work in the future.
>> Aleksandar Kuzmanovic: Because?
>>: Because of all these harvesting methods are out there are really applications [inaudible]
whether or not a client gets logged by using some manner of application. It's very it's almost
accidental that that information gets stored and it gets stored because of a lot of reasons that may
or may not exist for certain applications.
>> Aleksandar Kuzmanovic: I can agree with that to some extent. So I have to repeat my claim.
My claim was if something is heavily present at a given region we're capable of getting that. I
agree that there is a lot of, like, how do you know if a given application is being actually reflected
on the Web, that is a huge problem, of course. But the claim here is that if something is has
strong presence in a given region, it does leave a sufficient trace that we can that's what we
have seen.
>>: Today.
>> Aleksandar Kuzmanovic: Today.
>>: Okay. Sure.
>> Aleksandar Kuzmanovic: Yeah.
>>: I mean, it's one of those unknown unknown things. Right?
>>: Oh yeah, [inaudible], I'm not saying this yet. But [inaudible].
>> Aleksandar Kuzmanovic: Sure, sure, sure. So again. This is I agree, this is the most shaky
application that we came up with this approach, this is just one of the applications that we're
using. So I'm next going to explain the second one which I hope will convince you it's not as
shaky. But I do agree that there is a lot of uncertainty here.
The way we treat this uncertainty is that we simply compare it with the grand truth from four
different places where we were capable of gathering packet level traces and it will tell. Is it going
to work in the future? I don't know. We'll see.
So the next application is the traffic classification problem, right? Assume you do have access to
packet level traces, how can you figure out what is going on, what kind of applications are used
there?
So currently, current approaches are Perl-based [phonetic], payload based, signatures, numerical
and statistical analysis and so on and so forth. You are trying to understand what is going on.
Now what we were doing here is that we are saying that, okay, using information the station
[phonetic] IP address available on the Internet, right? And this you can get using our approach.
Once you use this approach, you can get a much better granularity of your inferences and you
can classify a much larger percent of traffic. And then here I am comparing to another technique
used by to another tough [phonetic] classification approach. I'm not going to talk about them
because that's not the point here. The point here is that we are doing we are capable of
performing much better traffic classification than they are. So on X axis you have the sampling
traffic. This is how many packets do you take randomly out of from a packet stream and on the
Y axis you have the percent of traffic classified, right?
So what is showing is that we are capable and these lines are for two different networks that are
capable of gathering packet-level traces. So what it shows is that when there is no sampling you
get all the packets. If you are much better than they are, however, when you have sample level
traffic, their approach totally goes down while we are capable of having high classification
capability.
So the bottom line is that despite a huge sampling rates we are still capable of classifying large
amounts of traffic simply because we are working based on a single packet we are capable of
saying, aha, given the destination address, we look on a search engine, we know what this is
about, hence we can tell you what it is.
So I will leave it here for this for the first topic. If you are interested, I'd be happy to talk more
about this project and you have some information online as well.
So let me jump to the second problem that I am going to be talking about today. This is the
ISP-enabled targeting. So basically the question with the first project was people started telling
us, oh, you guys are promoting Google, why did you put that in the title, that's not fair, blablabla.
We are not promoting Google, we just used that as a way to promote our paper, right?
So basically let me explain, the second project actually goes in a way against Google's approach
of doing something so I'm going to explain that next. So the second problem that I'm going to talk
about is behavioral targeting. So basically behavioral ad targeting itself is a huge business. It's
a $20 billion industry.
And then basically all the banner ads, rich media, e-mail marketing, all that stuff that comes to
you when you're looking at a given Web site that comes under online ad targeting. And then
behavioral advertising is a part of this ad targeting business and then to simply explain how it
works, for example you go to, user A accesses site X, site X belongs to category P, for example
sports. So the next time user A accesses site Y, well, the advertisement of type P is going to
show up on that site. Because you have shown already interest in that particular activity.
So to look at the market share in this behavioral in the ad targeting business, Google owns 35
percent of the business, Doubleclick owns 34 percent of the business, Google bought
Doubleclick, hence 35 plus 34 equals 69, 69 greater than 50 so where I come from this is called
monopoly, right, in a given business.
So Internet service providers, on the other hand. Yes?
>>: [inaudible] is that revenue based or is it based on [inaudible] ….
>> Aleksandar Kuzmanovic: It is based on the number of clicks through ads or something like
that. I can give you a more precise ….
>>: [inaudible].
>> Aleksandar Kuzmanovic: Yeah, yeah, yeah. So Internet service providers for years looked at
these things and basically see that these Web based companies are taking all the money, but
they are putting all this infrastructure, shipping packets from point A to point B here, they are not
gaining as much money as they would like to.
So the question they asked is hey, how about we do behavioral ad targeting, right? So here you
have all the packets from a user are going this way, right, so if they do the packet inspection, look
at the packet and try to understand what information our user is interested in, they can gain the
same information, right? Then they can sell this to the ad targeting companies and they can gain
some money as well. Which sounds like a reasonable approach.
However, the problem here is the legal one. Right? So Internet service providers as broadband
providers, they, this law, which is the Federal Wiretap Act, applies to them and it does not apply
to the Web based companies. And it says that “thou shalt not intercept the contents of
communications. Violations can result in civil and criminal penalties.” Right? So it is illegal for
ISPs to look at the packet payload to understand what is going on. And there was a huge
pressure on basically ISPs to not to do this both from consumers and from the federal
government and so on and so forth. Yes.
>>: Is it okay to look at addresses? Station addresses? Or ….
>> Aleksandar Kuzmanovic: That's what I'm going to talk about next. It is okay.
>>: This is black and white [inaudible].
>> Aleksandar Kuzmanovic: So hear this [phonetic]. This law came out in 1969 and it was
designed to prevent phone wiretapping. And it was enacted, again, in 1986, before the Internet
was around and they say this holds for computer communications as well.
And now, in 2009, ISPs are unable to do this stuff because all this stuff that came out in 1986,
which kind of doesn't make sense, right? And there is no fairness in this market because
Web-based companies are capable of doing this without any problems.
>>: I disagree with your stance on that. Basically there is a conversation between the user and
that Web site you're going to. So that is the conversation that is actually sanctioned. To look into
that conversation by looking at the packets going back and forth to me is a third party who is ….
>> Aleksandar Kuzmanovic: Sure, sure, sure. That is what the whole argument is all about.
>>: And the second part is, that Web site has a privacy policy so that the user understands it.
The user can't get access to what the ISP is supposedly doing because [inaudible].
>> Aleksandar Kuzmanovic: Sure, sure, sure. So my point here is that it's not the same game
for ISPs and Web-based advertisers currently. And I think it should be the same game. Right?
>>: No.
>> Aleksandar Kuzmanovic: Why not.
>>: [inaudible] I called Amazon or I called eBay or I called somebody else ….
>> Aleksandar Kuzmanovic: Sure, sure, sure.
>>: If I go to my ISP's Web site, sure, I agree. But if I don't, then why is the ISP jumping in the
middle. I don't want my ISP ….
>> Aleksandar Kuzmanovic: Are you aware that your communication is being monitored by
Google and by Doubleclick? Are you aware of that or not?
>>: Yes.
>> Aleksandar Kuzmanovic: Did you sign up for that or no? Did you sign up or no?
>>: When I go to a Web site but you don't sign up for any of those things.
>> Aleksandar Kuzmanovic: That's my point. There should be some fair game here. Right?
And it's currently not there.
>>: No ….
>> Aleksandar Kuzmanovic: So let me ….
>>: The privacy policy exists on Web sites you go to.
>> Aleksandar Kuzmanovic: Your Web browser is full of Google and Doubleclick cookies and
users don't know about that. And that's it should be the same principle. So what I'm saying here
is that this is currently used against ISPs to do behavioral ad targeting. What we're saying is you
can turn around the law if that 1986 law, well, we can use another law to do stuff.
So basically what we are looking at is and I will tell you I have consulted with Paul Ohm from
University of Colorado who is an expert in these issues, and then basically another piece of law is
Electronic Communications Privacy Act which states that “any provider can hand over
non-content records to anyone except the government.”.
Now what this means is that TCP headers could be legally shared. They are non-content based
communication, patterns of communications between end points.
And then the research question we wanted to ask, well, once you have TCP headers, that you
can legally share with anyone except the government, can you based on them do behavioral ad
targeting. And we claim it's possible. Right? So and again I would come back to this legal issue
at the end of this part of presentation because I do agree with you that users should be given the
right to say I want this to be, I don't want this to monitored. But it should be equal for everybody.
It cannot be that ISPs are some kind of bad guys and these other Web-based companies are
good guys so they can do it without any questions.
So there is the question is can you do this? And our approach is you can do this. Basically you
can go on and collect statistics about Web pages from a given Web site and then you can
compare that to Web-level information available from TCP packets. So once you compare this to
information it becomes possible, I would show, to basically understand content level information
based only on non-content based communication.
Okay, Web profiling and collecting statistics about Web pages, so basically I'm not going to go
too deep into this but the bottom line is that basically Web nowadays is a fairly complex beast so
what we have here is basically each Web page, because this is the root file, this is the index file,
and then you have a number of different objects that comes with that so basically so the first
distinction is the root file versus objects files. Now objects files could be transferred in different
various transfer modes, they could be compressed or non-compressed, then you have a user
[phonetic] of content distribution methods so these objects could be stored internally on the Web
site or externally on the content distribution network. They could be cachable or non-cachable so
the bottom line here is that there is a huge significant statistical diversity among pages at a given
Web site, right?
And then this diversity can be used to basically understand differences among pages even if you
don't have direct access to that. So this is one piece of the puzzle and then the second piece of
the puzzle is that you have once you look at the TCP packet level traces what can happen is that
you can basically go on and understand there is all of reflection from the Web kind of Web-layer
down to the TCP layer. Right? So the bottom line here is that if you observe packets generated
by a single user, you can say, aha, these packets belong to one page because there is a natural
delay between streams of packets or requested by a single user among different pages. Now this
is one thing that if you are using and at the same time by looking at TCP headers, I'm not going to
go into details, but you can actually distinguish which of the Web pages which of these elements
you can distinguish different elements at a TCP layer, right? You can say this is one object, this
is another object, this is a Web page and so on. I'm not going to go into details here. I can send
you a paper, it's a fairly straightforward thing. There is no huge science here.
Okay, so once you have that this is how our system, these are the results, so basically we had six
different Web sites that we were exploring. “New York Times” for Club Barcelona, we did this
before they became European champions. Ikea, Toyota and then two of the universities, one is
Northwestern, the other is University of Granada. So what you can see here, what you can see
here is that the success rate which is the probability to successfully detect which page was
accessed, is around 85 percent in all cases and the false positives are below five percent. In this
particular case. In all the case.
So ….
>>: [inaudible].
>> Aleksandar Kuzmanovic: Yes?
>>: [inaudible].
>> Aleksandar Kuzmanovic: So basically you have a user who accesses let's say 100 Web
pages, out of these 100 we can say we can say 85 we can detect successfully 85 of these 100
Web pages and for the five pages we do make false positives. We say these pages were
accessed but they were actually not accessed.
>>: Without looking at the HTTP header ….
>> Aleksandar Kuzmanovic: Without looking at the HTTP header. Only looking at the TCP.
>>: [inaudible].
>> Aleksandar Kuzmanovic: Okay, so the next thing that we looked here is okay, this is okay but
to do something like that an ISP would have to crawl all the Web sites, right, because you have to
build a profile about a large number of Web sites. This is one thing. And the second thing is if
you are an ISP and you collect all the TCP headers, you want to ship this to a third party, to an
advertising company, it takes time, right? It can like you put this on a device, then you ship it
there, some time can pass.
So the question is if you have basically non-perfect calling, so for example you can call a Web
site only once per week or something like or once you ship data from an ISP to an advertising
company, the question is, can that hurt the performance that we see in this basic case? So the
answer is yes.
>>: You see to rely on I'm guessing that why is the advertising company getting this information
from the ISP and not the site that the user is going to?
>>: [inaudible] ISPs basically want to get in the business of advertising. They want to ….
>>: [inaudible] they are not intercepting traffic, they are not actually changing content being
served to [inaudible], so the advertising company is better off ….
>> Aleksandar Kuzmanovic: Then you have to communicate with tens of thousands of Web sites
that users are accessing, while here you collaborate with a single ISP who has all the information
about the user. Because whatever goes through that ISP, ISP sees that and then whatever site
that user accesses, if that site is not cooperating with the advertising company you don't know
that that happened.
>>: But the site must be collaborating with the advertising company because the site itself is
placing ….
>> Aleksandar Kuzmanovic: Not necessarily. I mean, there are different advertising companies.
>>: It's better to create user profiles ….
>> Aleksandar Kuzmanovic: At the source.
>>: On the client side than on the server side.
>> Aleksandar Kuzmanovic: Yeah.
>>: It's simpler.
>>: Either way, I mean, it's an extra source of information, right? If it could be produced, it would
be valuable. I guess my question is how much granularity is necessary? I mean, just knowing IP
addresses and server, “New York Times” versus Ikea, is that enough to go on?
>> Aleksandar Kuzmanovic: So, that's a very good question. In some cases you might for sure
you would like to have more granularity, right? If I'm going to Ikea, I want to buy some furniture,
what do I want to buy, right? Am I looking for I mean, that can basically fine tune that
advertisement that comes to me at the end and it increases probability I might click on that
advertisement later. Or if I look at “New York Times” it's fairly diverse things I can be looking at.
If I'm looking at cars prices that implies I might be looking at cars so a car advertisement may
come to me.
So more information is always good, I think. Even though in some cases it can be fairly
straightforward like if I'm looking at Football Club Barcelona's Web site, I'm interesting in football
and in sports, right? Maybe that granularity is not necessary in that particular case but there are
other cases where it is important. And we should write papers, right, so we have to do something
here.
Yes?
>>: So the server IP address is used to [inaudible] site and [inaudible] too?
>> Aleksandar Kuzmanovic: Yes.
>>: So it does like [inaudible]?
>> Aleksandar Kuzmanovic: No, no, no. So we look at that so we only look at the basically V
classified destinations as going to the server side or to a CDN [phonetic]. So any Akamai
[phonetic] or any other stuff is just external information. We use that to understand what kind of
Web page did the user access but that's the only thing. We are not looking at the IP addresses of
particular Akamai [phonetic] servers or stuff like that. That's not what we're doing. We're just
saying external versus internal.
Internal means you go from the origin server, external means you look at CDNs, irrespective of
what CDNs.
>>: [inaudible].
>> Aleksandar Kuzmanovic: So typically you would get some content from the origin server and
the rest from the CDNs. Yeah. Okay.
So basically the question here becomes if the traces are old, so that means you ship traces to a
given advertising company, yet some time passes. Like a week goes on, the question is, can you
still use these traces? Can they still be useful? So we compare all traces with current Web
profiles and the bottom line is that you can still get basically nothing changes over time in the
sense that you can still get high success rates and low false positives. And why this happens?
Well because for these three particular sites that we are looking here, Toyota, University One,
University Two, the change rate of a site is very small. So that means both the root pages and
the objects on their sites does not change dramatically over time, hence we are still capable of
reusing these traces.
>>: [inaudible].
>> Aleksandar Kuzmanovic: I am going to show that next. So next the question is what happens
with this with the small changing Web sites and then here, what we are looking here is that we
are capturing TCP trace at the last day and we have the profile of “New York Times” from seven
days ago. So the question is can we still do a good job.
So this shows you that we can do a good job, “New York Times” here is blue and it doesn't
change dramatically over time. Why does that happen? Well, if you look at the change rate, for
“New York Times” it's quite high. Sixty percent of pages change over time. Right? But these are
mainly root pages, index pages. So if you have a given page, people can leave comments and
all the other stuff and hence that thing does change. However, the objects on that Web site,
which is this particular graph, picture is here, they don't change dramatically. It's a very small
change, like below five or so percent.
So the template of a Web of a page stays the same while the content can change but the
template stays the same. So you still can do a fairly good job in distinguishing different pages
despite the fact that the content in this environment does change, often because people leave
comments and do other stuff there.
Okay. So to summarize, the key motivation for this work is that I believe and my students believe
that online advertising business needs more fairness. Users we also strongly believe that users
should know whether their traffic is monitored or not. So basically we have shown that the law
itself is quite outdated. Because you can take one piece of law and say ha, you can't do this but
you can take another piece of law and say well, you can't do that. You don't even have to take
consent from users, right? You don't need even that, right?
So the bottom line here, what we are arguing for is that we need a comprehensive legislative
reform that would put all these things in some kind of a to make a fair game out of this whole
stuff.
And we of course believe that this should not be such a reform should not be used to kill
academic research. That means we researchers should be given a lot of data that we look at
because once we have access to all of data we can do very nice things.
So this is basically introduction to the third project that I am going to talk about. Measuring
serendipity in a mobile 3G network. We were given a very huge trace from a mobile operator
yes?
>>: Before you move on ….
>> Aleksandar Kuzmanovic: Yes, you want to talk more.
>>: Deep packet investigation, [inaudible] has been challenged by lots of industry experts,
privacy advocates, governments, because it seems to be inappropriate and very low value
[phonetic] to consumers. So as an ISP, you ask me if I want you looking at my packets and I
want to opt in or opt out, it's like what do I get from it? Well, you get to sell my data to someone
else and make money. Do I get anything? No, you get to look around at what I am doing. So
why is that unfair and why is that necessary to help research ….
>> Aleksandar Kuzmanovic: Yeah. But how is I mean Google does the same thing. Exactly the
same thing.
>>: Tell me how they do it.
>> Aleksandar Kuzmanovic: Because that's how the online advertising business works.
>>: Google is not an ISP.
>> Aleksandar Kuzmanovic: That is exactly true. But Google is doing exactly the same things
that I am saying the ISP should be given an opportunity to work with. Why not?
>>: The scenario is not the same.
>>: I have to choose Google. I mean, it's a free service. I don't have to use Google, I don't have
to enable the cookies, right? But, I mean ….
>> Aleksandar Kuzmanovic: But you have it enabled right now and you don't know about it.
>>: That's a different issue.
>>: It's not ….
>> Aleksandar Kuzmanovic: It's not a different issue ….
>>: [inaudible].
>> Aleksandar Kuzmanovic: What I'm saying, and this is really the but I can send you I mean,
there are other opinions that it should be fair, it should be the same.
>>: I agree it should be fair, but this is not fair because it's not the same. It's apples and apples.
I go to a Web site, I can actually choose to say I'm going to opt out of advertising for Google, for
Microsoft, [inaudible].
>> Aleksandar Kuzmanovic: If it is opt out it should be opt out both for these guys and for the
Web-based advertisers. Because now Google and others are saying, oh, you guys, you are very
bad guys and the law is not on your side so you should apply an opt in approach. Right? So why
doesn't Google apply opt in approach.
>>: I get free content from a Web site. In return they show me ads. The ISP is not giving me
any content.
>> Aleksandar Kuzmanovic: We fundamentally disagree on this. I would like to move on. I will
be happy to discuss more.
>>: You're right.
>> Aleksandar Kuzmanovic: Totally. Okay. Yes?
>>: I don't want to discuss more about this but I just want to point out that I don't know, us
[phonetic] computer scientists are a little uncomfortable trying to interpret what the law says.
Especially when there are no sort of cases that established what exactly how the courts interpret
the law. And I just find sort of like modulating problems based on my or your or your
interpretation of the law is sort of shaky. Because it's I'm not a lawyer, I don't understand what's
going. I can read I have an opinion. Sure, but it's just my opinion.
>> Aleksandar Kuzmanovic: So basically I mean, I agree with that to some extent. I'm just
saying that there is a lot of opportunity to come up with interesting research problems from that
perspective.
>>: So I think the motivation is look you know, by doing the packet inspection you open up a
huge can of worms, and you can do a nice job looking at the TCP headers [inaudible], that's
great. But it's sort of like debating what the law says, I ….
>> Aleksandar Kuzmanovic: We had a [inaudible] who had exactly that point. Yeah, that's a very
good point.
>>: I wasn't [inaudible].
>> Aleksandar Kuzmanovic: Yeah, I didn't imply that. Let me move on very quickly because I
have other things to talk about and I will be really happy to discuss further this issue because I
see it's kind of important.
So let me so the motivation here is that once we are given a once researchers are given a lot of
data it's great. They can do very interesting things. So in this particular case we have a lot of
fine grain data from a mobile operator and at this point we were not looking at law or anything like
that, we just once you get the data you forget about the law, you just do your research.
Okay. So the bottom line here is that to motivate [phonetic] this quickly, social networks are
becoming very popular these days but some believe that the future of social networking will take
place on mobile devices as opposed to being just limited to desktops.
So some of the applications that are being enable with this with this technology are so-called
serendipitous discovery of people, businesses or locations. So basically, what does it mean, for
example, if you go to a given to a neighborhood and your friend is in the same neighborhood, if
you're in the same vicinity geographically, you can go on and basically hook up together. This is
an application offered by loopt, a company from New York. Another example are like behavioral
ad targeting, like businesses. If you go to give an area and if this business knows what your
interests are they would be happy to send this advertisements on this. Now users are not happy
to receive these advertisements but that's a different questions.
And now locations. For example, you can do tagging. For example, you can go on and you can
go to restaurant, you like the restaurant, you put this on your and send this to your friends, your
friends are in the neighborhood. They see, ah-hah. That was a great restaurant suggested by a
friend of mine, I'm going to go there.
So what I am arguing next is that coarse-grained location information is very useful for these
services. Now, why am I saying this? Because we got hammered by some reviewers that we
have a fairly coarse-grained information in our dataset which consists of basically base stations.
And that's not sufficient to basically converge to these kinds of services. I argue against that. So
the point is that even if you have coarse-grained information at the level of a block or of a base
station or a city, that is still useful because that can help you hook up with friends.
And at the same time users are more comfortable sharing coarse-grained information than
GPS-level information. This is another argument that people say really like telling exactly where
I am on the map is maybe not that secure or private or whatever.
So the question that we try to answer here is what the relationship between mobility properties
and application affiliation in the cyber domain. So if you are interested in giving a location, how
does it affect your mobility properties and how does it increase or decrease the probability to
meet others who share the same interests?
So let me just give you a brief description about what we were doing. So here is the here are the
[inaudible] about our datasets so we had more than 3 million packet data sessions on more than
280,000 clients and we had like close to 1,200 locations in the Greater Chicago Area. And then
we were trying to understand what is happening with all this stuff.
So the way we extracted human movement so basically if you have if you have a user it
connects to a base station and then it goes to a RADIUS server which stands for Remote
Authentication Dial-in User Service. So we had access to this. So we were able to look at all
data and all information about users and then if the users moves from point A to point B within the
same session, we are capable of seeing this hand-off and we see we can basically understand
the location of a user.
And then we also had intersession movement. That means the user accesses a given base
station, then the user switches off but he then moves to another location so we are still capable of
getting this.
And what we argue in the paper is that despite the fact that we had these sample kind of
behavior, that even if you have if the users are not online all the time, that they switch on and off,
that we are fully capable of capturing their basic mobility properties. I'm going to leave the
arguments along these lines for one-on-ones if you're interested.
So then we use rule mining techniques to basically go on and understand basic moving
properties. So here we simply define a group of users who sit at point A and then they access a
given service within time window W and then they are seeing in another location within interval
delta.
So for these users we, basically we go on and define rule [phonetic] movement like this. Then we
have a stationary movement. I'm going to talk about that. That just means you're going to stay at
the same location and then we have a disappear rule which means users just switch off and they
are no longer here, they are no longer in the picture.
This is a better way to show that. So basically here we have position A. The rule support is the
number of people present at A at a given point in time. Then these users move to location B.
Now rule confidence is the number of people that move from A to B and finally, rule support is -the confidence probability is confidence to rule support. So the larger the confidence probability,
the larger confidence that this rule has some meaning in this particular case.
So let me show you just a few just some of the results that we had. I'm not going to dive into
details and I'm going to tell you give you the whole story. If you're interested, I can send you a
paper. So basically what we see here is on the there are two Y axes. Y one and Y two. So Y
two is the total confidence of rules so here you can see that for example, here you have 37,000
people basically this is 31,000 people and so on.
And then this particular graph is associated with this particular rule. So what we see is that on a
Y state you see these peaks happening at 8:00 a.m. and at 5:00 p.m. and basically what is going
on is that of course users are more likely to move in these particular times while on a weekend,
on Sunday you can see that people are most likely to sleep longer and they are less likely to
move longer because you can see here that this is much smaller than what happens during the
workday which is kind of an expected result.
So then we tried look at, like, okay, what is the correlation between applications that people
access and their mobility problems? So what we have shown, one strong rule that we have found
is that weather is a good indicator for stationality [phonetic]. Like close to 70 percent of accesses
remain the same location for the next six hours. So once you access weather, you are most likely
to stay location in the next six hours. And then we were kind of trying to understand why this
happens and we realized that one possible reason is because it's Chicago, right? So once you
see what the weather is outside, you don't want to go anywhere, you just want to stay home for
the next six hours, right?
So we don't know if this result is of a more broad scope so basically we need another location,
something like San Diego or something nicer to understand if this is if this has a strong local
implication.
So another finding is that maps are a good indicator of movement. Sixty seven of users
accessing the service can be seen moving in the next three hours. And then what we have found
is that there is a strong correlation, anti-correlation between of given applications and basically
mobility properties.
So stationary users tend to access music downloads a lot while mobile users tend to access
e-mail a lot. So we tried to look into this in somewhat more detail and here is the result. On X
axis we have the number of base stations seen by users so this is one, and this is 50. This is
over one big period. And then on Y axis you have application access ability. So what you can
see is that those who access music are mostly stationary users.
As the mobility increases, they're less likely to engage into these kind of downloads. So no we
were unable to understand why exactly this happens but there could be two reasons. One
because people are unhappy with the bandwidth that they get when they are moving a lot. If this
is the case then these peer-to-peer kind of mobile applications might have a lot of success
because they can improve throughput in these particular scenarios but if the power or the battery
is the concern, then this is not the case. Yes?
>>: Within applications or across time? What does 50 percent mean on the Y axis?
>> Aleksandar Kuzmanovic: Of those who see only a single base station in the seven day period,
the probability to the cumulative number of accesses to all applications, this is the percent of
music downloads.
>>: So this doesn't say that you tend to check more e-mail when you move, it just says that the
fraction of times you access e-mail as a fraction ….
>> Aleksandar Kuzmanovic: No, no, no. This is the fraction of accesses to application.
>>: The denominator is all accesses to applications, right?
>> Aleksandar Kuzmanovic: Yes.
>>: So the absolute quantity could be the same? Absolute number of accesses could be the
same in the numerator?
>> Aleksandar Kuzmanovic: So all these things sum up to one at a given ….
>>: So this is relative use. It says nothing about the absolute use of applications.
>> Aleksandar Kuzmanovic: Yes.
>>: It may then be that even when I am stationary or mobile I check e-mail exactly the same
frequency.
>> Aleksandar Kuzmanovic: Maybe. But this is relative to all the users at a given ….
>>: Sure.
>> Aleksandar Kuzmanovic: Yeah. That's right.
>>: Yeah.
>> Aleksandar Kuzmanovic: Yeah, yeah, yeah. I mean, we are on the same page.
So the second insight is that the more you move, the e-mail becomes a more dominant
application and then basically what we have said is that, okay, if you are moving a lot you see as
much as 50 base stations during the week. That means that you are really like moving a lot.
That means that this particular device, mobile device becomes your computer so that's why you
probably are checking mail more than others, right?
So for stationary users we believe that I mean, they may have some other stationary desktop or
something else that they are using to access this.
We found another interesting point for social networking which kind of is in between the two. So it
just becomes big here, we're not sure how significant this is or we were unable to come up with
anything anything more insightful than that. Okay.
So we had other insights here. If you're interested I can send you I can send you a paper or we
can talk more.
Okay. So the final piece of my presentation is about infrastructure and/or positioning. This is a
work in progress so this is still not finished. This is a project that I am working with my students.
So to motivate [phonetic] this problem, basically we have seen that you can have good
positioning when you are outside, GPS works, everything is fine but once you go inside it doesn't
work.
So while or it doesn't work at all. So my personal problem here is that I have a four year old son
and then I often have to buy a chocolate milk for him and then I go to a grocery store and they
have chocolate milk in two different places. One is in the fridge and the other is totally outside the
fridge, right, where you have these 12 pack kind of things. And then I go to the fridge, I always
buy from here, and then my wife kills me at home telling me like why didn't you buy these 12
packs? They have it somewhere else. I have no idea where they are. And once I spend like 20
minutes walking around the grocery store trying to find 12 pack chocolate milk, I find it finally, but
at that point I told to myself, okay, this is a problem that people are having. I am a human. I have
a problem. I want to solve the problem.
So basically the question is how do you do indoor positioning. So we are not the first, of course,
to look at this problem. You have a number of different approaches. Triangulation is one. Of
course if GPS doesn't work then you have to have some other infrastructure. For example, you
can use cell towers like CDMA approach. I believe you had a paper along these lines. And you
can also have or you can put some infrastructure inside the building, right, where you can once
you have infrastructure inside you can again do triangulation, you can try to understand what is
going on, what your location is.
Then you have these RF signatures. For example, beacons. You can go when you see a wifi
network, you are in that range, you can say, aha, approximately this is where I am now. So you
can basically try to understand there you are or if you don't like infrastructure you can go on and
send the radio signal within the room and then you pick up the response from that radio signal,
this is kind of a fingerprint of a given location and then you say, aha, this is where I am so now
you can do that.
Now the problem with this approach is that you need infrastructure, which is not a small thing and
you need manual labor, right? Somebody has to go on and fingerprint the entire space which is
not easy. It is not easy because for example, here if you have two points that are very close to
each other in space, the response in delay and frequency domain [phonetic] can be very different,
right?
So it's not easy to do this kind of thing and if you go to a grocery store in my favorite foodmart
where I want to buy chocolate milk, and if you tell them how about you guys have some
infrastructure here so we can understand where the chocolate milk is, they will tell you that's fine
but if you want to implement something, if you want to give us money to deploy that, that's fine,
but we are not going to do that on our own, we don't care about that.
So basically what we are trying to do is try to come up with a practical kind of solution to this
problem. So the first thing is let the building owner off the hook. We don't want to talk to the
building owner. Right? We don't want to install infrastructure, we don't want to ask for any
detailed schematics, right? We don't want to do any fingerprinting, so the site surveys, don't build
detailed RF maps, then we just want to rely on what actually can work in reality which is let the
users report what they see, right?
So in this particular case you have two examples, one is Whole Foods in Evanston and this is
Hyatt Regency Hotel. So basically many of the facilities published free floor plans, right? So
what are saying is that if you can use pre-computation you can extract information from here and
you can use people to help you navigate to understand where they are. Right?
People are great noise-filters. For example, if you are standing here, like if you send a radio
signal and you have people around you, it's going to get distorted and everything else. However
if you have a human here it will tell you, aha, I see seafood here so this can be used to basically
do indoor navigation.
So here is an example. Here is my favorite grocery shop. There you see olives, there you sell
self-serve counter. Based on this information you already have some understanding of where
you can be in that space. And then again, there is another example, wine and olives. And the
bottom like is that you can take advantage of relationships among identifiable features in the
room.
What that means is that, I mean, we are not trying to land a spaceship on a moon on a particular
point or we don't want to kill somebody with an exact millimeter level kind of precision. What we
need is something that can be useful so that people can be navigated and say, aha, go here and
then go left and then go right and then you find the given whatever you're looking for.
So let me just give you a brief I am going to be done in less than five minutes. Basically what we
are looking here are some important definitions so isovists is the visible area from a location,
right, from given location so here we have three isovists, green, blue and red.
So basically here you have points I'm not sure if you can see them, the green area is all the
locations that you can see this point from, right? And you can have some features such as
landmarks, such as for example, cash register, bathroom, elevator, whatever else. And then we
define region as a subset of coordinates sharing an identical feature vector.
So here is what it looks like. So the bottom line is that if you see a given feature, then that feature
sees you. Right?
So for example, if you have for example here three features in this particular room, it can be
shown that there are two to the H minus one locatable regions for these features and for
example, assume a user says, I see A and B but not C. You're going to end up with this
particular region. Right? So you have already limited the potential location that a user can be for
a fairly small number of features in the room.
Now if you increase the number of features what happens is that it becomes even more, even
better. So for example, a user can say I can see A and B but not C and D. Right? So you end
up in these smaller regions and the bottom line is that the more regions you have, the smaller
area at which the user can be.
Now the problem with this approach is that this assumes perfectly reporting. Right? So a user
tells you I see A and B and I don't see C and D. Now what can happen in the reality is that the
user can really see A and B, it might see C but it's not reporting, right? I mean, you can have a
number of features in the room so you just don't report it.
So the question is if you don't have this perfect reporting, how the thinking works. So I'm going to
show you quickly, briefly, if a brief example. You can see that. So here is my favorite room. So
for example, assume a user says I see specialty seafood and then I see produce. Right?
So here you don't assume perfect reporting. You just assume so that the user will tell you
exactly ABCD and I don't see you can't get that from the user. But even if you get this fairly not
so findrate [phonetic] information, you can still end up with a region that is fairly smaller only with
three particular features, here, you can end up here, you see. You can be here, here, here or
here. Okay?
And then based on this information you can go on and either ask more information from a user or
use dead reckoning or other stuff to go on and reduce this space and so on. So I will give you
just briefly. So basically this is what the system looks like. You go on and basically download a
map from a given company Web site and again this doesn't have to be a very nice map, this is
just a coarse-grain map that can do the job and then we preprocess this at our server. You
download this and then based on some user feedback you can go on we can go on and help you
understand there you are. Now this is a work in progress and then what we are using what we
are trying to do is to add some additional user input or use dead reckoning to break to tie.
Basically we are a very small sensor that we are going to put on a user in an attempt to try to
understand how it is ….
>>: I don't see how this helps you to locate your chocolate milk. I mean if I have a map where it
always says here's the chocolate milk and here are all the other important features of that sort,
then I can just look at it myself. I mean, it's not really it's really not difficult and I can figure
[inaudible].
>> Aleksandar Kuzmanovic: Sure.
>>: So that the kind of utility that helps me to locate myself in a store based on what I see right
now, it doesn't help me to find that chocolate milk.
>> Aleksandar Kuzmanovic: Sure, sure, sure. But you see, once you find the chocolate milk you
can tag that map or that information. And push it back to our Web site.
>>: Yeah, I know. I understand that there is a process of what I see and how I can add
information to [inaudible] map is useful, but I don't see how the location tool is useful.
>> Aleksandar Kuzmanovic: I'm sorry, what?
>>: The tool that you showed us to locate a human being in that supermarket based on what
[inaudible].
>> Aleksandar Kuzmanovic: So once you know where you are and you know where the given
target is, then you can go on and guide the user, the client and say, go left, go right.
>>: But I'm saying this is something the user trivially can do him or herself because like if I get to
see a map, I can locate myself, I don't need the computer to do that for me.
>> Aleksandar Kuzmanovic: Where does the map come from?
>>: By mapping the features. I'm saying I don't need a computer to do that. Find on a map and I
can locate myself on that.
>> Aleksandar Kuzmanovic: Not necessarily you might be able to do that, that's nice ….
>>: In a supermarket.
>> Aleksandar Kuzmanovic: In a supermarket, yes, but you can have a huge mall, you can have
a number of different places where it may not be as trivial to do that.
>>: [inaudible].
>> Aleksandar Kuzmanovic: Right? So yeah. So let me conclude the talk. So I gave you four
different stories. I hope you survived that. So the first topic that I talk about is unconstrained
endpoint profiling, the key point here is that trying to harness information that is already available
over the Web. As Ratoul [phonetic] said, it's not perfect. For some applications it works better
than for other. But it is definitely a it can be very helpful in trying to do traffic classification to get
some external information and a lot of that information is available on the Web.
Behavioral ad targeting, again it was a legal issue that we used to come up with a research
problem. The bottom line here is and my argument is that we need more fairness in all these
things, in this area.
Location based services, we had a measurement study along these lines and I think we have
shown that there is a strong potential we can find some other insights that are not emphasized in
this particular type, that is that there is a huge correlation between the types of applications that
you're accessing and the location from which you are accessing these applications. So it's not
uniform, there is a huge correlation among these things.
And then for the indoor positioning, the bottom line is that we don't we need a practical system,
that was really the constraint here. How can we get there? Right. So we have built a system
that I hope we can use in the near future to actually help people do this.
I have other projects. I'm just going to mention some of them. Of course, the great collaboration
with Microsoft on glitch-free Internet audio-video conferencing. We have found out that if you
want to talk to somebody in Microsoft, you'd better go and talk to that person, go to his office.
Right? That's the best way. You get the best quality. Otherwise using voice over IP is not the
best idea. No, I'm joking.
I also have some collaboration with NSF and Google on net neutrality and with Cisco so I'm not
going to talk about that. I'd be happy to talk more about that later so thanks for being here and
listening to the whole presentation.
>> Jin Li: Thanks Alex.
>> Aleksandar Kuzmanovic: They're about to leave.
>> Jin Li: Alex is going to be here for the next two days. Till Friday afternoon. Interested, feel
free to meet him.
>>: Where is he staying?
>> Jin Li: 2975.
>> Aleksandar Kuzmanovic: Spend the summer there.
>>: [inaudible].
>>: [inaudible].
>>: [inaudible].
>> Aleksandar Kuzmanovic: How's it going? Thanks. Thanks for coming.
Download