17213 >> Jin Li: Hello, everyone. Thanks very much for coming to talk. It's our great pleasure to have Aleksandar Kuzmanovic to come here and give us talk on ISP-enabled behavior ad targeting without user consent. Alex got his PhD from Rice University in year 2005 and then he joined Northwestern University and now he is assistant professor in the Northwestern University. His interest is in computer networking, with emphasis in measurement, analysis and protocol designs for all kind of Internet algorithms and protocols. Without further adieu let's here what Alex has to say about this interesting topic on ad targeting. >> Aleksandar Kuzmanovic: Thanks, Jin, so much. So I'm really happy to be here at Microsoft. I always liked Microsoft better than any other operating system. I'm a Microsoft kind of guy. I hate Macs. This can go online. This is my official statement. And so basically today I'm going to give you a brief outline of four projects that I'm basically that my group is working on so this is going to be a breadth kind of talk. I'm not going to go too deep into some technical details. I'll just put like four different ideas to you so if you're interesting in learning more about it, you can read a paper, you can ask for a paper, you can talk to me, I'll be happy to answer all these questions that you might have. So this is also good for me because you'll be allowed to have a chance to ask any question because I am going to be talking very fast, I'm going to move from one topic to another, by the time you ask a question I'm going to move to another topic so you're not going to have an opportunity to ask any questions. That was a joke of course. So let's have a let's go straight to the first problem. The first problem is, can you use search engines to do networking research and the scenario I have in mind is a student goes to search engine, puts a title of his project, PhD thesis and there comes like 10 different versions of this thesis, student then goes, reads all these different theses and this is the one I like, this is my PhD thesis and he gets his PhD. This is, of course, not possible. That was another joke. So basically what I am trying to that is still not possible but other things are possible. So if you put a random IP number in a search engine such as Google you can end up having different information coming out of that search engine. So for example, this particular IP address is associated with the Gnutella network, right? So basically and you might get surprised that if you start putting different IP numbers in different search engines, a huge amount of endpoint information is available on the Web. So the problem we try to answer is can we systematically exploit search engines to harvest endpoint information available on the Internet? So basically before I try to answer this question, let me first explain how in the first place does all this information come to search engines in the first place. So basically there are a number of reasons why information about endpoints can end up on the Web. One example is, for example, popular servers, for example, gaming servers. IP addresses are listed and available on the Web. If when they are available on the Web, they get searched, they get indexed by search engines and they end up being there. Then when you go to a Web site, you might think that's your own thing that you visit the given Web site, well, information about Web sites that you visit can all these Web sites can run logging software and basically display statistics, among others. They can show what kind of IP addresses are they accessed from, right? Then you if this is not enough you might go to a proxy network where you have a number of proxy systems that actually go on and display logs and they become publicly available on the Web. And then even with peer-to-peer systems, information is available on the Internet for a number of different reasons. For example, whenever you're accessing a peer-to-peer system you have to go first to a public domain and basically to get the information about the files that you want to access, this becomes again publicly available. And of course a number of there exists a number of blacklists, banlists, spamlists and so on and so forth that you can actually this is another source of information that gets publicly available. So what we did is we designed a methodology so I'm not going to spend a lot of time here, I'm just going to give you a brief outline of our methodology on how did we go from a given IP address to tagging that IP address with appropriate properties of this IP address. For example, what kind of activities are associated with this particular IP address. So basically here we again, Google is the search engine that we have used, of when you enter an IP address you end up with a number of URLs and the hit text and then we try to look at the domain name itself to try to understand what is what this IP address is about. So the example is when you can directly from the domain name realize what the IP address is associated with. For example if you have DNS server in the domain name, this implies that this is information about that. Of course, not always is this possible and we have to do something else. Again, from the artificial intelligence point of view, this is a toy problem, this is not our main contribution. So if you are an AI person looking for some deep things here, this is not the place to look for deep things, this is just a small hack by a couple of networking people trying to go to something else that we know better. And basically and then once you end up with this you go on and tag this IP address with appropriate features. So a single IP address can be tagged by a number of different features. Okay. What is this good for? Well, this is good for, for example, understanding what applications people are using across the world without having access to their network traces. It is hard to play network traces. You go to ISPs and say can we have some network traces, they will tell you no, you can't get that, there are security issues, there are privacy issues and you can't get that. So with this approach you can go on and analyze different networks, try to understand what kind of information is available on this network. So I'm against keeping a huge part of this problem in the sense that so basically, what we have, we managed to obtain through our collaborators packet level tracings from a number of different places in the world and then what we did, then we compared information gathered from these packet level traces with our approach where we just go around and input, go around and simply input one by one IP addresses and try to understand what is going on. And then the key insight here is that you can if an application is very popular in a particular region we are capable of capturing that in both the traces in both the packet traces and by using our approach. Of course, our approach is incapable of telling you, okay, this is the exact way that different applications are used and so on but it can still give you if something is strongly strongly shows up at a given region we're capable of capturing that. >>: Now are you talking about Web-based applications? Because how would you know if they're using Word or Office or Excel or running some PC-based game on their computer …. >> Aleksandar Kuzmanovic: I'm talking so the question is how do we know what kind of applications am I talking about, how do we know if they're using this or that type of application. So what I am talking here when I am talking about applications I am talking about applications, one is streaming media, two is Web applications, three is peer-to-peer and so on and so forth. So I am not talking about client level applications. >>: [inaudible] you described you did some validation of the quality of your inferences that are coming out. It seems for someone to actually use this in practice do they have to keep validating this system because I mean, it seems that it works in the present, it seems it might not have any bearing on whether the technique will continue to work in the future. >> Aleksandar Kuzmanovic: Because? >>: Because of all these harvesting methods are out there are really applications [inaudible] whether or not a client gets logged by using some manner of application. It's very it's almost accidental that that information gets stored and it gets stored because of a lot of reasons that may or may not exist for certain applications. >> Aleksandar Kuzmanovic: I can agree with that to some extent. So I have to repeat my claim. My claim was if something is heavily present at a given region we're capable of getting that. I agree that there is a lot of, like, how do you know if a given application is being actually reflected on the Web, that is a huge problem, of course. But the claim here is that if something is has strong presence in a given region, it does leave a sufficient trace that we can that's what we have seen. >>: Today. >> Aleksandar Kuzmanovic: Today. >>: Okay. Sure. >> Aleksandar Kuzmanovic: Yeah. >>: I mean, it's one of those unknown unknown things. Right? >>: Oh yeah, [inaudible], I'm not saying this yet. But [inaudible]. >> Aleksandar Kuzmanovic: Sure, sure, sure. So again. This is I agree, this is the most shaky application that we came up with this approach, this is just one of the applications that we're using. So I'm next going to explain the second one which I hope will convince you it's not as shaky. But I do agree that there is a lot of uncertainty here. The way we treat this uncertainty is that we simply compare it with the grand truth from four different places where we were capable of gathering packet level traces and it will tell. Is it going to work in the future? I don't know. We'll see. So the next application is the traffic classification problem, right? Assume you do have access to packet level traces, how can you figure out what is going on, what kind of applications are used there? So currently, current approaches are Perl-based [phonetic], payload based, signatures, numerical and statistical analysis and so on and so forth. You are trying to understand what is going on. Now what we were doing here is that we are saying that, okay, using information the station [phonetic] IP address available on the Internet, right? And this you can get using our approach. Once you use this approach, you can get a much better granularity of your inferences and you can classify a much larger percent of traffic. And then here I am comparing to another technique used by to another tough [phonetic] classification approach. I'm not going to talk about them because that's not the point here. The point here is that we are doing we are capable of performing much better traffic classification than they are. So on X axis you have the sampling traffic. This is how many packets do you take randomly out of from a packet stream and on the Y axis you have the percent of traffic classified, right? So what is showing is that we are capable and these lines are for two different networks that are capable of gathering packet-level traces. So what it shows is that when there is no sampling you get all the packets. If you are much better than they are, however, when you have sample level traffic, their approach totally goes down while we are capable of having high classification capability. So the bottom line is that despite a huge sampling rates we are still capable of classifying large amounts of traffic simply because we are working based on a single packet we are capable of saying, aha, given the destination address, we look on a search engine, we know what this is about, hence we can tell you what it is. So I will leave it here for this for the first topic. If you are interested, I'd be happy to talk more about this project and you have some information online as well. So let me jump to the second problem that I am going to be talking about today. This is the ISP-enabled targeting. So basically the question with the first project was people started telling us, oh, you guys are promoting Google, why did you put that in the title, that's not fair, blablabla. We are not promoting Google, we just used that as a way to promote our paper, right? So basically let me explain, the second project actually goes in a way against Google's approach of doing something so I'm going to explain that next. So the second problem that I'm going to talk about is behavioral targeting. So basically behavioral ad targeting itself is a huge business. It's a $20 billion industry. And then basically all the banner ads, rich media, e-mail marketing, all that stuff that comes to you when you're looking at a given Web site that comes under online ad targeting. And then behavioral advertising is a part of this ad targeting business and then to simply explain how it works, for example you go to, user A accesses site X, site X belongs to category P, for example sports. So the next time user A accesses site Y, well, the advertisement of type P is going to show up on that site. Because you have shown already interest in that particular activity. So to look at the market share in this behavioral in the ad targeting business, Google owns 35 percent of the business, Doubleclick owns 34 percent of the business, Google bought Doubleclick, hence 35 plus 34 equals 69, 69 greater than 50 so where I come from this is called monopoly, right, in a given business. So Internet service providers, on the other hand. Yes? >>: [inaudible] is that revenue based or is it based on [inaudible] …. >> Aleksandar Kuzmanovic: It is based on the number of clicks through ads or something like that. I can give you a more precise …. >>: [inaudible]. >> Aleksandar Kuzmanovic: Yeah, yeah, yeah. So Internet service providers for years looked at these things and basically see that these Web based companies are taking all the money, but they are putting all this infrastructure, shipping packets from point A to point B here, they are not gaining as much money as they would like to. So the question they asked is hey, how about we do behavioral ad targeting, right? So here you have all the packets from a user are going this way, right, so if they do the packet inspection, look at the packet and try to understand what information our user is interested in, they can gain the same information, right? Then they can sell this to the ad targeting companies and they can gain some money as well. Which sounds like a reasonable approach. However, the problem here is the legal one. Right? So Internet service providers as broadband providers, they, this law, which is the Federal Wiretap Act, applies to them and it does not apply to the Web based companies. And it says that “thou shalt not intercept the contents of communications. Violations can result in civil and criminal penalties.” Right? So it is illegal for ISPs to look at the packet payload to understand what is going on. And there was a huge pressure on basically ISPs to not to do this both from consumers and from the federal government and so on and so forth. Yes. >>: Is it okay to look at addresses? Station addresses? Or …. >> Aleksandar Kuzmanovic: That's what I'm going to talk about next. It is okay. >>: This is black and white [inaudible]. >> Aleksandar Kuzmanovic: So hear this [phonetic]. This law came out in 1969 and it was designed to prevent phone wiretapping. And it was enacted, again, in 1986, before the Internet was around and they say this holds for computer communications as well. And now, in 2009, ISPs are unable to do this stuff because all this stuff that came out in 1986, which kind of doesn't make sense, right? And there is no fairness in this market because Web-based companies are capable of doing this without any problems. >>: I disagree with your stance on that. Basically there is a conversation between the user and that Web site you're going to. So that is the conversation that is actually sanctioned. To look into that conversation by looking at the packets going back and forth to me is a third party who is …. >> Aleksandar Kuzmanovic: Sure, sure, sure. That is what the whole argument is all about. >>: And the second part is, that Web site has a privacy policy so that the user understands it. The user can't get access to what the ISP is supposedly doing because [inaudible]. >> Aleksandar Kuzmanovic: Sure, sure, sure. So my point here is that it's not the same game for ISPs and Web-based advertisers currently. And I think it should be the same game. Right? >>: No. >> Aleksandar Kuzmanovic: Why not. >>: [inaudible] I called Amazon or I called eBay or I called somebody else …. >> Aleksandar Kuzmanovic: Sure, sure, sure. >>: If I go to my ISP's Web site, sure, I agree. But if I don't, then why is the ISP jumping in the middle. I don't want my ISP …. >> Aleksandar Kuzmanovic: Are you aware that your communication is being monitored by Google and by Doubleclick? Are you aware of that or not? >>: Yes. >> Aleksandar Kuzmanovic: Did you sign up for that or no? Did you sign up or no? >>: When I go to a Web site but you don't sign up for any of those things. >> Aleksandar Kuzmanovic: That's my point. There should be some fair game here. Right? And it's currently not there. >>: No …. >> Aleksandar Kuzmanovic: So let me …. >>: The privacy policy exists on Web sites you go to. >> Aleksandar Kuzmanovic: Your Web browser is full of Google and Doubleclick cookies and users don't know about that. And that's it should be the same principle. So what I'm saying here is that this is currently used against ISPs to do behavioral ad targeting. What we're saying is you can turn around the law if that 1986 law, well, we can use another law to do stuff. So basically what we are looking at is and I will tell you I have consulted with Paul Ohm from University of Colorado who is an expert in these issues, and then basically another piece of law is Electronic Communications Privacy Act which states that “any provider can hand over non-content records to anyone except the government.”. Now what this means is that TCP headers could be legally shared. They are non-content based communication, patterns of communications between end points. And then the research question we wanted to ask, well, once you have TCP headers, that you can legally share with anyone except the government, can you based on them do behavioral ad targeting. And we claim it's possible. Right? So and again I would come back to this legal issue at the end of this part of presentation because I do agree with you that users should be given the right to say I want this to be, I don't want this to monitored. But it should be equal for everybody. It cannot be that ISPs are some kind of bad guys and these other Web-based companies are good guys so they can do it without any questions. So there is the question is can you do this? And our approach is you can do this. Basically you can go on and collect statistics about Web pages from a given Web site and then you can compare that to Web-level information available from TCP packets. So once you compare this to information it becomes possible, I would show, to basically understand content level information based only on non-content based communication. Okay, Web profiling and collecting statistics about Web pages, so basically I'm not going to go too deep into this but the bottom line is that basically Web nowadays is a fairly complex beast so what we have here is basically each Web page, because this is the root file, this is the index file, and then you have a number of different objects that comes with that so basically so the first distinction is the root file versus objects files. Now objects files could be transferred in different various transfer modes, they could be compressed or non-compressed, then you have a user [phonetic] of content distribution methods so these objects could be stored internally on the Web site or externally on the content distribution network. They could be cachable or non-cachable so the bottom line here is that there is a huge significant statistical diversity among pages at a given Web site, right? And then this diversity can be used to basically understand differences among pages even if you don't have direct access to that. So this is one piece of the puzzle and then the second piece of the puzzle is that you have once you look at the TCP packet level traces what can happen is that you can basically go on and understand there is all of reflection from the Web kind of Web-layer down to the TCP layer. Right? So the bottom line here is that if you observe packets generated by a single user, you can say, aha, these packets belong to one page because there is a natural delay between streams of packets or requested by a single user among different pages. Now this is one thing that if you are using and at the same time by looking at TCP headers, I'm not going to go into details, but you can actually distinguish which of the Web pages which of these elements you can distinguish different elements at a TCP layer, right? You can say this is one object, this is another object, this is a Web page and so on. I'm not going to go into details here. I can send you a paper, it's a fairly straightforward thing. There is no huge science here. Okay, so once you have that this is how our system, these are the results, so basically we had six different Web sites that we were exploring. “New York Times” for Club Barcelona, we did this before they became European champions. Ikea, Toyota and then two of the universities, one is Northwestern, the other is University of Granada. So what you can see here, what you can see here is that the success rate which is the probability to successfully detect which page was accessed, is around 85 percent in all cases and the false positives are below five percent. In this particular case. In all the case. So …. >>: [inaudible]. >> Aleksandar Kuzmanovic: Yes? >>: [inaudible]. >> Aleksandar Kuzmanovic: So basically you have a user who accesses let's say 100 Web pages, out of these 100 we can say we can say 85 we can detect successfully 85 of these 100 Web pages and for the five pages we do make false positives. We say these pages were accessed but they were actually not accessed. >>: Without looking at the HTTP header …. >> Aleksandar Kuzmanovic: Without looking at the HTTP header. Only looking at the TCP. >>: [inaudible]. >> Aleksandar Kuzmanovic: Okay, so the next thing that we looked here is okay, this is okay but to do something like that an ISP would have to crawl all the Web sites, right, because you have to build a profile about a large number of Web sites. This is one thing. And the second thing is if you are an ISP and you collect all the TCP headers, you want to ship this to a third party, to an advertising company, it takes time, right? It can like you put this on a device, then you ship it there, some time can pass. So the question is if you have basically non-perfect calling, so for example you can call a Web site only once per week or something like or once you ship data from an ISP to an advertising company, the question is, can that hurt the performance that we see in this basic case? So the answer is yes. >>: You see to rely on I'm guessing that why is the advertising company getting this information from the ISP and not the site that the user is going to? >>: [inaudible] ISPs basically want to get in the business of advertising. They want to …. >>: [inaudible] they are not intercepting traffic, they are not actually changing content being served to [inaudible], so the advertising company is better off …. >> Aleksandar Kuzmanovic: Then you have to communicate with tens of thousands of Web sites that users are accessing, while here you collaborate with a single ISP who has all the information about the user. Because whatever goes through that ISP, ISP sees that and then whatever site that user accesses, if that site is not cooperating with the advertising company you don't know that that happened. >>: But the site must be collaborating with the advertising company because the site itself is placing …. >> Aleksandar Kuzmanovic: Not necessarily. I mean, there are different advertising companies. >>: It's better to create user profiles …. >> Aleksandar Kuzmanovic: At the source. >>: On the client side than on the server side. >> Aleksandar Kuzmanovic: Yeah. >>: It's simpler. >>: Either way, I mean, it's an extra source of information, right? If it could be produced, it would be valuable. I guess my question is how much granularity is necessary? I mean, just knowing IP addresses and server, “New York Times” versus Ikea, is that enough to go on? >> Aleksandar Kuzmanovic: So, that's a very good question. In some cases you might for sure you would like to have more granularity, right? If I'm going to Ikea, I want to buy some furniture, what do I want to buy, right? Am I looking for I mean, that can basically fine tune that advertisement that comes to me at the end and it increases probability I might click on that advertisement later. Or if I look at “New York Times” it's fairly diverse things I can be looking at. If I'm looking at cars prices that implies I might be looking at cars so a car advertisement may come to me. So more information is always good, I think. Even though in some cases it can be fairly straightforward like if I'm looking at Football Club Barcelona's Web site, I'm interesting in football and in sports, right? Maybe that granularity is not necessary in that particular case but there are other cases where it is important. And we should write papers, right, so we have to do something here. Yes? >>: So the server IP address is used to [inaudible] site and [inaudible] too? >> Aleksandar Kuzmanovic: Yes. >>: So it does like [inaudible]? >> Aleksandar Kuzmanovic: No, no, no. So we look at that so we only look at the basically V classified destinations as going to the server side or to a CDN [phonetic]. So any Akamai [phonetic] or any other stuff is just external information. We use that to understand what kind of Web page did the user access but that's the only thing. We are not looking at the IP addresses of particular Akamai [phonetic] servers or stuff like that. That's not what we're doing. We're just saying external versus internal. Internal means you go from the origin server, external means you look at CDNs, irrespective of what CDNs. >>: [inaudible]. >> Aleksandar Kuzmanovic: So typically you would get some content from the origin server and the rest from the CDNs. Yeah. Okay. So basically the question here becomes if the traces are old, so that means you ship traces to a given advertising company, yet some time passes. Like a week goes on, the question is, can you still use these traces? Can they still be useful? So we compare all traces with current Web profiles and the bottom line is that you can still get basically nothing changes over time in the sense that you can still get high success rates and low false positives. And why this happens? Well because for these three particular sites that we are looking here, Toyota, University One, University Two, the change rate of a site is very small. So that means both the root pages and the objects on their sites does not change dramatically over time, hence we are still capable of reusing these traces. >>: [inaudible]. >> Aleksandar Kuzmanovic: I am going to show that next. So next the question is what happens with this with the small changing Web sites and then here, what we are looking here is that we are capturing TCP trace at the last day and we have the profile of “New York Times” from seven days ago. So the question is can we still do a good job. So this shows you that we can do a good job, “New York Times” here is blue and it doesn't change dramatically over time. Why does that happen? Well, if you look at the change rate, for “New York Times” it's quite high. Sixty percent of pages change over time. Right? But these are mainly root pages, index pages. So if you have a given page, people can leave comments and all the other stuff and hence that thing does change. However, the objects on that Web site, which is this particular graph, picture is here, they don't change dramatically. It's a very small change, like below five or so percent. So the template of a Web of a page stays the same while the content can change but the template stays the same. So you still can do a fairly good job in distinguishing different pages despite the fact that the content in this environment does change, often because people leave comments and do other stuff there. Okay. So to summarize, the key motivation for this work is that I believe and my students believe that online advertising business needs more fairness. Users we also strongly believe that users should know whether their traffic is monitored or not. So basically we have shown that the law itself is quite outdated. Because you can take one piece of law and say ha, you can't do this but you can take another piece of law and say well, you can't do that. You don't even have to take consent from users, right? You don't need even that, right? So the bottom line here, what we are arguing for is that we need a comprehensive legislative reform that would put all these things in some kind of a to make a fair game out of this whole stuff. And we of course believe that this should not be such a reform should not be used to kill academic research. That means we researchers should be given a lot of data that we look at because once we have access to all of data we can do very nice things. So this is basically introduction to the third project that I am going to talk about. Measuring serendipity in a mobile 3G network. We were given a very huge trace from a mobile operator yes? >>: Before you move on …. >> Aleksandar Kuzmanovic: Yes, you want to talk more. >>: Deep packet investigation, [inaudible] has been challenged by lots of industry experts, privacy advocates, governments, because it seems to be inappropriate and very low value [phonetic] to consumers. So as an ISP, you ask me if I want you looking at my packets and I want to opt in or opt out, it's like what do I get from it? Well, you get to sell my data to someone else and make money. Do I get anything? No, you get to look around at what I am doing. So why is that unfair and why is that necessary to help research …. >> Aleksandar Kuzmanovic: Yeah. But how is I mean Google does the same thing. Exactly the same thing. >>: Tell me how they do it. >> Aleksandar Kuzmanovic: Because that's how the online advertising business works. >>: Google is not an ISP. >> Aleksandar Kuzmanovic: That is exactly true. But Google is doing exactly the same things that I am saying the ISP should be given an opportunity to work with. Why not? >>: The scenario is not the same. >>: I have to choose Google. I mean, it's a free service. I don't have to use Google, I don't have to enable the cookies, right? But, I mean …. >> Aleksandar Kuzmanovic: But you have it enabled right now and you don't know about it. >>: That's a different issue. >>: It's not …. >> Aleksandar Kuzmanovic: It's not a different issue …. >>: [inaudible]. >> Aleksandar Kuzmanovic: What I'm saying, and this is really the but I can send you I mean, there are other opinions that it should be fair, it should be the same. >>: I agree it should be fair, but this is not fair because it's not the same. It's apples and apples. I go to a Web site, I can actually choose to say I'm going to opt out of advertising for Google, for Microsoft, [inaudible]. >> Aleksandar Kuzmanovic: If it is opt out it should be opt out both for these guys and for the Web-based advertisers. Because now Google and others are saying, oh, you guys, you are very bad guys and the law is not on your side so you should apply an opt in approach. Right? So why doesn't Google apply opt in approach. >>: I get free content from a Web site. In return they show me ads. The ISP is not giving me any content. >> Aleksandar Kuzmanovic: We fundamentally disagree on this. I would like to move on. I will be happy to discuss more. >>: You're right. >> Aleksandar Kuzmanovic: Totally. Okay. Yes? >>: I don't want to discuss more about this but I just want to point out that I don't know, us [phonetic] computer scientists are a little uncomfortable trying to interpret what the law says. Especially when there are no sort of cases that established what exactly how the courts interpret the law. And I just find sort of like modulating problems based on my or your or your interpretation of the law is sort of shaky. Because it's I'm not a lawyer, I don't understand what's going. I can read I have an opinion. Sure, but it's just my opinion. >> Aleksandar Kuzmanovic: So basically I mean, I agree with that to some extent. I'm just saying that there is a lot of opportunity to come up with interesting research problems from that perspective. >>: So I think the motivation is look you know, by doing the packet inspection you open up a huge can of worms, and you can do a nice job looking at the TCP headers [inaudible], that's great. But it's sort of like debating what the law says, I …. >> Aleksandar Kuzmanovic: We had a [inaudible] who had exactly that point. Yeah, that's a very good point. >>: I wasn't [inaudible]. >> Aleksandar Kuzmanovic: Yeah, I didn't imply that. Let me move on very quickly because I have other things to talk about and I will be really happy to discuss further this issue because I see it's kind of important. So let me so the motivation here is that once we are given a once researchers are given a lot of data it's great. They can do very interesting things. So in this particular case we have a lot of fine grain data from a mobile operator and at this point we were not looking at law or anything like that, we just once you get the data you forget about the law, you just do your research. Okay. So the bottom line here is that to motivate [phonetic] this quickly, social networks are becoming very popular these days but some believe that the future of social networking will take place on mobile devices as opposed to being just limited to desktops. So some of the applications that are being enable with this with this technology are so-called serendipitous discovery of people, businesses or locations. So basically, what does it mean, for example, if you go to a given to a neighborhood and your friend is in the same neighborhood, if you're in the same vicinity geographically, you can go on and basically hook up together. This is an application offered by loopt, a company from New York. Another example are like behavioral ad targeting, like businesses. If you go to give an area and if this business knows what your interests are they would be happy to send this advertisements on this. Now users are not happy to receive these advertisements but that's a different questions. And now locations. For example, you can do tagging. For example, you can go on and you can go to restaurant, you like the restaurant, you put this on your and send this to your friends, your friends are in the neighborhood. They see, ah-hah. That was a great restaurant suggested by a friend of mine, I'm going to go there. So what I am arguing next is that coarse-grained location information is very useful for these services. Now, why am I saying this? Because we got hammered by some reviewers that we have a fairly coarse-grained information in our dataset which consists of basically base stations. And that's not sufficient to basically converge to these kinds of services. I argue against that. So the point is that even if you have coarse-grained information at the level of a block or of a base station or a city, that is still useful because that can help you hook up with friends. And at the same time users are more comfortable sharing coarse-grained information than GPS-level information. This is another argument that people say really like telling exactly where I am on the map is maybe not that secure or private or whatever. So the question that we try to answer here is what the relationship between mobility properties and application affiliation in the cyber domain. So if you are interested in giving a location, how does it affect your mobility properties and how does it increase or decrease the probability to meet others who share the same interests? So let me just give you a brief description about what we were doing. So here is the here are the [inaudible] about our datasets so we had more than 3 million packet data sessions on more than 280,000 clients and we had like close to 1,200 locations in the Greater Chicago Area. And then we were trying to understand what is happening with all this stuff. So the way we extracted human movement so basically if you have if you have a user it connects to a base station and then it goes to a RADIUS server which stands for Remote Authentication Dial-in User Service. So we had access to this. So we were able to look at all data and all information about users and then if the users moves from point A to point B within the same session, we are capable of seeing this hand-off and we see we can basically understand the location of a user. And then we also had intersession movement. That means the user accesses a given base station, then the user switches off but he then moves to another location so we are still capable of getting this. And what we argue in the paper is that despite the fact that we had these sample kind of behavior, that even if you have if the users are not online all the time, that they switch on and off, that we are fully capable of capturing their basic mobility properties. I'm going to leave the arguments along these lines for one-on-ones if you're interested. So then we use rule mining techniques to basically go on and understand basic moving properties. So here we simply define a group of users who sit at point A and then they access a given service within time window W and then they are seeing in another location within interval delta. So for these users we, basically we go on and define rule [phonetic] movement like this. Then we have a stationary movement. I'm going to talk about that. That just means you're going to stay at the same location and then we have a disappear rule which means users just switch off and they are no longer here, they are no longer in the picture. This is a better way to show that. So basically here we have position A. The rule support is the number of people present at A at a given point in time. Then these users move to location B. Now rule confidence is the number of people that move from A to B and finally, rule support is -the confidence probability is confidence to rule support. So the larger the confidence probability, the larger confidence that this rule has some meaning in this particular case. So let me show you just a few just some of the results that we had. I'm not going to dive into details and I'm going to tell you give you the whole story. If you're interested, I can send you a paper. So basically what we see here is on the there are two Y axes. Y one and Y two. So Y two is the total confidence of rules so here you can see that for example, here you have 37,000 people basically this is 31,000 people and so on. And then this particular graph is associated with this particular rule. So what we see is that on a Y state you see these peaks happening at 8:00 a.m. and at 5:00 p.m. and basically what is going on is that of course users are more likely to move in these particular times while on a weekend, on Sunday you can see that people are most likely to sleep longer and they are less likely to move longer because you can see here that this is much smaller than what happens during the workday which is kind of an expected result. So then we tried look at, like, okay, what is the correlation between applications that people access and their mobility problems? So what we have shown, one strong rule that we have found is that weather is a good indicator for stationality [phonetic]. Like close to 70 percent of accesses remain the same location for the next six hours. So once you access weather, you are most likely to stay location in the next six hours. And then we were kind of trying to understand why this happens and we realized that one possible reason is because it's Chicago, right? So once you see what the weather is outside, you don't want to go anywhere, you just want to stay home for the next six hours, right? So we don't know if this result is of a more broad scope so basically we need another location, something like San Diego or something nicer to understand if this is if this has a strong local implication. So another finding is that maps are a good indicator of movement. Sixty seven of users accessing the service can be seen moving in the next three hours. And then what we have found is that there is a strong correlation, anti-correlation between of given applications and basically mobility properties. So stationary users tend to access music downloads a lot while mobile users tend to access e-mail a lot. So we tried to look into this in somewhat more detail and here is the result. On X axis we have the number of base stations seen by users so this is one, and this is 50. This is over one big period. And then on Y axis you have application access ability. So what you can see is that those who access music are mostly stationary users. As the mobility increases, they're less likely to engage into these kind of downloads. So no we were unable to understand why exactly this happens but there could be two reasons. One because people are unhappy with the bandwidth that they get when they are moving a lot. If this is the case then these peer-to-peer kind of mobile applications might have a lot of success because they can improve throughput in these particular scenarios but if the power or the battery is the concern, then this is not the case. Yes? >>: Within applications or across time? What does 50 percent mean on the Y axis? >> Aleksandar Kuzmanovic: Of those who see only a single base station in the seven day period, the probability to the cumulative number of accesses to all applications, this is the percent of music downloads. >>: So this doesn't say that you tend to check more e-mail when you move, it just says that the fraction of times you access e-mail as a fraction …. >> Aleksandar Kuzmanovic: No, no, no. This is the fraction of accesses to application. >>: The denominator is all accesses to applications, right? >> Aleksandar Kuzmanovic: Yes. >>: So the absolute quantity could be the same? Absolute number of accesses could be the same in the numerator? >> Aleksandar Kuzmanovic: So all these things sum up to one at a given …. >>: So this is relative use. It says nothing about the absolute use of applications. >> Aleksandar Kuzmanovic: Yes. >>: It may then be that even when I am stationary or mobile I check e-mail exactly the same frequency. >> Aleksandar Kuzmanovic: Maybe. But this is relative to all the users at a given …. >>: Sure. >> Aleksandar Kuzmanovic: Yeah. That's right. >>: Yeah. >> Aleksandar Kuzmanovic: Yeah, yeah, yeah. I mean, we are on the same page. So the second insight is that the more you move, the e-mail becomes a more dominant application and then basically what we have said is that, okay, if you are moving a lot you see as much as 50 base stations during the week. That means that you are really like moving a lot. That means that this particular device, mobile device becomes your computer so that's why you probably are checking mail more than others, right? So for stationary users we believe that I mean, they may have some other stationary desktop or something else that they are using to access this. We found another interesting point for social networking which kind of is in between the two. So it just becomes big here, we're not sure how significant this is or we were unable to come up with anything anything more insightful than that. Okay. So we had other insights here. If you're interested I can send you I can send you a paper or we can talk more. Okay. So the final piece of my presentation is about infrastructure and/or positioning. This is a work in progress so this is still not finished. This is a project that I am working with my students. So to motivate [phonetic] this problem, basically we have seen that you can have good positioning when you are outside, GPS works, everything is fine but once you go inside it doesn't work. So while or it doesn't work at all. So my personal problem here is that I have a four year old son and then I often have to buy a chocolate milk for him and then I go to a grocery store and they have chocolate milk in two different places. One is in the fridge and the other is totally outside the fridge, right, where you have these 12 pack kind of things. And then I go to the fridge, I always buy from here, and then my wife kills me at home telling me like why didn't you buy these 12 packs? They have it somewhere else. I have no idea where they are. And once I spend like 20 minutes walking around the grocery store trying to find 12 pack chocolate milk, I find it finally, but at that point I told to myself, okay, this is a problem that people are having. I am a human. I have a problem. I want to solve the problem. So basically the question is how do you do indoor positioning. So we are not the first, of course, to look at this problem. You have a number of different approaches. Triangulation is one. Of course if GPS doesn't work then you have to have some other infrastructure. For example, you can use cell towers like CDMA approach. I believe you had a paper along these lines. And you can also have or you can put some infrastructure inside the building, right, where you can once you have infrastructure inside you can again do triangulation, you can try to understand what is going on, what your location is. Then you have these RF signatures. For example, beacons. You can go when you see a wifi network, you are in that range, you can say, aha, approximately this is where I am now. So you can basically try to understand there you are or if you don't like infrastructure you can go on and send the radio signal within the room and then you pick up the response from that radio signal, this is kind of a fingerprint of a given location and then you say, aha, this is where I am so now you can do that. Now the problem with this approach is that you need infrastructure, which is not a small thing and you need manual labor, right? Somebody has to go on and fingerprint the entire space which is not easy. It is not easy because for example, here if you have two points that are very close to each other in space, the response in delay and frequency domain [phonetic] can be very different, right? So it's not easy to do this kind of thing and if you go to a grocery store in my favorite foodmart where I want to buy chocolate milk, and if you tell them how about you guys have some infrastructure here so we can understand where the chocolate milk is, they will tell you that's fine but if you want to implement something, if you want to give us money to deploy that, that's fine, but we are not going to do that on our own, we don't care about that. So basically what we are trying to do is try to come up with a practical kind of solution to this problem. So the first thing is let the building owner off the hook. We don't want to talk to the building owner. Right? We don't want to install infrastructure, we don't want to ask for any detailed schematics, right? We don't want to do any fingerprinting, so the site surveys, don't build detailed RF maps, then we just want to rely on what actually can work in reality which is let the users report what they see, right? So in this particular case you have two examples, one is Whole Foods in Evanston and this is Hyatt Regency Hotel. So basically many of the facilities published free floor plans, right? So what are saying is that if you can use pre-computation you can extract information from here and you can use people to help you navigate to understand where they are. Right? People are great noise-filters. For example, if you are standing here, like if you send a radio signal and you have people around you, it's going to get distorted and everything else. However if you have a human here it will tell you, aha, I see seafood here so this can be used to basically do indoor navigation. So here is an example. Here is my favorite grocery shop. There you see olives, there you sell self-serve counter. Based on this information you already have some understanding of where you can be in that space. And then again, there is another example, wine and olives. And the bottom like is that you can take advantage of relationships among identifiable features in the room. What that means is that, I mean, we are not trying to land a spaceship on a moon on a particular point or we don't want to kill somebody with an exact millimeter level kind of precision. What we need is something that can be useful so that people can be navigated and say, aha, go here and then go left and then go right and then you find the given whatever you're looking for. So let me just give you a brief I am going to be done in less than five minutes. Basically what we are looking here are some important definitions so isovists is the visible area from a location, right, from given location so here we have three isovists, green, blue and red. So basically here you have points I'm not sure if you can see them, the green area is all the locations that you can see this point from, right? And you can have some features such as landmarks, such as for example, cash register, bathroom, elevator, whatever else. And then we define region as a subset of coordinates sharing an identical feature vector. So here is what it looks like. So the bottom line is that if you see a given feature, then that feature sees you. Right? So for example, if you have for example here three features in this particular room, it can be shown that there are two to the H minus one locatable regions for these features and for example, assume a user says, I see A and B but not C. You're going to end up with this particular region. Right? So you have already limited the potential location that a user can be for a fairly small number of features in the room. Now if you increase the number of features what happens is that it becomes even more, even better. So for example, a user can say I can see A and B but not C and D. Right? So you end up in these smaller regions and the bottom line is that the more regions you have, the smaller area at which the user can be. Now the problem with this approach is that this assumes perfectly reporting. Right? So a user tells you I see A and B and I don't see C and D. Now what can happen in the reality is that the user can really see A and B, it might see C but it's not reporting, right? I mean, you can have a number of features in the room so you just don't report it. So the question is if you don't have this perfect reporting, how the thinking works. So I'm going to show you quickly, briefly, if a brief example. You can see that. So here is my favorite room. So for example, assume a user says I see specialty seafood and then I see produce. Right? So here you don't assume perfect reporting. You just assume so that the user will tell you exactly ABCD and I don't see you can't get that from the user. But even if you get this fairly not so findrate [phonetic] information, you can still end up with a region that is fairly smaller only with three particular features, here, you can end up here, you see. You can be here, here, here or here. Okay? And then based on this information you can go on and either ask more information from a user or use dead reckoning or other stuff to go on and reduce this space and so on. So I will give you just briefly. So basically this is what the system looks like. You go on and basically download a map from a given company Web site and again this doesn't have to be a very nice map, this is just a coarse-grain map that can do the job and then we preprocess this at our server. You download this and then based on some user feedback you can go on we can go on and help you understand there you are. Now this is a work in progress and then what we are using what we are trying to do is to add some additional user input or use dead reckoning to break to tie. Basically we are a very small sensor that we are going to put on a user in an attempt to try to understand how it is …. >>: I don't see how this helps you to locate your chocolate milk. I mean if I have a map where it always says here's the chocolate milk and here are all the other important features of that sort, then I can just look at it myself. I mean, it's not really it's really not difficult and I can figure [inaudible]. >> Aleksandar Kuzmanovic: Sure. >>: So that the kind of utility that helps me to locate myself in a store based on what I see right now, it doesn't help me to find that chocolate milk. >> Aleksandar Kuzmanovic: Sure, sure, sure. But you see, once you find the chocolate milk you can tag that map or that information. And push it back to our Web site. >>: Yeah, I know. I understand that there is a process of what I see and how I can add information to [inaudible] map is useful, but I don't see how the location tool is useful. >> Aleksandar Kuzmanovic: I'm sorry, what? >>: The tool that you showed us to locate a human being in that supermarket based on what [inaudible]. >> Aleksandar Kuzmanovic: So once you know where you are and you know where the given target is, then you can go on and guide the user, the client and say, go left, go right. >>: But I'm saying this is something the user trivially can do him or herself because like if I get to see a map, I can locate myself, I don't need the computer to do that for me. >> Aleksandar Kuzmanovic: Where does the map come from? >>: By mapping the features. I'm saying I don't need a computer to do that. Find on a map and I can locate myself on that. >> Aleksandar Kuzmanovic: Not necessarily you might be able to do that, that's nice …. >>: In a supermarket. >> Aleksandar Kuzmanovic: In a supermarket, yes, but you can have a huge mall, you can have a number of different places where it may not be as trivial to do that. >>: [inaudible]. >> Aleksandar Kuzmanovic: Right? So yeah. So let me conclude the talk. So I gave you four different stories. I hope you survived that. So the first topic that I talk about is unconstrained endpoint profiling, the key point here is that trying to harness information that is already available over the Web. As Ratoul [phonetic] said, it's not perfect. For some applications it works better than for other. But it is definitely a it can be very helpful in trying to do traffic classification to get some external information and a lot of that information is available on the Web. Behavioral ad targeting, again it was a legal issue that we used to come up with a research problem. The bottom line here is and my argument is that we need more fairness in all these things, in this area. Location based services, we had a measurement study along these lines and I think we have shown that there is a strong potential we can find some other insights that are not emphasized in this particular type, that is that there is a huge correlation between the types of applications that you're accessing and the location from which you are accessing these applications. So it's not uniform, there is a huge correlation among these things. And then for the indoor positioning, the bottom line is that we don't we need a practical system, that was really the constraint here. How can we get there? Right. So we have built a system that I hope we can use in the near future to actually help people do this. I have other projects. I'm just going to mention some of them. Of course, the great collaboration with Microsoft on glitch-free Internet audio-video conferencing. We have found out that if you want to talk to somebody in Microsoft, you'd better go and talk to that person, go to his office. Right? That's the best way. You get the best quality. Otherwise using voice over IP is not the best idea. No, I'm joking. I also have some collaboration with NSF and Google on net neutrality and with Cisco so I'm not going to talk about that. I'd be happy to talk more about that later so thanks for being here and listening to the whole presentation. >> Jin Li: Thanks Alex. >> Aleksandar Kuzmanovic: They're about to leave. >> Jin Li: Alex is going to be here for the next two days. Till Friday afternoon. Interested, feel free to meet him. >>: Where is he staying? >> Jin Li: 2975. >> Aleksandar Kuzmanovic: Spend the summer there. >>: [inaudible]. >>: [inaudible]. >>: [inaudible]. >> Aleksandar Kuzmanovic: How's it going? Thanks. Thanks for coming.