21766 >> Scott Counts: All right. Let's go ahead and get started. I'm delighted to introduce today's speaker Munmun de Choudhury from Arizona State University. Many of you remember Munmun from her internship that she did here last summer. Munmun is defending in just over a month. So she's just about at Ph.D. status. Munmun is, I guess you might say, in the vanguard of researchers working in an area called computational social science, which is basically a very interesting, contemporary relevant mung of computer science, various statistical techniques, and sort of today's big, big social media dataset. So we are just delighted to have her here this morning. So Munmun. >> Munmun de Choudhury: Thank you, Scott. And good morning everyone. And thank you for finding the time this morning and being here. The title of my talk today is going to be Information That Matters Investigating Relevance of Entities and Social Media Networks. So before we get to this topic let me start somewhere else from what I do. So it goes without saying, especially to the audience that we have here, is that social media and social networks are causing significant changes in our daily lives. With the advent of the Web 2.0 technology we all know that we have multiple different ways to communicate and interact with each other on different modalities. For example, we can write on somebody's Facebook wall. We can vote items on [inaudible] we can write comments and backlog posts on live journals or even post and share content on Twitter. And this domain is definitely the area that I'm working in. And the goal of my research so far has been to understand the dynamics and impact of these kind of online social interactions that are presented in social networks and social media. And this problem is interesting, because I'm sure all of us would agree here, as well as we are aware, that 140 characters can actually cause revolutions. It has caused revolution during the elections in Iran back in 2009. And during the earthquake in Haiti last year and therefore, with these kind of observations that we gather from the current happenings around social media, I believe the research in this direction to understand how interactions can help us understand the sustainability of culture that has begun to emerge in our digital society, which is emergent of this kind of interaction process. And it can also help us design, implement and apply the next generation of interactive social information systems we hope to see in the near future. And why should you care about it? Well, if we consider two of the major products of Microsoft like Bing and Windows Live Mail we often see the question what are the set of people that we need to target in order to advertise a particular product, or what kind of ads do we want to show to what kind of people. Which means that we need to understand the interactions between these individuals on these two products that we are talking about. So I believe that my research can help us drive wild marketing, advertising campaigns and so forth. The second example is if we consider a different set of Microsoft products like the XBox Live or the Office Suite or Visual Studio where we are essentially interested in understanding how groups collaborate, how people come together for gaming or to understand how we can have efficient groups working on different complex projects. So we need to understand who talks to whom and how do they talk to each other, which means that this research can help us understand collaborations. The third example is if you consider Bing or like Windows Phone, obviously the question that we often have these days is that with so much of data available, what to show and what not to show. So this research if you analyze interactions between individuals it can help better interface design. And finally if we consider again like Bing, Live Mail, Windows Phone and so forth we'll see that with the scale of data available, we often ask the question what is the right set of people and information that could match with those set of people. Basically we're looking for information and people who could be relevant on a given topic or a given set of events. Basically we are looking for distributed social search and therefore understanding the interactions between people can help us solving this kind of applications. But obviously all these applications are very interesting. The question is how do we analyze and model our interactions to actually address them? And this has been the focus of my research and my dissertation so far. My key idea is that our interaction can manifest itself via three key aspects. To take an example, that Alice and Bob be two users of the Windows, of the Microsoft XBox live gaming software and let them be inserting messages inserting communication modality. As a result of this interaction process, information is exchanged between them and also oftentimes this kind of interaction is visible to their own set of friends. And which is the notion of the network they're associated with. So the three key aspects of our interaction are essentially the information that is exchanged during the interaction process, the media or the channel via which the interaction takes place and finally the social engagement or network that embodies this kind of interaction. So obviously we want to be able to understand these three key aspects if you want to make sense of our interaction process. But these aspects often are enveloped in different large scale social processes and therefore during my research and my dissertation I've begun to analyze these different social processes. For example, corresponding to the first aspect information we are interested to see how information diffuses between sets of individuals in network. Corresponding to the second aspect, which is social engagement, I've begun to understand how communities of groups evolved due to this interaction process. And finally I have begun to understand how the characteristics of the media via which communication takes place it evolves over time, for example, how do the teams involve, how does the interestingness evolve and so forth. However, we have a very serious problem before we can actually address these questions in an adequate and efficient manner. And that observation is the social Web is really changing at a very fast rate. And what exactly is changing? Well, new people appear. New ties are formed, and new interaction data appears as well. So to support this observation by some statistics, what we have seen from blog post and Huffington Post last year Twitter was receiving as many as ASICS million search inquiries a day and that's a huge number, and we know by the middle of somewhere last year Facebook reached half a billion users, which again is a huge number. So basically we have a lot of data. And this data is definitely interesting because it helps us study all these different social processes, generalize them and so forth. However, is there something more fundamental that is happening here than just the scale? I believe so, and that is exactly going to be the focus of this talk today, where we are going to talk about sampling for information that matters to us. There are two simple questions that we are going to look at here. The first one is how do we inform meaningful human networks from our interactions online? And secondly how do we identify valuable social media content that is generated as a result of the interaction process? First of all, we'll look at the first question to give you some background it's a work that I did with Winter Mason Jake Hoffman and Dan Watts and it started at my internship at Yahoo! Research during the summer of 2009. So the first -- so the basic question that we address in this work is how do we choose a relevant tie? Because we are interested in understanding how we can inform meaningful networks from our interactions. It is important to first understand what ties are relevant, where ties could be relationships between people having some social or psychological meaning associated with it. So a very nice solution we might say that, hey, why not just go to the people and ask them which ties are relevant to you. Of course we can do that. But when we are talking of individuals or populations of users in the scale of millions, say, on Twitter, on Facebook, can we actually do that? No, right? So obviously that is why we are solving this problem, and the question we ask is: Is there a principal way that helps us find those relevant ties and a set of observed interaction over a period of time. So let us do a small exercise over here and let me ask you guys how frequently you guys talk to your best friends. Anyone? >>: Few times a week. >>: Daily. >>: Couple times. >> Munmun de Choudhury: Couple times a week. Okay. I'm sure there will be some individuals who would like -- like I personally talk to my best friends like -- even though I consider them as friends -- I talk to them like twice a month. So definitely assuming that the relationship with the best friend is actually a relevant tie, there could be actually different frequency of communication or interaction that is associated with that relevant tie. But what is a good measure of communication that can help us infer that tie to be relevant? Well, that's actually the problem that we are tackling in this paper. So to support it by some of the previous work, similar problems have been faced in the past in typical social network research, too, and the way of inferring social ties from interaction data has actually been investigated. And there are actually many reasonable definitions to be able to say that. Hey, you know we think there's going to be a tie between two individuals when there's at least one communication between them in the past one year. We could go a little strict and say that, you know, average of one communication per week. Or we can go even more stricter and say that we are going to think there's going to be a tie between two individuals when there is one reciprocated communication in the last month. Of course all sound reasonable, right? They can be of interest for different kinds of research questions such as the first one could be of interest when we are looking for information or people in a network. The second one, which is when we are trying to focus on relevant ties in short scales of time, it could be interesting to understand problems which are highly temporally volatile in nature such as diffusion of information; and, thirdly, the reciprocated communication definition could be interesting when we are trying to understand how people come together in groups or other kinds of hormophically-based hidden properties. But this doesn't solve our problem. What we're saying is there a -- these are more or less ad hoc and the question that we're asking is there a more principal way given an observed communication and given a network we can say that these are the relevant ties. So the method that we propose in this work says that to find relevant ties will define a minimum threshold on the frequency of communication between individuals. What that means is let us take an example. So these are say for Windows phone users, and these edges that you see represent the direction of communication and their weights represent how frequently they communicate. So now we want to infer a network in this diagram that we see where we are going to eliminate certain ties which we think are not relevant. We're going to do so by choosing a minimum threshold on the frequency of communication. So let our threshold be this red line of a certain thickness. What we're going to do is we're going to eliminate all the edges in this unthresholded network that have weights less than this threshold. We are saying, okay, there has to be minimum of this much communication for a tie to be relevant. So this leaves us with this set of three irrelevant edges in the network. We can, of course, use a different threshold, which is slightly higher value and we are left with only two relevant ties in the network. So the observation is that the network itself is undergoing a lot of change structurally because we are defining different values of the threshold. So to support this observation empirically what we did was we focused on two different e-mail datasets and we constructed networks based on who talks to whom. And from the left side of the screen to the right side, we constantly increased higher and higher threshold, which gave us less and less number of relevant ties associated between the people and the network. And the observation that we have is that as we go from left to right, the network becomes more and more sparse in a sense the network undergoes a lot of structural change. So it means that it's important to define a relevant tie in an optimal manner. So that we pick among this ensemble the network which is the most meaningful. Our goal is therefore to infer networks for various definitions of this threshold and our goal is to study the impact of these thresholded networks on different structural properties of the overall networks. Such as descriptive statistics like clustering efficiency or [inaudible] and also the ability of these networks to predict certain node characteristics, which justifies the utility of these networks in different tasks. And what we're interested in is to gather some insights on what kind of thresholds can be considered to be optimal in defining ties to be relevant. So just a brief idea of the datasets that we used in this work. The first one is university e-mail dataset, which is a registry of e-mails between students, faculty and staff at a large university in the United States. Has about a million e-mails over a two-year period and Enron, of course, it's publicly available, and it's a repository of e-mails again between the employees at the former Enron corporation. About the same number of e-mails and about 5,000 individuals. So first let us discuss how we construct these thresholded networks. So before we go there, what we did was to start with an unthresholded network which is similar to our Windows phone example. We have users who talks to, who mails to whom. And this kind of a network had a symmetric edges which is there should be at least one communication back and forth. And the weight of such an edge was the geometric mean of the rate of interest over a period of time between the individuals in the network. Next what we did was we wanted to eliminate certain edges in that network to infer what ties are relevant. And therefore we chose a threshold. So the definition of threshold here was real positive number. And it was equal to the annual rate of exchange of e-mails between individuals over a period of time. And we linearly chose different values of this threshold over a scale to construct a family of networks. So to take examples for the threshold five e-mails in here we have the graph that we see on the top and then for a slightly higher threshold we have a sparser network. We're going to see how for these set of networks we -- the descriptive statistics of the network changes. First of all, the global network features. What we see is the number of connected components for these networks defined over a different thresholds are the goals a significant change where the X axis shows the different values of the threshold and the value of, the number of components on the Y axis. The similar observation we can see about the relative sizes of these connected components where we see that there's a sharp decrease, because the graph gets sparser and we are getting a lot of small connected components. So similar observations can be made about the local network features where when we consider reach closure and bridge measures for each node, we know that all of these growth significant changes almost in a exponential manner when we increase the network. What it implies is that when we are choosing different definitions to choose that relevant set of ties, the networks that we are getting are totally different. Of course, it doesn't answer the question: Then what is the right threshold? And this is what we're going to see right now. So what we do here is that our hypothesis is that network is going to be relevant when it has relevant ties, of course. And when those ties can actually help us solve a certain task. So we considered a series of prediction tasks, and our goal was to be able to see which of these different networks helps, gives us the best prediction accuracy or helps or is useful in solving a certain research question. What we did was we considered three different prediction tasks, and we'll go over them briefly. The first prediction task was to predict the node status or gender where the status means in the case of the university e-mail whether a node is, whether a person is a faculty, staff or a student. What we did was we developed feature representations of each of these nodes in each of these thresholded network where the different features were the structure features suggesting clustering efficient embodiedness known degree to hop neighborhood and so forth. Given the speech representation we trained our support vector regression classes. Support vector machine classifier with a go shun regression and we performed testing where we learned the support vectors and kernel width to be able to predict the node status in general for these datasets. The next prediction task was to predict future communication where we again used the same feature representation of each node and what we did was we trained a linear model with using the pass communication of each activity, which is the number of e-mails that every person sends. And we learned the best fit coefficients to predict the future communication at the next time slice. And the next prediction test was detecting communities. So what we did we fit a stochastic block model where we wanted to find soft assignments of different individuals in the network to schools. So like in the university dataset each person belongs to a certain school. And then we compared these assignments through mutual information metric with actual spool of assignments. So now let us take a look at the results of these prediction tasks. So this is with the university e-mail. And these are the four different prediction tasks. So what we -- the first thing that comes to our mind from these results is that they kind of have the same pattern, which is the accuracy which is shown on the Y axis, it actually peaks at a nonobvious value and not on the unthresholded graph that we have on the extreme left side. So they would usually expect that it's the unthresholded graph that performs the best because it has the most instances of communication available. However, that doesn't seem to be the case. In fact, it kind of peaks around the same range, which is five to 15 e-mails over a year period. The same observation can be made for the Enron e-mail, where we have the same pattern. We see that as much as like 20 percent boost over the unthresholded network on the extreme left. So what are the observations? So clearly the accuracy seems to peak at a nonobvious point. And the increase that we see at that point or at that threshold is as much as 30 percent over the unthresholded graph. Our conjecture for this observation is that when we are looking at the unthresholded graph we have those ties or those instances of communication which are actually noisy, and therefore they are probably not reflective of actual ties between people. So as we tune out the threshold we're actually getting rid of that noise and increasing the signal which leads to the increase in accuracy. So this kind of is explanatory, but what we observed another very interesting artifact is the optimal threshold seemed to kind of be limited in a certain range, which is five to ten e-mails a year, and it was kind of consistent with the different prediction tasks that we saw and the different datasets that we looked at. So we tried to think about why it happened. And it seems that if you think about the different prediction tasks that we can consider they kind of belong to the same equivalence class. Like detecting communities or like other known attributes. They're essentially reflective of chromopsically or other users than the other. Maybe for this equivalence class of problems there's actually this optimal range of thresholds in which we can find the relevant ties. And for the datasets we looked at both of the datasets where e-mails and they were like from the early 2000s. So maybe that was a habitual nature of people to send e-mails and both of them were like organizational e-mails so maybe that's a similarity and that's why the optimal threshold that we, the consistency and result that we saw. And in the future, of course, it will be interesting to see how it generalizes. So the conclusions is that social network research so far kind of always looks at some instance of interaction data. And what they do is they construct edges based on these instances of interaction. However, this method seems to be ad hoc, because we don't really know which ties are actual the relevant ones and which ties are not. So we try to address a narrow version of this problem in this work where we have data mined an optimal threshold condition at least for the range of e-mail networks to say that how we can find those that were relevant ties in that network. Surprisingly enough, there's optimal range seems to be consistent across different datasets and also across the tasks that we considered. Some open questions that we would like to address in this research in the future. So what we have done here is to have a separate model of finding the relevant ties and then performing the prediction tasks. So is there a way that we can actually learn this threshold condition as a model parameter inside the prediction task itself? The second question is: So we know that there's a way to data mine these set of relevant ties for a set of known features. Does that relevant set of relevant ties, does that network representation equally hold for when we test it on a set of unknown features. So that will be something very interesting to investigate in the future. Now question two. This is a research that I did when I was over here last summer with Scott Counts and Mary Czerwinski. And let us begin with our motivation. So we are -- I'm sure all of us are more or less we know about Twitter or we are on Twitter and we write our updates and so forth. And not only does Twitter allow us to keep others posted about what we are doing, but also they have started to emerge as a news media, dissipating often information about news and timely happenings. However, there is so much of data that has been generated that we often, the end user often asks the question that: How do I find the right content? I just have an information overload problem. So this is our motivation. And we are trying to ask the question: How do we infer the most relevant or best set of items on a certain topic from these millions or billions of pieces of information content? So if you think about the problem of finding relevant ties, it can actually be mapped to the problem of sampling information given a certain signal. And so let us contrast our question with a familiar example. So I'm sure all of us are familiar with the famous [inaudible] sampling theorem which says if you have a signal and it has no frequencies over a certain frequency call, the [inaudible] frequency then if we pick samples which are twice the [inaudible] frequency apart we can actually reconstruct back the original signal. Essentially it's a loss-less sampling. So it seems that for this genre of signals which could be images, videos the data we get from Web cams and we do surveillance cameras, they have a discrete regular and fixed sampling lattice. And the time to sample each pixel, if it's an image, it's almost constant. However, in our case, when we are talking of social media content, that might not be the case. Because Web activity doesn't have a notion of bandwidth. Also, what we did was we tried to look at so what are the characteristics of this space? Like the social media space. And what we found was that it is characterized by a wide range of attributes or dimensions such as geography, who writes the content, then what time they are writing and so forth. So essentially social media content space is diverse. So what we did was we looked at so what is the state of the art? Do they actually address or give a user this address, this diversity issue? So we did a small survey and we asked users what tools do you use to find relevant content? And the results are up on the slide. So it seems that most of these are -- actually, all of these are tools that helps you browse information based on a fixed attributes or single dimension. So Twitter website lets you browse based on reverse chronological order, Bing Social gives you these URLs which are highly shared in the network. But then if I'm interested to focus on other attributes, how do I do that? So that's the challenge with these tools. And also motivation for this work. So one of the characteristics of social media its high dimensionality and the different dimensions we considered are geography, social graph, whether there's a URL, the team distribution and so forth. So we call this property as information diversity. And how do we quantify this diversity? So for that we came up with a conceptual measure which is called the diversity spectrum. So what it does is that it represents the social media content space with an information theoretic measure called the entropy. And on this spectrum, on the extreme left side, you'll have content that is highly homogenous where the diversity is nearly zero and on the right side you'll have content that is extremely heterogenous, where the diversity is almost close to one. So obviously this kind of formulation also makes sense from an information theoretic perspective where the entropy measure has often been used to represent the mixing or the relative representation of different attributes in a given sample of relation. The other observation in this work, when we are searching for relevant content is that so that essentially what happens is when we generate these samples of relevant content it's being consumed by the end user, and therefore our sampling process needs to benefit from mechanisms of human cognition, because it is the user who is going to judge an item to be relevant or not relevant. So what we did was that we estimated the goodness of such a sample based on measures of human information processing. We considered a set of cognitive metrics that would help us judge whether a sample that we generate is good or not. And some of them were engagement encoding and the long-term memory interestingness and informativeness and so forth. We'll come back to these in a little bit. In order to find a sample from these vast space of social medias content, the two important steps in our research have been given these large set of dimensions how do we find the relative significance of each of them, and then given this kind of dimensional representation, how do we sample content that could match a certain desired level of diversity in the information space? First of all, how do we find the significance of dimensions? So what we did was we sought feedback through a survey from 11 active Twitter users, and we asked them which dimensions they found to be important while browsing content and asked them to rate their importance on a scale of 1 through 7. And this is what we get. Not so surprisingly enough, certain dimensions seem to be more important than others. For example, posting time was very important, whereas the number of friends of the user who wrote the content, which could be a tweet, was an admiration partner [phonetic]. Then we move on to given the representation of like weighted version of these dimensions of social media space, how do we find a sample. So let us provide a little bit of motivation here. So there is -- our motivation comes from the signal processing literature in an area which is called compressive sensing. Compressive sampling. What it says is if we have a signal high dimensional in nature and has very few nonzero entries, that is, sparse, then we can essentially represent it as a linear combination of a very small number of basis components. So borrowing that idea in our context, what we observe is that the social media content space is also very high dimensional and it is sparse, too. Because if you consider the Twitter information space out of all the tweets, probably very few of them are retweets or probably very few have URLs in them. So it is sparse. So maybe we can represent the social media information space as through this linear combination of small number of basis components on measurements which would greatly help us prune down the information space. So to understand it in a more formal and visual manner, so let our signal which could be the social media information space like Twitter, it is given by X, which has N -- which is of the size N and all are real. So we are interested in finding a Y of size M where M is much smaller than N through some linear measurements that are performed over X. To represent it visually, so this is our actual data, it has only gained on 0 entries on it. That is a sparse. And we are performing a transform fee which is a two-dimensional matrix M cross N to get a prune information space Y which has M measurements where M is much smaller than N. So what this would give us is that it would help us prune down the information space to a great extent. So essentially what we need to find here is how do we find this transformation fee? For that we used the popular wavelet transform called the hard wavelet to represent it in a diagram we have the N. We perform a compression using wavelet transform and get a pruned information space which has K nonzero entries. So now we have a prune information space. It's computationally more efficient. So how do we construct the actual sample which is of a certain size and has a certain level of diversity? So we proposed an iterative clustering framework for that purpose that uses, that tries to minimize the distortion of entropy between the sample that we generate and the desired entropy level. So what we do is we start with an empty sample and we pick any information unit or tweet from the space at random. We keep on iteratively adding these tweets to the sample making sure at every point the entropy, the normalized entropy of that sample is as close as possible to the entropy level that we want to have. >>: What feature space are you working in here, what feature space are you applying the hard wavelets to and ->> Munmun de Choudhury: At one point we saw social media space has high dimension spaces, dimensions geography presents a URL or retweet or not, dramatic redistribution. >>: So you came up with a set ->> Munmun de Choudhury: We came up with our features. >>: How big was the feature space? >> Munmun de Choudhury: It was around 20. >>: 20. Okay. >> Munmun de Choudhury: Yeah. So to put this iterative framework in more a formal manner, so we essentially have the following objective function where our goal is to minimize the distortion in terms of L1 norm between the normalized entropy HO on adding the Ith tweet with the desired entropy level omega. And we continue this greedy approach until we reach the desired sample size. So now we have a method which sounds promising. Obviously the question you might be asking is: How does it compare to other possible sampling techniques or some methods that are actually used in practice? So for that, we perform some experiments using the full fire hose data on Twitter over the month of June last year having a little bit over a billion tweets. So, first of all, let us see what are the different other sampling techniques that we compared our method to? So we constructed a series of different sampling techniques based on utilizing variations of our sampling algorithm which comprises three key components which is whether or not it uses a transform, whether or not it uses entropy minimization to fit a desired level of entropy or diversity, and also weighting of the different dimensions. So to take an example what are the different baseline techniques, so our proposed method is the one in the valid line. We have used the wavelet transform. We've used the entropy minimization and also the weighting. And, for example, baseline 3 does use the entropy minimization but doesn't use wavelet transform or weighting. And we also had two other methods which we called the most recent tweets which is kind of how Twitter shows you results on a topic and also the most tweeted URL, the idea that we're going to show the tweets that have the most shared URLs in them. So some quantitative valuation of comparing our method with these baselines. So what these plots show is that given a certain diversity level, which are point one, point three and point six over here, how do the entropy of the samples that we generate by these different methods, they match to the desired level of diversity? So the desired level of diversity is represented in the dotted red line in all of these, and over different sample sizes shown on the X axis we are trying to see what is the entropy of the corresponding sample by each of these six methods. So the prime observation or the take-away here is that our method, which is again the valid line, it is the closest that has the minimum distortion with respect to the desired entropy level, and to take an example, baseline three, which doesn't seem to use the wavelet transform or the weighting, it seems to be much far off from the desired entropy level. So it seems that compared to all these other variations of the sampling technique, our method performs the best in generating samples of different sizes that are very close to the diversity that we are wanting. So we also tested some robustness of our proposed algorithm. So note that one of the characteristics of our algorithm is that it's a greedy approach. And it starts with a random seed. Of course, there could be other ways of picking that seed, for example, if we know what kind of tweets we want to definitely include. We can start from that tweet having that sort of attributes. But just we wanted to see what happens when we choose a seed to be random. What we did was we checked for the overlap in the content of samples that are generated. It crossed multiple iterations when we choose different seeds. We notice for two different topics spell an iPhone for across different levels of diversity and sample size we really have high overlap across samples. So now this is really interesting, because it shows no matter where in the information space we seed our sampling from, we essentially reach a suboptimal sample which seems to be consistent across multiple iterations. So now we conjecture that this probably reflects that the social media information space has certain regularity to it, because of which when we are trying to find tweets matching certain diversity we essentially reach out to those structure irregularities in the space because of which we have high overlap. However, quantifying that regularity is definitely an interest in our future work. Okay. So going back to one of the observations that ->>: I have a question. How deep are the samples in the sets like that, like how many tweets were in each, the starting set before you did the sampling, for those two? >> Munmun de Choudhury: There was 50 to 60 percent reduction on the proof, because we had the ->>: How big was the total number? >> Munmun de Choudhury: It was a billion tweets in all but it was filtered by topic. So it was several thousands. >>: Several thousands. >> Munmun de Choudhury: Okay. So going back to the observation we had initially in the start, which is that when we are generating these samples or relevant content, it is the end user who is consuming them. So obviously we need to find out how they think about these samples that we are generating using our method and also the baseline techniques. So what we did was we came up with four cognitive metrics. Two of which were explicit measures. They estimated the sample quality based on interestingness and informativeness and two implicit measures. The first one being subjective duration assessment, which is the idea that when we show a sample to a user, if the user underestimates the time taken to go through it, it means that probably he or she found it really engaging and therefore they underestimate the time they thought they took to go through it. We call this measure the normalized perceived duration. And the second one, which is recognition memory for the tweets, which is the idea that how well does the sample, the information that a user sees in the sample that gets encoded in the long-term memory or do the users remember the information they saw in the sample? To evaluate our proposed method again based on these four measures we performed a user study. So our use study had its own feedback from 67 participants of which around 65 to 70 percent were men and the rest were women. And the age range was 19 to 47 years. So that's a good mix. And what we did, it was a Web-based application, and what we did was we showed 12 different tweet samples that were generated by these different methods. And over different diversity parameter values. And corresponding to each sample, we asked them the questions to estimate the length of time they thought they took to go through it. The interestingness on Lichert scale and the diversity and informativeness. At the end of these 12 samples, we also showed them a set of 72 tweets of which 36 came from the samples they already saw and the rest were on the same topic, but they actually were not shown any of the samples. And we asked them whether or not they had seen the particular tweet in any of the samples through a yes/no question. So the idea here was that if the user found the tweets in these samples like worthy to be remembered, they are going to reply correctly here. So the experimental design was we had between subject topics and we had within subjects for the different methods and also the diversity levels. So now let us take a look at some of the results of this user study based on the four cognitive measures. So this is what we get. And the Y axis shows the normalized or the raw user ratings for each of these measures whereas the Y axis are the different methods. What we see is that our method performs the best among all of them, which has always been found to be statistically significant, at least with baseline one, two, and the most recent method. So as you'll see here is that for informativeness, the recognition memory and interestingness, we actually have the highest rating. And also for the subjective duration assessment, which is the normalized perceived duration measure, we see that it was the least negative, which shows that the users were able to find the content or the samples generated by our method to be relatively more engaging compared to the other ones. So this definitely is -- it shows that even on a qualitative scale users found the samples generated by our method PM to be the best. >>: Were there significant differences? >> Munmun de Choudhury: The differences were significant with respect to baseline one, two, and most recent. We didn't see any significance for three and M 2 U which is the most -- I have numbers online. I can show you. Cool. Okay. So again one of the observations when we started this problem was that diversity makes a difference to the sampling process. So now we ask the question: But are users actually able to discern those diversity levels or not? So we actually found support in favor of that. And what the slide reports is one explicit measure interestingness and one implicit measure, recognition memory, for two topics. And we showed the measures, the participant ratings separately for the three diversity levels that were considered in the user study. And interestingly, if you see, for if not all for most of the cases the ratings were actually better for diversity point one or point 9 which means that users found information to be more relevant when they were highly homogenous or when they were highly heterogeneous. So now that's something that's really interesting. On one hand it means that users were actually able to discern the differences in diversity. But it also shows that the change in diversity is not monotonically related to the perception of relevance to the users. In fact, it shows this kind of nonmonotonic structure where information at the ends of the diversity spectrum are found to be more relevant compared to those in the middle. So the conclusions. What our primary motivations work was sampling was a very important problem because we are essentially faced with gigantic amounts of data on social media today. And because this information space is very diverse, it is important that the diversity measure is incorporated inside the sampling framework itself. And we validated the goodness of our sample through different, through a user study where we found that when we evaluate those samples based on cognitive measures of a human information consumption, we are able to find that our proposed method actually generates the most relevant content compared to some base lines as well as to stronger versions of state-of-the-art techniques. However, there are some open questions which we would like to add in future work. The first one is just in the last experimental slide what we observed was users were able to find content to be more relevant when they were highly homogenous or when they were highly heterogeneous. So this leads us to the question that are there like empirical bounds on what kinds of diversity are more suited for content consumption than others? The second question is related to the regularity of the space that we observed when we saw information to be highly overlapping despite random starts of the algorithm. Does that mean that the social media information space has some sort of signature in an information theoretic sense? And if there are such signatures indeed, then how can the sampling method itself benefit based on these signatures? And it would be interesting to see that in future work. Okay. So we come to the end. So we all agree here or at least observe that social networks are causing significant changes in our daily lives. And the inferences that we make about different social phenomena, like how groups form, how information flows in the network, or how the characteristics of the media change, this is affected by the quality of the data or the relevance of the data that we are looking at. And also because we are interested in streamlining the user experience in order to present them with content, it is affected by the relevance of the data that we present. And both of these matter completely. Let us take a look at some of my future research directions. I would be interested to work on. In the short term, I'm interested in the evolving area of what is known as social media marketing. So some example questions in this domain could be, which could also be interest to, say, Bing or Windows Live Mail would be to understand where, which could be information or people we would actually like to tap and network which lets us optimize on an economic technological or a cultural goal. And also using the structure of these social networks in media, how do we drive computational advertising of which ads to show, where to show the ad and so forth on the Web. My second direction is the study of microscopic dynamics based on microscopic interactions. So a question that has always struck me is how does these pop culture evolve on these social systems? So, for example, do we actually know how RT became the accepted notion of retweet or kind of copying one's information and transmitting it on Twitter? I would like to investigate on how these kind of accepted norms emerge based on our interactions. The second question is: We always note, at least from a telescopic level, that there is tremendous order in all of these networks. Also relates to the work on fractile theory and all. But we all know that at microscopic level it's extremely noisy [inaudible] and there's a lot of spam. How do we go from noise to order? In the long term, I'm interested in understanding in a more laboratory manner the properties of these attribute-rich petabyte scale information spaces on the Web, which is topics such as how do we perform social computing when we move to the cloud? We have all our data on the cloud and so forth. I'll be interested in developing [inaudible] for these large information spaces that helps us compare and generalize across different kinds of data. And finally also I would be interested in developing a comprehensive theory of online communication, something that has happened for like face-to-face communication but doesn't quite exist for our communication that typically unfolds online or over the cloud. So with that, I come to an end. I would like to acknowledge my advisor Hari Sundaram, and also my collaborators at Avaya Research, Yahoo! Research and Microsoft Research, Duncan, Winter, Jake, Scott and Mary and also all my lab mates and colleagues at ACU during my internships. Thank you for being here until the end and I think I'm ready for questions. [applause] >>: I have one easy one, I think. I'm not a social networks guy, but my understanding or at least what I know is that a lot of these kind of networks follow the power of law, to some degree. So how does the power law come into play here? For example, when you were looking for a threshold, what if I had taken 20 percent of the heaviest ones and just used that? How would that compare to your -- how would the accuracy then compare to the optimal threshold? Accuracy of your optimal threshold? Then and then in the second part, when you were looking at the diversity of data what if I took 20 percent of the most focused ones and compared it to the 20 percent most diverse one and the six compared to the middle, what is your feeling how would that look like? >> Munmun de Choudhury: This actually relates to the slide that I would answer the second question first, where we had the overlap. So let me see if I can go back. So what we noted was that it wasn't flat as in like you'll see that the overlap was higher over here and over here compared to somewhere in the middle. So what I think you're seeing is what happens when we consider only data which is like ->>: Point 1 and point 8. >> Munmun de Choudhury: It's not monotonically increasing or decreasing. Like I said this is something we'd like to do in future research. But it seems that the number of solutions that we might have fitting the criteria for relevance, it might be much lesser when it's less diverse or highly diverse compared to ones in the middle. So that's why because we have lots of solutions over here, we have low overlaps. So I think that's how I think -- did I answer that question? >>: So what I'm just trying to figure out, is power laws is like 80/20 rule, sort of. Just took 20 -how much more intuition do we get from this if I had just taken, I said at believe 20. At point 2 I would get a sweet spot and point 20 I would get a sweet spot? >> Munmun de Choudhury: I think we would see the same set of patterns although the results might vary, and the reason being that this observation, empirical observation is more of the property of the dimensions or the sparsity of the space than whether the actual values of those dimensions, which is what you're seeing, when we look at the head of the distribution, probably the numbers, the value of the features would be very high, for example, the frequency of communication would be very high and so forth. So I think it's more about the structure of the sparsity pattern than the actual values of the feature. So I think you'll still see the same pattern even when we look at the head. >>: I think another way potentially to look at it is that as you squeeze the entropy to a really small value, really large value, you're effectively putting a lot of constraint in the data. You're saying I want to find the stuff that's really far apart from each other and find stuff that's really consistent and there's probably not that many ways in a given set. So seeing the -- the point 3 may be so low the kind of U-shaped you might see in a interesting phenomena where you try to squeeze a history very low, very high entropy. >> Munmun de Choudhury: That's what I was going to say with the solutions part. And actually during summer we discussed extensively on it. So there seems like there's a fewer solutions at the end of the spectrum, which is what you said. >>: Actually, I had one comment. When you look at multi-document summarization, excuse my voice, that's in a sense when you grade summaries from let's say 50 different stories about a topic, split them all into sentences where you have a gigantic bag of sentences and you have a litany of summary and you have to choose your sentences. The best summaries are really the ones you have diversity in the limited space you have so high entropy, which gives you sort of as much information about whose story and the space as possible or you get something that's ->>: Very thin points. >>: You get like the main fact and that's the -- the worse summaries are the ones where you have a mix, some new information needed information and I think that's a very analogous summary of the sample. >> Munmun de Choudhury: Yes, totally. >>: What's the behavior of the perceived relevance was higher in like being very diverse very nondiverse, was it consistent across people or queries or was it some or some topics were direct, more constant than others? >> Munmun de Choudhury: It's limited for the two topics we looked at. The Windows Live and iPhone. Because we can only have so many fax from the users in the user study. We focused on two topics. It's limited in the sense it's limited by the 67 participants we looked at. But because it's over -- it has a good mix of general and it's over a wide range of different age groups. I guess it's pretty generic. >>: So it's not that some people prefer very diverse and some people prefer it very ->> Munmun de Choudhury: Yeah, it seems like, but we, of course, don't have the numbers on who prefers diverse or who prefers homogenous, but it seems that people like one of the two. >>: So when I think about this down sampling technique as when I think about applying it to sort of news-like tweets so I'm reading about the Haitian earthquake or something it makes a lot of sense to me. But when I think about applying it to tweets that are closer to me in my social network then I get concerned that anomalies might be a more interesting thing. For example, maybe my friend consistently tweets about the weather or something stupid but then suddenly tweets I'm pregnant. I don't want that one filtered out. So it's unclear to me like how this algorithm is going to not filter that one out, because it's an anomaly. It's just a complete one-off. >> Munmun de Choudhury: Well, in some sense it's taken care of if you want -- if you are ready to have diverse information, because this aspect will probably be covered by the schematic distribution of the tweets. So you might end up having different angles, not just all other tweets. But as such our down sampling technique doesn't exactly look at these kind of anomalies. >>: I'm also wondering if the set of features, I realize this is sort of speculative, but I'm wondering if the set of features that I'm interested in may in fact vary with the distance in the network from the original tweet. >> Munmun de Choudhury: Sure. Sure. We can always add that on top of what we have already built. And it could be a part of how we learn the weights of the significance of the dimensions. We actually might want to weight the dimensions of users were far away compared to my immediate friends. And that is exactly a very classic example of actually personalizing this thing. Like right now it's like irrespective of who among us actually looks for information on -- we're always going to get more or less the same set of topics. But in practice we might want to personalize this and actually that's where weighting in this manner based on my distance for activities is for. >>: In the previous study about the link filtering, is there an implicit assumption there that there is an optimal link weight filtering that you can apply that's relevant for every person? It seems like people have such individualistic patterns of communication. >> Munmun de Choudhury: Right. That's a very good question, actually. And that was also one of the comments when we wrote the paper. But we can always do that. But one of the problems is that actually like how do you find out that learning those actually individualistic thresholds would be computationally expensive for very large datasets. >>: Or you compute some features which are normalized for a person's total e-mail volume or e-mail and that doesn't exploit the computation. >> Munmun de Choudhury: Right. Right. But if you like think about the different studies, the prediction tasks that we did, they're more like macro network level. They don't exactly like how groups form and stuff. So they don't exactly look at the individual aspect. And I think the individualistic aspect is taken care of to some extent based on the weighting of the features that we saw there. So basically what we had was we weighted the features based on the communication properties of that particular user when we were doing the prediction. So that kind of personalizes it to some extent. But like these weighted features and unweighted features. But I ->>: At least recognize our ->> Munmun de Choudhury: So the unweighted features is representing every node in the network based on certain features, set of structural attributes, clustering coefficient. And we can also weight it based on how frequently that person talks to that friend of his. So that could be the weight. That kind of takes care of that personalization threshold in certain way. Although it's not very intuitive directly. But I agree that I mean in practice different people usually have different thresholds. And in fact there's some work by Rob Dunbar that there's a new cortex limit on the number of friends you can have, which is 150. And usually below the number of times we talk to each other is usually much less than 50. For some it's 20, some it's 30 and so forth. It will be interesting to learn those thresholds at the individual level, if the nature of the research question asks. >>: Since you're talking about pairwise behavior, that's got shape over time. >> Munmun de Choudhury: Sure. So this is a single snapshot study. So one of the datasets was over two years. So it was kind of normalized over the two-year period. We did not look at individual granularities. It will be good to probably look at small time scales and see if that threshold changes for those time scales also. >>: I think it's interesting thing here. First it was neat that you had several different measures that you're looking at. And you find them all pointing to the same kind of thing that's inherent, the network aspect, that you saw this effect at the peak that kind of showed up at some point. Over these different functional tasks, which is very nice. So I think what's interesting is you look at the second study you were doing I agree with you that this notion of like sampling is really important. Fire posting, data that's coming in, but seems like it's potentially -- there might be a much more diverse set of possible applications that a user might want to do. So in this case you've looked at, I think it's laudable that you looked at this to say what do people do like with their streams and you found consistency in terms of [inaudible] and diversity. You can imagine maybe if you took, there might be a wide range of different things like people predicting marketing trends or people trying to predict stock prices, various other kinds of things, and try to take those, a macro set of tasks not just user preference and see if you, again, see some consistencies in terms of the sampling that worked the best. >> Munmun de Choudhury: Right. On those lines, the sampling technique would, at least when we pruned the information space we can still utilize that irrespective of what is our final task. But, yeah, I agree at least from the study perspective it's very pinpointed in the sense that we're trying to only understand from FCI or interface design perspective, but, sure, yeah. >>: The real question I have there really is about diversity. I think it's powerful in terms of perception and the kind of things you're looking at I think diversity makes a lot of sense. I wonder if you looked at something that's more of a classification task or something like that, maybe diversity wouldn't be the axis that would get the most benefit maybe it would be something else. Maybe the difference would help. It's hard to ->> Munmun de Choudhury: Yeah. Well, my conjecture is that probably a classification task diversity is not that useful, probably. >>: Might be. Some active learning stuff. It's hard to say. But it would be interesting. >> Munmun de Choudhury: Definitely. Absolutely. Yeah. Yeah, I mean one of the always more -- one thing I've always wanted to do in my dissertation, unfortunately I don't have the time, so I did all these -- I study these social processes that we saw in the Bing, but this was on all the data that was available. There was no sense of sampling in them. So actually it relates to your question now that we have a sampling technique, what if now we study all these different problems with the sample data, do we observe the same dynamics? >>: On all your papers. >> Munmun de Choudhury: It's like going back. >>: If you have time for that. >> Munmun de Choudhury: I know. >>: That would be interesting. >> Scott Counts: Anything else? We'll do this one more time. [applause]