21766 >> Scott Counts: All right. Let's go ahead...

advertisement
21766
>> Scott Counts: All right. Let's go ahead and get started. I'm delighted to introduce today's
speaker Munmun de Choudhury from Arizona State University. Many of you remember Munmun
from her internship that she did here last summer. Munmun is defending in just over a month.
So she's just about at Ph.D. status. Munmun is, I guess you might say, in the vanguard of
researchers working in an area called computational social science, which is basically a very
interesting, contemporary relevant mung of computer science, various statistical techniques, and
sort of today's big, big social media dataset. So we are just delighted to have her here this
morning. So Munmun.
>> Munmun de Choudhury: Thank you, Scott. And good morning everyone. And thank you for
finding the time this morning and being here.
The title of my talk today is going to be Information That Matters Investigating Relevance of
Entities and Social Media Networks. So before we get to this topic let me start somewhere else
from what I do.
So it goes without saying, especially to the audience that we have here, is that social media and
social networks are causing significant changes in our daily lives.
With the advent of the Web 2.0 technology we all know that we have multiple different ways to
communicate and interact with each other on different modalities. For example, we can write on
somebody's Facebook wall. We can vote items on [inaudible] we can write comments and
backlog posts on live journals or even post and share content on Twitter.
And this domain is definitely the area that I'm working in. And the goal of my research so far has
been to understand the dynamics and impact of these kind of online social interactions that are
presented in social networks and social media.
And this problem is interesting, because I'm sure all of us would agree here, as well as we are
aware, that 140 characters can actually cause revolutions.
It has caused revolution during the elections in Iran back in 2009. And during the earthquake in
Haiti last year and therefore, with these kind of observations that we gather from the current
happenings around social media, I believe the research in this direction to understand how
interactions can help us understand the sustainability of culture that has begun to emerge in our
digital society, which is emergent of this kind of interaction process.
And it can also help us design, implement and apply the next generation of interactive social
information systems we hope to see in the near future.
And why should you care about it? Well, if we consider two of the major products of Microsoft like
Bing and Windows Live Mail we often see the question what are the set of people that we need to
target in order to advertise a particular product, or what kind of ads do we want to show to what
kind of people.
Which means that we need to understand the interactions between these individuals on these two
products that we are talking about. So I believe that my research can help us drive wild
marketing, advertising campaigns and so forth.
The second example is if we consider a different set of Microsoft products like the XBox Live or
the Office Suite or Visual Studio where we are essentially interested in understanding how groups
collaborate, how people come together for gaming or to understand how we can have efficient
groups working on different complex projects.
So we need to understand who talks to whom and how do they talk to each other, which means
that this research can help us understand collaborations.
The third example is if you consider Bing or like Windows Phone, obviously the question that we
often have these days is that with so much of data available, what to show and what not to show.
So this research if you analyze interactions between individuals it can help better interface
design.
And finally if we consider again like Bing, Live Mail, Windows Phone and so forth we'll see that
with the scale of data available, we often ask the question what is the right set of people and
information that could match with those set of people. Basically we're looking for information and
people who could be relevant on a given topic or a given set of events.
Basically we are looking for distributed social search and therefore understanding the interactions
between people can help us solving this kind of applications.
But obviously all these applications are very interesting. The question is how do we analyze and
model our interactions to actually address them? And this has been the focus of my research
and my dissertation so far. My key idea is that our interaction can manifest itself via three key
aspects. To take an example, that Alice and Bob be two users of the Windows, of the Microsoft
XBox live gaming software and let them be inserting messages inserting communication modality.
As a result of this interaction process, information is exchanged between them and also
oftentimes this kind of interaction is visible to their own set of friends.
And which is the notion of the network they're associated with. So the three key aspects of our
interaction are essentially the information that is exchanged during the interaction process, the
media or the channel via which the interaction takes place and finally the social engagement or
network that embodies this kind of interaction.
So obviously we want to be able to understand these three key aspects if you want to make
sense of our interaction process. But these aspects often are enveloped in different large scale
social processes and therefore during my research and my dissertation I've begun to analyze
these different social processes.
For example, corresponding to the first aspect information we are interested to see how
information diffuses between sets of individuals in network. Corresponding to the second aspect,
which is social engagement, I've begun to understand how communities of groups evolved due to
this interaction process.
And finally I have begun to understand how the characteristics of the media via which
communication takes place it evolves over time, for example, how do the teams involve, how
does the interestingness evolve and so forth.
However, we have a very serious problem before we can actually address these questions in an
adequate and efficient manner. And that observation is the social Web is really changing at a
very fast rate.
And what exactly is changing? Well, new people appear. New ties are formed, and new
interaction data appears as well.
So to support this observation by some statistics, what we have seen from blog post and
Huffington Post last year Twitter was receiving as many as ASICS million search inquiries a day
and that's a huge number, and we know by the middle of somewhere last year Facebook reached
half a billion users, which again is a huge number.
So basically we have a lot of data. And this data is definitely interesting because it helps us study
all these different social processes, generalize them and so forth.
However, is there something more fundamental that is happening here than just the scale? I
believe so, and that is exactly going to be the focus of this talk today, where we are going to talk
about sampling for information that matters to us.
There are two simple questions that we are going to look at here. The first one is how do we
inform meaningful human networks from our interactions online? And secondly how do we
identify valuable social media content that is generated as a result of the interaction process?
First of all, we'll look at the first question to give you some background it's a work that I did with
Winter Mason Jake Hoffman and Dan Watts and it started at my internship at Yahoo! Research
during the summer of 2009.
So the first -- so the basic question that we address in this work is how do we choose a relevant
tie? Because we are interested in understanding how we can inform meaningful networks from
our interactions. It is important to first understand what ties are relevant, where ties could be
relationships between people having some social or psychological meaning associated with it.
So a very nice solution we might say that, hey, why not just go to the people and ask them which
ties are relevant to you. Of course we can do that. But when we are talking of individuals or
populations of users in the scale of millions, say, on Twitter, on Facebook, can we actually do
that? No, right? So obviously that is why we are solving this problem, and the question we ask
is: Is there a principal way that helps us find those relevant ties and a set of observed interaction
over a period of time.
So let us do a small exercise over here and let me ask you guys how frequently you guys talk to
your best friends. Anyone?
>>: Few times a week.
>>: Daily.
>>: Couple times.
>> Munmun de Choudhury: Couple times a week. Okay. I'm sure there will be some individuals
who would like -- like I personally talk to my best friends like -- even though I consider them as
friends -- I talk to them like twice a month. So definitely assuming that the relationship with the
best friend is actually a relevant tie, there could be actually different frequency of communication
or interaction that is associated with that relevant tie.
But what is a good measure of communication that can help us infer that tie to be relevant? Well,
that's actually the problem that we are tackling in this paper.
So to support it by some of the previous work, similar problems have been faced in the past in
typical social network research, too, and the way of inferring social ties from interaction data has
actually been investigated. And there are actually many reasonable definitions to be able to say
that.
Hey, you know we think there's going to be a tie between two individuals when there's at least
one communication between them in the past one year. We could go a little strict and say that,
you know, average of one communication per week. Or we can go even more stricter and say
that we are going to think there's going to be a tie between two individuals when there is one
reciprocated communication in the last month. Of course all sound reasonable, right?
They can be of interest for different kinds of research questions such as the first one could be of
interest when we are looking for information or people in a network.
The second one, which is when we are trying to focus on relevant ties in short scales of time, it
could be interesting to understand problems which are highly temporally volatile in nature such as
diffusion of information; and, thirdly, the reciprocated communication definition could be
interesting when we are trying to understand how people come together in groups or other kinds
of hormophically-based hidden properties.
But this doesn't solve our problem. What we're saying is there a -- these are more or less ad hoc
and the question that we're asking is there a more principal way given an observed
communication and given a network we can say that these are the relevant ties.
So the method that we propose in this work says that to find relevant ties will define a minimum
threshold on the frequency of communication between individuals.
What that means is let us take an example. So these are say for Windows phone users, and
these edges that you see represent the direction of communication and their weights represent
how frequently they communicate. So now we want to infer a network in this diagram that we see
where we are going to eliminate certain ties which we think are not relevant.
We're going to do so by choosing a minimum threshold on the frequency of communication. So
let our threshold be this red line of a certain thickness. What we're going to do is we're going to
eliminate all the edges in this unthresholded network that have weights less than this threshold.
We are saying, okay, there has to be minimum of this much communication for a tie to be
relevant.
So this leaves us with this set of three irrelevant edges in the network. We can, of course, use a
different threshold, which is slightly higher value and we are left with only two relevant ties in the
network.
So the observation is that the network itself is undergoing a lot of change structurally because we
are defining different values of the threshold. So to support this observation empirically what we
did was we focused on two different e-mail datasets and we constructed networks based on who
talks to whom.
And from the left side of the screen to the right side, we constantly increased higher and higher
threshold, which gave us less and less number of relevant ties associated between the people
and the network.
And the observation that we have is that as we go from left to right, the network becomes more
and more sparse in a sense the network undergoes a lot of structural change. So it means that
it's important to define a relevant tie in an optimal manner. So that we pick among this ensemble
the network which is the most meaningful.
Our goal is therefore to infer networks for various definitions of this threshold and our goal is to
study the impact of these thresholded networks on different structural properties of the overall
networks. Such as descriptive statistics like clustering efficiency or [inaudible] and also the ability
of these networks to predict certain node characteristics, which justifies the utility of these
networks in different tasks.
And what we're interested in is to gather some insights on what kind of thresholds can be
considered to be optimal in defining ties to be relevant.
So just a brief idea of the datasets that we used in this work. The first one is university e-mail
dataset, which is a registry of e-mails between students, faculty and staff at a large university in
the United States. Has about a million e-mails over a two-year period and Enron, of course, it's
publicly available, and it's a repository of e-mails again between the employees at the former
Enron corporation. About the same number of e-mails and about 5,000 individuals.
So first let us discuss how we construct these thresholded networks. So before we go there,
what we did was to start with an unthresholded network which is similar to our Windows phone
example. We have users who talks to, who mails to whom.
And this kind of a network had a symmetric edges which is there should be at least one
communication back and forth. And the weight of such an edge was the geometric mean of the
rate of interest over a period of time between the individuals in the network.
Next what we did was we wanted to eliminate certain edges in that network to infer what ties are
relevant. And therefore we chose a threshold. So the definition of threshold here was real
positive number.
And it was equal to the annual rate of exchange of e-mails between individuals over a period of
time. And we linearly chose different values of this threshold over a scale to construct a family of
networks.
So to take examples for the threshold five e-mails in here we have the graph that we see on the
top and then for a slightly higher threshold we have a sparser network.
We're going to see how for these set of networks we -- the descriptive statistics of the network
changes. First of all, the global network features. What we see is the number of connected
components for these networks defined over a different thresholds are the goals a significant
change where the X axis shows the different values of the threshold and the value of, the number
of components on the Y axis.
The similar observation we can see about the relative sizes of these connected components
where we see that there's a sharp decrease, because the graph gets sparser and we are getting
a lot of small connected components.
So similar observations can be made about the local network features where when we consider
reach closure and bridge measures for each node, we know that all of these growth significant
changes almost in a exponential manner when we increase the network.
What it implies is that when we are choosing different definitions to choose that relevant set of
ties, the networks that we are getting are totally different. Of course, it doesn't answer the
question: Then what is the right threshold? And this is what we're going to see right now.
So what we do here is that our hypothesis is that network is going to be relevant when it has
relevant ties, of course. And when those ties can actually help us solve a certain task.
So we considered a series of prediction tasks, and our goal was to be able to see which of these
different networks helps, gives us the best prediction accuracy or helps or is useful in solving a
certain research question.
What we did was we considered three different prediction tasks, and we'll go over them briefly.
The first prediction task was to predict the node status or gender where the status means in the
case of the university e-mail whether a node is, whether a person is a faculty, staff or a student.
What we did was we developed feature representations of each of these nodes in each of these
thresholded network where the different features were the structure features suggesting
clustering efficient embodiedness known degree to hop neighborhood and so forth.
Given the speech representation we trained our support vector regression classes. Support
vector machine classifier with a go shun regression and we performed testing where we learned
the support vectors and kernel width to be able to predict the node status in general for these
datasets.
The next prediction task was to predict future communication where we again used the same
feature representation of each node and what we did was we trained a linear model with using
the pass communication of each activity, which is the number of e-mails that every person sends.
And we learned the best fit coefficients to predict the future communication at the next time slice.
And the next prediction test was detecting communities.
So what we did we fit a stochastic block model where we wanted to find soft assignments of
different individuals in the network to schools. So like in the university dataset each person
belongs to a certain school.
And then we compared these assignments through mutual information metric with actual spool of
assignments.
So now let us take a look at the results of these prediction tasks. So this is with the university
e-mail. And these are the four different prediction tasks. So what we -- the first thing that comes
to our mind from these results is that they kind of have the same pattern, which is the accuracy
which is shown on the Y axis, it actually peaks at a nonobvious value and not on the
unthresholded graph that we have on the extreme left side.
So they would usually expect that it's the unthresholded graph that performs the best because it
has the most instances of communication available. However, that doesn't seem to be the case.
In fact, it kind of peaks around the same range, which is five to 15 e-mails over a year period.
The same observation can be made for the Enron e-mail, where we have the same pattern. We
see that as much as like 20 percent boost over the unthresholded network on the extreme left.
So what are the observations? So clearly the accuracy seems to peak at a nonobvious point.
And the increase that we see at that point or at that threshold is as much as 30 percent over the
unthresholded graph. Our conjecture for this observation is that when we are looking at the
unthresholded graph we have those ties or those instances of communication which are actually
noisy, and therefore they are probably not reflective of actual ties between people.
So as we tune out the threshold we're actually getting rid of that noise and increasing the signal
which leads to the increase in accuracy.
So this kind of is explanatory, but what we observed another very interesting artifact is the optimal
threshold seemed to kind of be limited in a certain range, which is five to ten e-mails a year, and it
was kind of consistent with the different prediction tasks that we saw and the different datasets
that we looked at.
So we tried to think about why it happened. And it seems that if you think about the different
prediction tasks that we can consider they kind of belong to the same equivalence class. Like
detecting communities or like other known attributes. They're essentially reflective of
chromopsically or other users than the other.
Maybe for this equivalence class of problems there's actually this optimal range of thresholds in
which we can find the relevant ties. And for the datasets we looked at both of the datasets where
e-mails and they were like from the early 2000s. So maybe that was a habitual nature of people
to send e-mails and both of them were like organizational e-mails so maybe that's a similarity and
that's why the optimal threshold that we, the consistency and result that we saw.
And in the future, of course, it will be interesting to see how it generalizes. So the conclusions is
that social network research so far kind of always looks at some instance of interaction data. And
what they do is they construct edges based on these instances of interaction.
However, this method seems to be ad hoc, because we don't really know which ties are actual
the relevant ones and which ties are not. So we try to address a narrow version of this problem in
this work where we have data mined an optimal threshold condition at least for the range of
e-mail networks to say that how we can find those that were relevant ties in that network.
Surprisingly enough, there's optimal range seems to be consistent across different datasets and
also across the tasks that we considered.
Some open questions that we would like to address in this research in the future. So what we
have done here is to have a separate model of finding the relevant ties and then performing the
prediction tasks. So is there a way that we can actually learn this threshold condition as a model
parameter inside the prediction task itself?
The second question is: So we know that there's a way to data mine these set of relevant ties for
a set of known features. Does that relevant set of relevant ties, does that network representation
equally hold for when we test it on a set of unknown features. So that will be something very
interesting to investigate in the future.
Now question two. This is a research that I did when I was over here last summer with Scott
Counts and Mary Czerwinski. And let us begin with our motivation. So we are -- I'm sure all of us
are more or less we know about Twitter or we are on Twitter and we write our updates and so
forth. And not only does Twitter allow us to keep others posted about what we are doing, but also
they have started to emerge as a news media, dissipating often information about news and
timely happenings.
However, there is so much of data that has been generated that we often, the end user often
asks the question that: How do I find the right content? I just have an information overload
problem. So this is our motivation. And we are trying to ask the question: How do we infer the
most relevant or best set of items on a certain topic from these millions or billions of pieces of
information content?
So if you think about the problem of finding relevant ties, it can actually be mapped to the problem
of sampling information given a certain signal. And so let us contrast our question with a familiar
example. So I'm sure all of us are familiar with the famous [inaudible] sampling theorem which
says if you have a signal and it has no frequencies over a certain frequency call, the [inaudible]
frequency then if we pick samples which are twice the [inaudible] frequency apart we can actually
reconstruct back the original signal. Essentially it's a loss-less sampling.
So it seems that for this genre of signals which could be images, videos the data we get from
Web cams and we do surveillance cameras, they have a discrete regular and fixed sampling
lattice.
And the time to sample each pixel, if it's an image, it's almost constant. However, in our case,
when we are talking of social media content, that might not be the case. Because Web activity
doesn't have a notion of bandwidth.
Also, what we did was we tried to look at so what are the characteristics of this space? Like the
social media space. And what we found was that it is characterized by a wide range of attributes
or dimensions such as geography, who writes the content, then what time they are writing and so
forth. So essentially social media content space is diverse.
So what we did was we looked at so what is the state of the art? Do they actually address or give
a user this address, this diversity issue? So we did a small survey and we asked users what
tools do you use to find relevant content? And the results are up on the slide.
So it seems that most of these are -- actually, all of these are tools that helps you browse
information based on a fixed attributes or single dimension.
So Twitter website lets you browse based on reverse chronological order, Bing Social gives you
these URLs which are highly shared in the network.
But then if I'm interested to focus on other attributes, how do I do that? So that's the challenge
with these tools. And also motivation for this work.
So one of the characteristics of social media its high dimensionality and the different dimensions
we considered are geography, social graph, whether there's a URL, the team distribution and so
forth.
So we call this property as information diversity. And how do we quantify this diversity? So for
that we came up with a conceptual measure which is called the diversity spectrum. So what it
does is that it represents the social media content space with an information theoretic measure
called the entropy. And on this spectrum, on the extreme left side, you'll have content that is
highly homogenous where the diversity is nearly zero and on the right side you'll have content
that is extremely heterogenous, where the diversity is almost close to one.
So obviously this kind of formulation also makes sense from an information theoretic perspective
where the entropy measure has often been used to represent the mixing or the relative
representation of different attributes in a given sample of relation. The other observation in this
work, when we are searching for relevant content is that so that essentially what happens is when
we generate these samples of relevant content it's being consumed by the end user, and
therefore our sampling process needs to benefit from mechanisms of human cognition, because it
is the user who is going to judge an item to be relevant or not relevant.
So what we did was that we estimated the goodness of such a sample based on measures of
human information processing. We considered a set of cognitive metrics that would help us
judge whether a sample that we generate is good or not. And some of them were engagement
encoding and the long-term memory interestingness and informativeness and so forth. We'll
come back to these in a little bit.
In order to find a sample from these vast space of social medias content, the two important steps
in our research have been given these large set of dimensions how do we find the relative
significance of each of them, and then given this kind of dimensional representation, how do we
sample content that could match a certain desired level of diversity in the information space?
First of all, how do we find the significance of dimensions? So what we did was we sought
feedback through a survey from 11 active Twitter users, and we asked them which dimensions
they found to be important while browsing content and asked them to rate their importance on a
scale of 1 through 7. And this is what we get.
Not so surprisingly enough, certain dimensions seem to be more important than others. For
example, posting time was very important, whereas the number of friends of the user who wrote
the content, which could be a tweet, was an admiration partner [phonetic]. Then we move on to
given the representation of like weighted version of these dimensions of social media space, how
do we find a sample.
So let us provide a little bit of motivation here. So there is -- our motivation comes from the signal
processing literature in an area which is called compressive sensing. Compressive sampling.
What it says is if we have a signal high dimensional in nature and has very few nonzero entries,
that is, sparse, then we can essentially represent it as a linear combination of a very small
number of basis components.
So borrowing that idea in our context, what we observe is that the social media content space is
also very high dimensional and it is sparse, too.
Because if you consider the Twitter information space out of all the tweets, probably very few of
them are retweets or probably very few have URLs in them. So it is sparse.
So maybe we can represent the social media information space as through this linear
combination of small number of basis components on measurements which would greatly help us
prune down the information space.
So to understand it in a more formal and visual manner, so let our signal which could be the
social media information space like Twitter, it is given by X, which has N -- which is of the size N
and all are real.
So we are interested in finding a Y of size M where M is much smaller than N through some linear
measurements that are performed over X.
To represent it visually, so this is our actual data, it has only gained on 0 entries on it. That is a
sparse. And we are performing a transform fee which is a two-dimensional matrix M cross N to
get a prune information space Y which has M measurements where M is much smaller than N.
So what this would give us is that it would help us prune down the information space to a great
extent. So essentially what we need to find here is how do we find this transformation fee?
For that we used the popular wavelet transform called the hard wavelet to represent it in a
diagram we have the N. We perform a compression using wavelet transform and get a pruned
information space which has K nonzero entries.
So now we have a prune information space. It's computationally more efficient. So how do we
construct the actual sample which is of a certain size and has a certain level of diversity?
So we proposed an iterative clustering framework for that purpose that uses, that tries to
minimize the distortion of entropy between the sample that we generate and the desired entropy
level.
So what we do is we start with an empty sample and we pick any information unit or tweet from
the space at random. We keep on iteratively adding these tweets to the sample making sure at
every point the entropy, the normalized entropy of that sample is as close as possible to the
entropy level that we want to have.
>>: What feature space are you working in here, what feature space are you applying the hard
wavelets to and ->> Munmun de Choudhury: At one point we saw social media space has high dimension spaces,
dimensions geography presents a URL or retweet or not, dramatic redistribution.
>>: So you came up with a set ->> Munmun de Choudhury: We came up with our features.
>>: How big was the feature space?
>> Munmun de Choudhury: It was around 20.
>>: 20. Okay.
>> Munmun de Choudhury: Yeah. So to put this iterative framework in more a formal manner,
so we essentially have the following objective function where our goal is to minimize the distortion
in terms of L1 norm between the normalized entropy HO on adding the Ith tweet with the desired
entropy level omega.
And we continue this greedy approach until we reach the desired sample size. So now we have
a method which sounds promising. Obviously the question you might be asking is: How does it
compare to other possible sampling techniques or some methods that are actually used in
practice?
So for that, we perform some experiments using the full fire hose data on Twitter over the month
of June last year having a little bit over a billion tweets.
So, first of all, let us see what are the different other sampling techniques that we compared our
method to? So we constructed a series of different sampling techniques based on utilizing
variations of our sampling algorithm which comprises three key components which is whether or
not it uses a transform, whether or not it uses entropy minimization to fit a desired level of entropy
or diversity, and also weighting of the different dimensions.
So to take an example what are the different baseline techniques, so our proposed method is the
one in the valid line. We have used the wavelet transform. We've used the entropy minimization
and also the weighting.
And, for example, baseline 3 does use the entropy minimization but doesn't use wavelet
transform or weighting. And we also had two other methods which we called the most recent
tweets which is kind of how Twitter shows you results on a topic and also the most tweeted URL,
the idea that we're going to show the tweets that have the most shared URLs in them.
So some quantitative valuation of comparing our method with these baselines. So what these
plots show is that given a certain diversity level, which are point one, point three and point six
over here, how do the entropy of the samples that we generate by these different methods, they
match to the desired level of diversity?
So the desired level of diversity is represented in the dotted red line in all of these, and over
different sample sizes shown on the X axis we are trying to see what is the entropy of the
corresponding sample by each of these six methods.
So the prime observation or the take-away here is that our method, which is again the valid line, it
is the closest that has the minimum distortion with respect to the desired entropy level, and to
take an example, baseline three, which doesn't seem to use the wavelet transform or the
weighting, it seems to be much far off from the desired entropy level.
So it seems that compared to all these other variations of the sampling technique, our method
performs the best in generating samples of different sizes that are very close to the diversity that
we are wanting.
So we also tested some robustness of our proposed algorithm. So note that one of the
characteristics of our algorithm is that it's a greedy approach. And it starts with a random seed.
Of course, there could be other ways of picking that seed, for example, if we know what kind of
tweets we want to definitely include. We can start from that tweet having that sort of attributes.
But just we wanted to see what happens when we choose a seed to be random. What we did
was we checked for the overlap in the content of samples that are generated. It crossed multiple
iterations when we choose different seeds.
We notice for two different topics spell an iPhone for across different levels of diversity and
sample size we really have high overlap across samples. So now this is really interesting,
because it shows no matter where in the information space we seed our sampling from, we
essentially reach a suboptimal sample which seems to be consistent across multiple iterations.
So now we conjecture that this probably reflects that the social media information space has
certain regularity to it, because of which when we are trying to find tweets matching certain
diversity we essentially reach out to those structure irregularities in the space because of which
we have high overlap.
However, quantifying that regularity is definitely an interest in our future work. Okay. So going
back to one of the observations that ->>: I have a question. How deep are the samples in the sets like that, like how many tweets were
in each, the starting set before you did the sampling, for those two?
>> Munmun de Choudhury: There was 50 to 60 percent reduction on the proof, because we had
the ->>: How big was the total number?
>> Munmun de Choudhury: It was a billion tweets in all but it was filtered by topic. So it was
several thousands.
>>: Several thousands.
>> Munmun de Choudhury: Okay. So going back to the observation we had initially in the start,
which is that when we are generating these samples or relevant content, it is the end user who is
consuming them. So obviously we need to find out how they think about these samples that we
are generating using our method and also the baseline techniques.
So what we did was we came up with four cognitive metrics. Two of which were explicit
measures. They estimated the sample quality based on interestingness and informativeness and
two implicit measures. The first one being subjective duration assessment, which is the idea that
when we show a sample to a user, if the user underestimates the time taken to go through it, it
means that probably he or she found it really engaging and therefore they underestimate the time
they thought they took to go through it. We call this measure the normalized perceived duration.
And the second one, which is recognition memory for the tweets, which is the idea that how well
does the sample, the information that a user sees in the sample that gets encoded in the
long-term memory or do the users remember the information they saw in the sample?
To evaluate our proposed method again based on these four measures we performed a user
study. So our use study had its own feedback from 67 participants of which around 65 to
70 percent were men and the rest were women. And the age range was 19 to 47 years. So
that's a good mix. And what we did, it was a Web-based application, and what we did was we
showed 12 different tweet samples that were generated by these different methods. And over
different diversity parameter values.
And corresponding to each sample, we asked them the questions to estimate the length of time
they thought they took to go through it. The interestingness on Lichert scale and the diversity and
informativeness.
At the end of these 12 samples, we also showed them a set of 72 tweets of which 36 came from
the samples they already saw and the rest were on the same topic, but they actually were not
shown any of the samples.
And we asked them whether or not they had seen the particular tweet in any of the samples
through a yes/no question.
So the idea here was that if the user found the tweets in these samples like worthy to be
remembered, they are going to reply correctly here. So the experimental design was we had
between subject topics and we had within subjects for the different methods and also the diversity
levels.
So now let us take a look at some of the results of this user study based on the four cognitive
measures. So this is what we get. And the Y axis shows the normalized or the raw user ratings
for each of these measures whereas the Y axis are the different methods.
What we see is that our method performs the best among all of them, which has always been
found to be statistically significant, at least with baseline one, two, and the most recent method.
So as you'll see here is that for informativeness, the recognition memory and interestingness, we
actually have the highest rating. And also for the subjective duration assessment, which is the
normalized perceived duration measure, we see that it was the least negative, which shows that
the users were able to find the content or the samples generated by our method to be relatively
more engaging compared to the other ones. So this definitely is -- it shows that even on a
qualitative scale users found the samples generated by our method PM to be the best.
>>: Were there significant differences?
>> Munmun de Choudhury: The differences were significant with respect to baseline one, two,
and most recent. We didn't see any significance for three and M 2 U which is the most -- I have
numbers online. I can show you. Cool.
Okay. So again one of the observations when we started this problem was that diversity makes a
difference to the sampling process. So now we ask the question: But are users actually able to
discern those diversity levels or not? So we actually found support in favor of that. And what the
slide reports is one explicit measure interestingness and one implicit measure, recognition
memory, for two topics. And we showed the measures, the participant ratings separately for the
three diversity levels that were considered in the user study.
And interestingly, if you see, for if not all for most of the cases the ratings were actually better for
diversity point one or point 9 which means that users found information to be more relevant when
they were highly homogenous or when they were highly heterogeneous. So now that's
something that's really interesting.
On one hand it means that users were actually able to discern the differences in diversity. But it
also shows that the change in diversity is not monotonically related to the perception of relevance
to the users.
In fact, it shows this kind of nonmonotonic structure where information at the ends of the diversity
spectrum are found to be more relevant compared to those in the middle.
So the conclusions. What our primary motivations work was sampling was a very important
problem because we are essentially faced with gigantic amounts of data on social media today.
And because this information space is very diverse, it is important that the diversity measure is
incorporated inside the sampling framework itself. And we validated the goodness of our sample
through different, through a user study where we found that when we evaluate those samples
based on cognitive measures of a human information consumption, we are able to find that our
proposed method actually generates the most relevant content compared to some base lines as
well as to stronger versions of state-of-the-art techniques.
However, there are some open questions which we would like to add in future work. The first one
is just in the last experimental slide what we observed was users were able to find content to be
more relevant when they were highly homogenous or when they were highly heterogeneous.
So this leads us to the question that are there like empirical bounds on what kinds of diversity are
more suited for content consumption than others?
The second question is related to the regularity of the space that we observed when we saw
information to be highly overlapping despite random starts of the algorithm.
Does that mean that the social media information space has some sort of signature in an
information theoretic sense? And if there are such signatures indeed, then how can the sampling
method itself benefit based on these signatures? And it would be interesting to see that in future
work.
Okay. So we come to the end. So we all agree here or at least observe that social networks are
causing significant changes in our daily lives. And the inferences that we make about different
social phenomena, like how groups form, how information flows in the network, or how the
characteristics of the media change, this is affected by the quality of the data or the relevance of
the data that we are looking at. And also because we are interested in streamlining the user
experience in order to present them with content, it is affected by the relevance of the data that
we present.
And both of these matter completely. Let us take a look at some of my future research directions.
I would be interested to work on. In the short term, I'm interested in the evolving area of what is
known as social media marketing.
So some example questions in this domain could be, which could also be interest to, say, Bing or
Windows Live Mail would be to understand where, which could be information or people we
would actually like to tap and network which lets us optimize on an economic technological or a
cultural goal.
And also using the structure of these social networks in media, how do we drive computational
advertising of which ads to show, where to show the ad and so forth on the Web.
My second direction is the study of microscopic dynamics based on microscopic interactions. So
a question that has always struck me is how does these pop culture evolve on these social
systems? So, for example, do we actually know how RT became the accepted notion of retweet
or kind of copying one's information and transmitting it on Twitter? I would like to investigate on
how these kind of accepted norms emerge based on our interactions.
The second question is: We always note, at least from a telescopic level, that there is
tremendous order in all of these networks. Also relates to the work on fractile theory and all. But
we all know that at microscopic level it's extremely noisy [inaudible] and there's a lot of spam.
How do we go from noise to order?
In the long term, I'm interested in understanding in a more laboratory manner the properties of
these attribute-rich petabyte scale information spaces on the Web, which is topics such as how
do we perform social computing when we move to the cloud? We have all our data on the cloud
and so forth.
I'll be interested in developing [inaudible] for these large information spaces that helps us
compare and generalize across different kinds of data. And finally also I would be interested in
developing a comprehensive theory of online communication, something that has happened for
like face-to-face communication but doesn't quite exist for our communication that typically
unfolds online or over the cloud.
So with that, I come to an end. I would like to acknowledge my advisor Hari Sundaram, and also
my collaborators at Avaya Research, Yahoo! Research and Microsoft Research, Duncan, Winter,
Jake, Scott and Mary and also all my lab mates and colleagues at ACU during my internships.
Thank you for being here until the end and I think I'm ready for questions.
[applause]
>>: I have one easy one, I think. I'm not a social networks guy, but my understanding or at least
what I know is that a lot of these kind of networks follow the power of law, to some degree. So
how does the power law come into play here? For example, when you were looking for a
threshold, what if I had taken 20 percent of the heaviest ones and just used that? How would that
compare to your -- how would the accuracy then compare to the optimal threshold? Accuracy of
your optimal threshold?
Then and then in the second part, when you were looking at the diversity of data what if I took
20 percent of the most focused ones and compared it to the 20 percent most diverse one and the
six compared to the middle, what is your feeling how would that look like?
>> Munmun de Choudhury: This actually relates to the slide that I would answer the second
question first, where we had the overlap. So let me see if I can go back. So what we noted was
that it wasn't flat as in like you'll see that the overlap was higher over here and over here
compared to somewhere in the middle.
So what I think you're seeing is what happens when we consider only data which is like ->>: Point 1 and point 8.
>> Munmun de Choudhury: It's not monotonically increasing or decreasing. Like I said this is
something we'd like to do in future research. But it seems that the number of solutions that we
might have fitting the criteria for relevance, it might be much lesser when it's less diverse or highly
diverse compared to ones in the middle. So that's why because we have lots of solutions over
here, we have low overlaps. So I think that's how I think -- did I answer that question?
>>: So what I'm just trying to figure out, is power laws is like 80/20 rule, sort of. Just took 20 -how much more intuition do we get from this if I had just taken, I said at believe 20. At point 2 I
would get a sweet spot and point 20 I would get a sweet spot?
>> Munmun de Choudhury: I think we would see the same set of patterns although the results
might vary, and the reason being that this observation, empirical observation is more of the
property of the dimensions or the sparsity of the space than whether the actual values of those
dimensions, which is what you're seeing, when we look at the head of the distribution, probably
the numbers, the value of the features would be very high, for example, the frequency of
communication would be very high and so forth.
So I think it's more about the structure of the sparsity pattern than the actual values of the feature.
So I think you'll still see the same pattern even when we look at the head.
>>: I think another way potentially to look at it is that as you squeeze the entropy to a really small
value, really large value, you're effectively putting a lot of constraint in the data. You're saying I
want to find the stuff that's really far apart from each other and find stuff that's really consistent
and there's probably not that many ways in a given set. So seeing the -- the point 3 may be so
low the kind of U-shaped you might see in a interesting phenomena where you try to squeeze a
history very low, very high entropy.
>> Munmun de Choudhury: That's what I was going to say with the solutions part. And actually
during summer we discussed extensively on it. So there seems like there's a fewer solutions at
the end of the spectrum, which is what you said.
>>: Actually, I had one comment. When you look at multi-document summarization, excuse my
voice, that's in a sense when you grade summaries from let's say 50 different stories about a
topic, split them all into sentences where you have a gigantic bag of sentences and you have a
litany of summary and you have to choose your sentences. The best summaries are really the
ones you have diversity in the limited space you have so high entropy, which gives you sort of as
much information about whose story and the space as possible or you get something that's ->>: Very thin points.
>>: You get like the main fact and that's the -- the worse summaries are the ones where you have
a mix, some new information needed information and I think that's a very analogous summary of
the sample.
>> Munmun de Choudhury: Yes, totally.
>>: What's the behavior of the perceived relevance was higher in like being very diverse very
nondiverse, was it consistent across people or queries or was it some or some topics were direct,
more constant than others?
>> Munmun de Choudhury: It's limited for the two topics we looked at. The Windows Live and
iPhone. Because we can only have so many fax from the users in the user study. We focused
on two topics. It's limited in the sense it's limited by the 67 participants we looked at.
But because it's over -- it has a good mix of general and it's over a wide range of different age
groups.
I guess it's pretty generic.
>>: So it's not that some people prefer very diverse and some people prefer it very ->> Munmun de Choudhury: Yeah, it seems like, but we, of course, don't have the numbers on
who prefers diverse or who prefers homogenous, but it seems that people like one of the two.
>>: So when I think about this down sampling technique as when I think about applying it to sort
of news-like tweets so I'm reading about the Haitian earthquake or something it makes a lot of
sense to me. But when I think about applying it to tweets that are closer to me in my social
network then I get concerned that anomalies might be a more interesting thing. For example,
maybe my friend consistently tweets about the weather or something stupid but then suddenly
tweets I'm pregnant. I don't want that one filtered out. So it's unclear to me like how this
algorithm is going to not filter that one out, because it's an anomaly. It's just a complete one-off.
>> Munmun de Choudhury: Well, in some sense it's taken care of if you want -- if you are ready
to have diverse information, because this aspect will probably be covered by the schematic
distribution of the tweets. So you might end up having different angles, not just all other tweets.
But as such our down sampling technique doesn't exactly look at these kind of anomalies.
>>: I'm also wondering if the set of features, I realize this is sort of speculative, but I'm wondering
if the set of features that I'm interested in may in fact vary with the distance in the network from
the original tweet.
>> Munmun de Choudhury: Sure. Sure. We can always add that on top of what we have
already built. And it could be a part of how we learn the weights of the significance of the
dimensions.
We actually might want to weight the dimensions of users were far away compared to my
immediate friends. And that is exactly a very classic example of actually personalizing this thing.
Like right now it's like irrespective of who among us actually looks for information on -- we're
always going to get more or less the same set of topics. But in practice we might want to
personalize this and actually that's where weighting in this manner based on my distance for
activities is for.
>>: In the previous study about the link filtering, is there an implicit assumption there that there is
an optimal link weight filtering that you can apply that's relevant for every person? It seems like
people have such individualistic patterns of communication.
>> Munmun de Choudhury: Right. That's a very good question, actually. And that was also one
of the comments when we wrote the paper. But we can always do that. But one of the problems
is that actually like how do you find out that learning those actually individualistic thresholds would
be computationally expensive for very large datasets.
>>: Or you compute some features which are normalized for a person's total e-mail volume or
e-mail and that doesn't exploit the computation.
>> Munmun de Choudhury: Right. Right. But if you like think about the different studies, the
prediction tasks that we did, they're more like macro network level. They don't exactly like how
groups form and stuff. So they don't exactly look at the individual aspect.
And I think the individualistic aspect is taken care of to some extent based on the weighting of the
features that we saw there. So basically what we had was we weighted the features based on
the communication properties of that particular user when we were doing the prediction. So that
kind of personalizes it to some extent. But like these weighted features and unweighted features.
But I ->>: At least recognize our ->> Munmun de Choudhury: So the unweighted features is representing every node in the
network based on certain features, set of structural attributes, clustering coefficient. And we can
also weight it based on how frequently that person talks to that friend of his.
So that could be the weight. That kind of takes care of that personalization threshold in certain
way. Although it's not very intuitive directly. But I agree that I mean in practice different people
usually have different thresholds. And in fact there's some work by Rob Dunbar that there's a
new cortex limit on the number of friends you can have, which is 150. And usually below the
number of times we talk to each other is usually much less than 50. For some it's 20, some it's
30 and so forth. It will be interesting to learn those thresholds at the individual level, if the nature
of the research question asks.
>>: Since you're talking about pairwise behavior, that's got shape over time.
>> Munmun de Choudhury: Sure. So this is a single snapshot study. So one of the datasets
was over two years. So it was kind of normalized over the two-year period. We did not look at
individual granularities. It will be good to probably look at small time scales and see if that
threshold changes for those time scales also.
>>: I think it's interesting thing here. First it was neat that you had several different measures that
you're looking at. And you find them all pointing to the same kind of thing that's inherent, the
network aspect, that you saw this effect at the peak that kind of showed up at some point. Over
these different functional tasks, which is very nice.
So I think what's interesting is you look at the second study you were doing I agree with you that
this notion of like sampling is really important. Fire posting, data that's coming in, but seems like
it's potentially -- there might be a much more diverse set of possible applications that a user might
want to do.
So in this case you've looked at, I think it's laudable that you looked at this to say what do people
do like with their streams and you found consistency in terms of [inaudible] and diversity. You
can imagine maybe if you took, there might be a wide range of different things like people
predicting marketing trends or people trying to predict stock prices, various other kinds of things,
and try to take those, a macro set of tasks not just user preference and see if you, again, see
some consistencies in terms of the sampling that worked the best.
>> Munmun de Choudhury: Right. On those lines, the sampling technique would, at least when
we pruned the information space we can still utilize that irrespective of what is our final task. But,
yeah, I agree at least from the study perspective it's very pinpointed in the sense that we're trying
to only understand from FCI or interface design perspective, but, sure, yeah.
>>: The real question I have there really is about diversity. I think it's powerful in terms of
perception and the kind of things you're looking at I think diversity makes a lot of sense. I wonder
if you looked at something that's more of a classification task or something like that, maybe
diversity wouldn't be the axis that would get the most benefit maybe it would be something else.
Maybe the difference would help. It's hard to ->> Munmun de Choudhury: Yeah. Well, my conjecture is that probably a classification task
diversity is not that useful, probably.
>>: Might be. Some active learning stuff. It's hard to say. But it would be interesting.
>> Munmun de Choudhury: Definitely. Absolutely. Yeah. Yeah, I mean one of the always
more -- one thing I've always wanted to do in my dissertation, unfortunately I don't have the time,
so I did all these -- I study these social processes that we saw in the Bing, but this was on all the
data that was available. There was no sense of sampling in them.
So actually it relates to your question now that we have a sampling technique, what if now we
study all these different problems with the sample data, do we observe the same dynamics?
>>: On all your papers.
>> Munmun de Choudhury: It's like going back.
>>: If you have time for that.
>> Munmun de Choudhury: I know.
>>: That would be interesting.
>> Scott Counts: Anything else? We'll do this one more time. [applause]
Download