>> Emre Kiciman: Hello, everyone. It's my pleasure... coming to us from the Kno.e.sis Center at the Wright...

advertisement

>> Emre Kiciman: Hello, everyone. It's my pleasure to introduce Meena Nagarajan. She is coming to us from the Kno.e.sis Center at the Wright State University previously at Georgia -- at the University of Georgia. And she'll be talking to us about the work she's done on analyzing user-generated content in social media. And this is work that she's done both during an internship here at Live Labs as well as internships at both IBM and UC Berkeley before that.

Thank you very much.

>> Meena Nagarajan: Thanks, Emre. Good morning, everybody. I think it's going to be fairly informal. Thank you for stopping by and thank you for people joining us on the webcast. I don't know where to say hello to them.

My name is Meena Nagarajan and I am from the Kno.e.sis Center at Wright State University.

Previously I was at the University of Georgia, and my advisor moved to Wright State University and the entire lab followed him to there.

Kno.e.sis Center, if you don't know, is one of the second largest groups in the area of semantic

Web. My research itself has focused on understanding user-generated content on social media.

So let's get started.

Social media needs no introduction to this audience, right? Over the last few years we have all been a part of this change in online media landscape, from this publish-oriented media to this more conversational use-oriented media, where users share, participate, collaborate and create a variety of content via a variety of platforms.

What this has resulted in is not only how and what we share information but also how we wish to seek information. Users today are no longer satisfied just with factual information, they want to know what opinions are generated around a topic, they want to know what conversations are going on around it.

With the click of a button they want to know what kind of music people like them listen to, and they want to use their social network just the way they use their Web search engine.

And for all of this what it means that we need a very strong understanding into several participating variables of this social system.

Not inconsequentially, our community's efforts have focused on one or more of these areas, either understanding the network of ties that people form, the social structure that emerges out of it or the behavior that emerges out of it; trying to understand who the people are who are a part of this network and who are generating this content; and, thirdly, what is the content people are sharing and what effects is this creating within the network.

My dissertation work has been devoted to the third of these topics in understanding user-generated content through the lens of text analysis.

There's been a rich body of work in processing [inaudible], but certain characteristics of this medium make this problem a little challenging.

As you all know, communication on social media is fairly interpersonal. It is oftentimes unmediated. And what that means is we see a lot of informal English content. We see abbreviations and slangs, context-dependent terms, all of them delivered with an indifferent approach to grammar and spelling.

Oftentimes context is implicit. People who are talking to each other don't always bring the shared context that they have onto the table.

And several -- oftentimes the variations and creativity that we see in user expression can be attributed to the medium itself. Twitter is a great example of this that limits expression and therefore context to some extent.

And if you go on the other spectrum, you see [inaudible] and question-answering platforms that encourage conversations, and so you start seeing a lot of noise that is clearly off topic.

In my work I address some of these challenges. As you can see, those properties -- what it means is the traditional content analysis techniques that were built for a more formal and more organized form of text, like news or scientific articles, those don't translate very well to social media platforms.

So in my work I address some of those challenges in processing user-generated content towards three different goals, all for adding some sort of structure to unstructured content.

I ask three questions in my work: What are people talking about, what the named entities and topics that they're making references to in their text; why are they writing, can we understand user goals and intentions by looking at the content that they put out; and how do people write, what can we say about an active population by just looking at their word usages. Can we say something about individual allegiances or conformance, nonconformance to group practices by looking at how they write.

And in all of these cases whenever there are poor context, what role does external domain knowledge play, be it from ontologies, taxonomies, dictionaries, or even from the social medium.

And finally what are the applications and consequences of understanding such user-generated content. Can we learn something new about the data and dynamics of the online social media.

So in the first half of my talk I will focus on those three micro variables that I've had a chance to investigate, the why and the how. I'll focus a bunch of my time on the first topic and briefly introduce the other two. So if it generates any interest, we can go back to it in the Q and A session.

In the second part of the talk I'd like to describe two of the Web applications that I've had a chance to work on. The first one -- both of them tap into different forms of crowd source intelligence. The first one was in collaboration with IBM. It's called the BBC SoundIndex. And the second, called Twitris, was built in collaboration with my colleagues at Kno.e.sis. And that was a platform for browsing realtime data.

If I have time I will also share with you a study that uses properties of content to explain properties of information diffusion on Twitter.

Let's get started.

So one of the most exciting topics that I've had a chance to work on has been named entity recognition in informal text. Here you're seeing two examples, excerpts from blogs that mention names of two movies around very strong movie context but they both are false positives. In the first case, The Hangover , also the name of a movie, is clearly not referring to the movie, it's referring to a very fun previous night. And in the second case there's a reference made to

Wanted as -- in its video game sense, not so much in the movie sense.

So in my work I focused on a particular class of named entities called cultural named entities.

What these are are named entities that are also artifacts of culture. So movie names, titles of books, songs, what have you.

Several cultural entities are especially hard to recognize in text because they're also common words in English. So movies like Up , Twilight and Crash , books like It , Push , so on and forth.

Moreover, several cultural entities participate in varied senses. So Star Trek is a good example of this. Several of them are not even documented well. Star Trek is popular as a movie and a

TV series and even a media franchise, but there's also a Star Trek cuisine that appeared in blogs.

How many of you know of that?

And another thing that happens with cultural entities is their meaning often tends to change with time. So The Dark Knight , the movie, was recently used in reference to President Barack Obama and his health care reform.

Now, what all this means is when a traditional named entity recognition system is trying to spot these entities it is impractical to assume a comprehensive knowledge base of senses that the system will have. We are forced to take -- relax this closed-world assumption and open the challenges up to this open-world sense assumption where we don't know everything about such entities.

In the work that I'm about to describe to you, we look to improve named entity recognition using a novel complexity of extraction priors. It's a feature-based approach, and this was conducted in collaboration with Amir Padovitz here at Microsoft.

So we hypothesized that knowing how hard or easy it is to recognize an entity might teach a classifier to recognize these entities better.

So imagine this [inaudible] scenario where a classifier is faced with these two cases, the movie

The Curious Case of Benjamin Button and the movie Wanted . Now, a traditional classifier when faced with these two cases and comparable signals around them is going to use the same inference rules for both these entities. But if the classifier knew that The Curious Case of

Benjamin Button was much easier to extract compared to Wanted , which occurs in many varied senses and was therefore much harder to extract, then it has the opportunity of inferring the signals around them differently.

So it's really an entity-specific feature, this complexity of extraction that we are proposing.

And so in this work we decided to go and validate this hypothesis. So how does one characterize the complexity of extracting an entity in a target sense, say movies. Many of you are probably

thinking entropy, if there is varied context around the entity in a corpus, then it's harder to extract and its complexity of extraction should be much higher.

Well, it's a similar measure. It's not quite what we're going for because you could be faced with a scenario where there is high entropy but all of that is related to the sense that you're interested in extracting, in which case the complexity of extraction should be low.

Imagine another scenario where there is relatively low entropy but the sense in that corpus is completely skewed to a different distribution and not to the sense that you're interested in. So you are interested in movies and the corpus distribution is about video games and you don't know that.

So clearly entropy is not the most accurate way to characterize the complexity of extracting an entity in a particular sense.

Here is our proposition for characterizing this feature. So imagine you're interested in extracting

Star Trek in the movie sense from a general blog corpus, you know nothing about it, and it appears in all these blogs. And of the hundred documents that it appears in only ten of them mention it in the movie sense. So the support for extracting Star Trek intuitively is something like 10 over 100.

Imagine the movie Up which would appear in more context, more documents presumably, would -- appears in, say, 500 documents but only 20 in the movie sense. So the relative support that Star Trek has compared to Up is higher, and therefore its complexity of extraction is lower.

So that's the intuition behind it.

But at the core of this measure is knowing how many documents mention our entity of interest in the sense that we are interested in. So that is what we really need to capture.

And measuring that boils down to measuring what documents mention the entity in the target sense. Now, how do you go about doing that if all you know is only some information about your sense given the impracticality associated with enumerating sense definitions for cultural entities, all you know is only some information that I'm interested in the movie sense. And you know nothing else about the distribution as well.

So in the framework that we proposed that I'm going to explain, we work under the open-world sense assumption, that all we know is the sense definition of that entity in a particular sense and we assume nothing else about the other senses.

So here's the basic idea. The basic idea is just start with a sense definition for the entity that you're interested in extracting, say Star Trek , and a sense definition could be just a group of words that define the sense that this entity occurs in. And to propagate this sense evidence in the context that the entity occurs in in the corpus that you're interested in extracting this entity from.

Now, what this does is it extracts a very strong sense-biased language model. And then you cluster the documents using dimensions of this sense-biased language model to identify documents that are related to your sense. That is the basic idea. And higher the proportion of documents that mention your entity in the target sense, easier will be its extraction.

Now, this is a framework and it can be used with several underlying algorithmic implementations. You could use maybe random walks for propagating sense evidence. In our work we use spreading activation networks and we used Chinese whispers clustering algorithm, but you can see you can use any clustering algorithm for that step.

The key modifications in both of these algorithms that are well known comes because of our sense specification. So we assume a sense definition from Wikipedia Infoboxes. You could go grab this from anywhere else. What you're seeing there is an example of a sense definition for

Star Trek . These are just other entities that are related to the entity of interest. And assume that we're in the movie sense right now.

The spreading activation network itself is a word co-occurrence network generated from words that occur around the entity.

However, because what we are interested in is a sense-biased problem, the nodes and edges are weighted appropriately. So the weights of nodes are those that are related strongly to the sense.

So if the words in the corpus also appear in the sense definition, then they have a very high weight of 1. If they don't appear in the sense definition, we assume we know nothing about them and give them a low weight of .1.

We intentionally stepped away from this traditional statistical importance of the entity in a corpus because we did not want that to bias our sense definitions or our sense-biased approach.

And of course like you all know how spreading activation works, in our work if we have ten sense nodes in the sense definition, there are ten pulses that propagate the truth that there is in the sense notice throughout the network and therefore activates different parts of the graph.

And at the end of it, the final activated portions that you see are words that will be strongly biased or related to the initial sense definition, because those were the nodes that we pulsed.

Here's an example of a sense-biased language model that was extracted for Star Trek from a general blog corpus in the movie sense.

As you can see, this process of pulsing sense definition tends to extract words that are not just related to the entity Star Trek but to the sense movie. So you're seeing words like movies, trailers, transformers and things that are really generalized more to the domain.

Our goal, however, is to identify documents that mention entities in this target sense. So we take this language model which clearly has entities that are biased to a particular sense, represent documents using this language model and cluster them.

You can probably imagine now that documents that don't have any membership in this language model get thrown away because they're not related to that sense. And documents that have only few evidences that have a very sparse tone vector or a sense tone vector.

Cluster the documents and by property of the documents being related to the sense, you can see that the weight of the cluster will also be related to the target sense.

So there could be low-scoring clusters and high-scoring clusters, and what that means is when the score of a cluster is low, it has less evidence to be related to the sense. And when it's high, the document is presenting a lot of evidence that it is in a particular sense.

Any threshold-based elimination, and you can count the number of documents that might be related to a sense, and that gives us our complexity of extraction score, the proportion of documents.

Here's an example that validates the framework. We took a list of these movies and tried to compute the complexity of extraction in a general blog corpus. As you can imagine, a movie like Twilight and Up should be more harder -- harder to extract compared to Angels and Demons or Dark Knight . And that's what we found with our algorithm, that, relatively speaking, those entities were much harder to extract than The Dark Knight , Hangover , or Angels and Demons .

Mind you, we did not know anything about the distribution. It was a general blog distribution and therefore characterizes -- hopefully characterizes these entities in several varied senses.

But the goal of course was to see if this feature improves named entity recognition, correct? So we use three state-of-the-art classifiers. The goal here was not to come up with -- or not to identify the most suitable classifier but to see that this measure is useful with any underlying prediction approaches.

So we use three classifiers: decision tree, bagging, and boosting. Labeled -- hand labeled 1,500 movie spots in this general blog corpus. And evaluated these classifiers with and without a prior.

The basic features that we used were things that a traditional extractor already has at its disposal is a spot, first capitalized or not, is it all capitalized, is it in codes, what is the context surrounding a spot, and knowledge words, something that classifiers have -- yes.

>> I may have missed it, what did you mean by a spot?

>> Meena Nagarajan: Sure. So we're assuming we're in the space of spot and disambiguate. So you take an entity, Twilight , and you see if it is spotted in text. And the goal of the classifier is only to identify if this spot is a valid mention or not.

>> [inaudible]

>> Meena Nagarajan: Correct. So our features. Coming back to our features. The knowledge features, something that traditional extractors have available to them, they could go to the

Wikipedia Infoboxes and get it, are simply sense definition matches either in the blog, in the same paragraph, in the title of the post, URL of the post, so on and so forth.

The second knowledge feature that we use is the extracted language model, something that only we're aware of. And we also evaluated this against two priors. The baseline is a contextual entropy prior that we visited a few minutes ago. And our proposed complexity of extraction.

So we throw this into the three classifiers. And as you can see, the baseline does very -- it just does well as our proposed entity, very close, but there is an overwhelming improvement in entity extraction over the traditional extractor settings. Same for the measure and accuracy that we're seeing.

But we need to think about this contextual entropy a little more closely. So what it is, it is seeming like a very strong baseline, but if you go back and think about contextual entropy, it is not going to accurately capture your varied senses in the high entropy, low entropy cases that we discussed previously.

And so we think -- because we were focusing only on a few movie named entities, seven or eight entities and 1,500 spots from one corpus distribution, we think that as we expand this to more different types of entities and different corpus distributions we'll start seeing the distinction between the baseline and our measure.

But this already gives you an indication of the usefulness of this feature that we propose.

>> [inaudible] significant?

>> Meena Nagarajan: Yes. I'm going to talk about the other two topics now, switch gears a little bit. I wanted to briefly describe my work in the next two topics, and of course we can come back to it if you have more questions.

The second problem that I've worked on is called -- is identifying user goals and intentions behind what they write. So what can we understand from the text about why users put information out there. Is there any transaction intent that is there in the post that we can exploit.

Is there an opinion or information-sharing intent that question-answering systems can exploit beyond a topic relevance.

We've made great strides in this problem of understanding Web search intent. However, there is an important difference in user generated content for this problem.

So what we notice is just the presence of an entity or its type does not always accurately tell you what kind of intent might be there in the content. So you're likely to find one or more of these broad intentions associated with the same entity.

I'm thinking of getting an iPhone is a transactional intent. I like my new iPhone is an information-sharing intent. And what do you think about my iPhone is an information-seeking intent.

So just us covering the entity doesn't take us very far in free text on social media.

I approached this problem of identifying intents as one of understanding action patterns. These are nothing but words and patterns that surround the name entity. Essentially steering away from this entity centric approach and to classifying the words that occur around it.

In my work I focused on a subset of this problem identifying just a transactional and information-seeking intent, typically information that online advertisers can exploit.

So the intuition behind my approach was to identify these intents when people are seeking information and expressing some transactional need. I found that people use one or more -- combinations of one or more of these word categories. So when they say I'm wondering if I should buy this, can someone tell me where to go find this, it's such expressions of intent comprise of words from one or more of these categories.

Wh-question words, you know, what, why, when, and who; cognitive process words, which indicates user is thinking about doing something; adverbs; impersonal pronouns, when they're referring to their social network and saying someone help me; and transaction-oriented words, like buy, get, trade, et cetera.

So building on this intuition of word memberships, I developed a weekly supervised bootstrapping algorithm that starts with a handful of patterns that clearly indicate monetization intents and then learn new patterns with similar meaning.

So the end goal, the way you should think about it, is to build a lexicon of patterns that are likely to occur with user-generated content that presents these intents.

The new patterns that I extract are scored against two rules, empirical support in the corpus and semantic compatibility. I compute semantic compatibility -- rather, I define it as the communicative function compatibility. So if two words like whether and how are used communicatively for the same task of, you know, a cognitive process, then they are semantically compatible in a pattern.

And I get this information from this dictionary called the Linguistic Inquiry Word Count , which is basically a dictionary of words in English categorized along several dimensions, linguistic categories, personal, psychometric properties, affect words, so on and so forth.

We took these [inaudible] of patterns that we learned and plugged it into a targeted content delivery system. So the goal was to understand if it is even possible, first of all, to identify monetizable posts, and, second, assuming that you're able to identify monetizable posts is user attention amenable to ads that are displayed against these posts.

So that's exactly what we studied. We took -- we crawled a bunch of Facebook posts, users who had given us permission, and crawled their wall posts as well and identified those that had monetizable intents and showed them ads, one for based on their profile information and two based on their monetizable posts.

And we found that users were eight times more likely to deliver attention to ads that were displayed for their monetizable posts, which is a very intuitive conclusion because profile information rarely has purchase intents or is even current, for that matter.

And the third project that I'd like to explain to you, this -- yes.

>> Emre Kiciman: So are you going to talk about how you identified monetizable --

>> Meena Nagarajan: I can do that. You want to do that now? I was worried about time. I could come back to it or do it right now.

>> Emre Kiciman: Maybe [inaudible] finish and then come back to it later.

>> Meena Nagarajan: Okay. I do have slides for that, so we can come back in the Q and A session.

The third thing that I wanted to share with you today is studying language usage and self-expression, what are the kind of words that people use and what that can tell us about some active population.

In my work I focused on one such language usage study where I looked at self-presentation strategies on online dating profiles. The larger goal here was to understand correlations between textural expression and perceived attractiveness of a dating profile. And this was work conducted in collaboration with Professor Marti Hearst at UC Berkeley.

So we took a bunch of profiles. Yahoo! Personals, which you all might now, is a paid dating site, which means the quality of data is very good. We took 500 male and female -- 500 male and 500 female profiles and analyzed the expression component, the free-text component, which was the "Me and My Partner" section, where they talk about themselves and also talk about what they're looking for.

And we ran this through the LWIC text-analysis program which basically breaks down a user-written profile into one of these many word categories. It says this profile had X percent of affect words, X percent of psychological words, so on and so forth.

And then we conducted an exploratory factor analysis over these profiles to identify systematic co-occurrence patterns among the words. Because the goal here is to understand what are the words people are using and how are they using them in combination.

At the outset we found similar factor structures for both men and women, and this is reflective of similar underlying communicative functions. I do have a slide for the factor structures, so if there's interest we can come back and look at it. We also found a lot of similarities in the actual words that men and women used, especially in the open-class category words, like affect and verbs, verb groups.

What was more surprising was that men were using a higher proportion of tentative words, like maybe, perhaps, sort of, these words that were typically attributed only to feminine discourse in past studies. And so it could be that men are temporizing their language to seem more attractive to women.

So we took these factor scores for every profile and clustered them basically to reveal profile types. And what that showed us was a majority of people, and this is more than 75 percent of the profiles, used all of these variables that -- the LWIC variables and moderation.

But there was -- there were a group of profiles that exhibited clear contrast. So this cluster that you're seeing was high on a factor that showed immediate interaction and activities. So they said

I am this, I like to this. So these are the profiles that were focused a lot on talking about themselves and the activities they liked.

And when men did that, they used a lot of affect and positive words and cognitive emotion words, biological words. And when women did the same thing, they used a lower frequency of the exact same words. So men were making more references to other people when they were talking about themselves. They were making use of more happy words and biological and sexual words, and women were doing the exact opposite.

And it's not very clear what this bundling of traits means really, but what is obvious is that there are more similarities than differences in how men and women self-present in online dating profiles.

Now, this could well be property of the context that we are in, because it's a very agenda-driven context. There is a goal behind self-expression. But we -- again, we don't know that for a fact, because, as far as I know, nobody's studied self-expression on MySpace or Facebook yet to know if these gender differences do carry over there.

But then we put this result out and we spoke to a few social psychologists, and many of them said, you know, it could be that men and women are trying to imitate each other in courtship, which you often see offline, and maybe you're seeing that online too, where men are imitating women and women are imitating men in writing styles and therefore you're seeing fewer differences and more similarities in the words that they use.

So that was an interesting conclusion.

So these were some of my research projects. They all seem a little isolated. But before I finish I wanted to share with you some of my applied research experience where you'll see some of these separate variables coming together.

And this also indicates the potential there is in a deep analysis of user-generated content instead of just a shallow counting of things.

The first project I wanted to share with you was the one conducted with -- in collaboration with

IBM. It's called the BBC SoundIndex. Here's the motivation for this work. So BBC was seeing an increasing discordance between the songs and artists that were being reported in popular music charts and the songs that were being judged as popular by the disk jockeys. So these two lists were not matching.

And if you think about it, the music charts, like the Billboards, were generated using audio plays on the radio, CD sales, metrics that are not sufficient proxies these days for counting what is popular. These days a lot of us don't go and buy CDs or listen to the radio a lot; we listen to it online and we exchange comments about music artists online.

So BBC said let's go and tap this online populous and see what turns out there.

And the idea was also to generate lists for -- so that people could ask questions like what are people like me listening to. So being able to not just tap into what people are saying but also who those people are inherently.

So we tapped into a whole bunch of online sources. And over the last few years I've had a chance to fit into several pieces of this puzzle as far as my contribution went in the analysis of textual content.

Starting from named entity recognition of music entities in MySpace comments -- this is the work I did not have a chance to talk about over here -- but in that we used external domain knowledge from MusicBrainz [inaudible]. And sentiment analysis, identifying if the user is saying something good or bad about an artist and identifying spam and off-topic discussions.

When you're interested in rating Madonna for her music popularity, you don't want to hear about

her boyfriend. So identifying such off-topic comments. And finally aggregating user preferences across several lists.

This system was deployed and users rated the efficacy of these charts that were generated and they found that online sources were a good proxy for challenging this traditional wisdom and polling that was being used.

The second project that I'd like to tell you about was Twitris. This was built in collaboration with colleagues at Kno.e.sis. Twitris, the word itself, is a combination of Twitter and Tetris.

And what it means is arranging Twitter data through dimensions -- in dimensions of space, time, and theme. It's basically a platform for browsing realtime data.

The motivation behind this work was the unfortunate Mumbai terror attack incident in 2008 when people were posting tweets and Flickr images online which, granted, was a great backdrop against traditional media reports, but for those of us who are sitting here consuming it, we were thinking there's this recent development and we cannot tell what people in Pakistan are saying about this, because all the tweets are just lost in the scores and scores of tweets that we were seeing.

So that was the basic motivation behind this work, to break down things along the dimensions of space, time, and theme. So that we can appreciate that during the healthcare debate reform people in Florida were saying things that were very different from what people in Washington were saying.

And these are really indicative of the fact that we're able to preserve these social perceptions that actually generated this data. And that is a very valuable thing when it comes to newsworthy current events.

Another thing that Twitris does is overlays these social observations against traditional media reports, from Wikipedia pages and news articles and the tweets themselves, because sometimes users say things that are not obvious to anyone who is reading it.

So soylent green was a great example. They were using -- this phrase was very common in the healthcare debate reform, and we were not sure what was going on. Apparently people are making a connection between the Soylent Green movie in which the garment was mandating a lifestyle and comparing it to Obama's healthcare reform, which was mandating a certain kind of a lifestyle. So that was what this tool was meant to do.

There were of course -- if you have questions about the architecture, we can go back and talk about it.

So that was the summary of some of the work that I did for my thesis. And before I end this talk

I'd like to say more about some of the topics I'd like to work on the next few years.

So today we understand large networks very well, we understand how they form and how they break, but we don't know a lot about how the semantics or style of content fits into these networks observations.

And we can say the same thing about our understanding of a corpus of content. We understand that very well, but we don't understand what effects a network has on the kind of content that is generated. And of course there is a very clear participant dimension that is there in this.

So if I were to summarize my future interests under one umbrella, it would have to be studying combined interactions of the people, content, and network properties.

At a theoretical level I'd like to build and observe models that incorporate these interactions.

And of course build applications that exploit and promote such dynamics.

One of the topics that I'm interested in is information -- studying information-seeking behavior on the social Web, and I see some of my intent work as building blocks towards this.

So what are the browsing and asking patterns on the social Web and implications does this have on integrating this with our online activities, can we seamlessly go from Facebook to the Web and back is a question that one would ask.

Is it even possible to do this on the social Web was the first thing that crossed our mind, and that was this measuring intent landscape work that I'm currently working on. And from what I see now it's looking like when people write on each other's walls, it's mostly interpersonal communication. It's mostly how are you, happy birthday kind of thing. And it's followed by a lot of asking questions, asking what did you do yesterday or where can I do this.

Facebook status messages, however, are more for information sharing, could be opinions or links followed by a lot of asking questions again.

So that is one space that I'm looking to investigate more.

I also see a lot of value in exploiting this people content network dimension and high-recall scenarios. So take realtime search, for example. If there is a local fire in Brooklyn where I live,

I don't want to be seeing tweets from everyone. Maybe I want to see tweets by my neighbor who is also in my Facebook and Twitter network. And so there's a lot of value in personalization when you think about this three-dimensional dynamic.

Another topic I'd love to see these interactions play a prominent role is in explaining collective online behavior. I think there is a huge opportunity in explaining finer nuances of online behavior when you take all these three dimensions into account.

Here is a sample study that we've been conducting with cognitive psychologists and a professor at Ohio State.

So we started looking at Twitter to see why some tweets were vital and not some others. Before we could answer this question we got sidetracked by another fascinating observation. It turns out that tweets that call for action in which some people are asking the community to do something leave no trace of the retweet patterns. So author attribution dies away very quickly. I have picture of this if you want to see.

On the other hand, information-sharing tweets, where people are posting links to videos retained author attribution and you could retrace -- you could retrace how the tweet made its way through the network. It is intuitive, but what it means is link-based diffusion study models are going to

have to take this into account, because it was not one or two tweets, it was a majority of tweets where you could not trace the retweet path through the network.

I am convinced of the opportunity that is in front of us in this space, and what excites me the most is the interdisciplinary effort and collaboration that it's going to require.

I'm going to stop now and take any questions that you may have or anything else you want me to explain in detail. Thank you.

[applause]

>> Can you talk about that?

>> Meena Nagarajan: Yes. Intent. I'm going to go to the next slide. I hid it. It's not playing it.

Okay. So going to reiterate a little bit. So the bootstrapping algorithm starts with a handful of seed patterns that humans pick out, so, for example, things like this. And in this work I used four-gram patterns. You could use three or five. Things like does anyone know how, where do I find, someone tell me where. And the algorithm goes on to find new candidate patterns in the corpus.

The basic idea is to generate filler patterns, so substitute every word in the patterns that you know for a fact indicate intent by a wildcard and go look in the corpus for other patterns that match it.

For example, you would find does anyone know you and does anyone know whether. And the goodness of this pattern is evaluated against the known patterns. The semantic compatibility, which is the most important, is the functional compatibility between the word that was replacing it and the word that substituted it.

So the compatibility between these two patterns is evaluated based on the functional compatibility between you and how, which is zero. And between whether and know, which is -- where there is an indication of functional compatibility because both of them are reflective of a cognitive mechanical process, something we know because they both are in -- are categorized as cognitive mechanical words from the LIWC dictionary.

And this passes the functional compatibility test and goes on to the empirical support test. And it's a very conservative way of expanding this seed pool of patterns.

Let me show you some examples also that I have. So here are some examples, some of them you can see are not very accurate. I do not want -- what else? Everything else is fairly clean. So these are some examples of the patterns that were learned using five seed patterns, and the corpus that we were using was this unannotated corpus of MySpace user-generated comments.

And then we evaluated these learned patterns across Facebook, so we took Facebook to buy posts, which means the posts are already categorized for intent, people are already asking for information and expressing a transactional need. We took these patterns and there was about an

83 percent coverage on it.

What that means is these features that we used, the word -- the five-word categories that we used from LWIC were fairly general across these two platforms. We learned the patterns from

MySpace and tested them on Facebook.

>> This seems plenty different [inaudible] you had sort of a commercial intent [inaudible].

>> Meena Nagarajan: Um-hmm. No, this is a commercial. This is the --

>> This is more an information [inaudible].

>> Meena Nagarajan: Information seeking and transactional. So they're both in some ways related to monetizable intents.

>> Some of these questions cannot be monetizable.

>> Meena Nagarajan: Correct.

>> So I could be just saying does anybody know how to [inaudible] in some program.

>> Meena Nagarajan: Um-hmm. So LIWC has this transactional word category. And so in the final -- let me show you the architecture. In that blog that you see there identifying monetizable posts, I go through two stages. First I identify if there is any information-seeking intent and then filter out those that have any explicit transactional intent that have these words.

>> Oh, so you get these phrases, then you look for the transactional entente.

>> Meena Nagarajan: Correct. So they're scored against both.

>> So you're really using the LWIC for the [inaudible].

>> Meena Nagarajan: Yes. Yes. In a lot of my work I want to emphasize that I rely a lot on domain knowledge that's already existing out there in addition to what I see from the content.

And this is one example of that.

And use only those posts for delivering any targeted content. In this case I used Google Ad programs. AdSense. Sorry.

>> So the only information that you used to identify another phrase is potentially [inaudible]

LWIC dictionary?

>> Meena Nagarajan: No. So we learned these phrases, right, we learned this lexicon of phrases that are indicative of information-seeking intent. So we take these phrases and check for their membership in anything that you put out on Facebook, direct membership but also see if maybe three of the four word grams occurred there.

And those are the wall posts, for example, that indicate some sort of information-seeking intent.

And then we see if there are any that express transactional intent, and they say things like buy, sell, eBay, such things.

And use only those posts for -- and then we of course go and find keywords for advertising, and that was that module had to do something else, we had to throw away off-topic words. So here's an example. People said I want to know where I can buy this, I have this project due at Merrill

Lynch tomorrow but I'm sick today with food poisoning. And the ads that were generated took

Merrill Lynch and food poisoning, which was completely irrelevant to the topic.

And so the last module that you see there I find keywords for advertising, filtered out these topics using information theoretic algorithm. Yes.

>> So I'm curious on this topic about the evaluation, the eight times more attention, how did you measure that?

>> Meena Nagarajan: Sure. So we took a group of users, we asked permission to use their

Facebook profiles, and so we took their interests and hobbies that they expressed on their profiles and delivered Google AdSense ads on top of them.

Now, we cannot control demographic parameters, so you should remember that you need to interpret that with caution. So the ads that they were seeing were not localized, which if they were seeing it on Facebook, would have been localized. We're really sitting outside of Facebook here, right?

So we showed them these ads, profile -- generated from their profile. And we showed them some ads based on some posts that they had written that they had also said was information seeking and intent indicating.

And we showed them these two ads, did not tell them which ads were from where, just showed them a pool of ads and asked them which one they would be interested in clicking more and knowing more about. And that's what we found out. And that was statistically [inaudible].

>> Was that separated from the Facebook posts that generated that?

>> Meena Nagarajan: What do you mean separated from?

>> Were these ads embedded on the Facebook pages [inaudible]?

>> Meena Nagarajan: No, no. We were sitting outside of Facebook. Back in the day -- this was in 2007, and it was not easy to write Facebook applications. Today it is and we're doing that.

But at that time it was outside. It was little Web page thing outside Facebook.

>> So there's like a survey or something.

>> Meena Nagarajan: I should say correct for those who are listening. Any other questions?

>> I'm curious about the culturally named entities. What changes if it's not a cultural? What if you have [inaudible] something else [inaudible] similarly ambiguous meaning, but [inaudible] strictly a --

>> Meena Nagarajan: A cultural entity. So we haven't evaluated this feature. I shouldn't do that. I should show you something else. We haven't evaluated this feature for different entities, but that is -- that is where we are going, because we want to show that this feature can be useful

for genetic named entity recognition. And we don't see why it will not be, because all we're doing here is trying to identify something in a particular sense and helping a classifier do that.

>> This is a feature that's going to change depending on where your corpus is from.

>> Meena Nagarajan: Correct.

>> Typically when we have humans perform [inaudible] and then it flattens out and then it kind of, depending on how [inaudible] or disappears. So depending on where you chose your corpus from --

>> Meena Nagarajan: That's absolutely right.

>> This can be a help or it could be completely --

>> Meena Nagarajan: Not necessary.

>> -- that could be negatively impacting your --

>> Meena Nagarajan: I don't know why it would negatively impact it. It might be completely unnecessary, because during the release of a movie people will most likely be referring to it in the movie sense. And so you don't have to probably bother with this. The signals that you see might not be polluting the movie sense that you're interested in.

But a few years -- a few months later The Dark Knight , for example, the movie was released and then there was this Barack Obama healthcare reform thing and we were seeing a lot of blogs making that reference. And in that case it was useful.

Let me show you. We were seeing improvements in the entity improvements as well. Maybe that will help you appreciate.

So this is the basic traditional extractor measure versus the complexity of extraction measure.

And what you're seeing is only the difference in improvement. So you can see something as hard as Twilight and Up versus something that was relatively easier to extract, The Dark Knight or The Hangover . You're seeing improvements.

Twilight that you see here that had a pretty bad performance when we threw in our feature was surprising, and here is why.

So Twilight is a book, is a movie sense, and it's also a common word in English, you know, clicking pictures of the twilight sun time of day. And the language model that we extracted that was biased to the sense, it was biased to the sense but it was -- it also had terms in these other senses. So the language model was essentially a little polluted in spite of those two steps of sense-biased language model and clustering, we could not remove the noise out.

So we were seeing examples like -- should not have had these slides. So we were seeing examples like I spent a romantic evening watching the twilight. The movie itself, the book also, is in the romantic genre, and watching is a word you associate with a movie, and this post was clearly not about the movie, so you can see the confusion that the classifier was having.

Photos of the twilight from the bay, and that was not entirely clear until I went and found a bunch of blogs that were talking about photos of the Twilight crew on the red carpet. At that time there was an event going on.

And this sense pollution because of the book and the movie sense because somebody mentioned

Edward Cullen who is also a cast in the movie and the book.

So in those cases when it's very hard to separate these senses, our feature did not seem to do very well.

>> So you mentioned that this was used as a prior. Is, I guess, from a classifier's perspective, I kind of understand prior as what's the likelihood of the particular sets. And but, however, in this case, your -- constantly the prior seems to be more saying how confusing is this particular entity.

>> Meena Nagarajan: In that sense that you're interested in.

>> [inaudible] as feature or are you actually referring to the more -- kind of the traditional use of prior?

>> Meena Nagarajan: I would say a feature.

>> So you're actually using it more as a feature [inaudible] decision tree where you're using this numerical feature to just see -- do you have an idea what the resulting model looks like and how does it actually use this feature?

>> Meena Nagarajan: So I don't have a slide on it, but I can tell you more. So we plug this into decision tree things, classifiers, and when we did not use the prior, the tree looked a certain way.

And when we used it, it was starting to use this feature as one of the first few decision points and to say ah-hah, yes entity or no entity.

The intuition is that if an entity is harder to extract the classifier is going to ask more questions and require more signals before it can make a decision. As opposed to not knowing this one feature and seeing comparable signals and treating those two cases the same.

It is a very exciting idea and I think once we explore different types of entities and different corpus distributions, for example, extract movie entities from a video sense corpus, a corpus that was biased to that sense, then we're going to start to see clear improvements. This is of course ongoing work. That's why I'm --

>> [inaudible]

>> Meena Nagarajan: Well, entity, right? This is about 16.5 percent improvement in --

>> No, you had a [inaudible].

>> Meena Nagarajan: Yes, that was generic across all of the things.

>> Yeah. I mean, [inaudible].

>> Meena Nagarajan: Against the baseline we're not seeing a huge improvement because we're working only with one corpus distribution here. And like I said, if you think about the intuition of this contextual entropy baseline, I think we will start seeing improvements when we expand to a different corpus distribution.

Because, you know, a contextual entropy thing is only going to tell you that there was a variation around the senses. It's not going to tell you if the variation was in the sense that you're interested in, that it was -- that all this extraction difficulty is because there is less support in your sense.

It's not going to be evident from the contextual entropy baseline.

>> Entity extracted, so then I'm looking for accuracy on that.

>> Meena Nagarajan: Another thing I wanted to mention and I skipped was -- and I don't mean to go off topic here, but right at this stage where we say here are the documents that are relevant to my sense of interest, I was able to tell that from this two-step process, right, from this corpus where I knew nothing about the distribution to identifying these set of documents that are biased to my sense because I'm clustering along the sense-biased language model.

Right here we could see applications to topic classification, and that is something Amir worked on. He took this two-step framework and did something on topic classification. You can see that in a corpus, if you're interested in extracting your sense biased models, here you have it, you know, that these documents are related to the movie sense using this unsupervised two steps.

And of course there are applications in the contextual search and browsing, where you're consuming a document and you're seeing an entity, and there is fairly high indication that it is in the movie sense, and therefore you can assume that the entities in that sense for contextual browsing.

But your point is well taken. We are still conducting experiments on the NER front, biology domain entities and movie -- more movie named entities.

The challenge here really is hand labeling. Because what we want is a corpus that shows us varied senses so that this complexity of extraction is going to make sense. And that's something we're exploring with Mechanical Turk now.

>> Emre Kiciman: Thank you very much.

>> Meena Nagarajan: Thank you. Thank you for coming.

[applause]

Download