21640 >> Jaime Teevan: So hello. Thank you for coming. And today I'm pleased to have Alice Oh here to talk with us. She's a faculty member at the Korean Advanced Institute For Science and Technology, where she does a lot of interesting work at the intersection of machine learning and NLP and HCI. Really focused on how people get at information, find what they're looking for. And I first met Alice at MIT where she was doing fun work looking at the different perspectives people have for document summarization and also some fun stuff in group awareness. And today she's going to talk about more recent work on sentiment analysis and how sentiment changes for different aspects. Thanks Alice. >> Alice Oh: Thanks, Jamie. It's great to be here. I'm going to talk about mostly aspect and sentiments in online reviews, but at the end of the talk I'll talk about and show you a little bit of the latest results from Twitter data. So that's something that's interesting and very fun. Okay. So let me give you a little bit of an introduction to where I come from. So KAIST is one of the major research and education institutions in Korea. So we have a lot of students both under graduate and masters and Ph.D. students and many of them do want to come to the U.S. to do research or to do graduate work. So if you're interested in collaborating with us in some way, please do contact me. In our research lab, we do -- recently we've been kind of focusing on topic modeling research. I'm not going to really talk about the first bullet there, topic modeling itself, LDA, HDP and others, nonperimetric models. I'm going to talk today mostly about the sentiment analysis work. Okay. So the problem that I'm going to be talking about today is the problem of unstructured reviews. So if you look at Amazon.com, this review comes from that. There's a lot of information out there. So users write tons and tons of reviews about tons and tons of products. But it's really hard to get this sort of structured information out of the reviews. So this one particular product has these, what they call attributes of the camera and each of the attributes has its own star rating. But you will see that many of their products do not have these attributes defined. They seem to be kind of manually defined or kind of suggested manually by the users and a lot of the other websites or blogs and other places that users write their reviews of course do not have any structure to them. So the problem that we wanted to address is this: Can we find these attributes and analyze the relevant sentiments automatically? From the corpus. So, for example, this review, this is the one -- this is the review you saw earlier -- has these specific sentences that talk about the size or the screen and the overall performance in that and all those. So we'd like to automatically find those, right? So to talk about our solution, I'm going to talk about first the topic models themselves. That will be a very brief introduction to what they are. And then I'll talk about the LDA, which is the sort of the basis -- basic model that we kind of do a variant of it for our model. And then I'll talk about the aspect and sentiments and review data and then finally the model itself. And I'll talk about the results. Okay. So these two slides from David Blei, who is sort of -- he's done the LDA work, kind of talks about what topic models are and what the motivation behind them is. So you all know -- we all hear about this information overload problem, right? So there's a lot of information out there and the problem is that we need tools to help us organize, search, understand a lot of information out there. So topic modeling provides one method, one tool for doing that, for doing automatically organizing, understanding, searching, summarizing electronic archives. So LDA is one specific type of topic model that has been widely used since it was created in the early 2000s, and there are many, many variants of LDA applied to all kinds of data, not just text data, but also images, all kind of stuff. And the basic assumption of LDA is kind of illustrated here. This is a New York Times article. And the title is economic slow down catches up with NASCAR, and as you can imagine, this article talks about major three topics among other level topics as well. So the first topic that it talks about is the NASCAR race topic. Another topic that it talks about is the economic slow down topic. And it also has some words related to sort of the general sports topic. So the LDA assumption is that every document in your corpus is going to be made up of one or more topics. Multiple topics with their sort of probability distributions over those topics. Okay. So the generative process. So LDA is a generative model which means that it tries to kind of mimic the generative process. If you can imagine that the writer of this article, if you have some journalists of the New York Times who are kind of thinking about writing these articles, they have these three topics in their minds. So with the associated words that have high probabilities for those topics. So once you have those topics, the writers can think, well, first I'm going to write this article, which is mostly about the NASCAR races and the economic slow down, and maybe a little bit about the general sports topic and some other topics. So the result would look something like this. And similarly, if you have some other articles that you want to write with different topic distributions, you may end up with an article that is mostly talking about the general sports topic and then another one that talks mostly about the economic stuff. Okay. So what happens -- so this is the graphical representation of LDA. Whenever you see any sort of papers about LDA, you would see a figure that looks like this. And I won't go into too much detail about what the circles mean or these different letters. But basically what it's saying is your corpus is this variable. So is represented by this circle, W. So those are the words in your corpus. And then the topics that we saw look like these and they're the five in this case. And they're multinomials over your vocabulary. And then you would have your topics and the topic distributions. So basically, based on your topic distributions and the topics themselves, you would generate the topics and you would generate the words in your corpus. Okay. So the process of fitting an LDA to your corpus looks like this. So when you start out you're kind of starting out with your corpus, unannotated, just plain text in your corpus, and you can ignore the different colors. They're just plain text files or text in your corpus. That goes into your LDA. And then your model will find these topics, which are multinomials over your entire vocabulary. What this means is that your NASCAR topic is going to assign high probabilities for these words and sort of low probabilities for other words that are not really related to that topic. Okay. So another output of LDA are these bars which are the topic distributions for each of your documents in your corpus. So this is the graphical view of the same thing that I just talked about. So your observations are the words in your corpus. That's sort of the input to your model, and then you're learning, you're discovering these topics and the topic distributions. So then we can talk now about our model, the aspect sentiment unification model. And we built this model to sort of uncover the relationship between the sentiments and aspects. And we'll see why that is necessary. So again we have this problem of unstructured reviews. And from that review we want to extract things like these. So the aspects for this particular camera, the size of it, the start-up and turn-off time, low light performance, video mode, so on. So for those aspects you can find that the sentiments are expressed using those words. So our goal is to discover those aspects and also at the same time discover those words that express the sentiments. Okay. So let's think a little bit about the words that express sentiment. So in general, you have these words like love, satisfy, best, excellent, and anyone can tell that they're very general sentiment words that can apply to pretty much any domain. So when you say excellent, it doesn't really matter which domain you're talking about. But there are these words that can express sentiment but they're very context-dependent. So let's see an example here, when you say this camera is small versus the LCD is small, they're both in the camera domain, but they're actually expressing sort of two different types of sentiments. Beer was cold. Pizza was cold. That's a more even a better example. The wine list is long, the wait is long. So the point here is that even within the same domain -- so if the domain of restaurant reviews, the domain of electronic reviews, your sentiment words are going to depend on the specific aspects that you're talking about. Okay. So to capture that, we made this model. And it's kind of a variant of the sentence LDA. And I'll show you our experiments for these two corpuses. Corpora. The Amazon reviews and the Yelp restaurant reviews. And the observation that we made is this: Again, these are the same sentences. The observation in our assumption for the model is that one sentence describes one aspect. And we made this assumption which is pretty different from the basic LDA assumption which says that each word represents one aspect. So the basic LDA, if you see a sentence like this, or any of these sentences, these words can come from any topics. But we're kind of restricting it such that all of the words in one sentence are coming from one topic. And the reason we did that is because we wanted to capture a little bit of the locality of the words, because if you are talking about the movie 640/480 mode because they're in the same sentence they kind of tend to talk about the same thing. Of course, there are sentences where this is wrong. But for the most part we think it's valid. >>: I'm sorry. When you are building your models for features of the sentence, are you using unibrands only or are you -- because I'm looking at it's light, too. And thinking if two counts for anything like normally in the context of too light, for example, would mean flip that sentence entirely. >> Alice Oh: That's a good observation. No, we work with just unigrams. So we don't have that problem. But we could work with N grams which that would be a problem. So the only difference here between the two models is this little box. So we're restricting M as the number of sentences. We are restricting each sentence to have only one topic, and every word in that sentence is generated from that same topic. Okay. So here the results of SLDA, these are the aspects found. You can see that these two are sort of coming from the camera. So the electronics data are coming from seven different product categories. Laptops, MP3 players, vacuum cleaners and so on. And they're all kind of mixed into one corpus. You can see from the laptop product you would see software topic, keyboard and input device topic. Laptop, hardware topic. So this is all unsupervised. So there's no labeling of any kind. The restaurant topics, here we have the parking topic and then we have the liquor topic or aspect as we call them. So on top of that model, we built aspect sentiment unification model, in which we just add this little bit right here. So in addition to the topic, the words are being generated by a pair of topic and sentiment. And here if you can read the graphical notation, this means that the topics are conditioned on both the topic and the sentiment. So here we're only conditioning the topics, the topic where it's on the topic itself. Here we have the pair of topic and sentiment. Okay. So that's what the model looks like. And what we do to actually get the sentiments, we do a little bit of a trick where we build into the model the seed words. It turns out if you don't use any seed words or any labeling of any kind, what you find are not really sentiments but they kind of turn out to be sub topic like things. So all of the joint models actually there are a couple of other joint aspect sentiment models they use some either the seed words or the labeled data. Okay. So we started out with the top words. These are paradigm words from Peter Tourney's work I think in 2002 ACL or so. And we started out with those and then we added some more. It turns out if you add some more general sentiment words, the performance of the model gets better but these words that we added are pretty general, too, and you can see not recommend not worth, not good. So we did a little bit of negation processing. Just really simple pattern matching like if you said not good, that would be not good. If you said not very good, that would also become not good. So we actually build these seed words into the model by sort of playing with the betas, which are the LDA, the Dirichlet priors. But basically what it is, we kind of prevent the positive seed words to be assigned any negative sentiment and vice versa. So a combination of setting the asymmetric priors and I didn't talk about sampling, but we use Gibb sampling to do inference. If you play with those a little bit then you can get the seed words to do what they should do. So these are the results. Electronics dataset. So you can see the positive senti aspects. So it's money well spent type of an aspect there. This is the negative counterpart. It's a waste of money type of aspect that -- senti-aspect that you're seeing. A positive senti-aspect about the screen. It's crisp and clear and bright and then something negative about the screen. This one aspect senti-aspect is about the vacuum cleaner. So another set of results. This is from the restaurant data. So you can see the meat is juicy and tender and crispy. The meat is dry and bland and salty. Music is loud. Fun atmosphere. And then here you see the same word loud, but this becomes a negative senti-aspect with the music. Cash only. Doesn't accept credit card kind of negative senti-aspect. So it's kind of fun. So we found these results to be pretty fun. So another thing you can do with these senti-aspects, then, is to talk about what words express the aspect itself and what words express the aspects that are specific to or what words express the sentiments that are specific to those aspects. So, for example, here the common words -- so a senti-aspect goes like this across the row. And these are the common words for the negative and positive senti-aspect related to the service aspect of the restaurant. So waiter table, waitress, ask. And so on. And then the positive things are like they refill the water glass, wine, attentive, friendly. The negative things like rude, bad like that. Right? So you might wonder why is me or want or probably not want, but why is this word in there. It's just something, if you have a very statistical corpus-based method, something like that could happen. Okay. So we did this without any labels, without any sentiment labels. Kind of figured out the aspect specific sentiment words. Although we did have to use some sentiment seed words. So here's another thing we can do with the results of the model. We can classify each sentence as either positive or negative. So these are two reviews. The first one is about an electronic product, and the second one is from the restaurant corpus. And you can see -- and of course I am showing you the good results. But the results turn out pretty well, and you'll see the numbers of sentiment classification. But the food is really great. I recommend and so on. Another set of results to show you we can identify a parking aspect, and which is identified by the words like Park Street, valet and so on. And these four sentences from the reviews are all classified or kind of tagged as parking aspect sentences. Here I wanted to show you something where it doesn't always work. So, for example, the second sentence, it took us several uses to figure out what was used. Probably not a positive senti-aspect, but it was identified as that. The last sentence shows you that our assumption that every sentence contains a single aspect is probably not true all the time. So talks about how nice it looks and how easy it is to use. So we're going to try to see if there's a way to get around it without having to resort back to a word per topic. >>: What about, do you find much sarcasm in reviews and do they confound the system? >> Alice Oh: Not in the product reviews so much. Although I didn't look at every review, to be frank. But we did try -- so we are trying with other types of data like we tried with like political discussions. We tried with like photo review type of data where people are more like conversing rather than really explicitly rating things. And in those data, it doesn't work quite well. But I don't know what the answer is to get to extract information out of sarcastic comments and so on. >>: Do you get cases like the last sentence there, where guests always comment on how nice it looks but how hard it is to use, do you get ones where there are mixed sentiments? >> Alice Oh: Yeah. >>: The assumption that there's the site talks about the same aspect may not be correct. But do you get ->> Alice Oh: I'm sure the corpus has sentences like that. And my answer would be the system would be all confused about that. It wouldn't be able to tell. >>: You're showing sentences for which you were able to assign some sentiment. And just to clarify, those are sentences that had some aspect word and then an aspect specific sentiment word or sort of a broad general sentiment word? Make sense, some aspect word to appear on the list you're showing us, to get a positive or negative assignment? >> Alice Oh: Well, yeah. >>: Had a very convenient -- are we mixing aspect specific sentiment and general like if there's not -- does that appear as a sentence? There's really not an aspect tied to it. Is that correct? >> Alice Oh: Actually, there would be an aspect tied to it. Like it is associated with this particular aspect, which is represented by those words on the top. So every sentence is assigned a senti-aspect. >>: Those were my questions. So you're targeting 100 percent assignment of some senti-aspect and not punting on a specific number of sentences, something like total parts value. >> Alice Oh: That's right. Every sentence gets a senti-aspect. And you may -- so the basic thing about topic models is that for every topic there is a probability associated with every single word in your vocabulary. So basically you're then sort of adding up probabilities for each of the words in your sentence. Here's a quantitative evaluation. So topic models are inherently difficult to evaluate quantitatively, because the way -- the reason you would use them is to discover these unknown topics. So if you have 10,000 New York Times articles and you're finding 100 topics within them, nobody really knows the correct set of answers. Anyway, so one thing that we did do to quantitatively evaluate our model is to do just sentiment classification. And these are document level classification. Not sentence by sentence, because we don't have the label data to do sentence by sentence classification. So we compared our model. We have two different versions of the model depending on the set of seed words that we used. And then we compare them with these two models, which are also joint models of sentiments and aspects. So ->>: The purple one is doing Dirichlet random? >> Alice Oh: Yeah. Well, according to -- yeah. So but I have to say that this particular model, it's not really designed to do sentiment classification. Or actually none of these models are designed specifically to do sentiment classification. And this model particularly is focused more on finding the topics, finding the specific aspects rather than doing sentiment. So although it's a joint model of topics and sentiments. So they don't do quite well. But our model does better. That's point of this slide. >>: What would your corpus be tested on, your ground truth. >> Alice Oh: The Amazon reviews and the Yelp.com reviews. >>: And you handle the sentiment aspects? >> Alice Oh: No, so the aspect part we're not doing quantitative evaluation. These are just based on the star ratings for each of the reviews. So I think four and five stars positive one and two stars negative. And three I think we just discarded. These are just different models. If anyone wants me to explain now. So this is our model. And I should point out from this slide that there are limitations in our model. If the sentence is too short, if it's just one or two words, because we have this assumption that one sentence gets one aspect. If you have a very short sentence it's not going to work so well. If you have sentences that have multiple aspects, it's not going to work so well. So this is just to show you what we can do with Twitter data. So this was just our question out of curiosity. What would happen if we apply this model to Twitter data. Because we noticed that a lot of the sentiment work that's been done on Twitter data isn't very good. They just use a list of words and kind of look to see if a tweet contains that word or not. So we wanted to see if we can get sort of topic-specific sentiments out of the tweets. So we tested on 1.3 million tweets, 50,000 words in our vocabulary. A thing to notice about Twitter data is that unlike the Amazon or the Yelp.com reviews where there is pretty explicit polarity being expressed in those reviews, a lot of the tweets don't really have any sentiment. If they do it's closer to feelings, how people are feeling, whether they're being happy or sad, rather than I really like this or this is good or this is bad. So I think that makes Twitter data a little bit difficult to do sentiment analysis on. But we'll see how the sentiments turn out. This is the fun part. Right? So for the seed words we don't have to really think too much with the Twitter data. These are the topics that we found, the positive senti-aspects. So there's some pretty obvious ones like about the singers and stuff. Ice cream yummy stuff, good night, good morning. Happy birthday topic here. American Idol stuff. And so pretty obvious stuff like I'm feeling happy at home type of stuff going on, right? And then there's some other topics, more of the obvious topics. God bless you type of stuff. So we do see some political stuff going on. If you notice, if you look at the words, there aren't too many sentiment words or there are actually no sentiment words that you can really pick out and say why this aspect turned out to be positive. And we'll see a negative counterpart of this and the negative senti-aspect looks pretty much the same, actually. The negative senti-aspects, again, the same type of stuff. But interesting things going on, right? I'm hurt, I'm feeling bad. The stock market, I guess, is not so good. Tired. Kind of a spanned topic. Michael Jackson's death. >>: Is there any ordering in this or is this ->> Alice Oh: No, there's no ordering. And these are from data, I guess, from 2007 to 2009. And I can't figure out how to do stuff on Twitter. So marked more senti-aspects to show you, flights being delayed, there's a lot of traffic. I don't want to take the test. So here's another political topic, something about Obama, and so this happens because we actually, our model assigns as the previous question, the answer to the previous question, our model assigns a senti-aspect to every single tweet. And that kind of works well for product reviews because sentences in the reviews do have sentiments or most of them do. But in the Twitter data, a lot of the sentences or tweets don't have a lot of sentiments. If you're linking to a New York Times article about the Obama healthcare issue, a lot of the users don't explicitly say I like it or not they just write something about it. And it just turns out that our model just kind of randomly probably assigns sentiments to those tweets. So that's sort of the downside of the model doing it that way. So that's pretty much it. I didn't even notice that last slide. So that's our model aspect sentiment senti model. We're going to -- this is going to be part of Wisdom, which is in February in next year. So we actually have the camera ready on our website if you want to go and fetch the paper and read it. Okay. Questions? [applause]. >>: Let me clarify what we were just looking at when you showed senti. Those are single words am I correct what could appear on both lists. I don't know whether any did. >>: Obama did. >>: I want to make sure that my mental model is right we're just looking at one slice of the senti model and both can appear on both lists? >> Alice Oh: Yeah. Yeah. >>: So sorry for another question about sarcasm but it's a deep personal interest of mine. >> Alice Oh: Okay [laughter]. >>: If I were worried about sarcasm interfering with data, not just can you derive good data from sarcasm, but can you at least factor it out, I would be looking for three negative comments followed by something very positive that doesn't have some kind of start to the phrase like however or on the other hand, to indicate change of sentiment. Is there any analysis like that being done? Because it seems like you go from one sentence to another. If you see a sudden change of sentiment, that should be suspicious. >> Alice Oh: Yeah, there's nothing like that I know of. So there's independent assumption here, right? So we're assuming that every sentence in the document is just independent of -- well, not quite independent of one another, because there's the distribution of the topics within the document itself. But we didn't specifically look at or nobody has really specifically looked at how the sentiment changes through the document. When you see some sudden change that just signals something like that. I mean, that's a good suggestion. This just general. >>: You sort of talked -- I guess you'll see a lot of here's the positive -- I see a lot of reviews. Here's the positive then the next paragraph is here's the negative. Or it seems like we're taught to write in a way that should allow you to gather more information from the structure. >> Alice Oh: Yeah, yeah. So we didn't look at any of the structure within sort of the higher level structure of review or document. But that is certainly something interesting to look at. I'm not sure how you would do that and kind of incorporate it into the model. I don't know, we can try to figure something out. >>: Can you talk a little bit about sort of how the corpus changes over time? We were talking about that a little earlier. So this is something totally different from this work. We have this model that looks at how documents change over time in respect to the topics that they talk about. So it applies pretty well. Very well, actually, to the Twitter data. If something happens today, it's going to appear a lot on the Twitter. Twitter sphere, whatever you call it. But then the next day it's just going to be a whole new set of topics and so on, right? So there are variance of the topic model, variance of LDA like the dynamic topic models or topics over time which try to capture those sort of dynamic changes to the topics. But the downside of those, sort of the limitation of those models is that they don't really capture how new topics emerge and topics kind of disappear through time. So we built this new model, which I hope to publish soon. It's called Distance Dependent CRF, Chinese Restaurant Franchise. It's a hierarchical model of LDA. And we built into it the notion of distance, how when you have distances between different documents the topic probabilities are going to change. So when we have a new tweet, say we have a bunch of tweets from today, each of them is going to have some topic probability distributions but they're going to look a lot more like yesterday's than they do of tweets a year ago or six months ago. So things like that, or you can do with locations, too. So where the tweets are coming from, if they're coming from Seattle, they're going to be talking about different things, than if they're coming from Korea or something like that. So there's a lot of stuff you can do with topic models. And if you -- we can talk more about it. Did you have another question? >>: However you defined accuracy, seems like your accuracy will go up if you are able to just drop some of the -- just not assign sentiment to some of the more ambiguous sentences. How robust is -- for applications that allow that, if your goal is to like put an icon next to every single tweet, gotta do it, if your goal is generalization in general aspects, finding, topic finding, seems like you could throw out half your sentences and improve accuracy and still -- how robust is your confidence signal to let you do that? And have you tried -- whatever your accuracy metric is, have you tried to tossing out your bottom K percent and see how your accuracy goes up? >> Alice Oh: So the Twitter stuff, this is brand new and we haven't done any analysis of our results. So I don't know. It's true that if you, a lot of the tweets are not going to have any sentiment. And maybe if we throw out all the political stuff or something, then the topics will look more like they have some sentiment built into them. With the product review data, I don't think it's going to change too much because a lot of the sentences do have some kind of sentiment in them. But that's just -- but we haven't done any testing with that either. >>: Interesting just like the course one-time human experiments, read two random readers, Amazon readers, curious if I had to pick what percentage of sentences contain meaningful sentiment information, I don't know. This is basically a question how terse are the reviews written on Amazon and do people blabber a lot, I have no idea. >> Alice Oh: So traditional sentiment analysis people do do that. The first thing that they do is take out the sentences that have subjective content and the objective sentences, they just throw out and kind of work only with the subjective sentences. >>: I was thinking more as more a post-processing step, if you have senti-aspect assigned defined, and you have presumably each of those effectively has, each sentence probability aspect assigned, take off the bottom probabilities. Without making any new assumptions about what are objective or subjective work, it seems for many applications your accuracy would go up. >> Alice Oh: Right. There are a lot of -- a few things that we're trying to do post all of this processing. And something that you could do with the senti-aspects themselves are how many of them are really about the -- how many of them really contain something about sentiment and not just nonsentiment sort of objective words. A very, very simple thing that you can try is the ratio of like nouns versus adjectives, if you do part of speech tagging. And we can look for the probabilities -- because every senti-aspect has every single word in it we can look for where the sentiment seed words are and try to see if they're higher up in the list then that topic is probably a sentiment topic whereas if they're really low then it probably doesn't have that much sentiment in it. >>: You might also want to go the time because some things that start out as some sentiment topics move into expressions involving a great deal of sentiment. I recall when the crack down in Bangkok occurred, initially it was very detailed descriptions by local people who were striving exactly what was going on and the intersection. And over time it was flooded with oh my God what's happening in Thailand. >> Alice Oh: That's right. >>: And basically as the amount of sentiment increased, the noise-to-signal ratio also increased. >> Alice Oh: Right. So there's a lot of work on Twitter and emergency response stuff like if there's an earthquake or some bombing somewhere people will start out describing the event and then afterwards they're going to say as you said all those sad or happy things that are going on. So we haven't looked at it. But that's a very interesting direction. >>: To quantify that. >> Alice Oh: So the two different research projects that we're working on, one is the sentiment stuff, which is not dynamic at all at this point. And then we have the topic stuff that's dynamic. So we want to at some point kind of merge the two and we kind of half jokingly say we should do distance-dependent hierarchical aspect sentiment. So all of that unification model. But that's certainly an interesting direction to go. Yeah. Okay. Thank you. [applause]