21640 >> Jaime Teevan: So hello. Thank you for...

advertisement
21640
>> Jaime Teevan: So hello. Thank you for coming. And today I'm pleased to have Alice Oh
here to talk with us. She's a faculty member at the Korean Advanced Institute For Science and
Technology, where she does a lot of interesting work at the intersection of machine learning and
NLP and HCI. Really focused on how people get at information, find what they're looking for.
And I first met Alice at MIT where she was doing fun work looking at the different perspectives
people have for document summarization and also some fun stuff in group awareness.
And today she's going to talk about more recent work on sentiment analysis and how sentiment
changes for different aspects. Thanks Alice.
>> Alice Oh: Thanks, Jamie. It's great to be here. I'm going to talk about mostly aspect and
sentiments in online reviews, but at the end of the talk I'll talk about and show you a little bit of the
latest results from Twitter data. So that's something that's interesting and very fun.
Okay. So let me give you a little bit of an introduction to where I come from. So KAIST is one of
the major research and education institutions in Korea. So we have a lot of students both under
graduate and masters and Ph.D. students and many of them do want to come to the U.S. to do
research or to do graduate work.
So if you're interested in collaborating with us in some way, please do contact me. In our
research lab, we do -- recently we've been kind of focusing on topic modeling research. I'm not
going to really talk about the first bullet there, topic modeling itself, LDA, HDP and others,
nonperimetric models. I'm going to talk today mostly about the sentiment analysis work.
Okay. So the problem that I'm going to be talking about today is the problem of unstructured
reviews. So if you look at Amazon.com, this review comes from that. There's a lot of information
out there. So users write tons and tons of reviews about tons and tons of products. But it's really
hard to get this sort of structured information out of the reviews.
So this one particular product has these, what they call attributes of the camera and each of the
attributes has its own star rating. But you will see that many of their products do not have these
attributes defined.
They seem to be kind of manually defined or kind of suggested manually by the users and a lot of
the other websites or blogs and other places that users write their reviews of course do not have
any structure to them. So the problem that we wanted to address is this: Can we find these
attributes and analyze the relevant sentiments automatically? From the corpus.
So, for example, this review, this is the one -- this is the review you saw earlier -- has these
specific sentences that talk about the size or the screen and the overall performance in that and
all those.
So we'd like to automatically find those, right? So to talk about our solution, I'm going to talk
about first the topic models themselves. That will be a very brief introduction to what they are.
And then I'll talk about the LDA, which is the sort of the basis -- basic model that we kind of do a
variant of it for our model. And then I'll talk about the aspect and sentiments and review data and
then finally the model itself. And I'll talk about the results.
Okay. So these two slides from David Blei, who is sort of -- he's done the LDA work, kind of talks
about what topic models are and what the motivation behind them is.
So you all know -- we all hear about this information overload problem, right? So there's a lot of
information out there and the problem is that we need tools to help us organize, search,
understand a lot of information out there. So topic modeling provides one method, one tool for
doing that, for doing automatically organizing, understanding, searching, summarizing electronic
archives.
So LDA is one specific type of topic model that has been widely used since it was created in the
early 2000s, and there are many, many variants of LDA applied to all kinds of data, not just text
data, but also images, all kind of stuff.
And the basic assumption of LDA is kind of illustrated here. This is a New York Times article.
And the title is economic slow down catches up with NASCAR, and as you can imagine, this
article talks about major three topics among other level topics as well. So the first topic that it
talks about is the NASCAR race topic.
Another topic that it talks about is the economic slow down topic. And it also has some words
related to sort of the general sports topic.
So the LDA assumption is that every document in your corpus is going to be made up of one or
more topics. Multiple topics with their sort of probability distributions over those topics.
Okay. So the generative process. So LDA is a generative model which means that it tries to kind
of mimic the generative process. If you can imagine that the writer of this article, if you have
some journalists of the New York Times who are kind of thinking about writing these articles, they
have these three topics in their minds. So with the associated words that have high probabilities
for those topics.
So once you have those topics, the writers can think, well, first I'm going to write this article,
which is mostly about the NASCAR races and the economic slow down, and maybe a little bit
about the general sports topic and some other topics.
So the result would look something like this. And similarly, if you have some other articles that
you want to write with different topic distributions, you may end up with an article that is mostly
talking about the general sports topic and then another one that talks mostly about the economic
stuff. Okay. So what happens -- so this is the graphical representation of LDA. Whenever you
see any sort of papers about LDA, you would see a figure that looks like this.
And I won't go into too much detail about what the circles mean or these different letters. But
basically what it's saying is your corpus is this variable. So is represented by this circle, W. So
those are the words in your corpus.
And then the topics that we saw look like these and they're the five in this case. And they're
multinomials over your vocabulary. And then you would have your topics and the topic
distributions.
So basically, based on your topic distributions and the topics themselves, you would generate the
topics and you would generate the words in your corpus.
Okay. So the process of fitting an LDA to your corpus looks like this. So when you start out
you're kind of starting out with your corpus, unannotated, just plain text in your corpus, and you
can ignore the different colors. They're just plain text files or text in your corpus.
That goes into your LDA. And then your model will find these topics, which are multinomials over
your entire vocabulary. What this means is that your NASCAR topic is going to assign high
probabilities for these words and sort of low probabilities for other words that are not really related
to that topic.
Okay. So another output of LDA are these bars which are the topic distributions for each of your
documents in your corpus.
So this is the graphical view of the same thing that I just talked about. So your observations are
the words in your corpus. That's sort of the input to your model, and then you're learning, you're
discovering these topics and the topic distributions.
So then we can talk now about our model, the aspect sentiment unification model. And we built
this model to sort of uncover the relationship between the sentiments and aspects.
And we'll see why that is necessary. So again we have this problem of unstructured reviews.
And from that review we want to extract things like these. So the aspects for this particular
camera, the size of it, the start-up and turn-off time, low light performance, video mode, so on.
So for those aspects you can find that the sentiments are expressed using those words. So our
goal is to discover those aspects and also at the same time discover those words that express
the sentiments.
Okay. So let's think a little bit about the words that express sentiment. So in general, you have
these words like love, satisfy, best, excellent, and anyone can tell that they're very general
sentiment words that can apply to pretty much any domain.
So when you say excellent, it doesn't really matter which domain you're talking about. But there
are these words that can express sentiment but they're very context-dependent. So let's see an
example here, when you say this camera is small versus the LCD is small, they're both in the
camera domain, but they're actually expressing sort of two different types of sentiments. Beer
was cold. Pizza was cold. That's a more even a better example. The wine list is long, the wait is
long.
So the point here is that even within the same domain -- so if the domain of restaurant reviews,
the domain of electronic reviews, your sentiment words are going to depend on the specific
aspects that you're talking about.
Okay. So to capture that, we made this model. And it's kind of a variant of the sentence LDA.
And I'll show you our experiments for these two corpuses. Corpora. The Amazon reviews and
the Yelp restaurant reviews.
And the observation that we made is this: Again, these are the same sentences. The
observation in our assumption for the model is that one sentence describes one aspect. And we
made this assumption which is pretty different from the basic LDA assumption which says that
each word represents one aspect. So the basic LDA, if you see a sentence like this, or any of
these sentences, these words can come from any topics.
But we're kind of restricting it such that all of the words in one sentence are coming from one
topic. And the reason we did that is because we wanted to capture a little bit of the locality of the
words, because if you are talking about the movie 640/480 mode because they're in the same
sentence they kind of tend to talk about the same thing.
Of course, there are sentences where this is wrong. But for the most part we think it's valid.
>>: I'm sorry. When you are building your models for features of the sentence, are you using
unibrands only or are you -- because I'm looking at it's light, too. And thinking if two counts for
anything like normally in the context of too light, for example, would mean flip that sentence
entirely.
>> Alice Oh: That's a good observation. No, we work with just unigrams. So we don't have that
problem. But we could work with N grams which that would be a problem.
So the only difference here between the two models is this little box. So we're restricting M as
the number of sentences. We are restricting each sentence to have only one topic, and every
word in that sentence is generated from that same topic.
Okay. So here the results of SLDA, these are the aspects found. You can see that these two are
sort of coming from the camera. So the electronics data are coming from seven different product
categories. Laptops, MP3 players, vacuum cleaners and so on. And they're all kind of mixed into
one corpus. You can see from the laptop product you would see software topic, keyboard and
input device topic. Laptop, hardware topic. So this is all unsupervised. So there's no labeling of
any kind.
The restaurant topics, here we have the parking topic and then we have the liquor topic or aspect
as we call them. So on top of that model, we built aspect sentiment unification model, in which
we just add this little bit right here.
So in addition to the topic, the words are being generated by a pair of topic and sentiment. And
here if you can read the graphical notation, this means that the topics are conditioned on both the
topic and the sentiment.
So here we're only conditioning the topics, the topic where it's on the topic itself. Here we have
the pair of topic and sentiment.
Okay. So that's what the model looks like. And what we do to actually get the sentiments, we do
a little bit of a trick where we build into the model the seed words. It turns out if you don't use any
seed words or any labeling of any kind, what you find are not really sentiments but they kind of
turn out to be sub topic like things.
So all of the joint models actually there are a couple of other joint aspect sentiment models they
use some either the seed words or the labeled data.
Okay. So we started out with the top words. These are paradigm words from Peter Tourney's
work I think in 2002 ACL or so.
And we started out with those and then we added some more. It turns out if you add some more
general sentiment words, the performance of the model gets better but these words that we
added are pretty general, too, and you can see not recommend not worth, not good. So we did a
little bit of negation processing. Just really simple pattern matching like if you said not good, that
would be not good. If you said not very good, that would also become not good.
So we actually build these seed words into the model by sort of playing with the betas, which are
the LDA, the Dirichlet priors. But basically what it is, we kind of prevent the positive seed words
to be assigned any negative sentiment and vice versa. So a combination of setting the
asymmetric priors and I didn't talk about sampling, but we use Gibb sampling to do inference. If
you play with those a little bit then you can get the seed words to do what they should do.
So these are the results. Electronics dataset. So you can see the positive senti aspects. So it's
money well spent type of an aspect there. This is the negative counterpart. It's a waste of money
type of aspect that -- senti-aspect that you're seeing. A positive senti-aspect about the screen.
It's crisp and clear and bright and then something negative about the screen.
This one aspect senti-aspect is about the vacuum cleaner. So another set of results. This is from
the restaurant data. So you can see the meat is juicy and tender and crispy. The meat is dry and
bland and salty.
Music is loud. Fun atmosphere. And then here you see the same word loud, but this becomes a
negative senti-aspect with the music.
Cash only. Doesn't accept credit card kind of negative senti-aspect. So it's kind of fun. So we
found these results to be pretty fun.
So another thing you can do with these senti-aspects, then, is to talk about what words express
the aspect itself and what words express the aspects that are specific to or what words express
the sentiments that are specific to those aspects. So, for example, here the common words -- so
a senti-aspect goes like this across the row. And these are the common words for the negative
and positive senti-aspect related to the service aspect of the restaurant.
So waiter table, waitress, ask. And so on. And then the positive things are like they refill the
water glass, wine, attentive, friendly. The negative things like rude, bad like that. Right? So you
might wonder why is me or want or probably not want, but why is this word in there. It's just
something, if you have a very statistical corpus-based method, something like that could happen.
Okay. So we did this without any labels, without any sentiment labels. Kind of figured out the
aspect specific sentiment words. Although we did have to use some sentiment seed words.
So here's another thing we can do with the results of the model. We can classify each sentence
as either positive or negative. So these are two reviews. The first one is about an electronic
product, and the second one is from the restaurant corpus.
And you can see -- and of course I am showing you the good results. But the results turn out
pretty well, and you'll see the numbers of sentiment classification. But the food is really great. I
recommend and so on.
Another set of results to show you we can identify a parking aspect, and which is identified by the
words like Park Street, valet and so on. And these four sentences from the reviews are all
classified or kind of tagged as parking aspect sentences.
Here I wanted to show you something where it doesn't always work. So, for example, the second
sentence, it took us several uses to figure out what was used. Probably not a positive
senti-aspect, but it was identified as that.
The last sentence shows you that our assumption that every sentence contains a single aspect is
probably not true all the time. So talks about how nice it looks and how easy it is to use.
So we're going to try to see if there's a way to get around it without having to resort back to a
word per topic.
>>: What about, do you find much sarcasm in reviews and do they confound the system?
>> Alice Oh: Not in the product reviews so much. Although I didn't look at every review, to be
frank. But we did try -- so we are trying with other types of data like we tried with like political
discussions. We tried with like photo review type of data where people are more like conversing
rather than really explicitly rating things.
And in those data, it doesn't work quite well. But I don't know what the answer is to get to extract
information out of sarcastic comments and so on.
>>: Do you get cases like the last sentence there, where guests always comment on how nice it
looks but how hard it is to use, do you get ones where there are mixed sentiments?
>> Alice Oh: Yeah.
>>: The assumption that there's the site talks about the same aspect may not be correct. But do
you get ->> Alice Oh: I'm sure the corpus has sentences like that. And my answer would be the system
would be all confused about that. It wouldn't be able to tell.
>>: You're showing sentences for which you were able to assign some sentiment. And just to
clarify, those are sentences that had some aspect word and then an aspect specific sentiment
word or sort of a broad general sentiment word? Make sense, some aspect word to appear on
the list you're showing us, to get a positive or negative assignment?
>> Alice Oh: Well, yeah.
>>: Had a very convenient -- are we mixing aspect specific sentiment and general like if there's
not -- does that appear as a sentence? There's really not an aspect tied to it. Is that correct?
>> Alice Oh: Actually, there would be an aspect tied to it. Like it is associated with this particular
aspect, which is represented by those words on the top.
So every sentence is assigned a senti-aspect.
>>: Those were my questions. So you're targeting 100 percent assignment of some senti-aspect
and not punting on a specific number of sentences, something like total parts value.
>> Alice Oh: That's right. Every sentence gets a senti-aspect. And you may -- so the basic thing
about topic models is that for every topic there is a probability associated with every single word
in your vocabulary. So basically you're then sort of adding up probabilities for each of the words
in your sentence.
Here's a quantitative evaluation. So topic models are inherently difficult to evaluate quantitatively,
because the way -- the reason you would use them is to discover these unknown topics.
So if you have 10,000 New York Times articles and you're finding 100 topics within them, nobody
really knows the correct set of answers.
Anyway, so one thing that we did do to quantitatively evaluate our model is to do just sentiment
classification. And these are document level classification. Not sentence by sentence, because
we don't have the label data to do sentence by sentence classification.
So we compared our model. We have two different versions of the model depending on the set
of seed words that we used. And then we compare them with these two models, which are also
joint models of sentiments and aspects.
So ->>: The purple one is doing Dirichlet random?
>> Alice Oh: Yeah. Well, according to -- yeah. So but I have to say that this particular model, it's
not really designed to do sentiment classification. Or actually none of these models are designed
specifically to do sentiment classification.
And this model particularly is focused more on finding the topics, finding the specific aspects
rather than doing sentiment.
So although it's a joint model of topics and sentiments. So they don't do quite well. But our
model does better. That's point of this slide.
>>: What would your corpus be tested on, your ground truth.
>> Alice Oh: The Amazon reviews and the Yelp.com reviews.
>>: And you handle the sentiment aspects?
>> Alice Oh: No, so the aspect part we're not doing quantitative evaluation. These are just
based on the star ratings for each of the reviews.
So I think four and five stars positive one and two stars negative. And three I think we just
discarded. These are just different models. If anyone wants me to explain now. So this is our
model.
And I should point out from this slide that there are limitations in our model. If the sentence is too
short, if it's just one or two words, because we have this assumption that one sentence gets one
aspect. If you have a very short sentence it's not going to work so well.
If you have sentences that have multiple aspects, it's not going to work so well. So this is just to
show you what we can do with Twitter data. So this was just our question out of curiosity. What
would happen if we apply this model to Twitter data. Because we noticed that a lot of the
sentiment work that's been done on Twitter data isn't very good. They just use a list of words and
kind of look to see if a tweet contains that word or not.
So we wanted to see if we can get sort of topic-specific sentiments out of the tweets. So we
tested on 1.3 million tweets, 50,000 words in our vocabulary.
A thing to notice about Twitter data is that unlike the Amazon or the Yelp.com reviews where
there is pretty explicit polarity being expressed in those reviews, a lot of the tweets don't really
have any sentiment. If they do it's closer to feelings, how people are feeling, whether they're
being happy or sad, rather than I really like this or this is good or this is bad.
So I think that makes Twitter data a little bit difficult to do sentiment analysis on. But we'll see
how the sentiments turn out. This is the fun part. Right? So for the seed words we don't have to
really think too much with the Twitter data.
These are the topics that we found, the positive senti-aspects. So there's some pretty obvious
ones like about the singers and stuff. Ice cream yummy stuff, good night, good morning. Happy
birthday topic here. American Idol stuff. And so pretty obvious stuff like I'm feeling happy at
home type of stuff going on, right?
And then there's some other topics, more of the obvious topics. God bless you type of stuff. So
we do see some political stuff going on. If you notice, if you look at the words, there aren't too
many sentiment words or there are actually no sentiment words that you can really pick out and
say why this aspect turned out to be positive.
And we'll see a negative counterpart of this and the negative senti-aspect looks pretty much the
same, actually. The negative senti-aspects, again, the same type of stuff. But interesting things
going on, right? I'm hurt, I'm feeling bad. The stock market, I guess, is not so good. Tired. Kind
of a spanned topic. Michael Jackson's death.
>>: Is there any ordering in this or is this ->> Alice Oh: No, there's no ordering. And these are from data, I guess, from 2007 to 2009. And
I can't figure out how to do stuff on Twitter. So marked more senti-aspects to show you, flights
being delayed, there's a lot of traffic. I don't want to take the test.
So here's another political topic, something about Obama, and so this happens because we
actually, our model assigns as the previous question, the answer to the previous question, our
model assigns a senti-aspect to every single tweet. And that kind of works well for product
reviews because sentences in the reviews do have sentiments or most of them do. But in the
Twitter data, a lot of the sentences or tweets don't have a lot of sentiments.
If you're linking to a New York Times article about the Obama healthcare issue, a lot of the users
don't explicitly say I like it or not they just write something about it. And it just turns out that our
model just kind of randomly probably assigns sentiments to those tweets. So that's sort of the
downside of the model doing it that way.
So that's pretty much it. I didn't even notice that last slide. So that's our model aspect sentiment
senti model.
We're going to -- this is going to be part of Wisdom, which is in February in next year. So we
actually have the camera ready on our website if you want to go and fetch the paper and read it.
Okay. Questions? [applause].
>>: Let me clarify what we were just looking at when you showed senti. Those are single words
am I correct what could appear on both lists. I don't know whether any did.
>>: Obama did.
>>: I want to make sure that my mental model is right we're just looking at one slice of the senti
model and both can appear on both lists?
>> Alice Oh: Yeah. Yeah.
>>: So sorry for another question about sarcasm but it's a deep personal interest of mine.
>> Alice Oh: Okay [laughter].
>>: If I were worried about sarcasm interfering with data, not just can you derive good data from
sarcasm, but can you at least factor it out, I would be looking for three negative comments
followed by something very positive that doesn't have some kind of start to the phrase like
however or on the other hand, to indicate change of sentiment.
Is there any analysis like that being done? Because it seems like you go from one sentence to
another. If you see a sudden change of sentiment, that should be suspicious.
>> Alice Oh: Yeah, there's nothing like that I know of. So there's independent assumption here,
right? So we're assuming that every sentence in the document is just independent of -- well, not
quite independent of one another, because there's the distribution of the topics within the
document itself.
But we didn't specifically look at or nobody has really specifically looked at how the sentiment
changes through the document. When you see some sudden change that just signals something
like that. I mean, that's a good suggestion. This just general.
>>: You sort of talked -- I guess you'll see a lot of here's the positive -- I see a lot of reviews.
Here's the positive then the next paragraph is here's the negative. Or it seems like we're taught
to write in a way that should allow you to gather more information from the structure.
>> Alice Oh: Yeah, yeah. So we didn't look at any of the structure within sort of the higher level
structure of review or document. But that is certainly something interesting to look at. I'm not
sure how you would do that and kind of incorporate it into the model. I don't know, we can try to
figure something out.
>>: Can you talk a little bit about sort of how the corpus changes over time? We were talking
about that a little earlier.
So this is something totally different from this work. We have this model that looks at how
documents change over time in respect to the topics that they talk about. So it applies pretty well.
Very well, actually, to the Twitter data. If something happens today, it's going to appear a lot on
the Twitter. Twitter sphere, whatever you call it. But then the next day it's just going to be a
whole new set of topics and so on, right?
So there are variance of the topic model, variance of LDA like the dynamic topic models or topics
over time which try to capture those sort of dynamic changes to the topics.
But the downside of those, sort of the limitation of those models is that they don't really capture
how new topics emerge and topics kind of disappear through time.
So we built this new model, which I hope to publish soon. It's called Distance Dependent CRF,
Chinese Restaurant Franchise. It's a hierarchical model of LDA. And we built into it the notion of
distance, how when you have distances between different documents the topic probabilities are
going to change.
So when we have a new tweet, say we have a bunch of tweets from today, each of them is going
to have some topic probability distributions but they're going to look a lot more like yesterday's
than they do of tweets a year ago or six months ago.
So things like that, or you can do with locations, too. So where the tweets are coming from, if
they're coming from Seattle, they're going to be talking about different things, than if they're
coming from Korea or something like that.
So there's a lot of stuff you can do with topic models. And if you -- we can talk more about it. Did
you have another question?
>>: However you defined accuracy, seems like your accuracy will go up if you are able to just
drop some of the -- just not assign sentiment to some of the more ambiguous sentences. How
robust is -- for applications that allow that, if your goal is to like put an icon next to every single
tweet, gotta do it, if your goal is generalization in general aspects, finding, topic finding, seems
like you could throw out half your sentences and improve accuracy and still -- how robust is your
confidence signal to let you do that? And have you tried -- whatever your accuracy metric is,
have you tried to tossing out your bottom K percent and see how your accuracy goes up?
>> Alice Oh: So the Twitter stuff, this is brand new and we haven't done any analysis of our
results. So I don't know. It's true that if you, a lot of the tweets are not going to have any
sentiment. And maybe if we throw out all the political stuff or something, then the topics will look
more like they have some sentiment built into them.
With the product review data, I don't think it's going to change too much because a lot of the
sentences do have some kind of sentiment in them. But that's just -- but we haven't done any
testing with that either.
>>: Interesting just like the course one-time human experiments, read two random readers,
Amazon readers, curious if I had to pick what percentage of sentences contain meaningful
sentiment information, I don't know. This is basically a question how terse are the reviews written
on Amazon and do people blabber a lot, I have no idea.
>> Alice Oh: So traditional sentiment analysis people do do that. The first thing that they do is
take out the sentences that have subjective content and the objective sentences, they just throw
out and kind of work only with the subjective sentences.
>>: I was thinking more as more a post-processing step, if you have senti-aspect assigned
defined, and you have presumably each of those effectively has, each sentence probability
aspect assigned, take off the bottom probabilities. Without making any new assumptions about
what are objective or subjective work, it seems for many applications your accuracy would go up.
>> Alice Oh: Right. There are a lot of -- a few things that we're trying to do post all of this
processing. And something that you could do with the senti-aspects themselves are how many of
them are really about the -- how many of them really contain something about sentiment and not
just nonsentiment sort of objective words.
A very, very simple thing that you can try is the ratio of like nouns versus adjectives, if you do part
of speech tagging. And we can look for the probabilities -- because every senti-aspect has every
single word in it we can look for where the sentiment seed words are and try to see if they're
higher up in the list then that topic is probably a sentiment topic whereas if they're really low then
it probably doesn't have that much sentiment in it.
>>: You might also want to go the time because some things that start out as some sentiment
topics move into expressions involving a great deal of sentiment. I recall when the crack down in
Bangkok occurred, initially it was very detailed descriptions by local people who were striving
exactly what was going on and the intersection. And over time it was flooded with oh my God
what's happening in Thailand.
>> Alice Oh: That's right.
>>: And basically as the amount of sentiment increased, the noise-to-signal ratio also increased.
>> Alice Oh: Right. So there's a lot of work on Twitter and emergency response stuff like if
there's an earthquake or some bombing somewhere people will start out describing the event and
then afterwards they're going to say as you said all those sad or happy things that are going on.
So we haven't looked at it. But that's a very interesting direction.
>>: To quantify that.
>> Alice Oh: So the two different research projects that we're working on, one is the sentiment
stuff, which is not dynamic at all at this point. And then we have the topic stuff that's dynamic.
So we want to at some point kind of merge the two and we kind of half jokingly say we should do
distance-dependent hierarchical aspect sentiment.
So all of that unification model. But that's certainly an interesting direction to go. Yeah. Okay.
Thank you.
[applause]
Related documents
Download