18764 >>: Okay. I think we'll go ahead and...

advertisement
18764
>>: Okay. I think we'll go ahead and get started with the next session. I hope everybody had a
good lunch. There's probably still a few people finding their way back. So I think we'll get started.
So we have a few talks in this session that are on a variety of different NLP topics. So our first
speaker is Meider Lehr, who is going to be talking about duration features and speech
recognition.
>> Meider Lehr: Okay. Hi. So I'm going to talk about discriminatively estimated during acoustic
duration and language model for speech recognition. This is work I am doing with Izhak Shafran.
So currently speech recognizer, the parameters of the components are estimated independently.
What I mean is that the parameters of the acoustic models and the language models are
independently estimated.
Consider the finite transducer representation of the discriminative language model where the path
represents the work sequences and the weight represent the probabilities.
These probabilities are estimated with maximal likelihood estimation or with discriminative
modeling.
Similarly, the lexicon is deterministic mapping from words to fonts. It may have a weight
representing different pronunciations of our work. But these weights are also estimated
independent from the language model and the acoustic model.
The acoustic representation of the [indiscernible] sounds are with hidden Markov models. Hidden
Markov models represent the temporal variability of the speech with the transition probabilities in
a fixed linear left-to-right topology and the spectral variability of the speech with the observation
models.
The observation vectors have, as I mentioned, between 39 and 50 dimension. And the stated
transitions are bidimensional. The high did dimensionality of it make the state transitions, state
probabilities not relevant. Transition probabilities explicitly encodes duration. So duration
probabilities at is not at most speech recognizer.
As you will see shortly, we estimate the weight of the transition probabilities. But we didn't
change the observation model.
Duration information is useful, for example, to improve the performance of the speech
recognizers in noisy environment, prosodic features are more robust in noisy and channel
variations. And it can be also useful with languages where duration plays a crucial role to
discriminate between words. Phone duration is important in the comprehension of speech.
For example, in the [indiscernible] discriminate minimal pairs. For example, in this case
[indiscernible] or also in English is very useful tool to discriminate between several words like set
and is it and sip and sift.
There are many studies in the literature that have an odd duration information. Some approaches
are extensions of the HMM technique. So in hidden semi-Markov models they replace the
transition probability A sub II with duration distribution.
In the homogenous hidden Markov models, not only the self-transition probability A sub II but also
the transition probability A sub IJ depends on the duration distribution. And in the spanneded
state hidden Markov models, they represent each of the hidden Markov model estates with
another soup HMM model.
In this -- and in that way they increase the parameters. These approaches haven't been adopted
because they are computationally expensive, because we cannot apply any more the standard
training and the cardinality assumes more because we are adding more parameters to estimate.
And furthermore, the improvement that they assert was very limited.
Another approach is the processing technique. People have used this approach modeling the
duration at what level. And then using this information to record the output of the speech
recognizers.
But this approach is not efficient for languages with large vocabulary like Arabic. To set that
contest of the talk, consider the finite state transducer representation of the speech recognizer,
where it can be seen as a composition of different finite state transducers.
And the recording graph can be made compact with standard FST operations like determinization
and minimization.
So Ye is the language model that defines the context dependency between words. L is the
lexicon that maps words into fonts. Then the context dependency tree maps the fonts into the
other fonts, and finally HMM are the acoustic representation.
There are several recent papers that have address the issue of jointly estimated parameters of
the acoustic and the language models. In the first one they adjusted the weight of the decoding
graph at the composing the acoustic and the language models. But these approaches
computationally expensive. And in the second and third word, they map multi-phone units into
utterances. In the first one using segmental conditional random fields and in the second one
using maximum entropy approach.
These approaches have been tested in constraint tasks. So to motivate our approach, consider
the word lapses that anticipate the word recognition. The task of the current system is to pick the
best hypothesis, the one with the lowest cost. But these hypothesises don't always match the
oral hypothesises the hypothesis with the min word rate. The speech recognizer makes
systematic errors that the language model doesn't consider. And the discriminative language
model compensates this word level errors.
So in a finite state transducer representations, representation, this can be seen as a composing
the output lattice with the discriminative language model and a priori to output side and then
collapsing the path. Finally picking the best hypothesis.
However, the finite state transducers that output the speech recognizer don't totally contain word
sequences. They also contain acoustic state sequences. So what we propose is that instead of
acknowledging that information to take -- to take it as extra information to build the discriminative
models.
So in the finite state, transducer representation, this could be as composing the lattice with two
discriminative models. One on the output side and the other one on the input side. The resulting
transducer reaches priority to the output side. We collapse identical path, and then we pick the
hypothesis with the lowest cost.
So in this case we are applying the correction both on the input side and on the output side. And
so from the word lattices, from the finite state transducer, we can start word sequences and also
state sequences at each time stamp. These sequences are usually represented as N gram sub
sequences. So, for example, bigrams on word sequences and unigrams on the state sequences.
And what we do is to estimate the weights that represent the context dependency between words
and the ways that represent the context dependency between the transition probabilities.
Another feature that we can easily add is the duration. We can -- for -- we can -- well, speech
recognizer provides duration at work and [indiscernible] and for later vocabulary test to model
duration at what level is too sparse because we don't have enough occurrences of each word to
model representative distributions.
At full level these two course because we don't have enough symbols to model many variations.
So to model duration at cluster at state level is a good trade-off. We encode duration in two
ways. As counts in the indicators. As counts, we take into account how many times a feature
appeared within the utterance. And as indicated, we take into account the presence or absence
of the features. So consider the following cluster state sequence, where the state 1,000 with
duration two appears twice.
So in the count case, the feature for state 1,000 with duration two will have the value two. And in
the indicator case will have the value through. However, representing the duration in this case
has a limitation. So consider, for example, that this distribution models the duration of a state in
the training data. And as you see, the state with the duration six doesn't have any occurrence in
the training data.
So if we model the data with continuous distribution, we can apply like some kind of smoothing
and avoid this.
As first experiment, we model the duration at HMA estate level. So for each font settlement we
model our duration distribution. Here in the graph you can see the empirical distribution of an
estate. And then when we fit this empirical distribution with the normal distribution and the log
normal density function.
People usually have used gamma distribution to model duration at the state level but it seems like
log log normal distribution is a good choice. To extract the features from the continuous duration
modeling, we have to do some kind of quantization. So our approach is to divide the distribution
in three regions. One region will create durations with low probability and another division the
duration will have like middle probability.
And then a third region with high probability values.
Global linear models have been successfully used in discriminative language modeling. The
recording task in the speech recognition maps the input occurrences analysis X to our
word-to-word sequences Y. Then of X is the function that enumerate the candidates for input X.
Their representation fee maps each pair XY to feature vector and weight vector. In our case the
feature vector will have an acoustic durationnal language lexical features.
Then F of X is the output of the discriminative model. And we use the Perceptron algorithm to
model, to build discriminative models, because even with large feature space it converts,
converges quickly. So the Perceptron algorithm we penalize the features of the best hypothesis
that doesn't match the original hypothesis. The hypothesis with the minimum word error rate.
And we do work the features of the oracle hypothesis. Actually, we use the average Perceptron
algorithm to avoid the overtraining. And we didn't touch the baseline score to the training
because maybe it could dominate other features.
So what we did is to interpolate the score of the baseline speech recognizer with the output of the
discriminative model using interpolation weight.
We test our tasking valid transcription task with 200 hours of hourly broadcast information. We
coded the data using 24 cross-validation. We test it in DIVO 7 and NOVALO 7 and the acoustic
models of the baseline speech recognizers were trained with 1,000 hours of data. The font set
contains 45 fonts including the long vowels and the speech recognizer contains 5,000 clustered
penta phone states, including the taking into account the word boundaries and the hypothesis
context.
And the language model has a vocabulary size of 737 core words and it's a four-gram language
model. So for each way finite state transducer we instructed 100 best unique hypothesis for the
training and 1,000 best unit hypothesis for testing.
Here you can see in the first row the results for the baseline recognizer. Then when we only build
a discriminative model with lexical features using around 1.1 million features in the training, with
an interpolation weight of .30, then when we use acoustic features, and finally when we include
the duration features. We test it in the cross-validation set. VIVO 7 and EVALO 7 and we
observed a gain between 1.2 and 1.6 percent. And the property of the duration features was
much smaller than the lexical features.
So then we apply the continuous duration models after fitting the duration distribution with the
gamma and log normal distribution. And we didn't observe any gain. But looking at the features,
we see that practically all the duration features from the test data appeared in the training data.
So it's quite -- it could be a reason of note having any gain. So as a conclusion, we propose an
extension of the discriminative language model and using features of the input and the output
side of the speech recognizer.
We wrote an improvement of between 1.2 and 1.6 percent with discrete approach. And we didn't
have any improvement with the continuous approach but maybe because we didn't have many
duration features in the test data that were not in the training data.
So as a conclusion, we can say that we have proposed a framework that overcomes the
weaknesses of the hidden Markov models, modeling the acoustic estate transitions and
durations.
And I want to thank Brian [indiscernible] from IBM for providing the tools. Brian wrote for his
feedback and his Perceptron algorithms. Thank you.
[applause]
>>: Okay. So we have just under ten minutes for questions. Raise your hand I can bring the
microphones around.
>>: Just a simple question first. On your first result slide there was a distinction I didn't quite
catch. There. Back one more. So the bottom two, what's the difference between fee sub DV of
DS equals.
>> Meider Lehr: In this case we model, we encode the duration as counts.
>>: I see.
>> Meider Lehr: And as this case as indicator.
>>: Binary indicator functions.
>> Meider Lehr: Yes. We didn't have any significant differences in one or the other way of
encoding.
>>: And you end up learning a duration for each of these 5,000 penta phone states.
>> Meider Lehr: Yes.
>>: Okay. Interesting.
>>: Normally when people did this discriminative training for speech task, they used gradient
based method for optimization rather than Perceptron. I remember IBM they used to have
perception -- a gradient method for training.
>> Meider Lehr: Yes. But here the feature space is quite big. So the training could be
computationally quite expensive.
>>: Okay. But I thought that you were doing gamma distribution a lot normal distribution.
>> Meider Lehr: You are repeating when I model duration with continuous distribution.
>>: Uh-huh. So in that case ->> Meider Lehr: But.
>>: It would be more sensible to use gradient method for training.
>> Meider Lehr: I don't catch what --
>>: So you learn the parameters of the gamma distribution in log normal distribution. Do you
agree?
>> Meider Lehr: What I do is to fit the empirical duration distributions of the HMM estate with the
log normal distribution, and the gamma distribution.
>>: And so you use that as a model.
>> Meider Lehr: Extract from there the features to train the model with the Perceptron algorithm.
>>: Okay. So still discrete?
>> Meider Lehr: Yes. You have -- we do our conversation after modeling the distribution.
>>: I see. I think the more reasonable approach is to use that model directly and then learn the
parameters for those distributions, and that probably should be compared with that.
>> Meider Lehr: Okay.
>>: I think it's a hybrid discriminative degenerative model, where you're reweighting the output
from some generative model and the Perception way is trying to learn a wave over maybe the log
probability in your generative probability. Log probability?
>> Meider Lehr: No we are using a linear approach.
>>: Go to the next slide. Next. Sorry.
>> Meider Lehr: So we are using the linear ->>: Yes. But your theta sub B -- can you go back to slide 19? So theta sub D sub gamma S
equals M that means you're including a feature which is the log probability of the duration
according to the gamma distribution.
>> Meider Lehr: Yes.
>>: You reweight that log probability using Perceptron. Does that answer your question?
>>: Yes. I see. Good.
>>: You have worked with Arabic which is very rich in terms of durationnal minimal pairs. Is
there an across-the-board method to other languages such as English where durationnal minimal
pairs are like almost rare?
>> Meider Lehr: This is something that we have in mind that to try this approach for English, for
example, and see if we will have any significant gain. Maybe here we got the significant gain
because of the nature of the language. Yes, you are right.
>>: Time for one or two more questions. One thing I was wondering about with your dataset. So
said Arabic broadcast conversations, is it like purely spontaneous speech, where actually people
like conducting an interview or is it partly prepared speech, a combination?
>> Meider Lehr: Conversations are like phone calls, I think. Phone conversations.
>>: It's not like prepared news speech, spontaneous.
>> Meider Lehr: No, this is spontaneous.
>>: Any final questions? Well, let's thank our speaker again.
[applause]
Okay. So the next talk is going to be by Nicholas FitzGerald on summarization of evaluative text.
>> Nicholas FitzGerald: Okay. Hello. So I'm going to be talking about a system I developed
called Assess, which is abstractive summarization system for evaluative state summarization. So
automatic summarization. The goal is to determine and express the most relevant information
from a large collection of text. And this is dealing with a problem that's familiar, I'm sure, to a lot
of us, information overload. For instance, if you search the Web for automatic summarization,
you might get 85,000 pages or best digital camera over 38 million.
And obviously no one can read that much except maybe a grad student. But...so specifically for
this project we're working on reviews. So, for instance, on Amazon or a website in Canada,
dinehere.ca, you might have reviews on an camera or a restaurant or blogs, there's lots of
opinions being expressed offer message boards for things like stocks.
There's a lot of opinions out there that could be useful for various tasks. So this system as far as
we know is the first fully automatic abstractive summarization system to complete this task. All
the various parts existed but this is the first time they've all been put together.
And I managed to make an improvement over one of the steps of the pipeline. And so far we've
done preliminary testing on a wide range of different domains such as restaurants and digital
cameras, video games, et cetera. So just before I get into how this works, here's an example
summary. This is for a restaurant in Vancouver, and as you can see, it's summarizing based on
various aspects of the dining experience like service or the price, the food, et cetera. So there's
two main approaches in automatic summarization. The more traditional one which has been
pursued is extractive summarization which is where sentences are extracted from the source
input text. And this is generally easier and faster, partly because it can be framed as a binary
classification problem.
But the problem is that the summaries can lack coherence and it's difficult to aggregate
information effectively. So the approach we take in this pipeline is abstractive summarization,
which is where we first extract information into an internal format and then we use natural
language generation to generate new sentences expressing that information in a cohesive form.
So in this way we can get better aggregation and more coherence.
So here's an illustration of the difference. Extractive summarization you might have two
sentences expressing contrasting opinions, for example, I thought the battery life was more than
adequate. We liked the camera but the battery ran out too quickly. And obviously this doesn't
make very much sense if you read it in that order whereas with extract active summarization you
could say something like users had contrasting opinions about the battery life. So again the
pipeline is to extract the information. And to internal data format and then generate a summary.
And for this project, our lab already had a natural language generation part of the pipeline. So I
was mostly focusing on the data abstraction.
So one other input to this system, which will I'll explain its use later on is a user-defined hierarchy.
So this is a hierarchy of the features of the product that important to the user of the
summarization system. So for a digital camera it would be things like convenience, battery life,
picture quality, et cetera.
And for a restaurant it could be things like service, ambiance and then the menu items. So
there's four main steps to the abstraction process. First we're going to extract feature terms from
the reviews. And at this point we call these crude features because they're taken directly from
the input text, and they can be using different ways to refer the same feature. There could be
spelling mistakes, et cetera. And for this we used who and lu. And the next step is going to be to
map these crude features onto that user defined hierarchy, which is input by the user. So in this
way we can aggregate terms and we can reduce redundancy.
The next is to evaluate the opinions being expressed in sentences. So we're going to assign a
score to each sentence between negative 3 and plus 3, which is whether it's a positive or
negative sentence. And then content selection and natural language generation will express this
information in a summary.
So for step one, the crude feature extraction, I followed an algorithm by who and lu which is
based on the apriori algorithm for classification. It comes in two main steps. The first is to
discover frequent features which are words from noun phrases which commonly occur throughout
the reviews. And then they have a step which supplements these frequent features with more
infrequent features which are noun phrases which co-occur with words which have been
identified as sentiment words. This step reduces precision, but increases recall. So with this
assumption, with the infrequent features, they reported precision of .72 and recall of about .8. So
for this step, it was more important have high recall. I tested some other algorithms that had
higher precision, but we want good coverage at this point because later on when we're mapping
on to the hierarchy, we'll be able to reduce a lot of -- we'll be able to prune out a lot of the
incorrectly identified features.
So at this point we have the sentences with the features identified here highlighted in yellow and
extracted these features from the text. So what next? Now that we have these crude features,
the problem is that there's multiple terms which can refer to the same aspect of the product. And
there can be spelling mistakes and unfamiliar terminology.
So one approach we could take at this step is basic clustering based on word similarity. But this
can be quite prone to error and the difficulty is even once we have the clusters, it's not
necessary -- we then have to decide which of the crude features is the most -- is the most
appropriate to use. So the way we solve this problem is with the user-defined hierarchy.
So what we're going to do is map these crude features on to the hierarchy like this. And then the
terms in the hierarchy will become the ones why you had in the summary. And then one other
thing is here at the head of the hierarchy we have a node for any of the crude features which are
not placed in the hierarchy and that's how we can prune out incorrectly identified features from
the previous step.
So the mapping is an extension of synonym detection, and this is a step in which I did most of my
work and managed to create some improvements.
So the first thing to be done was I identified 13 word similarity algorithms, which have been used,
which fell into two main classes. The lexicon-based ones which had previously been investigated
for this task and I supplemented these with some distribution-based similarity metrics.
So the lexicon-based ones are based on Word Net, which is a lexical database of around 150,000
English words. And it defines relationships such as hypernyms, homonyms and synonyms. So
the hypernym-based lexicon word similarity metrics are based on hypernym trees, and are
various metrics based on how far apart words are based on height and path distance.
And the other class are gloss-based ones based on word overlap in the definitions of the words in
this database. So these are the seven lexical ones used. And another important point is that
since these are based on word sense and because word sense disambiguation would be
impractical in this step, we just assume that the word senses with the maximum similarity.
The other class of algorithm were the distribution-based ones. These are based on the
assumption that words that mean similar things will occur commonly together in documents.
Obviously for this we need a large corpus. And one technique that's been used recently is to use
the Internet, to use search engine hits in order to compute distribution similarity. So, for example,
there's this one based on point-wise mutual information and another one called normalized
Google distance.
So now once we -- and these are based on the number of hits for the two words. So now that we
have these similarity metrics, first we need to normalize them between 0 and 1. And then
because these metrics are based on individual words, whereas the product terms could be
multiple word terms, we're going to take a weighted average of the similarity between words and
the term to get a score for similarity between two terms.
And one benefit of the search engine metrics is that they can be used for entire terms rather than
individual words. So you could search for the whole term picture quality and image quality rather
than the individual word. So that gives us another three metrics to use.
And then once -- so once we have this score between 0 and 1 for two terms, with the individual
metrics, we say if the average, the weighted average of similarity between a crude feature and a
user defined feature is above an empirically defined threshold theta, we'll map that crude feature
to the user defined feature.
Now, in order to determine how accurate these mappings are, we have two main metrics which
were useful. The first one is accuracy, and it's based -- it's compared against a gold standard,
which was created by human annotators, and it's a measure of how far away on average a
mapped crude feature is from where it should have been in the gold standard.
So it's a score between 0 and 1 where 1 would mean a perfect mapping, wherever crude feature
is where it should be, and .5 means every crude feature is an average of one edge away from
where it should be in the tree.
The other one is recall, which only takes into account those crude features which were placed.
So it ignores the ones that were in the not placed node on the tree.
So preliminary results for the individual metrics, as you can see, they're all above .5. So they're
all doing quite well. The first thing to notice is that the lexicon-based ones did much better than
the distribution-based ones. So these new distribution-based search engines that I tested didn't
work so well. They're all above above .5. But the problem with all these individual metrics is that
they require this empirically determined parameter theta. And as -- so this LCH score was one of
the lexical scores, and it achieved the highest, about .788. But as you can see, it's quite sensitive
to this parameter theta. If you go too far in either direction, the accuracy drops off quite a bit. But
what's even more sensitive to theta is the recall. So this is a problem because we don't
necessarily know ahead of time what the ideal value of theta should be.
And one other thing I want you to notice here is that even with zero percent recall, there's still
quite high accuracy. The reason for that is because zero percent recall means they've all been
mapped at the not placed node at the height of the tree, so they could still be quite close to where
they should be. It's not enough just to look at accuracy, we want to look at recall as well.
So I wanted to improve on this. So the next thing I did was to combine these classifiers with
ensemble methods. So I used a standard voting metric where we created a mapping with an odd
number, 3, 5 or 7 of these previously defined metrics, with their parameters set to zero. So we
have the highest possible recall for each of the individual classifiers.
And then if the majority mapped a given crude feature to a user defined feature, then we would
keep that mapping.
So this is the accuracy improvement with the best group of three for N equals 3 and N equals 5.
So we've beaten any of the individual metrics in terms of accuracy by using voting metric. And
also here's the improvement in terms of recall. You've also managed to improve on recall. And
so one other important thing about this is that it's quite -- there's quite a good pattern for
determining what would be a good collection of metrics to use in voting.
Although I just reported the top score for three. It's quite -- there's quite robust pattern where if
you choose one lexical score, one distributional score and one of these other scoring metrics,
these are the top ten. The top ten N equal 3 voting groups.
So even though these distributional metrics didn't work very well individually, they've allowed us
to get quite a good improvement when we use it with these voting metrics.
So now that we've completed these first two steps, the third step of the pipeline is that we need to
determine the sentiments being expressed. So we want to determine semantic orientation, which
is a score -- we use a score between negative three and plus 3. So pretty good would be plus 1.
Overwhelmingly awful would be negative 3. I'm yet to see a review that said overwhelmingly
awful in it. Ideally we'd like to do it individually for every product in a sentence. For example, a
sentence like this camera produces nice pictures but the battery life is abysmal. You have a
positive score for pictures. Negative score for battery life.
Unfortunately, this sort of inter sentential sentiment analysis is very difficult to do and it's kind of
one of the main problems that's being worked on now. So for the purposes of this pipeline, we
made the simplification that we just take one score for the entire sentence. And if we had
sentences that had both a positive and a negative score in them, we'd throw the sentences out.
And in the datasets we looked at, that worked out to about one in six to one in eight of the
sentences.
But the hope is that with enough data it will all average out. So to do this score, to do this
scoring, we used SOCAL a system developed at SFU by Kimberly Voll and might might. And it
uses a tag lexicon of semantic words and these are modified by linguistic traits such as negation
and augmentation so good might be in the lexicon with a score of plus 1 and very good would
make it a plus 2. And bad would be negative 1, not bad becomes plus 1. And then the score for
a sentence is a weighted average of this, of all the semantic expressions in a sentence.
So now for a sentence we have all the features in the sentence and a score between plus 3 and
negative 3. And using the mapping that we determined, we can get a score for the overall user
defined feature. So, for example, picture quality, we have the mapping of picture quality images
and picture. So we have all these three scores from these sentences.
Finally, you'll see natural language generation step. And this was previous work in our lab. So
content selection is based on the importance of a UDF, which is determined by how many
reviews we're talking about that given feature. And then we compute the average opinion and a
measure of controversiality, the degree to which users agreed on that given score. And then this
is expressed in a concise coherent form with links back to the source material.
So here again is that example summary. So we have it's expressing the reasons why people felt
positive about this restaurant, because diners found the service to be good, because the
customers had mixed opinions about the reservations, et cetera.
Then if you click on these numbers in the HTML output, you'll get a link to a sentence expressing
that opinion from the input.
And then here is an example of similar review on digital cameras. And one benefit of abstractive
summarization is that you can use, once you've extracted this data you can visualize it in other
ways. For example, this is a tree map visualization which we also produce, which is a way of
visualizing user opinions.
Or also you could create a natural language generation component that was written in another
language. So you could have an abstractive summarieser, which was also translating. So the
contributions of this work again was the first time that all these elements had been put together in
a completely working pipeline.
I was able to improve on the mapping step of crude features. And so far it's been successful in
preliminary testing on quite a wide range of domains.
And there's clear avenues for improvement. Obviously future work I'd like to work on the
fine-grained sentiment analysis into sentential, and we'd like to improve expressiveness of the
natural language components so we can capture more information and improve the overall
performance of the pipeline.
So I'd like to thank the lab and there are my references. Thank you very much.
[applause]
>>: Okay. Thanks. So we have lots of time for questions. So I'll bring the microphones around.
>>: Hi. I just have a question. When you were doing the summarization, did you start with topics
in mind or did you find documents without any necessary topic or type of sentence to start from.
>> Nicholas FitzGerald: I'm not sure I know what you mean.
>>: Well, for instance, when you are doing the summarization of restaurants in town, did you look
for all the restaurants in town? Did you look for that specific restaurant in town? Did you start off
with compare and contrast Chamber Belgian restaurant to XYZ Restaurant.
>> Nicholas FitzGerald: No the input for this was just the text from dinehere.ca, specifically for
just one restaurant. So we're looking to summarize the opinions about one specific product at a
time. So you might have the input for one restaurant from dinehere.ca or from one digital camera
from Amazon, something like that.
>>: Also, then if you went from a specific -- did you try answering like a question from a specific
type of sentence, or did you just use phrase searches? Like you would in Google? Or Bing?
[laughter]
>> Nicholas FitzGerald: Well, what we're looking to generate a summary about the whole, about
the whole reviews. But the topics that are mentioned in the summary come from the user-defined
hierarchy that's input. So like when it's talking about service and food and the ambiance, that
comes from the hierarchy that's input.
>>: That people thought ->> Nicholas FitzGerald: So if someone's only interested in one aspect of the product. Say they
only care about picture quality, you could just input that hierarchy with only picture quality. And
then it would only summarize reviews which mention that feature.
>>: Quick question about the section that you improved. So the Word Net syntactic similarities
that you used, so Leacock came out the best.
>> Nicholas FitzGerald: Yeah.
>>: Do you have any understanding why that was particularly the best with your results? Some
of those are information theoretic measures and others are graph measures.
>> Nicholas FitzGerald: Yeah, I didn't look -- I didn't look individually at why one would be better
than the other. I was really just looking to see how we could get the best possible mapping. So
that's something to look at. Potentially we could improve that in the future.
>>: So another interesting solution to your ensemble method might be to just run a quick learning
algorithm over it to do a super method to actually choose the best combination of all those word
net similarities that you've already calculated out.
>> Nicholas FitzGerald: Yeah. The problem with that is that it might be domain-specific, whereas
we were hoping to get ideally something that would be unsupervised. So I was hoping that
because we had this regularity in which voting metrics worked well together, that that could be
used as an unsupervised sort of way of picking three that would work well or five that would work
well.
>>: One in the back corner here.
>>: I had a question about the user-defined feature hierarchy. When you define it, are you
defining one hierarchy, say, for digital cameras and you try to make it generic enough to where it
fits, it matches all the digital cameras on the market or do you find you have to do specific
hierarchies for various types of cameras.
>> Nicholas FitzGerald: You could do either. That sort of depends on what user is using the
system or what they wanted to get out of it. But so, for instance, there are websites that already
produce these hierarchies for Consumer Reports, et cetera, but if you were interested in a
specific product feature of one specific product, you could also add that to the hierarchy.
For instance, when I made the hierarchy for the restaurant, I went to that restaurant and found
their menu and put all the menu items on the hierarchy. If they had been mentioned in the
reviews they would be mentioned in the summary.
>>: The reason I'm asking about is how easily does this scale? Once you make this and say you
want to apply it to lots of products out there, is there going to have to be a human involved that
updates the hierarchy every time new products are added with new features.
>> Nicholas FitzGerald: I mean, if you're looking to compare, say, digital cameras there will be
features that are common to all digital cameras. But if you were looking at, say, a feature that
only existed for a specific camera, you would have to add it to the hierarchy. But I found in
practice that creating the hierarchy is really simple. It takes -- and it's also a way of adapting the
system to a specific user, because if one user maybe isn't interested in battery life, they want
picture quality or vice versa or perhaps a less experienced user might not know some of the, what
some of the features mean. So you could tailor it that way as well.
>>: Thank you.
>>: This is also on the feature hierarchy. I'm curious if any of your classifiers here or the
ensemble classifiers are sensitive to the number of features they're trying to classify things
against or are they sort of all individual binary for each feature.
>> Nicholas FitzGerald: No -- yeah, because it was just -- they're just mapping it to the
user-defined feature which is the most similar. So ->>: But if you have only one user-defined feature you might get more false positives than if you
had 10.
>> Nicholas FitzGerald: Yes, that's true. I haven't looked at that aspect of it.
>>: I realize this may not be directly in your line of sights, but there's often a dichromic effect
where with a new camera, I'm dissatisfied generally but as I learn to use the camera I learn that I
love it. Is that anything that you're looking at.
>> Nicholas FitzGerald: Well, I mean ->>: Is one negative review just canceling out the positive review from that user, if they even
make a second review.
>> Nicholas FitzGerald: Well, we're really looking on just how to summarize these kind of texts.
So in terms of what you do with the deployed system to deal with that sort of thing, that's not -that wasn't part of what we're looking at. But I mean hopefully with a big enough dataset, enough
reviews, the hive mind would come up with the right conclusion.
>>: Question here.
>>: Hi, you had an illustration showing that, showing a summary with the customers of a
restaurant having conflicting opinions about something or another. That's quite a sophisticated
inference to make in a system which gets most of its mileage from information associated with the
individual words.
And intersential relations are important both in the extraction of discourse information process as
well as in the generation process I presume. Do you use or will you find it useful to employ any
sort of a formalism for capturing various types of relations between sentence fragments like as
contrast and elaboration and support and implication and things of that nature? For example, like
one might find common in rhetorical structure theory, which people use a lot in generation, I
suppose.
>> Nicholas FitzGerald: That's probably an important part of what we'll have to look at in order to
do the intersentential analysis. Rhetorical structure theory is used in the generation part,
obviously. But, yeah, that's something we'll have to look at.
>>: So your summaries look very impressive for [indiscernible] presumably because it's used in
generation a lot, right.
>> Nicholas FitzGerald: Yes.
>>: But presumably it would be useful in comprehension as well.
>> Nicholas FitzGerald: Yeah, that's true.
>>: We still have time for one or two more questions if there are any more.
>>: Could you just tell us a little bit more about how you choose what to present in the summary,
because it's an abstractive summary. But I'm guessing you have a lot of choices. How do you
choose what to say.
>> Nicholas FitzGerald: Well, I didn't work on the natural language generation part. But there's a
measure of importance, which is calculated based on the number of reviews which are talking
about a specific node on the tree, and it also takes into account the measures of importance of its
children on the hierarchy as well.
So that way we can determine which features are more important. So ones which are higher on
the hierarchy are given more weight. And then we're looking -- also looking to generate sort of a
varied summary. So we want to talk about a wide range of features as well. If you want to know
more about the natural language generation part you should talk to Dr. Care ninny.
>>: I assumed that since you used a website like dinehere.ca people actually leave explicit
scores like stars and stuff. Can you use that to measure like how successful your summarization
is, measuring services and reservations and such.
>> Nicholas FitzGerald: Yeah, that's one of the ways we're looking at analyzing the overall
success of the system. Because like you said, people give scores -- the problem is that those
scores aren't very fine-grained. They might give an overall score for service or food or value,
whereas if you want to look at how successful the more individual features are it's not so useful.
But that's definitely something we're looking at.
>>: Is there a strong correlation.
>> Nicholas FitzGerald: We haven't looked at that yet.
>>: I often use the Internet for looking at reviews of cell phones as they appear in Verizon or
other carriers' websites. And I get a lot of use out of applying your method to the customers'
reviews of individual cell phones. Where a phone comes out and within a month there's 2,000
reviews of it. I'd like perhaps a super digest of these reviews using something like this, which is
like a meta review from all those.
>> Nicholas FitzGerald: Yeah. That's a good idea.
>>: Any final questions? Okay. Let's go ahead and thank our speaker.
[applause]
Okay. Our next speaker is Shafiq Joty who is going to be talking about topic modeling.
>>: Shafiq Joty: Okay. So I will be talking about our research on exploiting conversation features
for finding topics in e-mails. And I would like to thank Dr. Giuseppe Carenini, Dr. Raymond Ng
and Dr. Gabriel Murray. Let's see why we are focusing on conversations.
So because nowadays in our daily lives we experience with conversations in many different
modalities. For example, we now have e-mails. We now have blogs, forums, instant messaging,
video conferencing, et cetera. And the wave has significantly increased the volume and
complexity. Now effectively and efficiently processing of this media can be [indiscernible] value.
For example, the e-mail or the blog summary or the topic pure can give you the direct and quick
access to the information content. Now, before joining a meeting or before buying a product, a
manager can have the summary of what has been discussed or a customer can have a version of
the customer reviews.
Now, understand the example of the e-mail conversation from the BC 3 corpus. Here you can
see Charles is emailing to this mailing list about what the people that would be interested to have
a phone connection to a face-to-face meeting.
Then people who are interested, they reply to this e-mail. Afterwards, Charles again e-mail to the
emailing list about the time zone differences. So here you can easily see there are two different
topics in this conversation. Now, what do we mean by topic modeling? So topic is something
about who is the participants a conversation, discuss or argue. For example, an e-mail thread
about arranging a conference can have topics such as location and time and the registration and
the food menu, and workshops and a schedule.
In our work, we are dealing with the problem of topic assignment. That means clustering the
sentences into a set of coherent topical clusters. Now, let's see one example of this topic
[indiscernible] as an example we just saw. Here you can see the first few sentences are in topic
idea one and then these two sentences are in topic idea two and here wanting to notice topic idea
one can be visited here.
Okay. Now, why do we need this topic modeling? So our main call here is to perform the
information instruction and the summarization, for conversations like e-mail blogs or
[indiscernible] and the topic modeling is considered to be a prerequisite for further conversation
analysis.
And the application of the drive structure are broad in comparison. There's text summarization,
and then information ordering, and then automatic quotient answering and information retrieval
and user interfaces.
Now, what are the challenges that you face when we are dealing with e-mails or other
conversations? So e-mails are different from general monologue or dialogue in various ways.
They're asynchronous. That means different people sitting at different places at different times
collaborate with each other.
Then they're informal in nature. Different people have different types of writing. People tend to
use short sentences and obtain their ungrammatical. And the fact that topic does not change in a
sequential way. And headers can be misleading when we are going to find that topic in an e-mail.
Now, let's see the example here. So you see this sentence at least on people old. This person is
quoting one of these sentences, and he just wrote this sentence. It's a short sentence. And if
you just take this sentence individually, it does not bear any meaning.
Again, here you can see Charles is using the same subject to talk to the same, to talking -- to talk
in a different topic, which is a time zone difference. And here you can see the topic one is
revisited.
So here the assumption of topical boundary as in monologue or dialogue is not applicable. So
this is the example. Short and informal. Headers can be misleading and not sequential.
So the outline of the rest of the talk is at first we'll see the existing methods for topic modeling.
Then how the existing methods can be applied to conversations. Then we'll see can we do better
than those. Then the preliminary evaluation on a small development set. Then we'll focus on the
current work that we're doing right now.
So the existing methods in the literature. So topic segmentation or finding the boundaries in
written monologue or in dialogue has got the most of the attention, and the methods can be one
of these two categories, supervised methods or unsupervised methods.
In supervised method, the problem is just binary classification problem where given the sentence
break you have to decide whether there should be a topical boundary or not. And they use a set
of features and in the unsupervised case these are the two popular models. One is the lexical
chain-based segmenter, and another one is the probabilistic topic models, latent Dirichlet
allocation. So we'll see how these two models can be applied to the conversations.
Recently, for conversation desegment, this grab discussion methods has been applied.
Personally, there's no work in e-mails. So this is what we're trying to do.
Now, at first let's see how we can apply this existing methods to the e-mail conversations. First,
the probabilistic topic model, that is the late late. So as you know, this is a generative topic model
of documents where the assumption is that a document is a mixture of topics and a topic is a
distribution multinomial distribution over words. So to create a document, at first you select the
topical distribution for a document. Then for each word in that document you at first select the
topic from this distribution, then you sample the word from the topic distribution.
So in our case, this e-mail is considered as a document. Then given a distribution of words, so
[indiscernible] a distribution of words over the topics. As you're measuring the words in a
sentence working independently we can extend those probabilities to the sentence level
probabilities and we can just take the out max to find the one on topic for this sentence.
So this is how we applied the general idea model on our e-mail conversation. And let's see the
second model that is the lexical chain-based approach. So you know the lexical chains are the
chains of words that are semantically related, and the relation can be the synonym and the
reputation and the hypernym and the hypernym. But the lexical-based chain approach they just
used this obtain relation. And the chains are ranked according to these two measures. One is
the number of words in a chain and the compactness of the chain.
The words in a chain are given. The score is the same as the rank of the chain. Then once we
have the scores for the words, we form the pseudo sentences of fixed length. Then for each fixed
window for two consecutive windows, we find similarity measure between these two consecutive
windows. If the measure falls below a threshold, then a boundary is given. So this is the
traditional lexical chain-based approach.
Now, as you have noticed that both LD and LSS SIG only consider one bag of words features. It
considers the structure, the temporal relation between the e-mails. But it ignores other important
features, such as the conversation features, then people mention names. This mentioning name
features and topic-shaped keywords like now however and from-to relation.
So example of this important features, see how you can see that people obtain, use -- people
obtain cords, sentences from other e-mails, to talk to the same topic. And here you can see that
people obtain use names to disentangle the conversation.
So to find the topics, we have to consider these features. So what we need is capture the
conversation at finite granularity level and then consider all these important features and then
wait to combine all this into model.
So let's see the way that we extract conversations from the e-mails. So we analyze the actual
body of the e-mails. By analyzing the body of the e-mails, we find there's two kinds of fragments.
One is the new fragment with depth level zero. There's quoted fragment which is depth level
greater than zero. And we form a graph where the nodes represent the fragments and the edges
between them nodes represent the referential relation between the nodes. It becomes clear with
an example. Let's just see the example.
So here we have the six e-mails in an e-mail thread. There's e-mail one. It has only one
fragment. There's new fragment. There's A. E-mail two it has on new fragment B and it's
replying to A. And then e-mail 3 has a new fragment C, it's replying to B. For simplicity it means
that fragment can apply to its neighboring fragments.
In E four we have this DE replying to E. But until we process this E five we don't know that D and
E are two different fragments. So we have to find the different fragments in the first place, and
then we form the graph from the second pass. The result of this process is this graph, where the
nodes are the fragments and the edges represent the referential relation between these
fragments.
So once we have this graph, now let's see whether we can apply the existing methods on this
graph or not.
So we choose to use the LC SIG on this fragmentation graph and to test whether it improves the
performance or not. So on each part of this fragment quotation graph, we apply S SIG and found
the topics. Then to consider these results of different paths, we form a graph where the nodes of
the sentences and the edge words represent the number of cases where these two sentences U
and B fall in the same segment. And we find the optimal clusters by the normalized criteria.
So this is how we applied LC SIG on fragmentation graph. Now, let's focus on our preliminary
evaluation. So the evaluation metrics that we use, just note that the evaluation metrics should be
something that given two assignments of the evaluation -- given two topic estimates metrics the
evaluation metrics should measure how these two assignments agree.
And here note that this widely used [indiscernible] is not applicable because we may have
different number of topics in these two annotations.
So what we did, we just adapted these three measures, these three measures from this paper
[indiscernible], and it's more appropriate for this task.
So this one is 1-to-1 measure and log 3M measures so we have these measures in the next few
slides. So evaluation metric, 1-to-1. This 1-to-1 metric it measures the global similarity by paring
of the clusters of two annotations to, that maximizes the overlap.
So here, for example, we have this source annotation, has the target annotation. So we at first
map each of these clusters into one of the target cluster. It has the highest value. So here you
can see.
This is the source annotation. Then we have this optimum mapping. This spreads our map too.
Greens, and blues are mapped to these white or something like this.
So we have this optimum mapping. So the source will be now this. And then we measured the
overlap between this source and target. And here you can see the overlap is 70 percent. This
one is correct. This one is correct. This one is wrong. This one is wrong. So out of ten, seven
are correct and three are wrong. And then the loc-k measure, it measures the character
references. So here, for example, is the sliding window protocol. So here you can see for this
sentence, we see -- like, okay, this is the source and this is the destination or the target. So for
this sentence we at first check the previous character references whether they're the same or
different. So for this it is different. For this, the same. Because this is red, this is red. But this is
different.
Same with the target one. So this is the same, this is same, this is different. Then we measure
the overlap between these two mappings. Here you can see it's 66 percent. Then the fact that
some annotations can be more finer-grained than the others. So for this we map -- we use this
M-to-1 measure and it maps each of the clusters of the first annotation to the single cluster in the
single annotation with which it has the greatest overlap. And this measure gives us an intuition of
annotators specificity. But to compare the models, we should not use this measure. We should
use this 1-to-1 and loc-k.
For our evaluation, we are using this 5-thirds from the BC corpus and the [indiscernible] argument
between these five annotators you can see. And we compared this annotations with the
annotation of this measure paper with the Chet comparison. This one is higher than this one.
This one is low. This one is kind of same.
So it's kind of a feasible test to do. Now, let's see the results of our three systems that we
described. So here you can see there's the probabilistic LDL model. And this is the SSA model
and this is the SS SIG, apply with fragmentation graph. So here you can see this LL SIG is
representing data [indiscernible] and LSS fragmentation beta, preferring beta than this L SIG
method.
And the differences are significant. So right now what we are trying to do, we'll consider -- this is
the purpose solution where we intend to consider a research set for each pair of sentences, and
we intend to use these features in the topic features we have this LSA. There's the latent
semantic allocation. [indiscernible] analyzes and LDA, latent Dirichlet allocation and we have the
fragmentation in the graph and the speaker and the mention of names and in the lexical features
we have this TDI measure and the keywords.
And our method is classify, which is supervised, and then cut. So we'll use one binary classifier
to decide whether given two sentences whether they should be in the same topic or not. And
then we'll form an unweighted graph where the nodes would be the sentences and the edges will
denote the class membership probability. And the problem it becomes just the graph partitioning
problem which you can solve using the normalized criteria. Thanks.
[applause]
>>: Okay. Plenty of time for questions again. So if you would put your hands up, I could bring
the mics around.
>>: So I was wondering how rough on average how many topics do you find per --
>>: Shafiq Joty: In the five e-mail threads that we had, we found, I guess, 3.4 per thread on
average.
>>: Per thread. So you have five e-mail threads to work on.
>>: Shafiq Joty: Yeah.
>>: Did you have multiple people annotate.
>>: Shafiq Joty: Yeah, yeah, there were five annotators.
>>: Okay. Did they find it hard to come up with the same set of topics, even if they -- and even if
they, you know, kind of found the same span for a topic, did they label it in the same way.
>>: Shafiq Joty: Yeah, yeah, they labeled it. And like you see for the clustering purpose, it has
the problem like level choosing problem, like you can -- like one annotator can have one cluster,
which is labeled as cluster one, but the other annotator can have the same cluster as level two.
But we made the annotators write like the name of the topic.
So our results is something like this. The interagreement. Here. On the e-mail corpus. Here,
you can see this is like -- in one of the e-mail, there's like 100 percent. They all agree. And this
is the minimum agreement and this is the mean of it.
>>: So if they named the topic, were the names that the annotators gave kind of, would that have
helped the identification -- I was wondering if the LDA models would predict what the names for
the clusters should be, just out of curiosity.
>>: Shafiq Joty: Yeah, but I don't think that the LDL model is capable of doing that, because you
can see that -- the performance here, like, is quite low.
>>: Okay. Other questions.
>>: So then how often are people using the like angled brackets to actually show that they are
referring to, like, different topics? The thing is I actually don't use those in like any of my e-mails.
I'm just wondering how reliable --
>>: Shafiq Joty: In our corpus, it's like people are using PTR obtain. So our corpus is the BC tree
that we have in our ATVC. So people often use quotations.
>>: Yeah.
>>: So e-mail threads are so hard to come by. And you yourself have five. Can we initiate some
effort to try to collect more so that we can --
>>: Shafiq Joty: Yeah, we have 40. We have 40.
>>: You have 40.
>>: Shafiq Joty: Yeah, but just for this experiment we just tried with 5, just to -- it was a pilot
study. So we just did it with five. So right now we are trying to like use all this 40.
>>: This corpus is mainly based on -- so the BC 3 corpus is from the World Wide Web
Consortium. So it's a mailing list and we have lots of e-mails from this mailing list. There is no
problem of privacy. It's a public -- so the challenge is to annotate this data. But unlabeled data is
available.
>>: Shafiq Joty: And one thing is that I was trying to incorporate this conversation structure into
the LDL model as a prior. Like LDL model uses this for the document for the topic word
distribution it uses when Dirichlet prior. So I was trying to incorporate this network structure into
this model. So I came up with this one approach that will allow us to do this.
So right now I'm kind of done with this. So as a machine learning process I did this hopefully it
will come up with some better results. So hopefully we'll see some papers.
>>: I was just going to ask about the sequential structure, because I think this is sort of what you
were just talking about. It sounded like you weren't really modeling -- like if you were talking
about topic one, maybe you're going to continue talking about topic one to the next reply to the
e-mail. I think it sounded like you sort of weren't doing that but that's sort of future work; is that
correct.
>>: Shafiq Joty: Which one.
>>: So if like the topic maybe will stay the same. I'm just talking about the transition between
topics.
>>: Shafiq Joty: Topics? We are considering this problem as a clustering problem. But we have
this like -- we have this LC SIG feature. Which will like consider these features, like they work
together or not. So implicitly we are considering this feature.
>>: Okay.
>>: Thanks for your time. One question. So do you have a fixed number of topics when you're
running LDA or are you using a Chinese restaurant process to get an arbitrary number of topics?
Is it nonparametric.
>>: Shafiq Joty: No, we set the upper parameter that is very low. And we check the probability of
Z, that's the latent variable, whether it goes below, below the threshold. And if it's below that
threshold then that topic is gone. Like that topic is not there.
>>: I see. Because looking at your results here ->>: Shafiq Joty: So it's kind of automatic cluster detection.
>>: Right. Looking at your results here, LDA seems to do much better on M-to-1 than it does on
any of the other evaluation metrics. I wonder if it's proposing too many topics.
>>: Shafiq Joty: I think, no, no, for this we fixed the number of topics, like the number of topics is
just the humans found in this conversation. So just to see like how -- just to see the best output
of LDA that one could get. But now I'm trying with like this alpha very low and this parameter.
>>: Thanks.
>>: Shafiq Joty: And just to tell you that the prior that I'm using is Dirichlet tree so it allows you to
incorporate network structure.
>>: You might have experienced it, too, but I often am part of e-mail threads which tend to kind of
gradually proliferate in such a way that each respondent diverges slightly from the original topic of
the e-mail to such a point that soon further on what you respond to has nothing to do with the
original topic, maybe nothing to do with the topic at all. Will you have a method of quantizing the
relevance to the original topic in such a way that you can just filter out all the nonsense at some
point in time.
>>: Shafiq Joty: Yeah, like we have this fragment quotation graph. So we are considering the
distance between two sentences in this fragment quotation graph. So if the distance is higher
then we're considering the topic shift.
>>: Okay.
>>: Any more questions? We have a couple minutes. Anybody have a final question.
I have one question. So you have e-mails so many issues that people will revisit topics. So that's
not sequential. I'm wondering if you have a sense of how prevalent that is, and also whether it's
the same in e-mails versus chats, if you've noticed any differences with how topic structure is
realized in --
>>: Shafiq Joty: In chats, like in the multi-party chat it's different because different people are
interacting with each other. So there are many different conversations. But for the real chat, I did
not check it because we did not have this chat corpus in the real chat. We had just multi-party
chat. But I think it's kind of more synchronous in chat, but in e-mail it's asynchronous. So people
can revisit things.
>>: Okay. So if there's no more questions, let's thank our speaker again.
[applause]
Download