18764 >>: Okay. I think we'll go ahead and get started with the next session. I hope everybody had a good lunch. There's probably still a few people finding their way back. So I think we'll get started. So we have a few talks in this session that are on a variety of different NLP topics. So our first speaker is Meider Lehr, who is going to be talking about duration features and speech recognition. >> Meider Lehr: Okay. Hi. So I'm going to talk about discriminatively estimated during acoustic duration and language model for speech recognition. This is work I am doing with Izhak Shafran. So currently speech recognizer, the parameters of the components are estimated independently. What I mean is that the parameters of the acoustic models and the language models are independently estimated. Consider the finite transducer representation of the discriminative language model where the path represents the work sequences and the weight represent the probabilities. These probabilities are estimated with maximal likelihood estimation or with discriminative modeling. Similarly, the lexicon is deterministic mapping from words to fonts. It may have a weight representing different pronunciations of our work. But these weights are also estimated independent from the language model and the acoustic model. The acoustic representation of the [indiscernible] sounds are with hidden Markov models. Hidden Markov models represent the temporal variability of the speech with the transition probabilities in a fixed linear left-to-right topology and the spectral variability of the speech with the observation models. The observation vectors have, as I mentioned, between 39 and 50 dimension. And the stated transitions are bidimensional. The high did dimensionality of it make the state transitions, state probabilities not relevant. Transition probabilities explicitly encodes duration. So duration probabilities at is not at most speech recognizer. As you will see shortly, we estimate the weight of the transition probabilities. But we didn't change the observation model. Duration information is useful, for example, to improve the performance of the speech recognizers in noisy environment, prosodic features are more robust in noisy and channel variations. And it can be also useful with languages where duration plays a crucial role to discriminate between words. Phone duration is important in the comprehension of speech. For example, in the [indiscernible] discriminate minimal pairs. For example, in this case [indiscernible] or also in English is very useful tool to discriminate between several words like set and is it and sip and sift. There are many studies in the literature that have an odd duration information. Some approaches are extensions of the HMM technique. So in hidden semi-Markov models they replace the transition probability A sub II with duration distribution. In the homogenous hidden Markov models, not only the self-transition probability A sub II but also the transition probability A sub IJ depends on the duration distribution. And in the spanneded state hidden Markov models, they represent each of the hidden Markov model estates with another soup HMM model. In this -- and in that way they increase the parameters. These approaches haven't been adopted because they are computationally expensive, because we cannot apply any more the standard training and the cardinality assumes more because we are adding more parameters to estimate. And furthermore, the improvement that they assert was very limited. Another approach is the processing technique. People have used this approach modeling the duration at what level. And then using this information to record the output of the speech recognizers. But this approach is not efficient for languages with large vocabulary like Arabic. To set that contest of the talk, consider the finite state transducer representation of the speech recognizer, where it can be seen as a composition of different finite state transducers. And the recording graph can be made compact with standard FST operations like determinization and minimization. So Ye is the language model that defines the context dependency between words. L is the lexicon that maps words into fonts. Then the context dependency tree maps the fonts into the other fonts, and finally HMM are the acoustic representation. There are several recent papers that have address the issue of jointly estimated parameters of the acoustic and the language models. In the first one they adjusted the weight of the decoding graph at the composing the acoustic and the language models. But these approaches computationally expensive. And in the second and third word, they map multi-phone units into utterances. In the first one using segmental conditional random fields and in the second one using maximum entropy approach. These approaches have been tested in constraint tasks. So to motivate our approach, consider the word lapses that anticipate the word recognition. The task of the current system is to pick the best hypothesis, the one with the lowest cost. But these hypothesises don't always match the oral hypothesises the hypothesis with the min word rate. The speech recognizer makes systematic errors that the language model doesn't consider. And the discriminative language model compensates this word level errors. So in a finite state transducer representations, representation, this can be seen as a composing the output lattice with the discriminative language model and a priori to output side and then collapsing the path. Finally picking the best hypothesis. However, the finite state transducers that output the speech recognizer don't totally contain word sequences. They also contain acoustic state sequences. So what we propose is that instead of acknowledging that information to take -- to take it as extra information to build the discriminative models. So in the finite state, transducer representation, this could be as composing the lattice with two discriminative models. One on the output side and the other one on the input side. The resulting transducer reaches priority to the output side. We collapse identical path, and then we pick the hypothesis with the lowest cost. So in this case we are applying the correction both on the input side and on the output side. And so from the word lattices, from the finite state transducer, we can start word sequences and also state sequences at each time stamp. These sequences are usually represented as N gram sub sequences. So, for example, bigrams on word sequences and unigrams on the state sequences. And what we do is to estimate the weights that represent the context dependency between words and the ways that represent the context dependency between the transition probabilities. Another feature that we can easily add is the duration. We can -- for -- we can -- well, speech recognizer provides duration at work and [indiscernible] and for later vocabulary test to model duration at what level is too sparse because we don't have enough occurrences of each word to model representative distributions. At full level these two course because we don't have enough symbols to model many variations. So to model duration at cluster at state level is a good trade-off. We encode duration in two ways. As counts in the indicators. As counts, we take into account how many times a feature appeared within the utterance. And as indicated, we take into account the presence or absence of the features. So consider the following cluster state sequence, where the state 1,000 with duration two appears twice. So in the count case, the feature for state 1,000 with duration two will have the value two. And in the indicator case will have the value through. However, representing the duration in this case has a limitation. So consider, for example, that this distribution models the duration of a state in the training data. And as you see, the state with the duration six doesn't have any occurrence in the training data. So if we model the data with continuous distribution, we can apply like some kind of smoothing and avoid this. As first experiment, we model the duration at HMA estate level. So for each font settlement we model our duration distribution. Here in the graph you can see the empirical distribution of an estate. And then when we fit this empirical distribution with the normal distribution and the log normal density function. People usually have used gamma distribution to model duration at the state level but it seems like log log normal distribution is a good choice. To extract the features from the continuous duration modeling, we have to do some kind of quantization. So our approach is to divide the distribution in three regions. One region will create durations with low probability and another division the duration will have like middle probability. And then a third region with high probability values. Global linear models have been successfully used in discriminative language modeling. The recording task in the speech recognition maps the input occurrences analysis X to our word-to-word sequences Y. Then of X is the function that enumerate the candidates for input X. Their representation fee maps each pair XY to feature vector and weight vector. In our case the feature vector will have an acoustic durationnal language lexical features. Then F of X is the output of the discriminative model. And we use the Perceptron algorithm to model, to build discriminative models, because even with large feature space it converts, converges quickly. So the Perceptron algorithm we penalize the features of the best hypothesis that doesn't match the original hypothesis. The hypothesis with the minimum word error rate. And we do work the features of the oracle hypothesis. Actually, we use the average Perceptron algorithm to avoid the overtraining. And we didn't touch the baseline score to the training because maybe it could dominate other features. So what we did is to interpolate the score of the baseline speech recognizer with the output of the discriminative model using interpolation weight. We test our tasking valid transcription task with 200 hours of hourly broadcast information. We coded the data using 24 cross-validation. We test it in DIVO 7 and NOVALO 7 and the acoustic models of the baseline speech recognizers were trained with 1,000 hours of data. The font set contains 45 fonts including the long vowels and the speech recognizer contains 5,000 clustered penta phone states, including the taking into account the word boundaries and the hypothesis context. And the language model has a vocabulary size of 737 core words and it's a four-gram language model. So for each way finite state transducer we instructed 100 best unique hypothesis for the training and 1,000 best unit hypothesis for testing. Here you can see in the first row the results for the baseline recognizer. Then when we only build a discriminative model with lexical features using around 1.1 million features in the training, with an interpolation weight of .30, then when we use acoustic features, and finally when we include the duration features. We test it in the cross-validation set. VIVO 7 and EVALO 7 and we observed a gain between 1.2 and 1.6 percent. And the property of the duration features was much smaller than the lexical features. So then we apply the continuous duration models after fitting the duration distribution with the gamma and log normal distribution. And we didn't observe any gain. But looking at the features, we see that practically all the duration features from the test data appeared in the training data. So it's quite -- it could be a reason of note having any gain. So as a conclusion, we propose an extension of the discriminative language model and using features of the input and the output side of the speech recognizer. We wrote an improvement of between 1.2 and 1.6 percent with discrete approach. And we didn't have any improvement with the continuous approach but maybe because we didn't have many duration features in the test data that were not in the training data. So as a conclusion, we can say that we have proposed a framework that overcomes the weaknesses of the hidden Markov models, modeling the acoustic estate transitions and durations. And I want to thank Brian [indiscernible] from IBM for providing the tools. Brian wrote for his feedback and his Perceptron algorithms. Thank you. [applause] >>: Okay. So we have just under ten minutes for questions. Raise your hand I can bring the microphones around. >>: Just a simple question first. On your first result slide there was a distinction I didn't quite catch. There. Back one more. So the bottom two, what's the difference between fee sub DV of DS equals. >> Meider Lehr: In this case we model, we encode the duration as counts. >>: I see. >> Meider Lehr: And as this case as indicator. >>: Binary indicator functions. >> Meider Lehr: Yes. We didn't have any significant differences in one or the other way of encoding. >>: And you end up learning a duration for each of these 5,000 penta phone states. >> Meider Lehr: Yes. >>: Okay. Interesting. >>: Normally when people did this discriminative training for speech task, they used gradient based method for optimization rather than Perceptron. I remember IBM they used to have perception -- a gradient method for training. >> Meider Lehr: Yes. But here the feature space is quite big. So the training could be computationally quite expensive. >>: Okay. But I thought that you were doing gamma distribution a lot normal distribution. >> Meider Lehr: You are repeating when I model duration with continuous distribution. >>: Uh-huh. So in that case ->> Meider Lehr: But. >>: It would be more sensible to use gradient method for training. >> Meider Lehr: I don't catch what -- >>: So you learn the parameters of the gamma distribution in log normal distribution. Do you agree? >> Meider Lehr: What I do is to fit the empirical duration distributions of the HMM estate with the log normal distribution, and the gamma distribution. >>: And so you use that as a model. >> Meider Lehr: Extract from there the features to train the model with the Perceptron algorithm. >>: Okay. So still discrete? >> Meider Lehr: Yes. You have -- we do our conversation after modeling the distribution. >>: I see. I think the more reasonable approach is to use that model directly and then learn the parameters for those distributions, and that probably should be compared with that. >> Meider Lehr: Okay. >>: I think it's a hybrid discriminative degenerative model, where you're reweighting the output from some generative model and the Perception way is trying to learn a wave over maybe the log probability in your generative probability. Log probability? >> Meider Lehr: No we are using a linear approach. >>: Go to the next slide. Next. Sorry. >> Meider Lehr: So we are using the linear ->>: Yes. But your theta sub B -- can you go back to slide 19? So theta sub D sub gamma S equals M that means you're including a feature which is the log probability of the duration according to the gamma distribution. >> Meider Lehr: Yes. >>: You reweight that log probability using Perceptron. Does that answer your question? >>: Yes. I see. Good. >>: You have worked with Arabic which is very rich in terms of durationnal minimal pairs. Is there an across-the-board method to other languages such as English where durationnal minimal pairs are like almost rare? >> Meider Lehr: This is something that we have in mind that to try this approach for English, for example, and see if we will have any significant gain. Maybe here we got the significant gain because of the nature of the language. Yes, you are right. >>: Time for one or two more questions. One thing I was wondering about with your dataset. So said Arabic broadcast conversations, is it like purely spontaneous speech, where actually people like conducting an interview or is it partly prepared speech, a combination? >> Meider Lehr: Conversations are like phone calls, I think. Phone conversations. >>: It's not like prepared news speech, spontaneous. >> Meider Lehr: No, this is spontaneous. >>: Any final questions? Well, let's thank our speaker again. [applause] Okay. So the next talk is going to be by Nicholas FitzGerald on summarization of evaluative text. >> Nicholas FitzGerald: Okay. Hello. So I'm going to be talking about a system I developed called Assess, which is abstractive summarization system for evaluative state summarization. So automatic summarization. The goal is to determine and express the most relevant information from a large collection of text. And this is dealing with a problem that's familiar, I'm sure, to a lot of us, information overload. For instance, if you search the Web for automatic summarization, you might get 85,000 pages or best digital camera over 38 million. And obviously no one can read that much except maybe a grad student. But...so specifically for this project we're working on reviews. So, for instance, on Amazon or a website in Canada, dinehere.ca, you might have reviews on an camera or a restaurant or blogs, there's lots of opinions being expressed offer message boards for things like stocks. There's a lot of opinions out there that could be useful for various tasks. So this system as far as we know is the first fully automatic abstractive summarization system to complete this task. All the various parts existed but this is the first time they've all been put together. And I managed to make an improvement over one of the steps of the pipeline. And so far we've done preliminary testing on a wide range of different domains such as restaurants and digital cameras, video games, et cetera. So just before I get into how this works, here's an example summary. This is for a restaurant in Vancouver, and as you can see, it's summarizing based on various aspects of the dining experience like service or the price, the food, et cetera. So there's two main approaches in automatic summarization. The more traditional one which has been pursued is extractive summarization which is where sentences are extracted from the source input text. And this is generally easier and faster, partly because it can be framed as a binary classification problem. But the problem is that the summaries can lack coherence and it's difficult to aggregate information effectively. So the approach we take in this pipeline is abstractive summarization, which is where we first extract information into an internal format and then we use natural language generation to generate new sentences expressing that information in a cohesive form. So in this way we can get better aggregation and more coherence. So here's an illustration of the difference. Extractive summarization you might have two sentences expressing contrasting opinions, for example, I thought the battery life was more than adequate. We liked the camera but the battery ran out too quickly. And obviously this doesn't make very much sense if you read it in that order whereas with extract active summarization you could say something like users had contrasting opinions about the battery life. So again the pipeline is to extract the information. And to internal data format and then generate a summary. And for this project, our lab already had a natural language generation part of the pipeline. So I was mostly focusing on the data abstraction. So one other input to this system, which will I'll explain its use later on is a user-defined hierarchy. So this is a hierarchy of the features of the product that important to the user of the summarization system. So for a digital camera it would be things like convenience, battery life, picture quality, et cetera. And for a restaurant it could be things like service, ambiance and then the menu items. So there's four main steps to the abstraction process. First we're going to extract feature terms from the reviews. And at this point we call these crude features because they're taken directly from the input text, and they can be using different ways to refer the same feature. There could be spelling mistakes, et cetera. And for this we used who and lu. And the next step is going to be to map these crude features onto that user defined hierarchy, which is input by the user. So in this way we can aggregate terms and we can reduce redundancy. The next is to evaluate the opinions being expressed in sentences. So we're going to assign a score to each sentence between negative 3 and plus 3, which is whether it's a positive or negative sentence. And then content selection and natural language generation will express this information in a summary. So for step one, the crude feature extraction, I followed an algorithm by who and lu which is based on the apriori algorithm for classification. It comes in two main steps. The first is to discover frequent features which are words from noun phrases which commonly occur throughout the reviews. And then they have a step which supplements these frequent features with more infrequent features which are noun phrases which co-occur with words which have been identified as sentiment words. This step reduces precision, but increases recall. So with this assumption, with the infrequent features, they reported precision of .72 and recall of about .8. So for this step, it was more important have high recall. I tested some other algorithms that had higher precision, but we want good coverage at this point because later on when we're mapping on to the hierarchy, we'll be able to reduce a lot of -- we'll be able to prune out a lot of the incorrectly identified features. So at this point we have the sentences with the features identified here highlighted in yellow and extracted these features from the text. So what next? Now that we have these crude features, the problem is that there's multiple terms which can refer to the same aspect of the product. And there can be spelling mistakes and unfamiliar terminology. So one approach we could take at this step is basic clustering based on word similarity. But this can be quite prone to error and the difficulty is even once we have the clusters, it's not necessary -- we then have to decide which of the crude features is the most -- is the most appropriate to use. So the way we solve this problem is with the user-defined hierarchy. So what we're going to do is map these crude features on to the hierarchy like this. And then the terms in the hierarchy will become the ones why you had in the summary. And then one other thing is here at the head of the hierarchy we have a node for any of the crude features which are not placed in the hierarchy and that's how we can prune out incorrectly identified features from the previous step. So the mapping is an extension of synonym detection, and this is a step in which I did most of my work and managed to create some improvements. So the first thing to be done was I identified 13 word similarity algorithms, which have been used, which fell into two main classes. The lexicon-based ones which had previously been investigated for this task and I supplemented these with some distribution-based similarity metrics. So the lexicon-based ones are based on Word Net, which is a lexical database of around 150,000 English words. And it defines relationships such as hypernyms, homonyms and synonyms. So the hypernym-based lexicon word similarity metrics are based on hypernym trees, and are various metrics based on how far apart words are based on height and path distance. And the other class are gloss-based ones based on word overlap in the definitions of the words in this database. So these are the seven lexical ones used. And another important point is that since these are based on word sense and because word sense disambiguation would be impractical in this step, we just assume that the word senses with the maximum similarity. The other class of algorithm were the distribution-based ones. These are based on the assumption that words that mean similar things will occur commonly together in documents. Obviously for this we need a large corpus. And one technique that's been used recently is to use the Internet, to use search engine hits in order to compute distribution similarity. So, for example, there's this one based on point-wise mutual information and another one called normalized Google distance. So now once we -- and these are based on the number of hits for the two words. So now that we have these similarity metrics, first we need to normalize them between 0 and 1. And then because these metrics are based on individual words, whereas the product terms could be multiple word terms, we're going to take a weighted average of the similarity between words and the term to get a score for similarity between two terms. And one benefit of the search engine metrics is that they can be used for entire terms rather than individual words. So you could search for the whole term picture quality and image quality rather than the individual word. So that gives us another three metrics to use. And then once -- so once we have this score between 0 and 1 for two terms, with the individual metrics, we say if the average, the weighted average of similarity between a crude feature and a user defined feature is above an empirically defined threshold theta, we'll map that crude feature to the user defined feature. Now, in order to determine how accurate these mappings are, we have two main metrics which were useful. The first one is accuracy, and it's based -- it's compared against a gold standard, which was created by human annotators, and it's a measure of how far away on average a mapped crude feature is from where it should have been in the gold standard. So it's a score between 0 and 1 where 1 would mean a perfect mapping, wherever crude feature is where it should be, and .5 means every crude feature is an average of one edge away from where it should be in the tree. The other one is recall, which only takes into account those crude features which were placed. So it ignores the ones that were in the not placed node on the tree. So preliminary results for the individual metrics, as you can see, they're all above .5. So they're all doing quite well. The first thing to notice is that the lexicon-based ones did much better than the distribution-based ones. So these new distribution-based search engines that I tested didn't work so well. They're all above above .5. But the problem with all these individual metrics is that they require this empirically determined parameter theta. And as -- so this LCH score was one of the lexical scores, and it achieved the highest, about .788. But as you can see, it's quite sensitive to this parameter theta. If you go too far in either direction, the accuracy drops off quite a bit. But what's even more sensitive to theta is the recall. So this is a problem because we don't necessarily know ahead of time what the ideal value of theta should be. And one other thing I want you to notice here is that even with zero percent recall, there's still quite high accuracy. The reason for that is because zero percent recall means they've all been mapped at the not placed node at the height of the tree, so they could still be quite close to where they should be. It's not enough just to look at accuracy, we want to look at recall as well. So I wanted to improve on this. So the next thing I did was to combine these classifiers with ensemble methods. So I used a standard voting metric where we created a mapping with an odd number, 3, 5 or 7 of these previously defined metrics, with their parameters set to zero. So we have the highest possible recall for each of the individual classifiers. And then if the majority mapped a given crude feature to a user defined feature, then we would keep that mapping. So this is the accuracy improvement with the best group of three for N equals 3 and N equals 5. So we've beaten any of the individual metrics in terms of accuracy by using voting metric. And also here's the improvement in terms of recall. You've also managed to improve on recall. And so one other important thing about this is that it's quite -- there's quite a good pattern for determining what would be a good collection of metrics to use in voting. Although I just reported the top score for three. It's quite -- there's quite robust pattern where if you choose one lexical score, one distributional score and one of these other scoring metrics, these are the top ten. The top ten N equal 3 voting groups. So even though these distributional metrics didn't work very well individually, they've allowed us to get quite a good improvement when we use it with these voting metrics. So now that we've completed these first two steps, the third step of the pipeline is that we need to determine the sentiments being expressed. So we want to determine semantic orientation, which is a score -- we use a score between negative three and plus 3. So pretty good would be plus 1. Overwhelmingly awful would be negative 3. I'm yet to see a review that said overwhelmingly awful in it. Ideally we'd like to do it individually for every product in a sentence. For example, a sentence like this camera produces nice pictures but the battery life is abysmal. You have a positive score for pictures. Negative score for battery life. Unfortunately, this sort of inter sentential sentiment analysis is very difficult to do and it's kind of one of the main problems that's being worked on now. So for the purposes of this pipeline, we made the simplification that we just take one score for the entire sentence. And if we had sentences that had both a positive and a negative score in them, we'd throw the sentences out. And in the datasets we looked at, that worked out to about one in six to one in eight of the sentences. But the hope is that with enough data it will all average out. So to do this score, to do this scoring, we used SOCAL a system developed at SFU by Kimberly Voll and might might. And it uses a tag lexicon of semantic words and these are modified by linguistic traits such as negation and augmentation so good might be in the lexicon with a score of plus 1 and very good would make it a plus 2. And bad would be negative 1, not bad becomes plus 1. And then the score for a sentence is a weighted average of this, of all the semantic expressions in a sentence. So now for a sentence we have all the features in the sentence and a score between plus 3 and negative 3. And using the mapping that we determined, we can get a score for the overall user defined feature. So, for example, picture quality, we have the mapping of picture quality images and picture. So we have all these three scores from these sentences. Finally, you'll see natural language generation step. And this was previous work in our lab. So content selection is based on the importance of a UDF, which is determined by how many reviews we're talking about that given feature. And then we compute the average opinion and a measure of controversiality, the degree to which users agreed on that given score. And then this is expressed in a concise coherent form with links back to the source material. So here again is that example summary. So we have it's expressing the reasons why people felt positive about this restaurant, because diners found the service to be good, because the customers had mixed opinions about the reservations, et cetera. Then if you click on these numbers in the HTML output, you'll get a link to a sentence expressing that opinion from the input. And then here is an example of similar review on digital cameras. And one benefit of abstractive summarization is that you can use, once you've extracted this data you can visualize it in other ways. For example, this is a tree map visualization which we also produce, which is a way of visualizing user opinions. Or also you could create a natural language generation component that was written in another language. So you could have an abstractive summarieser, which was also translating. So the contributions of this work again was the first time that all these elements had been put together in a completely working pipeline. I was able to improve on the mapping step of crude features. And so far it's been successful in preliminary testing on quite a wide range of domains. And there's clear avenues for improvement. Obviously future work I'd like to work on the fine-grained sentiment analysis into sentential, and we'd like to improve expressiveness of the natural language components so we can capture more information and improve the overall performance of the pipeline. So I'd like to thank the lab and there are my references. Thank you very much. [applause] >>: Okay. Thanks. So we have lots of time for questions. So I'll bring the microphones around. >>: Hi. I just have a question. When you were doing the summarization, did you start with topics in mind or did you find documents without any necessary topic or type of sentence to start from. >> Nicholas FitzGerald: I'm not sure I know what you mean. >>: Well, for instance, when you are doing the summarization of restaurants in town, did you look for all the restaurants in town? Did you look for that specific restaurant in town? Did you start off with compare and contrast Chamber Belgian restaurant to XYZ Restaurant. >> Nicholas FitzGerald: No the input for this was just the text from dinehere.ca, specifically for just one restaurant. So we're looking to summarize the opinions about one specific product at a time. So you might have the input for one restaurant from dinehere.ca or from one digital camera from Amazon, something like that. >>: Also, then if you went from a specific -- did you try answering like a question from a specific type of sentence, or did you just use phrase searches? Like you would in Google? Or Bing? [laughter] >> Nicholas FitzGerald: Well, what we're looking to generate a summary about the whole, about the whole reviews. But the topics that are mentioned in the summary come from the user-defined hierarchy that's input. So like when it's talking about service and food and the ambiance, that comes from the hierarchy that's input. >>: That people thought ->> Nicholas FitzGerald: So if someone's only interested in one aspect of the product. Say they only care about picture quality, you could just input that hierarchy with only picture quality. And then it would only summarize reviews which mention that feature. >>: Quick question about the section that you improved. So the Word Net syntactic similarities that you used, so Leacock came out the best. >> Nicholas FitzGerald: Yeah. >>: Do you have any understanding why that was particularly the best with your results? Some of those are information theoretic measures and others are graph measures. >> Nicholas FitzGerald: Yeah, I didn't look -- I didn't look individually at why one would be better than the other. I was really just looking to see how we could get the best possible mapping. So that's something to look at. Potentially we could improve that in the future. >>: So another interesting solution to your ensemble method might be to just run a quick learning algorithm over it to do a super method to actually choose the best combination of all those word net similarities that you've already calculated out. >> Nicholas FitzGerald: Yeah. The problem with that is that it might be domain-specific, whereas we were hoping to get ideally something that would be unsupervised. So I was hoping that because we had this regularity in which voting metrics worked well together, that that could be used as an unsupervised sort of way of picking three that would work well or five that would work well. >>: One in the back corner here. >>: I had a question about the user-defined feature hierarchy. When you define it, are you defining one hierarchy, say, for digital cameras and you try to make it generic enough to where it fits, it matches all the digital cameras on the market or do you find you have to do specific hierarchies for various types of cameras. >> Nicholas FitzGerald: You could do either. That sort of depends on what user is using the system or what they wanted to get out of it. But so, for instance, there are websites that already produce these hierarchies for Consumer Reports, et cetera, but if you were interested in a specific product feature of one specific product, you could also add that to the hierarchy. For instance, when I made the hierarchy for the restaurant, I went to that restaurant and found their menu and put all the menu items on the hierarchy. If they had been mentioned in the reviews they would be mentioned in the summary. >>: The reason I'm asking about is how easily does this scale? Once you make this and say you want to apply it to lots of products out there, is there going to have to be a human involved that updates the hierarchy every time new products are added with new features. >> Nicholas FitzGerald: I mean, if you're looking to compare, say, digital cameras there will be features that are common to all digital cameras. But if you were looking at, say, a feature that only existed for a specific camera, you would have to add it to the hierarchy. But I found in practice that creating the hierarchy is really simple. It takes -- and it's also a way of adapting the system to a specific user, because if one user maybe isn't interested in battery life, they want picture quality or vice versa or perhaps a less experienced user might not know some of the, what some of the features mean. So you could tailor it that way as well. >>: Thank you. >>: This is also on the feature hierarchy. I'm curious if any of your classifiers here or the ensemble classifiers are sensitive to the number of features they're trying to classify things against or are they sort of all individual binary for each feature. >> Nicholas FitzGerald: No -- yeah, because it was just -- they're just mapping it to the user-defined feature which is the most similar. So ->>: But if you have only one user-defined feature you might get more false positives than if you had 10. >> Nicholas FitzGerald: Yes, that's true. I haven't looked at that aspect of it. >>: I realize this may not be directly in your line of sights, but there's often a dichromic effect where with a new camera, I'm dissatisfied generally but as I learn to use the camera I learn that I love it. Is that anything that you're looking at. >> Nicholas FitzGerald: Well, I mean ->>: Is one negative review just canceling out the positive review from that user, if they even make a second review. >> Nicholas FitzGerald: Well, we're really looking on just how to summarize these kind of texts. So in terms of what you do with the deployed system to deal with that sort of thing, that's not -that wasn't part of what we're looking at. But I mean hopefully with a big enough dataset, enough reviews, the hive mind would come up with the right conclusion. >>: Question here. >>: Hi, you had an illustration showing that, showing a summary with the customers of a restaurant having conflicting opinions about something or another. That's quite a sophisticated inference to make in a system which gets most of its mileage from information associated with the individual words. And intersential relations are important both in the extraction of discourse information process as well as in the generation process I presume. Do you use or will you find it useful to employ any sort of a formalism for capturing various types of relations between sentence fragments like as contrast and elaboration and support and implication and things of that nature? For example, like one might find common in rhetorical structure theory, which people use a lot in generation, I suppose. >> Nicholas FitzGerald: That's probably an important part of what we'll have to look at in order to do the intersentential analysis. Rhetorical structure theory is used in the generation part, obviously. But, yeah, that's something we'll have to look at. >>: So your summaries look very impressive for [indiscernible] presumably because it's used in generation a lot, right. >> Nicholas FitzGerald: Yes. >>: But presumably it would be useful in comprehension as well. >> Nicholas FitzGerald: Yeah, that's true. >>: We still have time for one or two more questions if there are any more. >>: Could you just tell us a little bit more about how you choose what to present in the summary, because it's an abstractive summary. But I'm guessing you have a lot of choices. How do you choose what to say. >> Nicholas FitzGerald: Well, I didn't work on the natural language generation part. But there's a measure of importance, which is calculated based on the number of reviews which are talking about a specific node on the tree, and it also takes into account the measures of importance of its children on the hierarchy as well. So that way we can determine which features are more important. So ones which are higher on the hierarchy are given more weight. And then we're looking -- also looking to generate sort of a varied summary. So we want to talk about a wide range of features as well. If you want to know more about the natural language generation part you should talk to Dr. Care ninny. >>: I assumed that since you used a website like dinehere.ca people actually leave explicit scores like stars and stuff. Can you use that to measure like how successful your summarization is, measuring services and reservations and such. >> Nicholas FitzGerald: Yeah, that's one of the ways we're looking at analyzing the overall success of the system. Because like you said, people give scores -- the problem is that those scores aren't very fine-grained. They might give an overall score for service or food or value, whereas if you want to look at how successful the more individual features are it's not so useful. But that's definitely something we're looking at. >>: Is there a strong correlation. >> Nicholas FitzGerald: We haven't looked at that yet. >>: I often use the Internet for looking at reviews of cell phones as they appear in Verizon or other carriers' websites. And I get a lot of use out of applying your method to the customers' reviews of individual cell phones. Where a phone comes out and within a month there's 2,000 reviews of it. I'd like perhaps a super digest of these reviews using something like this, which is like a meta review from all those. >> Nicholas FitzGerald: Yeah. That's a good idea. >>: Any final questions? Okay. Let's go ahead and thank our speaker. [applause] Okay. Our next speaker is Shafiq Joty who is going to be talking about topic modeling. >>: Shafiq Joty: Okay. So I will be talking about our research on exploiting conversation features for finding topics in e-mails. And I would like to thank Dr. Giuseppe Carenini, Dr. Raymond Ng and Dr. Gabriel Murray. Let's see why we are focusing on conversations. So because nowadays in our daily lives we experience with conversations in many different modalities. For example, we now have e-mails. We now have blogs, forums, instant messaging, video conferencing, et cetera. And the wave has significantly increased the volume and complexity. Now effectively and efficiently processing of this media can be [indiscernible] value. For example, the e-mail or the blog summary or the topic pure can give you the direct and quick access to the information content. Now, before joining a meeting or before buying a product, a manager can have the summary of what has been discussed or a customer can have a version of the customer reviews. Now, understand the example of the e-mail conversation from the BC 3 corpus. Here you can see Charles is emailing to this mailing list about what the people that would be interested to have a phone connection to a face-to-face meeting. Then people who are interested, they reply to this e-mail. Afterwards, Charles again e-mail to the emailing list about the time zone differences. So here you can easily see there are two different topics in this conversation. Now, what do we mean by topic modeling? So topic is something about who is the participants a conversation, discuss or argue. For example, an e-mail thread about arranging a conference can have topics such as location and time and the registration and the food menu, and workshops and a schedule. In our work, we are dealing with the problem of topic assignment. That means clustering the sentences into a set of coherent topical clusters. Now, let's see one example of this topic [indiscernible] as an example we just saw. Here you can see the first few sentences are in topic idea one and then these two sentences are in topic idea two and here wanting to notice topic idea one can be visited here. Okay. Now, why do we need this topic modeling? So our main call here is to perform the information instruction and the summarization, for conversations like e-mail blogs or [indiscernible] and the topic modeling is considered to be a prerequisite for further conversation analysis. And the application of the drive structure are broad in comparison. There's text summarization, and then information ordering, and then automatic quotient answering and information retrieval and user interfaces. Now, what are the challenges that you face when we are dealing with e-mails or other conversations? So e-mails are different from general monologue or dialogue in various ways. They're asynchronous. That means different people sitting at different places at different times collaborate with each other. Then they're informal in nature. Different people have different types of writing. People tend to use short sentences and obtain their ungrammatical. And the fact that topic does not change in a sequential way. And headers can be misleading when we are going to find that topic in an e-mail. Now, let's see the example here. So you see this sentence at least on people old. This person is quoting one of these sentences, and he just wrote this sentence. It's a short sentence. And if you just take this sentence individually, it does not bear any meaning. Again, here you can see Charles is using the same subject to talk to the same, to talking -- to talk in a different topic, which is a time zone difference. And here you can see the topic one is revisited. So here the assumption of topical boundary as in monologue or dialogue is not applicable. So this is the example. Short and informal. Headers can be misleading and not sequential. So the outline of the rest of the talk is at first we'll see the existing methods for topic modeling. Then how the existing methods can be applied to conversations. Then we'll see can we do better than those. Then the preliminary evaluation on a small development set. Then we'll focus on the current work that we're doing right now. So the existing methods in the literature. So topic segmentation or finding the boundaries in written monologue or in dialogue has got the most of the attention, and the methods can be one of these two categories, supervised methods or unsupervised methods. In supervised method, the problem is just binary classification problem where given the sentence break you have to decide whether there should be a topical boundary or not. And they use a set of features and in the unsupervised case these are the two popular models. One is the lexical chain-based segmenter, and another one is the probabilistic topic models, latent Dirichlet allocation. So we'll see how these two models can be applied to the conversations. Recently, for conversation desegment, this grab discussion methods has been applied. Personally, there's no work in e-mails. So this is what we're trying to do. Now, at first let's see how we can apply this existing methods to the e-mail conversations. First, the probabilistic topic model, that is the late late. So as you know, this is a generative topic model of documents where the assumption is that a document is a mixture of topics and a topic is a distribution multinomial distribution over words. So to create a document, at first you select the topical distribution for a document. Then for each word in that document you at first select the topic from this distribution, then you sample the word from the topic distribution. So in our case, this e-mail is considered as a document. Then given a distribution of words, so [indiscernible] a distribution of words over the topics. As you're measuring the words in a sentence working independently we can extend those probabilities to the sentence level probabilities and we can just take the out max to find the one on topic for this sentence. So this is how we applied the general idea model on our e-mail conversation. And let's see the second model that is the lexical chain-based approach. So you know the lexical chains are the chains of words that are semantically related, and the relation can be the synonym and the reputation and the hypernym and the hypernym. But the lexical-based chain approach they just used this obtain relation. And the chains are ranked according to these two measures. One is the number of words in a chain and the compactness of the chain. The words in a chain are given. The score is the same as the rank of the chain. Then once we have the scores for the words, we form the pseudo sentences of fixed length. Then for each fixed window for two consecutive windows, we find similarity measure between these two consecutive windows. If the measure falls below a threshold, then a boundary is given. So this is the traditional lexical chain-based approach. Now, as you have noticed that both LD and LSS SIG only consider one bag of words features. It considers the structure, the temporal relation between the e-mails. But it ignores other important features, such as the conversation features, then people mention names. This mentioning name features and topic-shaped keywords like now however and from-to relation. So example of this important features, see how you can see that people obtain, use -- people obtain cords, sentences from other e-mails, to talk to the same topic. And here you can see that people obtain use names to disentangle the conversation. So to find the topics, we have to consider these features. So what we need is capture the conversation at finite granularity level and then consider all these important features and then wait to combine all this into model. So let's see the way that we extract conversations from the e-mails. So we analyze the actual body of the e-mails. By analyzing the body of the e-mails, we find there's two kinds of fragments. One is the new fragment with depth level zero. There's quoted fragment which is depth level greater than zero. And we form a graph where the nodes represent the fragments and the edges between them nodes represent the referential relation between the nodes. It becomes clear with an example. Let's just see the example. So here we have the six e-mails in an e-mail thread. There's e-mail one. It has only one fragment. There's new fragment. There's A. E-mail two it has on new fragment B and it's replying to A. And then e-mail 3 has a new fragment C, it's replying to B. For simplicity it means that fragment can apply to its neighboring fragments. In E four we have this DE replying to E. But until we process this E five we don't know that D and E are two different fragments. So we have to find the different fragments in the first place, and then we form the graph from the second pass. The result of this process is this graph, where the nodes are the fragments and the edges represent the referential relation between these fragments. So once we have this graph, now let's see whether we can apply the existing methods on this graph or not. So we choose to use the LC SIG on this fragmentation graph and to test whether it improves the performance or not. So on each part of this fragment quotation graph, we apply S SIG and found the topics. Then to consider these results of different paths, we form a graph where the nodes of the sentences and the edge words represent the number of cases where these two sentences U and B fall in the same segment. And we find the optimal clusters by the normalized criteria. So this is how we applied LC SIG on fragmentation graph. Now, let's focus on our preliminary evaluation. So the evaluation metrics that we use, just note that the evaluation metrics should be something that given two assignments of the evaluation -- given two topic estimates metrics the evaluation metrics should measure how these two assignments agree. And here note that this widely used [indiscernible] is not applicable because we may have different number of topics in these two annotations. So what we did, we just adapted these three measures, these three measures from this paper [indiscernible], and it's more appropriate for this task. So this one is 1-to-1 measure and log 3M measures so we have these measures in the next few slides. So evaluation metric, 1-to-1. This 1-to-1 metric it measures the global similarity by paring of the clusters of two annotations to, that maximizes the overlap. So here, for example, we have this source annotation, has the target annotation. So we at first map each of these clusters into one of the target cluster. It has the highest value. So here you can see. This is the source annotation. Then we have this optimum mapping. This spreads our map too. Greens, and blues are mapped to these white or something like this. So we have this optimum mapping. So the source will be now this. And then we measured the overlap between this source and target. And here you can see the overlap is 70 percent. This one is correct. This one is correct. This one is wrong. This one is wrong. So out of ten, seven are correct and three are wrong. And then the loc-k measure, it measures the character references. So here, for example, is the sliding window protocol. So here you can see for this sentence, we see -- like, okay, this is the source and this is the destination or the target. So for this sentence we at first check the previous character references whether they're the same or different. So for this it is different. For this, the same. Because this is red, this is red. But this is different. Same with the target one. So this is the same, this is same, this is different. Then we measure the overlap between these two mappings. Here you can see it's 66 percent. Then the fact that some annotations can be more finer-grained than the others. So for this we map -- we use this M-to-1 measure and it maps each of the clusters of the first annotation to the single cluster in the single annotation with which it has the greatest overlap. And this measure gives us an intuition of annotators specificity. But to compare the models, we should not use this measure. We should use this 1-to-1 and loc-k. For our evaluation, we are using this 5-thirds from the BC corpus and the [indiscernible] argument between these five annotators you can see. And we compared this annotations with the annotation of this measure paper with the Chet comparison. This one is higher than this one. This one is low. This one is kind of same. So it's kind of a feasible test to do. Now, let's see the results of our three systems that we described. So here you can see there's the probabilistic LDL model. And this is the SSA model and this is the SS SIG, apply with fragmentation graph. So here you can see this LL SIG is representing data [indiscernible] and LSS fragmentation beta, preferring beta than this L SIG method. And the differences are significant. So right now what we are trying to do, we'll consider -- this is the purpose solution where we intend to consider a research set for each pair of sentences, and we intend to use these features in the topic features we have this LSA. There's the latent semantic allocation. [indiscernible] analyzes and LDA, latent Dirichlet allocation and we have the fragmentation in the graph and the speaker and the mention of names and in the lexical features we have this TDI measure and the keywords. And our method is classify, which is supervised, and then cut. So we'll use one binary classifier to decide whether given two sentences whether they should be in the same topic or not. And then we'll form an unweighted graph where the nodes would be the sentences and the edges will denote the class membership probability. And the problem it becomes just the graph partitioning problem which you can solve using the normalized criteria. Thanks. [applause] >>: Okay. Plenty of time for questions again. So if you would put your hands up, I could bring the mics around. >>: So I was wondering how rough on average how many topics do you find per -- >>: Shafiq Joty: In the five e-mail threads that we had, we found, I guess, 3.4 per thread on average. >>: Per thread. So you have five e-mail threads to work on. >>: Shafiq Joty: Yeah. >>: Did you have multiple people annotate. >>: Shafiq Joty: Yeah, yeah, there were five annotators. >>: Okay. Did they find it hard to come up with the same set of topics, even if they -- and even if they, you know, kind of found the same span for a topic, did they label it in the same way. >>: Shafiq Joty: Yeah, yeah, they labeled it. And like you see for the clustering purpose, it has the problem like level choosing problem, like you can -- like one annotator can have one cluster, which is labeled as cluster one, but the other annotator can have the same cluster as level two. But we made the annotators write like the name of the topic. So our results is something like this. The interagreement. Here. On the e-mail corpus. Here, you can see this is like -- in one of the e-mail, there's like 100 percent. They all agree. And this is the minimum agreement and this is the mean of it. >>: So if they named the topic, were the names that the annotators gave kind of, would that have helped the identification -- I was wondering if the LDA models would predict what the names for the clusters should be, just out of curiosity. >>: Shafiq Joty: Yeah, but I don't think that the LDL model is capable of doing that, because you can see that -- the performance here, like, is quite low. >>: Okay. Other questions. >>: So then how often are people using the like angled brackets to actually show that they are referring to, like, different topics? The thing is I actually don't use those in like any of my e-mails. I'm just wondering how reliable -- >>: Shafiq Joty: In our corpus, it's like people are using PTR obtain. So our corpus is the BC tree that we have in our ATVC. So people often use quotations. >>: Yeah. >>: So e-mail threads are so hard to come by. And you yourself have five. Can we initiate some effort to try to collect more so that we can -- >>: Shafiq Joty: Yeah, we have 40. We have 40. >>: You have 40. >>: Shafiq Joty: Yeah, but just for this experiment we just tried with 5, just to -- it was a pilot study. So we just did it with five. So right now we are trying to like use all this 40. >>: This corpus is mainly based on -- so the BC 3 corpus is from the World Wide Web Consortium. So it's a mailing list and we have lots of e-mails from this mailing list. There is no problem of privacy. It's a public -- so the challenge is to annotate this data. But unlabeled data is available. >>: Shafiq Joty: And one thing is that I was trying to incorporate this conversation structure into the LDL model as a prior. Like LDL model uses this for the document for the topic word distribution it uses when Dirichlet prior. So I was trying to incorporate this network structure into this model. So I came up with this one approach that will allow us to do this. So right now I'm kind of done with this. So as a machine learning process I did this hopefully it will come up with some better results. So hopefully we'll see some papers. >>: I was just going to ask about the sequential structure, because I think this is sort of what you were just talking about. It sounded like you weren't really modeling -- like if you were talking about topic one, maybe you're going to continue talking about topic one to the next reply to the e-mail. I think it sounded like you sort of weren't doing that but that's sort of future work; is that correct. >>: Shafiq Joty: Which one. >>: So if like the topic maybe will stay the same. I'm just talking about the transition between topics. >>: Shafiq Joty: Topics? We are considering this problem as a clustering problem. But we have this like -- we have this LC SIG feature. Which will like consider these features, like they work together or not. So implicitly we are considering this feature. >>: Okay. >>: Thanks for your time. One question. So do you have a fixed number of topics when you're running LDA or are you using a Chinese restaurant process to get an arbitrary number of topics? Is it nonparametric. >>: Shafiq Joty: No, we set the upper parameter that is very low. And we check the probability of Z, that's the latent variable, whether it goes below, below the threshold. And if it's below that threshold then that topic is gone. Like that topic is not there. >>: I see. Because looking at your results here ->>: Shafiq Joty: So it's kind of automatic cluster detection. >>: Right. Looking at your results here, LDA seems to do much better on M-to-1 than it does on any of the other evaluation metrics. I wonder if it's proposing too many topics. >>: Shafiq Joty: I think, no, no, for this we fixed the number of topics, like the number of topics is just the humans found in this conversation. So just to see like how -- just to see the best output of LDA that one could get. But now I'm trying with like this alpha very low and this parameter. >>: Thanks. >>: Shafiq Joty: And just to tell you that the prior that I'm using is Dirichlet tree so it allows you to incorporate network structure. >>: You might have experienced it, too, but I often am part of e-mail threads which tend to kind of gradually proliferate in such a way that each respondent diverges slightly from the original topic of the e-mail to such a point that soon further on what you respond to has nothing to do with the original topic, maybe nothing to do with the topic at all. Will you have a method of quantizing the relevance to the original topic in such a way that you can just filter out all the nonsense at some point in time. >>: Shafiq Joty: Yeah, like we have this fragment quotation graph. So we are considering the distance between two sentences in this fragment quotation graph. So if the distance is higher then we're considering the topic shift. >>: Okay. >>: Any more questions? We have a couple minutes. Anybody have a final question. I have one question. So you have e-mails so many issues that people will revisit topics. So that's not sequential. I'm wondering if you have a sense of how prevalent that is, and also whether it's the same in e-mails versus chats, if you've noticed any differences with how topic structure is realized in -- >>: Shafiq Joty: In chats, like in the multi-party chat it's different because different people are interacting with each other. So there are many different conversations. But for the real chat, I did not check it because we did not have this chat corpus in the real chat. We had just multi-party chat. But I think it's kind of more synchronous in chat, but in e-mail it's asynchronous. So people can revisit things. >>: Okay. So if there's no more questions, let's thank our speaker again. [applause]