>>: Welcome back to the short talks session. So for this session, we actually have five speakers, and we really have a very little amount of time each, so I'm going to limit the questions to one or two questions, and we'll only have two minutes for questions. So I'm really pleased here to be able to introduce the five different papers here, so we're going to be starting. The first paper is on approximate parsing with hedge grammars, and it's going to be Mahsa Yarmohammadi talking about it. Thank you. Great. Thanks. >> Mahsa Yarmohammadi: Okay. Can you hear me? >>: Is your mic on there? >> Mahsa Yarmohammadi: Yes, it's on. Okay, I'm going to talk about approximate parsing with hedge grammars today. First of all, thank you, and also NLP for triggering this submission to the ACL, so I was preparing this two-page extended abstract for Northwest NLP. Then I decided why not submit the same work to the ACL, so I, Brian and Aaron submitted this paper, also with a different title, to the ACL and it got accepted. Parsing for hierarchical structure is costly, and in fact, some of NLP applications skip it altogether and replace it with shallow proxies such as NP chunks, for example. The models to drive these non-hierarchical structures are finite-state equivalent, so they are very fast. However, they omit all but the most basic syntactic segmentation. And in some applications, having some more degree of hierarchical structure might useful in inference, even if it doesn't end up a fully connected parse tree at the end. So in this talk, I'm going to present hedge parsing as an approach which provides a full hierarchical structure for a local window by ignoring any constituent which is out of this local window. So we defined hedge parsing as discovering every constituent of the length up to some span L. This is a parse tree for a given sentence, example sentence that we have. Every constituent in this parse tree has a span, which is the number of words it covers. It can -- for example, this is five for NP and VP constituents, and it is 10 for S constituents, showing that it covers L words and 10 words. It's 13 for this VP constituent here, and for the part-of-speech tags, the span is just one, which means that they cover just one word. Okay, to obtain the constituents that cover up to L words, we do a hedge transform to the original tree. So we remove the constituents that span more than L words recursively and connect their children to the parent of the node, so this is an example for L equal to seven, these are the nodes that were subject to remove because they span more than seven words. So we remove those, and we connect their children to the topmost node, which is S in this case. Okay, a very interesting property of hedge parse trees is that they are sequentially connected to the topmost node, as I showed with the dashed line here, so each hedge constituent is sequentially connected to S, and this property allows us to segment the sentence before parsing and parse these segments instead of parsing the entire sentence and then recombine the results. And, as I will show later, this will give us a huge speedup over the case that we parse the entire sentence. This plot shows the percentage of retained constituents when we do the hedge transform on the Wall Street Journal Penn Treebank. Over 50% of constituents have a span of three or less, and for example, for L equal to 15, we have over 90% of constituents retained in the hedge transform tree compared to the original parse tree. Okay. What are the methods for doing the hedge parsing? As I said, hedge parsing is the approach of finding hedge constituents of an input sentence. The baseline approach, which is the brute force case, is to parse the entire sentence using a full context-free grammar and then hedge transform the result. As we can imagine, this has the maximum accuracy but minimum speed or minimum efficiency, because it has the full knowledge -- the rich knowledge of context, so it's so accurate but very slow. I propose two alternative approaches over this baseline. In both, we constrain the search space of CYK parser based on the constraint that's defined for hedges, and also use a hedgebank grammar instead of a full context-free grammar, and a hedgebank grammar which is trained from the hedge parse tree as opposed to the full parse trees. This is how we constrain the search space of a CYK parser. Since we limit these kind of non-terminals, we can limit the search space of a CYK parser, as I've shown here, by closing the cells which are above the L, which is seven in this case, except the cells which are along the periphery of the chart. And the complexity of CYK parsing reduces from n-cubed times size of grammar to this number I showed below. I'm not going through details of we compute this complexity, but we are very well [indiscernible], and ask me questions in the poster session, which I am presenting the same poster. Okay. These are the approaches for hedge parsing besides the baseline. The first approach parses the entire sentence using a hedgebank grammar, and the second one presegments the sentence, parses each segment independently and simultaneously and then combines the results together. The task of hedge segmentation is to chunk the input sentence into complete hedges, so we trained a classifier and applied it at each word to decide if that word can begin a new hedge or not. So we define two tasks. One is unlabeled and the other is labeled. For the labeled, we just have the begin and not begin tags and for the labeled, we also include the type of constituents, like NP or VP, and we use the discriminative log-linear model, using the average perceptron algorithm to train the parameters, and these are the feature sets that we use. This is our experimental setup. We use the Wall Street Journal Penn Treebank, the standard training, development and test section. We train the Berkeley latent-variable grammars with six split and merge. We use the BUBS parser in exhaustive CYK parsing mode, and we use the standalone machine with this specification to run the experiments, and we evaluated our parsing result with the precision and recall and F1 score using the standard EVALB script. These are the results of parsing the development set Section 24 for L equal to four and seven. You can see that we get order of magnitude speedup when we parse using the hedgebank grammar as opposed to parsing with a full CYK grammar in the cost of 3% of accuracy. And this shows the comparison between no segmentation versus presegmentation before parsing. If we look at the [indiscernible] line here, you can see that segmentation is potentially very powerful in a fast and accurate hedge parsing. Well, in action, we cannot achieve this accuracy obviously, so we achieved this accuracy, and even in that case, we got an order of magnitude speedup over the case that we do not segment before parsing at the case of almost 5% accuracy reduction. This is the result on our test set or eval set. Basically, the same pattern holds here, and these are the results for different L variables from three to 20 on the eval set with our three different approaches, the baseline, no segmentation and presegmentation before parsing. As you can see here, we could get a huge speedup when we do segment before parsing, especially for small Ls, and if you look at this graph, you see that when we segment, we get a consistent degradation in precision and recall, which points the need for an improved segmentation in our task. I introduced a novel partial parsing approach for fast syntactic analysis for the input beyond shallow parsing. It allows for input segmentation before parsing. Our baseline segmentation model provides significant speedup in parsing but cascade errors remain as a problem. And these are a few future directions, which is to investigate hedge parsing in combination with the methods that are available for productization and pruning the cells of a parser. The other direction is to improve the hedge transform and hedge segmentation model to achieve better accuracies. And the other is to really evaluate the hedge parsing idea on a real task like in increment translation. Thank you so much. >>: Great. Thank you very much. So while the next presenter gets ready, we have time for a couple of quick questions, and I have the mic here. >>: So have you heard of vine grammars and vine parsing? >> Mahsa Yarmohammadi: Yes. >>: So how is this different from that? >> Mahsa Yarmohammadi: Yes, that's a different -- the same concept but for dependency parsing, and they use some kind of similar constraint for length on the dependency parsing, as well. That's the same idea with [indiscernible] parsing, and we mentioned that as a related work in the introduction, as well. >>: Okay, and time for one more quick question while we get that set up. Very quick question here. >>: A clarification, I guess. When you're evaluating, are you evaluating against the true full parse or are you evaluating against a hedged tree? >> Mahsa Yarmohammadi: No, that's a good question. That's against the hedge transform trees. Thanks for clarifying. >>: Great. Okay, let's thank the speaker again. Thank you very much. Okay, we're getting it set up here for our next speaker. It's going to be Xiao Ling, and the talk is going to be on context representation for named entity thinking. >> Xiao Ling: I'm Xiao Ling from the University of Washington. This is joint work with Sameer and my adviser, Dan. Today, I'm going to talk about context representation for named entity linking. So let me just briefly tell you about what the task of named entity linking is. It's about identifying entity mentions in text and connecting to their corresponding knowledge base entries, usually Wikipedia. So there are a couple of major challenges for this problem. One is ambiguity, where people to use the same entity dimension to refer to different entities. For example, we have Seattle in both sentences, but the first one refers to the soccer team Seattle Sounders, while the second one refers to the city. The other big challenge is variability, also known as synonymy, where people tend to use different names for the same entity, like the nickname of Seattle and also acronyms, like MSR for Microsoft Research. So next, I'm going to tell you about the general architecture of a standard named entity linking system. Assuming the entity dimension is given, like Seattle here, and within some context, like I'm going to simplify the context by using only one single sentence. Next, we're going to generate a bunch of candidate entities for the system to select from. To be able to select the best one, we're going to rank them, using multiple scoring functions. For example, string similarity to measure the similarity between the entity dimension and the canonical form of the candidate entity. And also, we can look at the context around the entity dimension and measure how similar it is to the representation of the entity. You can have other scoring functions, as well, and in the end, we're going to sum them up and pick the entity with the higher score. Today, I'm going to talk about -focus on the context similarity. So the majority of previous named entity linking literature has been using the bag-of-words representation for the context of entity mentions. In the reigning example, we have a three-word bag, beat, Portland, yesterday, and to be able to compare that to the candidate entities, we're going to look at the Wikipedia text of the corresponding entities and extract similar bags of words, and we're going to compare the bag of words I mentioned to each of them. So we observe a couple of the issues here. First, we're going to -- the words in the bag are sometimes imprecise and give the predicate to the wrong candidate. Also, it might include irrelevant features, like yesterday, because yesterday doesn't help disambiguation of the entity mentioned. What matters here is the verb beats. However, the issue is beat might not occur in the bag of words for the correct entity. What we are proposing here is to make use of the dependency graph of the sentence containing that entity mentioned. We are going to focus on the direct dependency linked to the head of the entity mentioned. We are going to extract features like those. The first feature basically says the entity mentioned is a subject of the verb -is the subject of the verb beat. And also, we are going to include some more specific ones, like the second one, which is a conjunction feature. It basically says the entity mentioned is the subject of beat, and also the object of beat is Portland. To construct similar representations for the entities, we're again looking at the Wikipedia text, extract features using the same procedure. And also, as a complement, we are going to look at a bunch of sentences from the web, and the authors of those will voluntarily link those important mentions to Wikipedia, to the corresponding Wikipedia pages, and we're going to apply the same feature extraction. And after the processing, we are going to have a huge matrix, where the row are the entities, and the columns are going to be the features. And the number of entities we have in this matrix is around 3 million, and the number of total features -- the total number of the features is going to be around 700,000. And you might suspect the matrix is very sparse, and it has lots of missing values. So, for example, we might not observe the expression that Seattle beat some other team in either Wikipedia text or from the web sentences, so it will be very difficult to compare this representation to the representation for the entity dimension. So what we propose here is we are going to perform a matrix completion task by learning a load dimension of embedding of both entities and features, which signals the correlations of these features and also from the similar entities within the matrix. To further encourage sharing and propagation of those information, we're going to add 500 most frequent freebase types and fill in those values accordingly. So the final matrix is going to be composed by two parts. One is the dependency features we extracted and the freebase types, and we are going to learn a bunch of load dimensional vectors for each of the entities and each of the features. So here are some experimental results. First, this is more like a sanity check. We want to make sure that the entity matrix is making sense, so we kind of - what we did was basically picking up some pairs of ambiguous entities. Like in this case, we pick up Georgia the state and Georgia the country, and we compare the nearest neighbors according to the entity matrix. As you can see here, it's kind of making sense, because the nearest neighbors for Georgia the state are mostly states, and the nearest neighbors for the country are mostly European countries. So that's nice. And then, we look at the real-world entity-linking data sets, and the first thing we did is we want to make sure that the proposed representation is better than bag of words, so to isolate other factors, we basically just used the context similar to rank the candidates. And the y-axis is showing the percentage of one representation ranked the gold entity higher than the other. And the blue bars are the winning percentage of our work and the green bars are bag of words. So in all the data sets we tested, our proposed representation actually dominates bag of words. >>: Two minutes. >> Xiao Ling: Got it. Okay, so this is some preliminary results on end-to-end linking accuracy of our prototype system. We are still -- the whole system is still under development. For example, we don't have a joint inference component that considers the prediction of the lengths together, but still, using simple string similarity and the content similarity we propose here, we have comparable results, at least in two of the seven data sets, and we are optimistic, when we further improve the system, it's going to get better. And also, just a quick reminder, our context representation is independent of any system -- is independent of the whole system, so it can be plugged in any other linking system. Just a quick recap, we propose a novel representation for modeling the mention context and to combat sparsity in the entity matrix, we conform the matrix completion task and we show some preliminary experimental results in the end, so that's it. >>: Great. Thank you very much. We have time for some questions, and your hand is the first one up here. >>: Sure. Can we go back to your results slide? So I notice that you're doing well on a lot of the complicated data sets, like TAC, but something like MSNBC is such an easy data set to beat. I was wondering why this work is getting lower performance. >> Xiao Ling: There could be many reasons why we are getting worse. One particular reason is the MSNBC accuracy is evaluated over bag of concepts within the whole document, while we are sort of like predicting the lengths for each entity mentioned independently, so it could be for the same entity mentioned, we are predicting different things. That will hurt precision if you look at the whole document as a bag of concepts, so this could be one reason, but there could be others. >>: We have time for one more question while the next speaker comes up to get set up, and it's going to be Victoria Lin who is the next speaker. Another question, at the back there. >>: Do you have a sense of [indiscernible] of the syntactic [indiscernible]? What is most important between your syntactic side of the matrix and the side where the other types -- what happens if I use only the type side or only the syntactic feature part or if I try to weight them? >> Xiao Ling: I think the syntactic features and the freebase typed features are complementary to each other. I don't think it's possible to only use freebase types, because there's no such clue for the entity dimension, so that you don't actually have anything to compare with. I don't have an exact answer for what's the most important syntactic feature in the matrix, and I think it's mention independent. >>: Great. Thank you very much. Let's thank the speaker again. And while we're getting set up here, I'll have a chance to introduce the next speaker and the title of the next talk, and what's going to appear on the screen behind me here is the title, hopefully being -- good timing. Leveraging prior knowledge of output structure for learning with incomplete annotations, and the speaker today is going to be Victoria Lin, so over to you Victoria for 10 minutes. Thank you. >> Victoria Lin: So good morning, everyone. Welcome to my talk. I'm Victoria. Today, I'm going to speak about a research project we did recently. The focus of this project is in scenarios, learning with only incomplete annotations. However, prior knowledge of an ideal prediction structure would help us to train [that] classifier. This is joint work with our group from the University of Washington, with Sameer, Luke, Luheng and Ben Taskar. By the way, if you're going to search of the Ben Taskar on your phone, that is a term we are going to Google for. So the problem we look at is multi-label learning. The goal is to align a set of labels to each data point. This is a problem we are going to frequently encounter in different situations. For example, nowadays, there is a variety of social bookmarking websites where users could align keywords to their favorite websites, scientific publications or their everyday photographs. Multilabel learning can be seen as a synced list form of structured prediction, in the sense that the output structure is a flat set of labels. However, with just this simple structured output, as the number of labels we consider grows, one could encounter a lot of different challenges when designing a machine learning algorithm for it. The most immediate one is gathering incomplete annotations is extremely challenging, so think about this example. For a single image, what is the possible set of words you could happy to it? We see here there is a castle. There are pinnacles of the castle, and some people might want to call it a building. There is also the lights that are scattered, etc. When users assign tags for these kind of images, no effort is made towards completeness. Hence, the annotations we get from the uses might look like this. In this case, if we treat those missing labels as inactives, our classifier would be confused, because the features corresponding to the missing object are still present in the image. Hence, what we really want to do is to treat the missing labels as question marks in the sense that we are not sure that they should be present or not. Our model handles this aspect by excluding them from the definition of our loss function, and on the other hand, we use the prior knowledge about a desired prediction structure to help recover those information. So here, we present a mathematical formulation. Our input is represented as the design matrix X and a label matrix Y. Each row of the matrices here is corresponding to an example, so each row of X is a feature vector and each of Y is a binary label vector, so that Y(I,J) equals to one if example I is tagged with label J. And as I said before, we really want to treat those zeros in Y as question marks. So our model goes by completing the label matrix as training goes along. Our proposed approach is we would have an inductive multi-label learning prediction model and also a label-completion model, and what we are going to do is to do joint inference over both of them in training. So the matrix representation of Y is key to our innovation, because here we see we have observed some entries of Y which are the ones corresponding to the annotations provided by the users. So by the theory from matrix completion field, if Y is known to satisfy certain structure properties with those samples, one is able to recover the original Y matrix with high probability. And the structure assumption we are going to make about Y comes from two different aspects. Both are based on our prior knowledge. The first one is the large number of labels are actually highly correlated with each other, in the sense that groups of words tend to co-occur with each together. Hence, we would expect that possible labels actually come from a smaller number of word groups. Mathematically, this means the latent structure representation of Y should be closely low rank. And also, from the large number of possible tags, each example can only be associated with a smaller number of them. Hence, we also expect the true label matrix to be sparse. So with those structural assumptions, we are able to design the completion algorithm, which is fairly standard. We do this by fixed desired rank K and define the factorization as a product of the two [indiscernible] matrix U and V, and we compute the U and V by minimizing their error loss over the complete -- over their observed entries of Y so that you could get a low-rank embedding that is close to your annotation. And for the classifier prediction part, we just tried the simplest binary relevance logistic regression, so basically what it does here is to train a logistic regression for each label. And we modified the standard logistic regression loss to factorize over only the observed entries of Y. And remember, what we really want to do is to conduct those two parts jointly, so that when we compute the matrix, the matrix -- the label matrix could take feature information into consideration. And also, our trained classifier could learn from the low rank embedding. So we closed this loop by adding a term that minimized the KL divergence between our classifier output and the low rank embedding. In this sense, we are forcing consistency among the classifier output, the low rank embedding and our observation. The other way to understand this model is that we have encoded the structural property, low rank and sparse, onto the low rank embedding Q we computed, and by allowing Q and P to be close to each other, those soft constraints propagate from Q to P. Hence, Q acts as a regularizer on the structure of the output of the classifier. And this is the philosophy of the technique called posterior regularization. Thereby we named our model as called posterior regularization low rank, short for PR error. The joint objective defined in the previous slides is non-convex, unfortunately, but it can be optimized using an EM style algorithm. So we initialized all the variables properly, and for the M-step, we fix our low rank embedding and update the parameters. In the E-step, we fix the classifier output and compute the low rank embedding again. Both E-step and M-step can be done efficiently using stochastic gradient descent, and the entire EM process converged fairly fast, within a few iterations. So here we go over some of our experimental results. We basically evaluated the models over three data sets. All of them are gathered from social bookmarking websites BibSonomy and Delicious, and all of the data sets have fairly large number of labels, ranging from near 200 to near 1,000. Here shows the number of training examples we have, the number of features and the number of labels. We calculate the evaluation metrics based on label average, like example average F1-measure. Both of them are calculated just using the user annotation provided. So the first set of experiments is to test the advantage of doing joint modeling, so we compared with three baseline models that can be seen as subcomponents of our model. The first one is just a naive binary relevance classifier, which is trained by treating those missing labels as inactives. And the second one is basically the same as the first one, but the loss function is defined over only the observed entries, and we also add a sparse regularization on the output. The third one is to do label completion and classifier training in two separate stages. We can see -- and the last one is our model. We can see here by ignoring the negative examples we could have a fairly strong posterior already, but on those three data sets, our model that used label correlation and doing joint inference has a significant and consistent improvement. We also compared with some of the standard model label learning algorithms, which has previously published state-of-the-art results, but both of them are trained, creating all those labels as inactives. And here, we see that our model has a close performance to the best of state of the art here, HOMER. Especially on the big test data sets, our improvement is quite significant. And also, it's worth mentioning that the training process of HOMER is very complex, so it didn't finish within one week on our largest data set. On the other hand, our model is quite scalable. So in conclusion, I present here like for a complex structure prediction problem gathering complete annotation would be an obstacle for designing the machine learning algorithm, and our prior knowledge about the desired output would -- is very effective in providing extra supervision for pruning the learning space. And posterior regularization is an effective way to achieve this goal. It performs by enforcing soft constraints on the classifier output. As future work, we would like to design better evaluation methods, especially for getting the cases where we only got incomplete annotations. And also, we want to test the promise of our model across domains. And at last, we should compare to some of the recent machine-learning work on learning with incomplete labeling. Thank you. >>: Thank you very much. >>: Okay, we have time for a question here. Giuseppe? >>: I was wondering if you use any prior knowledge to initialize the M, what kind of prior knowledge? >> Victoria Lin: That's a good question. I think the answer is you can, but we didn't. We just randomly initialized all the variables. For example, the low rank matrix here, we just initialized them using random values for the entries, and hopefully the learning process would help you to get from the initialization to a local minimum. >>: Great. Thanks, and I think one quick question while we're still setting up there. Any others? Yes. I'll bring the mic to you back here. >>: Thanks. So the prior knowledge you encode is that there's a low rank kind of completion ->> Victoria Lin: Yes, low rank and sparse, basically. >>: So do you imagine extending the kind of prior knowledge you can put into that kind of optimization? Like, for example, biases in terms of how the data is incomplete? >> Victoria Lin: Right, right. I think the question is like the data might be incomplete in a biased way, so certain labels turns to [miss]. I think the answer is still that you can, because the joint model we present here is very flexible, actually. We didn't make extra assumptions on who you do the label completion and how you do the training. So basically, if you have a better way to model the label completion part, you could just add them into the framework. I think the same similar alternative minimization would still apply here. But I think that part is worth trying, because I think that's really the problem we should ask. >>: Thank you again very much. Okay, we have a couple more talks to go here, so welcoming my colleagues from Vancouver again back to the podium here, we're going to be hearing a presentation on evaluating open relation extraction over conversational text. Over to you. >> Yashar Mehdad: Thank you. Hi, again. Well, this is the work that was done by one of our grad students, Mahsa, but since she couldn't be here today, I am going to present her work. The work is more a preliminary result and preliminary work about how we can use relation extraction tools or open information extraction tools for summarization. And mainly, we're trying to look at some results, evaluation results, to see whether we can use those tools for summarization or we need some adaptation. Well, as I talked to you about conversational data in my previous talk, we have very precious source of information in conversational data, so we are already interested to dig into those data, to run some analytics on data, to gain some information from such data. And, as we know, it grows fast, it grows exponentially, and we need some way to deal with the information overload you get from such data. So the intuition behind that is for any kind of text processing or text analytics that you're going to run over conversational data, what you need is to extract some value information inside a text. So one of the things I think that we look at is information extraction, specifically open information extraction where the relations are not predefined, so it's pretty open, and then we decided to look at these and see if we can extract valuable information, and then from there, we run our summarization and use such information in our summarization system. So open relation extraction is to try to extract relations or triple relations, entity relation. Entity I have here one example, like for example Facebook, about Whatsapp, argument one, Facebook argument two is Whatsapp and the relation. There are many advantages of using open relation extraction and open information extraction tools, and one of them is that these tools are available nowadays. Many groups are working on that, trying to improve them. They are scalable so you can run, then you can extract the relations quite robustly nowadays, and you don't need much if any domain-specific knowledge to run those tools or to run those systems on your data. So we were mainly motivated to use open information extraction for summarization, especially because we have quite a few number of state-of-the-art open IE systems called ReVerb, OLLIE, which is an improved version of ReVerb, SONEX, TreeKernel and EXEMPLAR. But the fact is that conversational data are not like news. They are very noisy. They are less structured. So we're dealing with more problems with conversation, so it's not like a result that you get when you run open information extraction on news would be the same result that you could expect that you'd get over conversations. Talking about not only forums and blogs but let's say tweets. They are very short, less structured, full of acronyms, and that makes it difficult for such systems to extract the relations or information. And we know, actually, that most of text preprocessing systems or tools that we use for different applications in our research, when we run it on the conversational data are such noisy data, the performance really decreases. So, for example, when you run named entity tagger or syntactic parser over tweets, you absolutely get nothing if they are not trained on the domain-specific data. Or we know that about 8% of missed extraction, 32% of incorrect extraction in those famous ReVerb, OLLIE, are from incorrect parsing, so if we fix the parsing, we can solve those kind of problems, as well. So we know that there are many challenges with dealing with conversational data. So in this work, in this preliminary work, we're trying to first of all sample a good data set for different sources of conversational data, from firms, blogs, to e-mails, meetings and tweets. We try to run the current evaluation, open information extraction system on them. At the same time, since some of the sentences from those kind of data sets are not really well formed or are quite complex for the system to understand, we try to see if text simplification techniques can help us to do a better information extraction or not. So that's basically the contributions of our work. So for data set creations, we have these set of data sets for conversations. We have reviews, we have e-mails, we have meetings, we have blogs and online discussions, and we have social networks. So we have all these sets. What we need to do is to have a good sample of each data set, putting them together and run the system, but sampling in such a kind of valuable and huge data set is not simple randomization problem. You're going to find a quite consistent sampling method to get your data set. So what we did, we used [indiscernible] sampling using some features, some conversational features and some other features from the data sets to have good distributions that are representing many features that exist in conversational data sets. So I'm giving you a good kind of set of representations of good sets of features. We have syntactic features, conversational features. We try to use them to cluster our data sets and sample from those. For text simplification, we know that complex sentences can be simplified using those methods. So we use one of the state-of-the-art text simplification systems called Tris, which is a kind of a statistical sentence simplification. It's a very interesting model, so we try to use that. So in summary, we have run the information extraction systems, called OLLIE, which is one of the state of the art, over our data set. We evaluated the result manually. We simplified the data set. We run it again, and then we compare it to see how the result will change. And then how we evaluate it, we have the metrics of number of relations extracted, accuracy of the relations extraction and confidence score coming from OLLIE. So for the results, I am giving you the important results, so we can see actually, in most cases, the text simplification system helps the information extraction, which is one of the things that we wanted to know to acknowledge if we want to use text simplification before running information extraction or not. At the same time, we know that the results for different data sets are different, and then we know that, of course, different, and then we know that across different modalities, when we run simplification or not, the results differ in a number of the cases. For example, when we run OLLIE itself, the resulting Slashdot and BC3 corpus are the best, but the review corpus are the worst, and when we simplify, the BC corpus still is the best and reviews still, they're not very good. We know that text simplification is quite effective, especially when we have some data like Twitter data, which showed in our experiments -- also is not very effective in increasing accuracy of OLLIE or information extraction -- open information extraction of Slashdot data sets. There is some analysis in our paper, you can have a look. So in conclusion, in conversational data sets, we have lots of challenges. We have very complex kinds of text in terms of noise and less structure. We got to know that e-mails and blogs are kind of easier for relation extraction than reviews of the product, so probably for product reviews and relation, we should decide some other ways to use some knowledge to do summarization. If you come to our poster, we talk about review summarization as well, and for future work, we are thinking of using some systems in combination or different simplification methods. Also, we are trying to take advantage of such relations that are extracted for summarization. Thank you very much. >>: Thank you. Okay, we have time for a question or two. >>: So by extracting relations, are you also interested in the relations between utterances, talking about ->> Yashar Mehdad: Between utterances? >>: Yes, such as ->> Yashar Mehdad: Exactly, we are. We are very interested about that, and this is actually one of the works that we are working on. So, for example, the rhetorical structure can be very interesting to see the structure of any conversation and use that for summarization. We have some insights and some work done. We are at the evaluation phase, so if you come to the review summarization poster, I will talk more about the conversational feature we use in our framework. >>: Great, thank you very much. >> Yashar Mehdad: Thank you. >>: For our last talk of the morning, I am pleased to be able to introduce to you Stephanie Lukin, who is going to be taking us today about identifying subjective and figurative language in online dialogue. Over to you. >> Stephanie Lukin: Okay. Hi. Thank you. My name is Stephanie Lukin, and I'm going to talk about identifying social language in online dialogue, specifically focusing on nastiness and sarcasm. So many of the current NLP tools are based on this monologic model of language from traditional media, excuse me, such as the news, but as we've been hearing throughout the day so far, social media is becoming more prevalent, and the type of style of this language is very different. It consists of dialogues and emotions and informalities, so in our work, we're interested in creating new models of dialogues by taking advantage of the abundance of all this social data. In 2012, Walker et. al released the Internet argument corpus, which consists of annotated exchanges and online debate data. It was annotated by Mechanical Turkers. They were shown a dialogic term like up here and they were asked to evaluate the overall language of the responder, which is in bold up here. A variety of types of languages were annotated along the side, but of all these styles, we are specifically interested in sarcasm. But how can we define sarcasm? It's very difficult. There's a lot of different definitions, people don't interpret it in the same way, and furthermore, there's theoretical work that claims you need the context or you need world knowledge in order to determine if something is sarcastic. Just an important note, in the previous Mechanical Turk task, when we asked people to identify if an utterance was sarcastic or not, we said just -- we didn't give them any definition of sarcasm. We just said is this sarcastic, in your opinion. So in this work, we're hoping to examine these utterances that they picked out and try and hone in on the aspects that they believed are actually sarcastic. So looking in our forms data, we found sarcasm is prevalent in about 10% of the data, but despite -- that is prevalent, but it's still pretty scarce, and it's expensive in order to have human annotators to collect more data. So we think it would be useful to have a technique that we can be able to learn sarcasm from a small amount of well-labeled and annotated data. So we look at a method designed to do just this. It's from Thelen and Riloff and Riloff and Wiebe in 2002 and 2003. And we recreate their method by applying it to our data in hopes we can learn new labels from the small amount of label data that we have. And we also decide to look at nasty language in addition to sarcastic language, just to see if this Riloff and Wiebe method can generalize well to the domain of dialogue language. So this is their method, in summary. They first develop cues. So in their original task, I'll point out, it's a task looking at detecting subjectivity versus objectivity in a monologic domain. So the first thing they do is they develop cues for identifying subjectivity. They use these learned cues to train a cue-based classifier, and the classifier, the goal of it is to maximize precision at the expense of recall. They use this classifier as a first approximation on a large amount of un-annotated data. And then because the precision from this classifier is high enough, they can take these predicted labels and learn new patterns from this data, specifically syntactic patterns. And then they then bootstrap this process and learn new patterns on un-annotated data. So in our work, we found as we were following this process -- we found that we can't achieve this high-based precision with just the cue words. So we used this cue-based classifier we developed. We also developed a pattern-based classifier, and we combined them together. And in the end, we achieve fairly high precision for this task. So the first step is to develop these cue words for our sarcasm and nastiness domain, but as I mentioned, the theoretical work says that you may need context in order to determine if something is sarcastic. So we do have labeled data already from our corpus. We run a simple statistical analysis to just select unigrams, bigrams and trigrams, but we also want to see what providing context will do. So we create a Mechanical Turk hit where we showed them the quote and the response and then asked them to pick out things in the response that people could think are sarcastic. So from this, here's some examples of the highest-rated ones we have. Our statistical test was chi-squared. MT is Mechanical Turk. So there is some overlap. You can see oh and oh yeah appearing in both of them for sarcasm. But also we point out that in the chi-squared, things appear like we, and we think this could just be just an overtraining issue. So we use these cues we found. We make this cue-based classifier. To rank the cues in order of importance, we use their frequency and their reliability that we found in our development set. We train over a variety of data one and data two. We make a classification if two or more of these cues are present above these thresholds. And here are our results for sarcasm and nastiness. But as I mentioned before, our goal is to create a high-precision classifier, and this is not very high precision. So we do notice that nastiness does better than sarcasm, to this may be just a first indicator that maybe nastiness is easier to identify using these cue words than sarcasm is. So okay, we don't get a very high precision, but we continue in the process of next learning syntactic patterns, and we come back to this problem later. So in the pipeline, okay, yes, so next we learn these syntactic patterns, and the point is that they can generalize across utterances. They don't have to be exact surface matches like the cue words, so these are the templates that were used from the original Riloff and Wiebe work. So as some more examples, so this template, noun prep, noun phrase, can match any of these examples that we find in our text. When we the question are these templates tailored to the subjectivity domain, which they were developed? So we also look at our data and try and develop our own sarcastic cues, or sarcastic patterns, so here are some examples. OH adverb, OH right and OH sorry we see appear a lot in our data. So we run our pattern classification, and what we call baseline here is just using the Riloff and Wiebe syntactic patterns without our new cues or without our new patterns. So there is an increase, especially in nastiness. This is very good. And in sarcasm, it is better than the cue-based classifier was previously. And then looking at the new patterns, we see that they help very much in sarcasm, which was what we were expecting, but in nastiness, the precision jumps but recall drops 5%. There is maybe like just 10 utterances or some very small number. So in this, we did not develop patterns for nastiness, so that could be a reason as why this is just a little strange. Okay, so we've gone through this pipeline, but as I mentioned a couple times now, the cue-based classifier is not achieving the high precision that we need it to to make this actually work, so what we do is we do have a trained cue-based classifier. We have a trained pattern classifier, so we combine them together. So we make a classification. We have two distinctions. We make a classification if either the cue based or the pattern based says yes this is sarcastic, that's our or in the table below. And then we have another classifier, where if they both say yes, this is sarcastic, and that's the and in the table. So comparing the cue-based classifier, which is the yellow, the 51%, the or does much -- it does much better in terms of recall. Because we're using both classifiers, we're getting a lot more of the data. Precision is still increasing, as well, and for and, we do very well in terms of precision, but as expected, recall is lower because we're being more specialized and focused in our classifications. And in nastiness, we see the same results, where we have the or-based classifier does better in terms of precision than the cue based, and then in the and, 88% is our precision but with a lower recall of 30%. So in conclusion, the goal of the Riloff and Wiebe work as to develop this high-precision classifier that we can use as this first approximation and then learn from a large amount of un-annotated data, learn syntactic patterns from it. So we couldn't really achieve that with our cue based, but our combined classifier is on the right track. So we've learned that context is pretty important for these cues. As shown on their own, they're not doing as good as we expect, but the patterns do generalize well, especially our syntactic patterns. And in the future work, we're going to now look at un-annotated data and run our classifier on the un-annotated data, have human annotators compare, and we're hoping to get a high amount of agreement and overlap. >>: Thank you very much. >> Stephanie Lukin: Thank you. >>: So we have time for a few questions, and you probably want a nasty question so you can use a different [indiscernible]. >>: Hi, there. So very interesting, thank you. I was wondering, the Turkers review the things and annotate them for sarcasm, whatever. Do you have a sense of what the inter-annotator agreement was like? Did the Turkers generally agree on what was sarcastic, or was there a lot of variability there? >> Stephanie Lukin: There was -- I can't remember the exact numbers, but there was -- so we picked -- the utterances that we selected for our study, we made sure that there was a high agreement in those agreements. We had about seven annotators per utterance, and we picked ones that had four that were agreeing that it was sarcastic or more. >>: Great. Thank you very much. Turning it over to you, thanks very much for this morning's session, and announcements relating to lunchtime.