>> Chris Quirk: So welcome, everyone. It's my pleasure to introduce Miguel Ballesteros who's visiting. He's normally a professor in Barcelona, but he's currently doing a one-year stint at Carnegie Mellon where he's been working with a number of really interesting people there and pursuing directions in introducing structure into recurrent neural networks including like stack-LSTMs, has been doing some very influential and effective research in parsing and now beyond. So with that, Miguel, please. >> Miguel Ballesteros: Okay. Thank you, Chris, for the introduction. today I'm going to talk about the transition-based natural language processing. You will understand what it is in an hour. So So first you might be wondering who is Miguel. So basically I did my Ph.D. in Complutense University of Madrid Spain. Then during this period I also start doing [inaudible] Uppsala University working under [inaudible] Joakim Nivre which I learn all the wonders about transition-based parsing models. Then I moved to Barcelona in which I have worked with people in generation and also dependency parsing as well. Then I did also a research stay at Singapore University of Technology and Design in which I also work in phrase-structure shift-reduce parsing. So after that I moved to Carnegie Mellon University in which I have worked mainly with Noah Smith and Chris Dyer in parsing and other problems. So this talk about is about linguistic structure prediction in which you basically have sentences like this one, Mark Watney visited Mars. And then you want to infer the linguistic structure. The linguistic structure can be given in many different ways. So one way could be, for instance, named entity recognition in which you have sentences like Mark Watney visited Mars, and then you [inaudible] that Mark Watney, all this chunk, is a person and also that Mars is a location. So there are other things you can do for linguistic structure prediction, for instance, dependency parsing in which you get dependency relations between words, and these dependency relations are binary dependency relations. Basically you get like some kind of dependency-based [inaudible] of sentences. So in this talk, I'm talking about a supervised approach to linguistic structure prediction, just in case somebody thinks I'm doing unsupervised. I'm talking about supervised approach to it in which I have a treebank with gold structures. I have basically pairs with sentence and the gold structure. And the challenge is basically given a new sentence, we want to infer the structure of this sentence. Okay. So I say that this is a challenge because it is actually a challenge mainly if you want to do parsing of a new sentence, this sentence can be something completely new, for instance, so something -- a sentence that has never been written before, and the parser has to be able to parse the sentence in the proper way. And we expect it to do it like this. So the general approach or basically the main approach for parsing is what we call graph-based parsing or graph-based approach. So linguistic structure prediction, not only parsing, because this talk is not only about parsing. So in this idea you basically build a graph connecting all words, you score all the edges with some kind of machine learning model that says how likely is Diego and plays to be attached together, and then we search the scoring tree with some kind of graph-based algorithm, and then we produce a dependency tree or any other basically linguistic structure for this sentence. So the graph-based approaches have some problems. So the problem mainly is speed and complexity and a hard inference algorithm that you have to implement in order to do them. But of course they are also very powerful in the sense that they present a global solution to the problem. And of course this makes them very attractive for basically people that like to incorporate all the information they can into their models. The question of this talk is to whether we can model linguistic structured prediction problems, such as NER or parsing as a sequence of actions, and we can do it as the same level of accuracy as we can do it for graph-based approach. Okay. So what is a transition-based approach for natural language processing? So basically what we have is a transition system or a abstract state machine. So we process a sentence in a sequential way, from left to right. And we basically have normally a buffer that keeps a word that we want to parse or we want to get the structure from. And we may use different data structures. Normally we use a stack and a buffer like in shift-reduce compilers or shift-reduce parsing as you may familiar with. And of course in each step we have an algorithm that has to select the best operation or basically the best class or the best action to do given the parsing state. And we normally use a classifier, and you can pick the best classifier you want to do this, and basically people have tried this in many different ways. So they have a lot of promise, these transition-based models, because they are fast and very flexible for different kinds of problems. Basically they are linear in the length of a sentence because you take an action for each word and you basically finish whenever you are done with the sentence, and this makes them very attractive. And they are also flexible in the sense because you can come up with any kind of set of actions in order to come up with your problem. Of course they present other challenges, mainly feature engineering is hard because you need to define the features over the stack of the buffer and you need to do it in a proper way. And you normally need a linguistic intuition in order to do that. And also the local optima problems. Since you are making these decisions, the parser can come up in some kind of wrong decision at some point and this can be a problem. So in this talk, I basically going to try to convince you that we can actually do feature engineering much better by using a different thing and also that we can also fix this local optima problem by using a better classifier. So in this talk, I'm going to talk about powerful sequential stack-based approaches for natural language processing, or basically transition-based natural language processing. All of them are based on transition-based approaches with stack-LSTMs, which is cool model we presented in ACL 2015. So I'm going to talk about a transition-based dependency parser. I'm going to talk about how we can do extensions for semantic role labeling, so doing joint syntactic and semantic parsing. I'm going to talk about how we can do -- how we can use also these powerful syntactic models for language modeling and generation. And I'm going to talk about the transition-based NER system that was [inaudible] yesterday. Okay. So at the beginning, since this is basically how people start to do the transition-based approaches to NLP, I'm going to talk about parsing. Okay. So what is dependency parsing? So basically dependency parsing is the syntactic parsing of natural language in which we have dependency-based syntactic representations of language. Okay. We have a long tradition in descriptive and theoretical linguistics. And is right now very popular, I will say more than [inaudible] popular, very popular in computational linguistics because people, well, we have a lot of treebanks, people have work a lot in parsing, and we have a lot of like baselines to beat in dependency parsing. And of course the main reason of this is because dependency parsing is very useful for downstream applications such as question answering or a matching translation or many other parsing which you need syntax in order to do, to do this [inaudible] better. So in a formal fashion, a dependency graph for a sentence such as this one is basically a directed graph in which you have the set of nodes that represent the tokens, so basically tokens and you have basically the words in your sentence. And then you have a set of arcs that are directed arcs that represent the dependencies. So a dependency have dependent on a head, and those dependencies can be labeled with a dependency type. So basically you know that Watney is the subject to visited, and you have it labeled with the subject dependency. So I'm going to talk about the Nivre's arc-standard, Joakim Nivre's arc-standard transition system in which we have three different actions. So you get shift, which will in this arc-standard parser you have a buffer and a stack, and then you get three different actions. So the first action is to do a shift, which is to take a word from the buffer and put it in the stack. And then whenever you have a couple of words in the stack, you can create an arc from left to right or to create an arc from right to left. And this is why you have a couple of actions that I'll call left arc and right arc. And if you want to parse non-projective trees such as this one in which you have basically crossing edges and there is no way to grow these three without crossing edges, you can also incorporate a swap action, which is -- which we call swap. In order to do reordering, by using that in a stack, you can do any kind of reordering you want and you can also parse any non-projective tree for the sentence. And still, even if you do the swap action, the parser is still fast, linear in the length sense. Okay. So I'm going to give you an example of how Nivre's arc-standard transition system works for this sentence that I am running all over the presentation. So basically the initialization will be which you have a buffer full of words. Basically the words that we are parsing in this sentence. We have a stack which is empty. And the transition here will be like the initialization. So the first thing to do here since [inaudible] empty stack will be to do a shift action taking the word Mark and put it into a stack. And now whenever we have only one word into the stack, in the arc-standard parser, the only thing we can do is arrange to do another shift action. So we will take the word Watney and we will put it into a stack. And whenever we have a couple of words in the stack, we can decide whether to shift another word from the buffer to the stack or to create an arc from Mark to Watney or an arc from Watney to Mark. In this case I guess the best thing to do is to do basically the left arc between Mark and Watney. And whenever we do these, we basically remove this word from the stack. Now, since we only have one word into the stack, we will create another shift action, and we will have visit into the stack. And, again, of course we can shift Mars or we can be in an arc between these two words. So the parser or the classifier decide that the best thing to do is to create this arc between these two words and then it keeps going, shifting the word and matching through the stack, and finally it will create the arc between visited to Mars. So this is how a transition-based dependency parser works. And you can see you have all your actions here, and the number of operations is actually linear in the length of the sentence because you have eight actions and four words. It's always twice in the [inaudible] parser. So the question is given a parsing state, the classifier has to select the best action. So how do we do that? Well, we need features. And these features are basically things that we define over the stack and the buffer. So [inaudible] MaltOptimizer, so I know very well how to do a feature [inaudible] for transition-based parsers. So basically people define features over the stack and they offer fixed window features. Like say, okay, I take the first couple of words of the stack, the first couple of words of the buffer, and I define like singleton features over part-of-speech, word forms, et cetera, et cetera. And by using that, I can fit everything, all these features, into a classifier, and the classifier will make predictions. So this is how a feature set looks like normally, like a standard feature set, or the default feature set looks like for a transition-based parser. These are like merge features, a couple of things. And as you can see, you only go -- input would be the buffer. You only go to length three or something like this. But now we can do it better because we have a recurrent neural networks and we can have -- like basically we can have a better look ahead of what is coming out of the buffer and what we have in the stack. So we can use recurrent neural networks to do transition-based parsing. So the idea is that recurrent neural networks are good to learn to extract complex patterns from sequences. They are very good for data which is ordered and context sensitive, where the later elements of the sequence depend on previous one. And this basically means that we have -- like what we have in transition-based parsing, which we have a stack and a buffer. And these basically are -- they are ordered and context sensitive. They're also good at remembering information over long distances. But recurrent neural networks has [inaudible] with the simple implementation. They have a bias toward representing the most recent information, and this is what people called the vanishing gradient problem. So some people came up with [inaudible] LSTMs, or long short-term memory networks, which is basically a variant of these recurrent neural networks which are designed to cope with these vanishing gradient problems. So they basically have an extra memory cell that keeps gradient and decides when to keep the gradient and when to not keep the gradient. And by doing that you improve the results of RNNs in many tasks. So they are better than RNNs in several tasks and also in parsing. And what we presented in ACL is basically so we have an LSTM. As you know, is a sequential modeling coding of a sequence. So in transition-based parsing, we need stacks. So we basically changed LSTMs with a stack pointer and two constant time operations. So basically the constant time operation, the push operation that we have, will be the same as we have in LSTM whenever we have an input, we have a new input into LSTMs. And then we also have a pop operation which basically moves a stack pointer back. And by doing that, we can basically model the behavior of our stack. And we can do parsing in a very efficient way by using stack-LSTMs or LSTMs. So the intuition behind the stack-LSTMs is that the summary of the stack contents is basically obtained by accessing the output of this stack pointer at the location of the stack pointer. So I'm going to give you an example of how this works. So this will be [inaudible]. So this will be the stack-LSTM, okay, in which we have the stack pointer pointing to the output. So this will be like the output layer, then hidden layer and the input layer of the stack-LSTM. So it will be for this stack in which we only have one word, Mark. The output is when is pointing to the output layer of this, of this one. Okay. So now we do a pop. Basically we remove the symbol from the stack. What happens with the stack-LSTM is that the output -- sorry, the pointer is basically pointing out to the empty stack. So, as you can see, we can model any kind of a stack. And if we make another input basically shifting one into the stack, now the pointer is basically this one which is what we need. If this is difficult to understand, I have another slide, which is the next one, which I think is going to be better for some people. Some people like this one, some people like the other one, so I decide to keep both. So basically is you are doing push. You are doing the same thing as you do in LSTM. And whenever you do pop, you basically move the pointer back to this. And now with push we basically create like another branch in a tree for doing like this. Yeah. >>: So there's only one stack total? >> Miguel Ballesteros: >>: At the end you have one stack, yes. But its compliance is kind of structured to it? >> Miguel Ballesteros: Yes. So at the end, for instance, if you are here and you want the output of your LSTM, you just run the recurrent neural net from whenever you get the pointer. >>: So you don't need to remember those other branches? >> Miguel Ballesteros: You don't. But if you start removing them, the complexity grows, and you don't want that. So the idea here is that we just need to keep whatever we need for this. And of course all these things, you can get rid of them but for parsing. Yeah. Any more questions? >>: So this is the same kind of model that appear in NIPS? >> Miguel Ballesteros: What NIPS [inaudible]? >>: [inaudible] I saw the same paper. Maybe it's a different author. different kind of model or same model? Is it >> Miguel Ballesteros: Well, this is the stack-LSTM, which first published by me and other authors in ACL, in ACL 2015. >>: The continuous [inaudible] this one seems more discrete. >> Miguel Ballesteros: Okay. So we'll get going. Any more questions about this? No? Okay. So this is basically what we use. So these are the stack-LSTMs in which whenever you want to get, you get like the encoding of a stack. You can do that so we are augmenting LSTMs with this stack pointer, and this allows us to basically be the stacks and do parsing. But we need in transition-based parsing are basically stacks. So the next question of course is how we can use these to do transition-based dependency parsing with stack-LSTMs. So we have a buffer, as you know, because I show you the sample, we have a stack, and we also have a list with the history of actions, which are basically the things that the parser has done so far. So this list of actions, this buffer and this stack are basically associated with a stack-LSTM. So we have one encoding of the components of the stack, the buffer and the list of actions at any time step. The list of actions is not a stack-LSTM because you are only pushing things into this. You are never popping things from the list of actions because you cannot change the past. But the buffer and the stack are basically stack-LSTMs. And we use a rectified linear unit to produce the parser state embedding given the three stack-LSTMs. So this thing basically looks like this. So this will be like the buffer in which we have the stack pointer pointing to this element here. We have also the stack-LSTM here for the stack, and we have the stack-LSTM for the list of actions with the actions taken so far by the parser. And this of course is a rectified linear unit and the softmax over the actions and the parser decides what to do given the encoding of the three stack-LSTMs. So since this is a neural network parser, we have to represent the words in the best way possible. So we have word invariance. So what you can see here in the previous slide, a lot of the things here, basically we have words, so we need to represent them. So basically we have a learned vector representation for each word type. So these are these vector representations that you see here. So these are taken from the training corpus. We get a little bit representation, a representation for them. And for each out of vocabulary words, we get a fixed representation like unk for the words that are not included in the training corpus, basically, or real words. We also have the train word embeddings where you've seen a neural language model be trained word embeddings by Ling, et al., in NAACL last year. And these are basically another word embedding that you can have here. And we also have a learned representation of the part-of-speech tag. So what we do is to concatenate these three different vectors and for each words we get a new embedding as this. Another thing which is very, very useful in transition-based parsers is the composition functions. In transition-based parsers, we normally call these history-based features, which are the things that have already been decided. So you have already found the subject for a sentence, so you don't want to find another subject for the same word. Right? So you get composition functions are basically the embedding of a subtree, of a dependency subtree. So as you see here, you have overhasty and decision is a subtree. And we want to get the best embedding possible from it. So what we do is basically to run recursive neural network or [inaudible] over head, dependant and relation plus some kind of bias. So you have the word overhasty and decision. You run these [inaudible] with the relation, and then you get an embedding from it. And this is basically how we represent [inaudible] dependency structures, which is a very, very useful source of information in parsing. Yes. >>: Quick question. In the case where you only have dependence on one side, it seems like it's unambiguous about how to compute this tree, but if you dependents on both sides, there's ambiguity about what do you pull in ->> Miguel Ballesteros: Yes. There is some [inaudible] but you are always -you are starting with head dependent and then relation, so you ->>: But there could be -- say I have -- let's see. The man -- let's see. English is tough. But, well, even if I just say John went home, went has two children, one is John and one is home. Does it matter in the order in which you consume the ->> Miguel Ballesteros: It clearly doesn't matter. different embedding in which you ->>: Exactly. So you will get like a One will be a left -- >> Miguel Ballesteros: One will be a left and the other one. >>: You just pick it -- >> Miguel Ballesteros: You just pick it whenever the parser -- but since this is a left to right parser, you normally get the same order at the same time. Unless you are doing swapping and non-predictive parsing, which it can change. Yes? >>: For English, why is it a left to right parser? >> Miguel Ballesteros: This is because we are using a transition-based approach for parsing in which we basically shift words one by one from the buffer to the stack. So these always are left to right parser. >>: But could it not have equally been a stack -- >> Miguel Ballesteros: works similarly. We tried actually to do it in the other way, and it >>: I'm just thinking because English by default has right branch unstructured, so it's actually maybe easier to go from right to left. >> Miguel Ballesteros: Some people have tried several ways, like doing first left to right, then right to left, then right to some kind of [inaudible] system between these two. And they get a little bit improvement, but it's not like the most influential application. But of course you -- since this is a transition-based parser, you have to think about always left to right parsing in which you can also swap words if you want to have non-predictive parsing [inaudible] these languages. Yes. >>: But at any point in time [inaudible] left to right. something bidirectional? How about doing >> Miguel Ballesteros: Yeah, this is what we are talking. You can do it bidirectional. But in this case, we are only [inaudible] words in a left to right fashion. Okay? Sure. Okay. So these are the experiments that we run first, the ones that we run for the ACL paper. So these are for the English Penn treebank. So we have the Stanford dependencies. We use the predicted part-of-speech tags with a Stanford tagger with 97.3, which is state-of-the-art accuracy for this treebank. And then we use basically the same settings as Danqi Chen and Chris Manning, which is a fit for our transition-based parser, so left to right neural network parser, which is the closest model published to our stack-LSTM parser. So we compare our results to their results. So we see that basically this is a label attachment score is the percentage of a score in talking with a correct head. Unlabel attachment score is the percentage of scoring tokens with a correct head. We also take into the label. Okay. So as you can see the results compared to Chen and Manning, we get more than 1 point here. And for a label, we more or less the same result but one point better. So our parser is basically better in all conditions compared to Chen and Manning. What I'm going to show you now is like in a relation conditions in which we remove the components that are represented before, so the composition functions, the word invariance or the part-of-speech doesn't see how the parser react to that. So we remove the part-of-speech tags. The parser goes on to 92.67 [inaudible] score. People normally take into account this column in order to compare parsers. So 92.57, actually 92.6 is like the best, one of the best results [inaudible] part of the tags, so this is a very, very strong result. If you remove the pretrained word embeddings, you see that the parsers gets like 92.4. So you are removing all the semantic information or all the contextual information that you have in your word embeddings, and you see that the parser sufferers, but it still is very competitive. And if you remove the composition functions or basically a parser [inaudible] dependency structures, and when I say remove it, basically what we keep in the stack is the head [inaudible] keeping the embedding of the parser [inaudible] tree, we keep only the embedding of the head. And if you do that, the parser goes to 92.1, which is -- of course it goes in line with previous research in transition by parsing and history-based features. Yes? >>: Is this cumulative or [inaudible]? >> Miguel Ballesteros: >>: No, it is not cumulative. It is -- this one -- And you put the parser -- you're saying that came before [inaudible]. >> Miguel Ballesteros: Yes, yes, yes. This is why they are in the same level. But that's true [inaudible] little bit confused. Okay. If we remove the stack-LSTM and we use a stack [inaudible] so instead using LSTMs we use RNNs, the parser gets 92.3, which is still better than [inaudible] neural network, but of course it not competitive compare to the stack-LSTMs. And then we the did the experiment with the Chinese treebank. So for this Chinese treebank we follow basically Yue Zhang and Stephen Clark, which are the people that started doing the experiment with this treebank. We use gold part-of-speech tags, and this is basically, again, the same settings as Danqi Chen and Chris Manning for this paper in EMNLP. So here we have more or less the same picture. So the stack-LSTMs are here, but they're like three points better than [inaudible] parser. These are the state-of-the-art results for these settings. And, again, you will go -- for the [inaudible] conditions, you see that there are differences. If you take into account the part-of-speech, then you see a big drop. And this is because we are using gold part-of-speech in this case. So here is like the improvement of part-of-speech [inaudible] is higher. Okay. And, well, basically some conclusions of this part. So this is the highest result ever published for a greedy parser, for a greedy left to right transition-based dependency parser. Is very close to beam-search or complex graph-based parser, or even I will say it's better than most of them. So this is actually something very influential. And it runs in linear time with greedy decisions. So you get decisions, you get a parser which is super fast, and it provides one of the best results ever published. Okay. So any questions so far? >>: I've got a question about the -- several slides ago [inaudible] supplied the part-of-speech tags that were embedded. But, I mean, there's only 40 part-of-speech tags. I don't understand what an embedding of ->> Miguel Ballesteros: Well, it's a small -- it's a small embedding of both dimensions. Yeah. It's basically an embedding that we learn [inaudible] train. In some languages you have more than 40. So you go to Korean, I'm going to show example [inaudible] result in Korean later, you have a part-of-speech tag that is basically like 1,000 different parsers. You know, like the language in which you have 32,000 or something but it's 1,000. So you can have something. Okay. So let's move to the character-based modeling of words. So as you remember, we're actually going from the question, we have like an embedding of part-of-speech tags, and then we also have like [inaudible] embedding and [inaudible] representation of the words from the training corpus. So we can do something better. We can do character-based modeling of words. So [inaudible] you cannot also run a bidirectional LSTM character by character of the words of your input sentence, and then you run it in the other direction, and basically you get -- you concatenate these two embeddings together, and you concatenate these with the embedding with the part-of-speech. So the intuition behind this is that you are going to get a lot of stuff, fixes [inaudible] fixes from the character-based embedding, and also that you can do this for every word. You don't need to care about whether this was an out of vocabulary word or a in vocabulary word. You can actually do it for every word in your input sentence. And this was published in EMNLP 2015. I'm going to show you the results later. So basically we model this character-based modeling of words in a two-dimensional thing. So we basically do the embeddings and we run it with [inaudible] and we put this. And as you can see here, we see a cluster of words that are basically words in past tense for English. You also see a lot of gerunds here, you see some adverbs over here, and you see the same word endings of these words here, or you see the same word endings of these words here. So you see a lot of basically suffixes and prefixes. And this information is something which can be very useful in parsing because suffixes and prefixes is what gives you the more fully in part-of-speech study. So by doing this character-based embedding, the intuition and the motivation is that we can improve out of vocabulary funding and we can improve the performance for more [inaudible] languages. So we did this experiment in which we have this baseline model, which is what we called WORDs. I put this here because it's in the tables. And we also have this character-based model which we called CHARs. So this baseline model is the same one as I presented results in Chinese and English before, but removing the pretrain word embedding. So we don't have additional resources, we just have the pretrained word embeddings of the treebank. Okay? So we did experiments for the -- for some treebanks of the Statistical Parsing of Morphologically Rich Languages. These are basically Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. As you can see, some of them are agglutinative languages and you're basically looking at more themes. And we also included Turkish because we know that Turkish morphology have this agglutination behavior which is going to be very impressive with character-based modeling. In terms of completion, we also run it with English and Chinese. So we don't have any explicit morphological features just to see the set of character-based presentations. And we don't additional resources. Just the training corpus, the parser, and no pretrained or character based embeddings. So we tried with and without parts of speech as well to see the fate of character-based modeling of words. Okay. So if we try with analytics languages such as English or Chinese, do you see that the parser with character-based embeddings is better, like one point. For English it's only 0.3, or almost one point for Chinese. But of course we didn't expect too much for English or Chinese because these are analytic languages in which morphology is not playing a big role for syntax. But if we move to agglutinative languages such as Basque, Hungarian, Korean or Turkish, you see very big improvements. So you see how the parser for Basque have more than six points improvement. For Hungarian it goes like eight. For Korean it goes like exactly ten points actually. And for Turkish it goes more than three points. So you see how the parser actually behaves much better when you see character-based modeling of words comparing to what -- how it behave having the word representation like the sequential thing. And we tried with fusional or templatic languages. We see how the parser improves in all of them. So as you can see, this column is already in bold. And as you can see how we also improve in all languages. In Polish we saw a very, very big improvement. And the main reason why we saw this improvement in Polish is because the out of vocabulary [inaudible] is super high. So we thought we basically had seen that the character-based modeling of words are giving you something a lot also when you have out of vocabulary, higher out of vocabulary rates. In some languages you see more or less the same results, such as Arabic or other languages. And when we include -- this is without part-of-speech tags [inaudible]. And when we include part-of-speech tags, we see that the pictures change a little bit. So for English and Chinese, you almost get the same numbers, or a little bit better for words. But you see that they are -- that both things are always agglutinative languages. So it's still Turkish, Hungarian, Basque and Korean, which are agglutinative languages. Basically [inaudible] more themes and you get these kind of things, you basically see that you still get very big improvement for parsing. More than two points is a lot. And in average the parser is better compared to the one with a [inaudible] presentations, as you can see here. Okay. So the conclusions for this is of course character-based representations are useful. They obviate part-of-speech information for some languages. So actually you compare -- if you compare this column to this column, in some cases is -- you go to Korean, for instance, it's actually even better when you don't have a part-of-speech information. So you are getting better information from the character-based modeling than what you get with part-of-speech tags. And they are more useful for agglutinative languages, in our examples, these four languages are, in which you can see actually a very big improvement. And it's also a nice approach for Hungarian and the other vocabulary program we show in Polish in which, well, you have another vocabulary -- we have a vocabulary rate, the parser gets better. So we have an article submitted to Computational Linguistics, still waiting for the reviews, but somewhere, so with all the results that I have presented before. Okay. So, well, in this fall I have the chance to basically teach a course in natural language processing. So one of the things that I thought about is the parsing. So I asked the students do be the parser. So I had this student who's working with me, and we built this ensembling system by using different outputs of different models [inaudible] stack-LSTM parser. So basically you train several models with this stack-LSTM parser, and you build a [inaudible] system, like this one, okay, in which 40 parsers think that the one place would be attached together. And then you run graph-based parser, a Chu-Liu Edmonds parser, and the result we got is basically the best result ever recorded for dependency parsing. And we are working on a research paper together for this. And also we can do this cross lingually. So was interested in this parser, and we were working basically in this cross-lingual setup in which we train a language-universal dependency parser on a multilingual collection of treebanks. So basically instead of training with a multilingual treebank, we're training with several treebanks. And we also have multilingual word embeddings and typological information including to the parser. And the nice result about this research was that the multilingual training outperforms a standard supervised training which we have for multilingual training. So this result is submitted to TACL. We expect a review soon. And basically this is the main table. This is the monolingual training in which you basically take a treebank for German and you train a model for German and you get a result. The result for English is only 88 because this is the universal dependencies, you only have 12,000 words, you don't have the size of the Penn treebank. And the nice result is here basically. So you compare these to these, and you see that the multilingual training is better than the monolingual training. Basically all the columns [inaudible] language universal in which we train with other languages and we evaluate in German, and here is when you include lexical invariants for the target language, the language [inaudible] information, and finding part-of-speech [inaudible] see how the parser is better. The implications of these are very high because if you want to parse difficult languages in which you don't have any data, well, you can use this model, and this is suspected to be better actually than the multilingual training. And we compare these with the state-of-the-art with the same system, so this is when you don't have any data of the target language, and we are better than state-of-the-art systems in these sets. >>: I'm sorry, you're saying you don't have any data, so then -- >> Miguel Ballesteros: So you basically -- this is Swedish, so you remove all the sentences from your training corpus of [inaudible] and you train with all the other languages and you evaluate [inaudible] a set. >>: So when you say plus lexical, what does the that mean? >> Miguel Ballesteros: It means that you use explicit invariance for the target language that are trained with -- for this language. >>: So trained in monolingual data. >> Miguel Ballesteros: Yes, in monolingual data. Unlabeled. And unlabeled data. Which we -- you could take it for any language or for multiple. >>: I'm sorry, can you go back one slide? language ID, what is that? When you -- when you supply the >> Miguel Ballesteros: The language ID is typological information, so language would tell you [inaudible]. But basically it gives you like the -how the parse -- so how the syntactic information is behaving in each language. So you know how the left branching, the right branching ->>: [inaudible] the model? >> Miguel Ballesteros: >>: Sorry? How do you give it [inaudible] -- >> Miguel Ballesteros: It's basically like a signal for the -- so you remember the [inaudible] stack-LSTM's? So you get like a signal -- another embedding for the language ID. >>: So you have a set of psychological features [inaudible] so on and so forth [inaudible] so you have these K different features ->> Miguel Ballesteros: And you input an embedding on this to the parser, and the parser makes the better prediction as you can see. It improves for all languages. So in languages you actually see a big improvement. So you go to German, and you include the [inaudible] information [inaudible] which is something [inaudible]. The parser, when we didn't have [inaudible] information [inaudible] we include this, and the parser started to do better. Sorry. I think it's [inaudible]. Okay. So now I know this is basically all the parsing stuff. So as I promised at the beginning, I'm going to talk about something different to parsing. So the question is if we can model other linguistic structured prediction problem different than parsing as sequences of actions. So I hopefully convince you that we can do it for parsing and we can try to do it for other problems. So the question is whether we can model then jointly with syntactic parsing, so having syntactic parsing joint with another task, or we can model a particular problem with the onset of actions. I'm going to give you an example of these two questions. So first talking about joint syntactic and semantic parsing with stack-LSTMs. I'm going to briefly explain about this. It's another paper submitted to TACL. So we have a joint model for syntax and semantic role labeling basically using stack-LSTMs. So instead of having one stack, we have a couple of stacks. One that keeps the syntactic information; another one that keeps the semantic information. So in one of them we keep all the [inaudible] that we get from the syntactic stack, and the other one we get all the semantic roles we can get from the stack. And we got state-of-the-art parsing and semantic role labeling from these. By using [inaudible] stack-LSTMs. So we compare with previous system. These are the system of the [inaudible] 2009, 2008. And this is the closest model published, which is Jane Henderson's model which is actually the same algorithm but without using stack-LSTM, by using the neural networks actually. So we see how we improve [inaudible] especially in the semantic task. Okay. And now I'm going to talk about one paper that got accepted yesterday in NAACL, in NAACL 2016. So this is basically some paper called neuron network models for [inaudible] recognition. And one of the things in this system is the transition-based NER model with the stack-LSTM. These are the two main collaborators of the paper. So basically we have an algorithm that constructs and labels chunks of input sentences. It's a transition-based approach again. This is the same way as we do it with the dependency parser. In the case, we have three data structures. I'm going to give you an example of how this works. We have a buffer, we also have a stack, and we also have an output buffer that keeps the thing that has already been decided. And it's a similar approach to the parser. So let me give you an example of how this works with the same running example, or framework, Mark Watney, which is tired of being in Mars. So basically you have a buffer full of words, you have the stack which is empty, and you have your output buffer which is also empty at the beginning. So the system basically decides to make a shift to take some work from the buffer and put it into the stack. Now we can decide to say, well, must -sorry, Mark can be already an entity and can go to the output buffer, or we can create a chunk with another thing, so shifting another thing to the stack. And whenever we have this, again, you can decide to create the chunk between these two words or shift in another thing. In this case, it decides to create two basically reduce, what we call reduce. And it's basically taking Mark Watney and creating this. And basically I'm going explain you how to put this information in an invariant [inaudible] analyzed way to improve the results. Of course, when you have a verb, like visited, it's not in an entity, so we can basically get rid of it. And what we do is put it up to have an output transition. So basically [inaudible] throws it to the output buffer. And, of course, you can also shift much into the stack. And whenever you have this, since there is nothing else, you can decide what is the best entity to have for this thing. And in this case, it's location. So this will give you like the result for these centers, in here. Okay. So the main motivation of this research is that most NER systems use external resources, as you know. People that have worked with probably know. So they look into gazzetteers system and databases, resources. We can do that, of course. We can input that into our model, but we don't want to because we want to do this also for a low resources scenario. So the question is if we can perform at the same level or even better without including any of the system and resources such as gazzetteers or external databases with information about entities in the work. So we did experiments with CoNLL-2002 and 2003 datasets for English, German, Dutch and Spanish, and we only use wordform features. We didn't even use part-of-speech tags to make this like cross-lingual or multilingual, which you can actually thrive with everything like this. So this is how the system looks like. As you can see, similar to the parser. You have stack-LSTM for the buffer, you have the stack-LSTM for the reduce functions. You have your stack-LSTM for the output buffer, and you also have the stack-LSTMs for the stack. And then you get the softmax over previous actions, and you basically decide what to do given the parsing state, or the system state in this case. So in order to come up with a nice way here is to do character-based modeling thought, okay, because character-based also for NER because you have also the these things are very, very useful. of representing words, what we did of words again so we could -- we modeling is going to be very useful capital letters at the beginning. All And we also have a pretrained word invariance with a lookup table with basically the same invariance that we used for the previous publication in parsing, we use it here in a lookup table. So basically for each word we run a bidirectional SDN. For the word match, for instance, we get this. And then we have a lookup table for the [inaudible] of Mars, and we concatenate these two vectors together. And this is our embedding for the system. And, again, we have also composition function. So whenever we reduce, we never -- we take a chunk from the stack and put it into the output buffer, we have to infer some kind information and put it in some kind of embedding. So what we did is basically to run a bidirectional LSTM over the invariants of these tokens together. So in Mark Watney, we run a bidirectional LSTM over these two tokens together. And we also include the label. So in this case like location or person or whatever. And then we get the representation which is what we called composition function [inaudible] bidirectional LSTMs. Okay. So these are the NER English results. So this is the result of the stack-LSTM parser. This is another system that we presented in the same paper, which is actually a little bit better but it's the same level. So all the systems with a star are systems that incorporate basically [inaudible] resources or part-of-speech tags [inaudible] years, linking, et cetera, et cetera. And our system only use word forms. Nothing else. And as you can see, the results are actually very, very competitive. This is one of the best results actually published for NER. The same picture is for other languages. So if you go to Spanish, you see how the systems with stack-LSTM system gets like almost 84. It's the best result ever published on the LSTM-CRF system, which is a bidirectional LSTM thing that we have in the same paper, it's also a little bit better. And for that you also have the same picture. So this is the [inaudible] external resources and you compare with the rest. We have like the best NER system by only using word forms. And for German we have the same picture. Of course we see that the character-based invariants as we saw in parsing are very extremely useful. So you see that they have a big impact into the results, especially the stack-LSTM one. So this is the state-of-the-art NER system. It's linear in the length of the sentence, so I mean it is fastest we have with the parser. So basically we can run it in a very fast way and we can produce very nice results. Again, character-based representation happen with the dependency parser, they provide very, very useful information for NER such as you can have like out of vocabulary words for other things. And we are only using word form features. So we might -- we are not even including part-of-speech that's in the model. Of course we could have included, but this was not the -- and when we include it, we improve. But this was not the task with the system. And we don't have any -- no gazzetteers, no external resources. So now I'm going to briefly talk about another paper that was accepted to NAACL 2016. So basically is what we call recurrent neural network grammars. So this is a top-down phrase-structure parser. So it's not dependency parser, it's phrase-structure parser. And we also include language modeling results, and we again use stack-LSTMs. So the nice thing is that this is stack-LSTMs for syntactic parsing that I show you before. It's a very powerful discriminative model for language basically. So you can basically infer syntax in a very nice way. But syntax, as you know, is very useful for generating language. Okay. So the idea here is that we the stack-LSTM parser to basically evaluated as a language model. So word from the stack to the buffer, can use the same things that we have in create a generative model that can be think about it. Whenever we shift our we can also predict what words, we can decide what word is dependent on that. language model. So we can actually use it as a So we do this in the context of phrase-structure parsing, and we call it RNNG, so basically recurrent neural network grammars. So this works more or less like this. This is a top-down parser. It's not like a [inaudible] parser that basically is [inaudible] up. So you have the buffer here. This would be like the discriminative version of it. You have the buffer. And then you can also shift [inaudible] into your stack. At the end you are basically [inaudible] phrase-structure [inaudible] sentence. The nice thing is whenever you shift, you can also predict the word. And then you can evaluate it as a language model. And the language perplexity results that we got for the English Penn treebank, so this is sequential LSTM, basically LSTM in which [inaudible] previous elements. We get better results including the syntactic information of the stack-LSTMs for both English and Chinese. As you can see, we have an improvement here and here. And the nice thing is that we can also evaluate this as a parser because it's a parser. And if you go to the F1 scores for the English Penn treebank, you got without external resources 92.4, which is state of the art. It's one of the best results ever published for this parser -- for this treebank, sorry. And for Chinese it's close, but it's a little bit worse [inaudible] around 83.2 or something like this. But it still is a very, very competitive result taking into account again it's greedy, left to right, and you don't need any external resources. And this is something that we are starting with more ongoing work. So we are thinking about this transition-based approach to machine translation in which, well, as I said at the beginning, I said with a stack-based parser and a swap operation you can actually come up with any kind of reordering. So you have a buffer -- yes, you can. >>: You can do [inaudible] reordering. >> Miguel Ballesteros: >>: You can do any reordering. We'll have to talk about that. >> Miguel Ballesteros: I can show you. So basically you have -- well, it's hard because [inaudible] you can grow, but you can come up with most of the reorderings. If you go to some language such as Japanese, Japanese English, it's going to be more difficult. >>: But the restrictions -- yeah, anyway, this is a total minor detail. Please go on. >> Miguel Ballesteros: Sure, sure, no problem. Anyway, so basically we already starting on this, we have started thinking about this. So thinking -- looking at the results of our RNNG grammars in which language modeling results are super nice. And we can also do transition-based parsing and non-predictive parsing which we swap words and we do reordering. What about instead of shifting a word from the source language we shift a word from the target language. So we make a prediction of the target language. So basically we are thinking about this and help basically supervising a student on this. And basically the final goal is to do a complete machine translation system that runs in linear time. This is a very big project, a very big goal, but we are trying to do it, and I believe it's possible. So it might be also interesting to do it jointly with parsing. So whenever we see [inaudible] we also produce a syntactic parsing. So using these embeddings can be useful for machine translation because I do know syntax is also useful for translation. So this is something we are trying to do right now. So in conclusion -- yes, sorry. >>: Sorry, you were saying about linear time, I mean, that's a Montreal LSTM model also run in linear time of minutes. Not a very large constant, but it's also linear, right? >> Miguel Ballesteros: Yes, yes. It is. Definitely. case the idea will be to do everything in [inaudible]. >>: All right. In this Yeah, but when that's sort of what the Montreal [inaudible]. >> Miguel Ballesteros: Okay. Okay. Okay, so in conclusions for this talk. So I presented powerful sequential models for natural language processing. Basically all of them are based on transition-based approaches with stack-LSTMs. So I presented state-of-the-art transition-based parser, which is getting a lot of attention from the community. It's good in both in terms of results and speed because, well, it runs in linear time with the [inaudible] produces one of the best results ever published. I also presented a fresh and new state-of-the-art transition-based NER system that produces very high results without using external resources that can be extended to many languages and many tasks by using the character-based embeddings of words. I presented how we can do extensions for language modeling by using a shift-reduce approach in which we never use shift; however, you can also break the word from the buffer and use this language model. And I show that I can -- we can produce state-of-the-art results also for language modeling compared with a powerful LSTM. And also [inaudible] extensions for semantic role labeling which you can do joint syntax and semantic role labeling with stack-LSTMs. Sorry. I would like to acknowledge Noah Smith and Chris Dyer, which are [inaudible] most of these papers and collaborators in my case [inaudible] papers. And I would like to thank you, all of you, for your attention. And you have any questions? [applause] >> Miguel Ballesteros: No more questions? Yes. >>: There was one of these many papers where you had an output, input and a stack, and it sounded like you said you used stack LSTM for each of them as opposed to ->> Miguel Ballesteros: >>: I'm just curious. >> Miguel Ballesteros: instance, this? >>: Yes. So basically -- so the parser you mean? So, for Yes. >> Miguel Ballesteros: So it's a blend with stack-LSTM, yes. So basically in the case of NER, these are -- these -- basically, again, you do always push, push, push for the buffer, for instance, you always push all the words, and then you start popping things. Whenever you shift something from the buffer to a stack, you [inaudible]. >>: I see. >> Miguel Ballesteros: But in the case of transition-based parsing, so if you -- let me go back. So in the case of transition-based parsing, so this is the same. So we fill these with all the words at the beginning, and then we basically we shift things. Whenever we shift, we pop. But if you do non-predictive parsing with a swap operation, you basically take this work and you put it back here. So you also do pushes during parsing time. So this is [inaudible]. But this [inaudible] can be modeled with LSTM and this is basically history and you are always pushing things. You never pop [inaudible] stack-LSTMs. >>: Great. Thank you. >> Miguel Ballesteros: >> Chris Quirk: Okay? Thanks, everyone. >> Miguel Ballesteros: Thank you.