>> Miguel Ballesteros: Yep

advertisement
>> Miguel Ballesteros: Yep
>> Chris Quirk: It’s my pleasure to introduce Miguel Ballesteros who’s visiting. He’s currently doing a
one year visiting stint at Carnegie Mellon working with Chris Dyer and Noah Smith. On a variety of
interesting topics including some of these deep learning based dependency parsing approaches. His
permanent position is back in Barcelona which is a beautiful city. We’ll have to go visit him sometime.
Please, Miguel.
>> Miguel Ballesteros: Okay, thank you so much Chris for introduction. Today I’m going to talk about
Greedy Transition-Based Dependency Parsing with Stack-LSTMS. This is our transition-based parser that
we presented in ACL in an experiment that we have in EMNLP with Character-based representations.
This is the outline of the presentation. I’m going to start introducing you to Transition-based
dependency parsing for those of you that are not familiar with the topic. After that I’m going to talk
about Recurrent Neural Networks, LSTMs, and Stack-LSTMs.
After that I’m going to explain how we can do Transition-based parsing with these Stack-LSTMs that we
presented in ACL. After that I’m going to explain different approaches to word embeddings, especially
character-based presentations. I will try to motivate you to use them for out of vocabulary words. After
that we’ll talk about future and the things that we are doing right now.
This is Dependency Parsing. What you, the task here is to find a dependency tree on top of a sentence
which is labeled directed tree. We have a set of nodes labeled with words. A set of arcs labeled with
dependency types. You want to find for example the subject of a sentence you can find it like doing this
kind of task.
In Transition-based dependency parsing is a way of solving this problem by doing sequential algorithm.
It processes the sentence from left to right. You have a stack so basically a [indiscernible] or something
like this. You have a stack and you have a buffer.
In each step the parsing algorithm has to select one operation by means of a classifier. Operations
[indiscernible] to make a SHIFT which is to take a word from the buffer and put it on the stack, or to
make reduce action which is to make an ARC from LEFT to RIGHT or from RIGHT to LEFT.
This kind of parsing algorithm is very efficient because it has a linear number of operations in the length
of a sentence. This is why it’s very attractive. It’s becoming more attractive right now in the last, after
our language processing conference.
Let me give you an example of how this works. Imagine you want to parse a sentence, They told him a
story. This will be the initialization where you have a stack empty and a Buffer full of words. The
classifier here the best thing it can do, or the only thing it can do right now is make a SHIFT action. Take
the word They and put it in the stack. Then the std will make is another SHIFT action which is to take the
word told and put it in the stack. Whenever you have two words on the top of the stack you can either
make another SHIFT action or you can create an arc from left to right, or from right to left. In this case
the classifier will find this object of the sentence. It will continue the process and whenever you do an
arc the word that gets the arc it will be removed from the stack.
Then you can continue here making another SHIFT action. Then you will find the [indiscernible] object
of the sentence and it will continue here making another SHIFT. Then another SHIFT in this case you
have three words in the stack. Then we will find the determiner. Then you will find the [indiscernible]
object. At the end you will push the punctuation symbol and you will finish the parsing process with the
final dependency tree of the sentence.
This is how you parse a sentence. You can see the number of operations that you have, I think you have
here. Like ten actions and six words, so it’s linear in the length of the sentence. This is why it’s very
attractive for the parsing because it’s fast.
Okay, so Transition-based parsing has been studied in very different perspectives. People have started
doing like feature engineering Beam-search Dynamic oracles as a way of studying parsing in, from the
[indiscernible] perspective. Trying to get better results by improving how the parser learns on
information that it gets.
Also some people have studied this problem by doing different parsing algorithms, like coming up with
ideas for how to parse [indiscernible] trees, or how to be faster, or how to parse a [indiscernible] tree
with different algorithms.
Right now there is like a trend in the ACL conferences of doing Continuous-state parsing with neural
networks. This was started by Ian Titov and [indiscernible] Henderson back in two thousand and seven.
But in EMNLP two thousand fourteen there was this paper by [indiscernible] Chen and Chris Manning
which is a Transition-based parsing with neural networks. After that in ACL and EMNLP there were like a
lot of papers talking about this topic, and trying to improve these numbers. The two papers I’m
presenting today is the one that we have in ACL and the one that we have in EMNLP.
Okay, so let me explain you a little bit about Recurrent neural networks. These recurrent neural
networks are very good learning to extract complex patterns in sequences. They are very good for data
which is ordered and they have some kind of long distance dependencies in the sequence. They are also
very good remembering information over long distances as I said.
The problem with these recurrent neural networks is that they have some kind of bias toward
representing the most recent information that is given to these networks. There was idea of doing long
short-term memories of LSTMs which is like a variant of these recurrent neural networks. In which you
have an extra memory cell and some kind of gates that gives the gradient sometimes. Sometimes you
forget it and in this way you can remember the information from the past. But they still model
sequences in a left to right order which is what we do in Transition-based parsing. They are like a very
good fit for this.
What we represented in ACL is [indiscernible] learning model which is called Stack-LSTMs, which is like
an augmentation of LSTMs with a stack pointer and two constant-time operations. These operations are
the one that we have in computer science for any stack. One is to do a push to make a new input and
the other one is to make a pop which is the new thing that we have in LSTMs. Basically, you move the
stack pointer back.
The intuition here is that the summary of the stack contents or the encoding of the [indiscernible] that
you’re learning is given by the position of the stack pointer. Let me give you an example. This is like a
computation graph of an LSTM which is doesn’t have on a stack pointer. Whenever you make a new
input, so the first, this will be like the input layer. This will be like the hidden layers of memory cells.
This will be like the output layer.
Okay, so whenever you make a new input you create this kind of computation graph. Then
[indiscernible] you get the encoding vector of the layer sequence, okay. But with the Stack-LSTM
operations now we’re going to create a new input. You will do a push operation. Then you will move
the stack pointer which is this blue arrow that you have on top, okay.
But now you want to make a pop operation and you can do this. Then when it accesses output will be
this one so you forget about this. This is very good fit for Transition-based parsing because what you are
doing all the time is to make push and pop action. Yeah?
>>: Where is the stack in this case, the stack?
>> Miguel Ballesteros: Where is the stack?
>>: Yeah.
>> Miguel Ballesteros: The stack pointer is given by this stack pointer. The stack is given by stack
pointer. Now if I want to access the output of the LSTM at this point I will access where the stack
pointer is.
>>: Basically, you assume that the whole output is a stack. That a pointer has access of which…
>> Miguel Ballesteros: The pointer tells me where to start. [indiscernible] going back and get encoding.
>>: Okay.
>> Miguel Ballesteros: Also, look at how it goes. Now if I make a push action then you will get this kind
of structure in which the stack pointer points here. Then I have this kind of memory here. If now I make
another pop action then it will go back to this. It’s actually a very simple trick in which you can get like a
stack. You can actually point to the information that you want and the information that you have in the
stack, which is why we have in a Transition-based parser.
>>: Yes, that’s a memory network for Facebook I think that this new one can every memory work from
this. They do this kind of operation. I thought they do this one.
>> Miguel Ballesteros: I think there’s some people have this kind of…
>>: What’s the difference between the one that you’re doing that and the one they are doing?
>> Miguel Ballesteros: I don’t know.
>>: [indiscernible]
>> Miguel Ballesteros: Yeah?
>>: Now, the way you show that there’s only a single blue arrow which is the current top of the stack.
>> Miguel Ballesteros: Yes.
>>: But actually is it the case that there’s an underlying actually stack underneath?
>> Miguel Ballesteros: You have this same computation graph.
>>: Yes.
>> Miguel Ballesteros: You get the, now so for instance whenever you get the access output right now.
You will go through the arrows that these are how many arrows that you have.
>>: Right, but there’s actively a faint blue arrow that tells me when I pop I would go back to Y zero?
>> Miguel Ballesteros: Yes.
>>: If I pushed again then I would need and even fainter blue arrow off in the distance that would be
the result of two pops, right?
>> Miguel Ballesteros: Yes.
>>: Unlike the neural networks that neural memory networks you can’t jump arbitrarily through the
state. You can only move through those…
>> Miguel Ballesteros: You can only move like in a stack.
>>: Yes, you can only…
>> Miguel Ballesteros: You can only do a stack action.
>>: Right.
>> Miguel Ballesteros: Because well this is what we need because we are using a stack in Transitionbased parser.
>>: Right.
>> Miguel Ballesteros: We don’t need any fancy, well we don’t need more than that.
>>: But there’s a little bit more state than that the blue arrows only the top of the stack not the whole
stack.
>> Miguel Ballesteros: Yes, yes.
>>: Right, good, good.
>> Miguel Ballesteros: Yeah.
>>: This is the generic method not only for dependency tree parsing…
>> Miguel Ballesteros: Well, we represented this middle and then we tried dependency parsing. But
you could use it for anything.
>>: Okay.
>> Miguel Ballesteros: In which you need a stack of course.
>>: Okay.
>>: Output it can stop a layer, I mean this Y two is actually dependent on a hidden layer. It seems that
my understanding of this stack is actually the hidden, the history of hidden state, that’s the stack. Then
we output from this stack it’s through this upper layer. Then actually when you like push and pop
actually it’s the position on the hidden state that is changed.
>> Miguel Ballesteros: Well, not only the hidden state also the output layers is what you get. It’s the
stack point that is pointing to the output that you want, at this point.
>>: What is push actually?
>> Miguel Ballesteros: Push action is to add something to your network.
>>: [indiscernible]
>> Miguel Ballesteros: You, I don’t know you have this and then you do a push, then you add
something, this which wasn’t here before.
>>: Result of the hidden layer or being pushed from the upper?
>> Miguel Ballesteros: You have a new input so…
>>: [indiscernible]…
>> Miguel Ballesteros: Imagine that you want to do a recurrent neural network in the character so far of
our work, right.
>>: Yeah.
>> Miguel Ballesteros: Any push action would be to add. I don’t know if I wanted to process a sentence
like the word, word. The first input would be the W. Then the second one would be O. The third one
would be R.
>>: Okay.
>> Miguel Ballesteros: They would do push, push, push with these words.
>>: Okay.
>> Miguel Ballesteros: Okay, does that make sense?
>>: Are you going to do into more detail about the. How much more detail are you going to go into
because I still very confused about what? Like actually what these, because are these hard operations
or are these soft operations that are implemented as part of the learning process? Are you, is the
supervision signal telling you push this and pop this at this time?
>>: That is what I was wondering.
>>: Or is it learning all of that?
>> Miguel Ballesteros: Let me see, so when you have a Transition-based parser, right. You so push and
pop actions on your stack and buffer.
>>: What do you mean you do that? Does the supervised data tell you exactly when to push and pop
or…
>> Miguel Ballesteros: I’m going to explain that one in the next slide so…
>>: You consider the traditional LSTM as always push?
>> Miguel Ballesteros: Yes, the LSTM always push which is what I presented as before.
>>: Okay.
>> Miguel Ballesteros: Then if you, well you add input for whatever. You can call it push or however,
but you want an input.
>>: Yes, so with that understanding when would I pop? That was the, so when would that pop? I still
try to understand what is actually push and what is pop? You…
>> Miguel Ballesteros: The push is adding an input.
>>: Yeah.
>> Miguel Ballesteros: Pop is moving the stack pointer like going back to in the memory.
>>: [indiscernible]…
>> Miguel Ballesteros: Go back to the previous step.
>>: Memory is hidden state.
>> Miguel Ballesteros: No it’s not hidden state.
>>: [indiscernible]
>>: [indiscernible]
>>: If you go into more detail then.
>> Miguel Ballesteros: Yeah.
>>: You can just…
>> Miguel Ballesteros: Yeah, okay.
>>: There’s a way…
>> Miguel Ballesteros: Okay, let me see if you understand this how it works. This is how, basically how
it works. Now in Transition-based parsing we have the buffer as [indiscernible]. At the beginning we
have a stack. We also have a list with the history of actions, right. These three list or these stacks are
associated with a Stack-LSTM which is what I have presented before. That provides an encoding of the
contents.
Okay, so this is how it looks. We have this Stack-LSTM here for the stack. Then we have the Stack-LSTM
for the buffer. Then we have the Stack-LSTM for the list of actions, alright. Now, with these three
encodings we make a prediction which is then this action. Imagine that you want to do a SHIFT which
would be to take the word was and put it in the stack. Then we will make a push in this stack and a pop
in this stack.
>>: That’s actually distribution over actions.
>> Miguel Ballesteros: Do you understand how a Transition-based parser works? Then I will continue
again. Let me see, so this is the buffer, right. This is the stack. Then you make a SHIFT action. Then you
take a word from the buffer and you put it in the stack, right.
I do this the same stuff with a Stack-LSTM. Any input in my Stack-LSTM is a word in my buffer and my
stack in my parser. Now, when I do push and pop I am doing the same kind of operation.
>>: Well then one thing [indiscernible] is that we’re familiar with like attention based models where you
have a soft. You’re making a soft decision over a list of actions. Like which source [indiscernible] should
I pay attention to? Is this making a soft decision or is it making a hard decision?
>> Miguel Ballesteros: I think it’s make a hard decision.
>>: Hard decision. It’s hard.
>>: Okay.
>>: Because a supervision signal that tells you for this tree here’s a set of operations in the arc standard
arc [indiscernible] or whatever model you’re going to use. That would allow you to reconstruct that
tree.
>> Miguel Ballesteros: Yeah, of course and following an algorithm, which is the arc standard algorithm,
which is this one.
>>: [indiscernible]…
>> Miguel Ballesteros: Whenever I do a SHIFT or a use which is to make an arc I’m doing pop and push
actions in my Stack-LSTM, okay.
>>: Is it also correct to say that given a labeled dependency tree you can deterministically get a set of
actions?
>> Miguel Ballesteros: Actually it is how you train this model.
>>: Yeah.
>> Miguel Ballesteros: You add the label…
>>: There’s exact…
>>: Exactly one.
>> Miguel Ballesteros: That’s fantastic.
>>: Okay.
>> Miguel Ballesteros: Well it depends on the parsing algorithm. In arc standard you’ll get one or
maybe a little bit more than that. But there is a way of doing that, locating more than that which is
dynamic oracles. I’m presenting at the end how to do dynamic oracles.
But you have, yeah, you have like, you could have several ways of getting to the old tree. But let’s
assume that in this case let’s assume that you have only one. Let’s assume that you have a dependency
tree and you get a whole set of actions. You are learning this set, this transition which is basically this,
right, these set of actions leads you to the gold tree. This is why you learn.
>>: Then, at test time you’re predicting. You’re doing two predictions, one that says what should I out
of all the words I have left which one…
>> Miguel Ballesteros: What I’m placing I’m placing to the next action.
>>: Okay.
>> Miguel Ballesteros: Even, so imagine that you are here.
>>: You always know it’s going to be the last [indiscernible] word in the buffer?
>> Miguel Ballesteros: Yes.
>>: You just predict, you said to predict some…
>> Miguel Ballesteros: Given that you are here this is my stack. This is my buffer. I predict the next
action.
>>: Okay.
>> Miguel Ballesteros: My classifier, in this case my LSTM predicts the next action given the parsing
state. The parsing state is given by the stack, the buffer, and the list of actions, the history.
>>: The action is SHIFT plus the number of dependency types times two?
>>: This is unlabeled here.
>> Miguel Ballesteros: Yes, that’s it, perfect.
>>: Oh.
>> Miguel Ballesteros: No it’s not labeled, it’s labeled. You have SHIFT plus the list of actions times two,
yeah.
>>: Perfect.
>> Miguel Ballesteros: You can also have a swap operation which is I do include here which is to parse
an operated tree. But I think it’s going to be more complicated. But I use, you get the basics it’s going to
be, yeah?
>>: Each of the column [indiscernible] tree it is stored in [indiscernible] Stack-LSTMs, each of them has
LSTMs?
>> Miguel Ballesteros: Yes, so the stack, the buffer, the LSTMs. We also have a list, we get history of
action. We have like another LSTM with history.
>>: I see.
>> Miguel Ballesteros: In this case it’s not a stack because you are only pushing things there. But yeah
it’s the same signal but you are not pushing anything. Okay, make sense how it works. Okay, so this is
why I said, basically, when you have a recurrent neural network you’re always adding things on this
thing. Then whenever you have this stack pointer you are basically doing the same things that you do in
a Transition-based parser which is to move the stack pointer back. Getting the important information,
so the top of the stack is the most important information in order to get the next action even the
parsing state.
This is I think this is the whole picture. You have the stack which is a Stack-LSTM. You have a buffer
which is a Stack-LSTM and you have this history of actions. By using these three encoded Stack-LSTMs
you make the prediction by using a rectifier linear unit over this parsing state, okay.
We also have, well this is greedy decoding. We basically get any action and we stick to it. We can make
any mistakes we don’t have any back tracking in this parser. But still we have very high results. We also
have the non-projectivity with the SWAP operation.
Following your question we have the C plus all the possible actions for left arc and right arcs. But we
also have a SWAP operation. Whenever you make, whenever you break the linear order you make a
SWAP which is to swap the two words on top of the stack. This is something that it was presented by
[indiscernible] in two thousand and nine. We just re-implemented that in the parser. We’re not going
to do, go into the details but if you want I can give you the details.
Okay, so this is the parsing algorithm. It’s arc-standard plus SWAP. Then the operations as you said is
use right and the label which is this thing here, and reduce left and the label. Then you have also SHIFT
and SWAP.
Okay, so let’s go into the details about how the parser works. One of the things that is really important
in Transition-based parsing are partially built dependency structures. Whenever you make an arc you
create a partially built dependency tree, right.
This information is really at eleven. At the top of the stack instead of having words we also have like
partial trees. These partial trees are represented with composition functions, which is very similar to
[indiscernible] two thousand fifteen parser. These are representation of a dependency tree is computed
by having this composition function which is basically running the [indiscernible] function with a head
dependency relation plus some kind of bias. We get a vector or embedding of the partially built
dependency tree.
For instance we have another [indiscernible] decision we get this guy [indiscernible] representation in
which you get the composition of, and with over [indiscernible] and then over [indiscernible] with
decision. Yeah?
>>: Okay and so this is mostly going to be pre-determined except when you have children on both the
left and the right. I guess exactly when the arc-standard is ambiguous or the oracle might be. You just
follow whatever this extra set of actions are to do this commonality?
>> Miguel Ballesteros: Yeah, you follow the actions and this way you are getting the composition.
>>: Got you, yes.
>> Miguel Ballesteros: Yeah, okay and another thing that is very important is the word embedding. This
parser realize heavily on word embeddings using a neural net parser. The word embeddings here, what
we have in this parser is a variant of the skip n-gram model introduced by [indiscernible] Ling et al. and
Chris Dyer in two thousand fifteen. This is an [indiscernible] paper.
This word embedding where it customized with syntactic representation they used them to improve the
Chen and Manning parser, so basically, took them for our parser. They use a different set of parameters
which are used to predict each context word depending on its position. Just like I, in which model are
word embeddings. By doing that we get a representation for each word represented in a lot of
unlabelled data. We use that as an input to the parser.
To represent each input word in our parser we have a like a concatenation of three from vectors. We
have learned vector representation which is in the training set, right, for each word type. Whenever we
have an out of vocabulary word we have a fixed representation for a non-token, okay.
Then we have the word embedding which is a fixed vector representation from a neural language model
which is what I said in the previous slide. Then we also have a learned representation of the part of
speech tag. You have an out of vocabulary word you will get this kind of concatenation of three
different vectors. You have an in vocabulary word you have the concatenation of these tree different
vectors. This is how we represent each word in the parsing process.
Okay, so the English experiments. Sorry, initial experiments we did the experiment with English with the
Stanford Dependencies. We have predicted the part of the speech tags with Stanford tagger with
ninety-seven point three accuracy in the, with a [indiscernible] for cross experiment.
We also did experiments with the Chinese treebank five point one following a [indiscernible] Zhang and
Steven Clark in two thousand and eight. In this case we used a gold part of speech tags. These setting
are actually the same setting as was presented by [indiscernible] Chen and Chris Manning. We wanted
to basically compare to [indiscernible].
Alright, so these are the experiments in English. These are the result by Chen and Manning in two
thousand fourteen. They get a label that’s been scored. Are you familiar with what this is? A label
attaching the score is the percentage rate of the score in tokens that has a [indiscernible] head in the
tree. A label attaching the score is the percentage of the score in token that has a [indiscernible] head
and label. You have [indiscernible] both the head and the label.
They have ninety-one point eight for a label that’s been scored. This is the result of our Stack-LSTM
parser which is ninety-three point one which is actually one of the better results. Then I’m going to
show you like the [indiscernible] condition in which we remove the word embeddings. Or we remove
the part of speech tags. Or we remove the composition functions and see well how they are actually
getting good improvements in the parser.
If you remove the part of speech tag instead of having three vectors you have a concatenation of two
different vectors for the word embedding in which you don’t have a part of speech tags, okay. In this
case you remove them for English you get ninety-two point seven. It’s actually not that bad. You’ll get
only zero point four less than what you get before.
You get the, you remove the pretraining word embeddings that [indiscernible] vectors. In this case
whenever you have an out of vocabulary word your representation will be basically the part of speech of
the word, in this case by doing pretraining. Without pretraining without using a label later you’ll get
ninety-two point four which still is a very competitive result. But is not as good as you can get with
pretraining.
Then if we remove the LSTM, so we remove the memory cell. Basically we have a stack recurrent neural
network which is not LSTM anymore. In this case we get ninety-two point three. We see that the LSTMs
are actually very useful in this case. They are providing much better results. But this Stack-RNN is still a
Stack-RNN. It’s like you have basically removed the LSTM.
Then if you remove the composition function as I said before is actually very useful information because
it gives you the information that well you have found already the object or the subject of the sentence.
If you do that the parser goes on to ninety-two point two. But you still of course get, you still have the
list of history of actions. This is why it doesn’t go down a lot, okay.
>>: Is the last one that’s setting went to like what the [indiscernible] using LSTM for this is where you
just basically take the…
>> Miguel Ballesteros: No, the last one will be whenever you make an arc. What we do is to push into
the stack this kind of composition, of composition that represented before which was a [indiscernible].
You don’t do that you just push the word instead of doing the composition you just have the word.
>>: Okay.
>> Miguel Ballesteros: In this case you get lower results but still competitive numbers.
>>: [indiscernible] of training the embedding together with everything.
>> Miguel Ballesteros: Of training embedding this is presented in another paper which is not me. We
used embeddings as an input, so this is not…
>>: Okay, so presumably you have thoughts about [indiscernible] training?
>> Miguel Ballesteros: They have a lot of data so this will be the reason why we…
>>: Oh, makes more sense of training with the embedding vector as well.
>> Miguel Ballesteros: Yeah, they have a lot data and they do experiments in language modeling and
dependency parsing, but with a different parser. When we…
>>: [indiscernible] by pretraining style by learning the embedding vector together with the LSTM?
>> Miguel Ballesteros: They do embeddings training in order to get like language modeling. Then they
use these embeddings which are useful for these language model, with the objective function of getting
good language modeling. Then they use them to do parsing and also language modeling, and part of
speech tagging.
>>: In your case you just copy…
>> Miguel Ballesteros: In our case what we did is to use these embeddings and put it in the parser. We
don’t have [indiscernible] results of the training.
>>: Okay.
>>: Just a quick clarification question. I thought you had two embeddings. One that was learned along
with all the other parameters of the LSTM on all those words that are present in the training data.
>>: Yeah, for…
>>: Another embedding that’s concatenated next to it which is trained on a larger set of data for which
you do not concatenate the dependency parsers on.
>>: Oh, I see, okay.
>>: In that way you can leverage a word representation. Is that it?
>> Miguel Ballesteros: Is that this?
>>: Yes…
>> Miguel Ballesteros: It’s like, sorry because I misunderstood you then.
>>: Oh, okay.
>> Miguel Ballesteros: You have a learned vector representation for each word type which is this thing
whenever you have an out of vocabulary word in testing.
>>: Yeah.
>> Miguel Ballesteros: Then you have also the word embedding which is this input that I told you
before.
>>: I see.
>> Miguel Ballesteros: That we don’t have the results for training because we use them as an input for
the parser, okay.
>>: Right.
>> Miguel Ballesteros: Okay, so now the results for English. The results for English Chen and Manning
they got eighty-three point nine which is a very competitive result. Then we, our Stack-LSTM parser gets
eighty-seven point two with the same settings.
Now you remove the part of the speech tagging in this case it goes down to eight-two point two because
well the Chinese tree one is like more complicated in terms of out of vocabulary. Then you go, with
other composition functions you get eighty-five point three which still is very high numbers. Well the
stack by the way this eighty-seven point two is state of the art for this setting. This is like a very good
number.
Then if you remove the pretraining you get eighty-five point seven. If you remove the Stack-LSTM and
you use a Stack recurrent neural networks you get eighty-six point one. Yes?
>>: Weren’t these gold part of speech or in English they were predicted part of speech?
>> Miguel Ballesteros: In this case they are gold. This is also why when you remove the…
>>: This is [indiscernible].
>> Miguel Ballesteros: Yeah, this gold part of speech test, so this is also why when you remove the part
of the speech test the results go down much harder than the other case, of course.
>>: [indiscernible] what is the standard of the first lines of methods?
>> Miguel Ballesteros: The what?
>>: The C and M that the…
>> Miguel Ballesteros: This is a parser which is a Transition-based parser with arc-standard.
>>: [indiscernible]…
>> Miguel Ballesteros: Very similar, very similar to us. But this was presented in EMNLB in two
thousand fourteen. They use; well it’s a different parser.
>>: [indiscernible]
>>: You have the feet forward, right?
>> Miguel Ballesteros: They have a feet forward one.
>>: Feet forward, oh, now I see, okay. No recurrent neural network.
>> Miguel Ballesteros: In this case we have a recurrent neural network. Well several recurrent neural
networks to train the parser.
>>: Yeah.
>> Miguel Ballesteros: Okay.
>>: In your [indiscernible] experiments are these accumulative or are you removing P lists or
composition [indiscernible]?
>>: No.
>> Miguel Ballesteros: They are not accumulative.
>>: Thank you sir.
>> Miguel Ballesteros: You do it accumulative it weakens.
[laughter]
>>: I’m not [indiscernible]
>> Miguel Ballesteros: They are like, so this is like a Stack-LSTM. Imagine that you have the word StackLSTM here all the time.
>>: Okay, yeah.
>>: It should be like plus POS minus Transition-LSTM?
[laughter]
>> Miguel Ballesteros: Yes. I see, I see, okay.
>>: Or you can SHIFT the whole thing there.
[laughter]
>> Miguel Ballesteros: Yeah, yeah, [indiscernible] that’s a good question.
>>: Between both of these did you do any like just data analysis looking into C and M versus your LSTM
results? To see like what is this thing getting right that the previous model did not get right?
>> Miguel Ballesteros: What do you mean…
>>: Did it systematically improve?
>> Miguel Ballesteros: Well, actually well because [indiscernible] probability that there is some sense.
You are getting deductions all the time. This is why…
>>: Well, I was curious you know is this thing doing a better job of relative clauses or rare words.
>> Miguel Ballesteros: We didn’t in this experiment but it, I think is a [indiscernible]. You didn’t well to
be honest we didn’t have any, ever analysis in [indiscernible]. But it would be nice to do it definitely.
Okay, yeah.
>>: There seem to be several things at work here, right. There is compared to [indiscernible] there are
this LSTM aspect where it’s a feet forward.
>> Miguel Ballesteros: Yes.
>>: But also you have this embeddings, word embeddings.
>> Miguel Ballesteros: No they also have word embeddings in the…
>>: They also have word embeddings, so…
>> Miguel Ballesteros: Actually the word embeddings that we have were tested for this parser before...
>>: [indiscernible]
>> Miguel Ballesteros: For this parser before in this [indiscernible] paper that I know the author. But
they did say this. They get like ninety-two for English since we get ninety-three point one. They get a
ninety-two point two if I remember correctly. It still is [indiscernible] which is basically given…
>>: I was just curious why…
>> Miguel Ballesteros: Which is basically given because they use the feet forward neural network and
also that the good thing of this model is that you get the whole stack and the whole buffer. They use
features of this, the first word in the stack, first word in the buffer. In our case you get the whole stack
and the whole buffer encoded in your LSTM.
>>: [indiscernible]
>> Miguel Ballesteros: I think this is where the improvement goes.
>>: I’m most curious why if you hold everything else static. Just compare the feet forward versus LSTM
what’s the difference?
>> Miguel Ballesteros: Okay, it will be like a nice experiment to try but we didn’t do that. Shall I
continue? Yeah?
>>: For the architecture could you like give us a [indiscernible] diagram?
>> Miguel Ballesteros: Sure, I think it’s…
>>: Yeah, right…
>> Miguel Ballesteros: Here you can see it yes.
>>: Yeah, so you have an LSTM on a buffer that it goes across first the whole thing. Then every hidden
state is can you do it bidirectional? Because then…
>> Miguel Ballesteros: You could do it bidirectional but we didn’t try this experiment. But I think in this
case it doesn’t make a lot of sense to do it bidirectional because the most important information’s at the
top of the stack or the buffer. This word give you actually most important information. If you do it
bidirectional you are going to get also the last word.
>>: Well, so…
>> Miguel Ballesteros: Also, yeah.
>>: If you, but if you do it bidirectional and then had it be the concatenation of the two states.
Because…
>> Miguel Ballesteros: Yeah, yeah, you can do it.
>>: Great, is it better?
>> Miguel Ballesteros: You can do the experiment. We didn’t try to do bidirectional LSTMs in this case.
In the next frame that I’m going to present we did because like it made sense for us to do it. Because it’s
like recurrent neural network representations, but in this case we didn’t try. But I think it’s going to be,
it’s very straightforward to do it. We could do the experiment right away, yeah, why not.
>>: In terms of supervision signal for LSTM how did you get it? Did you get the result from the
supervision you know from the parsing?
>> Miguel Ballesteros: I don’t understand what you mean.
>>: Well to train LSTM you…
>>: You do deterministic transformation of the referenced parse.
>>: Oh, I see, okay. From then you take the label and then parse it…
>> Miguel Ballesteros: Then you confirm whether the label is correct or not, high [indiscernible] from
there.
>>: Okay, okay.
>> Miguel Ballesteros: Yeah?
>>: If we have a garden path sentence so, the horse raced pass the barn fell. You can’t go back and
revise the parse once you get [indiscernible]?
>> Miguel Ballesteros: No, in this kind of greedy parser you cannot.
>>: Okay.
>> Miguel Ballesteros: You make a mistake you stick to it…
>>: [indiscernible]…
>> Miguel Ballesteros: Until the end you are carrying this mistake with you.
>>: [indiscernible]…
>>: To [indiscernible] point if you did it like for German you know that the verbs are often
[indiscernible].
>> Miguel Ballesteros: Sure, sure.
>>: They would impact how you decide to parse considerably, so.
>> Miguel Ballesteros: Well there are ways of improving that. At the end of the presentation in
presenting dynamic oracles which is a way of getting rid of these early mistakes that you can do. You
can also run a Beam-search and try to get a lot of possible paths, but this is low…
>>: Have you tried that? Does it help?
>> Miguel Ballesteros: We tried in this parser and we get like zero point three improvement. It’s not
very big so we didn’t, so we actually seen that the parser is greedy, so very attractive. Because you get
linear number of operations and you don’t, well you get like a fast parser and fast implementation of the
parser. But…
>>: When you were playing…
>> Miguel Ballesteros: Yeah, sorry.
>>: When were you playing with the beam how large were you making the beam?
>> Miguel Ballesteros: Well, people have tried with the difference size of the beam. It depends also on
the [indiscernible] by what I know people have like sixty-four, fifty-two, this kind of beam sizes. I mean
you can go farther than that but it doesn’t make sense. You can, there are ways of getting rid of these
[indiscernible] backtracking. Like doing Beam-search and I like most to do dynamic oracles which is
what I’m presenting at the end, okay.
Okay, let me continue. Okay, so this also the effect of initialization. We have some [indiscernible]
initialization of the vectors and everything. With this doing like one hundred experiments and these are
the results in the test set. You see that the parsers actually, these are all the experiments that we have.
The parsers actually very stable, so you don’t see like a lot of difference, but it’s also true that the most
repetitive number is ninety-two point eight in this Stack-LSTM for English.
Okay, so in discussion let’s say that these are very high results for a greedy parser. Actually, this is the
best result for a greedy parser. What I say for Transition-based people saying that these are greedy
parser is very attractive. It’s very close to beam-search or complex graph-based parsers that are
presented before. This runs in linear time with greedy decisions.
In terms of parsing speed and memory, so this runs in a single core. It’s not very [indiscernible]
engineer. We could try to paralyze it and do it faster of course but we didn’t. It only required one
gigabyte of memory. You could try on your laptop if you want. In fifty milliseconds per sentence more
or less you get the output for a given sentence. In less than two minutes you parse entire Penn
treebank test set. This is like a very competitive parser in this sense. But of course you can do like
training a GPU or do it in parallel, or doing a lot of things to make it faster.
These are the experiments that I presented EMNLP. This is experiment with different word
representations. [indiscernible] try to motivate you. You are working in out of domain or out of
vocabulary. I think this actually a very good solution.
As I told you before the parser relies heavily on word embeddings. We discussed how we use word
embeddings. Like we train word embeddings and how we also trained embeddings in the training
process of the parser. This is like the baseline that we have for the experiment which is without pretrain word embedding. We only have like the embeddings learned within the training portions of the
parser.
Okay, so we have a learned vector representation for its word type and for the part of speech that we
concatenate both of them. Whenever you have an out of vocabulary word you concatenate that with a
[indiscernible] presentation.
Okay, so this is the Character-based representations. What we do here is to run a bidirectional LSTM to
the characters of the word. We have a start token and we have an end token. Then we run these
bidirectional LSTM character by character. Then we do it backwards. By doing that you get like a, then
you concatenate these two bidirectional encodings. You get a vector and then you concatenate that
with a part of a speech tag and by doing that you get a representation which is based on characters.
It has the contribution is actually two fold. This is very good to get a prefixes of fixes and in fixes
because it learns this information. It’s also very good in terms of getting out of vocabulary word
embeddings. Because whenever you can run these bidirectional LSTM with any word in the world,
actually you can do it with Twitter data or any data, or spelling mistakes. Then you will get an
embedding which is similar to the actual word or to the words that are in your training data. This is the
motivation.
Why character-based representations? Well morphemes and morphology are one of the things. You
know how you can create morphemes in a sentence in English of course. But English is easy. But
morphemes is in other languages [indiscernible] syntactic information. For example in German or
Basque, or Turkish this is an example of a word, every other word in Turkish. You can actually get like a
lot of agglutination. This information is given by these Character-based representations.
This is what they can do for English. This is a [indiscernible] visualization with t-sne of Character-based
representations. You can see here that we get some adjectives, sorry some [indiscernible] in English.
You can also see some past tense verbs. You are getting some kind of, part of speech information. Then
you get some [indiscernible] in [indiscernible]. Then you also get other words that are, you look at then
you see that for these two are similar to each other because they were end similar. They are like a
grouping together by what endings prefixes or suffixes.
Okay, let me give you an example of why this is a good thing to do in other languages such as Basque in
which you have. It’s an agglutinative language. Imagine that you want to parse a sentence Jasonek
datorren mutila ikusi du. I don’t speak Basque but these people from the Basque Country University
helped me to find this example, so it’s actually correct.
This sentence means Jason has seen the boy that comes. You can actually; this Jason could be
connected to the word to see which is a transitive verb. Or it could be connected to the boy to come, to
the verb to come, sorry. Then how do we know that this would be connected to the word to see
because it’s the correct one. Well, we know it because this Jason has the [indiscernible] case which is
this ek thing at the end of the sentence. This ek thing even defies that this is a [indiscernible] verb.
Actually, this could work with any name. You could put it with any name Chrisek, Miguelek, or whatever
you want to imagine. It will work. You will get this kind of; you will get this embedding which is useful
for that. This is why we use character-based representations for, more for morphological rich
languages.
Okay, so the hypothesis that we have is the standard word embeddings or the previous word
embeddings won’t capture this kind of morphological information. Then the parser will be blind for
that. It will be able to parse the sentences. But now the Character-based representation will encode
morphological features based on the characters. The parser will be very powerful for them.
[laughter]
It will be able to see that. Okay, so summing up. We augment the LSTM parser with Character-based
representations. Let’s see the results. We have experiments in twelve of the languages including
English, Chinese, and a lot of morphological rich languages.
The data that we use is from the Statistical Parsing of Morphologically Rich Languages which is data for
Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish, and Swedish. You look at these
languages a little bit in detail.
You see that Basque and Hungarian, and Korean, and also yeah these three are agglutinative languages
in the sense that they agglutinate more things together. Then you also have Fusional and templatic
languages and analytical languages such as English and Chinese. We also did experiment for Turkish
because we expected to have these very good numbers in Turkish because this is an agglutinative
language.
The experiments are without any explicit morphological features. We do have also morphological
features we have the part of speech tags. But we don’t have any morphological features explicit to see
the fate of character-based representations. We don’t have additional resources such as pretrained
word or character-based representation which is also something that we could definitely do.
Okay, so this is experiment without part of speech tags. You see the results for the English and Chinese
treebanks. You see here like the [indiscernible] actually a zero point three improvement by using the
character-based representation when you don’t have part of speech tags for English.
Then when you move to agglutinative languages which is these four languages in where you have
Turkish, Hungarian, Basque, or Korean. With these kind of agglutination that I saw you before on it, it
motive you before. You see a lot of improvements. You see how for reasons for Korean that result goes
up to nine point seven even ten points in parsing is a lot, right. This is like a very nice result.
Then when you move to Fusional or templatic languages when you don’t have part of speech tags you
also see a lot of improvements. You see for instance in Polish you have like plus eleven here. But the
thing is that the Polish tree one has a lot of out of vocabulary words in the test set. This is also like a
good motivation because its character-based representations are getting this information doing testing.
Then they are good. But in other languages as you can see as the improvements as we can see here.
Now we move with part of speech tags. We have part of speech tags information. These are all
pretrained. In Chinese we still have gold but, sorry predictive. In the other languages all of them are
predictive part of the speech tags. For English you have ninety-two point six. But with the characterbased representation they don’t work anymore. Because with the information that they get is actually
very similar in this sense.
>>: Is there any way to appeal as information in the character-based representation of the
[indiscernible] as well because you…
>> Miguel Ballesteros: Well, we are doing it. We are concatenating the character-based
representation…
>>: With…
>> Miguel Ballesteros: With a [indiscernible].
>>: Oh.
>> Miguel Ballesteros: This is the [indiscernible].
>>: Still in [indiscernible]?
>> Miguel Ballesteros: In this case it’s not as good for English and Chinese but if you go to agglutinative
languages…
>>: I see.
>> Miguel Ballesteros: You still get a lot of improvement. You’ll still see that you, well the parser is
better in this sense. Actually, this one with part of speech tags these results are worse than the ones
without part of speech tags with only character-based representations. Because they are also getting
morphological features such as the case and these kind of, or numbered tends these kind of things that
are used for.
>>: Wait, can you go back. What you’re saying about this is that these are worse?
>> Miguel Ballesteros: Yeah, these are worse. You see for instance this one eighty-five…
>>: It’s, you’re…
>> Miguel Ballesteros: For example it’s eighty-eight point three…
>>: You’re basically saying it’s learning a [indiscernible] because it’s using the part of speech information
too much and it’s not learning an interesting…
>> Miguel Ballesteros: No, I am saying that with the character-based representation is you also learn
morphological features such as case. This is why the information that you get with them is better than
all the [indiscernible] part of speech tags, for some languages not for all of them.
>>: I guess what you’d expect to happen is that it just learns to ignore the POS tag in the best case.
>> Miguel Ballesteros: In some sense it happens, yes.
>>: But in English and Chinese it’s not properly ignoring those things.
>> Miguel Ballesteros: Yes.
>>: How do you attribute this? Is this just noisy optimization?
>> Miguel Ballesteros: I attribute it that the part of speech tags in English are probably better than my
character-based representations. That the word embeddings that we are getting they are also getting
information about semantic information. These kinds of things that we don’t have this information
when we only have character-based representations. Yes, please?
>>: You’re, it may run from the second column [indiscernible] from words to characters you’re
removing the word representation all together, right?
>> Miguel Ballesteros: Yes.
>>: Okay, so you could add the word representation as well as the character-based representation?
>> Miguel Ballesteros: Yes, yes, you could do both.
>>: This seems like it does a great job of…
>> Miguel Ballesteros: You could concatenate both of them.
>>: Yes.
>> Miguel Ballesteros: We had some explaining with part of speech tagging. In which we have the
character-based representation and the [indiscernible] embedding.
>>: Yes.
>> Miguel Ballesteros: We concatenate both of them and concatenating always helps in this case.
>>: Yes.
>>: But are there any cases where the second column in this slide is significantly like more than noise
worst than the second column in the last slides?
>> Miguel Ballesteros: Yes, you have cases in which I don’t know you go to Korean here…
>>: No, it’s still a little bit better.
>> Miguel Ballesteros: Eighty-eight point four and here well it’s very similar to it.
>>: Yeah, so, but it says, okay it’s a [indiscernible].
>> Miguel Ballesteros: Yeah, but the thing is that with character-based representation you’re also
learning morphological features. I have other experiment with these treebanks with a simple parser
which has HDMS. When you include morphological features the results go up a lot. This is why I think
character-based representation improves a lot in these languages.
>>: Yeah, so with the character do you do any embedding of the character, or no?
>> Miguel Ballesteros: Well, we are getting embedding by running the bidirectional LSTM.
>>: Okay, I see, okay, I see, so it does interrupt that.
>> Miguel Ballesteros: Yeah.
>>: [indiscernible] will copy from someone else.
>> Miguel Ballesteros: No, in this case we did…
>>: Okay, okay, okay.
>> Miguel Ballesteros: It isn’t…
>>: Are you taking the last two states and using…
>> Miguel Ballesteros: Sorry, sorry.
>>: Are you taking the last two states or…
>> Miguel Ballesteros: What do you mean the last two states?
>>: How are you getting it to a fixed vector size?
>> Miguel Ballesteros: You don’t have a fixed vector size. You just run the bidirectional LSTM. I mean
then you get the callings on this.
>>: Can you go back to the…
>>: You take the last two hidden space or you concatenate all the space?
>> Miguel Ballesteros: That [indiscernible] is this, is this one.
>>: Okay, so you take the, okay.
>> Miguel Ballesteros: Yeah, sorry, I have this in my mind but people probably forget, sorry about that.
[laughter]
You run bidirectional LSTM twice. Then you are adding input and input until you get at the end of the
word and you do it upwards. Yeah, the gold is available by the way, so if you want to take a look it’s
there.
Okay, so let me go back. Okay, so now if we are for the Fusional/templatic languages, the characterbased representation with part of speech tags. You don’t see improvements anymore. Actually, there is
like, more or less there is no preference of using character-based representations or using the previous
word embeddings.
>>: But what type of part of speech recognition do you use?
>> Miguel Ballesteros: We use predictive part of speech tags for our languages. These are the other
speeches are provided for these…
>>: Was it a neural part of speech tagger or just the standard…
>> Miguel Ballesteros: No, it’s an [indiscernible] it’s [indiscernible] in this space. It’s not LSTM part of
speech tag.
>>: But, okay, but…
>> Miguel Ballesteros: Its [indiscernible] is…
>>: A lot of kiss up character relevant features, right?
>> Miguel Ballesteros: No, so you get well the results are actually very high similar to the Stanford
target like ninety-seven for most of the languages, ninety-eight. You get like an error rating every
twenty words or something like this. They are good part of speech tags. They are not the best ones but
they are. They are not gold but they are actually a state of the art part of speech tag.
Okay, so we did those on experiment on the effect, on out of vocabulary words to see actually whether
this is actually even any improvements on that. This experiment is by having like a unique
representation for out of vocabulary words with character-based. This means that we have a fixed
representation for each word in the test set that we haven’t seen before. We run the character-based
bidirectional LSTM for the rest of the words and versus our model which is the one I represented before
in which you run the bidirectional LSTM for words. Is this understandable? Yes.
Okay, then the experiments the last that basically whenever the out of vocabulary rate increases you
can see Polish here in which we have this plus eleven result before. Whenever the out of vocabulary
rating increases the label attachment score also increases a lot.
We see that actually they use a lot of out of vocabulary words by running the character-based
representations for all words. But the character-based representations have a tremendous positive for
that. But they are not only for that. You also get a lot of improvements by using the morphology of the
languages.
This is more or less the picture. The model with character-based representations for all words is better
than the model for character-based representation with this fixed representation for out of vocabulary
words. It’s better than the model for words for all languages without part of speech tags. This is why I,
well I try to motivate that this is actually a very good fit for out of domain and you want to do that.
>>: Have you computed the scores for just the OOV words versus the non-OOV words?
>> Miguel Ballesteros: Yes, yes we did. Then you see that whenever you use this model the score of
OOV words is actually very similar to the score of non, of in vocabulary words.
>>: Okay, so does it then in some cases does it hurt the non-OOV words to add to the character-based
model?
>> Miguel Ballesteros: No, that’s [indiscernible].
>>: Okay…
>> Miguel Ballesteros: You get always improvements.
>>: Okay.
>> Miguel Ballesteros: But of course the OOV words still provide more errors than the non-OOV words.
But whenever you are character-based representation at least for parsing this is what I can say they get
a lot of improvements for everything.
Okay, so in this caption let’s say that this character-based representations are very useful. But they are
more useful for like agglutinative languages because they have this agglutinative of morphemes. They
obviate explicit for part of speech information for all languages which is actually a good way. This does
not mean that we don’t have part of speech. We have it but we have it encoded in characters, right.
We also have a nice approach for handling the out of vocabulary problem. These are very competitive
results compared to the best parsers of the state of the art. Actually these are the best greedy parser.
In some cases we even have the state of the art such as in Turkish.
In terms of parsing speed and memory of course running the character-based representation make the
parser slower, but not a lot. Instead of fifty milliseconds per sentence, you now get like eighty
milliseconds per sentence. This summarize that we measure. You need more memory, so instead of
having one gigabyte you need one point five gigabytes. Maybe you need like a better laptop but you still
can do it.
>>: [indiscernible] all the non-words they were just, you can pre-compute the representation for those?
>> Miguel Ballesteros: Yes of course you could do that, yeah. You will speed up a lot. But we did
experiment like the simplest impossible.
Okay, so now let’s move to the question that they asked me from the back. That how can you react to
like early mistakes? You train this parser with static oracles which is what we discussed before. You get
like a set of gold actions and you stick to it. You try to learn these set of gold actions. You apply locallyoptimal actions until sentence is parsed. You don’t have backtracking. The parser learns to behave in
the gold path.
This is still good for English to get ninety-three point one which is still a very good result. The state of
the art for some languages but you can do better. What we do is to use dynamic oracles which is to
learn how to parse in difficult configurations in which the gold tree is not reachable anymore.
Whenever you have made a mistake before that you cannot fix. Then you try it well; let’s try at least
let’s get the rest. But you also learn the gold path to the gold tree.
This is more or less how it works the training in this case. If the gold-tree is reachable you allow actions
leading to the gold tree. If the gold-tree is not reachable you allow actions leading to the best reachable
tree. In training sometimes what you do is you say I want to trust the model. I let the model to make an
early mistake and then I try to get to the gold tree by trying to like the best actions. Sometimes you
learn how to do that gold tree, okay, so in the best reachable tree like the tree closer to the gold trees.
>>: There’s no [indiscernible] it’s still a greedy at test time?
>> Miguel Ballesteros: It’s greedy at test time, yeah. But your classifier learns…
>>: You totally pre-train the model and then you do this. Then you start applying it to training data?
>> Miguel Ballesteros: Can you rephrase that?
>>: Because initially the models going to be really bad. Like after the first deep pocket it’s nope it’s
putting…
>> Miguel Ballesteros: Of course, of course.
>>: You do it, you wait until the model is good and then you start…
>> Miguel Ballesteros: Yes, yes, you wait on because otherwise the model is going to be very bad at the
beginning.
>>: Correct.
>> Miguel Ballesteros: You don’t learn anything, yes.
>>: Okay.
>> Miguel Ballesteros: There are a lot of ways of doing that. People have to study this problem in
different ways.
>>: Okay.
>> Miguel Ballesteros: We just take, more or less this thing here. Basically, sometimes you basically
take the action and sometimes you get the [indiscernible] action. But you do it as you said before you
try to like. Well, you don’t do it with a very early model because it doesn’t make sense to do that.
>>: Right.
>> Miguel Ballesteros: Okay, so this is more of how it works. Then for these dynamic oracles they
parser for English instead of having ninety-three point one now we get to ninety-three point six which is
actually a very high result. In English we also get some improvement, sorry for Chinese we also get
some improvements. The parser before was eighty-seven point two and then you go up to eighty-seven
point seven. This is more or less what you could do with a beam-search or something like this. But you
do it in a greedy decoding. This is a very nice improvement.
This is in submission in computational linguistics the last thing that I said. Right now we are working in
joint part of speech parsing and tagging. Doing like whenever you receive a word into the stack you can
also [indiscernible] part of speech tag of the word. Then this is implemented but the Google people
presented this in the EMNLP. Maybe there is no chance to publish that.
We are also studying different approaches to Beam search in [indiscernible] a way to avoid this kind of
backtracking problem. Also since we didn’t do a lot of hyper-parameter tuning the parser we just did
like some kind of iterations of that. We would like to see doing grid search for hyper-parameter tuning
like getting like the most tuned parser possible.
Also by using optimization we’ll go in this way. We are also doing more experiments with characterbased representations going out of domain and trying to solve the effect of that. Not only
morphological languages but also in out of domain data.
The parser is available online you can actually use it if you want. It’s over there. This is the CNN is Chris
Dyer’s neural network library that we are using for all the experiments since he’s in the papers. He’s
working out with that. There’s outreach these are the three papers that we have with this. This is an
ACL paper in which we presented the Stack-LSTMs at the first part of the presentation. This is the
character-based modeling that, improvement for that. This is a computational linguistics paper in which
we also have the dynamic oracles for that.
Thank you so much for your attention and all your questions.
[applause]
>>: Yes, so…
>> Chris Quirk: One thing I want to call, Michelle’s here, I’m sorry Miguel.
>> Miguel Ballesteros: No problem.
>> Chris Quirk: Is here today. He has some free time this afternoon if you want to take some of it.
There’s about an hour free this afternoon. Feel free to sign up for a half hour or [indiscernible]. But so
sorry, please go ahead.
>>: What just happened? The first few slides you went through you know because I don’t know
anything about this.
>> Miguel Ballesteros: Okay.
>>: It looks like everything is a very simple for each stack you get supervision signal. You simply if you
use the standard…
>> Miguel Ballesteros: Yes, yes, this is how Transition-based parser works.
>>: I was saying that [indiscernible] SHIFT all that.
>> Miguel Ballesteros: Sure.
>>: I think it could be so very simple neural network, maybe about three or four small neural network to
do this you know [indiscernible] prediction.
>> Miguel Ballesteros: I’m saying that this is what we are doing, right.
>>: No, but, yeah you’re right.
[laughter]
But, well actually [indiscernible] just looking at the example that you give you don’t even need to do any
memory stuff because every single…
>> Miguel Ballesteros: What it is…
>>: You have that supervision there, right.
>> Miguel Ballesteros: Yes, well this is parsing the sentence they talk in a story. You go to other
languages. You go to other problems with number activity and you need the story and you need history,
yeah.
>>: Okay, I see. But if you have enough data probably don’t even need to just look at [indiscernible].
>>: You shouldn’t just make like a series of logistic regressions.
>>: Just a minute, can get a good result, right.
>>: Yeah.
>>: That’s what people use to do. Then Chen and Manning came along and said no, let’s do a feet
forward neural network instead and improved on those logistic regression classifiers.
>>: I see.
>>: Now he’s coming along and saying the feet forward neural network has a limited window of the
sentence. Let’s look at the whole sentence and see if we can make better global decisions.
>>: Yeah, I see.
>>: Yeah, I mean that was the history.
>>: Okay, yeah because…
>>: You’ve recapitulated the history…
>>: [indiscernible]
[laughter]
>>: [indiscernible]
>> Miguel Ballesteros: Before you use to have like a simple SBM classifier.
>>: I see.
>> Miguel Ballesteros: Just look at the like very short look ahead of the three words in the stack, three
words in the buffer.
>>: Okay.
>> Miguel Ballesteros: There are features of these words and try to get the prediction. They get good
results.
>>: I see.
>> Miguel Ballesteros: They get like ninety for English doing that.
>>: Yeah.
>>: Okay.
>> Miguel Ballesteros: But with that you can get better, right. Yes of course you can implement the
Transition-based parser…
>>: [indiscernible] classify something.
>> Miguel Ballesteros: But you don’t have to have a very accomplished model it works, right.
>>: Right.
>>: Well this speech recognition is it’s a predicted [indiscernible], right.
>>: [indiscernible]
>>: Like all I’ve got to do is train the right classifier and I’ll be done, right.
[laughter]
>>: [indiscernible] other words.
>> Miguel Ballesteros: Yeah.
>>: This is a [indiscernible], right. [indiscernible] through. [indiscernible], yeah, it’s amazing, yes.
>> Miguel Ballesteros: No, no, I don’t know but, so people have these kind of simple classifiers before.
Like perceptual learning for instance and then in order to get better results and to compatible with
character-based approaches, what they did is to do beam searches as discuss before. Do dynamic
oracles and now you put everything together you get a better parser, right. You then, it’s like better
results.
>>: Okay.
>> Miguel Ballesteros: Yeah, you have a question?
>>: Can you comment on difference between this approach and Google’s attention base.
>> Miguel Ballesteros: Well, the Google says it’s similar to the Chen and Manning one. But they have
also the like [indiscernible] search and [indiscernible] label [indiscernible]. They get farther than that.
They get ninety-four something like this. But in our case we don’t have that big amount of data. We
don’t have all the hyper-parameter tuning that they do. It’s what I said at the end that I would like to do
if I have the machines to do it that, I would like to do a greedy search and brute force to see how far the
machine can go.
>>: [indiscernible]
>> Miguel Ballesteros: Well you could do a lot of things. Like the size of the hidden layers. The size of
the input layers, the, find a good initialization, sorry initialization seed for the Transition-based vectors.
You can find out of hyper-parameters, right.
>> You mentioned the Google’s paper, I don’t know. But looking back to the original slides the StackLSTM actually the output is [indiscernible] tension, right, I’m not sure Google has that. You can use a
tension we can actually [indiscernible] instead of the hard…
>> Miguel Ballesteros: Yes you could do that. I bet there are people trying to do that…
>>: But I’m not sure…
>> Miguel Ballesteros: But I like this model more, okay.
>>: Okay.
>> Miguel Ballesteros: As well that’s my opinion. Because you get, this is how a Transition-based
parsing people is how you are with a recurrent neural network to whether you get light encoding from
the top of the stack and from the top of the buffer. This is the kind of information that you use in a
Transition-based parser. But of course you could use attention models and do much more things. We
would like to try that but time is needed.
>>: As to tune that stack I understand you set, you have the push and pop. That’s signal [indiscernible],
so it comes because you have the three stacks of [indiscernible] correction. For example one LSTM
actually the [indiscernible] from one LSTM is the training signal for the other LSTM. That has the
location and needs to be learning, right, just to…
>>: Those, it’s a deterministic transform of the parse.
[laughter]
>>: There’s a training signal?
>>: The training signal is all deterministic beforehand.
>>: I see, okay.
>>: That’s what…
>>: [indiscernible]…
>>: How is it very experimental in this task…
>>: [indiscernible]
>>: It’s a pretty simple way of improving your system namely for the test set. First you use the standard
parsing. Could you get a ninety percent already you actually can do label for test set.
>> Miguel Ballesteros: Sorry, sorry and can you rephrase your…
>>: I’m just saying that [indiscernible] you might actually in the whole test set that you get, number one
you don’t use [indiscernible] standard parsing.
>> Miguel Ballesteros: What is a standard parsing for you?
>>: I mean we’re the one that [indiscernible] parsing the…
>>: No that’s, you need a reference for that. I’d say you get this; you turn the dependency tree into a
stack. Into this, you get this…
>>: [indiscernible] correct, okay.
>>: Of just the…
>>: I understand that, so you get a stack not using these methods.
>>: Yeah.
>>: As you said a method. Then why do you have that…
>> Miguel Ballesteros: What is a standard?
>>: You use those labels to go back to the training set to retrain the system.
>>: No that is the training set then.
>>: No, not for the test set you can just run. You can, [indiscernible] two system ones the standard
system.
>>: Why would you, you want to do unsupervised training then you could just do it with the right…
>>: [indiscernible]…
>>: Yeah.
>> Miguel Ballesteros: People have tried all these things before to.
>>: I have to see that, yeah it may not be as good, yeah.
>> Miguel Ballesteros: People have tried this using a parser as an input of another parser and stacking a
lot of things.
>>: I see.
>> Miguel Ballesteros: This is totally possible, yeah. It improves the results.
>>: I see.
>> Miguel Ballesteros: But in this case we are present in a single parser that runs in a single core and
gets good results.
>>: That’s cool.
>> Miguel Ballesteros: These will take a lot of time, right, running two parsers, one just stack one
parser. [indiscernible], yeah.
>>: I see.
>>: This is dependency parsing.
>> Miguel Ballesteros: Yeah.
>>: What about constituency parsing?
>> Miguel Ballesteros: You could do the same process.
>>: I see.
>> Miguel Ballesteros: Because you have several use phrase structure parsers. They work in a similar
way.
>>: Yeah.
>> Miguel Ballesteros: You could do actually the same thing. Yeah, definitely you can do it. Well there
are like, I don’t know you go to do a [indiscernible] from Singapore University he has a very nice
[indiscernible] parser that works with upper [indiscernible]. You could actually apply the same kind of
algorithm for the Stack-LSTMs, yeah.
>> Chris Quirk: Thanks everybody. My plan now is to take Miguel out to lunch, so if anybody wants to
come join…
>>: We should thank the speaker.
>> Chris Quirk: No thanks again, yeah.
[applause]
Download