>> Chris Quirk: So welcome, everyone. It's my... Ballesteros who's visiting. He's normally a professor in Barcelona,...

advertisement
>> Chris Quirk: So welcome, everyone. It's my pleasure to introduce Miguel
Ballesteros who's visiting. He's normally a professor in Barcelona, but he's
currently doing a one-year stint at Carnegie Mellon where he's been working
with a number of really interesting people there and pursuing directions in
introducing structure into recurrent neural networks including like
stack-LSTMs, has been doing some very influential and effective research in
parsing and now beyond. So with that, Miguel, please.
>> Miguel Ballesteros: Okay. Thank you, Chris, for the introduction.
today I'm going to talk about the transition-based natural language
processing. You will understand what it is in an hour.
So
So first you might be wondering who is Miguel. So basically I did my Ph.D.
in Complutense University of Madrid Spain. Then during this period I also
start doing [inaudible] Uppsala University working under [inaudible] Joakim
Nivre which I learn all the wonders about transition-based parsing models.
Then I moved to Barcelona in which I have worked with people in generation
and also dependency parsing as well. Then I did also a research stay at
Singapore University of Technology and Design in which I also work in
phrase-structure shift-reduce parsing. So after that I moved to Carnegie
Mellon University in which I have worked mainly with Noah Smith and Chris
Dyer in parsing and other problems.
So this talk about is about linguistic structure prediction in which you
basically have sentences like this one, Mark Watney visited Mars. And then
you want to infer the linguistic structure.
The linguistic structure can be given in many different ways. So one way
could be, for instance, named entity recognition in which you have sentences
like Mark Watney visited Mars, and then you [inaudible] that Mark Watney, all
this chunk, is a person and also that Mars is a location.
So there are other things you can do for linguistic structure prediction, for
instance, dependency parsing in which you get dependency relations between
words, and these dependency relations are binary dependency relations.
Basically you get like some kind of dependency-based [inaudible] of
sentences.
So in this talk, I'm talking about a supervised approach to linguistic
structure prediction, just in case somebody thinks I'm doing unsupervised.
I'm talking about supervised approach to it in which I have a treebank with
gold structures. I have basically pairs with sentence and the gold
structure. And the challenge is basically given a new sentence, we want to
infer the structure of this sentence.
Okay. So I say that this is a challenge because it is actually a challenge
mainly if you want to do parsing of a new sentence, this sentence can be
something completely new, for instance, so something -- a sentence that has
never been written before, and the parser has to be able to parse the
sentence in the proper way. And we expect it to do it like this.
So the general approach or basically the main approach for parsing is what we
call graph-based parsing or graph-based approach. So linguistic structure
prediction, not only parsing, because this talk is not only about parsing.
So in this idea you basically build a graph connecting all words, you score
all the edges with some kind of machine learning model that says how likely
is Diego and plays to be attached together, and then we search the scoring
tree with some kind of graph-based algorithm, and then we produce a
dependency tree or any other basically linguistic structure for this
sentence.
So the graph-based approaches have some problems. So the problem mainly is
speed and complexity and a hard inference algorithm that you have to
implement in order to do them. But of course they are also very powerful in
the sense that they present a global solution to the problem. And of course
this makes them very attractive for basically people that like to incorporate
all the information they can into their models.
The question of this talk is to whether we can model linguistic structured
prediction problems, such as NER or parsing as a sequence of actions, and we
can do it as the same level of accuracy as we can do it for graph-based
approach.
Okay. So what is a transition-based approach for natural language
processing? So basically what we have is a transition system or a abstract
state machine. So we process a sentence in a sequential way, from left to
right. And we basically have normally a buffer that keeps a word that we
want to parse or we want to get the structure from. And we may use different
data structures. Normally we use a stack and a buffer like in shift-reduce
compilers or shift-reduce parsing as you may familiar with.
And of course in each step we have an algorithm that has to select the best
operation or basically the best class or the best action to do given the
parsing state. And we normally use a classifier, and you can pick the best
classifier you want to do this, and basically people have tried this in many
different ways.
So they have a lot of promise, these transition-based models, because they
are fast and very flexible for different kinds of problems. Basically they
are linear in the length of a sentence because you take an action for each
word and you basically finish whenever you are done with the sentence, and
this makes them very attractive. And they are also flexible in the sense
because you can come up with any kind of set of actions in order to come up
with your problem.
Of course they present other challenges, mainly feature engineering is hard
because you need to define the features over the stack of the buffer and you
need to do it in a proper way. And you normally need a linguistic intuition
in order to do that. And also the local optima problems. Since you are
making these decisions, the parser can come up in some kind of wrong decision
at some point and this can be a problem.
So in this talk, I basically going to try to convince you that we can
actually do feature engineering much better by using a different thing and
also that we can also fix this local optima problem by using a better
classifier.
So in this talk, I'm going to talk about powerful sequential stack-based
approaches for natural language processing, or basically transition-based
natural language processing. All of them are based on transition-based
approaches with stack-LSTMs, which is cool model we presented in ACL 2015.
So I'm going to talk about a transition-based dependency parser. I'm going
to talk about how we can do extensions for semantic role labeling, so doing
joint syntactic and semantic parsing. I'm going to talk about how we can
do -- how we can use also these powerful syntactic models for language
modeling and generation. And I'm going to talk about the transition-based
NER system that was [inaudible] yesterday.
Okay. So at the beginning, since this is basically how people start to do
the transition-based approaches to NLP, I'm going to talk about parsing.
Okay. So what is dependency parsing? So basically dependency parsing is the
syntactic parsing of natural language in which we have dependency-based
syntactic representations of language.
Okay. We have a long tradition in descriptive and theoretical linguistics.
And is right now very popular, I will say more than [inaudible] popular, very
popular in computational linguistics because people, well, we have a lot of
treebanks, people have work a lot in parsing, and we have a lot of like
baselines to beat in dependency parsing.
And of course the main reason of this is because dependency parsing is very
useful for downstream applications such as question answering or a matching
translation or many other parsing which you need syntax in order to do, to do
this [inaudible] better.
So in a formal fashion, a dependency graph for a sentence such as this one is
basically a directed graph in which you have the set of nodes that represent
the tokens, so basically tokens and you have basically the words in your
sentence. And then you have a set of arcs that are directed arcs that
represent the dependencies.
So a dependency have dependent on a head, and those dependencies can be
labeled with a dependency type. So basically you know that Watney is the
subject to visited, and you have it labeled with the subject dependency.
So I'm going to talk about the Nivre's arc-standard, Joakim Nivre's
arc-standard transition system in which we have three different actions. So
you get shift, which will in this arc-standard parser you have a buffer and a
stack, and then you get three different actions.
So the first action is to do a shift, which is to take a word from the buffer
and put it in the stack. And then whenever you have a couple of words in the
stack, you can create an arc from left to right or to create an arc from
right to left. And this is why you have a couple of actions that I'll call
left arc and right arc.
And if you want to parse non-projective trees such as this one in which you
have basically crossing edges and there is no way to grow these three without
crossing edges, you can also incorporate a swap action, which is -- which we
call swap. In order to do reordering, by using that in a stack, you can do
any kind of reordering you want and you can also parse any non-projective
tree for the sentence. And still, even if you do the swap action, the parser
is still fast, linear in the length sense.
Okay. So I'm going to give you an example of how Nivre's arc-standard
transition system works for this sentence that I am running all over the
presentation.
So basically the initialization will be which you have a buffer full of
words. Basically the words that we are parsing in this sentence. We have a
stack which is empty. And the transition here will be like the
initialization.
So the first thing to do here since [inaudible] empty stack will be to do a
shift action taking the word Mark and put it into a stack. And now whenever
we have only one word into the stack, in the arc-standard parser, the only
thing we can do is arrange to do another shift action. So we will take the
word Watney and we will put it into a stack.
And whenever we have a couple of words in the stack, we can decide whether to
shift another word from the buffer to the stack or to create an arc from Mark
to Watney or an arc from Watney to Mark. In this case I guess the best thing
to do is to do basically the left arc between Mark and Watney. And whenever
we do these, we basically remove this word from the stack.
Now, since we only have one word into the stack, we will create another shift
action, and we will have visit into the stack. And, again, of course we can
shift Mars or we can be in an arc between these two words. So the parser or
the classifier decide that the best thing to do is to create this arc between
these two words and then it keeps going, shifting the word and matching
through the stack, and finally it will create the arc between visited to
Mars.
So this is how a transition-based dependency parser works. And you can see
you have all your actions here, and the number of operations is actually
linear in the length of the sentence because you have eight actions and four
words. It's always twice in the [inaudible] parser.
So the question is given a parsing state, the classifier has to select the
best action. So how do we do that? Well, we need features. And these
features are basically things that we define over the stack and the buffer.
So [inaudible] MaltOptimizer, so I know very well how to do a feature
[inaudible] for transition-based parsers.
So basically people define features over the stack and they offer fixed
window features. Like say, okay, I take the first couple of words of the
stack, the first couple of words of the buffer, and I define like singleton
features over part-of-speech, word forms, et cetera, et cetera. And by using
that, I can fit everything, all these features, into a classifier, and the
classifier will make predictions.
So this is how a feature set looks like normally, like a standard feature
set, or the default feature set looks like for a transition-based parser.
These are like merge features, a couple of things. And as you can see, you
only go -- input would be the buffer. You only go to length three or
something like this.
But now we can do it better because we have a recurrent neural networks and
we can have -- like basically we can have a better look ahead of what is
coming out of the buffer and what we have in the stack. So we can use
recurrent neural networks to do transition-based parsing.
So the idea is that recurrent neural networks are good to learn to extract
complex patterns from sequences. They are very good for data which is
ordered and context sensitive, where the later elements of the sequence
depend on previous one. And this basically means that we have -- like what
we have in transition-based parsing, which we have a stack and a buffer. And
these basically are -- they are ordered and context sensitive.
They're also good at remembering information over long distances. But
recurrent neural networks has [inaudible] with the simple implementation.
They have a bias toward representing the most recent information, and this is
what people called the vanishing gradient problem.
So some people came up with [inaudible] LSTMs, or long short-term memory
networks, which is basically a variant of these recurrent neural networks
which are designed to cope with these vanishing gradient problems.
So they basically have an extra memory cell that keeps gradient and decides
when to keep the gradient and when to not keep the gradient. And by doing
that you improve the results of RNNs in many tasks. So they are better than
RNNs in several tasks and also in parsing.
And what we presented in ACL is basically so we have an LSTM. As you know,
is a sequential modeling coding of a sequence. So in transition-based
parsing, we need stacks. So we basically changed LSTMs with a stack pointer
and two constant time operations. So basically the constant time operation,
the push operation that we have, will be the same as we have in LSTM whenever
we have an input, we have a new input into LSTMs. And then we also have a
pop operation which basically moves a stack pointer back. And by doing that,
we can basically model the behavior of our stack. And we can do parsing in a
very efficient way by using stack-LSTMs or LSTMs.
So the intuition behind the stack-LSTMs is that the summary of the stack
contents is basically obtained by accessing the output of this stack pointer
at the location of the stack pointer. So I'm going to give you an example of
how this works.
So this will be [inaudible]. So this will be the stack-LSTM, okay, in which
we have the stack pointer pointing to the output. So this will be like the
output layer, then hidden layer and the input layer of the stack-LSTM. So it
will be for this stack in which we only have one word, Mark. The output is
when is pointing to the output layer of this, of this one. Okay.
So now we do a pop. Basically we remove the symbol from the stack. What
happens with the stack-LSTM is that the output -- sorry, the pointer is
basically pointing out to the empty stack. So, as you can see, we can model
any kind of a stack. And if we make another input basically shifting one
into the stack, now the pointer is basically this one which is what we need.
If this is difficult to understand, I have another slide, which is the next
one, which I think is going to be better for some people. Some people like
this one, some people like the other one, so I decide to keep both.
So basically is you are doing push. You are doing the same thing as you do
in LSTM. And whenever you do pop, you basically move the pointer back to
this. And now with push we basically create like another branch in a tree
for doing like this. Yeah.
>>:
So there's only one stack total?
>> Miguel Ballesteros:
>>:
At the end you have one stack, yes.
But its compliance is kind of structured to it?
>> Miguel Ballesteros: Yes. So at the end, for instance, if you are here
and you want the output of your LSTM, you just run the recurrent neural net
from whenever you get the pointer.
>>:
So you don't need to remember those other branches?
>> Miguel Ballesteros: You don't. But if you start removing them, the
complexity grows, and you don't want that. So the idea here is that we just
need to keep whatever we need for this. And of course all these things, you
can get rid of them but for parsing. Yeah. Any more questions?
>>:
So this is the same kind of model that appear in NIPS?
>> Miguel Ballesteros:
What NIPS [inaudible]?
>>: [inaudible] I saw the same paper. Maybe it's a different author.
different kind of model or same model?
Is it
>> Miguel Ballesteros: Well, this is the stack-LSTM, which first published
by me and other authors in ACL, in ACL 2015.
>>:
The continuous [inaudible] this one seems more discrete.
>> Miguel Ballesteros: Okay. So we'll get going. Any more questions about
this? No? Okay. So this is basically what we use. So these are the
stack-LSTMs in which whenever you want to get, you get like the encoding of a
stack. You can do that so we are augmenting LSTMs with this stack pointer,
and this allows us to basically be the stacks and do parsing. But we need in
transition-based parsing are basically stacks.
So the next question of course is how we can use these to do transition-based
dependency parsing with stack-LSTMs. So we have a buffer, as you know,
because I show you the sample, we have a stack, and we also have a list with
the history of actions, which are basically the things that the parser has
done so far.
So this list of actions, this buffer and this stack are basically associated
with a stack-LSTM. So we have one encoding of the components of the stack,
the buffer and the list of actions at any time step. The list of actions is
not a stack-LSTM because you are only pushing things into this. You are
never popping things from the list of actions because you cannot change the
past. But the buffer and the stack are basically stack-LSTMs.
And we use a rectified linear unit to produce the parser state embedding
given the three stack-LSTMs. So this thing basically looks like this. So
this will be like the buffer in which we have the stack pointer pointing to
this element here. We have also the stack-LSTM here for the stack, and we
have the stack-LSTM for the list of actions with the actions taken so far by
the parser.
And this of course is a rectified linear unit and the softmax over the
actions and the parser decides what to do given the encoding of the three
stack-LSTMs.
So since this is a neural network parser, we have to represent the words in
the best way possible. So we have word invariance. So what you can see here
in the previous slide, a lot of the things here, basically we have words, so
we need to represent them.
So basically we have a learned vector representation for each word type. So
these are these vector representations that you see here. So these are taken
from the training corpus. We get a little bit representation, a
representation for them. And for each out of vocabulary words, we get a
fixed representation like unk for the words that are not included in the
training corpus, basically, or real words.
We also have the train word embeddings where you've seen a neural language
model be trained word embeddings by Ling, et al., in NAACL last year. And
these are basically another word embedding that you can have here.
And we also have a learned representation of the part-of-speech tag. So what
we do is to concatenate these three different vectors and for each words we
get a new embedding as this.
Another thing which is very, very useful in transition-based parsers is the
composition functions. In transition-based parsers, we normally call these
history-based features, which are the things that have already been decided.
So you have already found the subject for a sentence, so you don't want to
find another subject for the same word. Right?
So you get composition functions are basically the embedding of a subtree, of
a dependency subtree. So as you see here, you have overhasty and decision is
a subtree. And we want to get the best embedding possible from it. So what
we do is basically to run recursive neural network or [inaudible] over head,
dependant and relation plus some kind of bias.
So you have the word overhasty and decision. You run these [inaudible] with
the relation, and then you get an embedding from it. And this is basically
how we represent [inaudible] dependency structures, which is a very, very
useful source of information in parsing. Yes.
>>: Quick question. In the case where you only have dependence on one side,
it seems like it's unambiguous about how to compute this tree, but if you
dependents on both sides, there's ambiguity about what do you pull in ->> Miguel Ballesteros: Yes. There is some [inaudible] but you are always -you are starting with head dependent and then relation, so you ->>: But there could be -- say I have -- let's see. The man -- let's see.
English is tough. But, well, even if I just say John went home, went has two
children, one is John and one is home. Does it matter in the order in which
you consume the ->> Miguel Ballesteros: It clearly doesn't matter.
different embedding in which you ->>:
Exactly.
So you will get like a
One will be a left --
>> Miguel Ballesteros:
One will be a left and the other one.
>>:
You just pick it --
>> Miguel Ballesteros: You just pick it whenever the parser -- but since
this is a left to right parser, you normally get the same order at the same
time. Unless you are doing swapping and non-predictive parsing, which it can
change. Yes?
>>:
For English, why is it a left to right parser?
>> Miguel Ballesteros: This is because we are using a transition-based
approach for parsing in which we basically shift words one by one from the
buffer to the stack. So these always are left to right parser.
>>:
But could it not have equally been a stack --
>> Miguel Ballesteros:
works similarly.
We tried actually to do it in the other way, and it
>>: I'm just thinking because English by default has right branch
unstructured, so it's actually maybe easier to go from right to left.
>> Miguel Ballesteros: Some people have tried several ways, like doing first
left to right, then right to left, then right to some kind of [inaudible]
system between these two. And they get a little bit improvement, but it's
not like the most influential application. But of course you -- since this
is a transition-based parser, you have to think about always left to right
parsing in which you can also swap words if you want to have non-predictive
parsing [inaudible] these languages. Yes.
>>: But at any point in time [inaudible] left to right.
something bidirectional?
How about doing
>> Miguel Ballesteros: Yeah, this is what we are talking. You can do it
bidirectional. But in this case, we are only [inaudible] words in a left to
right fashion. Okay? Sure. Okay.
So these are the experiments that we run first, the ones that we run for the
ACL paper. So these are for the English Penn treebank. So we have the
Stanford dependencies. We use the predicted part-of-speech tags with a
Stanford tagger with 97.3, which is state-of-the-art accuracy for this
treebank. And then we use basically the same settings as Danqi Chen and
Chris Manning, which is a fit for our transition-based parser, so left to
right neural network parser, which is the closest model published to our
stack-LSTM parser.
So we compare our results to their results. So we see that basically this is
a label attachment score is the percentage of a score in talking with a
correct head. Unlabel attachment score is the percentage of scoring tokens
with a correct head. We also take into the label.
Okay. So as you can see the results compared to Chen and Manning, we get
more than 1 point here. And for a label, we more or less the same result but
one point better. So our parser is basically better in all conditions
compared to Chen and Manning.
What I'm going to show you now is like in a relation conditions in which we
remove the components that are represented before, so the composition
functions, the word invariance or the part-of-speech doesn't see how the
parser react to that.
So we remove the part-of-speech tags. The parser goes on to 92.67
[inaudible] score. People normally take into account this column in order to
compare parsers. So 92.57, actually 92.6 is like the best, one of the best
results [inaudible] part of the tags, so this is a very, very strong result.
If you remove the pretrained word embeddings, you see that the parsers gets
like 92.4. So you are removing all the semantic information or all the
contextual information that you have in your word embeddings, and you see
that the parser sufferers, but it still is very competitive.
And if you remove the composition functions or basically a parser [inaudible]
dependency structures, and when I say remove it, basically what we keep in
the stack is the head [inaudible] keeping the embedding of the parser
[inaudible] tree, we keep only the embedding of the head. And if you do
that, the parser goes to 92.1, which is -- of course it goes in line with
previous research in transition by parsing and history-based features. Yes?
>>:
Is this cumulative or [inaudible]?
>> Miguel Ballesteros:
>>:
No, it is not cumulative.
It is -- this one --
And you put the parser -- you're saying that came before [inaudible].
>> Miguel Ballesteros: Yes, yes, yes. This is why they are in the same
level. But that's true [inaudible] little bit confused. Okay. If we remove
the stack-LSTM and we use a stack [inaudible] so instead using LSTMs we use
RNNs, the parser gets 92.3, which is still better than [inaudible] neural
network, but of course it not competitive compare to the stack-LSTMs.
And then we the did the experiment with the Chinese treebank. So for this
Chinese treebank we follow basically Yue Zhang and Stephen Clark, which are
the people that started doing the experiment with this treebank. We use gold
part-of-speech tags, and this is basically, again, the same settings as Danqi
Chen and Chris Manning for this paper in EMNLP.
So here we have more or less the same picture. So the stack-LSTMs are here,
but they're like three points better than [inaudible] parser. These are the
state-of-the-art results for these settings. And, again, you will go -- for
the [inaudible] conditions, you see that there are differences. If you take
into account the part-of-speech, then you see a big drop. And this is
because we are using gold part-of-speech in this case. So here is like the
improvement of part-of-speech [inaudible] is higher.
Okay. And, well, basically some conclusions of this part. So this is the
highest result ever published for a greedy parser, for a greedy left to right
transition-based dependency parser. Is very close to beam-search or complex
graph-based parser, or even I will say it's better than most of them. So
this is actually something very influential. And it runs in linear time with
greedy decisions. So you get decisions, you get a parser which is super
fast, and it provides one of the best results ever published. Okay. So any
questions so far?
>>: I've got a question about the -- several slides ago [inaudible] supplied
the part-of-speech tags that were embedded. But, I mean, there's only 40
part-of-speech tags. I don't understand what an embedding of ->> Miguel Ballesteros: Well, it's a small -- it's a small embedding of both
dimensions. Yeah. It's basically an embedding that we learn [inaudible]
train. In some languages you have more than 40. So you go to Korean, I'm
going to show example [inaudible] result in Korean later, you have a
part-of-speech tag that is basically like 1,000 different parsers. You know,
like the language in which you have 32,000 or something but it's 1,000. So
you can have something. Okay.
So let's move to the character-based modeling of words. So as you remember,
we're actually going from the question, we have like an embedding of
part-of-speech tags, and then we also have like [inaudible] embedding and
[inaudible] representation of the words from the training corpus.
So we can do something better. We can do character-based modeling of words.
So [inaudible] you cannot also run a bidirectional LSTM character by
character of the words of your input sentence, and then you run it in the
other direction, and basically you get -- you concatenate these two
embeddings together, and you concatenate these with the embedding with the
part-of-speech.
So the intuition behind this is that you are going to get a lot of stuff,
fixes [inaudible] fixes from the character-based embedding, and also that you
can do this for every word. You don't need to care about whether this was an
out of vocabulary word or a in vocabulary word. You can actually do it for
every word in your input sentence. And this was published in EMNLP 2015.
I'm going to show you the results later.
So basically we model this character-based modeling of words in a
two-dimensional thing. So we basically do the embeddings and we run it with
[inaudible] and we put this. And as you can see here, we see a cluster of
words that are basically words in past tense for English. You also see a lot
of gerunds here, you see some adverbs over here, and you see the same word
endings of these words here, or you see the same word endings of these words
here.
So you see a lot of basically suffixes and prefixes. And this information is
something which can be very useful in parsing because suffixes and prefixes
is what gives you the more fully in part-of-speech study. So by doing this
character-based embedding, the intuition and the motivation is that we can
improve out of vocabulary funding and we can improve the performance for more
[inaudible] languages.
So we did this experiment in which we have this baseline model, which is what
we called WORDs. I put this here because it's in the tables. And we also
have this character-based model which we called CHARs. So this baseline
model is the same one as I presented results in Chinese and English before,
but removing the pretrain word embedding. So we don't have additional
resources, we just have the pretrained word embeddings of the treebank.
Okay?
So we did experiments for the -- for some treebanks of the Statistical
Parsing of Morphologically Rich Languages. These are basically Arabic,
Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. As
you can see, some of them are agglutinative languages and you're basically
looking at more themes.
And we also included Turkish because we know that Turkish morphology have
this agglutination behavior which is going to be very impressive with
character-based modeling.
In terms of completion, we also run it with English and Chinese. So we don't
have any explicit morphological features just to see the set of
character-based presentations. And we don't additional resources. Just the
training corpus, the parser, and no pretrained or character based embeddings.
So we tried with and without parts of speech as well to see the fate of
character-based modeling of words.
Okay. So if we try with analytics languages such as English or Chinese, do
you see that the parser with character-based embeddings is better, like one
point. For English it's only 0.3, or almost one point for Chinese. But of
course we didn't expect too much for English or Chinese because these are
analytic languages in which morphology is not playing a big role for syntax.
But if we move to agglutinative languages such as Basque, Hungarian, Korean
or Turkish, you see very big improvements. So you see how the parser for
Basque have more than six points improvement. For Hungarian it goes like
eight. For Korean it goes like exactly ten points actually. And for Turkish
it goes more than three points.
So you see how the parser actually behaves much better when you see
character-based modeling of words comparing to what -- how it behave having
the word representation like the sequential thing.
And we tried with fusional or templatic languages. We see how the parser
improves in all of them. So as you can see, this column is already in bold.
And as you can see how we also improve in all languages. In Polish we saw a
very, very big improvement. And the main reason why we saw this improvement
in Polish is because the out of vocabulary [inaudible] is super high. So we
thought we basically had seen that the character-based modeling of words are
giving you something a lot also when you have out of vocabulary, higher out
of vocabulary rates. In some languages you see more or less the same
results, such as Arabic or other languages.
And when we include -- this is without part-of-speech tags [inaudible]. And
when we include part-of-speech tags, we see that the pictures change a little
bit. So for English and Chinese, you almost get the same numbers, or a
little bit better for words. But you see that they are -- that both things
are always agglutinative languages. So it's still Turkish, Hungarian, Basque
and Korean, which are agglutinative languages. Basically [inaudible] more
themes and you get these kind of things, you basically see that you still get
very big improvement for parsing. More than two points is a lot. And in
average the parser is better compared to the one with a [inaudible]
presentations, as you can see here.
Okay. So the conclusions for this is of course character-based
representations are useful. They obviate part-of-speech information for some
languages. So actually you compare -- if you compare this column to this
column, in some cases is -- you go to Korean, for instance, it's actually
even better when you don't have a part-of-speech information. So you are
getting better information from the character-based modeling than what you
get with part-of-speech tags.
And they are more useful for agglutinative languages, in our examples, these
four languages are, in which you can see actually a very big improvement.
And it's also a nice approach for Hungarian and the other vocabulary program
we show in Polish in which, well, you have another vocabulary -- we have a
vocabulary rate, the parser gets better.
So we have an article submitted to Computational Linguistics, still waiting
for the reviews, but somewhere, so with all the results that I have presented
before. Okay.
So, well, in this fall I have the chance to basically teach a course in
natural language processing. So one of the things that I thought about is
the parsing. So I asked the students do be the parser. So I had this
student who's working with me, and we built this ensembling system by using
different outputs of different models [inaudible] stack-LSTM parser. So
basically you train several models with this stack-LSTM parser, and you build
a
[inaudible] system, like this one, okay, in which 40 parsers think that the
one place would be attached together.
And then you run graph-based parser, a Chu-Liu Edmonds parser, and the result
we got is basically the best result ever recorded for dependency parsing.
And we are working on a research paper together for this.
And also we can do this cross lingually. So was interested in this parser,
and we were working basically in this cross-lingual setup in which we train a
language-universal dependency parser on a multilingual collection of
treebanks. So basically instead of training with a multilingual treebank,
we're training with several treebanks.
And we also have multilingual word embeddings and typological information
including to the parser. And the nice result about this research was that
the multilingual training outperforms a standard supervised training which we
have for multilingual training.
So this result is submitted to TACL. We expect a review soon. And basically
this is the main table. This is the monolingual training in which you
basically take a treebank for German and you train a model for German and you
get a result. The result for English is only 88 because this is the
universal dependencies, you only have 12,000 words, you don't have the size
of the Penn treebank.
And the nice result is here basically. So you compare these to these, and
you see that the multilingual training is better than the monolingual
training. Basically all the columns [inaudible] language universal in which
we train with other languages and we evaluate in German, and here is when you
include lexical invariants for the target language, the language [inaudible]
information, and finding part-of-speech [inaudible] see how the parser is
better.
The implications of these are very high because if you want to parse
difficult languages in which you don't have any data, well, you can use this
model, and this is suspected to be better actually than the multilingual
training. And we compare these with the state-of-the-art with the same
system, so this is when you don't have any data of the target language, and
we are better than state-of-the-art systems in these sets.
>>:
I'm sorry, you're saying you don't have any data, so then --
>> Miguel Ballesteros: So you basically -- this is Swedish, so you remove
all the sentences from your training corpus of [inaudible] and you train with
all the other languages and you evaluate [inaudible] a set.
>>:
So when you say plus lexical, what does the that mean?
>> Miguel Ballesteros: It means that you use explicit invariance for the
target language that are trained with -- for this language.
>>:
So trained in monolingual data.
>> Miguel Ballesteros: Yes, in monolingual data. Unlabeled. And unlabeled
data. Which we -- you could take it for any language or for multiple.
>>: I'm sorry, can you go back one slide?
language ID, what is that?
When you -- when you supply the
>> Miguel Ballesteros: The language ID is typological information, so
language would tell you [inaudible]. But basically it gives you like the -how the parse -- so how the syntactic information is behaving in each
language. So you know how the left branching, the right branching ->>:
[inaudible] the model?
>> Miguel Ballesteros:
>>:
Sorry?
How do you give it [inaudible] --
>> Miguel Ballesteros: It's basically like a signal for the -- so you
remember the [inaudible] stack-LSTM's? So you get like a signal -- another
embedding for the language ID.
>>: So you have a set of psychological features [inaudible] so on and so
forth [inaudible] so you have these K different features ->> Miguel Ballesteros: And you input an embedding on this to the parser, and
the parser makes the better prediction as you can see. It improves for all
languages.
So in languages you actually see a big improvement. So you go to German, and
you include the [inaudible] information [inaudible] which is something
[inaudible]. The parser, when we didn't have [inaudible] information
[inaudible] we include this, and the parser started to do better. Sorry. I
think it's [inaudible].
Okay. So now I know this is basically all the parsing stuff. So as I
promised at the beginning, I'm going to talk about something different to
parsing. So the question is if we can model other linguistic structured
prediction problem different than parsing as sequences of actions.
So I hopefully convince you that we can do it for parsing and we can try to
do it for other problems. So the question is whether we can model then
jointly with syntactic parsing, so having syntactic parsing joint with
another task, or we can model a particular problem with the onset of actions.
I'm going to give you an example of these two questions.
So first talking about joint syntactic and semantic parsing with stack-LSTMs.
I'm going to briefly explain about this. It's another paper submitted to
TACL.
So we have a joint model for syntax and semantic role labeling basically
using stack-LSTMs. So instead of having one stack, we have a couple of
stacks. One that keeps the syntactic information; another one that keeps the
semantic information. So in one of them we keep all the [inaudible] that we
get from the syntactic stack, and the other one we get all the semantic roles
we can get from the stack. And we got state-of-the-art parsing and semantic
role labeling from these. By using [inaudible] stack-LSTMs.
So we compare with previous system. These are the system of the [inaudible]
2009, 2008. And this is the closest model published, which is Jane
Henderson's model which is actually the same algorithm but without using
stack-LSTM, by using the neural networks actually. So we see how we improve
[inaudible] especially in the semantic task. Okay.
And now I'm going to talk about one paper that got accepted yesterday in
NAACL, in NAACL 2016. So this is basically some paper called neuron network
models for [inaudible] recognition. And one of the things in this system is
the transition-based NER model with the stack-LSTM. These are the two main
collaborators of the paper.
So basically we have an algorithm that constructs and labels chunks of input
sentences. It's a transition-based approach again. This is the same way as
we do it with the dependency parser. In the case, we have three data
structures. I'm going to give you an example of how this works. We have a
buffer, we also have a stack, and we also have an output buffer that keeps
the thing that has already been decided. And it's a similar approach to the
parser.
So let me give you an example of how this works with the same running
example, or framework, Mark Watney, which is tired of being in Mars. So
basically you have a buffer full of words, you have the stack which is empty,
and you have your output buffer which is also empty at the beginning.
So the system basically decides to make a shift to take some work from the
buffer and put it into the stack. Now we can decide to say, well, must -sorry, Mark can be already an entity and can go to the output buffer, or we
can create a chunk with another thing, so shifting another thing to the
stack.
And whenever we have this, again, you can decide to create the chunk between
these two words or shift in another thing. In this case, it decides to
create two basically reduce, what we call reduce. And it's basically taking
Mark Watney and creating this. And basically I'm going explain you how to
put this information in an invariant [inaudible] analyzed way to improve the
results.
Of course, when you have a verb, like visited, it's not in an entity, so we
can basically get rid of it. And what we do is put it up to have an output
transition. So basically [inaudible] throws it to the output buffer.
And, of course, you can also shift much into the stack. And whenever you
have this, since there is nothing else, you can decide what is the best
entity to have for this thing. And in this case, it's location. So this
will give you like the result for these centers, in here.
Okay. So the main motivation of this research is that most NER systems use
external resources, as you know. People that have worked with probably know.
So they look into gazzetteers system and databases, resources. We can do
that, of course. We can input that into our model, but we don't want to
because we want to do this also for a low resources scenario.
So the question is if we can perform at the same level or even better without
including any of the system and resources such as gazzetteers or external
databases with information about entities in the work.
So we did experiments with CoNLL-2002 and 2003 datasets for English, German,
Dutch and Spanish, and we only use wordform features. We didn't even use
part-of-speech tags to make this like cross-lingual or multilingual, which
you can actually thrive with everything like this.
So this is how the system looks like. As you can see, similar to the parser.
You have stack-LSTM for the buffer, you have the stack-LSTM for the reduce
functions. You have your stack-LSTM for the output buffer, and you also have
the stack-LSTMs for the stack. And then you get the softmax over previous
actions, and you basically decide what to do given the parsing state, or the
system state in this case.
So in order to come up with a nice way
here is to do character-based modeling
thought, okay, because character-based
also for NER because you have also the
these things are very, very useful.
of representing words, what we did
of words again so we could -- we
modeling is going to be very useful
capital letters at the beginning. All
And we also have a pretrained word invariance with a lookup table with
basically the same invariance that we used for the previous publication in
parsing, we use it here in a lookup table. So basically for each word we run
a bidirectional SDN. For the word match, for instance, we get this. And
then we have a lookup table for the [inaudible] of Mars, and we concatenate
these two vectors together. And this is our embedding for the system.
And, again, we have also composition function. So whenever we reduce, we
never -- we take a chunk from the stack and put it into the output buffer, we
have to infer some kind information and put it in some kind of embedding.
So what we did is basically to run a bidirectional LSTM over the invariants
of these tokens together. So in Mark Watney, we run a bidirectional LSTM
over these two tokens together. And we also include the label. So in this
case like location or person or whatever. And then we get the representation
which is what we called composition function [inaudible] bidirectional LSTMs.
Okay. So these are the NER English results. So this is the result of the
stack-LSTM parser. This is another system that we presented in the same
paper, which is actually a little bit better but it's the same level. So all
the systems with a star are systems that incorporate basically [inaudible]
resources or part-of-speech tags [inaudible] years, linking, et cetera, et
cetera. And our system only use word forms. Nothing else. And as you can
see, the results are actually very, very competitive. This is one of the
best results actually published for NER.
The same picture is for other languages. So if you go to Spanish, you see
how the systems with stack-LSTM system gets like almost 84. It's the best
result ever published on the LSTM-CRF system, which is a bidirectional LSTM
thing that we have in the same paper, it's also a little bit better. And for
that you also have the same picture. So this is the [inaudible] external
resources and you compare with the rest. We have like the best NER system by
only using word forms.
And for German we have the same picture. Of course we see that the
character-based invariants as we saw in parsing are very extremely useful.
So you see that they have a big impact into the results, especially the
stack-LSTM one.
So this is the state-of-the-art NER system. It's linear in the length of the
sentence, so I mean it is fastest we have with the parser. So basically we
can run it in a very fast way and we can produce very nice results.
Again, character-based representation happen with the dependency parser, they
provide very, very useful information for NER such as you can have like out
of vocabulary words for other things. And we are only using word form
features. So we might -- we are not even including part-of-speech that's in
the model. Of course we could have included, but this was not the -- and
when we include it, we improve. But this was not the task with the system.
And we don't have any -- no gazzetteers, no external resources.
So now I'm going to briefly talk about another paper that was accepted to
NAACL 2016. So basically is what we call recurrent neural network grammars.
So this is a top-down phrase-structure parser. So it's not dependency
parser, it's phrase-structure parser. And we also include language modeling
results, and we again use stack-LSTMs.
So the nice thing is that this is stack-LSTMs for syntactic parsing that I
show you before. It's a very powerful discriminative model for language
basically. So you can basically infer syntax in a very nice way. But
syntax, as you know, is very useful for generating language.
Okay. So the idea here is that we
the stack-LSTM parser to basically
evaluated as a language model. So
word from the stack to the buffer,
can use the same things that we have in
create a generative model that can be
think about it. Whenever we shift our
we can also predict what words, we can
decide what word is dependent on that.
language model.
So we can actually use it as a
So we do this in the context of phrase-structure parsing, and we call it
RNNG, so basically recurrent neural network grammars. So this works more or
less like this. This is a top-down parser. It's not like a [inaudible]
parser that basically is [inaudible] up. So you have the buffer here. This
would be like the discriminative version of it. You have the buffer. And
then you can also shift [inaudible] into your stack. At the end you are
basically [inaudible] phrase-structure [inaudible] sentence. The nice thing
is whenever you shift, you can also predict the word. And then you can
evaluate it as a language model.
And the language perplexity results that we got for the English Penn
treebank, so this is sequential LSTM, basically LSTM in which [inaudible]
previous elements. We get better results including the syntactic information
of the stack-LSTMs for both English and Chinese. As you can see, we have an
improvement here and here.
And the nice thing is that we can also evaluate this as a parser because it's
a parser. And if you go to the F1 scores for the English Penn treebank, you
got without external resources 92.4, which is state of the art. It's one of
the best results ever published for this parser -- for this treebank, sorry.
And for Chinese it's close, but it's a little bit worse [inaudible] around
83.2 or something like this. But it still is a very, very competitive result
taking into account again it's greedy, left to right, and you don't need any
external resources.
And this is something that we are starting with more ongoing work. So we are
thinking about this transition-based approach to machine translation in
which, well, as I said at the beginning, I said with a stack-based parser and
a swap operation you can actually come up with any kind of reordering. So
you have a buffer -- yes, you can.
>>:
You can do [inaudible] reordering.
>> Miguel Ballesteros:
>>:
You can do any reordering.
We'll have to talk about that.
>> Miguel Ballesteros: I can show you. So basically you have -- well, it's
hard because [inaudible] you can grow, but you can come up with most of the
reorderings. If you go to some language such as Japanese, Japanese English,
it's going to be more difficult.
>>: But the restrictions -- yeah, anyway, this is a total minor detail.
Please go on.
>> Miguel Ballesteros: Sure, sure, no problem. Anyway, so basically we
already starting on this, we have started thinking about this. So
thinking -- looking at the results of our RNNG grammars in which language
modeling results are super nice. And we can also do transition-based parsing
and non-predictive parsing which we swap words and we do reordering.
What about instead of shifting a word from the source language we shift a
word from the target language. So we make a prediction of the target
language. So basically we are thinking about this and help basically
supervising a student on this.
And basically the final goal is to do a complete machine translation system
that runs in linear time. This is a very big project, a very big goal, but
we are trying to do it, and I believe it's possible.
So it might be also interesting to do it jointly with parsing. So whenever
we see [inaudible] we also produce a syntactic parsing. So using these
embeddings can be useful for machine translation because I do know syntax is
also useful for translation. So this is something we are trying to do right
now.
So in conclusion -- yes, sorry.
>>: Sorry, you were saying about linear time, I mean, that's a Montreal LSTM
model also run in linear time of minutes. Not a very large constant, but
it's also linear, right?
>> Miguel Ballesteros: Yes, yes. It is. Definitely.
case the idea will be to do everything in [inaudible].
>>:
All right.
In this
Yeah, but when that's sort of what the Montreal [inaudible].
>> Miguel Ballesteros: Okay. Okay. Okay, so in conclusions for this talk.
So I presented powerful sequential models for natural language processing.
Basically all of them are based on transition-based approaches with
stack-LSTMs. So I presented state-of-the-art transition-based parser, which
is getting a lot of attention from the community. It's good in both in terms
of results and speed because, well, it runs in linear time with the
[inaudible] produces one of the best results ever published.
I also presented a fresh and new state-of-the-art transition-based NER system
that produces very high results without using external resources that can be
extended to many languages and many tasks by using the character-based
embeddings of words. I presented how we can do extensions for language
modeling by using a shift-reduce approach in which we never use shift;
however, you can also break the word from the buffer and use this language
model.
And I show that I can -- we can produce state-of-the-art results also for
language modeling compared with a powerful LSTM. And also [inaudible]
extensions for semantic role labeling which you can do joint syntax and
semantic role labeling with stack-LSTMs. Sorry.
I would like to acknowledge Noah Smith and Chris Dyer, which are [inaudible]
most of these papers and collaborators in my case [inaudible] papers. And I
would like to thank you, all of you, for your attention. And you have any
questions?
[applause]
>> Miguel Ballesteros:
No more questions?
Yes.
>>: There was one of these many papers where you had an output, input and a
stack, and it sounded like you said you used stack LSTM for each of them as
opposed to ->> Miguel Ballesteros:
>>:
I'm just curious.
>> Miguel Ballesteros:
instance, this?
>>:
Yes.
So basically -- so the parser you mean?
So, for
Yes.
>> Miguel Ballesteros: So it's a blend with stack-LSTM, yes. So basically
in the case of NER, these are -- these -- basically, again, you do always
push, push, push for the buffer, for instance, you always push all the words,
and then you start popping things. Whenever you shift something from the
buffer to a stack, you [inaudible].
>>:
I see.
>> Miguel Ballesteros: But in the case of transition-based parsing, so if
you -- let me go back. So in the case of transition-based parsing, so this
is the same. So we fill these with all the words at the beginning, and then
we basically we shift things. Whenever we shift, we pop. But if you do
non-predictive parsing with a swap operation, you basically take this work
and you put it back here. So you also do pushes during parsing time. So
this is [inaudible]. But this [inaudible] can be modeled with LSTM and this
is basically history and you are always pushing things. You never pop
[inaudible] stack-LSTMs.
>>:
Great.
Thank you.
>> Miguel Ballesteros:
>> Chris Quirk:
Okay?
Thanks, everyone.
>> Miguel Ballesteros: Thank you.
Download