>> Matthew Hohensee: Okay, so this paper we called "Getting... Morphology in Multilingual Dependency Parsing." This is based on my

advertisement
>> Matthew Hohensee: Okay, so this paper we called "Getting More from
Morphology in Multilingual Dependency Parsing." This is based on my
Master's thesis research at UW in CLMS with my adviser, Emily Bender,
and which we're also presenting at [Inaudible] next month.
So I'll start with our sort of research question, can we use linguistic
knowledge of morphological agreement to improve dependency parsing
performance on morphologically rich languages? I'll talk a little bit
more later about the motivation for that question. But the basic idea
is that we have all these languages that have a lot of morphology,
suffixes, prefixes, other kind of morphological markers, and we want to
see if we can make use of that intelligently. And that is actually what
came first, and then I decided that dependency parsing would be a good
place to try to apply that.
So this is a little overview of what we did; we developed -- And I'll
go into much more detail on all this -- developed the simple model of
morphological agreement, added it to a dependency parser called
MSTParser and tested that on a sample of treebanks, 20 treebanks. We
were able to get accuracy improvements of up to 5.3%, absolute
accuracy. We found out that some of that was due to our model sort of
capturing things that weren't actually related to the agreement that we
were trying to model, so when we controlled for that the improvements
were about 4.6% which is still pretty good.
Outline of the talk: I'll give some background on agreement, on the
CoNLL shared tasks which is kind of -- set the stage for this kind of
work, other related work, and about the parser, methodology, some of
the changes that we made to the parser and how we implemented that and
the data we collected and how that was prepared. And I'll talk about
our experiments and results and conclude.
So to give some background on morphological agreement, there's this
idea of morphological complexity which varies widely across languages
which refers to the amount of -- not of different forms that a word can
take depending on the case, number, tense, all these different factors.
At one end of this spectrum we have morphologically poor languages.
Sort of a canonical example is Chinese which doesn't really have any
morphology. Words don't really take different forms depending on
number, on person, any of that. English has a small amount of
morphology. We conjugate verbs in a few cases; pronouns can be
different depending on case. Nouns are different if they're plural. The
other end of the spectrum is morphologically rich languages or
synthetic languages. Canonical example there is Czech which has I think
four genders and seven cases, and generally nouns can take forms for
all accommodations of those; although, some of them are sort of
collapsed so that actually not all accommodations are represented.
That's a little example with the same sentence of Czech and English,
and I can't pronounce the Czech. But you can see that the adjective
which is the first word is inflected as feminine, plural and nominative
case. The verb: feminine third-person, plural, nominative case. Sorry,
that's the noun. And then the verb at the end: third-person, plural and
present tense. The same sentence in English: we only have a little bit
of inflection for the noun that it's plural, and sort of it's
inherently third-person, although, that's not really marked in any way.
And the verb is plural because if the verb were singular it would be
grows not grow.
The CoNLL shared tasks in 2006 and 2007 focused on multi-lingual
dependency parsing. The organizers collected about 12 or 13 treebanks
which were parsed and annotated, and those were distributed. And the
participants trained systems on those and tested them on test sets that
the organizers provided. The way the parsing was set up they would
start with tokenized and POS-tagged sentences which also have
morphological information on them annotated and predict the head and an
arc label for each token.
So this is some sample data of what that looked like. For that sentence
we have for each word the ID is just the index of the word in the
sentence. Form is just the word. A lemma which was not -- not all of
the treebanks have lemmas in them, but some of them did so we ended up
using them where they were if they were there. Then there's two POS
fields, course and fine, and again not all of the treebanks had two POS
fields. For instance we used the English Penn Treebank; there's only
one POS. And I'll talk a little bit more about this later, but we ended
up generating course POS tags for all those so that we would have two
tags for each word.
There is a filed with morphological information which is the most
important for this work. And as you can see, it just includes
information about any inflections or inherent sort of morphological
information about the word. The name John is obviously third-person and
singular, and pizza is singular. That's about all we have in this, in
this example.
And then for each token we have the head and the arc label or the
relation type. So just to really quickly go through this. And in
dependency parsing we're just looking at the head for each token, and
everything is basically headed by the finite verb. So here, for
instance, John in subject. That's headed by ate which is the finite
verb, and that is in a subject relation.
So what was found in the CoNLL tasks was that morphologically rich
languages were the most difficult to parse. A lot of the parses rely
really heavily on word order. And is sort of a high-level
generalization, morphologically rich languages tend to have more
flexible word order because you mark the constituent roles with
morphology rather than with word order. So that was kind of a
generalization that they came up with and then there's this quote from
the organizers saying that one of the most important goals for future
research is to develop methods to parse morphologically rich languages
with flexible word order.
So that's kind of the starting point for this work, what we wanted to
sort of try to tackle. This is a little summary of different approaches
to using the morphological information, and a lot of these citations
are participants in those CoNLL shared tasks, not all of them but most
of them.
They don't fall too neatly into language-independent and languagespecific, but I kind of tried to group them that way. So if we're
looking at these treebanks and looking at an arc, a potential arc,
between head to dependent and each of those tokens has morphological
attributes, we can take each of those attributes separately and include
that as a feature for the token. We can keep the entire list of
attributes. For instance, if there's person, number and gender all
marked, keep that as one single feature which makes it a little bit
more like a -- sort of like a really finely-grained POS tag. Or we can
take all the combinations of the head independent. So if they each have
three then we'll combine those in all different ways and sort of use
each of those as a feature.
More language specific approaches, we can morphological information on
a token to pick out function words or finite verbs basically to
supplement the POS tags here if we need sort of more specific
information than we can get from the POS tags, adding detail to other
features like POS tags. And then there was one -- and maybe there was
one other approach who actually decided to model agreements. So if two
words have the same -- are marked the same way for an attribute, for
instance they're both plural, to generate a special feature for that.
And that doesn't have to be language-specific which is kind of the idea
of our research, but it was done by these people in a language-specific
way. So I think they only chose a couple languages and modeled a few
specific types of agreement.
The other work that has been done has been a lot of work involving
MaltParser which is not the parser we used. But that's been tested on a
lot of different treebanks and a lot of languages, and they've always
found that incorporating the morphological information can give them
boosts in accuracy. So that sort of showed us that there was promise
for using this morphological information.
A little information on MSTParser. There's a citation. They use a
machine learning approach, look at each arc between a potential head
independent and there's a whole group of features that are generated. I
guess I'll talk about the features in the next slide. They enumerate
features at the arc-level, estimate feature weights and save all those.
And then to do code they just find the highest scoring parse based on
those feature weights.
And then linguist knowledge here is incorporated most via that feature
design. So the way we were trying to add some more linguistic knowledge
was by tinkering with the feature templates. These are the features
that are used by MSTParser out of the box. There's Lexicalized parent
and child features, a lot of POS tag features involving the word order,
the parent, child, surrounding words, intervening words. And then this
is a list of the morphological features that it generates which -- It's
pretty hard to -- I guess that's -- that's a lot to understand.
Basically what that all is, is the index of the head dependent and
various combinations of the word forms, the lemmas, and the direction
of the attachment and the distance to the head or to the dependent. So
here's a little summary of our agreement model.
We decided to look at two tokens and add an agreement or disagreement
feature if the head and the dependent are both marked for the same
attribute. In other words if we're looking at say a noun and a verb and
they're both marked third-person then we would just generate a feature
that says, you know, "This noun and this verb agree for a person, and
they are related." In the case that we had an attribute that was not
matched on the other token, we added an asymmetric feature. So that
just said, "Well, we have a noun and a verb. The noun is marked form
some attribute, and the verb is not marked for that at all." So this
approach is language-independent. We didn't really use any information
about the languages. The treebanks were already annotated with
morphological information that was generally in a format that we could
just use straight out of the treebank. It represents a kind of backoff.
We're using a little bit of information which is the actual value of
the attribute. If we're -- say the noun and the verb are marked with
third-person, it doesn't matter that it's third-person. What matters is
that they agree. So we're just trying to save the useful information
here. And it's high-level, so we're letting the classifier use
agreement as a feature rather than having to discover agreement
relationships as it goes through all the features.
These are our feature templates, and I'm a little low on time so I'll
skip those. Sample sentence and features generated. That's a sentence
of Czech and that's just a list of the features which we generate which
are agreement features whenever two tokens agree and the asymmetric
features when they don't agree.
This is the list of the treebanks that we collected. There's a range
there. Hindi-Urdu has an average of 3.6 morphological attributes for
each token. And we included the Penn Chinese treebank sort of as a
reminder that there's this whole set of languages that don't have any
morphological information really at all. And then there's sort of a
range in between there.
To prepare the data -- And I'll move pretty quickly through this -- we
normalized the course POS tags. There is a paper by Petrov et al who
suggested a universal tag set. So in the case where there was only one
set of POS tags, we generated these course tags. And if there were
already course tags, we sort of normalized them to be from the same tag
set. We normalized the morphological so that it would be useful to us,
basically just into the form of attribute and value. For instance, case
equals nominative or gender equals feminine. And we generated
morphological information for the Penn English treebank which doesn't
have any. And we randomized the sentences, used five-fold crossvalidation and averaged accuracy, run time and memory usage across five
folds. We can skip the projective thing for now. And we ran the whole
parser on each treebank four times: once with the original features
sort of out of the box, once with just our features replacing the
original features, once with both sets and once with neither.
So the MSTParser, the word order and POS tag features those were always
retained. It's just the morphological features that we were swapping in
and out our version versus the original version. This, again, I have a
graphical summary of this on the next page. But what's important to
point out, this is the twenty treebanks and the four feature
configurations. And what I want to point out especially is that the run
time in features are roughly the same, so this is no morphological
features. And this is our set over here, so again the number of
features is roughly the same for our version with is agr the agreement
model and for no features. The original version has much higher feature
counts and run time because so many features are generated. And we were
able to cut down the size of the feature set a lot.
This presents our accuracy results. The way to read this is, no
morphological is the original -- not the original. It's the version of
the parser with no morphological features are all. And that's the blue
bar. And then the improvements due to adding each of the other feature
sets is on top of that. So in every case the yellow one is the highest
even if it's only by a sliver, and that's the accuracy using only our
feature set. And that performed better than using both features sets, I
think mostly because there were so many features generated by the
original set that it was tending to swamp the classifier.
Moving pretty quickly through this, this is just sort of what our
results generally look like. I should move ahead. The Hebrew dataset we
used comes with both gold and automatically-predicted morphological
information, so we ran it on both and nd basically found that our
feature set was a little bit more robust to using the predicted tags
and annotations than the original feature set was. The original feature
set lost about close to 3% and ours was closer to 2% when using the
predicted tags.
I think maybe I will skip over this so we have time for some questions.
This is, to summarize it really quickly, we found that our feature set
was capturing some information that was not related to morphology, that
was improving accuracy because the original feature set didn't have any
feature with just POS tags and the arc label. So we sort of compensated
for that and improved the baseline a little bit. So basically
controlling for that, these are our new results which is just due to
the agreement model and not sort of to the insulary effect that we were
capturing with our feature set.
So it decreased our improvement numbers a little bit. Comparing to the
previous slide, there's a little bit more of a correlation here between
the amount of improvement to the top of the yellow bar and compared to
the blue bar which means that we're getting -- So we're getting less
improvement due to our features. But the amount of improvement we're
getting is more closely correlated with the morphological complexity of
the language, which is the -- I didn't mention that. The X-axis. So on
the left Hindi is the most morphologically rich language, and Chinese
at the other end is the least. SO we looked a the correlation
statistically between the morphological complexity of the language and
between the improvement we were getting and calculated the correlation
of coefficient for that, Pearson's r.
And once we compensated, once we controlled for that PPL effect using
just our feature set, we got about .75 which is a much stronger
correlation which is what we'd expect. Before controlling for that
effect we were getting a much weaker correlation because that effect
that I didn't go into in great detail was, I think, obscuring the
correlation between the complexity of the language and the improvement
that we were getting.
So future work, these are some things we want to work on in the future.
And I want to try to get to some questions. The answer to our question
we asked in the beginning, yes we were able to improve parsing
performances on all the languages with morphology and reduce feature
counts and run times. And I guess we have maybe a minute or two left.
So....
[ Applause ]
>> : Maybe a question or two?
>> : So you were using gold standard morphological tags. And you shows
comparison on Hebrew with automatically extracted ones. But Hebrew was
one of the last morphologically rich languages that you tested. So, I
mean, how viable is automatic extraction of these morphological
features for more morphologically rich languages?
>> Matthew Hohensee: That's a good question. This was the only data we
had that that was available for. So it would've interesting to try it
on the more complex languages. We did find that the no-morph version
here actually had the worst performance. So when there no morphological
features, that's when we got the biggest hit from using the automatic
tags. So that implies it was more the POS tagging not the morphological
annotations that was decreasing the accuracy. But that's a good point.
>> : So I assume you did this on first-order features and decoding?
>> Matthew Hohensee: Yes. So MSTParser has second-order features about
POS tags and word order. We didn't do any -- None of our morphological
features were second-order; they were all first-order.
>> : Would you expect the same improvement? Does it actually cover
something that maybe the second-order features might cover if you just
had, you know, agreement features?
>> Matthew Hohensee: Right. That's a good question. Yeah. I don't know.
>> : All right. Thanks.
[ Applause ]
>> Matthew Hohensee: I don't really know what we have to do here.
>> Will Lewis: Okay, so for our second talk we have Anthony telling us
about Hello, Who is Calling? Can Words Reveal the Social Nature of
Conversations?
>> Anthony Stark: Thank you. A bit of strange title and at its core
what it was an initial product study in testing whether there was a
utility in using [Inaudible] for downstream processing, in particular
whether we could get clinical applications running out of this. And it
was done between a collaboration between engineering and clinical side.
So overall the biggest possible picture of what we're trying to -- Oh,
I'll go over the abstract of the speech for this. I'll give you a brief
motivation and overview of the data that we collected, the corpus, the
ASR that we used and also describe the engineering problem and of
course some of the experimental results on top of that.
So at the biggest possible scope what we're interested in doing is
inferring the preferential, social and cognitive characteristics of an
individual, and the best possible way to do that is directly probe
someone's brain. But -- Now that's not too far fetched. You've got the
MRI's, but in a lot of applications that's not really practical. So we
usually have to look at that through some sort of filter. And obviously
speech is a pretty direct model of what people think, how they think
and their cognitive capacity. And there's really two possible ways that
you can look at that: you can directly go to the person and ask them
their opinions, how healthy they are, etcetera, or you can observe them
in their natural environment. And depending what you really want out of
your [inaudible] whether you want subjectivity and naturalness, whether
you have design constraints, whether that person has time constraints,
you have to pick one or the other. And obviously for this study we
picked the latter where we're directly observing natural speech in the
natural environment.
So what would the role of automation be in this? Well the biggest one
of course is it's a lot cheaper. If you've got a thousand hours of
recorded audio, it becomes very expensive and time-consuming if you
want to do the low-level human transcriptions and processing on this.
And the other important thing especially for our case was the privacy
constraints. If you're recording these people in their natural
environment, it could be very privacy-sensitive. And so we don't really
want people looking through these conversations.
So the current situation of the whole language processing field is
their quite mature in separate areas. At the lowest level we've got ASR
systems trained on thousands of hours of audio, and at the other end of
the system we've got fairly mature text analysis. And even on top of
that we've got a lot psychology studies that show correlates between
various behaviors and how language is used.
So unfortunately the literature is kind of sparse on testing that
[inaudible], whether there is utility in using ASR transcripts right
down at the end to see if you can find clinical outcomes. And really
word error rate is not the end goal here. So the big probing of this
study was to see how sensitive was our application to word error rate,
whether we could deal with 10% word error rate or even 50% word error
rate.
So our specific application that we're interested in is we're looking
at elderly individuals to track cognitive health, health outcomes. So
the overall goal was to track these people and track their languages as
they age or whether they developed dementia. So the clinically side of
it, optimistically, we would want to provide a fully automated pipeline
for diagnosing these individuals or at least flagging them when they're
possibly susceptible to developing dementia. But more pragmatically
we're just trying to find correlates and possibly provide a first-pass
screening tool to see if we can map dementia onto the natural language
record.
On the actual engineering side of things which is what the rest of my
talk will be mostly on, we're looking to see if we can efficiently,
reliably and anonymously infer the type and frequency of social
conversations. And as an additional point, are there any markers that
are unique to the automated route.
So the corpus that we collected is focused on the telephone usage of
these individuals, and they're all healthy 75 years or older. The
important thing here is that these are no subjects that we collected in
isolation. These are part of an OHSU cohort that are already
researching Alzheimer as best to fix. So there's a parallel data series
of MRI's, clinical assessments of their physical health, mental health
and activity reports so very dense longitudinal records. And the
overall goal is to sort of do a backtrack of health outcomes, track
these people one year into the future, five years into the future, ten
years into the future, see what sort of health outcomes they have and
then go back to the language record and see if we can find correlates
or at least markers that give good indication.
So the core of the study was to propose an additional telephone series
on top of this, and on top of the acoustic information, the actual
audio channel, we record the surface information: the call times,
durations, the numbers that they called, etcetera, etcetera.
So the reason why we actually started probing the telephone
conversations is firstly it gives a direct insight into the capacity
for independent living which is one of our main clinical outcomes that
we're trying to look at. Secondly, it's quite a convenient channel to
look at. They purposely use a telephone, and it has a fairly high SNR
relatively speaking. And the age demographic still uses telephones
which is probably the only demographic that still does it. And they use
it quite a lot as I'll soon show you.
So our initial run was ten individuals over twelve months and that was
about twelve thousand individual calls. So it worked out to be three or
four calls per person per day, so it's quite extensive usage. And the
interesting thing is we recorded all telephone conversations, all
incoming, all outgoing conversations for twelve months so a very dense
record. And on top of that we did enrollment and exit interviews just
so that we could get a few baselines for our ASR system. Currently
we're collecting additional homes so we've enrolled 45 more people and
it's up to so far fifty thousand conversations and about two and half
thousand hours. So it is quite a large corpus at the moment.
So I mentioned before that we're trying to see what sort of utility we
can get out of ASR. And the ASR system we developed, it's nothing too
special. There're no frills. We just wanted a standard, sort of out of
the box, ASR system. So it's a standard Switchboard Fisher maximum
likelihood. It's got some adaptions, speaker adaption, trigram model.
And a very, very conservative estimate would be 30% word error rate, so
that would be a flaw. More realistic, it's probably pushing about 50%
word error rate. So you're thinking about one in every three words is
just completely incorrect or one in ever two words. So it's still
pretty hefty error that you'll get out of this sort of domain.
So to the actual engineering problem that we had. The telephony surface
records provide you a lot of useful information, what number they
called, how often they called that person and you can develop sort of
social networks out of this information. Unfortunately a lot of the
high-level labels are still missing or incomplete. And what I mean by
that is are they calling family members, are they calling friends, what
sort of communications they're having. Is it just straight up formal
business communications or is it more friendly social chit chat? And
that's what we're really trying to get at with this data set from the
clinical side of things.
So as I said ASR error is pretty significant on this sort of open
domain. And what we're trying to do is what is the sensitivity of the
downstream analysis on this error? What features are robust? What
inferences can we still make?
So our primary goal here was we sort of did contrive some
classification experiments whether we could determine business calls
from residential calls, whether we can tell family calls from nonfamily calls, and whether we can tell if they're calling a familiar
contact or someone that they haven't called before. And also an
additional one, trying to tell family calls apart from just friends and
other residential lines.
So what we needed to do was process raw transcripts in a useful and an
anonymized format. So the only way that we managed to collect this
information was that we promised not to directly listen to their
telephone conversations, and we promised not to directly read
transcripts of these conversations.
So the first thing, the first order to sort of classify that you can
think of is just a straight up n-gram, content of words. After doing a
little bit of normalization on the transcripts, if we removed stop
words and also get rid of the rare tokens, we can form a pretty good
baseline system. So just some pre-processing we built 50/50 partitions
and we used some cross-validation training.
So the n-grams surprisingly give very good performance, classificationwise. And they're all around about the same sort of region here, but
you can see from the business and residential lines we can classify it
with about mid-80% accuracy which is quite good for possibly 50% word
error rate. The other tasks are a little bit more difficult so telling
family members apart from non-family members but still relatively high
accuracy, well above the chance 50% baseline.
In terms of the actual n-grams, the unigram actually did the best which
is not too surprising. That was about a ten thousand-size dictionary.
Bigrams do a little worse, and trigrams really start dropping. That's
not too surprising; if you have a 30% word error rate, that's going to
have pretty bad trigrams.
Some other somewhat surprising results is the linear-SVM did a little
worse than non-linear [inaudible]. And if we add in surface features,
the time that they made the call, etcetera, how long they made the
call, it added very little classification utility to it. So it looks
like you can get most of the contextual utility out of just the ASR
transcripts whether they're error-full or not.
So I sorted of papered over how I truncated a lot of those unigrams.
Essentially if you got a very poor estimate, if they appeared in only
one or two conversations so a rare word, we dropped it. If you don't do
that, you get very bad accuracy, instantly take a 15 to 20% hit.
So you can ask the question, what sort of additional robustness can you
gain through some sort of linear dimension reduction? And there're a
few flavors that you can look at there. You can do a priori mapping.
You can do two types of dotted-driven line mapping. You can do
supervised and unsupervised. So for the a priori mapping we went to a
social psychology study, and it's called a Linguistic Inquiry and Word
Count. And it sort of tries to map words down into sixty-ish categories
of happy words, activity words, negative-emotion words, etcetera. And
it didn't do particularly well, so it wasn't really trained on any sort
of robust mathematical reduction.
If we moved towards dotted-driven lines, we looked at Latent Semantic
Analysis, and that did much better even if you reduce it down to about
ten semantic features. And if we used a supervised approach, so mutual
information, just prune out the features that have low mutual
information, you get even better results. So with a dictionary of 250
words, you can still tell these classes apart quite well. And what that
shows is there's great utility in using ASR because you can collect a
lot of information and the dotted-driven techniques do work quite well.
So this asks another question: the ASR transcripts are clearly not
accurate but they do seem to be quite consistent. So even if they don't
tell you exactly the right content, they are consistent enough that you
can train a decent classifier. So another question you can ask is will
structure-based features work? So part of speech tags, would that work
on a 50% error rate transcript. And it turns out they actually do quite
well. If we use a part of speech bigram, you do take a big hit over the
content-based features. But still there's a surprising of utility
there. And that actually does better than the psychology dictionary in
a lot of cases. And that's a pretty crude representation of a whole
conversation.
If you try mixing it in with the content-based features, it generally
doesn't work. So content features pretty much supersede everything else
here. So now that we've got a sort of good baseline, you can start
asking a few questions about how these conversations are structured. So
an interesting one that we looked at is where does the actual utility
come from? Does it come from the start of the sentence? The end of the
sentence? Or does it randomly distribute it all over? So on this plot
the first group of every bar plot is samples taken from the start of
the sentence. The middle one is from end and the third bar is just a
random segment. And we go ramp up the size of the segments. So you can
see with a 30-word opening, you get pretty much all of the utility
straight away. This is a business residential classifier by the way.
The end of the sentence or random segments of the conversation are not
too good until you get up to very large sample sizes. So I just showed
there that you can get good classification with a forced 30-word
window, and that's sort of contrary to general results that show
results negatively correlated with the conversation length. You'd
normally assume that if you had a long conversation, you'd have very
robust features and you should get better results. But it turns out
opening of short and long conversations are fundamentally different. So
if we look at the accuracy stratified into conversation length, this is
the straight up unigram. So, again, you can see the trend, very short
conversations [inaudible] poor accuracy at 75.8, long conversations
quite good accuracy at 93.8. The interesting thing is if we truncate
that down to a 30-word window at the start, you still get the same
picture. So if someone is going to have a long conversation, for
example, they just more clearly annunciate their reasons for calling.
So to sort of wrap up, we looked at whether there was merit in using
ASR transcripts, and it does turn out that you can derive quite a lot
of utility even if you have very high word error rates. So fidelity
doesn't seem to be the prime mover here; it's just consistency of your
recognizer. And you can make inferences with surprisingly small
samples, this 30-word dictionary, a 1000-word -- a 30-word opening with
a 1000-word dictionary. And a future work: we have yet to look at lower
level acoustic features like [inaudible], talking rate, etcetera. All
right.
[ Applause ]
[ Silence ]
>> : Hi, did you take a look at the errors to see in the recognition
whether it's like consistently missing words, like I'll never get the
word school out of the recognition? Or is it like half the time I get
school, half the time I miss something else, and they're kind of...?
>> Anthony Stark: Not directly. The way that we measured word error
rate is we enrolled them in an interview and that was very structured.
And they were the only transcripts that we were allowed to look at. So
it was very sparse, and that word error rate that I quoted was very
much an extrapolation. So I can't actually answer that question too
well.
>> : I have a question about how you determined your stop words. So
some stop words are actually useful for social -- distinguishing social
interactions, so you might not want to throw away everything.
>> Anthony Stark: Partially just through flagging common words and then
me going through and sort of pruning out ones that a priori knew
weren't too useful in the context so "and" and "the."
>> : Very interesting work. I have a question about your bigram and
trigram model. So basically you were saying your ASR is consistent even
if it makes some errors. So I'm wondering if you have tried any
smoothing techniques for bigram and trigram? Because it seems like your
unigram works the best in this...
>> Anthony Stark: Yeah.
>> : ...case.
>> Anthony Stark: So I did so some [inaudible] smoothing and it did
slightly better but not significantly so. So I sort of just chopped it
down to a lower dimension feature and just left it at that. It seemed
robust to quite a lot of different pruning and smoothing techniques.
[ Applause ]
>> Will Lewis: So now...
>> : This is Shafiq.
>> Will Lewis: So now Shafiq will tell us about a novel discriminative
framework for sentence-level discourse analysis.
>> Shafiq Joty: Thanks. Hi. So in this talk I'll be presenting a novel
discriminative framework for sentence-level discourse analysis. But as
I say that our approach can easily be extended to [inaudible] tags. I'm
Shafiq Joty, and this is a joint talk with my advisers Dr. Giuseppe
Carenini and Dr. Raymond Ng.
So now let's see the problem first. So we are following the rhetorical
structure theory of discourse [inaudible] tree-like discourse of
structure. So for example given this sentence, "The bank was hamstrung
in its efforts to face the challenges of a changing market by its links
to the government, analysts say." The corresponding discourse tree is
this. The leaves the discourse tree corresponds to the contiguous
atomic textual spans which are called elementary discourse units or
EDU. Then the adjacent EDU's or the larger spans are connected by a
rhetorical relation. For example here the first two clauses are
connected by an elaboration relation then the large span is connected
by an attribution relation.
Then [inaudible] in a relation can be either a nucleus or a satellite
depending on how central the message is to the other. So part of the
computational tasks in RST. Given a sentence like this the first
computation task is to break this text into sequence of EDU's which is
called discourse segmentation. Then the next task is to link these
EDU's in here, label hierarchal tree. This is called discourse parsing.
Here the important thing to note that the fact that here the EDU one
and two are connected by an elaboration relation with have an effect on
the higher-level relation, on the attribution. These kind of
dependencies are called hierarchical dependencies. Again, if we had
four EDU's and the relation between EDU 1 and 2 will have an effect on
the relation between the third and fourth. So these kind of
dependencies are called sequential dependencies. So we need to model
these kind of dependencies into a model. So what's the motivation for
this rhetorical parsing?
Our main goal it build computation models for different discourse
[inaudible] task in asynchronous conversations. That is, conversations
where participants collaborate with each other at different times like
e-mails or blogs. So we are mainly interested in topic modeling, then
dialog act modeling and rhetorical parsing. So we built on supervised
and unsupervised topic segmentation model to find the topical segments
in an asynchronous conversation. Now we are working on topic labeling
so do come to our poster to know more about this work. And we built an
supervised model to cluster the sentences based on their [inaudible]
like question or answer. Right we are working on rhetorical parsing, so
this talk is a part of this project. But we'll only show the results on
two models of corpora. And it has been shown to be useful in many
applications like text summarization, then text generation, sentence
compression and question answering. And we'd like to perform the
similar task in asynchronous conversations like e-mails and blogs.
So here's the outline of today's talk. I'll first briefly describe the
previous work. Then I'll present our discourse parser followed by the
discourse segmenter. Then we'll see the corpora or datasets on which we
performed our experiments. Then we'll see the evaluation metrics and
the experimental results. Finally I'll conclude with some future work.
So Soricut and Marcu first presented the publicly-available SPADE
system that comes with [inaudible] models for discourse segmentation
and discourse parser, sentence-level discourse parser. They take a
generative approach and their model is entirely based on Lexicosyntactic features constructed from the lexicalized-syntactic tree of
the sentences. However, when the model [inaudible] in a discourse tree,
they assumed that the structure and the label are independent, and they
do not model the sequential and hierarchical dependencies between the
constituents.
Recently Hernault et al. presented the HILDA system that comes with
both segmental and a document-level parser. The model is based on SVMs.
In the parser they use two SVMs in a cascaded fashion while the job of
the first SVM is to decide which of the two adjacent expanses are to
connect then once this is decided the job of the next upper-level SVM
is connect this spans with an appropriate discourse relation. So as you
can see that their approach is a [inaudible] approach and they don't
model the sequential dependencies.
Now all these works like these two works are on newspaper articles. On
a different genre of instruction manuals Subba and Di-Eugenio presented
a shift-reduce parser. And their parser relies on an inductive logic
programming based classifier, and they use rich semantic knowledge in
the form of compositional semantics. However, their approach is not
optimal and they do not model the sequential or hierarchical
dependencies.
On discourse segmentation, Fisher and Roark, they used binary loglinear model which performs the state-of-the-art segmentation
performance. And they show that the features extracted from the parsetree like syntactic trees are indeed important for discourse
segmentation.
Now let's move onto the discourse parsing problem. So for now just
assume that the sentences has already been segmented into a sequence of
EDU's. For example here we have three EDU's in the sentence. The
discourse parsing problem is to decide the right structure of the
discourse tree. So we have to decide whether EDU 2 and 3 should be
connected into a larger span then that larger span should be connected
with EDU 1, or EDU 1 and 2 should be connected then the larger span
should be connected with EDU 3. So you have to decide the right
structure and the right labels for the internal [inaudible] which
varies on the relation and the nuclear status of the spans.
So our discourse parser has two components. The first one is the
parsing model that assigns a probability to all possible discourse
trees for a sentence. Then the job of the parsing algorithm is to find
the optimal tree. Okay? So part of the requirements for our parsing
model. We want to have a discriminative model because it allows us to
incorporate a large number of features, and it has been [inaudible]
that the discriminative models are in general more accurate than the
generative ones. We want to jointly model the structure and the label
of the constituent. We want to capture the sequential and hierarchical
dependencies between the constituents and further more our parsing
model should support an optimal parsing algorithm.
So here's our parsing model. Just assume that we are given a sequence
of observed spans at level i of the discourse tree. Remember that we
want to model the structure and the level jointly. So we put a hidden
sequence of binary structure nodes on top of this. So here is three.
Here is [inaudible] nodes whether span two and three should be
connected or not. So this is binary node. Then on top of this we put
another sequence of hidden -- hidden sequence of multinomial relation
nodes. So here our [inaudible] if span 2 and 3 are connected then what
should be their relation? So here you can note is that we are modeling
the structure and level jointly. Now this is an undirected graphical
model. Now if we model the output variables that is the structures and
the relations directly, this is basically dynamic conditional random
field, and we are modeling the sequential dependencies here.
Now you may be wondering how this model can be applied to obtain the
probabilities of different discourse tree constituents. So here it is.
We'll apply this model at different levels and compute the posterior
marginals of the relation-structure pairs. So for example just consider
that we have four EDU's in sentence. Then here's the corresponding CRF
at the first level. Then we apply the CRF and compute the posterior
marginal of this constituent then this constituent then this
constituent. Now the second level we have three possible sequences. In
the first sequence where EDU 1 and 2 are connected in a larger span. So
here EDU 1 and 2 are connected in a larger span. So here's the
corresponding CRF. We compute the posterior marginals of this
constituent then this constituent. The other possible sequences where
EDU 2 and 3 are connected in a larger span. So here you do 2 and 3 are
connected. So here's the corresponding CRF and we compute the posterior
marginals of this constituent and this constituent. The third possible
sequence is this, so here EDU 3 are connected in a larger span and the
corresponding CRF is this. We compute the posterior marginals of this
constituent and this constituent.
At the third level we have two possible sequences. One where EDU 1
through 3 are connected in a larger span. This is the corresponding
CRF. We compute the posterior marginal of this constituent. Then again
for EDU 2 through 4 in a large span, we compute the posterior marginal
of this constituent. So by computing posterior marginals we'll have all
the properties for all the discourse tree constituents.
Now these are the features used in our parsing model. Most of the
features are from the previous work. We have eight organizational
features that mainly captured the length and position. Then we have
eight N-gram features that basically captured the lexical and part of
speech features. We have five dominance set features which has been
shown to be useful in SPADE. Then we have two contextual features and
two substructure features. In the substructure features we're
incorporating the head node of the left and right rhetorical subtrees.
So by means of this we are actually incorporating the hierarchical
dependencies into our model.
Now once we know how to derive the probabilities for different
discourse tree constituents, the job of the parsing algorithm is to
find the optimal tree. So we have implemented a probabilistic CKY-like
bottom-up parsing algorithm. Now, for example, we have four EDU's in a
sentence. The dynamic programming table will have four into four
entries and we'll be using just an upper triangular portion of the
dynamic programming table. So T(i, j) will [inaudible] the prob of this
constituent, and m is the argmax of what the possible structure and r
is ranges of the relations.
So as you can see that we are finding the optimal based on both the
structure and the relation. So this approach will find the global
optimal discourse tree. Now at this point, we have already described
the discourse parser assuming that the text has already been segmented
into EDU's. Now let's see our discourse segmenter. So the discourse
segmentation problem is given a text like this, we're going to break
the text into a sequence of EDU's like this. And it has been shown that
inaccuracies the segmentation is the primary source of inaccuracy in
the discourse analysis pipeline. So we should have a good discourse
segmentation model. So we framed this problem as a binary
classification problem where for each word, except the last word in a
sentence, we have to decide whether there should be a boundary or not.
And we're using a logistic regression classifier with L2
regularization. And to deal with the sparse boundary tags, we using a
simple bagging technique.
Now these are the features used in our discourse segmentation model.
You can find the details in the paper. We are the SPADE features. Then
we are using Chunk and part of speech features. We have some positional
features and contextual features. Now let's see the datasets or
corpora. Now to validate the generality of our approach, we have
experimented with two different corpora. The first one is the standard
RST-DT dataset which comes with 385 news articles, and it comes with a
split of 347 documents for training and 38 documents for testing. In
sentences we have 7,673 sentences. For training and for testing we have
991 sentences. And we are using the 18 [inaudible] relations which has
been used in the previous work. And by attaching the Nucleus-Satellite
with these relations we get 39 distinct discourse relations.
Our other corpus is the instructional corpus delivered by Subba and DiEugenio. It comes with 176 instructional how-to-do manuals. It has
3,430 sentences, and we are using the same 26 primary relations and
treat the reversal of the non-commutative relations as separate
relations. So in our case the relation [inaudible] two different
relations. And by attaching the Nucleus-Satellites we get 70 distinct
discourse relations. Now to the evaluation metrics. We are using to
measure the parsing performance we are using the unlabeled and labeled
precision, recall and f-measure as described in Marcu's book. For
discourse segmentation we are using the same approach as taken by
Soricut and Marcu and Fisher Roark; that is, we measured the models
ability to find the intra-sentence EDU boundary. So if your sentence
contains three EDU's that corresponds to two intra-sentence segment EDU
boundary, we find the model's ability to find these two intra-sentence
EDU boundary. Now here are the results. Let's first see how the model
performs when it is given or the parser performance when it is given
manual segmentation or gold segmentation. So here's the result. Now one
important thing to note that the previous studies mainly show their
results on a specific test set. So to compare our approach with their
approach, we have shown our results on that specific test set, but for
generality we have also shown our results based on 10-fold crossvalidation on a specific corpus. So here you can see on the RST-DT test
set our model, like the DCRF model, consistent outperforms the SPADE.
Especially on relation, it outperforms SPADE by a wide margin. And our
results are consistent when we do the 10-fold cross-validation. And if
it compared with the human agreement, our performance in [inaudible] is
much closer to the human agreement.
And on the instructional corpus the improvement is even higher in all
these three metrics. Now let's see the results for discourse
segmentation. So here you can see on the RST-DT test set, our logistic
regression-based discourse segmenter outperforms HILDA and SPADE by a
wide margin. And our results are comparable to Fisher and Roark's
result on the same test set, but we are using fewer features so it's
more efficient than that in terms of time. And when you see the 10-fold
cross-validation, we get similar results. On the instructional corpus,
our model outperforms SPADE by a wide margin. Now if you compared the
results between these two corpora, you can see that there's a
substantial decrease in the instructional corpus. It may be due to the
inaccuracies in the syntactic parsing that we are using and also maybe
the tagger that we are using as features.
Now let's see the end-to-end evolution of our system that is where the
parser is given the automatic segmentation. So here are the results. On
the RST-DT test set our model outperforms SPADE by a wide margin and we
get similar results when we do the 10-fold cross-validation. And on the
instructional corpus, there is a substantial decrease. But we cannot
compare with Subba and Di-Eugenio because they do not report their
results based on an automatic segmentation. However, if you compared
the results between these two corpora, you can see that their
substantial decrease is due to the inaccuracies in the segmentation.
So we have also performed an error analysis on the relation labeling
which is the hardest task. So here's the confusion matrix for the
discourse relations. So the errors can be explained by two [inaudible].
One is the most frequent ones confuse the less frequent ones. For
example here elaboration confused the summary. And the other one is
when the two discourse relations are semantically similar, they tend to
confuse each other. So we need technique like bagging to deal with this
imbalanced distribution of the relations and we need a more rich
semantic representation like subjectivity or compositional semantics.
Okay. So to conclude, we have presented a discriminative framework for
sentence-level discourse analysis and our discourse parsing model is
discriminative. We captured the structure and label jointly. We
captured the sequential and hierarchical dependencies and, furthermore,
our model supports an optimal parsing algorithm.
And we have shown that our approach outperforms the state-of-the-art by
a wide margin. Now in the future work we would like to extend this to
multi-sentential text applied to asynchronous conversation like blogs
and e-mails, and we'd like to investigate whether segmentation and
parsing can be done jointly or not.
So, that's it. Thanks.
[ Applause ]
>> : You go back to your slides around 12, 13 where you show the
parser.
>> Shafiq Joty: Yes.
>> : Go back, back. So I just want to know. Go back, back, back. Back.
Yeah. Right here. So when you're doing the process, when you build up
layer by layer, level by level, do you make all possible adjacent...
>> Shafiq Joty: Yeah.
>> : ...[inaudible]?
>> Shafiq Joty: Yeah.
>> : And then you compute the score that determines whether you want to
merge it or not?
>> Shafiq Joty: I take all possible sequences. So here is at the level
4 we have two possible sequences. This is one is...
>> : I see.
>> Shafiq Joty: ...one. This is the second one.
>> : Yeah, but that's [inaudible]. I mean you have one through four,
one through five and...
>> Shafiq Joty: Yeah, but...
>> : ...[inaudible] determine...
>> Shafiq Joty: Yeah, but...
>> : ...[inaudible]...
>> Shafiq Joty: But the spans can be just -- Only the adjacent spans
can be connected.
>> : Oh, I see. Okay.
>> Shafiq Joty: So here you can see one, two, three.
>> : Okay.
>> Shafiq Joty: Yeah.
>> : Okay, so now the score is based upon the posterior probability...
>> Shafiq Joty: Yeah, the posterior probability.
>> : ...[inaudible]. So that looks very, very similar to this recent
work done by the parsing [inaudible] by Stanford group which is...
>> Shafiq Joty: The future-based CRF of...
>> : No, not future. It's a new network-based.
>> Shafiq Joty: Oh.
>> : It's [inaudible].
>> Shafiq Joty: Oh.
>> : They're actually sure -- I think the difference really lies in how
do you compute...
>> Shafiq Joty: The posterior?
>> : ...the score that determines whether you want to merge it or not.
And that's actually the most crucial one.
>> Shafiq Joty: We are not aware of that talk, though.
>> : You're not aware of that, okay.
>> Shafiq Joty: Yeah.
>> : We can follow up with that afterwards. That'd be great. One more
quick question?
>> : Yeah, so I'm not very familiar with RST and I'm curious how it
handles spans that are not continuous. For example, "This talk which
covers work on RST was just ended," or something that when you have
elaboration...
>> Shafiq Joty: Which one? [Inaudible]...
>> : So I can think of in language...
>> Shafiq Joty: Yeah.
>> : ...some spans are not continuous.
>> Shafiq Joty: Yeah. Yeah. So there is like a graph theory of
discourse that poses like the discourse [inaudible] should be a graph
not a tree. So yeah, there is recent work on this. But there is no
computation [inaudible] that can like [inaudible].
>> : But in this case the spans will be defined on a...
>> Shafiq Joty: Yeah, like here...
>> : The EDU's will be defined...
>> Shafiq Joty: Yeah, the EDU's can be like separated not necessarily
adjacent. Yeah. Thanks.
[ Applause ]
>> Will Lewis: So for our last speaker Ryan will tell us about
measuring the divergence of dependency structures cross-linguistically
to improve syntactic projection algorithms. [Inaudible]...
>> Ryan Georgi: All right. So that's a bit of a mouthful. So this work
that I did with Fei Xia and Will Lewis of UW and MSR. We'll just go
right into it. So right now there are thousands of languages in the
world, and most of those don't have a lot of resources. Most of NLP is
focused on a handful of few that do have large annotated corpora. Cost
of creating new corpora for most of the world's languages is a pretty
limiting factor.
So there is some previous work where we can project annotation from one
language that does have the resources available to a language without
those. But that annotation is generally of limited quality due to the
differences between the languages. And it's not easy to tell beforehand
the potential that the adaptation will have from one language to
another without specifically knowing facts about those languages and
how they differ. So this work, we want to look at the impact of common
divergence types, differences between the languages and how that
affects projection accuracy and hopefully improve it.
So I'm going to start with a review of the previous work on projection
and linguistic divergence. Then I'm going to take a look at how we
detect and measure that divergence, and then the experimental results,
and conclude.
So we used dependency structures for this task. They abstract a little
bit away from some of the issues that free structures might have where
it's order-sensitive and has internal representation nodes that might
not map so well. This kind of distills it down to the basic structures
so that we can really see the core differences.
So just a quick run through of basic projection algorithms where we
start with a bitext, in this case English and Hindi. First we get the
word alignments. Then we go ahead and we can run a monolingual parser
on the English side and get an English dependency structure. Then we
just use those word alignments and make a pseudo-Hindi dependency tree
assuming that the structure resembles that that we see on the English
side.
So projection allows us to take annotation from a research-rich
language and project it to one that has none using the word alignment
to bootstrap it. It's got the obvious advantage of we can do this with
parallel text if we have the resources available for one language on
one side of the bitext. Unfortunately this relies on what I'll call the
Direct Correspondence Assumption that when a word is aligned between
one language and the other that essentially they have the same
syntactic and semantic meanings. Obviously that's not the case always,
so linguistic divergence is cases where the direct correspondence
assumption is violated.
So Dorr's '94 paper outlines a number of types of divergence that can
occur between languages. I don't have time to go into all of them, so
here's just one of demotional divergence. So in this case the "like"
that's the head of the English dependency tree here gets demoted as it
categorically shifts with the head of the German sentence which is
"eat" and becomes a different category. That noun on there is actually
kind of wrong; it should be adverbial. And we get a mismatch between
the dependency structures. So Dorr's divergence types are handy but
they really require language-dependent knowledge. You really have to
know that that's something that occurs in German versus English and
write that into whatever you're doing. So we'd like to see if we can
discern kind of the frequencies and the types that happen between
language pairs and see if we can do that programmatically kind of with
minimal knowledge about the language pairs.
So the first thing that we have to kind of figure out is define what we
mean by similarity. When you're looking at two trees, most tree
similarity metrics assume that you're looking at two representations of
the same string. When we're looking across languages, we're actually
going to have a different number of terminals so those metrics may not
necessarily come out to match even though they're representing the same
thing. So in this case we want to count the number of matching edges
between trees as made by the word alignments.
So the similarity for a tree pair, s and t here, is the percentage of
edges in one that match in t as defined by those aligned words where
the children aligned and the parents of those children are aligned. The
similarity for the whole treebank is defined just as the percentage
over the trees. So we have a number of different alignment types that
we see frequently when we're looking through the trees.
We have the Merge Alignment where multiple words in one side map to the
same word on the other. There's a swapped alignment where a child and a
parent on one side map to an inverse relationship like in the German
example. And then we simply have a case with no alignment where we may
have a word aligned with another word but something, a spontaneous
intervening node, may kind of break what otherwise would be a match.
So we'd find some operations that correspond to those different
alignment types. We have a merge operation where if we have multiply
aligned words we simply combine them and promote the children of the
child node to be parents of the new merged node. We have the swap
operation where we swap the child with the parent and move the children
along with it. And we have the remove node which is essentially like
the merge -- Or remove operation is essentially like the merge
operation except we don't rename the new merged node.
So with these operations we can go about calculating the divergence
between the treebanks. So using those different alignment types we
performed the corresponding operation, and before and after we
calculate the edge match percentage to see what impact there is by
applying that operation. And as we go all the way through we eventually
end up with two different trees that should resemble each other
maximally after taking into account all of those divergences.
So just to walk through what that looks like. First we start off with a
bitext. In this case it's actually an interlinear gloss text. So the -It's not shown here but there's a one-to-one correspondence between the
source language here which is Hindi and the English gloss text here. So
each one of these tokens aligns directly with the one beneath it. Then
we can get word alignment through matching the gloss with the word in
the translation.
So we end up with trees. These are actually the gold trees that look
something like this. As you can see there's a significant difference
between the "caused" given in English and the "give" with a causative
marker on it.
So first we identify the words that are spontaneous that weren't shown
in that alignment and remove those. Then we still have a multiply
aligned with the "caused" given versus the "give" causative, and we
want to merge that. And in between each one of these steps as we remove
these nodes, the percentage of the match between the trees increases
because now we no longer have those edges that are unaligned. And as we
merge these two nodes we lose an edge that wasn't matching previously
between "book" and "cause" on this side and what would be "book" and
"cause" on this side.
So we did this on four different language pairs: English-Hindi,
English-Swedish, English-German and German-Swedish. And we started off
with just the baseline before we apply any operation of how many
matches there were in the treebank. Then we removed every spontaneous
word from treebank so you have what the match will look like then. Then
applied merge and then swap. One thing that I should point out too is
there are two numbers listed, there's for English to Hindi and Hindi to
English because the denominator essentially in the percentage is the
number of edges in that tree for the language. It's actually an
asymmetrical measure. They're generally similar-ish especially after
we've removed spontaneous words. But, yeah, it is asymmetrical.
So one thing that was -- So they all correspond pretty similarly in
jumps after removal, after merge, after swap. One interesting thing is
the Hindi at the very end is at 90% match. All the other languages are
at 78 to 80% match. So this first table here is actually from the
Hindi-English treebank. And it's much -- It was actually guideline
sentences. It's very clean, clear translations. The other three tables
here are from the Smultron database which has, again, supervised
parallel trees but the sentences are much looser translations of one
another. So after applying our operations, it still seems that there is
quite a bit of a gap to be closed.
So we drilled down a little bit deeper trying to get a breakdown of the
operations as they applied different part of speech tags. In English
and Hindi we see that verbs collapse with other verbs like in the
"give-caused" version especially with auxiliary verbs. Hindi tends to
have those outlined.
In English and German we see that nouns collapse pretty frequently with
other nouns as German does a lot of compounding that English doesn't.
For swaps there's a lot of head-changing in the verbs in Hindi, so the
English verbs will often change heads with the aligned verbs in Hindi.
And then again in German some of the compounding between the treebanks
causes nouns to swap position.
And then finally just removals. Determiners tended to be the most kind
of spontaneous between English and Hindi. English and German had other
determiners that just didn't match. And, yeah, there are some
interesting German-specific prepositions that got removed in the German
case.
So moving on. Just to review is to find three operations that capture
common divergence patterns, measured the effect of those operations on
the four language pairs. And so what we want to do as future work is
perform these same tests on a much broader selection of languages.
Ideally since we have interlinear text in ODIN which is a lot like the
interlinear text we used for Hindi, we should be able to run that on
upwards of a hundred languages. We can try learning the high frequency
transformation and apply those as rewire rules to projected trees. Or
what we're working on right now is actually using the inferred patterns
to inform a dependency parser that is trained across parallel treebanks
and then is fed parallel text. And, yeah. I got quite a bit of time for
questions.
[ Applause ]
>> : Hi. It was a good talk, thanks. I'm not [inaudible] to speak of
Hindi but what I could see it looks a bit artifical sentence. So...
>> Ryan Georgi: Yeah.
>> : So I was wondering if the numbers that you shows is indicative of
the nature of the corpus?
>> Ryan Georgi: Yeah, so these sentences are -- So interlinear text
does tend to have a slight bias in that it is usually instructive
sentences that are highlighting specific differences between whatever
the linguists native language and the target language is. So there can
be a bit of a bias in those interlinear examples. So that does seem a
little artificial to you? Yeah? I mean, the ODIN database four -- some
languages it's got thousands of IGT's. For some it's only got twenty or
thirty. We have seen that when you get enough examples the stiltedness
of a couple of the sentences will tend to average out as the linguists
look at different factors of the language. But, yeah, some of the
individual examples can be a little biased for instructive purposes.
>> : Okay. But do you know the way this corpus was collected from? Like
[inaudible]?
>> Ryan Georgi: Oh, yeah. So these sentences in the Hindi-English were
actually guideline sentences for annotating the Hindi Urdu treebank.
>> : Okay.
>> Ryan Georgi: Which I, yeah, I believe I have the citation down here.
I think that's still in progress. Yeah, the one at the top there. So
these in particular are going to be super instructive so that the
annotators can pick up the right thing to do.
>> : Okay. Thanks.
>> : So I just got a question about the results that you showed. So the
previous slide, yes. After the remove and merge steps, how many edges
are you basically knocking out of the tree?
>> Ryan Georgi: Yeah.
>> : And is there a trivial solution that gives you a tree with one
node in it on both sides that gets 100% on all of these [inaudible]?
>> Ryan Georgi: Yeah, so no we never get that bad. The trees are pretty
well -- So, so the alignments that we actually have were done first
statistically and then corrected by hand, so they're basically hand
alignments. And most of the stuff is in there. I don't have the numbers
up on here, but I think from beginning to end it was roughly a third
reduction in the edges.
>> : Okay.
>> Ryan Georgi: From...
>> : And these numbers are -- That's basically the denominator is
changing here, right, when...
>> Ryan Georgi: Yeah.
>> : ...you remove [inaudible]. Okay.
>> Ryan Georgi: Yeah, so it goes from -- Yeah, exactly.
>> : Okay. Okay, no that's fine.
>> Ryan Georgi: Yeah, yeah.
>> : I got the idea. Thanks.
>> Ryan Georgi: Yeah, occasionally there can be some cases where the
numerator will change as we merge something that was also aligned, but
the denominator will decrease at the same time.
>> : Sure. Okay.
>> Ryan Georgi: Yeah.
>> : I have two quick comments.
>> Ryan Georgi: Yeah.
>> : First, so I think it's the next slide where you break it down by
part of speech. I'd also be interested in at the lexical level. There
might be like specific verbs...
>> Ryan Georgi: Yeah.
>> : ...like...
>> Ryan Georgi: Yeah.
>> : ..."like" versus [inaudible].
>> Ryan Georgi: Yeah, yeah.
>> : That might be accounting for a large percentage of...
>> Ryan Georgi: Yeah.
>> : ...what actually happens. So it'd be interesting to see. There's
just a few examples that, a few specific constructions that account for
everything or if it's a wider divergence between the two languages.
>> Ryan Georgi: Yeah, yeah. So with the Hindi examples in particular
the treebank -- And, again, I'm not a Hindi speaker so I didn't know.
If the -- The case markings in a lot of cases are separate words that
aren't mapping onto English and that, I'm guessing, is a pretty closed
class that would probably see the same repetition over and over again.
So, yeah, lexically it would be probably an [inaudible]. Yeah, good
point.
>> : And also to point you to work by Laurie Levin and Alon Vi on -They didn't do dependency parsing, it was syntactic parsing, but it was
very similar trying to project the two trees to each other.
>> Ryan Georgi: Yeah.
>> : So I'll give you that reference later.
>> Ryan Georgi: Cool. Thanks.
[ Applause ]
[ Silence ]
Download