Document 17864503

advertisement
>> Geoffrey Zweig: Okay. We should be good to go now, I think. So
it's a pleasure to introduce Imed Zitouni today. Imed is a former
colleague of mine from IBM Research. He just recently decided to make
a change and move to Bing where he's in the Metrics Group. He got his
Ph.D. at the University of Nancy. After that he was a research staff
member at Bell Labs for a number of years before he joined IBM, where
he spent many years, actually, working on the IBM TALE System, which is
their commercial system for doing Arabic and Chinese broadcast news
transcription and all kinds of interesting analyses there.
He's also written a book recently published by Prentice Hall on
multi-lingual natural language processing. So here he is.
>> Imed Zitouni: Thank you. Thank you, Jeff, for this. Thank you
everyone for coming. So as you just said, I just joined Microsoft.
I'm a colleague, even though that the slides present IBM.
So I'm going to talk about my work in the last, let's say, about six
years. And mostly on TALES, and I will explain what I mean by that in
a little bit, if I can get this going.
Okay. This is the outline of this talk. So I will introduce TALES and
the information extraction component. That's what I want to focus on
today. Mostly I will focus on the machine detection part, the core
reference as well as the relation. I see a typo there.
Then I will show how we can transfer a program from one language to
another or to many other languages. So how we make cross language
transfer. And what happens if we don't have enough data to do that, to
train a new model.
So there I will talk about information propagation across languages.
Then what happened when we deploy commercial system where robustness
matters, because the text or the input signal can be very noisy. I
will also talk about that, and then I will show how all this technique
can be applied to a different domain. And here I will show how we use
this in the health care space and then I will conclude my talk.
So I have a video to introduce TALES. It's better than doing that
through a slide. So I will try to do that now.
[video] multiple channels generating 24/7to serve as captions. TALES
is video tool capability allows near real time monitoring of the video
as it's captured. Each row in the table corresponds to a show. We see
the show's network, the show's title, the language, the start time of
the show, the duration of the show, including an indication of how many
minutes have been captured already. The live column indicates that one
show is still being captured while six others have completed
processing. Video is processed in two-minute segments labeled
according to how many minutes into the show the segment began. Those
shown in blue are fully translated while light gray indicates segment
likely to be captured. Let's look at a segment of Al-Jazeerian news.
Hovering over the segment, the upper left corner shows us a slide show
of the key frames of that segment of video. Clicking on it, we see the
video clip with the automatically generated captions below. [Foreign
language]
>> Imed Zitouni: This is the real time translation of the Arabic
spoken audio. [video] we can also use speech synthesis to dub the
translation over the original audio. Let's look at a Chinese segment
this time.
[Chinese]
We can also make captions appear when hovering over segments. These
features of TALES video tools allow us to monitor the capture and
processing of current video being added to TALES searchable database.
>> Imed Zitouni: These are snippets that are important for information
extraction because also one of the ideas of this tool is to allow
for -- let me get this.
>>: I'm sorry, can you search for things there on a specific topic if
you want to find out about Hurricane Katrina?
>> Imed Zitouni: Absolutely you can. And actually we are using Omni
find for that and you see to do the search. We do the search at the
level of snippets. It was developed by internationally to answer the
question who did what to whom. It's not the regular kind of search.
It's search based on metadata, based on entity as well as relations
between entity, and that's what I'm going to talk about today.
>>: TALE top half [inaudible].
>> Imed Zitouni: It was initially -- so the type of project it's GALE.
The tool that IBM built on the GALE DARPA project and that it is trying
to sell to show to customer, that is TALE. So all the technology
behind it we developed it for GALE, however we are presenting it
into -- under a different framework.
>>: So that's the origin of GALE.
>> Imed Zitouni:
Yes.
>>: Is it still going on now?
>> Imed Zitouni: It's applied to different things. It's applied to
healthcare. I will try to talk about that later on. It's one of the
joint development agreement we signed with Nuance.
>>: This is the beauty of this DARPA program.
develop the technology, you can sell it.
Because after you
>>: Yes, IBM still has ->> Imed Zitouni: Correct. This is the architecture actually of TALES.
You know from the key frame extraction, you have to detect speech and
speech detection and then speaker segment clustering. Then you will
run your speech-to-text, and here we have Arabic and Chinese. You will
have your information extraction component on top of speech and also on
top of text, and then the name entity transliteration is the part
mostly used for machine translation, because you need that for
transliteration, and then we have the machine translation part, Arabic
to English and Chinese to English and you index all this and you can
search them with Lucine or Omni find.
>>: UIMA we had so many times.
What is it now?
>> Imed Zitouni: What does it mean? So UIMA is a component that helps
you. It makes the integration of this component very easy. So you
don't want to be aware of what is happening here.
>>: It's infrastructure.
>> Imed Zitouni: Infrastructure. It's kind of a middleware,
infrastructure, you don't have to worry about what is happening at the
different components. Everyone will complement -- this is going to be
an annotator, that's the way that UIMA uses this component. This is a
different communicator that uses [inaudible]. So the advantages of
that I can take the IBM component and put Microsoft component, it's
going to be instantly, there's no issue, no problem of using -- the
advantage of this also for our customers is we tell them this is the
architecture we have. If you want to use your own speech recognizer,
you want to use your own machine translation, we can do that.
>>: Sounds more like object oriented program on a larger scale, you get
a module, application module by module.
>> Imed Zitouni: It's XML-based. So it helps regarding the
interfaces. Let me -- here I got it. Here I wanted to take a couple
minutes highlight where my contribution really was in the previous
years. I contributed to these different components.
So as an example for the speech recognition part, my contribution was
mostly on language modeling, on using rich model language for Arabic,
putting that into a neural framework for language modeling, using
addition set of features, syntactic features, semantic labeling, and
also using die [inaudible] for Arabic text even though this was really
not successful in improving the [inaudible] speech recognition but
still it was an interesting area to explore.
>>: ALESO, what conference is that?
>> Imed Zitouni: That's the [inaudible] that's the Arabic league, the
equivalent of Arabic league, happens every couple of years in Middle
East. This one is happening in Qatar.
>>: The neural network, that's a technique?
>> Imed Zitouni: Yes. So the technique was using neural network.
There's a paper here. So the idea, the advantage of using an neural
network, you can throw in different kind of information and it will
discriminate among those information to help your language model
predict better kind of. And the idea here -- so the usual, the suspect
language model, the anagrams, we tried to, in addition at the word
level, we did it at the morph level because Arabic is high
morphological language, we did apply word class and used semantic
labeler as additional information there and that helped not only on
speech recognition but also on machine translation.
Regarding the name entity transliteration, that was mostly work on
cross lingual name spelling. Normalization. Identify name spelling
variance and all that. That helps mostly for machine translation in
the machine translation part my work mostly was in Arabic to English,
and here the work I did was the word segmentation level, boundary
detection, part of speech tagging and language modeling.
And today I'm going to focus on this part, which is the information
extraction part, and that will be the rest of my talk. Okay. So in
real life we start to detect like this. There is plenty of tags. And
you want to process this data to extract some information from it. So
the first step you do is to tokenize the text. You remove all these
tags, and you separate the dots from the tokens. The punctuation as
well. You do kind of -- and then you do sentence segmentation. You
find the sentences. And also you do K restoration in a sense here as
an example you restore the capital letter because it will help you
later on on detecting person's names and all that stuff.
Then we do parsing. We're on top of that semantic labeling, and we do
the machine detection part. The machine detection part is to recognize
object that refers to an entity like here Bin Laden, battle, which is
an event. I'll talk about that later. CIA, which is an organization.
We also are interested on dates. We want to recognize the dates and so
on and so forth. So once we recognize these entities, we want to link
the entities that are -- once we recognize objects, we want to link the
objects that refers to the same entity. Like here I need information
telling me that this token leader and this person's name Bin Laden and
this pronoun his refer to all the same entity Bin Laden. I also want
to know that this event killed, short, battle, they all refers to a
[inaudible] event. The same thing happened for the city here or
country here, Pakistan. And the same thing happened for the president,
which is here, you have here mention of president. You have here
Barack Obama. And both of them refers to the same person to the same
entity President of the United States.
>>: Make it look like those are resolving to the actual president
Obama, but it's actually to a cluster of [inaudible] you don't actually
link it to actual ->> Imed Zitouni:
entity.
It's the cluster of mentions that refers to that same
>>: Undefined ->> Imed Zitouni: Undefined things, yes. But you know that the
president in that text and the person in that text, they are the same.
So if you look to the name it mentions you may find the name of the
entity.
Once we recognize entity, we want to find the relation.
>>: Just following up on that question does the sort of absolute
resolution happen later or is that just too hard? Like to maintain a
set of things, a set of actual physical entities and things can resolve
to and globally say this and this and this resolve to this and this.
>> Imed Zitouni:
Yes.
>>: That comes later.
>> Imed Zitouni:
No, that's at this stage.
>>: But that was -- at this stage you're talking about right now.
>> Imed Zitouni:
At this stage.
>>: But that was just within the text.
within the text.
>> Imed Zitouni:
Within the text.
That was a group of things
Within the text.
Across documents.
>>: Across documents into the actual physical world where everything
refers to actual things.
>> Imed Zitouni:
That comes much later.
>>: Does it come at all?
>> Imed Zitouni: This is what happens, we index at this level. We
index every single document at this level. And then when you do your
search you are searching for the entity. Let's say I'm searching for
the entity Obama. So I'm going to hit all the documents that they have
the entity Obama.
>>: They have a mention of Obama?
>> Imed Zitouni: That have an entity and the entity has a name
mentioned that is Obama.
>>: So you're indexing mentions and [inaudible] is that right.
>> Imed Zitouni: I'm indexing mentions and indexing entity and
relation. When I say indexing entities, I have ID for them. It
doesn't have to be Obama. It's ID XYZ, single ID.
>>: Not a single database.
>> Imed Zitouni: No, I don't have a single database.
fly based on the name mentioned within the entity.
It's done on the
>>: Okay.
>> Imed Zitouni: And after that we need also to look into relations,
what we call relation and the idea here if I have identity called which
is an event I have the location Pakistan. I'd like to know that the
killing happened in Pakistan and the name relation is located at. So
we have a set of around 100 relations, predefined, and such as manager
off. Located at. Patient off. And we try to do all this together at
the relation level. I come later in how to do it. Just now I'm trying
to introduce the problem.
>>: Are you going to talk about the timing relation?
>> Imed Zitouni: Yes. The time relation, yes, actually that's the
idea if you have yesterday in the text, the yesterday has to be
converted to the exact date. If I have Sunday May 1st, 2011, if I have
today, today this is and this happened today has to be translated, has
to be converted into the exact date. And that based on if I don't have
any information that is based on the date of the document and if I have
some information related here within the text, then, again, it's a kind
of relation that helps me -- the relation always they have specificity.
So they have what we call named specific generic kind of relation, and
when it's specific it usually refers to a specific date like this
versus generic, which can be yesterday tomorrow and next month, things
like that.
>>: Perhaps like [inaudible] has been [inaudible] in nature but
[inaudible] it's long term is that something that ->> Imed Zitouni:
No, that's not.
>>: [inaudible].
>> Imed Zitouni: Correct. That's an event. The time normalization is
mostly related when last year it happened that someone went somewhere
and that last year you converted into exact date.
And again the whole idea we have in mind is this is part of the GALE
project, the distillation, which is we want to answer who did what to
whom and where. And the when is usually date interval from that date
to that date, and most of the document they have this information that
you have today's date and they tell you the article about what is
happening last year.
So if you need to know what last year means based on the document
happening today.
>>: Timed to the entity for longer two years ago this appeared
[inaudible] a lot of [inaudible] so how do you associate the time with
the entity relation.
>> Imed Zitouni: I don't associate with the entity here. I just
replace last year with the exact date. And if later on I have a query
saying that what's the occupation of that person last year, again I
will translate last year into the exact date. And if both exact date
matches, then I will consider this as a hit.
>>: So just one clarification question.
ceases, right?
>> Imed Zitouni:
So the output of a translation
Excuse me?
>>: Your input is an output translation system?
>> Imed Zitouni: No, here during the training -- well, you are talking
about decoding. There's two things. During training, it is not
translation. It's the source language. It is either Arabic or English
or Chinese. During decoding, it can be both. It can be the output of
a translation system or speech recognizer. It can be also the original
text.
Yes. So how we do machine detection. Similar to many other
applications, such as part of speech tagging. Chanker, we consider
this as a sequential classifier. And the idea is to process the text
from left to right or right to let, depends on the language you're
considering and for every token you make a decision whether it begins
dimension, inside dimension or outside dimension, it's nothing. You
run your classifier, and you take -- you can take either top end or you
take the best answer for that. I'll get into more details in a little
bit.
So similar to any other techniques, including speech recognition, you
know, you compute the probability or MT in that matter. The
probability of the sequence of text given the words. And for that we
use, again, the bayes rules, similar to other applications, and what we
do for that we use the maximum entropy framework, and actually I think
that you know this. This actually was investigated by Delpieter
Berger, 1967, '67, where they find an interesting relationship between
the maximum likelihood estimates of models and the maximum entropy
models.
And this is the relationship that happens here. So actually
founding -- I mean estimating these probabilities can be viewed as the
maximum entropy and also under the maximum likelihood framework, you
know, we choose those that maximum the entropy of a set of consistent
models and maximize the likelihood over a set of models and here so the
whole idea what I'm trying to say here is that the maximum entropy
model will not assure anything except the evidence, right? So if you
have all the data in the world available, you are sure to converge to
the perfect model you want. And in reality that's not true, because
you don't have all the events. Unless you have a dead language. If we
take the Egyptian languages and we take all the data happening there
and there is no new data happening any more and we train on that
maximum entropy model, any discriminative training model, we can claim
that we have a perfect model. In reality, we live in live languages
that evolves and change over time. So that's not a possibility. So
that's why we try to estimate the maximum likelihood.
>>: Sequence -- I think CRF would be better.
>> Imed Zitouni: We tried CRF. We tried MaxEnt. I believe personally
what matters is the user features more than the approach itself. So
you know if you look to the -- well, that's also -- this happens in my
group and I have also this idea, maybe I'm wrong, really the difference
between CRF and MaxEnt is if you look to the loss functions, if you
change the loss function of MaxEnt, you get CRF.
I understand one is looking to local. MaxEnt mostly tries to do
optimize locally and you hope that that will be globally. CRF looks
globally. So it's time-consuming. It takes more time to do CRF.
>>: You have both tools ->> Imed Zitouni:
Excuse me?
>>: Do you have the tools for both models both CRF and MaxEnt?
>> Imed Zitouni:
The tools.
>>: The software tools.
>> Imed Zitouni: The tools, yes, we have the tools, yes, yes. The
reason -- and we offer -- it's for historical reasons as well. IBM
believes that they believe that contributed to the invention of MaxEnt
they want to keep using it. So there's no big difference in
performance it's a good claim to keep using MaxEnt.
>>: [inaudible] and there's somebody here in Bing that would say that
way to present trends here is bigger faster they learn as well better.
Have you guys seen anything like that?
>> Imed Zitouni: The weighted perception -- again, including -- you
can go further than that, and you include into that SVM as well.
Support vector machine. All these techniques, they are good. They are
comparable. That's what I'm trying to say. They are based on
discriminative training kind of approaches.
Really the main difference between them is the feature used. The
information you threw into them. So if you go basic information set or
using only lexical input, nothing else, I would assume that you will
get comparable results.
>>: [inaudible] like have you guys tried it?
>> Imed Zitouni: I didn't try the perception. We tried CRF. We tried
support vector machine. We tried hidden Markov model. The Markovian
path. The challenge with Markovian path, like it's very hard to
include additional features. The advantage of using MaxEnt and CRF
it's easy to implement additional features. So it's for convenience
more than for the theory behind it.
So, yes, so here I'll go to the feature used for this. So we use the
context of the word. If I'm trying to predict the tag of this word so
I use the context, the previous word, the next word, I use it in the -we are using MaxEnt but we are also using the Markov assumption of
MaxEnt. So we use the two previous tags, we keep the history. That's
very important, because I saw this in many papers that people telling
you CRF folks better than MaxEnt but the implementation of MaxEnt is
not the Markovian MaxEnt. And this makes a lot of difference to catch
that. Okay?
We use dictionaries. We call them gazetteers. We have huge
dictionaries of person, names, locations organizations, we threw these
as features, as gazetteer information, document trigger level features,
it's interesting to know what is the document you are handling.
And the output
you have other
it. I use the
you're working
of other semantic classifiers what I mean by semantic if
classifiers, if you have a CRF classifiers, I can use
output of this classifier as the input to my model. If
on a different project and this has happened so for GALE
we work on ACE. They have different kind of tags. I will not throw
that away I will use it. I have a classifier trained on that data. I
use the output of that classifier and threw it in as a feature. It
helps.
>>: So the capability [inaudible].
>> Imed Zitouni: Yes, it helps a lot. I mean, there is two points of
measure by doing these kinds of things. And two points of measures of
the TALES of the 80 it's important.
>>: What's the size of the functionality?
how many entities, the dimensionality.
>> Imed Zitouni:
That's good for information,
The number.
>>: The number of classes.
>> Imed Zitouni:
The number of classes is around 120.
>>: Okay.
>> Imed Zitouni:
All right.
>>: Number big one.
>> Imed Zitouni:
It's very sparse.
Excuse me?
>>: The words in context, the feature you use just [inaudible] maybe
one ->> Imed Zitouni: We use a context of the two next and the two preview.
Five. It's five-gram context. We did try -- we did try something
else. Actually, we also used the parser information and semantic
labeling information. So you have the headword information you have in
semantic labeling she will tell you more, you have the idea the
argument who is doing the action, who is getting the action. So all
this information that we threw as a feature here, it helps as well on
top of the parser.
>>: I'm just -- it will be just so large so you must have some
technique, too, because of functionality.
>> Imed Zitouni:
I see.
Well, we use --
>>: [inaudible].
>> Imed Zitouni: Yes, that's why we use the condition generative
scaling with the Gaussian prior. We need to do that. We cannot --
yes, I see your point.
do that.
No, we cannot train -- yes.
True.
We need to
This is the entity that's why I'm saying, this is the set of entities
it's a little different from ACE. We cover what we want. We have 116.
When I say time three, because we're interested on name mentions,
nominal mentions and pronominal mentions. We differentiate between
them. The he is a pronominal. President is a nominal. Barack Obama
is named.
This is an idea about the performance. So I know that many people are
familiar with what's happening in ACE, and that's why the same
technique is used in ACE and I'm showing the performance here in terms
of precision recall and F measure. So this performance in terms of ACE
is very competitive. This same model was ranked in the top. In the
ACE evaluation, applying it to TALES because we have many classes. The
data is maybe a little bit sparser, all that. So the performance drops
a little bit.
>>: ACE is [inaudible] is it probably GALE?
>>: Content extraction.
>> Imed Zitouni: The C is content E is extraction and A is automatic
content extraction. It's run by NIST. And yes so this is the number
of mentions we have and that's the number of documents we have.
Now, we talked about mention. I wanted to do the core F part so how do
I do that? So I'll be brief here. But we have a paper at ACL for
those that want more details. So actually really I take all the
mentions here and my goal is to group them into entities with different
IDs and for that I use what we call the bell tree algorithm. So I
start with the first mention. And then I decide.
So I have all these mentions in here. I take the first one. I put it
into one class. And then for the second one I have two choices. The
first choice is it belongs to the same class. The second choice will
be that it starts its own class. All right? And then when I do that
for the third one, again, the different choices I have, even this
belongs to the same class or it starts another class, the same thing in
here. All right?
So again I use my classifier, my maximal entropy classifiers with some
threshold because I can train on all this. And we try to -- so I try
to estimate the probability of linking. Linking meaning put it in the
same class. And estimate the probability of starting a new entity.
And, again, for that we use the maximal entropy. We're using the same
classifier. So the same framework. And the difference is on the
feature we use here when you do entity, it makes more sense to use --
we were using lexical features such as the string match, acronyms,
special match. We are using distance features, how far they are the
two mentions from each other in terms of number of words.
Also the mention entity attributes. We did recognize the dimension.
We know the type. We know the level. We know the gender. We know the
numbers. We can use that. And what is interesting actually this is
kind of almost most people use this, what maybe makes the difference
for this, for our approach is using the syntactic features based on the
parse tree and semantic labeling. And yes so compared to what everyone
else is using the performance in terms of core F is at this level using
syntactic features it helps for English. It helps with Arabic, didn't
for Chinese, the reason because I don't know how to read Chinese. I
don't know. I was not able to debug this. I ran my features. I got
those numbers. I said done. I was able to debug this. I was able to
debug that and find out how to improve things.
>>: Such as [inaudible] you see some errors?
>> Imed Zitouni: Again, we see the DEF set. We see what's going wrong
in the DEF set and we try to improve the performance the features in
the DEF set then you have a blind test set. This is on the blind test
set, right? But again when you do your training, you always look to
some data to see the effect of these kind of features on that date.
>>: The kind of feature tuning?
>> Imed Zitouni:
Right.
>>: I assume [inaudible] before?
>> Imed Zitouni:
In terms of performance?
>>: The parser, syntactic features should be more on the Arabic than
English.
>> Imed Zitouni:
Should be more noisy?
>>: Arabic.
>> Imed Zitouni: That's true it. Should be more noisy. However,
remember the fact that it has a high morphology and all that you are
capturing plenty of information you're not capturing here. When you
use context anagrams there, if you're doing it at the stem level or at
the morphs level, you're picking the morphs. So maybe the context gets
diluted a little bit. But if you add parser and semantic labeling, you
get additional information that you were losing before because of the
details of the morphs. And that helps.
>>: The algorithm you mentioned through the clustering, how do you -looks like an expansion such -- how do you handle that?
>> Imed Zitouni:
Excuse me, repeat again?
>>: This algorithm describes the classes.
>> Imed Zitouni:
Yes.
>>: That seems to be an exponential [inaudible] how do you handle it.
>> Imed Zitouni: It's an exponential problem. So let me go back. So
during training you have all this. So it's not exponential. It's only
during decoding, during decoding the issue you have if the path is low
you just get rid of it. It's similar to what we do in speech
recognition. All right? When you do your Verterbe. Even here with
the bell tree you have a kind of Verterbe. You don't discriminate you
eliminate the paths that you believe that they will not get you there.
It's the same techniques applied here. You are using a Verterbe
anyway. It's a bell tree, but think about it as a kind of, you are
exploring this path you have a probability here. You are following
that. You are getting another probability here. If the cost of this
path is low, I'm not going to follow it.
>>: [inaudible].
>> Imed Zitouni:
It's a beam share, yes.
[phonetic].
>>: So provide the data ->> Imed Zitouni: So this there is ACE, the NIST that provides some
data and also for some application we have annotators in house that -human. That provide this as well.
>>: How much do you need to have -- what kind of data -- how much ->> Imed Zitouni:
Data?
>>: Annotation as well.
>> Imed Zitouni: The loads is the more the better. But the more you
have the better. But at the level of mention detection, we did find
out around a million tokens you get reasonable -- not mentions, text.
You start to get reasonable model. You can improve -- the improvement
over that is limited. If you look to the ACE data, the ACE data was in
the range of 400 K tokens. And we did add on top of that.
>>: That's not very big one. 120 classes of average is class you have
how many samples in training?
>> Imed Zitouni: So roughly speaking 400 K if you divide them power of
14. So we are talking about -- yeah, a few thousand. A few thousand,
yeah.
>>: Talking about ACE right here. The base is your model without the
syntactic picture and you added the syntactic picture in the
experimenter?
>> Imed Zitouni:
Yes.
>>: Base, max phase.
>> Imed Zitouni:
Yes.
>>: I remember a couple of other co-reference reiteration papers, for
example, from Andrew [inaudible], did you consider those or was there
some difference in ->> Imed Zitouni: So there's two techniques. There is -- so compared
to McKern [phonetic] I know we're using similar techniques in the
features, not on the technique itself. He tried CRF. We used MaxEnt.
But again I believe it's not a big difference because I believe that
the features make the difference there. There is other approaches
where they try to do mentions and core F in parallel, try to take both
of them. We didn't explore that path. We know that an overall numbers
here is better. We didn't implement that because we think that there's
a noisy step in the middle. Maybe that's the right way to do to do a
joint approach where you can learn from kind of from your mistakes. We
didn't explore that. We did them in two stages. Once mentions
recognizes it, we keep sometimes, we try to keep the N best but we
didn't try to do the joint approach between mention and core F.
Relation. So it's similar to the core F part is really I have two
mentions and I try to find out if I have relation between them, yes or
no. And to do that I'm training classifier that for every relation for
every two mentions, for all the features that I can fire I detect is
it, does this relation does exist or not. And the nonrelation is by
itself a relation.
>>: What is the classes, how many types?
>>: The number of them, they are around 50.
in a little bit.
I would have more details
So again I'll skip this. We are using maximum entropy as well. I
discussed that. So what is important here you know this is the kind of
feature we have. If you have a person entity person visited location
that's a good indication. If you have a person's person, that's also
like the father's son.
correlated.
That's a good indication that they are
The organization person, those kind of features that we are using. Of
course, for binary features you understand for numeric features such as
the distance between the two and all that, it's better to bend them.
We cannot really use in MaxEnt, it's better to find bins and that's how
we use them.
So, again, these features the parse tree. We find the path from here
to here. The dependency tree, we find the head words. And this is the
other kind of features. I have the organization. I have the person.
I mentioned that.
And there is two approaches to do relation. We can do it sequentially
or we do it in a cascaded way. Sequentially so that we take every pair
and we recognize if it does exist or not. So as I mentioned here. In
the cascade approach, so you, first of all, you find because there is
many -- in the relation any way we need to detect the tense, the
modality and the specificity of the model, all right? So one way is to
separate all of them, to detect one of the type. Or to predict all of
them at once.
And one issue with the relation is the number of class nonrelation is
huge in the data compared to the entity where there is a relation. And
we have to deal with that. That's a big issue.
>>: How do you do that?
You get a sentence here and -- you pairwise.
>> Imed Zitouni: Yes, you have a sentence. You have mentions or
entities within the sentence. And you want to know if a relation does
exist or not.
>>: That's huge.
You do combinatorial.
>> Imed Zitouni:
No, you do it at the sentence level only.
>>: Sentence only.
Okay.
>> Imed Zitouni: Because the whole document you have the entities, the
mention that are grouped in entities that will help you. But you do it
at the sentence level.
And even at the sentence level, there is many entities that there is no
links between them because they are not related. So you end up when
you are looking to the training data, the number of nonrelations label
is huge compared to the number of relation. And that's why we did this
bagging approach and the idea to sample the data, create different
samples, train many classifiers, and then if at least N vote, then you
call it a relation.
So we use the majority vote. That's the known bagging, and also we can
use if at least N then it is a relation. And we do it twice.
Actually, first of all, we need to detect do we have a relation or not.
And once we know that we have a relation, then we detect what kind of
relation it is. Okay?
>>: You have a binary [inaudible] or not?
>>: Yes.
>>: And then if you have like person/person you have [inaudible] for
everything.
>> Imed Zitouni: Yes, so person/person it can be related to -- this is
the kind of relation we have. It can be parent of, because that
applies for person/persons. It can be relative. Then the second stage
you will detect the specificity of the relation. Okay. Now, what
happened before we apply to this in Arabic? So I only see one Arabic
speaker. I'll give a brief idea here. If I have this as the English
text, if you assume that guy is Arabic, you increase ambiguity, because
there is no vowels. So if you take a text and you remove the vowels
you get something like this in Arabic.
Now, there is a lack of capital letters. That also adds another level
of ambiguity. And because of rich morphology, few words are attached
to each other like here. All right? So that adds another level. And
so a sentence that was initially like this, it becomes like this, if
you see it from Arabic perspective. And you need to handle that.
>>: One-to-one correspondences letter to Arabic.
>> Imed Zitouni: Not letter to letter it's at the word level but it
happens that this gets glued together. Here you remove the vowels.
You remove the capital letters.
>>: Kind of presentation, is it?
>> Imed Zitouni: Yes. So this is -- the presentation of this is the
same. This is English. It's English that I wrote it Arabic way.
>>: Okay.
>> Imed Zitouni: So if you cannot read this, that's exactly how hard
is the problem, because this is English. But that's how Arabic
speakers read Arabic. All right? So what we do is we segment the
text. I'll go fast for this. We segment the text, separate it into
segments like here. We run the same classifier into segments to detect
the entities and then the relation. And here I'm showing the
performance we have on Arabic and these are the features used. What I
want to show here is if I don't have enough data for a specific
language, what can I do? I have rich language. English is rich. We
have a bunch of resources. We have a bunch of annotated data. That's
not the case for other languages such as Arabic. How much time do we
still have?
>>: Ten minutes or so.
>> Imed Zitouni: Good. So the idea was how to use rich language such
as English to improve other languages, such as here our target is
Arabic, Chinese, German and French, Spanish, these are the languages we
want to handle with TALES where we do have some data. It's not that we
don't have data at all. But the data we have depends. It depends on
the language and it depends -- so I'm not going to -- I did already the
motivation. So what we did is the following: I have rich set of
languages. Let's say here English. And I want to know how can I have
my Arabic model benefit from the English model? So what we do, you
train your model. You use all the features. You have the data. So
you do the usual things. Okay?
So that ends up with you have the English side with all the tags. You
have the alignment between the two. You can use even the publicly
available like Geeza, Gezza or something liner between the two
languages if you have the translation of that, and you propagate this
mention to the Arabic side or in that matter to Spanish or anything
else, and you will get your mentions in the target language. Now, if
you get that, what you can do with them? You can assume that this is
the results, if you don't have any data annotated at all in the target
language. So you say this is my results. Or you can say, okay, I will
use this as an additional features in my framework. I can build
dictionaries from this data. If I have here huge amount of data
because, I don't know, if you look to the European parliament where we
have plenty of languages, language pair and they are all aligned and
you can take -- this is huge. So you take all of this extract
gazetteers from that and use it as a feature. You build a model, you
take your training data, you will get training data annotated, so you
will use this training data to build the model and use that model as a
feature in your target language. Here the same example I'm showing for
Spanish. So, again, I tag the English part and then I propagate that
into Spanish.
So the way as I said the way we trained it, if I don't have resources
at all I just use that as the output of my classifiers. If I have some
data, then I may use that data to train a classifier and use it as a
feature and here it's gazetteer using dictionaries and using them as
dictionaries. I'll compare it shortly. We tried it on data because
it's publicly available and we can publish it.
So the feature used are lexical, lexical with syntax and with semantic
features. And the idea we have here we want to show that the gain
performance decreases with the amount of resources used in the source
language.
So if you have a donor language which is here, English, you have plenty
of resources. If you want to help another language and that language
has already interesting resources, you are not going to help it.
If that language has poor resources, yes, you will help it. And that's
what we'll prove soon. So in our case we use English. This is the
performance of English. We have other classifier. There is not
language propagation. And here you see the performance on Arabic,
Chinese and Spanish, where there is no resources at all in these
languages. So it's only propagation. It's only the results from
propagation.
>>: Let me calculate ->> Imed Zitouni:
Excuse me?
>>: There's no target language ->> Imed Zitouni: Yes, there's no target language. This is only
English. I have only English data and this is the performance I get in
Arabic.
In Chinese and in Spanish. And the reason here the performance is
better makes sense because Spanish is closer to English than here
between Chinese and Arabic it's hard to make it clean because both of
them are different but for whatever reason Chinese looks ->>: What was -- what is the error rate for English in that case?
>> Imed Zitouni:
80 percent.
80 percent as measured.
>>: That's not bad.
>> Imed Zitouni: And this is when we use lexical features. So already
we have some data in the target languages. So we see that the
performance like for Spanish it goes to 77. So we are getting it close
to the 81 of English.
Here syntax and here all the information. So this is -- so interesting
point here to see Arabic, that Arabic we are less than one point behind
English.
>>: So all you do is you just pump them into the dataset pretending
that they're correct.
>> Imed Zitouni: Yes, the training model on that and extract
dictionary from that and I feel that, yes. So I use the last five
minutes in here. So all these models that I'm showing, when we show
this to a customer, and the customer usually doesn't expect to have a
clean text. Here you have a clean text. You have the system behaving
properly. This is the likelihood from the MaxEnt probability from the
token so everything is fine. Here we may get confused from number one
and number two, but still the performance is reasonable.
Now, our customer sometimes he feeds text like this we have English
model but he still feeds this kind of text and he expects the model to
behave appropriately. And that model that I just described, this is
what it's doing. Very bad. And the same -- we have the same text. We
have another customer from the finance sector and this is the kind of
data that they give to us.
This is actually not the exact data because I cannot show the exact
data but I try to mimic a little bit how it is. And he expected to
have a behavior like this. But, again, the system doesn't behave the
proper way. So plenty of techniques actually are proposed how to
overcome -- to deal with this noise. We get inspired from these
techniques. But what we did here, we first of all we wanted the system
to know what is English versus not English with some probability. We
processed the text. And then we try to find the SGM tags and all
that's part. And then how do we know that the text is English or not
English. We are using the perplexity as a measure, use it in here. So
we find out that the perplexity is a good indicator to define if this
sentence is a good English sentence or not. And the perplexity is not
binary decision good or bad. It's like perplexity is know enough so
that this is a good English, perplexity is confusing, I don't know,
perplexity is very, very high. This should be very bad English. In
terms of that we create these different models in the clean text, mix
the model in the sense it's trained on this noisy data and we also have
models that are using only gazetteers only dictionaries and this is the
performance. So this is the baseline. This is the clean text. We see
here how it follows miserably when you decode text that is not English.
But we see that with the technique we proposed, even though we lose a
little bit, but that's okay for commercial purpose, because in
counterpart, we catch up on this kind of data.
>>: So this is bilingual.
>> Imed Zitouni:
This is not bilingual.
This is English.
>>: If it's not bilingual how can you do the information propagation?
>> Imed Zitouni: No, this has nothing to do with propagation. This
is -- I'm sorry. I am back to only one language. The information
propagation is done.
>>: That requires pairwise --
>> Imed Zitouni: Yes, that part is done. Now I'm back to my
monolingual part. Okay. So I'm on time, I think. So all things I
mentioned before the recent project I applied on is to apply to
healthcare. Healthcare seems to be a similar problem. Doctors dictate
a lot of documents using ASR technology. They also type text. And
when they try to send that to the insurance, the insurance, what they
worry about, what the insurance wants is ICD9 or ICD10 code that
matches the procedures that happen there. They don't want to read the
entire document. They're not interested in that. Right now the coder
ICD9 and 10 coders, they're trying to do that manually. With this
project we're trying to help them do that automatically. So this is
the kind of text you get in a medical and it will be nice to know that
the [inaudible] is a problem -- probably is a hedge. You need to
define, because probably this versus those probability in the
healthcare area. So in terms of like probably not really not sure
themselves are mentioned that we need to detect. So we call that
hedges. And the relation between this and this so we have chest pain,
non-car jack and there's a split -- that means they belong to the same
attribute but they are split. And here modify by so this problem
modify the meaning of this. So based -- so this is -- so again we do
mention detection. It's sequential classifier. We run that same
classifier detect relation. And here another example where she's on
the medication, the dosage is 40 milligram. It takes by IV. It's
enter -- push daily so you have the frequency. The dosage. You need
to recognize all those, and based on that you will give -- once you
have that it's a trivial kind of to find the ICD9 code and send that to
the insurance.
>>: The relation between the ->> Imed Zitouni: Again you have annotators, you have coders. Right
now they do their job manually. So we are using that data trying to
take advantage of it.
>>: Specify how many [inaudible].
>> Imed Zitouni: Right. We have annotators doing that and we're
training models based on that. We're using kind of active learning all
that to expedite the process but that's the idea. This is also again
we're sometimes the same mentions belong to two different relation. We
need to take care of that, because here it does not consume cigarette
or alcohol. That means does not consume cigarette and does not consume
alcohol. So we need to get both of it. Not only one. It's details.
But it's important. So anyway to conclude I try to present end-to-end
statistical approach for information extraction and showed how we can,
the same technique applied to different language and if we don't have
enough resources in the target language, how we can use information
transferred across languages, and again if the receiver target language
has enough resources, this approach will not be that nice. So we need
to have different kind of resource-rich and resource-poor languages and
we talk about robust to noisy text and how we can apply this to other
areas such as healthcare. All these gave very competitive performance
in the healthcare space it's already used in commercial purposes. For
TALES it's also shipped for a couple of customers. And that's about
it. Thank you.
[applause]
>> Geoffrey Zweig:
Any last questions?
>>: For healthcare problem, do you find syntactic feature?
>> Imed Zitouni: Syntactic features, they were useful to -- not all
the syntax, it's mostly the head word information is very important.
That helps to find the split attributes. However, the parser itself,
we use some chunking information to know the limit, where the Chanker
stops, that's it. But actually here our parser is trained on the same
kind of data as well. So we're not using a parser on regular English
text.
>> Geoffrey Zweig:
[applause]
Okay.
Let's thank Imed again.
Download