>> Geoffrey Zweig: Okay. We should be good to go now, I think. So it's a pleasure to introduce Imed Zitouni today. Imed is a former colleague of mine from IBM Research. He just recently decided to make a change and move to Bing where he's in the Metrics Group. He got his Ph.D. at the University of Nancy. After that he was a research staff member at Bell Labs for a number of years before he joined IBM, where he spent many years, actually, working on the IBM TALE System, which is their commercial system for doing Arabic and Chinese broadcast news transcription and all kinds of interesting analyses there. He's also written a book recently published by Prentice Hall on multi-lingual natural language processing. So here he is. >> Imed Zitouni: Thank you. Thank you, Jeff, for this. Thank you everyone for coming. So as you just said, I just joined Microsoft. I'm a colleague, even though that the slides present IBM. So I'm going to talk about my work in the last, let's say, about six years. And mostly on TALES, and I will explain what I mean by that in a little bit, if I can get this going. Okay. This is the outline of this talk. So I will introduce TALES and the information extraction component. That's what I want to focus on today. Mostly I will focus on the machine detection part, the core reference as well as the relation. I see a typo there. Then I will show how we can transfer a program from one language to another or to many other languages. So how we make cross language transfer. And what happens if we don't have enough data to do that, to train a new model. So there I will talk about information propagation across languages. Then what happened when we deploy commercial system where robustness matters, because the text or the input signal can be very noisy. I will also talk about that, and then I will show how all this technique can be applied to a different domain. And here I will show how we use this in the health care space and then I will conclude my talk. So I have a video to introduce TALES. It's better than doing that through a slide. So I will try to do that now. [video] multiple channels generating 24/7to serve as captions. TALES is video tool capability allows near real time monitoring of the video as it's captured. Each row in the table corresponds to a show. We see the show's network, the show's title, the language, the start time of the show, the duration of the show, including an indication of how many minutes have been captured already. The live column indicates that one show is still being captured while six others have completed processing. Video is processed in two-minute segments labeled according to how many minutes into the show the segment began. Those shown in blue are fully translated while light gray indicates segment likely to be captured. Let's look at a segment of Al-Jazeerian news. Hovering over the segment, the upper left corner shows us a slide show of the key frames of that segment of video. Clicking on it, we see the video clip with the automatically generated captions below. [Foreign language] >> Imed Zitouni: This is the real time translation of the Arabic spoken audio. [video] we can also use speech synthesis to dub the translation over the original audio. Let's look at a Chinese segment this time. [Chinese] We can also make captions appear when hovering over segments. These features of TALES video tools allow us to monitor the capture and processing of current video being added to TALES searchable database. >> Imed Zitouni: These are snippets that are important for information extraction because also one of the ideas of this tool is to allow for -- let me get this. >>: I'm sorry, can you search for things there on a specific topic if you want to find out about Hurricane Katrina? >> Imed Zitouni: Absolutely you can. And actually we are using Omni find for that and you see to do the search. We do the search at the level of snippets. It was developed by internationally to answer the question who did what to whom. It's not the regular kind of search. It's search based on metadata, based on entity as well as relations between entity, and that's what I'm going to talk about today. >>: TALE top half [inaudible]. >> Imed Zitouni: It was initially -- so the type of project it's GALE. The tool that IBM built on the GALE DARPA project and that it is trying to sell to show to customer, that is TALE. So all the technology behind it we developed it for GALE, however we are presenting it into -- under a different framework. >>: So that's the origin of GALE. >> Imed Zitouni: Yes. >>: Is it still going on now? >> Imed Zitouni: It's applied to different things. It's applied to healthcare. I will try to talk about that later on. It's one of the joint development agreement we signed with Nuance. >>: This is the beauty of this DARPA program. develop the technology, you can sell it. Because after you >>: Yes, IBM still has ->> Imed Zitouni: Correct. This is the architecture actually of TALES. You know from the key frame extraction, you have to detect speech and speech detection and then speaker segment clustering. Then you will run your speech-to-text, and here we have Arabic and Chinese. You will have your information extraction component on top of speech and also on top of text, and then the name entity transliteration is the part mostly used for machine translation, because you need that for transliteration, and then we have the machine translation part, Arabic to English and Chinese to English and you index all this and you can search them with Lucine or Omni find. >>: UIMA we had so many times. What is it now? >> Imed Zitouni: What does it mean? So UIMA is a component that helps you. It makes the integration of this component very easy. So you don't want to be aware of what is happening here. >>: It's infrastructure. >> Imed Zitouni: Infrastructure. It's kind of a middleware, infrastructure, you don't have to worry about what is happening at the different components. Everyone will complement -- this is going to be an annotator, that's the way that UIMA uses this component. This is a different communicator that uses [inaudible]. So the advantages of that I can take the IBM component and put Microsoft component, it's going to be instantly, there's no issue, no problem of using -- the advantage of this also for our customers is we tell them this is the architecture we have. If you want to use your own speech recognizer, you want to use your own machine translation, we can do that. >>: Sounds more like object oriented program on a larger scale, you get a module, application module by module. >> Imed Zitouni: It's XML-based. So it helps regarding the interfaces. Let me -- here I got it. Here I wanted to take a couple minutes highlight where my contribution really was in the previous years. I contributed to these different components. So as an example for the speech recognition part, my contribution was mostly on language modeling, on using rich model language for Arabic, putting that into a neural framework for language modeling, using addition set of features, syntactic features, semantic labeling, and also using die [inaudible] for Arabic text even though this was really not successful in improving the [inaudible] speech recognition but still it was an interesting area to explore. >>: ALESO, what conference is that? >> Imed Zitouni: That's the [inaudible] that's the Arabic league, the equivalent of Arabic league, happens every couple of years in Middle East. This one is happening in Qatar. >>: The neural network, that's a technique? >> Imed Zitouni: Yes. So the technique was using neural network. There's a paper here. So the idea, the advantage of using an neural network, you can throw in different kind of information and it will discriminate among those information to help your language model predict better kind of. And the idea here -- so the usual, the suspect language model, the anagrams, we tried to, in addition at the word level, we did it at the morph level because Arabic is high morphological language, we did apply word class and used semantic labeler as additional information there and that helped not only on speech recognition but also on machine translation. Regarding the name entity transliteration, that was mostly work on cross lingual name spelling. Normalization. Identify name spelling variance and all that. That helps mostly for machine translation in the machine translation part my work mostly was in Arabic to English, and here the work I did was the word segmentation level, boundary detection, part of speech tagging and language modeling. And today I'm going to focus on this part, which is the information extraction part, and that will be the rest of my talk. Okay. So in real life we start to detect like this. There is plenty of tags. And you want to process this data to extract some information from it. So the first step you do is to tokenize the text. You remove all these tags, and you separate the dots from the tokens. The punctuation as well. You do kind of -- and then you do sentence segmentation. You find the sentences. And also you do K restoration in a sense here as an example you restore the capital letter because it will help you later on on detecting person's names and all that stuff. Then we do parsing. We're on top of that semantic labeling, and we do the machine detection part. The machine detection part is to recognize object that refers to an entity like here Bin Laden, battle, which is an event. I'll talk about that later. CIA, which is an organization. We also are interested on dates. We want to recognize the dates and so on and so forth. So once we recognize these entities, we want to link the entities that are -- once we recognize objects, we want to link the objects that refers to the same entity. Like here I need information telling me that this token leader and this person's name Bin Laden and this pronoun his refer to all the same entity Bin Laden. I also want to know that this event killed, short, battle, they all refers to a [inaudible] event. The same thing happened for the city here or country here, Pakistan. And the same thing happened for the president, which is here, you have here mention of president. You have here Barack Obama. And both of them refers to the same person to the same entity President of the United States. >>: Make it look like those are resolving to the actual president Obama, but it's actually to a cluster of [inaudible] you don't actually link it to actual ->> Imed Zitouni: entity. It's the cluster of mentions that refers to that same >>: Undefined ->> Imed Zitouni: Undefined things, yes. But you know that the president in that text and the person in that text, they are the same. So if you look to the name it mentions you may find the name of the entity. Once we recognize entity, we want to find the relation. >>: Just following up on that question does the sort of absolute resolution happen later or is that just too hard? Like to maintain a set of things, a set of actual physical entities and things can resolve to and globally say this and this and this resolve to this and this. >> Imed Zitouni: Yes. >>: That comes later. >> Imed Zitouni: No, that's at this stage. >>: But that was -- at this stage you're talking about right now. >> Imed Zitouni: At this stage. >>: But that was just within the text. within the text. >> Imed Zitouni: Within the text. That was a group of things Within the text. Across documents. >>: Across documents into the actual physical world where everything refers to actual things. >> Imed Zitouni: That comes much later. >>: Does it come at all? >> Imed Zitouni: This is what happens, we index at this level. We index every single document at this level. And then when you do your search you are searching for the entity. Let's say I'm searching for the entity Obama. So I'm going to hit all the documents that they have the entity Obama. >>: They have a mention of Obama? >> Imed Zitouni: That have an entity and the entity has a name mentioned that is Obama. >>: So you're indexing mentions and [inaudible] is that right. >> Imed Zitouni: I'm indexing mentions and indexing entity and relation. When I say indexing entities, I have ID for them. It doesn't have to be Obama. It's ID XYZ, single ID. >>: Not a single database. >> Imed Zitouni: No, I don't have a single database. fly based on the name mentioned within the entity. It's done on the >>: Okay. >> Imed Zitouni: And after that we need also to look into relations, what we call relation and the idea here if I have identity called which is an event I have the location Pakistan. I'd like to know that the killing happened in Pakistan and the name relation is located at. So we have a set of around 100 relations, predefined, and such as manager off. Located at. Patient off. And we try to do all this together at the relation level. I come later in how to do it. Just now I'm trying to introduce the problem. >>: Are you going to talk about the timing relation? >> Imed Zitouni: Yes. The time relation, yes, actually that's the idea if you have yesterday in the text, the yesterday has to be converted to the exact date. If I have Sunday May 1st, 2011, if I have today, today this is and this happened today has to be translated, has to be converted into the exact date. And that based on if I don't have any information that is based on the date of the document and if I have some information related here within the text, then, again, it's a kind of relation that helps me -- the relation always they have specificity. So they have what we call named specific generic kind of relation, and when it's specific it usually refers to a specific date like this versus generic, which can be yesterday tomorrow and next month, things like that. >>: Perhaps like [inaudible] has been [inaudible] in nature but [inaudible] it's long term is that something that ->> Imed Zitouni: No, that's not. >>: [inaudible]. >> Imed Zitouni: Correct. That's an event. The time normalization is mostly related when last year it happened that someone went somewhere and that last year you converted into exact date. And again the whole idea we have in mind is this is part of the GALE project, the distillation, which is we want to answer who did what to whom and where. And the when is usually date interval from that date to that date, and most of the document they have this information that you have today's date and they tell you the article about what is happening last year. So if you need to know what last year means based on the document happening today. >>: Timed to the entity for longer two years ago this appeared [inaudible] a lot of [inaudible] so how do you associate the time with the entity relation. >> Imed Zitouni: I don't associate with the entity here. I just replace last year with the exact date. And if later on I have a query saying that what's the occupation of that person last year, again I will translate last year into the exact date. And if both exact date matches, then I will consider this as a hit. >>: So just one clarification question. ceases, right? >> Imed Zitouni: So the output of a translation Excuse me? >>: Your input is an output translation system? >> Imed Zitouni: No, here during the training -- well, you are talking about decoding. There's two things. During training, it is not translation. It's the source language. It is either Arabic or English or Chinese. During decoding, it can be both. It can be the output of a translation system or speech recognizer. It can be also the original text. Yes. So how we do machine detection. Similar to many other applications, such as part of speech tagging. Chanker, we consider this as a sequential classifier. And the idea is to process the text from left to right or right to let, depends on the language you're considering and for every token you make a decision whether it begins dimension, inside dimension or outside dimension, it's nothing. You run your classifier, and you take -- you can take either top end or you take the best answer for that. I'll get into more details in a little bit. So similar to any other techniques, including speech recognition, you know, you compute the probability or MT in that matter. The probability of the sequence of text given the words. And for that we use, again, the bayes rules, similar to other applications, and what we do for that we use the maximum entropy framework, and actually I think that you know this. This actually was investigated by Delpieter Berger, 1967, '67, where they find an interesting relationship between the maximum likelihood estimates of models and the maximum entropy models. And this is the relationship that happens here. So actually founding -- I mean estimating these probabilities can be viewed as the maximum entropy and also under the maximum likelihood framework, you know, we choose those that maximum the entropy of a set of consistent models and maximize the likelihood over a set of models and here so the whole idea what I'm trying to say here is that the maximum entropy model will not assure anything except the evidence, right? So if you have all the data in the world available, you are sure to converge to the perfect model you want. And in reality that's not true, because you don't have all the events. Unless you have a dead language. If we take the Egyptian languages and we take all the data happening there and there is no new data happening any more and we train on that maximum entropy model, any discriminative training model, we can claim that we have a perfect model. In reality, we live in live languages that evolves and change over time. So that's not a possibility. So that's why we try to estimate the maximum likelihood. >>: Sequence -- I think CRF would be better. >> Imed Zitouni: We tried CRF. We tried MaxEnt. I believe personally what matters is the user features more than the approach itself. So you know if you look to the -- well, that's also -- this happens in my group and I have also this idea, maybe I'm wrong, really the difference between CRF and MaxEnt is if you look to the loss functions, if you change the loss function of MaxEnt, you get CRF. I understand one is looking to local. MaxEnt mostly tries to do optimize locally and you hope that that will be globally. CRF looks globally. So it's time-consuming. It takes more time to do CRF. >>: You have both tools ->> Imed Zitouni: Excuse me? >>: Do you have the tools for both models both CRF and MaxEnt? >> Imed Zitouni: The tools. >>: The software tools. >> Imed Zitouni: The tools, yes, we have the tools, yes, yes. The reason -- and we offer -- it's for historical reasons as well. IBM believes that they believe that contributed to the invention of MaxEnt they want to keep using it. So there's no big difference in performance it's a good claim to keep using MaxEnt. >>: [inaudible] and there's somebody here in Bing that would say that way to present trends here is bigger faster they learn as well better. Have you guys seen anything like that? >> Imed Zitouni: The weighted perception -- again, including -- you can go further than that, and you include into that SVM as well. Support vector machine. All these techniques, they are good. They are comparable. That's what I'm trying to say. They are based on discriminative training kind of approaches. Really the main difference between them is the feature used. The information you threw into them. So if you go basic information set or using only lexical input, nothing else, I would assume that you will get comparable results. >>: [inaudible] like have you guys tried it? >> Imed Zitouni: I didn't try the perception. We tried CRF. We tried support vector machine. We tried hidden Markov model. The Markovian path. The challenge with Markovian path, like it's very hard to include additional features. The advantage of using MaxEnt and CRF it's easy to implement additional features. So it's for convenience more than for the theory behind it. So, yes, so here I'll go to the feature used for this. So we use the context of the word. If I'm trying to predict the tag of this word so I use the context, the previous word, the next word, I use it in the -we are using MaxEnt but we are also using the Markov assumption of MaxEnt. So we use the two previous tags, we keep the history. That's very important, because I saw this in many papers that people telling you CRF folks better than MaxEnt but the implementation of MaxEnt is not the Markovian MaxEnt. And this makes a lot of difference to catch that. Okay? We use dictionaries. We call them gazetteers. We have huge dictionaries of person, names, locations organizations, we threw these as features, as gazetteer information, document trigger level features, it's interesting to know what is the document you are handling. And the output you have other it. I use the you're working of other semantic classifiers what I mean by semantic if classifiers, if you have a CRF classifiers, I can use output of this classifier as the input to my model. If on a different project and this has happened so for GALE we work on ACE. They have different kind of tags. I will not throw that away I will use it. I have a classifier trained on that data. I use the output of that classifier and threw it in as a feature. It helps. >>: So the capability [inaudible]. >> Imed Zitouni: Yes, it helps a lot. I mean, there is two points of measure by doing these kinds of things. And two points of measures of the TALES of the 80 it's important. >>: What's the size of the functionality? how many entities, the dimensionality. >> Imed Zitouni: That's good for information, The number. >>: The number of classes. >> Imed Zitouni: The number of classes is around 120. >>: Okay. >> Imed Zitouni: All right. >>: Number big one. >> Imed Zitouni: It's very sparse. Excuse me? >>: The words in context, the feature you use just [inaudible] maybe one ->> Imed Zitouni: We use a context of the two next and the two preview. Five. It's five-gram context. We did try -- we did try something else. Actually, we also used the parser information and semantic labeling information. So you have the headword information you have in semantic labeling she will tell you more, you have the idea the argument who is doing the action, who is getting the action. So all this information that we threw as a feature here, it helps as well on top of the parser. >>: I'm just -- it will be just so large so you must have some technique, too, because of functionality. >> Imed Zitouni: I see. Well, we use -- >>: [inaudible]. >> Imed Zitouni: Yes, that's why we use the condition generative scaling with the Gaussian prior. We need to do that. We cannot -- yes, I see your point. do that. No, we cannot train -- yes. True. We need to This is the entity that's why I'm saying, this is the set of entities it's a little different from ACE. We cover what we want. We have 116. When I say time three, because we're interested on name mentions, nominal mentions and pronominal mentions. We differentiate between them. The he is a pronominal. President is a nominal. Barack Obama is named. This is an idea about the performance. So I know that many people are familiar with what's happening in ACE, and that's why the same technique is used in ACE and I'm showing the performance here in terms of precision recall and F measure. So this performance in terms of ACE is very competitive. This same model was ranked in the top. In the ACE evaluation, applying it to TALES because we have many classes. The data is maybe a little bit sparser, all that. So the performance drops a little bit. >>: ACE is [inaudible] is it probably GALE? >>: Content extraction. >> Imed Zitouni: The C is content E is extraction and A is automatic content extraction. It's run by NIST. And yes so this is the number of mentions we have and that's the number of documents we have. Now, we talked about mention. I wanted to do the core F part so how do I do that? So I'll be brief here. But we have a paper at ACL for those that want more details. So actually really I take all the mentions here and my goal is to group them into entities with different IDs and for that I use what we call the bell tree algorithm. So I start with the first mention. And then I decide. So I have all these mentions in here. I take the first one. I put it into one class. And then for the second one I have two choices. The first choice is it belongs to the same class. The second choice will be that it starts its own class. All right? And then when I do that for the third one, again, the different choices I have, even this belongs to the same class or it starts another class, the same thing in here. All right? So again I use my classifier, my maximal entropy classifiers with some threshold because I can train on all this. And we try to -- so I try to estimate the probability of linking. Linking meaning put it in the same class. And estimate the probability of starting a new entity. And, again, for that we use the maximal entropy. We're using the same classifier. So the same framework. And the difference is on the feature we use here when you do entity, it makes more sense to use -- we were using lexical features such as the string match, acronyms, special match. We are using distance features, how far they are the two mentions from each other in terms of number of words. Also the mention entity attributes. We did recognize the dimension. We know the type. We know the level. We know the gender. We know the numbers. We can use that. And what is interesting actually this is kind of almost most people use this, what maybe makes the difference for this, for our approach is using the syntactic features based on the parse tree and semantic labeling. And yes so compared to what everyone else is using the performance in terms of core F is at this level using syntactic features it helps for English. It helps with Arabic, didn't for Chinese, the reason because I don't know how to read Chinese. I don't know. I was not able to debug this. I ran my features. I got those numbers. I said done. I was able to debug this. I was able to debug that and find out how to improve things. >>: Such as [inaudible] you see some errors? >> Imed Zitouni: Again, we see the DEF set. We see what's going wrong in the DEF set and we try to improve the performance the features in the DEF set then you have a blind test set. This is on the blind test set, right? But again when you do your training, you always look to some data to see the effect of these kind of features on that date. >>: The kind of feature tuning? >> Imed Zitouni: Right. >>: I assume [inaudible] before? >> Imed Zitouni: In terms of performance? >>: The parser, syntactic features should be more on the Arabic than English. >> Imed Zitouni: Should be more noisy? >>: Arabic. >> Imed Zitouni: That's true it. Should be more noisy. However, remember the fact that it has a high morphology and all that you are capturing plenty of information you're not capturing here. When you use context anagrams there, if you're doing it at the stem level or at the morphs level, you're picking the morphs. So maybe the context gets diluted a little bit. But if you add parser and semantic labeling, you get additional information that you were losing before because of the details of the morphs. And that helps. >>: The algorithm you mentioned through the clustering, how do you -looks like an expansion such -- how do you handle that? >> Imed Zitouni: Excuse me, repeat again? >>: This algorithm describes the classes. >> Imed Zitouni: Yes. >>: That seems to be an exponential [inaudible] how do you handle it. >> Imed Zitouni: It's an exponential problem. So let me go back. So during training you have all this. So it's not exponential. It's only during decoding, during decoding the issue you have if the path is low you just get rid of it. It's similar to what we do in speech recognition. All right? When you do your Verterbe. Even here with the bell tree you have a kind of Verterbe. You don't discriminate you eliminate the paths that you believe that they will not get you there. It's the same techniques applied here. You are using a Verterbe anyway. It's a bell tree, but think about it as a kind of, you are exploring this path you have a probability here. You are following that. You are getting another probability here. If the cost of this path is low, I'm not going to follow it. >>: [inaudible]. >> Imed Zitouni: It's a beam share, yes. [phonetic]. >>: So provide the data ->> Imed Zitouni: So this there is ACE, the NIST that provides some data and also for some application we have annotators in house that -human. That provide this as well. >>: How much do you need to have -- what kind of data -- how much ->> Imed Zitouni: Data? >>: Annotation as well. >> Imed Zitouni: The loads is the more the better. But the more you have the better. But at the level of mention detection, we did find out around a million tokens you get reasonable -- not mentions, text. You start to get reasonable model. You can improve -- the improvement over that is limited. If you look to the ACE data, the ACE data was in the range of 400 K tokens. And we did add on top of that. >>: That's not very big one. 120 classes of average is class you have how many samples in training? >> Imed Zitouni: So roughly speaking 400 K if you divide them power of 14. So we are talking about -- yeah, a few thousand. A few thousand, yeah. >>: Talking about ACE right here. The base is your model without the syntactic picture and you added the syntactic picture in the experimenter? >> Imed Zitouni: Yes. >>: Base, max phase. >> Imed Zitouni: Yes. >>: I remember a couple of other co-reference reiteration papers, for example, from Andrew [inaudible], did you consider those or was there some difference in ->> Imed Zitouni: So there's two techniques. There is -- so compared to McKern [phonetic] I know we're using similar techniques in the features, not on the technique itself. He tried CRF. We used MaxEnt. But again I believe it's not a big difference because I believe that the features make the difference there. There is other approaches where they try to do mentions and core F in parallel, try to take both of them. We didn't explore that path. We know that an overall numbers here is better. We didn't implement that because we think that there's a noisy step in the middle. Maybe that's the right way to do to do a joint approach where you can learn from kind of from your mistakes. We didn't explore that. We did them in two stages. Once mentions recognizes it, we keep sometimes, we try to keep the N best but we didn't try to do the joint approach between mention and core F. Relation. So it's similar to the core F part is really I have two mentions and I try to find out if I have relation between them, yes or no. And to do that I'm training classifier that for every relation for every two mentions, for all the features that I can fire I detect is it, does this relation does exist or not. And the nonrelation is by itself a relation. >>: What is the classes, how many types? >>: The number of them, they are around 50. in a little bit. I would have more details So again I'll skip this. We are using maximum entropy as well. I discussed that. So what is important here you know this is the kind of feature we have. If you have a person entity person visited location that's a good indication. If you have a person's person, that's also like the father's son. correlated. That's a good indication that they are The organization person, those kind of features that we are using. Of course, for binary features you understand for numeric features such as the distance between the two and all that, it's better to bend them. We cannot really use in MaxEnt, it's better to find bins and that's how we use them. So, again, these features the parse tree. We find the path from here to here. The dependency tree, we find the head words. And this is the other kind of features. I have the organization. I have the person. I mentioned that. And there is two approaches to do relation. We can do it sequentially or we do it in a cascaded way. Sequentially so that we take every pair and we recognize if it does exist or not. So as I mentioned here. In the cascade approach, so you, first of all, you find because there is many -- in the relation any way we need to detect the tense, the modality and the specificity of the model, all right? So one way is to separate all of them, to detect one of the type. Or to predict all of them at once. And one issue with the relation is the number of class nonrelation is huge in the data compared to the entity where there is a relation. And we have to deal with that. That's a big issue. >>: How do you do that? You get a sentence here and -- you pairwise. >> Imed Zitouni: Yes, you have a sentence. You have mentions or entities within the sentence. And you want to know if a relation does exist or not. >>: That's huge. You do combinatorial. >> Imed Zitouni: No, you do it at the sentence level only. >>: Sentence only. Okay. >> Imed Zitouni: Because the whole document you have the entities, the mention that are grouped in entities that will help you. But you do it at the sentence level. And even at the sentence level, there is many entities that there is no links between them because they are not related. So you end up when you are looking to the training data, the number of nonrelations label is huge compared to the number of relation. And that's why we did this bagging approach and the idea to sample the data, create different samples, train many classifiers, and then if at least N vote, then you call it a relation. So we use the majority vote. That's the known bagging, and also we can use if at least N then it is a relation. And we do it twice. Actually, first of all, we need to detect do we have a relation or not. And once we know that we have a relation, then we detect what kind of relation it is. Okay? >>: You have a binary [inaudible] or not? >>: Yes. >>: And then if you have like person/person you have [inaudible] for everything. >> Imed Zitouni: Yes, so person/person it can be related to -- this is the kind of relation we have. It can be parent of, because that applies for person/persons. It can be relative. Then the second stage you will detect the specificity of the relation. Okay. Now, what happened before we apply to this in Arabic? So I only see one Arabic speaker. I'll give a brief idea here. If I have this as the English text, if you assume that guy is Arabic, you increase ambiguity, because there is no vowels. So if you take a text and you remove the vowels you get something like this in Arabic. Now, there is a lack of capital letters. That also adds another level of ambiguity. And because of rich morphology, few words are attached to each other like here. All right? So that adds another level. And so a sentence that was initially like this, it becomes like this, if you see it from Arabic perspective. And you need to handle that. >>: One-to-one correspondences letter to Arabic. >> Imed Zitouni: Not letter to letter it's at the word level but it happens that this gets glued together. Here you remove the vowels. You remove the capital letters. >>: Kind of presentation, is it? >> Imed Zitouni: Yes. So this is -- the presentation of this is the same. This is English. It's English that I wrote it Arabic way. >>: Okay. >> Imed Zitouni: So if you cannot read this, that's exactly how hard is the problem, because this is English. But that's how Arabic speakers read Arabic. All right? So what we do is we segment the text. I'll go fast for this. We segment the text, separate it into segments like here. We run the same classifier into segments to detect the entities and then the relation. And here I'm showing the performance we have on Arabic and these are the features used. What I want to show here is if I don't have enough data for a specific language, what can I do? I have rich language. English is rich. We have a bunch of resources. We have a bunch of annotated data. That's not the case for other languages such as Arabic. How much time do we still have? >>: Ten minutes or so. >> Imed Zitouni: Good. So the idea was how to use rich language such as English to improve other languages, such as here our target is Arabic, Chinese, German and French, Spanish, these are the languages we want to handle with TALES where we do have some data. It's not that we don't have data at all. But the data we have depends. It depends on the language and it depends -- so I'm not going to -- I did already the motivation. So what we did is the following: I have rich set of languages. Let's say here English. And I want to know how can I have my Arabic model benefit from the English model? So what we do, you train your model. You use all the features. You have the data. So you do the usual things. Okay? So that ends up with you have the English side with all the tags. You have the alignment between the two. You can use even the publicly available like Geeza, Gezza or something liner between the two languages if you have the translation of that, and you propagate this mention to the Arabic side or in that matter to Spanish or anything else, and you will get your mentions in the target language. Now, if you get that, what you can do with them? You can assume that this is the results, if you don't have any data annotated at all in the target language. So you say this is my results. Or you can say, okay, I will use this as an additional features in my framework. I can build dictionaries from this data. If I have here huge amount of data because, I don't know, if you look to the European parliament where we have plenty of languages, language pair and they are all aligned and you can take -- this is huge. So you take all of this extract gazetteers from that and use it as a feature. You build a model, you take your training data, you will get training data annotated, so you will use this training data to build the model and use that model as a feature in your target language. Here the same example I'm showing for Spanish. So, again, I tag the English part and then I propagate that into Spanish. So the way as I said the way we trained it, if I don't have resources at all I just use that as the output of my classifiers. If I have some data, then I may use that data to train a classifier and use it as a feature and here it's gazetteer using dictionaries and using them as dictionaries. I'll compare it shortly. We tried it on data because it's publicly available and we can publish it. So the feature used are lexical, lexical with syntax and with semantic features. And the idea we have here we want to show that the gain performance decreases with the amount of resources used in the source language. So if you have a donor language which is here, English, you have plenty of resources. If you want to help another language and that language has already interesting resources, you are not going to help it. If that language has poor resources, yes, you will help it. And that's what we'll prove soon. So in our case we use English. This is the performance of English. We have other classifier. There is not language propagation. And here you see the performance on Arabic, Chinese and Spanish, where there is no resources at all in these languages. So it's only propagation. It's only the results from propagation. >>: Let me calculate ->> Imed Zitouni: Excuse me? >>: There's no target language ->> Imed Zitouni: Yes, there's no target language. This is only English. I have only English data and this is the performance I get in Arabic. In Chinese and in Spanish. And the reason here the performance is better makes sense because Spanish is closer to English than here between Chinese and Arabic it's hard to make it clean because both of them are different but for whatever reason Chinese looks ->>: What was -- what is the error rate for English in that case? >> Imed Zitouni: 80 percent. 80 percent as measured. >>: That's not bad. >> Imed Zitouni: And this is when we use lexical features. So already we have some data in the target languages. So we see that the performance like for Spanish it goes to 77. So we are getting it close to the 81 of English. Here syntax and here all the information. So this is -- so interesting point here to see Arabic, that Arabic we are less than one point behind English. >>: So all you do is you just pump them into the dataset pretending that they're correct. >> Imed Zitouni: Yes, the training model on that and extract dictionary from that and I feel that, yes. So I use the last five minutes in here. So all these models that I'm showing, when we show this to a customer, and the customer usually doesn't expect to have a clean text. Here you have a clean text. You have the system behaving properly. This is the likelihood from the MaxEnt probability from the token so everything is fine. Here we may get confused from number one and number two, but still the performance is reasonable. Now, our customer sometimes he feeds text like this we have English model but he still feeds this kind of text and he expects the model to behave appropriately. And that model that I just described, this is what it's doing. Very bad. And the same -- we have the same text. We have another customer from the finance sector and this is the kind of data that they give to us. This is actually not the exact data because I cannot show the exact data but I try to mimic a little bit how it is. And he expected to have a behavior like this. But, again, the system doesn't behave the proper way. So plenty of techniques actually are proposed how to overcome -- to deal with this noise. We get inspired from these techniques. But what we did here, we first of all we wanted the system to know what is English versus not English with some probability. We processed the text. And then we try to find the SGM tags and all that's part. And then how do we know that the text is English or not English. We are using the perplexity as a measure, use it in here. So we find out that the perplexity is a good indicator to define if this sentence is a good English sentence or not. And the perplexity is not binary decision good or bad. It's like perplexity is know enough so that this is a good English, perplexity is confusing, I don't know, perplexity is very, very high. This should be very bad English. In terms of that we create these different models in the clean text, mix the model in the sense it's trained on this noisy data and we also have models that are using only gazetteers only dictionaries and this is the performance. So this is the baseline. This is the clean text. We see here how it follows miserably when you decode text that is not English. But we see that with the technique we proposed, even though we lose a little bit, but that's okay for commercial purpose, because in counterpart, we catch up on this kind of data. >>: So this is bilingual. >> Imed Zitouni: This is not bilingual. This is English. >>: If it's not bilingual how can you do the information propagation? >> Imed Zitouni: No, this has nothing to do with propagation. This is -- I'm sorry. I am back to only one language. The information propagation is done. >>: That requires pairwise -- >> Imed Zitouni: Yes, that part is done. Now I'm back to my monolingual part. Okay. So I'm on time, I think. So all things I mentioned before the recent project I applied on is to apply to healthcare. Healthcare seems to be a similar problem. Doctors dictate a lot of documents using ASR technology. They also type text. And when they try to send that to the insurance, the insurance, what they worry about, what the insurance wants is ICD9 or ICD10 code that matches the procedures that happen there. They don't want to read the entire document. They're not interested in that. Right now the coder ICD9 and 10 coders, they're trying to do that manually. With this project we're trying to help them do that automatically. So this is the kind of text you get in a medical and it will be nice to know that the [inaudible] is a problem -- probably is a hedge. You need to define, because probably this versus those probability in the healthcare area. So in terms of like probably not really not sure themselves are mentioned that we need to detect. So we call that hedges. And the relation between this and this so we have chest pain, non-car jack and there's a split -- that means they belong to the same attribute but they are split. And here modify by so this problem modify the meaning of this. So based -- so this is -- so again we do mention detection. It's sequential classifier. We run that same classifier detect relation. And here another example where she's on the medication, the dosage is 40 milligram. It takes by IV. It's enter -- push daily so you have the frequency. The dosage. You need to recognize all those, and based on that you will give -- once you have that it's a trivial kind of to find the ICD9 code and send that to the insurance. >>: The relation between the ->> Imed Zitouni: Again you have annotators, you have coders. Right now they do their job manually. So we are using that data trying to take advantage of it. >>: Specify how many [inaudible]. >> Imed Zitouni: Right. We have annotators doing that and we're training models based on that. We're using kind of active learning all that to expedite the process but that's the idea. This is also again we're sometimes the same mentions belong to two different relation. We need to take care of that, because here it does not consume cigarette or alcohol. That means does not consume cigarette and does not consume alcohol. So we need to get both of it. Not only one. It's details. But it's important. So anyway to conclude I try to present end-to-end statistical approach for information extraction and showed how we can, the same technique applied to different language and if we don't have enough resources in the target language, how we can use information transferred across languages, and again if the receiver target language has enough resources, this approach will not be that nice. So we need to have different kind of resource-rich and resource-poor languages and we talk about robust to noisy text and how we can apply this to other areas such as healthcare. All these gave very competitive performance in the healthcare space it's already used in commercial purposes. For TALES it's also shipped for a couple of customers. And that's about it. Thank you. [applause] >> Geoffrey Zweig: Any last questions? >>: For healthcare problem, do you find syntactic feature? >> Imed Zitouni: Syntactic features, they were useful to -- not all the syntax, it's mostly the head word information is very important. That helps to find the split attributes. However, the parser itself, we use some chunking information to know the limit, where the Chanker stops, that's it. But actually here our parser is trained on the same kind of data as well. So we're not using a parser on regular English text. >> Geoffrey Zweig: [applause] Okay. Let's thank Imed again.