>> Chris Burges: We are delighted to welcome Limin Yao to MSR Redmond for a couple of days. She is finishing up her Ph.D. with Andrew McCallum at U-Mass. Today she’s going to talk about Universal Schema. >> Limin Yao: Thanks for the introduction. Thanks for having me here. Today I’ll describe Universal Schema for Knowledge Representation from Text and Structured Data. Entities and relationships among them are central for representing our knowledge about the word. My research goal is to extract relations from text and structured data. Given collection of documents we may find the sentences like Peter McLoughlin the president of National Football Leagues Seahawks. From these sentences we can identify entities. For example we identify these entities Richard Samuels, MIT, Peter McLoughlin, Seahawks. What is our task? Our task is to identify whether there exists a relation between two entities that occur in the same sentence. If there is a relation with relationship, there has been lots of work going on for relation extraction. In my opinion these approaches can be classified into several categories. The first approach represents relations using predefined types. Freebase is an example, so in Freebase you have a predefined entity and relation types from schema. But what’s the disadvantage of this approach? There is no complete schema. It’s just challenging to design a schema that have all the relations that you are interested in. For example here Freebase is missing like relation, teach at, or like, or criticize. Researchers have done some work to overcome this disadvantage. One approach is Open Information Extraction. In Open Information Extraction you identify two entities occurring in the same sentence. Then you identify the worth sequence between these two entities to represent your relationship between these two entities. There’s also some limitation about this approach. There’s no generalization among patters. For example you do not know historian at can imply professor at. We have done some work to generalize the meanings of these patterns. What we do is using our generative model to cluster the triples into different clusters. In the mean time so patterns falling into the same cluster were represent to one relation. For example here we how to example clusters. We make the assumption that patterns falling to the same cluster are semantically equivalent to each other. This has some information loss. Like for example here you assume professor at is equal to historian at so you lost some information. Also like professor at is different from student at. What we need is a representation that has broad coverage, no information loss, and easy to generalize among these patterns. Here for example we, so we allow each single pattern to represent one relation, also each tab from knowledge base to represent relation. What we try to do is to generalize these patters by learning implications among them. We call this representation as Universal Schema. Okay, here’s the outline of my talk. First I will mainly talk about Universal Schema for relation extraction. Then we extend the Universal Schema to entity types. Then finally I will conclude and point out some future work. Okay? >>: If a new relation shows up it [indiscernible] at will you then sort of slip [indiscernible] at into the existing say ideas that new, think new relations can come up and kind of put them into your big graph? >> Limin Yao: Yes. >>: Okay. >> Limin Yao: Yeah, okay first the Universal Schema for relation extraction. What exactly is Universal Schema and how does it work? Here given a sentence, Peter McLoughlin, the president of National Football League’s Seahawks. What we can do is we can path the sentence using our dependency path. Then back here form Peter McLoughlin that’s the, this represents a part of the cloth, then like from this dependency tree we can instruct the paths between to nodes that represent entities. For example we can instruct the dependency path president of between Peter McLoughlin and the Seahawks. We mainly use this dependency path to represent relations. Also we have some Knowledge Bases available to us. We can find whether this pair occurring the Knowledge Base. If they occur then we can identify the relation types defined in the Knowledge Base. >>: Are you only looking at paths that have, like connect like bounds in the [indiscernible] middle not depending on which pathways… >> Limin Yao: Actually the pathways actually learn from your paper. [laughter] >>: Okay. >> Limin Yao: Okay, so we call this Universal Schema because it’s a union of textual patterns and types defined in Knowledge Bases. >>: The entities are Knowledge Based… >> Limin Yao: Yes, noun phrase and work phrases just like the [indiscernible] word including nouns and the verbs, yeah. Like between the two nodes in the dependency tree. >>: But you have to do all these pre-processing first, right? >> Limin Yao: Yes. >>: To do all this typing of stuff… >> Limin Yao: Yes, yes, exactly, yeah. Here’s our Input Data. We organized the text to patterns into columns. They enter the pairs into rows. Like for these entity pair McLoughlin and Seahawks we find their occurrences in the text and mark down the corresponding like textural patterns as the positive results. We also listed the types defining in Knowledge Base as the relations. We also mark down the like the relation top member or relation person/company for this entity pair. We do the same thing for all the other entity pairs in our data. Here is our input. >>: Question? >> Limin Yao: Okay. >>: The text paths so you here you note as president of, are you only looking at the actual text in the path or are you using the dependency, arch types as well? You’re just not showing it here. >> Limin Yao: I’m using the X and the words just to for ease of understanding I only show the words, yeah. >>: Okay, okay. >> Limin Yao: What is our like final goal? Our goal is to like answer user’s information need. For example a user can ask who is the leader of U.S.? User can also ask whether Samuels is a Historian at MIT. Also user can query whether Roy Smith is employed by NYU. Actually user can query any cell in this matrix. Each cell corresponds to a question whether the column relation holds for the entity pair. >>: This gigantic number grows? >> Limin Yao: Yes, but actually like most of them should be off, most times the Y relation only hold for some entity pairs, yeah. >>: I mean there’s, aren’t there potentially just a staggering number of rows? Do you only put a row in if they occur in the same sentence or if they occur in the same database? >> Limin Yao: Yes, so if yeah I only put a row here if they occur if the text covers like multiple times like larger than the threshold I put a row here, yeah. >>: Do you pull thirty day depends on [indiscernible]? >> Limin Yao: Sorry? >>: Do you consider [indiscernible] suffix of the pair? The entity pair, the number of entity pair over Obama, U.S. would contain information about [indiscernible] U.S.A. except that you conflict over [indiscernible]? >> Limin Yao: Ah, you mean the co-reference thing? >>: No, no, do you… >>: The suffix from just the [indiscernible] like you is U.S. and U.S.A., and United States the same entity? >> Limin Yao: Ah, yes that’s a co-reference, yeah. You can do co-reference first like to cluster U.S. and U.S.A. into the same entity. What we do is very simple. We just use [indiscernible] match to do coreference, yeah. >>: You would have two rows here. You would have one which would be Obama, U.S. and another one which would be Obama, U.S.A? >> Limin Yao: Ideally you should have only one row, yeah. That will help you in accurate… >>: I think [indiscernible] paraphrase you don’t necessarily, you can have two rows. >>: If you had, if you could [indiscernible] you’re saying it could be better. But your goal is also the need to learn… >> Limin Yao: Yes, yes, depending on how you do co-reference, yeah. Okay? >>: Using a unit at lowest proportional to training co-reference? >> Limin Yao: Yes, exactly, yeah. >>: I see so it’s up to you [indiscernible] generalization for example I assume that he was [indiscernible]. For training set you identify all the entities and you have to define probably determining what [indiscernible] reach that you have. Then in the test set what if the word expression of different [indiscernible] of training starts data you have. But it’s the same entity. You don’t think that… >> Limin Yao: Yes, that means for this matrices we’ll have like half. You’ll have half off the matrix observed. I will show you my approach can complete the whole matrix, yeah. >>: But [indiscernible] says I don’t know Obama, United State of America was misspelled or something. >>: Yeah. >>: Would you be able to handle that or not? >> Limin Yao: Maybe not, no. [laughter] >>: With the surface form matching so far? >> Limin Yao: Yes, yes. >>: Okay. >>: We [indiscernible] that the statistics are kind of like spread across, if I, so if I observe Obama, U.S. and then observe Barak leader of U.S.A. they won’t be, I wouldn’t be able to link those two facts because [indiscernible] is different? >> Limin Yao: Yeah, actually that’s the problem of co-reference. If you are able to co-ref Barak Obama to Obama and the United States to U.S. then that should be filing the same row. [indiscernible] the leader of like similar to a president of, yeah. >>: [indiscernible] then like that happened somewhere there were these distributional similarity something to constrain the arguments. For example not necessarily all the different surface [indiscernible] but saying that president tend to be here on one side and countries on the other side, people have done that too. >>: The assumption that this problem is auxiliary to the problem that you’re trying to solve. Assume that this is, has been you know is solved by some other mechanism and you’re focusing only on the relations? >> Limin Yao: Yes, yes in actually, yeah. >>: [indiscernible] how do you determine the relations? Keyword or training set, or… >> Limin Yao: Yeah, yeah, yeah I will show you later. [laughter] >>: I mean here they’re syntactical relations, right? >> Limin Yao: Yeah, syntactical relations. >>: [indiscernible] >>: [indiscernible] I saw the semantic games. >> Limin Yao: I will show you how we semantics later, yeah. [laughter] >>: Okay. >> Limin Yao: Yeah, so here like how do you do a matrix, like how do you fill in the cells? The task actually makes a matrix combination. Based on the pattern of the matrix and wanting to fill the whole matrix. Here I’m showing you that we are able to do this. How I will show you later. Here I’m showing like the large green cells are predicted as true cells and the red cells are predicted as false cells. For example here we can predict Obama is indeed leader of U.S., yeah. How to do matrix completion based on pattern of our data? Okay? >>: Just what is matrix completion? Some of this question are non-decisive from the data that you have or, and some of them might be [indiscernible] on you know you cannot apply this predicate to this property. How do you handle that when you look at it as a matrix completion total? >> Limin Yao: Okay, I will talk about one model that can handle [indiscernible] relationships can only occur with particularly entity types. >>: Yeah. >> Limin Yao: Here like one approach is Matrix Factorization. In Matrix Factorization we [indiscernible] like a column of vector for each column and a row vector for each row. Each cell is determined by the dot product between these corresponding two vectors. Here in our data we make another assumption. We assume that each cell is binary random variable. Based on this dot product we add a logistic function to converge the score into a probability. Also in our data we notice that like some relations occur frequently, some relations don’t. We add a base for each relation to make the model more accurate, yeah. Now our problem is how to learn these vectors. We do not have those vectors beforehand. How could we learn it? What is the objective function? We try to maximize the log likelihood of the joint probability of all cells. Because we assume each cell is a binary random variable so we can maximize the like the joint probability. What about missing cells like we only observer some cells, not all the cells? One thing we can do is subsample them as negative cells. Now our goal is to maximize the log likelihood of the joint probability of positive and sampled negative cells. Here I will walk you through the learning process. First we go to one cell. Then for this cell like if this is a political cell we try to maximize the probability to make the probability close to one. Then we update to the corresponding vectors a and v. >>: Do you handle negation at all? This is Senator McCain is not president of the U.S., would that be, so you know that’s a zero, right? >> Limin Yao: Yeah, we don’t have negations in our data. I mean we do not handle negations now, yeah. >>: [inaudible] told x equals one [indiscernible]. >> Limin Yao: Yeah, yeah that will give us the better negative examples, yeah. >>: You don’t do temporal, right like… >> Limin Yao: No. >>: Richard Nixon is president of the… >>: No… >>: Was president of the United States. [laughter] >>: What does, I was going to ask the same question about negation. What would happen? Would you get a column not president of or would you just like do you ignore it? >> Limin Yao: I think we could add one. No that, maybe matrix not efficiently and you add negation for error rate pattern, yeah. That will double the columns, yeah. >>: What happened in your system? If you get a sentence that says McCain is not president of the United States will it create a column for that relation or? >> Limin Yao: I think depending on the pather. The pather may remove the negation I guess, I’m not sure. >>: If you’re using the same road construction we did it dropped that [inaudible]. >> Limin Yao: Okay. >>: Yeah, so if you get a one for McCain, U.S. that’s great. [laughter] >>: Do you apply this… >>: [indiscernible] >>: [indiscernible] >>: There’s a cost of the functional dependency in database, right. If this relation [indiscernible] on other relationship, so that you get efficient, mostly efficient. Do you apply this kind of pattern so that your relationship is going to be mostly efficient? >> Limin Yao: I will show you like this approach is general. It should capture the relationship you talk about, yeah. >>: Automatically. >> Limin Yao: Yeah, automatically, yeah. >>: Even when you don’t handle negation when you look at the text. But for some database especially I know that for [indiscernible]. >> Limin Yao: Yeah. >>: Because it’s pretty messy and a lot of their relation actually is [indiscernible]. In that case you’re still [indiscernible] because the [indiscernible] space actually has this kind of negation. >> Limin Yao: At least I know Freebase do not have negations, yeah. >>: [inaudible] >> Limin Yao: We just take Freebase types and our test data, yeah. >>: What’s the [indiscernible] of relationship Freebase? >> Limin Yao: Freebase now has many but when we worked here we just take I think fifty, around fifty relations, yeah, depending on our data. Okay, so we go to… >>: Are all these columns? >> Limin Yao: Sorry? >>: Only fifty columns? >>: From Freebase? >> Limin Yao: Oh, I mean only these columns, these are fifty… >>: [inaudible] relations. >> Limin Yao: From this column they have four thousand test [indiscernible] surface patterns, yeah. >>: [indiscernible] the relations they are all relation with existing in Freebase? >> Limin Yao: Sorry? >>: Do relations… >> Limin Yao: Not all of them are mentioned in Freebase. >>: Oh. >> Limin Yao: Like some of them are mentioned. Yeah, you may have like a blank cells, you don’t know whether these relations is true not in Freebase. Freebase do not have annotations for all the relations. >>: Do [indiscernible] relation or just coming from counts of the number of parses you’ve seen that support the [indiscernible]. >>: The [indiscernible]. >>: Like president would connect Obama to U.S. >>: For the relations that you have in the fifty of those do they contain this kind of redundancy, redundant [indiscernible] such as it’s a man or if it’s a human? What… >> Limin Yao: No, Freebase do not have this relation, yeah. >>: But in general in terms of in our dependencies do they remove those redundant… >> Limin Yao: I think, yeah like as a knowledge base so Freebase should have a complete schema so they try to define relation types that are not overlapping, yeah, yeah. >>: But do you do a materialized view or do you just use the raw cells? In other words if it says presidents, Obama is a president, presidents are humans, humans are animals. You don’t make, you just say Obama’s a president, forget the rest? >> Limin Yao: Yeah, yeah. [laughter] >>: Okay. >> Limin Yao: [indiscernible] yeah… >>: Freebase doesn’t have rows like that. >>: Doesn’t even have it listed? >>: No, you’d have to have the separate facets to find which they often do. I think you will have that right like for Tom Cruz you’ll have a row that says he’s an actor and a row that says he’s a person because he’s explicitly a defined row. >> Limin Yao: Okay. >>: But there’s no… >>: [indiscernible] >>: Nor have they said that person is actually correct. >>: No, do you have a recommendation for the creative thing that you should? >>: [indiscernible] so the last thing could exist. >> Limin Yao: Okay, so for learning like our learning requires a negative data. What we do is we can use sampling to sample a negative data. For example for Y entity pair Obama and U.S., so we sample professor at and historian at as the negative cells for this pair. Similarly we tried to minimize the probability for the negative cells. Make sure the probability goes to there instead of go to one. Yeah, we updated a corresponding vectors professor at and historian at. Yeah, we go over this matrix iteratively and update the corresponding row vectors and the column vectors. Finally after many iterations then we all be at a stable state. The vectors are almost affixed… >>: Just something negative not positive… >> Limin Yao: Sorry? >>: You’re just [indiscernible] the negatives though? >> Limin Yao: Yeah, some… >>: [indiscernible] every time, every time. >> Limin Yao: Yeah, yeah, yeah. >>: You’re doing Stochastic gradient descent to the [indiscernible]… >> Limin Yao: Yeah, so this is a Stochastic gradient method, yeah. Now like finally our model learns which are model learn is a bunch of like a column of vectors and the row vectors. How do we do prediction, so go back to your original like definition of one cell score? We just use this dot product and the pairs, then we apply a logistic function to predict the one cell’s probability, yeah. >>: Do you do some kind of [indiscernible]? >> Limin Yao: Actually logistic function is mapping a real value to a value between zero and one. It is one kind of normalization, yeah. >>: You don’t project the [inaudible]? >> Limin Yao: For the vectors you can see in your learning you make sure your vectors norm is less or equal to a constant, yeah. >>: [indiscernible] >> Limin Yao: Yeah, yeah, yeah. Then I will try to explain how this approach will learn implications, yeah. For example here look at this row, Morris and Columbia. This row we observer two positive cells teach at and the person/company. During learning we know that each cells goal is determined by the dot product to between the row vector and the column vector. Ideally we should learn that these two column vectors are close to each other in the low dimensional space. Now go to this row here we observe teach at. We also know that teach, the vector for teach at is close to vector of person/company. This row, these two cells share the same row vector a m. We can see that the score for person/company should be similar to the score of teacher at. That’s one way to see that we can learn the implication teach at implies person/company, yeah. >>: Which way do you know the implication is going? >>: Because it’s a coalition. >> Limin Yao: Yeah maybe I should not call these as [indiscernible] implication just as a like a [indiscernible] or… >>: Correlation. >> Limin Yao: Correlation or, yeah. Here I will show you more mathematical details about this learning process. We assume each cell is a binary random variable. Here are the definitions for the cell score. Then yeah we make this assumption. Our object here is the data likelihood that’s drawn to probability of other cells. Then we can use the log likelihood and add a [indiscernible] to this objective function. Then we can get this gradient. Our learning process is the Stochastic gradient optimization for each cell we apply this gradient to update the parameters. >>: [inaudible] maybe we should talk about this offline but how is the independent assumption work if the model is sending [indiscernible] right? >> Limin Yao: You mean, yeah. >>: Yeah. >> Limin Yao: We assume each cell is independent of another cell, yeah. >>: But the model itself is doing opposite… >> Limin Yao: The model is not doing the opposite we just see yeah I can see why you say this. You are like, we are assuming like one row share the same row vector, [indiscernible] column share the same column vector. But we still make this independent assumption, yeah. >>: Also independent given the rows that the column [indiscernible]. >>: Yeah. >> Limin Yao: Yes, yes, yeah. Some thoughts about learning that for these objective function we maximize log likelihood of positive and the sampled negative cells. Then people may think like why, are there some alternative ways to, actually we do not know which cells are negative. Are there alternative ways to set up the objective function? We can choose the ranking based objective. Like it intuitive to rank positive relations over the missing cells for one entity pair. This is actually row based ranking. Also in some cases it’s useful to rank entity pairs with respect to the relation so that is column based ranking. Here I will show you so here row based ranking. Each cell is indexed by a tuple and the relation. For each tuple you rank the positive cell, positive relations or the other relations. Also for each relation you can rank the, observer the entity pairs, observe the tuples above the other tuples you haven’t seen. Similarly you can make the Bernoulli assumption so the ranking of a pair can be drawn from up underneath distribution. Then you can apply this maximum log likelihood on pairs of cells. Okay? >>: [inaudible] with the fact that, I mean there some [indiscernible] data like let’s say you have some relations on much more mentioned than others. >> Limin Yao: Yes that is one thing the [indiscernible] separations occur more frequently so I have a [indiscernible] term of all these relation, yeah. >>: But in that those types of ordering constraints are like automatically set. I know that like without looking at t I can tell you that CEO of is more frequent than manager of without looking at which tuple it applies to. Just because this means… >> Limin Yao: I know… >>: That then if you refuse some kind of constraint optimization about this you will get a lot of satisfied constraints. Not by looking at the tuple which mean that you don’t exploit these to lower the tuple, right. What would you do to kind of correct this? Or maybe it’s not a problem you know, I don’t know. >> Limin Yao: It depends how you sample the negative data. I think of here like for each entity tuple then you have some observed positive relations. Then you sample the negative relations. When you sample the negative relations you can have many strategies. Like one thing is the, if the relations, so if the relations frequent then maybe you sample them all. That will give you more negative examples about that relation. That means that in your prediction that relation won’t occur more times, yeah. >>: You play with a type of kind of [indiscernible] on the negative sampling? >> Limin Yao: Yeah, yeah. >>: If you learn the story by ranking then does that mean you can no longer ask for a [indiscernible] cell what is the probability of that relation? Now you can only compare two cells? >> Limin Yao: Yes you are right. In terms, in this ranking scheme you only see one thing is, one thing has the higher score than the other, yeah. But in our experiments we still apply the logistic function because here we still get a score for one cell. Then we can apply a logistic function to get, to convert the score into the range to their one, one, yeah. Just for evaluation like, yeah. Here, so factorization model is successful but it still has some disadvantages. That’s why we introduce other models. Factorization model fail to model localized the structure. What do I mean by localized the structure? Some columns, like some patterns that occur fewer times but they are good indicators for other columns. I will give you an example. Like here a champion from so this pattern doesn’t occur many times but every time it occurs it co-occurs with person/nationality. The factorization model does not capture this. We need a model to capture this localized relationship. We introduced the neighbor model. The neighbor model is similar to [indiscernible]. Here look at this highlighted cell. This cell is, we try to predict whether a person/company relationship hold for this entity pair. We can use the [indiscernible]. We take this person/company as a label, then other patterns occurring in the same row as features. We compute exponential family [indiscernible] to model this. It is shown in collaborative filtering that, like whether it is a factorization model or it’s the neighbor model, neither one is superior to the other. We can combine these two models together to model one cell. Here in this formula this term is a factorization term, then the base for each relation then the neighbor model term. Okay so selectional preference. Yeah, so people already mentioned it here like relations has the selectional preferences. Like some relation types they are picky about their argument types. For example here professor at means the first argument can only be person, second argument can be University or Institute. This president of the first can be person, second can be organization/country/sports team. We try to incorporate this information into our model. What we do is instead of for using one vector for one entity pair we use two vectors. These are entity vectors, the first like we have one vector for this Morris and another vector for Columbia. In the mean time we can represent this relation using two vectors, one corresponding to each argument. We can add this term to the final score of the cell. >>: Do you add… >>: You add the dot products? >> Limin Yao: Yeah, to, two dot. >>: You don’t, I would have thought you would tensor and you would have done a… >> Limin Yao: Yeah, I’ll talk about tensor later. >>: Oh, cool. >> Limin Yao: Yeah, people try the tensors, yeah. >>: So what… >>: What is the difference between adding these two in [indiscernible] products versus just having [indiscernible]? >> Limin Yao: These try to capture that like some entity can only go with some particular relations. Like I just showed like the person entity goes with the president of, or professor of. Then it’s the second argument is the university then university can only go with like professor of. >>: [inaudible] eventually the mathematical object that you end up with is equivalent to just concatenating the vectors? >> Limin Yao: These two vectors could be different, yeah. >>: I think Morris is tied so if Morris shows up somewhere else. >>: Yeah, Morris is… >>: Then it’s like have a longer vector with it halves are tied because… >>: Oh, I see so it’s a vector for each entity and entity can repeat. >> Limin Yao: Yeah, actually there are many ways to do this, yeah. I will talk about later, yeah. There’s some other alternative ways to factorize this base data, yeah. >>: But and maybe this is John’s point but the, learning the Morris vector and the v one vector separately from the Columbia v two vector it seems like you’re learning that Morris is, what’s the relation? Historian… >> Limin Yao: Professor at… >>: Morris… >> Limin Yao: Professor at, person/company. >>: Morris is a professor somewhere and then separately learning that someone is a professor at Columbia. But it seems like now you’ll get kind of all pairs are going to end up… >> Limin Yao: I know, I know, yes you are right here. There’s a tensor model for these like you have three vectors, one for the source entity and one for the relation, one for the destination entity, like tensor is one way to solve this, yeah, but this the one way to… >>: You really didn’t run into a problem. Somehow it doesn’t really have that problem. >> Limin Yao: I will show you in our experiments. This model is not, I mean does not perform, does not improve the performance at large, yeah. >>: Okay, but it’s not worse? >> Limin Yao: It’s not worse, yeah. >> Okay, so… >> Limin Yao: Just, we are still exploring ways to represent the entities and their relations [indiscernible]. >>: Yeah. >>: What kind of type by this right, so if I have vector, if Morris is a vector which is a human. It’s more likely to be president, CEO, a teach, whatever, than if it’s a university. >>: It seems like you need both. I guess seems like this helps with the types and making sure that Morris is only on the relations that want persons. But it seems like you lose the tying of Morris and company together. >>: Yeah. >> Limin Yao: Yeah. >>: It seems like you kind of want both. >> Limin Yao: The combined model can like overcome this one integer a little bit. Like here we have combined model. The combined model have all of these terms, like the factorization, the pairs, and the neighbor, and this part, yeah. You have everything. Okay, now I will show you like how well this performs, yeah. We are working on text to data and the knowledge base. For text data we use New York Times. We do some pre-processing NER annotation, dependency parsing, then we extract the triple. Each triple has two entities, the source entity, the destination entity, and then there’s a dependency path between them. For knowledge based we use Freebase. It is a collection of entities and relations from Wikipedia Infobox. Also we add some human annotations. >>: What is [indiscernible] so you count, so is it [indiscernible] like from the text, so from the New York Times to [indiscernible] true path you need the entity to occur [indiscernible] time, destination entity to occur within the [indiscernible] time, and maybe you [indiscernible] to kind of counting dependency path… >> Limin Yao: Yes, yes, exactly, yeah, yeah. >>: You have like a threshold and [inaudible] you can vary the size of your matrix and the amount of support you have for every pair. >> Limin Yao: Yes. >>: Do you play with that or do you fix it… >> Limin Yao: For the, like for the pattern, for the dependency path like you have many dependency paths. We just restrict our path’s allowance. Maybe the allowance should be smaller than ten, or smaller than five. Then you pick the path for the entities pairs they have some frequencies and you path them by frequencies, yeah. >>: [inaudible] on the length of the path. You don’t specify the number of occurrences for path, as well as. >> Limin Yao: Both, both, both, yeah. >>: Both. >> Limin Yao: Then, yeah we carry out some experiments. We reach, so for the whole matrix we can [indiscernible] by the rows. We have some train tuples and test tuples. Then we can do some predictions for the test tuples we can predict the Freebase types. I’ll show you what our approach can do and what we can evaluate. First we can discover maybe I should not say implications or correlations. Then based on some observed patterns we can predict unseen patterns. Then we can predict relation types and surface patterns. Here I will show you some example predictions. For example here the second one, so this Miller is replacement Mark. Then we can predict Miller is replaced by Mark or Miller is succeeded by Mark. For this row, for this entity pair like one person’s mother is Claudia. Then we can predict these patterns and also the relationship person/parents, yeah. >>: When you did this test you just blacked out all the matrix during training. >> Limin Yao: Yeah. >>: Then used the predictions available. >>: [inaudible] standard prediction like you say Smith and John. >> Limin Yao: Samantha. >>: Let’s say that [indiscernible] pair out Smith and John’s. You would have like every possible relation you can find in the New York Times. [indiscernible] because basically you kind of collide different entities into the same suffix for… >> Limin Yao: I see yeah, yeah, yes. Smith and John are very common names, yeah first names. I think in my processing I did one thing if there are more than like maybe one hundred or fifty patterns occur between two entities. I don’t like this row. [laughter] I think it’s the junk, junk. Okay, so these are some examples and now I will show you like some numbers to show our results. Here is our evaluation setup. We compare against other approaches for relation instruction. Mainly distant supervision approaches, like this is the first approach about distant supervision proposed by Stanford. The second is distant supervision using some clustering features. I like, I use generative models to produce these cluster features. The third is another distant supervision based relation extraction system. For evaluation what we do is, so it’s challenging till you add it to the whole matrix. We sample some entity tuples and we pick some relation types so we rank the entity pairs with respect to these types. We measure the mean average precision. Also we want evaluate the prediction surface patterns so we randomly sample some surface patterns. Also do this ranking to get the recall and precision curve. >>: How did you handle, did you do any labeling? >> Limin Yao: Yes, so here for example is we pick this person/company. >>: [indiscernible] >> Limin Yao: Even though Freebase has this relation Freebase is not complete. What we do is we, from this sample the ones on the entity pairs. We label whether this relation is true or not, yeah, so yeah so these numbers are compared against this human annotated data, yeah. Here we show that… >>: [inaudible] subset enable or kind of pick the end best for each model and give them label? >> Limin Yao: We have two hundred thousand that tested tuples but we pick one thousand entity tuples. In this one thousand entity tuples so we use different systems. Different systems we are rank these one thousand tuples differently. Then we pool them together to get something to annotate. We only annotated those pooled things, yeah. >>: [inaudible] results of just on the entity? >>: Yeah. >> Limin Yao: Yeah, so we can see our approach to better performance than other approaches. >>: Do you have a sense of the error bars? A thousand is not you know a thousand cells, no? >> Limin Yao: No a thousand entity tuples. >>: A thousand rows? >> Limin Yao: Yes a thousand rows, yeah, yeah. Here I show you that, so our approach can predict surface patterns. Here I show you the results of, like the curves fall predicting surface patterns. The neighbor model is the red line. We can see that the neighbor model performs worse than other models. The, like the factorization model and the combined factorization and neighbor model perform similarly, that are the green and the blue lines. Here we can see that adding the entity model helps a little bit in the low recall area but does not help as recalling increases, that’s the purple line. There are still like challenges how to model the entities and the relations, yeah. >>: How do surface patterns, what’s this? >> Limin Yao: Sorry? >>: He said it’s some surface patterns, how many? >> Limin Yao: They saw ten because you need to label them. >>: Right. >> Limin Yao: You do not have any good answers from knowledge base. >>: Right. >> Limin Yao: We just picked ten. >>: Do you have the sense of the error, the paths kind of [indiscernible] populated? Like let’s say I have two pieces, I have a pair of entities for which I have like a and revelation which here are in my database and I have one for which I have three, I expect to be much worse on the one that I have three because I would have less data to kind of [indiscernible] what they should learn. Do you have a sense of error per kind of supporting point in the data? [indiscernible] >> Limin Yao: I don’t have a statistic for that, yeah. But yeah we tried to model that intuition that like for our entity pair. At most maybe you should have like ten relations or some relation. You couldn’t have like more than fifty or one hundred relations, yeah. Yeah if we could model that then it could be more accurate in learning, yeah. Related work, yeah here I will show you there are many factorization models for relation extraction. One is generative model. You can choose the infinite number of topics. There’s the tensor model for factorizing relation. Here they represent a relation as a matrix. I will show you the figure in the next slide. Then there’s the tensor and translation models. Here like you have three vectors, one for the source entity, one for the relation, and another for the destination. Here V makes the assumption that this source vector plus the relation vector should be close to the destination vector. Also standard tensor decomposition can be used for relation instruction. The Canonical Tensor Decomposition basically each cell is a tensor product of these three vectors. Here I summarize all these models and show their number of parameters. K is number of low dimensions. Here the figures somewhat show what they do. The first approach they represent each entity as a vector then each relation as a matrix. This approach has many more parameters than as approaches because you want to represent each relation as a key to the square using these many parameters, yeah. All other approaches represent a relation using one low dimensional vector, yeah. Our approach represents one entity pair instead of one entity in the factorization model, yeah. That’s some related work in collaborative filtering. There are many strategies for handing negative data. Then, like this logistic loss is first presented by Michael Collins. Also our work is rated to entailment like people use the logical to represent implication, representing entailment like the first two approaches like they are, they have strong representation or power. But the first approach is limited to answer questions in one domain. Second approach it is challenging to learn. The third approach is presented by [indiscernible] Washington. They use some statistics to discover the inference rules. Our factorization approach is more principle. Here like many embeddings gain large amounts of attention recently. Like neural network language models and also deep learning for NLP. Also I think here at Microsoft like polarity LSA is related to our work. Also doing like factorization. Okay, so now Universal Schema for entity types. We use entity types are useful for relation extraction and the knowledge base construction. Pre-defined fine grained entity types are needed but however we have some alternatives here. There are debate about granularity, it’s subtle to label boundary cases and costly to obtain training data. We can extend the Universal Schema to model entity types. Similarly to relation extraction we extract a unary patterns for entities. For example here maybe from Coca Cola you can extract a drink Coca Cola as this pattern. Also for another music group you can extract list appos to structure group as a pattern. We also know that like Freebase has some types for the entities. We can extract those too. Here a list of some example types like executive, broker, company, senator and so on. >>: Are these derived from Freebase… >> Limin Yao: No these are derived from our text. >>: Oh. >> Limin Yao: Yeah, so here I can show you more examples like here House of Pain you extract the pattern rap group, also albums by House of Pain. You can organize the data into a matrix then do matrix completion again, yeah. Here we can use the models presented earlier for relation instruction, the factorization model, the neighbor model, and also the combined model. Notice that in integer recognition you have besides this unary patterns you can have other features. You can also encourage these features to enhance the performance. I will show you experiments on types. Here… >>: I don’t quite see how this gives you the types [indiscernible]? >> Limin Yao: Okay, here, here. For example the apple that you, [indiscernible] to a structure like you will see somebody an executive of. Then this executive could be extracted as a unary pattern. This [indiscernible] structure could be a type. Like here usually you see Bill Clinton, a politician, or a senator that senator or politician could be types, yeah. >>: Is it the exact same [indiscernible] where you retraced entity pairs by single entities? >> Limin Yao: Yeah, yeah, exactly, yeah, yeah. >>: But not all the columns are here because I can sip something. As a professor I don’t know… [laughter] I don’t know, so but the whole point is that some of the columns are just there to help you determine the entity types? >> Limin Yao: Yes, yes, yes. >>: I see. >>: What is this with you Universal Schema? Schema really it’s just means that you know if [indiscernible] attribute you have, what kind of type of attribute you have… >> Limin Yao: My schema is Universal Schema that means anything can tell you something about one entity or one entity pair. That is our schema. Here like you can see sip X is one schema. Then a drink is another schema. It’s another type you can see, yeah. >>: Okay. >> Limin Yao: Here I’ll show you some example predictions like here you observe financier then you can predict investor, magnate, or philanthropist, yeah. Here underneath economist and expert. Then here somebody can sing, you predict he must be a singer. But sometimes you make some errors like sing may co-correlate with baritone, but actually if he is not a baritone. >>: It seems so [indiscernible]. You show two models and they can do the same thing, right. They could be like one single column is the [indiscernible] relation whereas a Jack Lyons, financier [indiscernible] in the [indiscernible] column. They are like these kind of linear to where a single matrix where this only deal with [indiscernible]. I have, so what is the, so then I could [indiscernible] choose between both right here because choose between like having like N matrices, one correlation, or like… >> Limin Yao: One for this, yeah. I will show you in like in future work section yeah you can do, you can combine these things together, yeah. Here we carry out experiments on WikiLinks data. That is a collection of web documents so each has a collection of entities and their mentions from web pages. Some of these entities are linked to Freebase. We get some Freebase text but we do not have NER tagging and dependency parsing. We just take neighbor words as columns like for example basketball player Baron Davis. Then we take basketball and player as columns. On this dataset we measure the F1 using different models. You can see that the combined model is better than neighbor model and factorization model alone. Some of these numbers are low. I will, I did some error analysis. Look here for some types we have fewer training examples like food, cheese we have only about one hundred forty training examples, all total are like three hundred thousand total training examples. Sometimes the features are not sufficient enough. For example you can, you will see like some entity co-occur with vaccine but actually it is the treated this guy with vaccine. Our system running label X as medicine.drug. Also this Q is a keyword for disaster but actually criminals is the entity it should be a person, should not be an event.disaster. We can do one thing is using a coarse-grained NER this can fix this problem. Also we can use dependency path but on lab data you know given this path maybe not accurate. Sometimes it’s due to confounding types like event.disaster is close to terrorist_event or event.accident, yeah. >>: [inaudible] kind of [indiscernible] you had to have more data this would be solved? >> Limin Yao: Even if you have more data you need to design your columns very well. Like this treated X with vaccine maybe you should use like ngram features like you use the treated always vaccine, instead of using only one word of vaccine. >>: Question, so it looks like you tried to uncover an ontology here of entity types. >> Limin Yao: Yeah. >>: Why not use something like WordNet that has ontology that’s already built in as training? >> Limin Yao: Yeah, WordNet is, I mean WordNet has larger coverage. But some, like some people show that on the particular data you have some task WordNet may not help you as much as you expect, yeah. If you do just your own clustering that may give you better performance than using WordNet, yeah. >>: [inaudible] if you could incorporate… >> Limin Yao: Yes, yes, yes, yeah that means yeah exactly the Freebase thing I can use WordNet as well, yeah, yeah, you are right, yeah. >>: Does this approach have the same limitations as the previous where you could learn a correlation that like singer and songwriter tend to be the same. But you couldn’t learn that you know vaccine is a type of drug but not all drugs are vaccines or… >> Limin Yao: Yeah that has the same, yeah, limitation, yeah. >>: Or baritone is a type of singer. >>: Or baritone is a type of singer but no singer’s baritone? >> Limin Yao: Yeah. >>: [inaudible] >> Limin Yao: Yeah we know that [indiscernible] showing that entities can help relation extraction. Here we want to show that our predicted Universal Schema entity types can also have prediction extraction. We add the predicted types as features in the neighbor model. We compare against these space lines predicted NER, predicted Freebase types, and our predicted patterns. Here the first the three rows are relation extraction without these predicted types. The last three rows are relation extraction with these predicted types. We can see that adding NER hurts the performance so the coarse-grained is not enough. We also observe that using predicted Freebase types has similar effect as using predicted patterns. This suggests that the patterns could be replacement for Freebase types, yeah. Now I go to the conclusion and I talk about some future work. To summarize we’re presenting a Universal Schema so this picture captures the whole idea. You observe some cells and you try to complete the whole matrix. This representation we allow single pattern to represent a relation. Also we gain generalization by matrix completion. We also introduce other models to compensate the disadvantage of factorization. For future work one things joint entity and relation extraction. We want to model relations, entities, and the selectional preferences. For example if you know somebody is a writer then he tends to participate in the write relationship like between the author and the novel. Y representation is for the binary relation data you can have entity-entity-pattern. Then like you can do two things, one thing is you use a tensor like yes, yes. Then for the source entity and ed stand for the destination entity. You can use a tensor that has three things or you can use a matrix that has two things. Also you have the entity-attributes that’s the unary relation data. Each entity has some attributes so those attributes are unary patterns. We can do a joint factorization. Here the likelihood is the, so the first part is tensor product like you represent the binary relation as like entity-entity-relation-tensor. The second is about the entity themselves so entity has some unary patterns. Actually for the first part you can do two things like here you can use tensor at the same time you use a matrix. You factorize them twice. Then this is information about the entity and the r unary patterns. This is one way to do joint factorization. There is another direction is incorporating constraints. For example you know that both Seahawks and Patriots are football teams. In your model you can make sure that the rule vectors for these two teams are close to each other using the last term. Another interesting direction is the Infer multi-hop relations. Like if you know A is brother of B, B is parent of C, then A should be uncle of C. How do you do this? One model is the, so besides the original like entity pair and the relation matrix you have another tensor. This tensor will take care of the patterns. The tensor will, so each, the brother of will get a vector. Then parent of gets another vector. Uncle of has another vector. Each cell should be the tensor product between these three vectors, yeah. >>: The [indiscernible] of that relationship gives [indiscernible] vector… >> Limin Yao: I mean you take brother off and parent off, and uncle off you build another tensor of, using these three patterns, yeah. Here like some strategies for net, sampling negative samples. We can get inspirations from the like collaborative filtering literature. Also in natural language processing people have presented like a contrastive estimation for [indiscernible] neighbors. Yeah, we can explore in this direction. Another problem is the make factorization scalable so we can do a parallel stochastic gradient optimization for matrix completion. Like somebody also has some work on this. Last three I will mention some of my other work relation extraction. I explore generative models for relation extraction. We extend a standard LDA to model the two generations of triples. We also use the Type-LDA to model the selectional preferences. Also, so use patterns to represent relations will have some ambiguity like A beat B could be Obama beat Romney could be Seahawks beat Broncos. We want to separate these two entity tuples so we design a generative model to do this. That’s all. Thank you. [applause] >> Chris Burges: Any questions? >>: Yes so for the last example A beat B, so this is really just a, this ambiguous semantic. But if you actually can do some [indiscernible] the same that beat has two types. One is you know [indiscernible] the other one is the other one. >> Limin Yao: Our approach is like two stage approach. At the first stage for each pattern, therefore each pattern A beat B then we try to, we collect all the entity pairs co-occurring with this pattern. Then we cluster these entity pairs into different clusters. Each cluster represents one sentence. >>: I see so what [indiscernible] can then be a [indiscernible]? >> Limin Yao: Yeah. >>: Treat them as two different relation then you solve the problem. >> Limin Yao: Yeah, yes, yes, yes. >>: [indiscernible] >>: [indiscernible] >>: I have a question out of curiosity. The main component is motivation here is coming from text for analyzing text and finding entity. But can you apply the same techniques to other types of data for example images. I can learn things like cars have windows and cars ride on roads? But using all the images or video, or image, or you know visual signals as a source opposed to textual [inaudible]? >> Limin Yao: I think that you need to have two sources like one image signal and the corresponding text signal. I mean without prior knowledge that you know car, if you do not, you should have some prior knowledge of knowing that car runs on roads. Then otherwise you, all you can do is the first you build a knowledge base from the text. Then you apply this knowledge base to improve your image recognition, yeah. Then if you know that from your knowledge base cars should run on roads. Then image segmentation if you detect a car then you should have some higher probability to [indiscernible] to the road, yeah. I think in CMU people are working, like CMU has a network and the learning system. They also have another like some system about image. They try to combine this text and the image thing together, yeah. >> Chris Burges: Let’s thank Limin again. [applause] >> Limin Yao: Thank you.