>> Rich Caruana: I'm Rich Caruana. It's my pleasure to introduce William Cohen who's going to give a presentation this afternoon. Let me just give you a little background on William. Let's see, you're currently the president of, IMLS, right? The International Machine Learning Society. You've been an editor or an action editor for like all the journals basically, so for many. >> William Cohen: All the good ones. >> Rich Caruana: So for many of the machine learning journals. Let's see, ICML. You were general chair once and program chair twice. Two different ICMLs? >> William Cohen: I was general chair once, program chair once, and then once when it was like just one, kind of, combined back in the '90s. >>Rich Caruana: Gotcha. Twice you've been chair of the AI conference on weblogs and social interaction. One of them just happened a couple months ago, right? >> William Cohen: Yep. >> Rich Caruana: William is also an AAAI Fellow. Then you won the SIGMOD award for the Test of Time for a paper you did back in 1998, I'm guessing, but I don't know what paper that is. >> William Cohen: It's the only SIGMOD paper I've ever published. They gave me an award 10 years later, so I guess they liked it. >> Rich Caruana: It's my great pleasure to introduce William. I think his greatest trait, which he just reminded me of a few minutes ago, is that he follows in my footsteps, so twice, William has moved into offices that I left. Once at Just Research in Pittsburgh, and once in Wean Hall at Carnegie Mellon. So I think that's why he's so successful. [laughter] >> William Cohen: That's right. That's right. It's all the aura, right? Then after that, actually, I moved into an office had been vacated by Sebastian Thrun who's also done pretty well, so you know. >> Rich Caruana: Here's a laser pointer if you want it. >> William Cohen: Okay. Cool. The work I'm going to talk about is joint with a whole bunch of different people. Tom Mitchell I guess is really one of the kind of driving forces behind this NELL project. Here's a rough outline. When I found out I was going to be talking this late in the afternoon on Friday, I tried to cut things back a little bit. I still tend to kind of pack a lot of things in, but don't let me kind of do too much; stop me. I'm going to spend a couple minutes talking about this Never Ending Language Learning system and just some of the key ideas in it and maybe a give a little idea of what it is. Then most of the time I'm going to talk about the problem of doing inference, particularly inference in this system, which is this kind of very challenging because, of course, this is all constructed with information extraction techniques and there's noise and redundancy and incompleteness and so on. I'll talk about two projects for that. One is a joint work with among other people, William Wang, who is an intern here with Eric Horvitz until recently. Another one is Promotion as Inference, which is joint work with Lise Getoor, who's going to be here at the faculty summit if you want to hear more about this later. I think she's talking about this. Then I'll kind of wrap up very briefly. Like I said, this is kind of the key part. The rest of this is just sort of prelude, and let's go ahead and move in. Information extraction. I know I can't see everybody that's going to be listening to this talk, because some people are listening online, but how many people know kind of roughly what information extraction is? You've heard about it, yeah? Okay, all right. This is about the right level of detail. It's basically extracting facts about the world by reading text. NELL, like many information extraction systems, basically it's based on learning, so it learns how to recognize facts in text, and then it will aggregate the results, okay, so it doesn't do any detailed parsing at this current time. It doesn't actually do a lot of deep analysis of any particular piece of text. The key thing is it's sort of by aggregating even sort of relatively shallow tools for recognizing facts. The aggregation, the fact that you've got big data will kind of help you get good results. NELL is a semisupervised system, so it doesn't need a lot of labeled training data. I should have said label there. It learns about five or six hundred concepts and relations, so it's a closed set, but a fairly large closed set. For each of these things, it's got a handful of examples, say 10 or 12 examples for each relationship like capitalCityOf, and each concept like person. It also knows something about the relations between those relations, so it knows, for example, that if you're a celebrity, you're probably a person. It knows that the first argument of capitalCityOf is a geopolitical location. It knows something about things that are disjoint, so it knows a location is not the same as a person. It has a bunch of web pages, about half a billion web pages it uses to do its information extraction. It also does live queries. It's been running for about three years, okay. It's got about 50 million candidate extractions, which we call these beliefs. There's about 1.5 million high-confidence ones. About 85 percent of the high-confidence beliefs are correct. A picture being worth a thousand words, here's kind of the NELL site, and this is just a random sample of things that it's learned recently. Recently means sort of in the last few iterations, right, so in the last three, almost three-and-a-half years, sort of done through about 700-odd iterations. It knows things like Taylor_Stanley is a person, a Canadian person, in fact, a special subset of person. It thinks there's a restaurant called Fix in Las Vegas. Anyone know if that's true? No idea, okay. Karnataka_bank_ltd. is a bank in India. That sounds right. Of course, it's got some mistakes, so it's got Iraq_which. It thinks that's a name of a country. I could actually go in, and if I wanted to, I could sort of say that's really not quite right. Here's another subset just kind of at random. Let's see beautiful_toccoa_river is a river. Nj_turnpike is a street in the city West. Okay, that's probably an error, and so on. We could also go in and just look at some of the things it's learned. Since we have an expert in ornithology here, I'll look at some of the birds. It's got a little gloss for each concept, and it knows, as I said, something about the relationships between those things, and it also knows something about how to recognize instances of birds. Some of these are features of the noun phrase itself. So to take an example, if it ends in h-r-u-s-h like thrush, or r-b-l-e-r like warbler, it thinks maybe it's a bird, or, I guess, another one is like keet, like parakeet I guess, right? Then there's bunch of patterns like blanks fly, abundance of migratory something, African Wattled something, aquatic birds such as something, so each of these are sort of kind of shallow syntactic patterns that will suggest that something is a bird. These are all patterns that it's learned over time. I'm just trying to figure out if there's any more interesting things here. Maybe let's go back and look at a particular bird. >>: I notice you had cat among, so it must be doing [indiscernible] learning as well? >> William Cohen: [laughter] Right. Let's see, we've got, I don't know, let's say a blackbird, so here are the patterns that were used on blackbird. It also looks at semi-structured web pages and tries to extract information out of semi-structured web pages. For example, there are, let's see. Here's a list which it found apparently several birds in. I don't see where that is, but maybe it's in this list of related species right here. Another one it found is this one right here, which is amusing. It's a list of birds from Vietnam from Wikipedia, so somewhere down here there's a list of birds, and it's sort of figured out how to wrap these. This is basically what NELL is and what it looks like. Then we'll kind of switch back to talking about how we get there. Skip over this here since we've done this. What are the key ideas in the system? As you can sort of guess from the list of people that have worked in the project, it's a big system. The two key ideas I'd highlight are it uses coupled learning and it also has a multi-view, multi-strategy learning approach. What's coupled learning? Let me give you an example of uncoupled learning by contrast. Let's think about these lexical patterns it learns, like blanks fly to recognize birds, so how does it start? How does it work? It looks at some initial set of seeds, and then it looks for patterns that are supported by those seeds, okay, so for cities you might come up with something like this and they’ll come up with other things that sort of match those patterns, so “mayor of San Francisco” might appear in text somewhere. “Live in denial” might appear in text. Then you learn those. Then you bootstrap and learn some more patterns, and you learn some more things. What you see eventually in many cases is semantic drift. You'll see a drift toward things that aren't really like the original seeds that you gave it, so one problem with this is, if I just think about one concept or even a small number of concepts, it's kind of under-constrained. We don't really know why those things are wrong, and the reason those things wrong is because they're really an example of something else. If you have a whole bunch of something elses, then it makes it easier to learn any individual concept, so the idea is it's kind of paradoxically sort of easier to do like two hundred things or a thousand things at once than to just focus on one if you're a machine learning system. Again, if we're trying to learn the concept of a coach, right. So there are a whole bunch of other things that are related, like coaches and athletes and teams, and we know that teams are different from sports and they're different from people, right. A coach coaches a team, so there's a connection between that concept and that relation, right. There are also, there are other sorts of thing like if someone is an athlete, they're probably a person; they probably play for a team, and so on. If we look at all those constraints, that gives us a lot more information about what's not true as well as what is true, and so by pushing one concept-learning problem off against the others, you get more accurate results and less drift. So this is kind of one key idea. It's easier to learn lots of interrelated tasks than one isolated task. The other idea is it's easier to learn using many different types of information. So I sort of suggested that when I looked at the demo. It learns rules about recognizing birds based on where those phrases co-occur. Like do they appear in a pattern like “puffins fly?” And also the actual structure of the word, and also on how they appear in semi-structured pages, like the list of birds from, whatever it was, Vietnam. So there are all, those are all different views of the data okay, so that's the other idea is that you have this multi-view, multistrategy, learning system and sort of the big kind of heavy-lifters in NELL is finding these text extraction patterns, these like lexical patterns, finding patterns that are based on individual web pages, so basically sort of page-specific wrappers from semi-structured pages, these rules about the prefixes and suffixes and morphology of the noun phrases, okay and then the last one is inference rules. This is just another component of NELL. It will also basically do data mining on its learned facts and come up with plausible rules by which you can take those learned facts and infer new facts. The cycle is basically you take your initial seed set. You apply it to the data, get some examples. You do learning in each of these spaces. You do classification in each space to come up with new things. You combine the results. You get some more facts and you look back. So let's talk a little bit more about this inference process. I'm going to talk really about two kinds of inference we're doing. One is basically inference sort of in that little box, another learning strategy, just sort of another way of coming up with facts that are probably true, likely to be true, that you can somehow derive from what you know already. Just as a little bit of background, I'm going to talk about a more general problem which is learning in graphs. Then I'll talk about a particular algorithm for learning in graphs, and then I'm going to kind of go off on a tangent and talk about the latest stuff I've been doing, which is sort of like extensions to this last bit. Question yeah. >>: The inference stuff, does that -- so you have to give it an ontology. Does it stay within the ontology or does it propose new concepts or relations that seem to be missing? >> William Cohen: It stays within the existing ontology, so it's going to be with new facts in the existing ontology. Okay, yeah. A while back, I sent an intern to Microsoft Research, and Einat Minkov was her name. She was working with me on this kind of nice problem. If you think about information management, let's say managing your e-mail, a very simple way of viewing it is say we've got two types of things. We've got documents, and we've got terms. We can organize them into a bipartite graph, and we can do IR or something like that. And a more complicated way, say, well, there are lots of other things. There’re people, there’re dates, there’re meetings you might go to, there're e-mail addresses. And these things are all connected, so there might be people that attend meetings that are associated with e-mail addresses, and so on, and so forth. If you take that view, if you think about your world of personal information as a big graph with lots of different entities, and lots of different types between those entities, the question is what can you do with that? Einat came here and worked with Toutanova, who had done some work as a grad student on learning parameters of a certain kind of graph traversal process. So this is personalized page rank. I'm guessing most people know about page rank, so personalized page rank is essentially the same process via sort of a random surfer. It's going through the graph. At any point, it either takes a random arc out of that graph, or it jumps back, not to any random node in the graph like you do in regular page rank, but to a particular person with [indiscernible] node okay. The end result of that is that you're putting weight on every node in the graph and the weight depends in some reasonable way on the distance or the similarity of that node to the start node. Once we have the similarity metric for nodes over a graph, and once we've got some ways of taking that and tuning it for a particular application, then we can talk about solving a number of interesting information needs by doing simple similarity queries. The simplest kind is, basically, you give me a type, let's say email addresses, and a node index, and I want to find other nodes of the right type that are similar to that, or a slight extension would be give a type and a set of nodes, find things of some target type that are similar to the set of nodes. So there are lots of things you can do like this. So let's take one example. Let's say we want to find out what the referent of that name is okay. I’ve got two objects. I've got the name itself, and I've got the context the name appears in, which is a message, so my start query is are these two things, the term Andy and the message ID of the file that it appears in, and what I want out of that is a person, so maybe a contact record would be a reification of person, all right, and so this is my similarity query, and if I start off without doing any training, I'll get some level of performance. It turns out to not be terrible in many cases, and then I can do learning to improve it and tune it to queries of this particular type. So if I have a bunch of disambiguation queries, I can tune that relationship for that task. We looked at a bunch of other things like e-mail threading and finding aliases and finding e-mail addresses of people that would likely to attend the meeting object and that worked pretty well. Subsequently, I had another student that came by, Ni Lao, and he looked at basically the same problem, the same task of given a type and a query set of nodes, finding related things. And his contribution was basically to look at new and better learning methods. In particular, the stuff that Einat was looking at there were a limited set of parameters, which were basically related to the edge types and the entity types that you have. Ni looked at some new tasks. One of them was basically information managements for scientists, and the language that he used was a little bit more expressive, the representation that he learned was a little more expressive. So here's sort of an example of what the sorts of things his system learned. Essentially, what you're learning in each of these cases is a similarity metric, so you can solve some particular class of queries better. So the similarity metric is basically some strategy for doing random walk on the graph. And so for Ni what you basically had is you had a whole bunch of experts each of which meant you're following, doing a random walk across edges of a certain type. So, for example, you might start with a word, and let's say so that here we're starting off. Our query is the set of the title of a paper, the year you're going to publish the paper, maybe the authors of the paper, and I think there were some entities like gene entities that were associated with that paper. So you start off with this meta-information, and the goal is to find the papers you'd like to cite. There's lots of data you can get for this, if you assume citations are mostly right, and you learn paths like this one. It says start with a word and then find a paper connected to that word. So we're starting off with an initial uniform distribution over all the words, and then we're taking a random walk from that to a set of papers. So I have a paper that's linked to by a bunch of the words in the title, then it will get more weight than one that's just linked by one. If I have a word in the title like Z, then it's going to be linked to a lot of papers, so it weight will get dispersed, whereas a rare word will get dispersed less. So going from word to papers is kind of like traditional IR. In fact, this is one of the rules that gets learned is basically sort of type the title into a search engine, and those are things you might cite, or you could basically go from there and find things that cite that paper, and then are cited by those papers. That’s sort of a co-citation rule. Ni's system basically does structure search and enumerates a large number of paths of this sort, and then it does a linear combination of these things, so it's basically finding weights for the things that come out of each path. Each of these paths is a random walk, so it's kind of smart about what it generates, and then additionally, we're going to put some weights on that make the ensemble even smarter. So that's where we are. Here are a couple of other examples of things. This is five papers cited in the last two years. This is one. These are some that are negatively weighted, so if you just look at recent papers, then generally you don't cite those as much as older papers apparently. This was Ni's system. I was talking about this with Tom Mitchell, and he suggested the following idea: think about applying this learning system to basically predict relations inside the NELL knowledge base. There's a relation like athlete plays in league, so this is in the ontology, you know you want to predict it, and it may be the case that there are change of relations that will get you there reliably. So, for example, if I know that somebody is an athlete, and they play for this team, the Steelers, and the Steelers play in the NFL, then that chain may help me predict whether an athlete plays in a league. The question is can you use this same sort of path learning to learn a similarity metric which basically says here is how I can infer the second argument of athlete plays in league from the first argument. So that's the task we're trying to do is learning a binary relation in the ontology. So we tried it and it actually works pretty well, and it's kind of interesting to see the sort of things that it finds, so one is this athlete plays in a league, and the league is part of some organization, so let's see, because I don't remember quite the semantics of this but that's basically sort of like this link right here with NELL's actual concept names. Here's another one which is kind of cute. You start with your athlete. So I start with an athlete, and I go to a concept that he's a member of, so let me do that just a node that indicates the concept athlete in this graph, and then you find all things that are instances of that, so now I have the set of all athletes basically uniformly weighted. And now I want to find out the sports those athletes play, so at this end of this chain I basically have a distribution over sports that's weighted by the number of athletes that play that sport. So this is basically sort of a prior distribution over possible fillers. Ni's algorithm learns a lot of rules that are kind of like this. It learns things that sort of give you prior information. It learns some things that look like inference rules. It learns other things that look like they're basically sort of trying to get around holes in the knowledge base. Here's one. I want to find out the home stadium of a team, so where do the Seahawks play. I can start with some member of the Seahawks, find out what team they play for, now that really should be back to the Seahawks, but it might not be because I may have two duplicate concepts. I may have different versions of the same entity because the de-duping in NELL is imperfect. So then I could find the home stadium of that team. That's a pretty good inference rule if you have some extra unduplicated entities. It learns a lot of interesting things. PRA has been now for a while one of the components of NELL. That's where we are. Now I'm going to talk about the next stage, so this is very recent work. It's recently appeared in a workshop at ICML, and there's longer versions under review. This is actually cool stuff. I'm actually more interested in this than I have been for a while, about anything for a while. When Ni was around, we are talking about a number of possible extensions. Some of them he did for his thesis. Some of them seem very hard to figure out how you would do. One question is can you make the rules that PRA learns in this knowledge completion setting recursive. After a while we were thinking about it, and we came with this idea maybe a we can use the ideas from PRA in a more powerful representation, and jump by and get a whole bunch of these extensions at once. Just to give you sort of a concrete example again, for PRA, what it learns, these paths really look a lot of like logical inference rules, which is why I'm calling this inference in the knowledge base. You might write this rule that takes you from an athlete to a sport by saying, well, you know, if you're on the team, you can tell in the knowledge base that this athlete’s on a team, and that team plays that sport, then the athlete plays that sport. That's a rule. You’re asserting something new via these set of rules, based on things that you already know in the knowledge base. Of course, you're doing this for lots of different relations, so you do it for athletePlaysSport. You also do it for a teamPlaysSport, so it might learn a couple of rules for that. One of these might call the other, so this one calls this guy. You might conclude the team plays the sport, if the athlete’s on that team, and the athlete plays that sport. All these relations are, by design, closely interconnected and redundant. PRA can learn this, but it couldn't learn something where these rules actually get called in a mutually recursive fashion, so what you'd really like to do is have the rules basically not be limited to talk about the knowledge base, but to basically test, see whether that fact is either known or it can be inferred, which basically means you need to have the separate rules for teamPlaysSport set up here, and for each possible thing you need to have some sort of ground-out rule that basically says well, if it's already known in the knowledge base, then I'll conclude it to be true. Now this is a mutually recursive program, and the question is how can we learn that? Everyone kind of clear where we are so far? I stymied anyone? Okay. This is a goal. We want to lean something mutually recursive. So here is the idea. I'm going to start with a mutually recursive program. I apologize, but I'm switching examples here, just to kind of reuse slides. This is a different program that is basically doing some kind of similarity or label propagation over web pages. A page X is about topic Z. If you've labeled it as about Z or if it's similar to Y and Y is about Z, so that's kind of a propagation step. I'll say that X is similar to Y if there's a hyperlink from X to Y or if there's a word in common between these two. That's basically what these two rules are saying. The way I've set this up there's features associated with these rules, so this has two different features; this has two different features. This has one feature, so the rule is always true, and I've set it up this way, so when you get to this, the head of the rule contains the variable W which is the word, and I can tag the application of that rule with the word that being used. So there's a feature for this rule, which is just the word that's being used. What are those features about? Well, the features basically get put onto edges of a proof graph. The proof graph looks like this. We start with some goal. Up here in the right-hand corner, we have a copy of the World Wide Web. It’s got four pages. One about fashion, one about sport. I'm sorry, this is a subset. Here are some words in each of these things. So to find out what page A is about, I can do my -- I don't have a hand labeling thing, so I look, do a propagation step. It's got to be similar to something. There are two ways it can be similar. We could say it's linked to, I guess this is B. B is labeled fashion, so this gives me one solution, fashion, and we can chain through here. It's linked by a word, let's say the common word, sprinter. Sprinter is also in C, so that gets me to C, and C looks like it's actually, well, I have to propagate that to D, and D is about sport. So there's another little chain that gets me to sport. There are two possible solutions to the search base. We now have a language for representing recursion, because we can have a recursive prolog program, and the cool thing about it is the proof space is a graph. It took me a long time to realize that this is a reasonable thing to think about. This proof space is a graph, and since it's a graph, I know something about how to tune this exploration process. I could imagine just basically doing same personalized page rank on the graph, which gives me a similarity metric between this query and these two nodes, fashion and sport, so it's a similarity between queries and possible answers to that query. Then given those -- given that search process, I can now think about tuning that search process to give me the results I want, using standard techniques for learning and this graph here I've constructed. So I've got lots of features. I’ve got lots of edges for the nodes. I have some ideas about how to do this, so again, basically you’ve got this probabilistic transition through this graph. It depends on the features of the node, which basically depend on the rules and sometimes the arguments of those rules as they were applied. There are implicit reset transitions, because I'm doing personalized page rank, because I like doing that. That basically means if I'm doing some sort of a reset, basically it gets hard to do walks that are very long, so it's much easier to take the short path than to take this relatively long path, because a long path, basically you have a lot more chances to do reset. So basically this is supporting looking for answers that are supported by a lot of proofs and, in particular, a lot of short proofs. The semantics of this, it turns out there's a formalism called stochastic logic programs that were similar but didn't have the reset, that were explored 10 or 12 years ago. The semantics are basically all set up exactly the same way. And there's one thing about this which is really cool. Ideally, we'd like to take this small recursive program and apply it to a slightly larger subset of the web than the one I showed you in that slide, more than four. If this proof tree will let you recurse across the whole web, so you can propagate labels all across the web, then the proof tree we're talking about is really, really big. So if I use stochastic logic programming for it, it wouldn't work, because building this proof tree is like doing a traversal of the whole web. But if I want to do a random walk with reset, it turns out the weights in a random walk with reset drop off very quickly. They drop off exponentially as you get further and further away from our reset node. So that means that you can get an approximation to the random walk with reset by just incrementally expanding away from the reset node. There's a nice algorithm for it, a very simple algorithm that was described in a stocks paper a few years back. It’s called PageRank Nibble and they proved that the size of subgraph you explore is inverse to the error and the reset parameter. So as long as you have some sort of lower bound on reset, you can get relatively small grounding. We know how to do -- I won't talk too much about the learning algorithm, because it's 4:06 on Friday, maybe thinking about the weekend now, but we're using an off-the-shelf learning algorithm for doing this, and the real key thing here is this fact: that the size of the structure that you're producing, the ground structure that you create that corresponds to a query, the thing that you're doing inference over when you're doing the learning, is not that large. It doesn't depend on the size of the database; it just depends on these two parameters. So that's a big deal, because if you have, in many other probabilistic representations, it turns out the size of the grounding is actually very large, so for example, in Markov logic network, you start out by building Markov random field where the nodes are all the possible atoms in the logical theory. That's a huge set. In this case, since we've got similarity X,Y that's basically not just all the nodes of the web, but all the pairwise nodes of the web, or the number of nodes on the web squared. Here's a couple of quick results just to give you an idea how this works. I've suggested it's going to be faster than something like Alchemy Markov logic network system. First thing we do is tried verify that, so we took a data set that the Alchemy people put together, and we took their Markov logic network and we translated it using a mechanical translation to our representation, and so Markov off logic networks let you have non-Horn clauses. I can only do Horn clauses. So the things that weren't Horn I just threw away, so it gives me this program. Has everyone got that now? Good. Okay. Here are the numbers. Here we're looking at inference. We're going from a tiny database that has four citations to a really, really small database that has eight citations. I'm sorry; this is log. So this is two to the fourth, 16. Two to the eighth, 256 citations. What you see here is there are different learning algorithms that are applied to the underlying Markov logic network. There's also this lifted belief propagation strategy where you try not to unroll everything. But in every one of these cases, what you essentially see is you see that the time it takes to do the inference grows pretty quickly with the size of the data set. For our case, that blue line is what we've got, so it's independent of the database size, given that we’ve got alpha and epsilon fixed. And now we look at how well it works. This is before learning; we just put uniform weights on all features. This is AUC; these numbers are not too bad, not too great. With learning we improve on most of these cases. This is the Markov logic network, and this is all using the same subset of rules, the 21 rules that we actually could use in our system. If we put in all of Pedro's rules, then their system gets a little bit better, but we're still doing a little bit better than that. This is a very encouraging result, that this is our reasonable bias, and it has the computational properties we want. There's also another really nice benefit of this fact that the grounding process is bounded in size. When you're doing learning, you basically get a whole bunch of queries, and for each query there's a bunch of correct and incorrect responses, so I want to say what are the things that are the same, citation 120. There are some things that are right; there are some things that are wrong. Operationally, what we do is we start here. We do a proof for that query, and then that's grounded out to this little graph. Now we now have like a little non-first-order construct. It's just a graph labeled edges and nodes. And we're trying to tune the weights on these edges. The cool thing is those graphs are all separate, right. We could combine these things if we want, but we can also keep the graph for each query as a separate thing. The only coupling between these graphs is the parameters that appear, the features that appear on the edges. Our algorithm is a little different where you're optimizing the same metric that Uri and Lars Backstrom were optimizing, but our algorithm—they used like a Newton method, and what we're doing is we're doing stochastic gradient descent. Basically you compute the gradient with respect to a particular graph. You tweak the parameters and then you go on to the next graph. This is really nice. It's sort of a small-memory-type operation because you're only doing one graph at a time. The only thing that they share is the edge features, and you can also do these in parallel, so you can have a different thread looking at the inference process over each little subgraph so there's some synchronization, because you're sharing the edge features, but you can still get potentially a big win by parallelizing. So this is some other numbers. This is doing the learning, and we're looking here at how much you speed up as you add more threads. Ideally, if you have 16 threads, you get a 16X speed-up. We don't quite get that, but we're doing pretty well. That's the formalism. And like ten minutes ago or something, before you guys started dropping off and falling asleep, we were talking about why I wanted to do that. What I wanted to do is I wanted to get to a mutually recursive set of rules, and try and learn this mutually recursive set of rules. So this is learning how to do inference in a stronger sense, rather than just learning how to do one step of the inference with the stuff I've got, I'm learning how to do multiple steps of the inference at once. That's really the motivation behind this. I don't have any way of doing structural learning right new for this language. What I did was I used Ni Lao's PRA system. I just let it run and it learns nonrecursive rules. It's take them and I syntactically transform them so they're recursive, and they can call each other. Then I take that and I train weights on the whole program. That's the experiment. So a few details. What I did actually was -- so the hypothesis is that this mutual recursion is going to help for the highly connected entities, the things that are related to a lot of things, which are themselves related to a lot things. So to get at a sample that had that characteristic, what I did is I picked a couple of different seed entities, and then I did just a kind of simple random walk with reset, personalized page rank away from those, okay, so personalized page rank will give you things that are close to the seed, but also give you things that are sort of central, things that are well connected. It's kind of a mix between page rank and distance. So I picked the top few things in that random walk, so that's a subset of NELL. Then I project the knowledge base to those entities, so I did that in a few different ways, and here are the results. Just to pick one of these out here, if we take the top ten thousand things closest to the concept baseball, the AUC for the nonrecursive rules is about 75, okay. The AUC for the recursive rules is about 99. We're getting a huge lift here largely because of extra recall we get because we can do multiple rounds of inference. This sort of inference, the results vary across these different samples somewhat but there's always a big lift from looking at the recursion and learning a recursive set of rules. I'll just let people stare at those for a second. I really should have a better visualization of this. Another thing that's actually kind of interesting is that these things also tend, since they're very closely connected, statistically they tend to also be more common. There are more facts about the things that are kind of closest to baseball than things that are kind of farther away. What you also see typically is you see as you go down this list the test sets get harder and harder. Why I split into train and test is the training set is everything up to some particular iteration. It might have been 713. The test set are the things that are learned later on in the bootstrapping process. This is the first way that we're now using joint inference in NELL. I'm going to talk a little bit about a second way that we're doing that, okay. Lise Getoor is going to be at the faculty summit. I believe she's going to be talking about this to some extent, though I haven't actually checked with her, so maybe she's going to talk about something different. I don't know. Just in a little bit more detail what NELL does is it basically iterates over and over and over, okay. Each iteration it basically learns classifiers with each of these feature sets including PRA and basically does distant training for that feature set using the current knowledge base. It uses the current knowledge base to provide labels on whatever data it's got, sentences or semi structured pages or what have you. Then once you've got the results of that learning, it's a classifier; you can extend it; you can get new candidates based on the unlabeled things that the classifiers fired on. So we get a whole bunch of candidates from that are from different learning systems, and then we do some heuristics to find the best candidates, and those are -- in the system we call it promotion, right. They're promoted into the next version of the knowledge base, and then we go around the next loop. The knowledge base is a little bit bigger, so in each iteration the knowledge base grows and it's a little bit better at recognizing things because it learns a few extra patterns and how to parse a few more web pages and so on. What's the hard part? Well, we're using some fairly ad hoc techniques here to find the good candidate beliefs. It'll be nice to do this in a principled way. What's the right algorithm for doing promotion? What we're really trying to do here is kind of deal with the fact that there's lots of noise in the database, many different types of noise, but there's also lots of redundancy, so it seems that the way we should deal with these many different types of noise and exploit the redundancy is to do some sort of joint reasoning. We should pool information about things like co-referent entities or possibly co-referent entities. We should enforce mutual exclusion and other sorts of ontological relationships, so we should find something that sort of jointly satisfies everything as well as possible under the assumption that there is going to be noise. Also under the assumption that we don't really know what kind of noise there is. We don't know how much there is of each type. Just as an example of this, here is a set of extracts. There's one here that's like way off. It thinks this entity is a bird rather than a country, which it really is, okay. It thinks these things might be the same thing, but it's not sure. It thinks that this is the capital of that place, okay. It knows that things that have a capital are countries. It knows that countries and birds are disjoint sets. Those are mutually exclusive. These are the facts we have, and we'd like to kind of get to something like this. So this is sometimes called, like in Google they're calling this a knowledge graph, so this is a certainly a graph. For a long time, Lise Getoor has been looking at using the term knowledge identification to talk about the problem of basically going from this sort of graph which has many types of errors. She's been using the term knowledge identification for this process of coming up with a clean de-noised version of a graph. All right. We start talking and we decide to try using her or graph identifying techniques for this problem. A couple of students, Jay and Hui, are really doing all the work for this project, let's be honest. How does she do this? Again, she uses joint reasoning in a probabilistic logic, so it's different from the one I'm using, and it's interestingly kind of complementary. I won't go too much into the details. Essentially you ground things out by instantiating your general rules in all possible ways, so unlike the proPPR language, this page rank version of prolog, the grounding process can be relatively expensive okay. But one nice thing is when you're done, the inference problem is convex, so finding the most probable explanation given your current weights for all the rules is a convex optimization problem. How do we turn this into a joint reasoning problem in her framework? We basically have a bunch of inference rules. One basically says well, if NELL was going to promote it, then maybe it's a good idea to promote it. There are two types of promotions. One is for concepts and one is for relations. So this is basically true if NELL wanted to promote the thing, and basically this sort of says yeah, that's a real relation; that's a label. Here's another one that basically says if anything, if there's any candidate then go ahead and promote it. It's better to promote can dates than not promote candidates. These are some rules which basically enforce consistency about things where there's a co-reference assertion, so if you think entity one and entity two are co-referent, then they should have the same labels, for example. There are also constraints that come from the ontology. For example, if L is a subset of P, okay, athlete is a subset of person, and entity E is in class L, so somebody's an athlete then they better be a person too, so this sort of enforces the subset relationship. These enforce the mutual exclusion relationships and so on. These rules are actually too expressive for my logic, so the problem is basically these rules aren't Horn, right. These mutual exclusion things, I can't do those in my logic, but we can do them in Lise's logic. Here's kind of the rough results. So we've done some more experiments, but just to give you an indication, there's a set of facts from a year or so ago with another group, David Lode, one of Pedro Domingos's students and some of his students looked at basically the same sort of scenario, doing this promotion step within NELL as a joint reasoning process. They were using Markov logic networks. They very generously gave us their evaluation data so we could -- and their kind of precise predictions so we could reproduce their results. The joint inference runs in just a few seconds. The learning is a little bit longer. The baseline here is sort of NELL's learning method, okay. Here, the AUC is about .76, so MLN does it about .99 and the PSL does about the same. This is actually sort of a version of NELL's code that's been tweaked for just sort of the test set, so the thresholds have been adjusted for those 25 K test sets. I'm going to go ahead and stop here and maybe spend a couple minutes with questions. Basically the two things I've talked about are sort of on overview of this NELL system and some of the key ideas behind it. Then an overview of kind of these two lines of really recent research. One is basically treating promotion, a sort of key step in bootstrap learning systems, as an inference process, which isn't really very interesting if you think about bootstrapping in some very simple space where you just have one or two concepts, but it's actually quite interesting when you're talking about bootstrapping in sort of like a nontrivial, significant ontology. The other thing I talked about was learning how to do inference over a knowledge graph and in particular learning how to do this multistep, possibly recursive inference. We've taken kind of past work which is based on kind of random walk with reset type procedures and learning algorithms and extended it to first-order logic, which is actually kind of cool because we get some very nice scalability properties for this and some very nice properties with respect to how we can parallelize that inference. Okay. That's basically as far as I'm going to go. So thanks a lot for your attention. You can clap now and then you can ask questions. [applause] Matt? >>: Can you go to the previous slide? Did you compare proPPR to the PSL, like the Horn subset->> William Cohen: We actually haven't done that. We could do the Horn subset, and that's kind of on the list of things to do, right, so the other thing that's kind of on the list to do is to see whether some of the ideas from proPPR's grounding procedure can be modified for PSL, right. They've been kind of like things that we've been kind of plunging along with sort of slightly independently, and we've been talking a lot about these things. I'd really like to see that happen. Yeah, one question is how much are those handful of things and how much do they matter? There's also some strategies that I've also fought through. For example, we could basically do is set up rule that is basically say I am going to infer that there's a conflict if you see this sort of situation, okay, and then basically tell the system here's some positive examples of things you should infer -- and by the way please don't infer any conflicts, so every conflict is sort of a negative example of something you could infer. >>: Sort of pull the negative stuff out and do it as like a post hoc kind of? >> William Cohen: Yeah, something like that. To do that, there are like some technical things we'd have to do. The easiest way of like training something around that would be just to basically sort of say well, these rules that allow me to infer conflicts, the guy says he doesn't want conflicts, so let's just weight these rules down, right, then I'll have no conflicts. It's not quite as simple as just doing that, but I think that approach might be applicable. Certainly the question of well, we sort of would need to have some sort of random walky semantics for PSL if we're going to use the same kind of grounding strategy, right. Or maybe we don't. Maybe it works okay without that. It's not clear, but it'll be good to do some sort of experiments to kind of feel what the space of things is there. Any more questions? Yeah, one in the back. >>: Kind of how you deal with enemy resolution? I guess I'm sure it's built in. Do you give pretty much every [indiscernible] a unique identifier and then? >> William Cohen: Oh, yeah. Let's see. Okay. Short answer, long answer. In NELL, basically there's sort of a typically two versions of an entity. One is basically sort of the entity viewed as a string, and then there's the entity as assigned to a type, okay. If you look at something that's ambiguous like apple, there'll be Apple the company; there'll be apple the fruit, okay. There'll be strings like the string apple, right, which can refer to either of those, or maybe Apple Inc. which maybe will only refer to only one of those. That's the representational scheme that NELL uses. The way it gets there is it sort of has a couple of classifiers which I didn't put on that picture, so there's one that basically looks for things that should be co-referent and aren't. Then there are also some heuristics, so if you have a fact that's got a couple of mutually exclusive high-confidence predictions, that will consider splitting that. That's sort of the machinery that's in NELL. The results I gave were actually not just using those co-reference facts though, actually. It was actually using those and some additional co-reference facts from some coreference machinery that Lise's students put together because they're like kind of big co-reference people too. That's another discussion that Lise and I are going to have sometime in the future about sort of evaluating those changes and sort of figuring out. I mean they clearly help with the promotion stuff. Is that something we want to bring into the main NELL system or not? Yeah that's kind of the short or longer story. Rich? >>: I was just curious about how you do experiments with NELL. Let's say you release something and then you realize a couple weeks later that it's actually creating a problem and a bunch of bad facts you're getting added to the system, a bunch of bad inferences. You must have some sort of rollback procedure where you can go back before that happened and sort of continue. >> William Cohen: Yeah, right. That has actually happened in the past. The way NELL works basically is after each round of bootstrapping, the knowledge base, okay, and all the learned patterns and classifiers are basically checkpointed, so we can, in fact, roll back at any point, but that said, it's really hard to do. I mean typically ->>: You can't roll that back, right? >> William Cohen: Sorry? >>: You can't roll the web back, so presumably ->> William Cohen: Oh. Right, right, right, right. So that's ->>: You can't repeat an experiment. >> William Cohen: Well, we do live queries. We take the pages that come out and we cache them okay, and then whenever you refetch the same page, you get it from the cache rather than from the web, which is not ideal for temporal things, but that sort of gives us some more reproducibility. Then most of the patterns and stuff are built on the distilled version of the clue web thing which is a static repository, static crawl. >>: Do you have any sense if you were to change the random seed and rerun from a certain time, how similar you would be after the two weeks? >> William Cohen: We haven't done that for the large web repository okay. I've done a bunch of experiments like that with different versions -- a different version of NELL that was basically sort of set up over a biomedical ontology. It certainly is quite sensitive to the seeds, in particular, the quality of the seeds, so one of the limitations of NELL is that there's some engineering here of the ontology to kind of keep the system from breaking, right, of the seeds, because you're giving seeds that you kind of know, because you're a system designer, you're a smart grad student from CMU right, you know that these are going to be unambiguous relative to the sets of things that you know. They'll be frequent enough to be informative. There's a big question of sort of how robust is NELL when you get to a different ontology. One of the reasons why I did the biomedical thing, it's sort of like other designed ontologies. There are not a lot of interesting ontologies that have these sets of mutual exclusion and things like that that can be used. >>: Do you ever get compound interest effects where if you make something just one percent better, after six months everything is much, much better? Is there a thought that there might be singularity in the [indiscernible] sense if you achieve a certain incremental improvement in the short term, or is this just bad thinking? >> William Cohen: You've got to ask Tom whether that's true and see whether he'll go out on a limb and say yes. >>: One percent worse can be catastrophically bad [laughter]. >> William Cohen: That's concept drift, right, and that sort of clearly happens. We've had to sort of like roll back, and there are lots of reasons. I mean one is you'll like run into things that are just sort of spammy, right, but yeah. For example, the entity matching stuff was trained on some subset of things and it worked pretty well for a while. Then we changed the ontology, and then it stopped working, right, so suddenly it kind of like went out of control. There are some kind of like hacks basically that we use to kind of keep things from cascading in either direction too fast. One of the hacks, which we'd have to think about whether we want to incorporate it in Lise's extension or not, is that no single component can introduce too many facts of any type at once, and that's enforced even in kind of ridiculous ways. We'll put in something that introduces facts about the geolocations of things and we'll have a very conservative matching algorithm, and so the first time around, it'll say okay, well, I know the geographic locations of hundred thousand things, and it'll be like, okay, you have to put in ten thousand this iteration, ten thousand that iteration and so on. Yeah, there are geographic facts that are sort of like waiting to get in the queue as we iterate for example. >>: I'm wondering what is the key difference between the NELL and other [indiscernible] such as Freebase or YAGO. >> William Cohen: Yeah. I guess the goals are kind of the same, to build sort of a broad coverage knowledge base. There are a lot of technical differences. YAGO2's getting closer. YAGO has a smaller ontology, and it's done by extracting from a less diverse set of sources, mainly Wikipedia. Freebase is done primarily by merging structured databases, and the merging is sort of very systematic and user guided. I mean I think they're sort of hoping to get to the same place, but we're sort of trying to get there kind of by different tools. Part of why we're doing it this way is because we kind of like the idea of pushing the frontiers of semi-supervised learning and things like that as well. There's different motivations for doing these things, but I mean there's a lot of -- we hired a postdoc from the YAGO group, for example. There's a lot of commonalities between those different projects. >>: You can also [indiscernible] YAGO to [indiscernible]. >> William Cohen: That's something -- well, we haven't done YAGO. We've done some preliminary experiments with Freebase. There, the big thing is that there's not as much ontological information, so we have to sort of infer things like mutual exclusion constraints. Like doing Freebase as sort of an initial thing is also like a very reasonable sort of thing to do. We're doing some experiments where we're like merging results from NELL and YAGO and Freebase for applications, so we're also using NELL to, say, do like distant training for extracting relations from individual sentences, and there's no reason not to supplement that data with sources. Do you have a question? >>: This is probably blasphemy for NELL, but given a closed ontology, is there any benefit in knowing when to stop promoting facts into the knowledge base? For instance, country, you now start Iraq which. >> William Cohen: Yeah. No, no, no, that's actually a great idea. That's a great question. >>: I mean it's not never ending anymore [laughter] >> William Cohen: Right, well, so yeah. That's right. It's not never ending learning, that's true. Well, I mean there are other parts of NELL that say, for example, add new concepts, right, or add things to the ontology, but yeah. I tease Tom and say it's like you’re making this a selling point, but never ending is a synonym for slow, right? Really we should get to the end, right? [laughter] Yeah, I'm actually working with a student right now Bhavana Dalvi. She's working on basically learning in a fixed ontology, semisupervised learning in a fixed ontology where you know it's a fixed but incomplete ontology. Like if a simple case is you're doing semi-supervised learning into 20 classes but no one's given you seeds for 10 of them. You only have seeds for 10 of the classes, so as you progress, eventually you'll start running out of countries and then you'll start grabbing other stuff. The question is can you recognize when the countries stopped and you've started moving into the next outlying concept. Maybe there are things like US states which are in your ontology and you can say whoa, that's a US state; it's not a country. But there might be other things that aren't in your ontology like, I don't know, provinces of ancient Rome, right. Is Gaul a country? There may be things that sort of have additional structure. >>: I wonder how NELL can deal with facts that depend on time. For example, athlete is belongs to one, play in one team. The next season, he might play in the other team. With the promotion hear this conflict to the existing database? >> William Cohen: Right now NELL is like if something's true, then that basically semantically means it is or was true at some point in the past, so it doesn't consider time in the web site version. We have been doing a bunch of research on how to capture time. Partha Talukdar has been doing some stuff on this as part of his research. Our current approach for dealing with time is to look at a large corpus that has timestamped objects, like a newswire corpus or something like that and have a similar ontology that talks about when things can sort of, which things can co-occur and which things can't co-occur. For example, you can have a lot of senators, US Senators, at one time. You can only have one US President at one time. You can't be both a Senator and President at one time. If you look at the distribution of when you see facts that suggest that Barack Obama is a Senator and when he's a President and when George Bush is a President, right, then you can get some sort of information about when those -- if you look at that jointly, you can get some information as to when those transitions happen. That's kind of how we're dealing with the time issue, and there are a lot of interesting issues. How do you infer those constraints, for example? So far, that stuff hasn't been rolled into the existing NELL system. Matt? >>: Sorry if I missed this in the beginning, but where does the initial ontology come from? Is that -- so how many concepts and relations? >> William Cohen: So it's less than 1,000 but more than 600. I don't remember the exact number right now. It changes over time because part of the development process is developers look at this, right. You manually add, and every time you add, you sort of stick a few things in, like how do you state it and so on. Yeah, the ontology is a big deal. >>: Yeah. >> William Cohen: But you can just download it if you want it. >>: Yeah, yeah. >> William Cohen: Okay. [applause]