Document 17864780

advertisement
>> Rich Caruana: I'm Rich Caruana. It's my pleasure to introduce William Cohen who's going to give a
presentation this afternoon. Let me just give you a little background on William. Let's see, you're
currently the president of, IMLS, right? The International Machine Learning Society. You've been an
editor or an action editor for like all the journals basically, so for many.
>> William Cohen: All the good ones.
>> Rich Caruana: So for many of the machine learning journals. Let's see, ICML. You were general chair
once and program chair twice. Two different ICMLs?
>> William Cohen: I was general chair once, program chair once, and then once when it was like just
one, kind of, combined back in the '90s.
>>Rich Caruana: Gotcha. Twice you've been chair of the AI conference on weblogs and social
interaction. One of them just happened a couple months ago, right?
>> William Cohen: Yep.
>> Rich Caruana: William is also an AAAI Fellow. Then you won the SIGMOD award for the Test of Time
for a paper you did back in 1998, I'm guessing, but I don't know what paper that is.
>> William Cohen: It's the only SIGMOD paper I've ever published. They gave me an award 10 years
later, so I guess they liked it.
>> Rich Caruana: It's my great pleasure to introduce William. I think his greatest trait, which he just
reminded me of a few minutes ago, is that he follows in my footsteps, so twice, William has moved into
offices that I left. Once at Just Research in Pittsburgh, and once in Wean Hall at Carnegie Mellon. So I
think that's why he's so successful. [laughter]
>> William Cohen: That's right. That's right. It's all the aura, right? Then after that, actually, I moved
into an office had been vacated by Sebastian Thrun who's also done pretty well, so you know.
>> Rich Caruana: Here's a laser pointer if you want it.
>> William Cohen: Okay. Cool. The work I'm going to talk about is joint with a whole bunch of different
people. Tom Mitchell I guess is really one of the kind of driving forces behind this NELL project. Here's a
rough outline. When I found out I was going to be talking this late in the afternoon on Friday, I tried to
cut things back a little bit. I still tend to kind of pack a lot of things in, but don't let me kind of do too
much; stop me. I'm going to spend a couple minutes talking about this Never Ending Language Learning
system and just some of the key ideas in it and maybe a give a little idea of what it is. Then most of the
time I'm going to talk about the problem of doing inference, particularly inference in this system, which
is this kind of very challenging because, of course, this is all constructed with information extraction
techniques and there's noise and redundancy and incompleteness and so on. I'll talk about two projects
for that. One is a joint work with among other people, William Wang, who is an intern here with Eric
Horvitz until recently. Another one is Promotion as Inference, which is joint work with Lise Getoor,
who's going to be here at the faculty summit if you want to hear more about this later. I think she's
talking about this. Then I'll kind of wrap up very briefly. Like I said, this is kind of the key part. The rest
of this is just sort of prelude, and let's go ahead and move in. Information extraction. I know I can't see
everybody that's going to be listening to this talk, because some people are listening online, but how
many people know kind of roughly what information extraction is? You've heard about it, yeah? Okay,
all right. This is about the right level of detail. It's basically extracting facts about the world by reading
text. NELL, like many information extraction systems, basically it's based on learning, so it learns how to
recognize facts in text, and then it will aggregate the results, okay, so it doesn't do any detailed parsing
at this current time. It doesn't actually do a lot of deep analysis of any particular piece of text. The key
thing is it's sort of by aggregating even sort of relatively shallow tools for recognizing facts. The
aggregation, the fact that you've got big data will kind of help you get good results. NELL is a semisupervised system, so it doesn't need a lot of labeled training data. I should have said label there. It
learns about five or six hundred concepts and relations, so it's a closed set, but a fairly large closed set.
For each of these things, it's got a handful of examples, say 10 or 12 examples for each relationship like
capitalCityOf, and each concept like person. It also knows something about the relations between those
relations, so it knows, for example, that if you're a celebrity, you're probably a person. It knows that the
first argument of capitalCityOf is a geopolitical location. It knows something about things that are
disjoint, so it knows a location is not the same as a person. It has a bunch of web pages, about half a
billion web pages it uses to do its information extraction. It also does live queries. It's been running for
about three years, okay. It's got about 50 million candidate extractions, which we call these beliefs.
There's about 1.5 million high-confidence ones. About 85 percent of the high-confidence beliefs are
correct. A picture being worth a thousand words, here's kind of the NELL site, and this is just a random
sample of things that it's learned recently. Recently means sort of in the last few iterations, right, so in
the last three, almost three-and-a-half years, sort of done through about 700-odd iterations. It knows
things like Taylor_Stanley is a person, a Canadian person, in fact, a special subset of person. It thinks
there's a restaurant called Fix in Las Vegas. Anyone know if that's true? No idea, okay.
Karnataka_bank_ltd. is a bank in India. That sounds right. Of course, it's got some mistakes, so it's got
Iraq_which. It thinks that's a name of a country. I could actually go in, and if I wanted to, I could sort of
say that's really not quite right. Here's another subset just kind of at random. Let's see
beautiful_toccoa_river is a river. Nj_turnpike is a street in the city West. Okay, that's probably an error,
and so on. We could also go in and just look at some of the things it's learned. Since we have an expert
in ornithology here, I'll look at some of the birds. It's got a little gloss for each concept, and it knows, as I
said, something about the relationships between those things, and it also knows something about how
to recognize instances of birds. Some of these are features of the noun phrase itself. So to take an
example, if it ends in h-r-u-s-h like thrush, or r-b-l-e-r like warbler, it thinks maybe it's a bird, or, I guess,
another one is like keet, like parakeet I guess, right? Then there's bunch of patterns like blanks fly,
abundance of migratory something, African Wattled something, aquatic birds such as something, so
each of these are sort of kind of shallow syntactic patterns that will suggest that something is a bird.
These are all patterns that it's learned over time. I'm just trying to figure out if there's any more
interesting things here. Maybe let's go back and look at a particular bird.
>>: I notice you had cat among, so it must be doing [indiscernible] learning as well?
>> William Cohen: [laughter] Right. Let's see, we've got, I don't know, let's say a blackbird, so here are
the patterns that were used on blackbird. It also looks at semi-structured web pages and tries to extract
information out of semi-structured web pages. For example, there are, let's see. Here's a list which it
found apparently several birds in. I don't see where that is, but maybe it's in this list of related species
right here. Another one it found is this one right here, which is amusing. It's a list of birds from Vietnam
from Wikipedia, so somewhere down here there's a list of birds, and it's sort of figured out how to wrap
these. This is basically what NELL is and what it looks like. Then we'll kind of switch back to talking
about how we get there. Skip over this here since we've done this.
What are the key ideas in the system? As you can sort of guess from the list of people that have worked
in the project, it's a big system. The two key ideas I'd highlight are it uses coupled learning and it also
has a multi-view, multi-strategy learning approach. What's coupled learning? Let me give you an
example of uncoupled learning by contrast. Let's think about these lexical patterns it learns, like blanks
fly to recognize birds, so how does it start? How does it work? It looks at some initial set of seeds, and
then it looks for patterns that are supported by those seeds, okay, so for cities you might come up with
something like this and they’ll come up with other things that sort of match those patterns, so “mayor of
San Francisco” might appear in text somewhere. “Live in denial” might appear in text. Then you learn
those. Then you bootstrap and learn some more patterns, and you learn some more things. What you
see eventually in many cases is semantic drift. You'll see a drift toward things that aren't really like the
original seeds that you gave it, so one problem with this is, if I just think about one concept or even a
small number of concepts, it's kind of under-constrained. We don't really know why those things are
wrong, and the reason those things wrong is because they're really an example of something else. If you
have a whole bunch of something elses, then it makes it easier to learn any individual concept, so the
idea is it's kind of paradoxically sort of easier to do like two hundred things or a thousand things at once
than to just focus on one if you're a machine learning system. Again, if we're trying to learn the concept
of a coach, right. So there are a whole bunch of other things that are related, like coaches and athletes
and teams, and we know that teams are different from sports and they're different from people, right.
A coach coaches a team, so there's a connection between that concept and that relation, right. There
are also, there are other sorts of thing like if someone is an athlete, they're probably a person; they
probably play for a team, and so on. If we look at all those constraints, that gives us a lot more
information about what's not true as well as what is true, and so by pushing one concept-learning
problem off against the others, you get more accurate results and less drift. So this is kind of one key
idea. It's easier to learn lots of interrelated tasks than one isolated task. The other idea is it's easier to
learn using many different types of information. So I sort of suggested that when I looked at the demo.
It learns rules about recognizing birds based on where those phrases co-occur. Like do they appear in a
pattern like “puffins fly?” And also the actual structure of the word, and also on how they appear in
semi-structured pages, like the list of birds from, whatever it was, Vietnam. So there are all, those are
all different views of the data okay, so that's the other idea is that you have this multi-view, multistrategy, learning system and sort of the big kind of heavy-lifters in NELL is finding these text extraction
patterns, these like lexical patterns, finding patterns that are based on individual web pages, so basically
sort of page-specific wrappers from semi-structured pages, these rules about the prefixes and suffixes
and morphology of the noun phrases, okay and then the last one is inference rules. This is just another
component of NELL. It will also basically do data mining on its learned facts and come up with plausible
rules by which you can take those learned facts and infer new facts. The cycle is basically you take your
initial seed set. You apply it to the data, get some examples. You do learning in each of these spaces.
You do classification in each space to come up with new things. You combine the results. You get some
more facts and you look back. So let's talk a little bit more about this inference process. I'm going to
talk really about two kinds of inference we're doing. One is basically inference sort of in that little box,
another learning strategy, just sort of another way of coming up with facts that are probably true, likely
to be true, that you can somehow derive from what you know already. Just as a little bit of background,
I'm going to talk about a more general problem which is learning in graphs. Then I'll talk about a
particular algorithm for learning in graphs, and then I'm going to kind of go off on a tangent and talk
about the latest stuff I've been doing, which is sort of like extensions to this last bit. Question yeah.
>>: The inference stuff, does that -- so you have to give it an ontology. Does it stay within the ontology
or does it propose new concepts or relations that seem to be missing?
>> William Cohen: It stays within the existing ontology, so it's going to be with new facts in the existing
ontology. Okay, yeah. A while back, I sent an intern to Microsoft Research, and Einat Minkov was her
name. She was working with me on this kind of nice problem. If you think about information
management, let's say managing your e-mail, a very simple way of viewing it is say we've got two types
of things. We've got documents, and we've got terms. We can organize them into a bipartite graph, and
we can do IR or something like that. And a more complicated way, say, well, there are lots of other
things. There’re people, there’re dates, there’re meetings you might go to, there're e-mail addresses.
And these things are all connected, so there might be people that attend meetings that are associated
with e-mail addresses, and so on, and so forth. If you take that view, if you think about your world of
personal information as a big graph with lots of different entities, and lots of different types between
those entities, the question is what can you do with that? Einat came here and worked with Toutanova,
who had done some work as a grad student on learning parameters of a certain kind of graph traversal
process. So this is personalized page rank. I'm guessing most people know about page rank, so
personalized page rank is essentially the same process via sort of a random surfer. It's going through the
graph. At any point, it either takes a random arc out of that graph, or it jumps back, not to any random
node in the graph like you do in regular page rank, but to a particular person with [indiscernible] node
okay. The end result of that is that you're putting weight on every node in the graph and the weight
depends in some reasonable way on the distance or the similarity of that node to the start node. Once
we have the similarity metric for nodes over a graph, and once we've got some ways of taking that and
tuning it for a particular application, then we can talk about solving a number of interesting information
needs by doing simple similarity queries. The simplest kind is, basically, you give me a type, let's say email addresses, and a node index, and I want to find other nodes of the right type that are similar to
that, or a slight extension would be give a type and a set of nodes, find things of some target type that
are similar to the set of nodes. So there are lots of things you can do like this. So let's take one example.
Let's say we want to find out what the referent of that name is okay. I’ve got two objects. I've got the
name itself, and I've got the context the name appears in, which is a message, so my start query is are
these two things, the term Andy and the message ID of the file that it appears in, and what I want out of
that is a person, so maybe a contact record would be a reification of person, all right, and so this is my
similarity query, and if I start off without doing any training, I'll get some level of performance. It turns
out to not be terrible in many cases, and then I can do learning to improve it and tune it to queries of
this particular type. So if I have a bunch of disambiguation queries, I can tune that relationship for that
task. We looked at a bunch of other things like e-mail threading and finding aliases and finding e-mail
addresses of people that would likely to attend the meeting object and that worked pretty well.
Subsequently, I had another student that came by, Ni Lao, and he looked at basically the same problem,
the same task of given a type and a query set of nodes, finding related things. And his contribution was
basically to look at new and better learning methods. In particular, the stuff that Einat was looking at
there were a limited set of parameters, which were basically related to the edge types and the entity
types that you have. Ni looked at some new tasks. One of them was basically information
managements for scientists, and the language that he used was a little bit more expressive, the
representation that he learned was a little more expressive. So here's sort of an example of what the
sorts of things his system learned. Essentially, what you're learning in each of these cases is a similarity
metric, so you can solve some particular class of queries better. So the similarity metric is basically
some strategy for doing random walk on the graph. And so for Ni what you basically had is you had a
whole bunch of experts each of which meant you're following, doing a random walk across edges of a
certain type. So, for example, you might start with a word, and let's say so that here we're starting off.
Our query is the set of the title of a paper, the year you're going to publish the paper, maybe the
authors of the paper, and I think there were some entities like gene entities that were associated with
that paper. So you start off with this meta-information, and the goal is to find the papers you'd like to
cite. There's lots of data you can get for this, if you assume citations are mostly right, and you learn
paths like this one. It says start with a word and then find a paper connected to that word. So we're
starting off with an initial uniform distribution over all the words, and then we're taking a random walk
from that to a set of papers. So I have a paper that's linked to by a bunch of the words in the title, then
it will get more weight than one that's just linked by one. If I have a word in the title like Z, then it's
going to be linked to a lot of papers, so it weight will get dispersed, whereas a rare word will get
dispersed less. So going from word to papers is kind of like traditional IR. In fact, this is one of the rules
that gets learned is basically sort of type the title into a search engine, and those are things you might
cite, or you could basically go from there and find things that cite that paper, and then are cited by those
papers. That’s sort of a co-citation rule. Ni's system basically does structure search and enumerates a
large number of paths of this sort, and then it does a linear combination of these things, so it's basically
finding weights for the things that come out of each path. Each of these paths is a random walk, so it's
kind of smart about what it generates, and then additionally, we're going to put some weights on that
make the ensemble even smarter. So that's where we are. Here are a couple of other examples of
things. This is five papers cited in the last two years. This is one. These are some that are negatively
weighted, so if you just look at recent papers, then generally you don't cite those as much as older
papers apparently. This was Ni's system. I was talking about this with Tom Mitchell, and he suggested
the following idea: think about applying this learning system to basically predict relations inside the
NELL knowledge base. There's a relation like athlete plays in league, so this is in the ontology, you know
you want to predict it, and it may be the case that there are change of relations that will get you there
reliably. So, for example, if I know that somebody is an athlete, and they play for this team, the Steelers,
and the Steelers play in the NFL, then that chain may help me predict whether an athlete plays in a
league. The question is can you use this same sort of path learning to learn a similarity metric which
basically says here is how I can infer the second argument of athlete plays in league from the first
argument. So that's the task we're trying to do is learning a binary relation in the ontology. So we tried
it and it actually works pretty well, and it's kind of interesting to see the sort of things that it finds, so
one is this athlete plays in a league, and the league is part of some organization, so let's see, because I
don't remember quite the semantics of this but that's basically sort of like this link right here with NELL's
actual concept names. Here's another one which is kind of cute. You start with your athlete. So I start
with an athlete, and I go to a concept that he's a member of, so let me do that just a node that indicates
the concept athlete in this graph, and then you find all things that are instances of that, so now I have
the set of all athletes basically uniformly weighted. And now I want to find out the sports those athletes
play, so at this end of this chain I basically have a distribution over sports that's weighted by the number
of athletes that play that sport. So this is basically sort of a prior distribution over possible fillers. Ni's
algorithm learns a lot of rules that are kind of like this. It learns things that sort of give you prior
information. It learns some things that look like inference rules. It learns other things that look like
they're basically sort of trying to get around holes in the knowledge base. Here's one. I want to find out
the home stadium of a team, so where do the Seahawks play. I can start with some member of the
Seahawks, find out what team they play for, now that really should be back to the Seahawks, but it
might not be because I may have two duplicate concepts. I may have different versions of the same
entity because the de-duping in NELL is imperfect. So then I could find the home stadium of that team.
That's a pretty good inference rule if you have some extra unduplicated entities. It learns a lot of
interesting things. PRA has been now for a while one of the components of NELL. That's where we are.
Now I'm going to talk about the next stage, so this is very recent work. It's recently appeared in a
workshop at ICML, and there's longer versions under review. This is actually cool stuff. I'm actually
more interested in this than I have been for a while, about anything for a while. When Ni was around,
we are talking about a number of possible extensions. Some of them he did for his thesis. Some of them
seem very hard to figure out how you would do. One question is can you make the rules that PRA learns
in this knowledge completion setting recursive. After a while we were thinking about it, and we came
with this idea maybe a we can use the ideas from PRA in a more powerful representation, and jump by
and get a whole bunch of these extensions at once. Just to give you sort of a concrete example again,
for PRA, what it learns, these paths really look a lot of like logical inference rules, which is why I'm calling
this inference in the knowledge base. You might write this rule that takes you from an athlete to a sport
by saying, well, you know, if you're on the team, you can tell in the knowledge base that this athlete’s on
a team, and that team plays that sport, then the athlete plays that sport. That's a rule. You’re asserting
something new via these set of rules, based on things that you already know in the knowledge base. Of
course, you're doing this for lots of different relations, so you do it for athletePlaysSport. You also do it
for a teamPlaysSport, so it might learn a couple of rules for that. One of these might call the other, so
this one calls this guy. You might conclude the team plays the sport, if the athlete’s on that team, and
the athlete plays that sport. All these relations are, by design, closely interconnected and redundant.
PRA can learn this, but it couldn't learn something where these rules actually get called in a mutually
recursive fashion, so what you'd really like to do is have the rules basically not be limited to talk about
the knowledge base, but to basically test, see whether that fact is either known or it can be inferred,
which basically means you need to have the separate rules for teamPlaysSport set up here, and for each
possible thing you need to have some sort of ground-out rule that basically says well, if it's already
known in the knowledge base, then I'll conclude it to be true. Now this is a mutually recursive program,
and the question is how can we learn that? Everyone kind of clear where we are so far? I stymied
anyone? Okay. This is a goal. We want to lean something mutually recursive. So here is the idea. I'm
going to start with a mutually recursive program. I apologize, but I'm switching examples here, just to
kind of reuse slides. This is a different program that is basically doing some kind of similarity or label
propagation over web pages. A page X is about topic Z. If you've labeled it as about Z or if it's similar to
Y and Y is about Z, so that's kind of a propagation step. I'll say that X is similar to Y if there's a hyperlink
from X to Y or if there's a word in common between these two. That's basically what these two rules are
saying. The way I've set this up there's features associated with these rules, so this has two different
features; this has two different features. This has one feature, so the rule is always true, and I've set it
up this way, so when you get to this, the head of the rule contains the variable W which is the word, and
I can tag the application of that rule with the word that being used. So there's a feature for this rule,
which is just the word that's being used. What are those features about? Well, the features basically
get put onto edges of a proof graph. The proof graph looks like this. We start with some goal. Up here
in the right-hand corner, we have a copy of the World Wide Web. It’s got four pages. One about
fashion, one about sport. I'm sorry, this is a subset. Here are some words in each of these things. So to
find out what page A is about, I can do my -- I don't have a hand labeling thing, so I look, do a
propagation step. It's got to be similar to something. There are two ways it can be similar. We could say
it's linked to, I guess this is B. B is labeled fashion, so this gives me one solution, fashion, and we can
chain through here. It's linked by a word, let's say the common word, sprinter. Sprinter is also in C, so
that gets me to C, and C looks like it's actually, well, I have to propagate that to D, and D is about sport.
So there's another little chain that gets me to sport. There are two possible solutions to the search
base. We now have a language for representing recursion, because we can have a recursive prolog
program, and the cool thing about it is the proof space is a graph. It took me a long time to realize that
this is a reasonable thing to think about. This proof space is a graph, and since it's a graph, I know
something about how to tune this exploration process. I could imagine just basically doing same
personalized page rank on the graph, which gives me a similarity metric between this query and these
two nodes, fashion and sport, so it's a similarity between queries and possible answers to that query.
Then given those -- given that search process, I can now think about tuning that search process to give
me the results I want, using standard techniques for learning and this graph here I've constructed. So
I've got lots of features. I’ve got lots of edges for the nodes. I have some ideas about how to do this, so
again, basically you’ve got this probabilistic transition through this graph. It depends on the features of
the node, which basically depend on the rules and sometimes the arguments of those rules as they were
applied. There are implicit reset transitions, because I'm doing personalized page rank, because I like
doing that. That basically means if I'm doing some sort of a reset, basically it gets hard to do walks that
are very long, so it's much easier to take the short path than to take this relatively long path, because a
long path, basically you have a lot more chances to do reset. So basically this is supporting looking for
answers that are supported by a lot of proofs and, in particular, a lot of short proofs. The semantics of
this, it turns out there's a formalism called stochastic logic programs that were similar but didn't have
the reset, that were explored 10 or 12 years ago. The semantics are basically all set up exactly the same
way. And there's one thing about this which is really cool. Ideally, we'd like to take this small recursive
program and apply it to a slightly larger subset of the web than the one I showed you in that slide, more
than four. If this proof tree will let you recurse across the whole web, so you can propagate labels all
across the web, then the proof tree we're talking about is really, really big. So if I use stochastic logic
programming for it, it wouldn't work, because building this proof tree is like doing a traversal of the
whole web. But if I want to do a random walk with reset, it turns out the weights in a random walk with
reset drop off very quickly. They drop off exponentially as you get further and further away from our
reset node. So that means that you can get an approximation to the random walk with reset by just
incrementally expanding away from the reset node. There's a nice algorithm for it, a very simple
algorithm that was described in a stocks paper a few years back. It’s called PageRank Nibble and they
proved that the size of subgraph you explore is inverse to the error and the reset parameter. So as long
as you have some sort of lower bound on reset, you can get relatively small grounding. We know how
to do -- I won't talk too much about the learning algorithm, because it's 4:06 on Friday, maybe thinking
about the weekend now, but we're using an off-the-shelf learning algorithm for doing this, and the real
key thing here is this fact: that the size of the structure that you're producing, the ground structure that
you create that corresponds to a query, the thing that you're doing inference over when you're doing
the learning, is not that large. It doesn't depend on the size of the database; it just depends on these
two parameters. So that's a big deal, because if you have, in many other probabilistic representations, it
turns out the size of the grounding is actually very large, so for example, in Markov logic network, you
start out by building Markov random field where the nodes are all the possible atoms in the logical
theory. That's a huge set. In this case, since we've got similarity X,Y that's basically not just all the nodes
of the web, but all the pairwise nodes of the web, or the number of nodes on the web squared. Here's a
couple of quick results just to give you an idea how this works. I've suggested it's going to be faster than
something like Alchemy Markov logic network system. First thing we do is tried verify that, so we took a
data set that the Alchemy people put together, and we took their Markov logic network and we
translated it using a mechanical translation to our representation, and so Markov off logic networks let
you have non-Horn clauses. I can only do Horn clauses. So the things that weren't Horn I just threw
away, so it gives me this program. Has everyone got that now? Good. Okay. Here are the numbers.
Here we're looking at inference. We're going from a tiny database that has four citations to a really,
really small database that has eight citations. I'm sorry; this is log. So this is two to the fourth, 16. Two
to the eighth, 256 citations. What you see here is there are different learning algorithms that are
applied to the underlying Markov logic network. There's also this lifted belief propagation strategy
where you try not to unroll everything. But in every one of these cases, what you essentially see is you
see that the time it takes to do the inference grows pretty quickly with the size of the data set. For our
case, that blue line is what we've got, so it's independent of the database size, given that we’ve got
alpha and epsilon fixed. And now we look at how well it works. This is before learning; we just put
uniform weights on all features. This is AUC; these numbers are not too bad, not too great. With
learning we improve on most of these cases. This is the Markov logic network, and this is all using the
same subset of rules, the 21 rules that we actually could use in our system. If we put in all of Pedro's
rules, then their system gets a little bit better, but we're still doing a little bit better than that. This is a
very encouraging result, that this is our reasonable bias, and it has the computational properties we
want. There's also another really nice benefit of this fact that the grounding process is bounded in size.
When you're doing learning, you basically get a whole bunch of queries, and for each query there's a
bunch of correct and incorrect responses, so I want to say what are the things that are the same,
citation 120. There are some things that are right; there are some things that are wrong. Operationally,
what we do is we start here. We do a proof for that query, and then that's grounded out to this little
graph. Now we now have like a little non-first-order construct. It's just a graph labeled edges and
nodes. And we're trying to tune the weights on these edges. The cool thing is those graphs are all
separate, right. We could combine these things if we want, but we can also keep the graph for each
query as a separate thing. The only coupling between these graphs is the parameters that appear, the
features that appear on the edges. Our algorithm is a little different where you're optimizing the same
metric that Uri and Lars Backstrom were optimizing, but our algorithm—they used like a Newton
method, and what we're doing is we're doing stochastic gradient descent. Basically you compute the
gradient with respect to a particular graph. You tweak the parameters and then you go on to the next
graph. This is really nice. It's sort of a small-memory-type operation because you're only doing one
graph at a time. The only thing that they share is the edge features, and you can also do these in
parallel, so you can have a different thread looking at the inference process over each little subgraph so
there's some synchronization, because you're sharing the edge features, but you can still get potentially
a big win by parallelizing. So this is some other numbers. This is doing the learning, and we're looking
here at how much you speed up as you add more threads. Ideally, if you have 16 threads, you get a 16X
speed-up. We don't quite get that, but we're doing pretty well. That's the formalism. And like ten
minutes ago or something, before you guys started dropping off and falling asleep, we were talking
about why I wanted to do that. What I wanted to do is I wanted to get to a mutually recursive set of
rules, and try and learn this mutually recursive set of rules. So this is learning how to do inference in a
stronger sense, rather than just learning how to do one step of the inference with the stuff I've got, I'm
learning how to do multiple steps of the inference at once. That's really the motivation behind this. I
don't have any way of doing structural learning right new for this language. What I did was I used Ni
Lao's PRA system. I just let it run and it learns nonrecursive rules. It's take them and I syntactically
transform them so they're recursive, and they can call each other. Then I take that and I train weights
on the whole program. That's the experiment. So a few details. What I did actually was -- so the
hypothesis is that this mutual recursion is going to help for the highly connected entities, the things that
are related to a lot of things, which are themselves related to a lot things. So to get at a sample that had
that characteristic, what I did is I picked a couple of different seed entities, and then I did just a kind of
simple random walk with reset, personalized page rank away from those, okay, so personalized page
rank will give you things that are close to the seed, but also give you things that are sort of central,
things that are well connected. It's kind of a mix between page rank and distance. So I picked the top
few things in that random walk, so that's a subset of NELL. Then I project the knowledge base to those
entities, so I did that in a few different ways, and here are the results. Just to pick one of these out here,
if we take the top ten thousand things closest to the concept baseball, the AUC for the nonrecursive
rules is about 75, okay. The AUC for the recursive rules is about 99. We're getting a huge lift here
largely because of extra recall we get because we can do multiple rounds of inference. This sort of
inference, the results vary across these different samples somewhat but there's always a big lift from
looking at the recursion and learning a recursive set of rules. I'll just let people stare at those for a
second. I really should have a better visualization of this. Another thing that's actually kind of
interesting is that these things also tend, since they're very closely connected, statistically they tend to
also be more common. There are more facts about the things that are kind of closest to baseball than
things that are kind of farther away. What you also see typically is you see as you go down this list the
test sets get harder and harder. Why I split into train and test is the training set is everything up to
some particular iteration. It might have been 713. The test set are the things that are learned later on
in the bootstrapping process. This is the first way that we're now using joint inference in NELL.
I'm going to talk a little bit about a second way that we're doing that, okay. Lise Getoor is going to be at
the faculty summit. I believe she's going to be talking about this to some extent, though I haven't
actually checked with her, so maybe she's going to talk about something different. I don't know. Just in
a little bit more detail what NELL does is it basically iterates over and over and over, okay. Each iteration
it basically learns classifiers with each of these feature sets including PRA and basically does distant
training for that feature set using the current knowledge base. It uses the current knowledge base to
provide labels on whatever data it's got, sentences or semi structured pages or what have you. Then
once you've got the results of that learning, it's a classifier; you can extend it; you can get new
candidates based on the unlabeled things that the classifiers fired on. So we get a whole bunch of
candidates from that are from different learning systems, and then we do some heuristics to find the
best candidates, and those are -- in the system we call it promotion, right. They're promoted into the
next version of the knowledge base, and then we go around the next loop. The knowledge base is a
little bit bigger, so in each iteration the knowledge base grows and it's a little bit better at recognizing
things because it learns a few extra patterns and how to parse a few more web pages and so on. What's
the hard part? Well, we're using some fairly ad hoc techniques here to find the good candidate beliefs.
It'll be nice to do this in a principled way. What's the right algorithm for doing promotion? What we're
really trying to do here is kind of deal with the fact that there's lots of noise in the database, many
different types of noise, but there's also lots of redundancy, so it seems that the way we should deal
with these many different types of noise and exploit the redundancy is to do some sort of joint
reasoning. We should pool information about things like co-referent entities or possibly co-referent
entities. We should enforce mutual exclusion and other sorts of ontological relationships, so we should
find something that sort of jointly satisfies everything as well as possible under the assumption that
there is going to be noise. Also under the assumption that we don't really know what kind of noise
there is. We don't know how much there is of each type. Just as an example of this, here is a set of
extracts. There's one here that's like way off. It thinks this entity is a bird rather than a country, which it
really is, okay. It thinks these things might be the same thing, but it's not sure. It thinks that this is the
capital of that place, okay. It knows that things that have a capital are countries. It knows that countries
and birds are disjoint sets. Those are mutually exclusive. These are the facts we have, and we'd like to
kind of get to something like this. So this is sometimes called, like in Google they're calling this a
knowledge graph, so this is a certainly a graph. For a long time, Lise Getoor has been looking at using
the term knowledge identification to talk about the problem of basically going from this sort of graph
which has many types of errors. She's been using the term knowledge identification for this process of
coming up with a clean de-noised version of a graph. All right. We start talking and we decide to try
using her or graph identifying techniques for this problem. A couple of students, Jay and Hui, are really
doing all the work for this project, let's be honest. How does she do this? Again, she uses joint
reasoning in a probabilistic logic, so it's different from the one I'm using, and it's interestingly kind of
complementary. I won't go too much into the details. Essentially you ground things out by instantiating
your general rules in all possible ways, so unlike the proPPR language, this page rank version of prolog,
the grounding process can be relatively expensive okay. But one nice thing is when you're done, the
inference problem is convex, so finding the most probable explanation given your current weights for all
the rules is a convex optimization problem. How do we turn this into a joint reasoning problem in her
framework? We basically have a bunch of inference rules. One basically says well, if NELL was going to
promote it, then maybe it's a good idea to promote it. There are two types of promotions. One is for
concepts and one is for relations. So this is basically true if NELL wanted to promote the thing, and
basically this sort of says yeah, that's a real relation; that's a label. Here's another one that basically says
if anything, if there's any candidate then go ahead and promote it. It's better to promote can dates than
not promote candidates. These are some rules which basically enforce consistency about things where
there's a co-reference assertion, so if you think entity one and entity two are co-referent, then they
should have the same labels, for example. There are also constraints that come from the ontology. For
example, if L is a subset of P, okay, athlete is a subset of person, and entity E is in class L, so somebody's
an athlete then they better be a person too, so this sort of enforces the subset relationship. These
enforce the mutual exclusion relationships and so on. These rules are actually too expressive for my
logic, so the problem is basically these rules aren't Horn, right. These mutual exclusion things, I can't do
those in my logic, but we can do them in Lise's logic. Here's kind of the rough results. So we've done
some more experiments, but just to give you an indication, there's a set of facts from a year or so ago
with another group, David Lode, one of Pedro Domingos's students and some of his students looked at
basically the same sort of scenario, doing this promotion step within NELL as a joint reasoning process.
They were using Markov logic networks. They very generously gave us their evaluation data so we could
-- and their kind of precise predictions so we could reproduce their results. The joint inference runs in
just a few seconds. The learning is a little bit longer. The baseline here is sort of NELL's learning
method, okay. Here, the AUC is about .76, so MLN does it about .99 and the PSL does about the same.
This is actually sort of a version of NELL's code that's been tweaked for just sort of the test set, so the
thresholds have been adjusted for those 25 K test sets. I'm going to go ahead and stop here and maybe
spend a couple minutes with questions. Basically the two things I've talked about are sort of on
overview of this NELL system and some of the key ideas behind it. Then an overview of kind of these
two lines of really recent research. One is basically treating promotion, a sort of key step in bootstrap
learning systems, as an inference process, which isn't really very interesting if you think about
bootstrapping in some very simple space where you just have one or two concepts, but it's actually
quite interesting when you're talking about bootstrapping in sort of like a nontrivial, significant
ontology. The other thing I talked about was learning how to do inference over a knowledge graph and
in particular learning how to do this multistep, possibly recursive inference. We've taken kind of past
work which is based on kind of random walk with reset type procedures and learning algorithms and
extended it to first-order logic, which is actually kind of cool because we get some very nice scalability
properties for this and some very nice properties with respect to how we can parallelize that inference.
Okay. That's basically as far as I'm going to go. So thanks a lot for your attention. You can clap now and
then you can ask questions. [applause] Matt?
>>: Can you go to the previous slide? Did you compare proPPR to the PSL, like the Horn subset->> William Cohen: We actually haven't done that. We could do the Horn subset, and that's kind of on
the list of things to do, right, so the other thing that's kind of on the list to do is to see whether some of
the ideas from proPPR's grounding procedure can be modified for PSL, right. They've been kind of like
things that we've been kind of plunging along with sort of slightly independently, and we've been talking
a lot about these things. I'd really like to see that happen. Yeah, one question is how much are those
handful of things and how much do they matter? There's also some strategies that I've also fought
through. For example, we could basically do is set up rule that is basically say I am going to infer that
there's a conflict if you see this sort of situation, okay, and then basically tell the system here's some
positive examples of things you should infer -- and by the way please don't infer any conflicts, so every
conflict is sort of a negative example of something you could infer.
>>: Sort of pull the negative stuff out and do it as like a post hoc kind of?
>> William Cohen: Yeah, something like that. To do that, there are like some technical things we'd have
to do. The easiest way of like training something around that would be just to basically sort of say well,
these rules that allow me to infer conflicts, the guy says he doesn't want conflicts, so let's just weight
these rules down, right, then I'll have no conflicts. It's not quite as simple as just doing that, but I think
that approach might be applicable. Certainly the question of well, we sort of would need to have some
sort of random walky semantics for PSL if we're going to use the same kind of grounding strategy, right.
Or maybe we don't. Maybe it works okay without that. It's not clear, but it'll be good to do some sort of
experiments to kind of feel what the space of things is there. Any more questions? Yeah, one in the
back.
>>: Kind of how you deal with enemy resolution? I guess I'm sure it's built in. Do you give pretty much
every [indiscernible] a unique identifier and then?
>> William Cohen: Oh, yeah. Let's see. Okay. Short answer, long answer. In NELL, basically there's sort
of a typically two versions of an entity. One is basically sort of the entity viewed as a string, and then
there's the entity as assigned to a type, okay. If you look at something that's ambiguous like apple,
there'll be Apple the company; there'll be apple the fruit, okay. There'll be strings like the string apple,
right, which can refer to either of those, or maybe Apple Inc. which maybe will only refer to only one of
those. That's the representational scheme that NELL uses. The way it gets there is it sort of has a couple
of classifiers which I didn't put on that picture, so there's one that basically looks for things that should
be co-referent and aren't. Then there are also some heuristics, so if you have a fact that's got a couple
of mutually exclusive high-confidence predictions, that will consider splitting that. That's sort of the
machinery that's in NELL. The results I gave were actually not just using those co-reference facts
though, actually. It was actually using those and some additional co-reference facts from some coreference machinery that Lise's students put together because they're like kind of big co-reference
people too. That's another discussion that Lise and I are going to have sometime in the future about
sort of evaluating those changes and sort of figuring out. I mean they clearly help with the promotion
stuff. Is that something we want to bring into the main NELL system or not? Yeah that's kind of the
short or longer story. Rich?
>>: I was just curious about how you do experiments with NELL. Let's say you release something and
then you realize a couple weeks later that it's actually creating a problem and a bunch of bad facts
you're getting added to the system, a bunch of bad inferences. You must have some sort of rollback
procedure where you can go back before that happened and sort of continue.
>> William Cohen: Yeah, right. That has actually happened in the past. The way NELL works basically is
after each round of bootstrapping, the knowledge base, okay, and all the learned patterns and classifiers
are basically checkpointed, so we can, in fact, roll back at any point, but that said, it's really hard to do. I
mean typically ->>: You can't roll that back, right?
>> William Cohen: Sorry?
>>: You can't roll the web back, so presumably ->> William Cohen: Oh. Right, right, right, right. So that's ->>: You can't repeat an experiment.
>> William Cohen: Well, we do live queries. We take the pages that come out and we cache them okay,
and then whenever you refetch the same page, you get it from the cache rather than from the web,
which is not ideal for temporal things, but that sort of gives us some more reproducibility. Then most of
the patterns and stuff are built on the distilled version of the clue web thing which is a static repository,
static crawl.
>>: Do you have any sense if you were to change the random seed and rerun from a certain time, how
similar you would be after the two weeks?
>> William Cohen: We haven't done that for the large web repository okay. I've done a bunch of
experiments like that with different versions -- a different version of NELL that was basically sort of set
up over a biomedical ontology. It certainly is quite sensitive to the seeds, in particular, the quality of the
seeds, so one of the limitations of NELL is that there's some engineering here of the ontology to kind of
keep the system from breaking, right, of the seeds, because you're giving seeds that you kind of know,
because you're a system designer, you're a smart grad student from CMU right, you know that these are
going to be unambiguous relative to the sets of things that you know. They'll be frequent enough to be
informative. There's a big question of sort of how robust is NELL when you get to a different ontology.
One of the reasons why I did the biomedical thing, it's sort of like other designed ontologies. There are
not a lot of interesting ontologies that have these sets of mutual exclusion and things like that that can
be used.
>>: Do you ever get compound interest effects where if you make something just one percent better,
after six months everything is much, much better? Is there a thought that there might be singularity in
the [indiscernible] sense if you achieve a certain incremental improvement in the short term, or is this
just bad thinking?
>> William Cohen: You've got to ask Tom whether that's true and see whether he'll go out on a limb and
say yes.
>>: One percent worse can be catastrophically bad [laughter].
>> William Cohen: That's concept drift, right, and that sort of clearly happens. We've had to sort of like
roll back, and there are lots of reasons. I mean one is you'll like run into things that are just sort of
spammy, right, but yeah. For example, the entity matching stuff was trained on some subset of things
and it worked pretty well for a while. Then we changed the ontology, and then it stopped working,
right, so suddenly it kind of like went out of control. There are some kind of like hacks basically that we
use to kind of keep things from cascading in either direction too fast. One of the hacks, which we'd have
to think about whether we want to incorporate it in Lise's extension or not, is that no single component
can introduce too many facts of any type at once, and that's enforced even in kind of ridiculous ways.
We'll put in something that introduces facts about the geolocations of things and we'll have a very
conservative matching algorithm, and so the first time around, it'll say okay, well, I know the geographic
locations of hundred thousand things, and it'll be like, okay, you have to put in ten thousand this
iteration, ten thousand that iteration and so on. Yeah, there are geographic facts that are sort of like
waiting to get in the queue as we iterate for example.
>>: I'm wondering what is the key difference between the NELL and other [indiscernible] such as
Freebase or YAGO.
>> William Cohen: Yeah. I guess the goals are kind of the same, to build sort of a broad coverage
knowledge base. There are a lot of technical differences. YAGO2's getting closer. YAGO has a smaller
ontology, and it's done by extracting from a less diverse set of sources, mainly Wikipedia. Freebase is
done primarily by merging structured databases, and the merging is sort of very systematic and user
guided. I mean I think they're sort of hoping to get to the same place, but we're sort of trying to get
there kind of by different tools. Part of why we're doing it this way is because we kind of like the idea of
pushing the frontiers of semi-supervised learning and things like that as well. There's different
motivations for doing these things, but I mean there's a lot of -- we hired a postdoc from the YAGO
group, for example. There's a lot of commonalities between those different projects.
>>: You can also [indiscernible] YAGO to [indiscernible].
>> William Cohen: That's something -- well, we haven't done YAGO. We've done some preliminary
experiments with Freebase. There, the big thing is that there's not as much ontological information, so
we have to sort of infer things like mutual exclusion constraints. Like doing Freebase as sort of an initial
thing is also like a very reasonable sort of thing to do. We're doing some experiments where we're like
merging results from NELL and YAGO and Freebase for applications, so we're also using NELL to, say, do
like distant training for extracting relations from individual sentences, and there's no reason not to
supplement that data with sources. Do you have a question?
>>: This is probably blasphemy for NELL, but given a closed ontology, is there any benefit in knowing
when to stop promoting facts into the knowledge base? For instance, country, you now start Iraq
which.
>> William Cohen: Yeah. No, no, no, that's actually a great idea. That's a great question.
>>: I mean it's not never ending anymore [laughter]
>> William Cohen: Right, well, so yeah. That's right. It's not never ending learning, that's true. Well, I
mean there are other parts of NELL that say, for example, add new concepts, right, or add things to the
ontology, but yeah. I tease Tom and say it's like you’re making this a selling point, but never ending is a
synonym for slow, right? Really we should get to the end, right? [laughter] Yeah, I'm actually working
with a student right now Bhavana Dalvi. She's working on basically learning in a fixed ontology, semisupervised learning in a fixed ontology where you know it's a fixed but incomplete ontology. Like if a
simple case is you're doing semi-supervised learning into 20 classes but no one's given you seeds for 10
of them. You only have seeds for 10 of the classes, so as you progress, eventually you'll start running
out of countries and then you'll start grabbing other stuff. The question is can you recognize when the
countries stopped and you've started moving into the next outlying concept. Maybe there are things
like US states which are in your ontology and you can say whoa, that's a US state; it's not a country. But
there might be other things that aren't in your ontology like, I don't know, provinces of ancient Rome,
right. Is Gaul a country? There may be things that sort of have additional structure.
>>: I wonder how NELL can deal with facts that depend on time. For example, athlete is belongs to one,
play in one team. The next season, he might play in the other team. With the promotion hear this
conflict to the existing database?
>> William Cohen: Right now NELL is like if something's true, then that basically semantically means it is
or was true at some point in the past, so it doesn't consider time in the web site version. We have been
doing a bunch of research on how to capture time. Partha Talukdar has been doing some stuff on this as
part of his research. Our current approach for dealing with time is to look at a large corpus that has
timestamped objects, like a newswire corpus or something like that and have a similar ontology that
talks about when things can sort of, which things can co-occur and which things can't co-occur. For
example, you can have a lot of senators, US Senators, at one time. You can only have one US President
at one time. You can't be both a Senator and President at one time. If you look at the distribution of
when you see facts that suggest that Barack Obama is a Senator and when he's a President and when
George Bush is a President, right, then you can get some sort of information about when those -- if you
look at that jointly, you can get some information as to when those transitions happen. That's kind of
how we're dealing with the time issue, and there are a lot of interesting issues. How do you infer those
constraints, for example? So far, that stuff hasn't been rolled into the existing NELL system. Matt?
>>: Sorry if I missed this in the beginning, but where does the initial ontology come from? Is that -- so
how many concepts and relations?
>> William Cohen: So it's less than 1,000 but more than 600. I don't remember the exact number right
now. It changes over time because part of the development process is developers look at this, right.
You manually add, and every time you add, you sort of stick a few things in, like how do you state it and
so on. Yeah, the ontology is a big deal.
>>: Yeah.
>> William Cohen: But you can just download it if you want it.
>>: Yeah, yeah.
>> William Cohen: Okay. [applause]
Download