>> David Lomet: So thank you all for coming. ... today's speaker who is Gerhard Weikum from the Max-Planck Institute...

advertisement
>> David Lomet: So thank you all for coming. It's my pleasure to be able to introduce
today's speaker who is Gerhard Weikum from the Max-Planck Institute in Germany.
Gerhard has a long and distinguished career in the database area and has been a visitor
with us on a number of occasions before, some of which turned out really well, and so
we're hoping that this visit will also turn out well.
Gerhard is in fact a director of the Max-Planck Institute. He's an ACM fellow, and he is
the -- is it chair or president of the VLDB?
>> Gerhard Weikum: President.
>> David Lomet: President of the VLDM endowment. So he's been a leader in the field
now for quite a number of years. And it's a real pleasure for me to be able to introduce
him. And he'll be talking about harvesting, searching, and ranking knowledge on the
Web. Gerhard.
>> Gerhard Weikum: Thank you, Dave for, the flattering intro. Even though it's flattering,
I still appreciate it.
So the reason that rises work is the goal to turn the Web whatever new Web we will have
in the near future into the world's most comprehensive knowledge base, into a semantic
database that does not only know about Web pages and text in Web pages but about
entities and relations between entities.
And the approach that we are pursuing with the people that I'm collaborating with as a
three step procedure so first is get the knowledge, lift the Web pages into more explicit
notions of fact, organize lists, and there's different ways of doing this, so in this talk I will
mostly talk about leveraging hand crafted high quality knowledge idly in the form of
ontologies but maybe also looking at encyclopedia.
But there's other ways, like we can use text mining, natural language processing, statistic
learning to go after the implicit facts that are embedded in natural language text. We are
doing also some of this, too, but will not be in presented in this talk. Or we can try to
harness the folk wisdom, the wisdom of the crowds that's implicit in social tagging and
other things along these lines.
Once we have the knowledge, next step is how do we query it, and of course we want to
raise the bar here as well, so we just -- we don't -- not only want to run like similar QA
queries but go after sophisticated things and you will see examples. And it was
searching but also ranking results. Often you end up with a lot of results and so ranking
is absolutely crucial.
And finally all that should be done efficiently and scaleable. So why can't we do this with
Google today? Suppose we run Google just limited to the Wikipedia domain, that's the
closest approximation to my goals in but that does not work well, and here's just a bunch
of examples where this Google centric approach would fail, so in natural sciences here's
one example from life sciences. You are often interested in specific entities and then
relationships between these entities. That theme also shows up in humanities, so I'm
working a little bit with people from humanities, so here the not so obvious connection
between Thomas Mann and Gruter is that Thomas Mann wrote a novel early 20th
Century, (inaudible) who features Gruter who was dead more than hundred years by then
but features Gruter as a character in the novel. So that would be cool to figure that one
out.
Then we have quiz question like things which are easy to answer for a human because
human can get the bits and pieces from the Web with some effort and they might even be
structured so birth dates, death dates and so on, pretty structured but the human still
needs to find a lot, quite a few of these bits and pieces and connect them in the right way.
So these joints if things were in the database would be trivial but these joints doing them
directly over the Web is a pain. The answer by the way is unfortunately Max-Planck who
died in 1947 after he had lost the last of his four children.
And then some of them are actually simple queries but they involve ranking over lots of
results. And just to drive my point home I tried this one, which politicians are also
scientists. You can vary the way you express this to Google. You can write it as a
natural language question because Google has some abilities to deal with this and the
results don't get better than this, right, so this is mostly about scientists and politicians
debating global warming, et cetera.
And what's wrong here, so I -- this is one of my favorite quotations by Frank Zappa. The
first line is the key point here, information is not knowledge. This is really raw
information, these are raw assets but it's not knowledge, it's not lifted to the higher level
and you can reach here the further levels above the level of knowledge, right.
And the first line is very good, the rest I don't know, maybe he smoked too much grass at
that time.
(Laughter).
So and this is in a screen shot from one of the prototype engines that we've built called
NAGA. Of course we're not good at doing (inaudible) and I will explain what's going on
here. And but we do ranking and the ranking is pretty good because the data comes
from Wikipedia and there are thousands of people that qualify and what we get at the
very top is Benjamin Franklin, who invented the lightning rod and also parts of the U.S.
Constitution. Paul Wolfowitz is well-known, Angela Merkel is the German chancellor and
she has a degree in physics and wrote a dissertation about physical chemistry. So this is
about the best you can get here in terms of ranking.
I don't have any real slides or elaborated part of the talk on related work. This could be
open ended. I could talk half an hour or longer only about related work. Essentially
what's going on is I see three major areas which are making great advances in the last
few years, and we position ourselves essentially in the intersection of these. So of
course Web search does look at entities but in a limited way driven by business
opportunities like product, if it's about product, if it's about location then, yes, of course
they do some form of dealing with named entities. To my knowledge, they don't go far in
terms of relations among entities.
Information extraction has gone a long way based on text mining and other techniques
and for about a decade we've seen that ranking over structure, assembly structured data
also makes a lot of sense. Those are the outline. It's essentially three systems and three
papers. So the first is about the YAGO system and knowledge behave that we've build
this way. And this corresponds to a paper we had in last year's (inaudible) then it's the
search engine and the ranking over this data called NAGA which appeared in ICDE this
year. And the last one then right now not yet integrated but when the process of putting
this underneath as the engine underneath the other stuff, this is a very fast RDF engine
coined RDF triple express which is going to appear in this year's (inaudible).
And by the way you need to drive me or you can drive me in terms of where I put
emphasis on and which parts I may have to skip. It's maybe a bit too much material. So
when you have questions or when you think rush on and let me know, right.
So know the first part is about building knowledge basis. How do we go about this? So
we could do text mining based information extraction. For example from this old
Wikipedia page about Max-Planck and that would essentially move the text into a
seemingly structured database in terms of records so we can find the birth date, birth
places of scientists, scientific results, more details about these maybe even combining
some of what we see here with other sources. Not everything that you see on the right is
actually written here. Collaboration and so forth.
And the techniques for doing the is a combination of methodology, so natural language
processing plays a role. Pattern matchings, some of the low hanging fruit is just regular
expression matching so birth dates is actually easy. Email addresses would be even
easier. Statistic learning might play a role so we need training material. In order to
determine whether something should move from here to the structured site or not we can
also reject hypothesis.
And some of that is boosted by system many assets like list of place names, for example
Kiel is a place name so it would be in the dictionary of German cities. Jonathan?
>>: I'm just wondering the degree to which the discovery of the fields that you (inaudible)
so for instance scientific result and to what degree that automatic versus (inaudible)
scientific results is important. (Inaudible).
>> Gerhard Weikum: Okay. Maybe the last part kind of addresses this, so there's no few
remarks. So one is this looks like a database but it's kind of a different database. So
there's uncertainty. So usually we don't have confidence one in these extractions. So
there might be a few exceptions because birth dates are so successful we might have
confidence close to one.
But certainly not in the things like inventions and so on, right. And we need to in the long
run we need to quantify these confidence measures and later on reason take them into
consideration would we reason over this knowledge base. So there's uncertainty that we
need to carry on, right. Does this roughly address your concern?
And also different techniques give you different confidence, right, so and the kinds of
relationships, there are some easy ones and some tough ones, right. This is certainly not
among the easy once.
So second remark is some of the techniques are very expensive so we actually built a
tool a few years ago which employs a dependency parser and feature representations
into a statistic learner but the dependency parser which is a deep natural language
parsing technique takes about a minute on a long Wikipedia article, maybe not on this
one because it's shorter. So it does not scale up.
Second thing is again coming back to the question or remark here, the confidence is
sometimes much less than one, right, so you can sometimes really fish in the dark and
get hairy fat candidates and then you end up with some ugly problem of configuring all
kinds of thresholds and parameters of your extraction machinery and that's a black art,
right.
So although we have worked along these lines and we're still pursuing it, we at some
point actually turned step back and wanted to at least produce a core of accuracy
knowledge that would not suffer from these problems. Also with better scaleability
properties.
So we looked at the available high quality hand crafted knowledge sources as opposed to
arbitrary natural language text and we wanted sources that are already closer to a
knowledge base. So ideally maybe already in a logic-based representation like
ontologies, but there are no super big ontologies available, so the closest that we've been
working -- that we -- closest to approximation to an ontology would be things like
WordNet for example, which is kind of something we care but can be turned into a
lightweight ontology.
And this one here would, this is a screen shot, would tell you here there's a concept of
scientists and there's special cases, hyponyms or subclasses of scientists so that might
help getting the knowledge to answering the politician and scientists query.
Interestingly WordNet also sometimes has instances but here it has Roger Bacon and
alchemist from the middle ages but doesn't seem to have Aristotle and many others are
just -- I mean, this is just total arbitrary set of samples and it's very small. And that tells
you something about the strengths and weaknesses of these approaches. It's very
strong in the taxonomic relation so it has lots of concepts, abstract concepts and
essentially is a relations among them and parts of relations and things of that kind but it
lacks knowledge about individual entities.
So there should be thousands of physicists or scientists in this ontology but there are not.
The source for this other side of knowledge, the individual entities is Wikipedia. Now it
looks superficially that we're back to text mining because of these entries but we're not so
we look close enough and carefully enough you see this is actually structure data. This
so-called Infobox and these Infoboxes appear pretty frequently. And so this is the course
code of the Infobox. It actually follows a template for scientists and there are quite a few
of these templates. There's one for pop band, there's one for companies so you can, you
see the CEO of a company, you see the drummer of a pop band, and so on, and here for
example you see the (inaudible) right and other interesting facts.
So this can be harvested fairly easily. There's more things. Many Wikipedia articles are
placed manually but with the community based quality control into a rich category
system. So the Max-Planck page for example is here in these categories, so even if we
didn't get the birth and death year from the Infoboxes, which might be the case for some
lesser known scientists, then these categories tell us at least the year.
They also give us instance of relations so we can figure out that Max-Planck was a
physicist or a German physicist, but that's again a specialization. Sometimes we need to
be careful Max-Planck is not a physics, right, so he's physicist, that's an important detail
and makes a difference.
So we did this systematically, we went systematically after the Infoboxes and the
Wikipedia category system and build a fairly big knowledge base that we've coined
YAGO for yet another great ontology and the way we represent the extracted facts is in
the form of triples, two entities connected by a binary, right.
So you can represent these in logic, if you wish, or in RDF. RDF is kind of a natural
representation here. We not only harvested Wikipedia in this systematic manner but we
connected the extracted facts to the exist WordNet taxonomic backbone. This is maybe
the not so easy thing, so we tried to place every entity that we extract from Wikipedia into
the right classes in the taxonomic space of WordNet and when we invent new classes in
the -- from Wikipedia like category names like German physicist in this combination is not
in WordNet, why would it right, so we need to map it into the taxonomic space and make
sure it becomes a subclass of physicist and then WordNet would also know that physicist
is a subset of scientists and so on.
So this gives you more of a more complete but still very tiny picture of what we did here.
It's publically available. It's pretty big. Has like two million, close to two million entities
and more than 50 million facts. Facts are always instances of binary relations and it's
much bigger than psych, for example and sumo and a bunch of other things that are
available in the public.
We are smaller than this other concurrent endeavor to harvest Wikipedia called DBpedia,
but part of the explanation is that we actually gave YAGO to them, so it's incorporated,
right. And they have a lot more facts because they don't have that goal of staying close
mostly consistent and staying at a very high level of accuracy but they just form unions of
things, so if they get redundancy they just add, add, add, and they create links. But it's
pretty noisy, pretty redundant, and so they cannot properly reason about consistency.
We're also in the process of giving this to the free base guys, which is a startup company
in the bay area where the approach is a bit different. They would say why we should not
just harvest what's existing already but we this get a big community to actually enter facts
in the form of database records. So they should already be entered in structured form.
But here again the issue is how do you get a coherent whole as opposed to just a huge
collection of detached bits and pieces, right? So I think by using YAGO as a backbone
they should be in a much better position.
We manually evaluated accuracy by extensive sampling and the accuracy is, this is a
conservative figure because we used like Wilson confidence intervals and everything.
And the errors that we actually observed are often errors that already come with
Wikipedia.
So this is I'm not saying this is deep science, it's not very scientific at all. The things we
did earlier that did not work so well were a lot deeper and require -- so this is picking low
hanging fruit by the right engineering.
The extractor software is also made open source now and it runs pretty fast, so it's a few
hours and it processes all of Wikipedia and combines it with WordNet so you can repeat
it on a daily basis if you wished.
So but here is just so the engineering has its problems, so here is a few examples where
you need the right engineering. So it's non trivial. One is this finding the instance, the
instances of the instance off relation. So when we see this mostly comes from seeing an
entity appearing in certain categories in the Wikipedia system and as I said earlier, the
good category names like Nobelaureates and the bad ones like American music of the
20th Century. Many singers are in those categories, but a singer is not a piece of music.
So it turns out the simple heuristic runs pretty well, we run a non-group parser over these
known phrases and if the so-called head word is plural then it's indeed a class, if it's
singular we should not apply it. So physics and music is singular.
And there's some meta categories like disputed articles where you just need to hand
code, hard wire that you should not fall into this trap.
>>: How do you know physics is singular?
>> Gerhard Weikum: Oh, this non group passes have that.
>>: Okay.
>> Gerhard Weikum: Something similar shows up here. So when you -- sometimes we
actually learn about new classes like this composite or specialized classes like
Nobelaureates in physics, right. So and we do capture this, right, so we want to keep
that, but we need to connect it to the right super class in WordNet and here to these
techniques help a lot. So for example here we know that the people is the plural of
person, right? So passes know this. Natural language tools are pretty good these days.
And by the way, we use other people's tools for that, right? So there's more things in the
entity name space. Of course we cannot produce any miracles. We don't do entity name
(inaudible) here but we actually harvest the ambiguity. If there's five ways of referring to
the same entity, we learn it from this input. So for example this one here points to where
the Wikipedia redirection system to one thing and this one points to the same thing. They
don't cross reference each other directly but by transitivity we know that they denote the
same entity namely little test for the audience.
>>: (Inaudible).
>> Gerhard Weikum: What?
>>: Isn't that the Andromeda ->> Gerhard Weikum: Very good. I forgot to bring you some chocolate. Usually I give out
little rewards to the audience if they work well.
And in general we have a methodology here. It's a collection of the heuristics and rules,
right? But we have an overriding methodology that we coin reductive type checking so
we are willing to get lots of fact candidates but then we run them through some
scrutinizing procedure. So we want to make sure that the whole knowledge base stays
consistent.
So for example the taxonomic part should never become cyclic. And we do what we call
type checks in this knowledge base sense so when we see for example for whatever
reason we might -- we also do text mining, right? So some text mining tool would give us
a conjecture that Max-Planck is married to quantum physics because the sentence said
Max-Planck was married to his work, right?
So but now the married to relation is a relation between two people, ideally of opposite
gender, so we have a type violation and we reject that fact candidate.
So we mostly build on first order fact -- Jonathan, yes?
>>: You may have a (inaudible) co-authored with somebody who is also married to their
work and therefore conclude that they were married (inaudible).
(Laughter).
>> Gerhard Weikum: Wow. So thanks for -- this is a new project for a new student.
(Laughter).
Nothing is perfect here. I mean you're right. So we mostly build on just vanilla first order
logic so these are just instances of binary relations. So this goes a long way but once in
a while you come across the need for higher order representations. So here we see
Berlin is the capital of Germany but we might also find Bonn is the capital of Germany
and no country should have two capitals. We can also add constraints, right. So we
could have logical constraints in the knowledge base.
And the explanation is that these refer to different time. So how do we represent this?
This looks like higher order because the valid N is now a binary relation between first
order fact and some other constants. But there's a symmetric in this community that's
been around in this community for decades and RDF users this tool, it's call reification.
We just give these facts here IDs and then we can use the IDs as arguments to what
used to be higher order facts.
Just to make sure you don't think only Germany has problems with this, so the world in
the U.S. at least in California is also not so simple but of course it comes from an
Austrian, so you can blame it on Europe.
(Laughter).
So if I have time, I -- here is a few words now where we stand and what we are doing
now. So this goes back to actually deeper scientific things.
So we believe YAGO knows all the interesting entities. It's not really true but we could
make it true. So if you are a politician of some village you better be -- you're the mayor of
some village you better be in Wikipedia and typically you are.
If you're the drummer of some garage band and you want to make a career, you are in
Wikipedia. So exceptions are computer scientists, Wikipedia is not strong in this regard,
but we could systematically harvest DBLP or DB -- what's it called, DB live and I mean
there are sources for this. Then there's for example biochemical entities. Of course
Wikipedia knows a lot of enzymes, proteins, drugs, diseases but this is a big zoo of
terminology. There are things like UMLS or the life scientists, they also do their job, do
their homework and organize their terminological and taxonomic things.
So this could be leveraged and imported. So I think we're pretty much done in terms of
entities. What we are missing is the coverage in terms of relations, both relation types.
There are so many interesting relation types that we don't know about, and then also the
instances for the relation. So we do go back to the text mining now. And our tool that we
build a few years ago works as follows: It runs natural language sentences through a
dependency parser and builds these craft representations with cryptic tagging going on,
but an expert can read this and this is still a syntactic representation but it comes closest
from a linguistic viewpoint to a semantic interpretation.
If you stay within context free parameters here.
And we can use this for example the shortest dependency pass between the two
arguments of interest as a feature representation for statistic learning and if we are willing
to have some training we could mark this as a positive sample for a learner and this as a
negative sample and then the learner could digest sentences like this, right, and then
would say yes indeed Paris is located on the Seine. Hopefully, right?
But this is expensive, as I mentioned earlier. And this is why we actually did not pursue it
as aggressively as we could have done two years ago. But I think we are now in a much
better position to do -- to reconsider this because YAGO as a backbone gives us a head
start now. So we can filter out many uninteresting sentences that we come across
because they don't contain two entities that might possibly be related to each other, and
that -- and we get a lot of this information from YAGO already.
We can quickly identify the relation argument and we can do type checks so if we are
after the run through relation it better -- we better look at sentences where one argument,
one entity is of type river and is other is of type location at least, right. And we could do
more fancy things like check that the river and the city are at least in the same
constituent, otherwise why would they be nearby?
So these are things we are looking at now, ongoing work, no hard results, yet we're
particularly focusing on time aspects. So because we have lots of interesting facts, some
of this is low hanging fruit like the CEOs of company and things of that kind, but it's time
evolving, so it's interesting to check out at what time points which facts hold.
Okay. So something got messed up here in the order. We'll see. Okay. So second part
is the search. Now that we have this knowledge base we also thought about how do we
search it and then whenever queries produce too many results how do we do the ranking.
Yes, please?
>>: I had a question about the information (inaudible) so it seems easier (inaudible)
possible to extract (inaudible) from Wikipedia other sources but how do you know
(inaudible). Are these time coded, updated automatically from some sources?
>> Gerhard Weikum: So that's a very good question. Some of the relation names come
just from the Infoboxes, right, so it says head quarter or something or CEO code on, right.
So and we can infer then that this is a relation between a person and a company and we
name it after what we see in the Infoboxes.
In other ways we have to hand craft like a catalog of interesting relation names.
And this is a bit unsatisfactory on the other hand I give this some thought and I asked
some pretty smart people so what we are asking -- it's a tall order. What we are asking
for is the universal catalog of entity relationship types. So like we all know this from
school, right, so the world has more than just departments and employees, right? We're
not talking instances only the types and shipment and order and so on. So we want to do
this universally. What are interesting relationship types in the world? So what's the best
data dictionary or catalog of relationship types?
And none of the people I've talked to and some of them have been working on this
conceptual modeling for decades, had a good answer. So I had good hopes that for
example something like Google base would give me clues. It's not about relations but it's
about attribute names. And I talked to Alan Halavy, (phonetic), for example and he told
me forget it, Google base is all about used cars, right.
And I would love to see for example interesting relations between people like once in a
while I watch strange movies with complicated plots so I after a while I'm totally lost and I
don't know who is actually the nephew of whom and who falls in love with whom. Well,
this is the easy part actually. Who is jealous and so on, right. So these are all interesting
relations and only between people. So very well and good question, but it's a tall order.
Okay. Other questions?
So then the search because we like the craft based visualization of the data we also
came up with a craft based language. But you can rehash this into some other
representation. And some of these queries are actually pretty straightforward from a
database viewpoint so this is a vanilla conjunctive query. Easy thing and SQL or any of
the other established languages.
In craft form all we do is we replace some of the craft nodes by variables and then the
semantics is you find bindings though these variables from the data such that after you
replace the variable with the binding this becomes a subcraft or this is seen as a subcraft
in the data.
So we have some additions which are now no longer that straightforward, so here the
Thomas Mann-Gruter relationship question would actually be addressed by a big or
formulated by a big wild card. And we call this relatedness or connectedness queries.
So here's a bunch of entities, tell me how are they related, do they have commonality, do
they interact in some way? And the answer would mostly be the labels of the, in this
case even path that connect the entities.
And that's a special case just a second, a special case of having regular expressions in
the language. And here we don't have a good characterization not yet how -- what does
this actually entail in terms of expressiveness, complexity and so on. But nevertheless
we wanted to have this so this is about like the one way of reasoning about the strong
German universities and now you see it has like little disjunctions here, it has star
because the located in hierarchy can have variable depths and so forth. Jonathan?
>>: I was just thinking that with the connectedness query we had something that
(inaudible). If there's multiple paths how do you sort of pick which one?
>> Gerhard Weikum: I'll come to that. I'll come to that. So the ranking is a big issue
here. Right. Right now this is just query and we get result sets and then the ranking is
actually very important.
And we can also have queries over these facts, right, and they can -- there's also a linear
syntax for this.
So now we come to the ranking and here I messed up my slides so I need to go back and
then forward again. I messed up order of slides for whatever reason.
So now I ask a query like Fisher is a scientists, what else is Fisher known for, right? And
maybe there are many Fishers. There are indeed, right? And we run this without ranking
so we get exact results but the order in which we get the results back and there's
thousands of them, is arbitrary. And this is the top result, Ronald Fisher is alumnus of
this college in Cambridge. And when you go a little bit further in the ranking, so Ronald
Fisher is a good result. Everybody knows why in who was Ronald Fisher? So he
invented maximum likelihood estimation. He is probably the most important statistician of
the last century.
And then there's two unknown Fishers. Why would they be on rank two and three? They
happen to be, right. Then there's Ronald Fisher the theories, Ronald Fisher the
colleague, very important property, Ronald Fisher the organism and then Ronald Fisher
the entity, and so on, right.
And so this is wrong, I mean this is not what we want. This is the flawed ranking. But
unless you have ranking, ranking is greatly underappreciated. I think this really must be
a first class citizen here, it cannot be an afterthought.
And indeed we have developed a statistic language model for this craft based
representation which computes better rankings, namely gives us Ronald Fisher the
mathematician, the statistician, the president of the Royal Statistic Society, the encoded
IDs so you have to follow the relationships in order to find that out, then Ronald Fisher
started his career was doing crops experiments and then for doing -- giving significance
to them, he had to invent statistics and then change the field and so on.
So it's pretty good. So what other criteria and here one of them is the answer to your
question, Jonathan. So there's three big dimensions for the -- for ranking criteria. The
first one is confidence because even though we have been driven a lot by this YAGO
extractor, which is high accuracy but we should have other extractors as well, and then
the accuracy or confidence into the extracted facts varies widely.
And there are two subdimensions to this. There's the certainty of the extractor. I mean if
you use a risky learning base technique with very little training data, your confidence
cannot be as high as for example extracting email addresses by regular expression
matching.
And the other dimension is the authenticity and the authority of sources. Actually it's two
subdimensions, right, these are different things. And you see this illustrated just by
examples. So a straightforward sentence Max-Blanck was born in Kiel from a higher
authority source, which never lies, should be taken for almost granted. Whereas if you
see this word sentence they believe Elvis hides on Mars in this strange blog, be careful
about this.
>>: (Inaudible) sign the authority (inaudible).
>> Gerhard Weikum: There's -- this is big picture here. This is -- so we have pragmatic
implementations of this. Now we can break this down and to do this or that. So one of
the easy things is you go by page rank, right. So authority is easier than authenticity.
Authenticity is about saying the truth, right. As opposed to -- I mean it can be a high
authority source about jokes, right? Okay. So second dimension and I will elaborate a
little bit on that one, right, because I think it's a most interesting for.
We have implemented something on each of those, but there needs to be more work,
right, to make this systematic.
So informativeness is about telling something interesting. Don't tell me Fisher is an
entity. And ideally this would be a subjective criterion and in fact, the word is a concept in
linguistics and cognitive sciences, there it would mean tell me something that I didn't
know before. But then you need to personalize it, you need to know something about the
user and for the time being we don't have this.
So for the time being, we would say tell me something that most people find interesting.
So it's really driven by frequency statistics, different kinds. But we could -- we are
working on personalizing it and then it could mean tell me something that I didn't know
before. And it's illustrated again by example. Suppose we are asking what's Einstein
known for? So most people would prefer an answer like Einstein was a scientist rather
than Einstein was a vegetarian. Of course if you are yourself an accomplished physicist
maybe you do prefer the second one. But without knowing anything about the user we
can't do this.
Now, you cannot just precompute relevance or informativeness of facts because this is
context dependent. This depends on the query. You take a slightly different query, give
me important vegetarians, and then this one which should rank low here should rank high
here, right? So there's no easy solution to this. And I will explain in the next slide how
we do it.
The third dimension is what Jonathan asked before, so results should be compact in
some form. If it's just a path between entities we prefer short path over long ones, maybe
the path labels and the frequency of how dominant are they used in the overall
knowledge base might play a role and things of that kind. And sometimes we want to
connect more than just two entities and then we actually look at compact crafts, this leads
to (inaudible) computations unfortunately so we're also biting here into a computationally
expansive bullet.
But we're doing work along these lines. Again, illustration by example. So we ask about
how is Einstein related to Bore. This is a good answer, inferring that Tom Cruise is also
vegetarian and was born in the year in which Nils Bore died is kind of weird, right, so it
should rank lower. But you need ranking to get this out, right?
So how do we do the informativeness part of the ranking? We follow best practice and
information retrieval, which is based on statistic language models. Now, these are
generative probabilistic model so the rationale is that you have a bunch of documents
and each document is viewed as a probability distribution over observable features, for
example words or normalized terms and often you might postulate a parametric form for
this, so for example it's a multinomial distribution. And then the document itself can be
used to estimate parameters. But maybe you want a background corpus for smoothing
the parameter estimation.
Now, the query is now treated as something that would be generated by a document.
This is on first glance it's kind of an odd rationale but this is how it works. So you pretend
the query is a sample if the probability distribution of this document or this document or
that document or that document. And you prefer in the ranking those documents for
which the likelihood of actually observing that query is highest. So this is the model. And
then of course you can use Bayesian arguments to reverse everything and then it looks a
bit different.
So we did this, we applied this, too, but we're not just having text documents. Now, some
of these language models have been carried over to attributes and records but not to
relationships to my knowledge. So this is something new we did. And often you run into
both sparseness issues for parameter estimation and also computational tractability so
we're just like many other people, we factorize our probabilistic models into interest
components and here the unit of factorization is 1H essentially. So the smallest
meaningful subcraft is 1H and it's labeled and it's to end point the entities that together
constitute the fact.
A and now so this would be the things that generate queries and this is a query now with
a parameter. So in some sense we're after asking, given this borne end go to Frankfurt
fact what's the probability of generating this query or in other words you can give it some
intuitive meaning if a user asked this query and were presented this fact would the user
be satisfied?
I happen to be born also in Frankfurt, right, so there's a good test case. And but I would
not claim that I should rank higher than Gruter, right. So I'm the other candidate for
generating this query.
And then there's some bells and whistles here which we're going to skip. And I'm going
to explain but by example how we then estimate these query likelihoods. In the end it
boils down to simple correlation statistics. So you might say well, very easy and very
simple but the framework is much more powerful because we could now generalize it and
take more things into consideration and stay in the same formal apparatus which is nice.
So let's do this by example. So here these two facts could generate this query, so what
we do, we estimate the probability that the answer part which binds to the variable dollar
X namely myself or Gruter appears in some corpus given that the other, the input
constants appear. So it's about what's the probability of seeing Gruter given that you see
born in and Frankfurt or what's the probability of seeing me given that you see born in
and Frankfurt. And then for different query types depending on how many variables we
have, we can have two variables in a meaningful way things change a little bit.
Now, the actually estimates for this condition of probabilities can come. According to the
book they should come from your main corpus, but that might be easily misleading so
you do this on the knowledge craft itself and if you happen to have way more physicists
than vegetarians because of the way the condition of probabilities are defined, that would
actually give higher rank to the vegetarian part, right, because this -- the is a vegetarian
appears in the denominator here.
And so what we actually, we tried this, it doesn't work well because no knowledge base is
perfectly balanced and the knowledge -- and it doesn't have redundancy and redundancy
helps a lot in statistics, so we actually go for the Web in this case or some sample of Web
data that we precompute and do the estimations there.
You saw this, so we did some systematic user studies with question answering queries
and some queries by ourselves. We compared to a bunch of kind of competitors. You
get quickly into an apples versus oranges comparison here, but at least you're not
comparing apples against T bone steaks, right, so and so the results have to be
interpreted very carefully and I will not elaborate on this, so the main point is that some
techniques actually Google does very well on many of the question answering things.
Yahoo answers which matches new questions against human manually answered
existing questions from a big catalog is lousy. This is a question answering system done
by MIT which also utilizes Wikipedia. This is why it's interesting to compare with, this is
key word craft search. We only we actually use our engine but take their scoring model
for the ranking and this is us, so we are doing pretty well.
Questions so far?
>>: I just want to interpret the numbers on the previous slide. So in terms of the tasks
that you had (inaudible) you gave them like a broad question, you told them somehow
research the answer to the following question then you (inaudible) whatever.
>> Gerhard Weikum: No, no, no. We -- well, indirectly, yes. So this is why I set it's
apples versus oranges. Because different input, ways of taking the input, they also have
different (inaudible) right. So we tried various ways of formulating the queries for each of
them. We did, and we is the pool of people that we're working together. So we tried to
be as good as possible for each of them, right.
For our system there was typically one way, one canonical way of formulating the query
but here you have tons of ways, you're right. But we tried. We gave them a fair chance.
We tried lots of ways and this is the outcome. So take this with a big grain of salt. Don't
overrate it.
Five minutes?
>> David Lomet: 15.
>> Gerhard Weikum: 15. Okay. So there's plenty of time. So the last part is now about
efficiency. And it's not yet integrated, right. So YAGO can be downloaded both data and
the knowledge base and the extractors. NAGA is kind of proceed to typical. We have a
Web demo but it's not very stable, and sometimes it's awfully slow, sometimes it's fast.
Most of the queries on the previous slide can be answered in a few seconds, right, even
with NAGA, and the ranking. But sometimes we run into big problems.
So we also gave some thought to efficiency here and because we're using a knowledge
representation model that is close to RDF so we got into RDF and also because last
year's best paper in (inaudible) kind of won, won the best paper award with his
engineering exercise on RDF. So I felt like we could do better and indeed together with
one of the people in my group, Thomas Newman, we did something better.
So why RDF and why can't you just reuse your good sold SQL engine for this? So why
RDF is pretty clear. I mean here's another of these knowledge graphic excerpts but the
edges pretty much correspond to RDF triples. In RDF terminology these would be called
subject property object triples and there's lots of ID encoding going on and strictly
speaking some of these must be URLs or URIs and so on, so I'm using simplified syntax
here.
Now, why can't you just reuse some engine, is it at least a good healthy exercise to
rethink architectures? Well, RDF lends itself pretty nicely to this new paradigm known as
pay as you go data spaces. So it just enter data and in fact probably this is why bioload,
biologists like it more than for example XML or why don't they put everything into a
relational database first.
So don't need to worry about schemas, maybe some schema involved but then you can
quickly change the data or group it or whatever. Maybe it never evolves. That's also
fine, right. So it's schema last if ever, right? And this is a nice paradigm that we like.
The triples form an entity relationship craft but because of triples this is very fine grained.
So there's no notion of distinguishing attributes of an entity from its relationships. They
all are properties. So this blows up the whole thing and syntactically you don't have an
easy way of grouping things into more manageable units. As a result, the queries are big
joins, often you have lots of star joins but also long chains of joins along relationships and
therefore the physical design is very critical.
But if we are in this new world of data spaces or supporting and scientists or supporting
kids and they're high school kids like browsing and discovering things in a knowledge
base, there's no workload predictability. You cannot just say last year's minds work like
this, next year's minds will be like that. So it's totally ad hoc, right.
So the language that people advocate I don't like it but still this is up as a worldwide Web
standard is coined sparkle. It's pretty much select project join combinations encoded in
the so-called triple patterns. Everything was a question mark is a variable that can be in
the place of a subject, a property on object. Then you see the relations like is born in and
so on, and you see some constants.
And the dot in between these things -- this triple patterns is actually conjunction so this
entails a join in relational jargon.
There's some complications. In particular the relations can also be a variable which just
makes it interesting. It means you don't care so much about the schema, so for example
here we would say some person is related to some town and the town to some country
but we don't exactly say which by which relations, whether the person is born in that city,
died in the city, lived in the city, knows someone who lives in the city if that were one
relation or this is the capital of some city that once belonged to this country and so on.
So if you unfolded this into SQL it would be a huge union and then for each of the cases
you get long big joins, right. So this is a pain. This is what prompted the Obodi
(phonetic) paper. To some extent I'm repeating the arguments from last year's paper.
And then there's some typing of course which so far we think we know how to address
this but we haven't done it in thein so far.
Now, coming back to physical design, there's different prevalent approaches. This RDF
thing has been around for a decade, and everybody just smiled at it. And now only
recently people take it more serious. So the oldest approach is probably you put
everything into this big table with only triples and then you need to operate over this,
which in all the queries entail lots of self joins with this big table.
So Obodi said this really success so this you cannot do this. But it's our approach, so
you will see why it works. Then Obodi actually advocated this thing here. You group the
triples by the same properties and you do this to the extreme form so you end up with the
maximum form of partitioning and then because the property is encoded and the table
name you have almost a column store. So no big surprise they actually use a column
store for storing this and indexing, it is, and doing the query processing.
There are things in between which are kind of more closer to the conceptual, to a
conceptual entity relationship model but they are difficult to handle, in particular they face
a big physical design problem and I haven't seen big success stories on this.
So our engine, well, I'm glad (inaudible) is here, so finally I found a case for the risk style
engine paradigm that (inaudible) and I formulated in a shallow proficient paper in
(inaudible) 2000. So the risk here really means richest complexity. So in the sense of
the original risk processors.
And the rationale is well, first of all let's build an RDF engine and a colonel for RDF and
nothing else, not a RDF XML or RNDMS with whatever else. Simplify all operations. In
the sparkle large when you look at its core, it's pretty simple. Reduce implementation
choices. All the joins are pretty much merge joins. Optimize for the simple and frequent
case and radical eliminate tuning up, so we don't have any physical design here at all,
right, because it's the same for all possible data.
And this is essentially from a bird's-eye view the solution on first is a standard
engineering trick, don't bother with these long literals, URLs, and strings, encode
everything into integer IDs and then you're dealing with fixed length ID's triples all the
time. And then actually this giant triple stable approach is not so stupid and you can
afford actually to exhaustively index it.
So overall we build 17 indexes and that's good for every possible RDF database.
Because we're dealing with these triples and fixed length compression kicks in crate. So
it's wonderful.
The query processing works almost only with merge joins. We have hash joins also in
the repertoire but we -- they're hardly ever needed and because of this big simplification
the query optimizer also has a lighter task so we can't afford exact dynamic programming
based search of the or reversal of the search space of execution plans and this goes up
to 20 to 30 joints, so we always find the -- within much less than a second, right, within
less than hundred milliseconds.
We also have some new techniques, semi new techniques on the statistics for the data.
So now on the indexing so people, the literature on RDF is strange. You really see
where these people are coming from, so they are really dreaming of the semantic super
strong thing but when it comes to systems they don't -- they cannot add one two, three,
right. So you have these triple subject predicate object or property object and there are
six permutations of them, so you can have six indexes that make sense. And the
literature has proposals like we should have SPO index plus OSP plus PSO and then
they stop, right. So why only have these three and not all six, right?
So we actually build all six exhaustively. They are all directly in the leaves of the cluster
B plus tree. The data could actually be thrown away. Everything is in the indexes. And
the compression is so good that the total size of the indexes is less than the original data.
Now the beauty here is that this means for whatever scan or join we encounter, we
always have the right ordering of the three components.
We do even more. So these are the first six indexes. Now, the next six ones with these,
so we also index all binary projections of these triples in all six possible permutations and
the last, the missing component is replaced by account aggregate. And this comes very
handy when you have to deal with duplicates. So sometimes the sparkle semantics
forces you to actually produce a billing result with duplicates and we never need to carry
these around. So we only work with these indexes and then have the calendar available
and as we join we can actually (inaudible) at the counters and only carry it around the
(inaudible) of the result back as opposed to this duplication.
And we also do these for unary projections, so this is six plus six plus three, 15 indexes
and then we have mappings between the literals and the IDs. So these are the 17
indexes. And you never, ever need anything more, right. So there's no materialized path
or view indexing, et cetera.
Now, on query optimization as I said, because of the simplicity we can afford essentially
exhaustive plan enumeration in dynamic programming order. So my co-author, Thomas
Newman has done a lot of work along these lines and knows how to do these things,
right. So we measure this for more than 20 joins and we're also within 100 milliseconds
of finding the cost optimal plan. Of course it's optimal relative to the cost model, if the
cost model is wrong, you may not have a good plan.
And on that latter aspect, we build just like standard histograms. Turns out we can
quickly compute them from these count aggregated indexes. What we do is we
successively merge neighboring entries in the indexes in greedy order and then kind of
approximate an aqueduct histogram. We do so until the structure is small enough to be
called a histogram, so that we can keep it in whatever size limit we give it. This is done
also for all six orderings.
And in addition now in sparkle we thought a bit where's the biggest error in the (inaudible)
estimation and it seems that this is in these long join path. So you essentially start with
one entity and you want to connect it to some other entity over five relations or properties.
So but we make a big error here only if we actually have lots of results for this. So that
would mean that this path here, P1, P2, P3, P4, P5, these label sequences must be
frequent.
They are infrequent make it an independent assumption doesn't hurt you much, right,
because you underestimate anyway. Jonathan?
>>: Do you have a sense for sort of if there is such a thing, a typical query and how long
these join paths are ->> Gerhard Weikum: No, no, no. Well, you will see our benchmark in a minute.
Other questions?
>>: So if you have a set of facts this would work, I think. Now, if you throw in the
schemas as well, then in terms of (inaudible) because everything else is (inaudible)
something and this is in class so everything is connected to each other. So (inaudible)
self join, you know, whatever (inaudible) you have roughly, maybe two of these
(inaudible). So I'm wondering ->> Gerhard Weikum: We would not store it as a million tables. We keep or we're pretty
convinced about our storage and indexing scheme. So if you impose a schema, it's a
conceptual thing.
>>: (Inaudible).
>> Gerhard Weikum: Oh, this particular technique here at some point it might become
useless but it's not expansive, so unless like -- I mean you could also materialize these
paths, right, which is what the Obodi paper uses some techniques along these lines, too.
But that means you get into physical design so then actually building the materialized
view, maintaining it as you get updates and so on, there's a cost.
Here there's hardly any cost at all in building this synopsis and the worst that can happen
is they don't help a lot. So but they will be accurate. I mean, if you have like super
frequent path, in fact in the original in the benchmarking data that Obodi used which we
also used you have this situation. So then they just tell you the truth, yeah. I mean,
sudden queries do become more expansive but we still in the queries that we are looking
at, I don't -- I don't -- we're doing fine. I don't see the explosion even if you have a million
tables you would not have joins over a million, between a million tables, right? So the
join paths would be some bone at lengths, right? The synopsis may be useless at some
point, right, but it doesn't hurt, right? In fact, we have to have a fallback. When these
precomputed they would be exact if you didn't have any selection in addition to the joints.
But of course you have these. Now you cannot precompute each and everything, so
what we do then if there are additional selections and they are the typical with along joins
with properties that play the role of an attribute, this is why I named them A. They also
only properties and there would be a constant here. But now what we do is actually we
chop up this into the pieces the largest possible pieces so that for every piece we have a
selectivity estimator in our repertoire and then we postulate independence and do the
usual arithmetics. This goes a long way.
So benchmarking. As you know there are (inaudible) and benchmark assumptions. But
we went broader than the Obodi paper. So this is just a setup. We had two opponents,
essentially reimplementing the Obodi approach but not using C store but me native B
which Obodi himself recommended because C store is no longer maintained as open
source and on this machine we had 1A to B to run easily and so C would have been a
pain.
And we emulated one of the more traditional approaches of dealing with triple stores for
RDF but layering it on post crest, right. So we're following a bit what Obodi did.
We had three data sets so this Botten (phonetic) library catalog is what Obodi used.
Then we applied it to our own knowledge base. This is why this whole work fits into the
theme of the talk. And we experimented just for the fun of it with an excerpt from a social
tagging platform and actually made the tags property names, which is maybe odd and
maybe it is not a smart way of doing it this way, but it gave us a data set with very
different characteristics. Because this had more than 100 thousand different property
names as opposed to these have like hundreds, right. So we wanted something else.
And then you see some of the queries. So this is from Obodi's benchmark, who uses -he actually cheated a bit, he phrased the queries in SQL and then sparkle doesn't have
the kind of aggregation operators that he actually used.
So what we implemented is two and followed really his rules. Then on YAGO we used
the queries of the kind that you have seen before on the social tagging data we used
things like this, so we played a lot just with the tags. And we had a set we asked people
in the group obviously. I mean this is an ad hoc benchmark so by the threats of things
we think we got some insights.
And this could easily entail 20 joins, right? And here are some of the results. In terms of
load, times and database sizes, the systems were roughly comparable. So 1A to B
seems to create indexes on the fly, so after load time it looks a lot smaller. Once you run
the benchmark, it's a lot bigger than our data.
This is for the three data sets each of them came with like a dozen queries or so, and this
shows you the geometric means in run times for both warm and cold caches and you see
that we are like almost always in order of one billion, one and two orders of magnitude
faster than the competitors.
And interestingly the two competitors there's no, none dominates the other, so you also
see the Botten experiment that Obodi did is some specific setting, right, so it does not
easily generalize to lots of other things so here the order is just reversed, right. And the
social tagging data it posed yet other characteristics.
So in conclusion, I told you about this big vision and mission that we're pursuing in my
group so lift the most valuable raw assets about information left them to a level of explicit
knowledge in terms of entities and relations between entities. We're pursuing this
knowledge harvesting theme. I talked mostly about the semantic approach going after
encyclopedic and ontological sources. We do pursue text mining as well and also a bit of
mining social tagging information. Of course this is on (inaudible) and also interesting. I
see big potential synergies in combining these approaches but I also see big problems.
So how do you deal with consistency when you use like hybrid machinery. In general
there are tradeoffs between consistency and uncertainty. I mean you might go for
consistency, but then you throw away too many uncertain facts and you suffer in
coverage. And an important issue is a long term evolution, so today's knowledge is out
of date tomorrow, so people correct wrong facts in a knowledge base, knowledge
changes. How do you tell which of the situations apply. So if you wanted to maintain a
knowledge base with time annotations over a long time until a time horizon, you run into
lots of additional problems that we haven't even understood yet.
In terms of ranking, I think we did something interesting. We're in the process of
expanding this and generalizing this and especially making it a personalized ranking
approach. And likewise on the efficiency and scaleability issues I think we did some
interesting piece of work, but there's a lot more to do. Thank you.
(Applause)
>> Gerhard Weikum: More questions?
>> David Lomet: Jonathan?
(Laughter)
>> Gerhard Weikum: Okay. Let's take them offline. Thank you.
(Applause)
Download