>> David Lomet: So thank you all for coming. It's my pleasure to be able to introduce today's speaker who is Gerhard Weikum from the Max-Planck Institute in Germany. Gerhard has a long and distinguished career in the database area and has been a visitor with us on a number of occasions before, some of which turned out really well, and so we're hoping that this visit will also turn out well. Gerhard is in fact a director of the Max-Planck Institute. He's an ACM fellow, and he is the -- is it chair or president of the VLDB? >> Gerhard Weikum: President. >> David Lomet: President of the VLDM endowment. So he's been a leader in the field now for quite a number of years. And it's a real pleasure for me to be able to introduce him. And he'll be talking about harvesting, searching, and ranking knowledge on the Web. Gerhard. >> Gerhard Weikum: Thank you, Dave for, the flattering intro. Even though it's flattering, I still appreciate it. So the reason that rises work is the goal to turn the Web whatever new Web we will have in the near future into the world's most comprehensive knowledge base, into a semantic database that does not only know about Web pages and text in Web pages but about entities and relations between entities. And the approach that we are pursuing with the people that I'm collaborating with as a three step procedure so first is get the knowledge, lift the Web pages into more explicit notions of fact, organize lists, and there's different ways of doing this, so in this talk I will mostly talk about leveraging hand crafted high quality knowledge idly in the form of ontologies but maybe also looking at encyclopedia. But there's other ways, like we can use text mining, natural language processing, statistic learning to go after the implicit facts that are embedded in natural language text. We are doing also some of this, too, but will not be in presented in this talk. Or we can try to harness the folk wisdom, the wisdom of the crowds that's implicit in social tagging and other things along these lines. Once we have the knowledge, next step is how do we query it, and of course we want to raise the bar here as well, so we just -- we don't -- not only want to run like similar QA queries but go after sophisticated things and you will see examples. And it was searching but also ranking results. Often you end up with a lot of results and so ranking is absolutely crucial. And finally all that should be done efficiently and scaleable. So why can't we do this with Google today? Suppose we run Google just limited to the Wikipedia domain, that's the closest approximation to my goals in but that does not work well, and here's just a bunch of examples where this Google centric approach would fail, so in natural sciences here's one example from life sciences. You are often interested in specific entities and then relationships between these entities. That theme also shows up in humanities, so I'm working a little bit with people from humanities, so here the not so obvious connection between Thomas Mann and Gruter is that Thomas Mann wrote a novel early 20th Century, (inaudible) who features Gruter who was dead more than hundred years by then but features Gruter as a character in the novel. So that would be cool to figure that one out. Then we have quiz question like things which are easy to answer for a human because human can get the bits and pieces from the Web with some effort and they might even be structured so birth dates, death dates and so on, pretty structured but the human still needs to find a lot, quite a few of these bits and pieces and connect them in the right way. So these joints if things were in the database would be trivial but these joints doing them directly over the Web is a pain. The answer by the way is unfortunately Max-Planck who died in 1947 after he had lost the last of his four children. And then some of them are actually simple queries but they involve ranking over lots of results. And just to drive my point home I tried this one, which politicians are also scientists. You can vary the way you express this to Google. You can write it as a natural language question because Google has some abilities to deal with this and the results don't get better than this, right, so this is mostly about scientists and politicians debating global warming, et cetera. And what's wrong here, so I -- this is one of my favorite quotations by Frank Zappa. The first line is the key point here, information is not knowledge. This is really raw information, these are raw assets but it's not knowledge, it's not lifted to the higher level and you can reach here the further levels above the level of knowledge, right. And the first line is very good, the rest I don't know, maybe he smoked too much grass at that time. (Laughter). So and this is in a screen shot from one of the prototype engines that we've built called NAGA. Of course we're not good at doing (inaudible) and I will explain what's going on here. And but we do ranking and the ranking is pretty good because the data comes from Wikipedia and there are thousands of people that qualify and what we get at the very top is Benjamin Franklin, who invented the lightning rod and also parts of the U.S. Constitution. Paul Wolfowitz is well-known, Angela Merkel is the German chancellor and she has a degree in physics and wrote a dissertation about physical chemistry. So this is about the best you can get here in terms of ranking. I don't have any real slides or elaborated part of the talk on related work. This could be open ended. I could talk half an hour or longer only about related work. Essentially what's going on is I see three major areas which are making great advances in the last few years, and we position ourselves essentially in the intersection of these. So of course Web search does look at entities but in a limited way driven by business opportunities like product, if it's about product, if it's about location then, yes, of course they do some form of dealing with named entities. To my knowledge, they don't go far in terms of relations among entities. Information extraction has gone a long way based on text mining and other techniques and for about a decade we've seen that ranking over structure, assembly structured data also makes a lot of sense. Those are the outline. It's essentially three systems and three papers. So the first is about the YAGO system and knowledge behave that we've build this way. And this corresponds to a paper we had in last year's (inaudible) then it's the search engine and the ranking over this data called NAGA which appeared in ICDE this year. And the last one then right now not yet integrated but when the process of putting this underneath as the engine underneath the other stuff, this is a very fast RDF engine coined RDF triple express which is going to appear in this year's (inaudible). And by the way you need to drive me or you can drive me in terms of where I put emphasis on and which parts I may have to skip. It's maybe a bit too much material. So when you have questions or when you think rush on and let me know, right. So know the first part is about building knowledge basis. How do we go about this? So we could do text mining based information extraction. For example from this old Wikipedia page about Max-Planck and that would essentially move the text into a seemingly structured database in terms of records so we can find the birth date, birth places of scientists, scientific results, more details about these maybe even combining some of what we see here with other sources. Not everything that you see on the right is actually written here. Collaboration and so forth. And the techniques for doing the is a combination of methodology, so natural language processing plays a role. Pattern matchings, some of the low hanging fruit is just regular expression matching so birth dates is actually easy. Email addresses would be even easier. Statistic learning might play a role so we need training material. In order to determine whether something should move from here to the structured site or not we can also reject hypothesis. And some of that is boosted by system many assets like list of place names, for example Kiel is a place name so it would be in the dictionary of German cities. Jonathan? >>: I'm just wondering the degree to which the discovery of the fields that you (inaudible) so for instance scientific result and to what degree that automatic versus (inaudible) scientific results is important. (Inaudible). >> Gerhard Weikum: Okay. Maybe the last part kind of addresses this, so there's no few remarks. So one is this looks like a database but it's kind of a different database. So there's uncertainty. So usually we don't have confidence one in these extractions. So there might be a few exceptions because birth dates are so successful we might have confidence close to one. But certainly not in the things like inventions and so on, right. And we need to in the long run we need to quantify these confidence measures and later on reason take them into consideration would we reason over this knowledge base. So there's uncertainty that we need to carry on, right. Does this roughly address your concern? And also different techniques give you different confidence, right, so and the kinds of relationships, there are some easy ones and some tough ones, right. This is certainly not among the easy once. So second remark is some of the techniques are very expensive so we actually built a tool a few years ago which employs a dependency parser and feature representations into a statistic learner but the dependency parser which is a deep natural language parsing technique takes about a minute on a long Wikipedia article, maybe not on this one because it's shorter. So it does not scale up. Second thing is again coming back to the question or remark here, the confidence is sometimes much less than one, right, so you can sometimes really fish in the dark and get hairy fat candidates and then you end up with some ugly problem of configuring all kinds of thresholds and parameters of your extraction machinery and that's a black art, right. So although we have worked along these lines and we're still pursuing it, we at some point actually turned step back and wanted to at least produce a core of accuracy knowledge that would not suffer from these problems. Also with better scaleability properties. So we looked at the available high quality hand crafted knowledge sources as opposed to arbitrary natural language text and we wanted sources that are already closer to a knowledge base. So ideally maybe already in a logic-based representation like ontologies, but there are no super big ontologies available, so the closest that we've been working -- that we -- closest to approximation to an ontology would be things like WordNet for example, which is kind of something we care but can be turned into a lightweight ontology. And this one here would, this is a screen shot, would tell you here there's a concept of scientists and there's special cases, hyponyms or subclasses of scientists so that might help getting the knowledge to answering the politician and scientists query. Interestingly WordNet also sometimes has instances but here it has Roger Bacon and alchemist from the middle ages but doesn't seem to have Aristotle and many others are just -- I mean, this is just total arbitrary set of samples and it's very small. And that tells you something about the strengths and weaknesses of these approaches. It's very strong in the taxonomic relation so it has lots of concepts, abstract concepts and essentially is a relations among them and parts of relations and things of that kind but it lacks knowledge about individual entities. So there should be thousands of physicists or scientists in this ontology but there are not. The source for this other side of knowledge, the individual entities is Wikipedia. Now it looks superficially that we're back to text mining because of these entries but we're not so we look close enough and carefully enough you see this is actually structure data. This so-called Infobox and these Infoboxes appear pretty frequently. And so this is the course code of the Infobox. It actually follows a template for scientists and there are quite a few of these templates. There's one for pop band, there's one for companies so you can, you see the CEO of a company, you see the drummer of a pop band, and so on, and here for example you see the (inaudible) right and other interesting facts. So this can be harvested fairly easily. There's more things. Many Wikipedia articles are placed manually but with the community based quality control into a rich category system. So the Max-Planck page for example is here in these categories, so even if we didn't get the birth and death year from the Infoboxes, which might be the case for some lesser known scientists, then these categories tell us at least the year. They also give us instance of relations so we can figure out that Max-Planck was a physicist or a German physicist, but that's again a specialization. Sometimes we need to be careful Max-Planck is not a physics, right, so he's physicist, that's an important detail and makes a difference. So we did this systematically, we went systematically after the Infoboxes and the Wikipedia category system and build a fairly big knowledge base that we've coined YAGO for yet another great ontology and the way we represent the extracted facts is in the form of triples, two entities connected by a binary, right. So you can represent these in logic, if you wish, or in RDF. RDF is kind of a natural representation here. We not only harvested Wikipedia in this systematic manner but we connected the extracted facts to the exist WordNet taxonomic backbone. This is maybe the not so easy thing, so we tried to place every entity that we extract from Wikipedia into the right classes in the taxonomic space of WordNet and when we invent new classes in the -- from Wikipedia like category names like German physicist in this combination is not in WordNet, why would it right, so we need to map it into the taxonomic space and make sure it becomes a subclass of physicist and then WordNet would also know that physicist is a subset of scientists and so on. So this gives you more of a more complete but still very tiny picture of what we did here. It's publically available. It's pretty big. Has like two million, close to two million entities and more than 50 million facts. Facts are always instances of binary relations and it's much bigger than psych, for example and sumo and a bunch of other things that are available in the public. We are smaller than this other concurrent endeavor to harvest Wikipedia called DBpedia, but part of the explanation is that we actually gave YAGO to them, so it's incorporated, right. And they have a lot more facts because they don't have that goal of staying close mostly consistent and staying at a very high level of accuracy but they just form unions of things, so if they get redundancy they just add, add, add, and they create links. But it's pretty noisy, pretty redundant, and so they cannot properly reason about consistency. We're also in the process of giving this to the free base guys, which is a startup company in the bay area where the approach is a bit different. They would say why we should not just harvest what's existing already but we this get a big community to actually enter facts in the form of database records. So they should already be entered in structured form. But here again the issue is how do you get a coherent whole as opposed to just a huge collection of detached bits and pieces, right? So I think by using YAGO as a backbone they should be in a much better position. We manually evaluated accuracy by extensive sampling and the accuracy is, this is a conservative figure because we used like Wilson confidence intervals and everything. And the errors that we actually observed are often errors that already come with Wikipedia. So this is I'm not saying this is deep science, it's not very scientific at all. The things we did earlier that did not work so well were a lot deeper and require -- so this is picking low hanging fruit by the right engineering. The extractor software is also made open source now and it runs pretty fast, so it's a few hours and it processes all of Wikipedia and combines it with WordNet so you can repeat it on a daily basis if you wished. So but here is just so the engineering has its problems, so here is a few examples where you need the right engineering. So it's non trivial. One is this finding the instance, the instances of the instance off relation. So when we see this mostly comes from seeing an entity appearing in certain categories in the Wikipedia system and as I said earlier, the good category names like Nobelaureates and the bad ones like American music of the 20th Century. Many singers are in those categories, but a singer is not a piece of music. So it turns out the simple heuristic runs pretty well, we run a non-group parser over these known phrases and if the so-called head word is plural then it's indeed a class, if it's singular we should not apply it. So physics and music is singular. And there's some meta categories like disputed articles where you just need to hand code, hard wire that you should not fall into this trap. >>: How do you know physics is singular? >> Gerhard Weikum: Oh, this non group passes have that. >>: Okay. >> Gerhard Weikum: Something similar shows up here. So when you -- sometimes we actually learn about new classes like this composite or specialized classes like Nobelaureates in physics, right. So and we do capture this, right, so we want to keep that, but we need to connect it to the right super class in WordNet and here to these techniques help a lot. So for example here we know that the people is the plural of person, right? So passes know this. Natural language tools are pretty good these days. And by the way, we use other people's tools for that, right? So there's more things in the entity name space. Of course we cannot produce any miracles. We don't do entity name (inaudible) here but we actually harvest the ambiguity. If there's five ways of referring to the same entity, we learn it from this input. So for example this one here points to where the Wikipedia redirection system to one thing and this one points to the same thing. They don't cross reference each other directly but by transitivity we know that they denote the same entity namely little test for the audience. >>: (Inaudible). >> Gerhard Weikum: What? >>: Isn't that the Andromeda ->> Gerhard Weikum: Very good. I forgot to bring you some chocolate. Usually I give out little rewards to the audience if they work well. And in general we have a methodology here. It's a collection of the heuristics and rules, right? But we have an overriding methodology that we coin reductive type checking so we are willing to get lots of fact candidates but then we run them through some scrutinizing procedure. So we want to make sure that the whole knowledge base stays consistent. So for example the taxonomic part should never become cyclic. And we do what we call type checks in this knowledge base sense so when we see for example for whatever reason we might -- we also do text mining, right? So some text mining tool would give us a conjecture that Max-Planck is married to quantum physics because the sentence said Max-Planck was married to his work, right? So but now the married to relation is a relation between two people, ideally of opposite gender, so we have a type violation and we reject that fact candidate. So we mostly build on first order fact -- Jonathan, yes? >>: You may have a (inaudible) co-authored with somebody who is also married to their work and therefore conclude that they were married (inaudible). (Laughter). >> Gerhard Weikum: Wow. So thanks for -- this is a new project for a new student. (Laughter). Nothing is perfect here. I mean you're right. So we mostly build on just vanilla first order logic so these are just instances of binary relations. So this goes a long way but once in a while you come across the need for higher order representations. So here we see Berlin is the capital of Germany but we might also find Bonn is the capital of Germany and no country should have two capitals. We can also add constraints, right. So we could have logical constraints in the knowledge base. And the explanation is that these refer to different time. So how do we represent this? This looks like higher order because the valid N is now a binary relation between first order fact and some other constants. But there's a symmetric in this community that's been around in this community for decades and RDF users this tool, it's call reification. We just give these facts here IDs and then we can use the IDs as arguments to what used to be higher order facts. Just to make sure you don't think only Germany has problems with this, so the world in the U.S. at least in California is also not so simple but of course it comes from an Austrian, so you can blame it on Europe. (Laughter). So if I have time, I -- here is a few words now where we stand and what we are doing now. So this goes back to actually deeper scientific things. So we believe YAGO knows all the interesting entities. It's not really true but we could make it true. So if you are a politician of some village you better be -- you're the mayor of some village you better be in Wikipedia and typically you are. If you're the drummer of some garage band and you want to make a career, you are in Wikipedia. So exceptions are computer scientists, Wikipedia is not strong in this regard, but we could systematically harvest DBLP or DB -- what's it called, DB live and I mean there are sources for this. Then there's for example biochemical entities. Of course Wikipedia knows a lot of enzymes, proteins, drugs, diseases but this is a big zoo of terminology. There are things like UMLS or the life scientists, they also do their job, do their homework and organize their terminological and taxonomic things. So this could be leveraged and imported. So I think we're pretty much done in terms of entities. What we are missing is the coverage in terms of relations, both relation types. There are so many interesting relation types that we don't know about, and then also the instances for the relation. So we do go back to the text mining now. And our tool that we build a few years ago works as follows: It runs natural language sentences through a dependency parser and builds these craft representations with cryptic tagging going on, but an expert can read this and this is still a syntactic representation but it comes closest from a linguistic viewpoint to a semantic interpretation. If you stay within context free parameters here. And we can use this for example the shortest dependency pass between the two arguments of interest as a feature representation for statistic learning and if we are willing to have some training we could mark this as a positive sample for a learner and this as a negative sample and then the learner could digest sentences like this, right, and then would say yes indeed Paris is located on the Seine. Hopefully, right? But this is expensive, as I mentioned earlier. And this is why we actually did not pursue it as aggressively as we could have done two years ago. But I think we are now in a much better position to do -- to reconsider this because YAGO as a backbone gives us a head start now. So we can filter out many uninteresting sentences that we come across because they don't contain two entities that might possibly be related to each other, and that -- and we get a lot of this information from YAGO already. We can quickly identify the relation argument and we can do type checks so if we are after the run through relation it better -- we better look at sentences where one argument, one entity is of type river and is other is of type location at least, right. And we could do more fancy things like check that the river and the city are at least in the same constituent, otherwise why would they be nearby? So these are things we are looking at now, ongoing work, no hard results, yet we're particularly focusing on time aspects. So because we have lots of interesting facts, some of this is low hanging fruit like the CEOs of company and things of that kind, but it's time evolving, so it's interesting to check out at what time points which facts hold. Okay. So something got messed up here in the order. We'll see. Okay. So second part is the search. Now that we have this knowledge base we also thought about how do we search it and then whenever queries produce too many results how do we do the ranking. Yes, please? >>: I had a question about the information (inaudible) so it seems easier (inaudible) possible to extract (inaudible) from Wikipedia other sources but how do you know (inaudible). Are these time coded, updated automatically from some sources? >> Gerhard Weikum: So that's a very good question. Some of the relation names come just from the Infoboxes, right, so it says head quarter or something or CEO code on, right. So and we can infer then that this is a relation between a person and a company and we name it after what we see in the Infoboxes. In other ways we have to hand craft like a catalog of interesting relation names. And this is a bit unsatisfactory on the other hand I give this some thought and I asked some pretty smart people so what we are asking -- it's a tall order. What we are asking for is the universal catalog of entity relationship types. So like we all know this from school, right, so the world has more than just departments and employees, right? We're not talking instances only the types and shipment and order and so on. So we want to do this universally. What are interesting relationship types in the world? So what's the best data dictionary or catalog of relationship types? And none of the people I've talked to and some of them have been working on this conceptual modeling for decades, had a good answer. So I had good hopes that for example something like Google base would give me clues. It's not about relations but it's about attribute names. And I talked to Alan Halavy, (phonetic), for example and he told me forget it, Google base is all about used cars, right. And I would love to see for example interesting relations between people like once in a while I watch strange movies with complicated plots so I after a while I'm totally lost and I don't know who is actually the nephew of whom and who falls in love with whom. Well, this is the easy part actually. Who is jealous and so on, right. So these are all interesting relations and only between people. So very well and good question, but it's a tall order. Okay. Other questions? So then the search because we like the craft based visualization of the data we also came up with a craft based language. But you can rehash this into some other representation. And some of these queries are actually pretty straightforward from a database viewpoint so this is a vanilla conjunctive query. Easy thing and SQL or any of the other established languages. In craft form all we do is we replace some of the craft nodes by variables and then the semantics is you find bindings though these variables from the data such that after you replace the variable with the binding this becomes a subcraft or this is seen as a subcraft in the data. So we have some additions which are now no longer that straightforward, so here the Thomas Mann-Gruter relationship question would actually be addressed by a big or formulated by a big wild card. And we call this relatedness or connectedness queries. So here's a bunch of entities, tell me how are they related, do they have commonality, do they interact in some way? And the answer would mostly be the labels of the, in this case even path that connect the entities. And that's a special case just a second, a special case of having regular expressions in the language. And here we don't have a good characterization not yet how -- what does this actually entail in terms of expressiveness, complexity and so on. But nevertheless we wanted to have this so this is about like the one way of reasoning about the strong German universities and now you see it has like little disjunctions here, it has star because the located in hierarchy can have variable depths and so forth. Jonathan? >>: I was just thinking that with the connectedness query we had something that (inaudible). If there's multiple paths how do you sort of pick which one? >> Gerhard Weikum: I'll come to that. I'll come to that. So the ranking is a big issue here. Right. Right now this is just query and we get result sets and then the ranking is actually very important. And we can also have queries over these facts, right, and they can -- there's also a linear syntax for this. So now we come to the ranking and here I messed up my slides so I need to go back and then forward again. I messed up order of slides for whatever reason. So now I ask a query like Fisher is a scientists, what else is Fisher known for, right? And maybe there are many Fishers. There are indeed, right? And we run this without ranking so we get exact results but the order in which we get the results back and there's thousands of them, is arbitrary. And this is the top result, Ronald Fisher is alumnus of this college in Cambridge. And when you go a little bit further in the ranking, so Ronald Fisher is a good result. Everybody knows why in who was Ronald Fisher? So he invented maximum likelihood estimation. He is probably the most important statistician of the last century. And then there's two unknown Fishers. Why would they be on rank two and three? They happen to be, right. Then there's Ronald Fisher the theories, Ronald Fisher the colleague, very important property, Ronald Fisher the organism and then Ronald Fisher the entity, and so on, right. And so this is wrong, I mean this is not what we want. This is the flawed ranking. But unless you have ranking, ranking is greatly underappreciated. I think this really must be a first class citizen here, it cannot be an afterthought. And indeed we have developed a statistic language model for this craft based representation which computes better rankings, namely gives us Ronald Fisher the mathematician, the statistician, the president of the Royal Statistic Society, the encoded IDs so you have to follow the relationships in order to find that out, then Ronald Fisher started his career was doing crops experiments and then for doing -- giving significance to them, he had to invent statistics and then change the field and so on. So it's pretty good. So what other criteria and here one of them is the answer to your question, Jonathan. So there's three big dimensions for the -- for ranking criteria. The first one is confidence because even though we have been driven a lot by this YAGO extractor, which is high accuracy but we should have other extractors as well, and then the accuracy or confidence into the extracted facts varies widely. And there are two subdimensions to this. There's the certainty of the extractor. I mean if you use a risky learning base technique with very little training data, your confidence cannot be as high as for example extracting email addresses by regular expression matching. And the other dimension is the authenticity and the authority of sources. Actually it's two subdimensions, right, these are different things. And you see this illustrated just by examples. So a straightforward sentence Max-Blanck was born in Kiel from a higher authority source, which never lies, should be taken for almost granted. Whereas if you see this word sentence they believe Elvis hides on Mars in this strange blog, be careful about this. >>: (Inaudible) sign the authority (inaudible). >> Gerhard Weikum: There's -- this is big picture here. This is -- so we have pragmatic implementations of this. Now we can break this down and to do this or that. So one of the easy things is you go by page rank, right. So authority is easier than authenticity. Authenticity is about saying the truth, right. As opposed to -- I mean it can be a high authority source about jokes, right? Okay. So second dimension and I will elaborate a little bit on that one, right, because I think it's a most interesting for. We have implemented something on each of those, but there needs to be more work, right, to make this systematic. So informativeness is about telling something interesting. Don't tell me Fisher is an entity. And ideally this would be a subjective criterion and in fact, the word is a concept in linguistics and cognitive sciences, there it would mean tell me something that I didn't know before. But then you need to personalize it, you need to know something about the user and for the time being we don't have this. So for the time being, we would say tell me something that most people find interesting. So it's really driven by frequency statistics, different kinds. But we could -- we are working on personalizing it and then it could mean tell me something that I didn't know before. And it's illustrated again by example. Suppose we are asking what's Einstein known for? So most people would prefer an answer like Einstein was a scientist rather than Einstein was a vegetarian. Of course if you are yourself an accomplished physicist maybe you do prefer the second one. But without knowing anything about the user we can't do this. Now, you cannot just precompute relevance or informativeness of facts because this is context dependent. This depends on the query. You take a slightly different query, give me important vegetarians, and then this one which should rank low here should rank high here, right? So there's no easy solution to this. And I will explain in the next slide how we do it. The third dimension is what Jonathan asked before, so results should be compact in some form. If it's just a path between entities we prefer short path over long ones, maybe the path labels and the frequency of how dominant are they used in the overall knowledge base might play a role and things of that kind. And sometimes we want to connect more than just two entities and then we actually look at compact crafts, this leads to (inaudible) computations unfortunately so we're also biting here into a computationally expansive bullet. But we're doing work along these lines. Again, illustration by example. So we ask about how is Einstein related to Bore. This is a good answer, inferring that Tom Cruise is also vegetarian and was born in the year in which Nils Bore died is kind of weird, right, so it should rank lower. But you need ranking to get this out, right? So how do we do the informativeness part of the ranking? We follow best practice and information retrieval, which is based on statistic language models. Now, these are generative probabilistic model so the rationale is that you have a bunch of documents and each document is viewed as a probability distribution over observable features, for example words or normalized terms and often you might postulate a parametric form for this, so for example it's a multinomial distribution. And then the document itself can be used to estimate parameters. But maybe you want a background corpus for smoothing the parameter estimation. Now, the query is now treated as something that would be generated by a document. This is on first glance it's kind of an odd rationale but this is how it works. So you pretend the query is a sample if the probability distribution of this document or this document or that document or that document. And you prefer in the ranking those documents for which the likelihood of actually observing that query is highest. So this is the model. And then of course you can use Bayesian arguments to reverse everything and then it looks a bit different. So we did this, we applied this, too, but we're not just having text documents. Now, some of these language models have been carried over to attributes and records but not to relationships to my knowledge. So this is something new we did. And often you run into both sparseness issues for parameter estimation and also computational tractability so we're just like many other people, we factorize our probabilistic models into interest components and here the unit of factorization is 1H essentially. So the smallest meaningful subcraft is 1H and it's labeled and it's to end point the entities that together constitute the fact. A and now so this would be the things that generate queries and this is a query now with a parameter. So in some sense we're after asking, given this borne end go to Frankfurt fact what's the probability of generating this query or in other words you can give it some intuitive meaning if a user asked this query and were presented this fact would the user be satisfied? I happen to be born also in Frankfurt, right, so there's a good test case. And but I would not claim that I should rank higher than Gruter, right. So I'm the other candidate for generating this query. And then there's some bells and whistles here which we're going to skip. And I'm going to explain but by example how we then estimate these query likelihoods. In the end it boils down to simple correlation statistics. So you might say well, very easy and very simple but the framework is much more powerful because we could now generalize it and take more things into consideration and stay in the same formal apparatus which is nice. So let's do this by example. So here these two facts could generate this query, so what we do, we estimate the probability that the answer part which binds to the variable dollar X namely myself or Gruter appears in some corpus given that the other, the input constants appear. So it's about what's the probability of seeing Gruter given that you see born in and Frankfurt or what's the probability of seeing me given that you see born in and Frankfurt. And then for different query types depending on how many variables we have, we can have two variables in a meaningful way things change a little bit. Now, the actually estimates for this condition of probabilities can come. According to the book they should come from your main corpus, but that might be easily misleading so you do this on the knowledge craft itself and if you happen to have way more physicists than vegetarians because of the way the condition of probabilities are defined, that would actually give higher rank to the vegetarian part, right, because this -- the is a vegetarian appears in the denominator here. And so what we actually, we tried this, it doesn't work well because no knowledge base is perfectly balanced and the knowledge -- and it doesn't have redundancy and redundancy helps a lot in statistics, so we actually go for the Web in this case or some sample of Web data that we precompute and do the estimations there. You saw this, so we did some systematic user studies with question answering queries and some queries by ourselves. We compared to a bunch of kind of competitors. You get quickly into an apples versus oranges comparison here, but at least you're not comparing apples against T bone steaks, right, so and so the results have to be interpreted very carefully and I will not elaborate on this, so the main point is that some techniques actually Google does very well on many of the question answering things. Yahoo answers which matches new questions against human manually answered existing questions from a big catalog is lousy. This is a question answering system done by MIT which also utilizes Wikipedia. This is why it's interesting to compare with, this is key word craft search. We only we actually use our engine but take their scoring model for the ranking and this is us, so we are doing pretty well. Questions so far? >>: I just want to interpret the numbers on the previous slide. So in terms of the tasks that you had (inaudible) you gave them like a broad question, you told them somehow research the answer to the following question then you (inaudible) whatever. >> Gerhard Weikum: No, no, no. We -- well, indirectly, yes. So this is why I set it's apples versus oranges. Because different input, ways of taking the input, they also have different (inaudible) right. So we tried various ways of formulating the queries for each of them. We did, and we is the pool of people that we're working together. So we tried to be as good as possible for each of them, right. For our system there was typically one way, one canonical way of formulating the query but here you have tons of ways, you're right. But we tried. We gave them a fair chance. We tried lots of ways and this is the outcome. So take this with a big grain of salt. Don't overrate it. Five minutes? >> David Lomet: 15. >> Gerhard Weikum: 15. Okay. So there's plenty of time. So the last part is now about efficiency. And it's not yet integrated, right. So YAGO can be downloaded both data and the knowledge base and the extractors. NAGA is kind of proceed to typical. We have a Web demo but it's not very stable, and sometimes it's awfully slow, sometimes it's fast. Most of the queries on the previous slide can be answered in a few seconds, right, even with NAGA, and the ranking. But sometimes we run into big problems. So we also gave some thought to efficiency here and because we're using a knowledge representation model that is close to RDF so we got into RDF and also because last year's best paper in (inaudible) kind of won, won the best paper award with his engineering exercise on RDF. So I felt like we could do better and indeed together with one of the people in my group, Thomas Newman, we did something better. So why RDF and why can't you just reuse your good sold SQL engine for this? So why RDF is pretty clear. I mean here's another of these knowledge graphic excerpts but the edges pretty much correspond to RDF triples. In RDF terminology these would be called subject property object triples and there's lots of ID encoding going on and strictly speaking some of these must be URLs or URIs and so on, so I'm using simplified syntax here. Now, why can't you just reuse some engine, is it at least a good healthy exercise to rethink architectures? Well, RDF lends itself pretty nicely to this new paradigm known as pay as you go data spaces. So it just enter data and in fact probably this is why bioload, biologists like it more than for example XML or why don't they put everything into a relational database first. So don't need to worry about schemas, maybe some schema involved but then you can quickly change the data or group it or whatever. Maybe it never evolves. That's also fine, right. So it's schema last if ever, right? And this is a nice paradigm that we like. The triples form an entity relationship craft but because of triples this is very fine grained. So there's no notion of distinguishing attributes of an entity from its relationships. They all are properties. So this blows up the whole thing and syntactically you don't have an easy way of grouping things into more manageable units. As a result, the queries are big joins, often you have lots of star joins but also long chains of joins along relationships and therefore the physical design is very critical. But if we are in this new world of data spaces or supporting and scientists or supporting kids and they're high school kids like browsing and discovering things in a knowledge base, there's no workload predictability. You cannot just say last year's minds work like this, next year's minds will be like that. So it's totally ad hoc, right. So the language that people advocate I don't like it but still this is up as a worldwide Web standard is coined sparkle. It's pretty much select project join combinations encoded in the so-called triple patterns. Everything was a question mark is a variable that can be in the place of a subject, a property on object. Then you see the relations like is born in and so on, and you see some constants. And the dot in between these things -- this triple patterns is actually conjunction so this entails a join in relational jargon. There's some complications. In particular the relations can also be a variable which just makes it interesting. It means you don't care so much about the schema, so for example here we would say some person is related to some town and the town to some country but we don't exactly say which by which relations, whether the person is born in that city, died in the city, lived in the city, knows someone who lives in the city if that were one relation or this is the capital of some city that once belonged to this country and so on. So if you unfolded this into SQL it would be a huge union and then for each of the cases you get long big joins, right. So this is a pain. This is what prompted the Obodi (phonetic) paper. To some extent I'm repeating the arguments from last year's paper. And then there's some typing of course which so far we think we know how to address this but we haven't done it in thein so far. Now, coming back to physical design, there's different prevalent approaches. This RDF thing has been around for a decade, and everybody just smiled at it. And now only recently people take it more serious. So the oldest approach is probably you put everything into this big table with only triples and then you need to operate over this, which in all the queries entail lots of self joins with this big table. So Obodi said this really success so this you cannot do this. But it's our approach, so you will see why it works. Then Obodi actually advocated this thing here. You group the triples by the same properties and you do this to the extreme form so you end up with the maximum form of partitioning and then because the property is encoded and the table name you have almost a column store. So no big surprise they actually use a column store for storing this and indexing, it is, and doing the query processing. There are things in between which are kind of more closer to the conceptual, to a conceptual entity relationship model but they are difficult to handle, in particular they face a big physical design problem and I haven't seen big success stories on this. So our engine, well, I'm glad (inaudible) is here, so finally I found a case for the risk style engine paradigm that (inaudible) and I formulated in a shallow proficient paper in (inaudible) 2000. So the risk here really means richest complexity. So in the sense of the original risk processors. And the rationale is well, first of all let's build an RDF engine and a colonel for RDF and nothing else, not a RDF XML or RNDMS with whatever else. Simplify all operations. In the sparkle large when you look at its core, it's pretty simple. Reduce implementation choices. All the joins are pretty much merge joins. Optimize for the simple and frequent case and radical eliminate tuning up, so we don't have any physical design here at all, right, because it's the same for all possible data. And this is essentially from a bird's-eye view the solution on first is a standard engineering trick, don't bother with these long literals, URLs, and strings, encode everything into integer IDs and then you're dealing with fixed length ID's triples all the time. And then actually this giant triple stable approach is not so stupid and you can afford actually to exhaustively index it. So overall we build 17 indexes and that's good for every possible RDF database. Because we're dealing with these triples and fixed length compression kicks in crate. So it's wonderful. The query processing works almost only with merge joins. We have hash joins also in the repertoire but we -- they're hardly ever needed and because of this big simplification the query optimizer also has a lighter task so we can't afford exact dynamic programming based search of the or reversal of the search space of execution plans and this goes up to 20 to 30 joints, so we always find the -- within much less than a second, right, within less than hundred milliseconds. We also have some new techniques, semi new techniques on the statistics for the data. So now on the indexing so people, the literature on RDF is strange. You really see where these people are coming from, so they are really dreaming of the semantic super strong thing but when it comes to systems they don't -- they cannot add one two, three, right. So you have these triple subject predicate object or property object and there are six permutations of them, so you can have six indexes that make sense. And the literature has proposals like we should have SPO index plus OSP plus PSO and then they stop, right. So why only have these three and not all six, right? So we actually build all six exhaustively. They are all directly in the leaves of the cluster B plus tree. The data could actually be thrown away. Everything is in the indexes. And the compression is so good that the total size of the indexes is less than the original data. Now the beauty here is that this means for whatever scan or join we encounter, we always have the right ordering of the three components. We do even more. So these are the first six indexes. Now, the next six ones with these, so we also index all binary projections of these triples in all six possible permutations and the last, the missing component is replaced by account aggregate. And this comes very handy when you have to deal with duplicates. So sometimes the sparkle semantics forces you to actually produce a billing result with duplicates and we never need to carry these around. So we only work with these indexes and then have the calendar available and as we join we can actually (inaudible) at the counters and only carry it around the (inaudible) of the result back as opposed to this duplication. And we also do these for unary projections, so this is six plus six plus three, 15 indexes and then we have mappings between the literals and the IDs. So these are the 17 indexes. And you never, ever need anything more, right. So there's no materialized path or view indexing, et cetera. Now, on query optimization as I said, because of the simplicity we can afford essentially exhaustive plan enumeration in dynamic programming order. So my co-author, Thomas Newman has done a lot of work along these lines and knows how to do these things, right. So we measure this for more than 20 joins and we're also within 100 milliseconds of finding the cost optimal plan. Of course it's optimal relative to the cost model, if the cost model is wrong, you may not have a good plan. And on that latter aspect, we build just like standard histograms. Turns out we can quickly compute them from these count aggregated indexes. What we do is we successively merge neighboring entries in the indexes in greedy order and then kind of approximate an aqueduct histogram. We do so until the structure is small enough to be called a histogram, so that we can keep it in whatever size limit we give it. This is done also for all six orderings. And in addition now in sparkle we thought a bit where's the biggest error in the (inaudible) estimation and it seems that this is in these long join path. So you essentially start with one entity and you want to connect it to some other entity over five relations or properties. So but we make a big error here only if we actually have lots of results for this. So that would mean that this path here, P1, P2, P3, P4, P5, these label sequences must be frequent. They are infrequent make it an independent assumption doesn't hurt you much, right, because you underestimate anyway. Jonathan? >>: Do you have a sense for sort of if there is such a thing, a typical query and how long these join paths are ->> Gerhard Weikum: No, no, no. Well, you will see our benchmark in a minute. Other questions? >>: So if you have a set of facts this would work, I think. Now, if you throw in the schemas as well, then in terms of (inaudible) because everything else is (inaudible) something and this is in class so everything is connected to each other. So (inaudible) self join, you know, whatever (inaudible) you have roughly, maybe two of these (inaudible). So I'm wondering ->> Gerhard Weikum: We would not store it as a million tables. We keep or we're pretty convinced about our storage and indexing scheme. So if you impose a schema, it's a conceptual thing. >>: (Inaudible). >> Gerhard Weikum: Oh, this particular technique here at some point it might become useless but it's not expansive, so unless like -- I mean you could also materialize these paths, right, which is what the Obodi paper uses some techniques along these lines, too. But that means you get into physical design so then actually building the materialized view, maintaining it as you get updates and so on, there's a cost. Here there's hardly any cost at all in building this synopsis and the worst that can happen is they don't help a lot. So but they will be accurate. I mean, if you have like super frequent path, in fact in the original in the benchmarking data that Obodi used which we also used you have this situation. So then they just tell you the truth, yeah. I mean, sudden queries do become more expansive but we still in the queries that we are looking at, I don't -- I don't -- we're doing fine. I don't see the explosion even if you have a million tables you would not have joins over a million, between a million tables, right? So the join paths would be some bone at lengths, right? The synopsis may be useless at some point, right, but it doesn't hurt, right? In fact, we have to have a fallback. When these precomputed they would be exact if you didn't have any selection in addition to the joints. But of course you have these. Now you cannot precompute each and everything, so what we do then if there are additional selections and they are the typical with along joins with properties that play the role of an attribute, this is why I named them A. They also only properties and there would be a constant here. But now what we do is actually we chop up this into the pieces the largest possible pieces so that for every piece we have a selectivity estimator in our repertoire and then we postulate independence and do the usual arithmetics. This goes a long way. So benchmarking. As you know there are (inaudible) and benchmark assumptions. But we went broader than the Obodi paper. So this is just a setup. We had two opponents, essentially reimplementing the Obodi approach but not using C store but me native B which Obodi himself recommended because C store is no longer maintained as open source and on this machine we had 1A to B to run easily and so C would have been a pain. And we emulated one of the more traditional approaches of dealing with triple stores for RDF but layering it on post crest, right. So we're following a bit what Obodi did. We had three data sets so this Botten (phonetic) library catalog is what Obodi used. Then we applied it to our own knowledge base. This is why this whole work fits into the theme of the talk. And we experimented just for the fun of it with an excerpt from a social tagging platform and actually made the tags property names, which is maybe odd and maybe it is not a smart way of doing it this way, but it gave us a data set with very different characteristics. Because this had more than 100 thousand different property names as opposed to these have like hundreds, right. So we wanted something else. And then you see some of the queries. So this is from Obodi's benchmark, who uses -he actually cheated a bit, he phrased the queries in SQL and then sparkle doesn't have the kind of aggregation operators that he actually used. So what we implemented is two and followed really his rules. Then on YAGO we used the queries of the kind that you have seen before on the social tagging data we used things like this, so we played a lot just with the tags. And we had a set we asked people in the group obviously. I mean this is an ad hoc benchmark so by the threats of things we think we got some insights. And this could easily entail 20 joins, right? And here are some of the results. In terms of load, times and database sizes, the systems were roughly comparable. So 1A to B seems to create indexes on the fly, so after load time it looks a lot smaller. Once you run the benchmark, it's a lot bigger than our data. This is for the three data sets each of them came with like a dozen queries or so, and this shows you the geometric means in run times for both warm and cold caches and you see that we are like almost always in order of one billion, one and two orders of magnitude faster than the competitors. And interestingly the two competitors there's no, none dominates the other, so you also see the Botten experiment that Obodi did is some specific setting, right, so it does not easily generalize to lots of other things so here the order is just reversed, right. And the social tagging data it posed yet other characteristics. So in conclusion, I told you about this big vision and mission that we're pursuing in my group so lift the most valuable raw assets about information left them to a level of explicit knowledge in terms of entities and relations between entities. We're pursuing this knowledge harvesting theme. I talked mostly about the semantic approach going after encyclopedic and ontological sources. We do pursue text mining as well and also a bit of mining social tagging information. Of course this is on (inaudible) and also interesting. I see big potential synergies in combining these approaches but I also see big problems. So how do you deal with consistency when you use like hybrid machinery. In general there are tradeoffs between consistency and uncertainty. I mean you might go for consistency, but then you throw away too many uncertain facts and you suffer in coverage. And an important issue is a long term evolution, so today's knowledge is out of date tomorrow, so people correct wrong facts in a knowledge base, knowledge changes. How do you tell which of the situations apply. So if you wanted to maintain a knowledge base with time annotations over a long time until a time horizon, you run into lots of additional problems that we haven't even understood yet. In terms of ranking, I think we did something interesting. We're in the process of expanding this and generalizing this and especially making it a personalized ranking approach. And likewise on the efficiency and scaleability issues I think we did some interesting piece of work, but there's a lot more to do. Thank you. (Applause) >> Gerhard Weikum: More questions? >> David Lomet: Jonathan? (Laughter) >> Gerhard Weikum: Okay. Let's take them offline. Thank you. (Applause)