Information-Rich Programming in F# with Semantic Data >> Evelyn Viegas: Good morning everybody, it's a real pleasure to have here with us Steffen Staab from the University of Koblenz-Landau in Germany. Steffen is a professor for databases and information systems and he's cited among the top 10 researchers at WWW and Semantic Web conferences. And today he's going to talk about some work that we've been working on with Don Syme from Microsoft Research Cambridge on actually using some semantic technologies to help make sense of data. But more broadly, Steffen has been working also on using semantic technologies to support software modeling and I won't talk about what we've been doing and what Steffen has been doing with his team now because he's just going to talk about it. I'll just mention that we came back from Popo like a couple of days ago where we had a workshop on data centric programming where there was a lot of interest, actually to these approaches of bringing more tooling to help develop first make sense of data. So with that, Steffen please. >> Steffen Staab: Thank you very much for your kind introduction Evelyn. So, the work were going to talk about includes a couple of people from my team, both initiated actually by Evelyn and Don to look at how can we bring some of this semantic web technologies in semantic web data structures into F#. And among the people from my team there is Martin Leinberger, he's a PhD student in our team and he's also the [indiscernible] and I must frankly confess I'm a little bit too far away from the code base really to show this in a competent manner. Okay, so our point of departure is this quite famous picture from Linked Open Data Cloud. For those who have not seen it, who you have seen it before? I guess many will have? No, not so many. Okay, so what does it symbolize. Each of these bubbles stands for a data source and then some colleagues from Mannheim, Chris Beyser and his team, they have collected how these different data sources, each one of these bubbles has hundreds of thousands or millions of triples of fact and how each of these data sources is connected to other data sources. Following this mini web paradigm. Okay, many of them are connected to DBpedia, and DBpedia is the fact base derived from Wikipedia. All right so stuff that drives also knowledge craft and other kinds of approaches. And the interesting thing about this is that it's like the web but not on a document base but on a fact base. And what is also interesting is this data has a lot of different data structures. A lot of different schema. Well, and the sum of that has actually very little schema and just the facts. So it's very heterogeneous and by now it also is actually more in auto F triples and even some famous database researchers, here is for example Gerhard Weikum in his SIGMOD blog asks, this is actually a nice kind of environment to try your approaches if you deal with heterogeneous data and if you think of thick data not only in terms of having massive amounts of a simple schema association, but if you think in terms of having a lot of variety of data. So I will use this as a kind of motivation for my talk, though it's very clear when I talk about understanding new data sources that come in, it need not be this kind of linked data that so many web people look at, it could also just be your relational data source that you encounter where you don't know the schema and that you have to deal with. So, when we look at these different bubbles, each one representing an individual data source with millions of facts, you see that when you look at them in detail you have very different domains. For example here you have stuff about PubMed for example, so for medical domain or drac bang, gene ontology, this kind of thing. And here you have a kind of column which is about publications from the ICM or from the DB LP server which provides you a bibliography or from here's our partner, the Leibniz Institute for Social Sciences with whom we collaborate very closely. They look at publishing metadata and all kinds of data about scientific literature in the social sciences. Here you have stuff about geographical information. Here you have stuff with New York Times, is publishing a lot of linked data about, the articles they have just putting them on the web and making them accessible. PVC is doing the same kind of thing where you have this media domain which is very active and then you have like this generic databases, fact bases like DBpedia, but also like Freebase which is actually a part of Google or like YAGO which was done by Weikum and his team. When I talk about a few more of the foundations of the semantic Web, I have to explain to you the very simple foundation that we have there. What is this very simple foundation? We talk about data and data for us is simply that you have some subject and each object or here this particular subject is represented by a URI. A URI is nothing else than a global unique identifier. So at this global unique identifier you can take this identifier and ask for, please give me more information about you. Okay. And even such subjects are related by a specific predicate to an object. This object can be another URI. It can also be just like a string or an integer. And that's the very simple data model behind RDF. So extremely simplistic, much poorer than relational databases. But, at the same time, like a common denominator for publishing data. And from this piece of data you can describe the schema. And that schema has classes and it can have class hierarchies. You will see a very trivial example from me on the next slide. And you can describe how different classes are connected by predicates and the predicate has a certain domain constraint and a certain range constraint. RDF is very simple, you don't even have cut analysis. If you want to bring them in you need to have a more sophisticated standup which is called OWL for the web ontology language. Okay, here is a very simple example, a little bit trivial but I think nice enough to tell you the story. So what is the example? Here RDF resource is like a kind of top class that includes everything. And then you have a subclass like creature. All these URI's, they come with name spaces as you know them from XML. So you can really distinguish, let's say a creature that I described in my examples or a creature as it would be described in some biology domain or in some other domain as you like. And we always abbreviate them here just to make it more readable, but you could mix them completely. You could take some URI's that are coming from the Microsoft domain with some biology domain with my example. You could mix them all together in a giant global graph. And here for this class creature, you have several subclasses like dog or person and you have attributes described like has name, and has age and we could continue here to describe what are the range constraints. For example that the name would come as a string or that age would come as an integer. We've left that out of the picture here. And then here you see another example where between dog and person there is relationship. It has owner relationship, so the domain is dog. A dog may have and owner which is a person and this is all the schema level. And here you also have the data level where Bob then would have Hasso has a dog having an owner which is Bob which is the person. Okay. So very simple example, nothing very fancy about that. What I talk about this example, please always keep in mind this might directly be an RDF data source or maybe you take an established mapping of let's say a relational data source into this very simple kind of framework. So there are, for example the standard here, R2RML that was standardized like one and half years ago by W3C. Is that W3C recommendation and in the meanwhile there are very efficient mappers from this relational world to this kind of triple world. What I mean with very efficient, some colleagues like Juan Sequeder from Texas, they have execution engines were basically doesn't make a difference whether you then ask at this triple level or whether you ask at the original relational level. So you can really efficiently map between these different worlds. The interesting part here is that I will assume in the largest part of this talk that I have a nice schema description. As I told you for the Linked Open Data Cloud, that's not always true. Some of the data do not have schema but that's like the rest of my talk at the very end where will talk a little bit about that. How we can deal with the situation when we don't have an explicit schema description. And also to make it simple, I don't talk about different schema languages, I just talk about this schema language and this kind of data level. Okay, and that describes the agenda of my talk. I first want to talk about given these assumptions, how we can bring this data sources into the F# language and Visual Studio. How can we find out what is in an unknown data source that's out there among these hundreds of data sources that are out there in this Linked Open Data Cloud or that you encounter just by integrating another part of your company and integrating another data source that you have not dealt with before. In the second part, I want to go a little bit into a very preliminary user evaluation. I say very preliminary because it has a couple of flaws that we are aware of, but I think it still shows a certain tendency for the approach. Does well and what it doesn't do so well yet. And eventually, if I still have time, I don't know let's see, I want to talk a little bit about our indexing approach where we look at this huge cloud and we need to find a particular piece of information, we want to find something about creatures, about dogs or persons. And of course this kind of information may be spread all over the place, not just in a single data source and initially we may not know where it is. And then we have to look at querying such that we find the corresponding data sources and then explore them and integrate them and program against them. Okay. So let's start with this first part where we encounter some unknown data source. Just pick your favorite one and then you want to explore this data source and program against the data that's available there. So I think that's quite a, not so uncommon scenario if you do something to creation of different database. And for the sake of making a little bit better understandable, I don't go so much into the formal definitions that we have in the tech report, but rather just show more by example application. Here the example application may begin taking this tiny toy example that I gave that you may want to collect dog license fees and want to send email reminders to the dog owners. And we assume here that we have this nice kind of graph again that I've shown you before. And you would now have to do a couple of tasks to solve this very simple, to do this very simple program. So the first task that you have to do as a programmer, now as I indicated here in the title is to really explore the schema and find out what is in this data source. Yeah? So you really assume you don't know how the data is described there and you encounter that. And you want to find the types that represent the data you're interested in. So, of course you see directly here on this slide that this may be person and dog for example. But of course in general when you approach data source you just don't know whether it's person or human or whether it's dog or canine or however the formulation might be. So you really need to look around. And classically this first task, you might do with a kind of browser or if you work in a semantic web, you might do a SPARQL query and SPARQL is another W3C recommendation that's around for a couple of years now. Now there are at least a second version of SPARQL around and there are different kind of engines that support SPARQL and support the data format of RDF in the background. And you can ask questions like give me all classes where we have a subject that is of type a different class, so you find all the classes there are around that have instances by such a kind of a query. And you can do that, but of course it's not so easy to formulate the queries to explore this kind of unknown data source. You easily get lost. And if you are in a situation that you say you start here because that's the top class for all different classes, it's not so easy by asking these queries to navigate through because every time you want to do a refinement you have to completely change the query. So that will be like in a Naive approach, you would still have to do it like this. And then when you have found out, okay I'm interested really in the dog and the person RDF type, well then of course to program against that you somehow have to mirror this kind of types in your programming environment. So basically what you have to do is you take these kind of types and you provide code types so the terminology always overlaps here so I try to speak, when if I want RDF of RDF types and when I talk about F# for example I try to speak of code types, yeah. Because both are of course typing systems, but the one is in your data source and the other is in your programming environment. And you somehow need to establish this type systems both, and you need to map between them both, right. So you somehow have to, after you've identified these as the types that you're interested in, you have to establish this code types in F# for example. So you might say something like a creature is a certain class and while it has certain attributes like it has name, has age and we then can have subtypes like dog which would be another code type in F# or person. And they inherit all the attributes from creature and in addition they may have further attributes like it has owner, property or maybe some tax number for dogs. Okay, and that's a second task. So the program now has basically two break this down. Which is actually much of a copying task because the properties are not really new. They're already given in the schema of the data source. Next of course what you have to do is data querying. So you won't query, say I'm interested in all dog owners because I want to remind them to renew their license for their dogs. Okay, so what do you have to do? Well, once you have your persons and dogs, you are interested in persons that are owners of dogs, so you may ask a SPARQL query in a Naive approach and just find these owners by putting forward this kind of worry. Although possible, be aware that in general if it's an unknown data source you don't really know what this relations look like, what's their name. And here they even look nice and short but in general these are very long, unique identifiers in a semantic web, right. So for it may have long packaged names if you encounter a data source, not just such a toy example. Okay, just to write this query. And once you have this query, of course you have then to substantiate your objects from this data source and manipulate them and program against them. Okay, so here we have to develop the functionality around your query. So typically in F# you would have to formulate a query string. This would be for example this kind of SPARQL query I've shown you on the previous slide. And then you could evaluate this and then you could iterate over it to create persons and send email reminders to them according to the emails that I described. And that would be like the four steps that I would identify here. So first exploring and understanding what's in the data source, then creating your types in your programming environment, the code types. Then querying your data according to that and developing the functionality around that. And for that it's almost like turning these data instances into code objects. All quite nice and well, but a little bit laborious. And the idea of LiteQ was to say, let's try to put these different task together and provide a kind of a framework that just makes it easier for the developer to handle these tasks with fewer different tools and fewer machinery and less boilerplate code than we have so far. And of course behind this kind of the idea is you may already recognize if you think of Link, LinkQ and of type providers you will see some of these ideas shining through here. Again in will have later on a comparison with regard to these kind of approaches. Now the core idea of having these, supporting these different tasks is to have a kind of query language which we call NPQL. NPQL really is the Node Path Query Language and it allows you to traverse the graph. Traverse this RDF graph that contains information about the schema and contains information about the data. So the nice thing about RDF is that it hardly distinguishes the two, so they are not really completely separate. All of the schema information as well as the data information is encoded in triples and you can ask for this triples using SPARQL. And NPQL is just like, you could say syntactic sugar to formulate your SPARQL queries in a way that makes it easier for a developer to go through those different tasks with the same kind of syntax and not having completed different syntaxes all the time. For this language we have then developed three different kinds of semantics, right. So you want to just go through this graph. That's the exploration part. But then you want to query for data, and for that we have some typical extensional semantics, no big surprise here. Yeah, that's just what you would expect. And then we have also thought, well we do not only want to have the extension that is all the data that falls in pattern, but we really also don't want to duplicate the types that we have in that system there. So what we then define is an intentional semantics for basically the same kind of queries that gives you back this type descriptions. So you don't have to reinvent this type descriptions because it just queries for them. And then you sort of like can put them together with extensions. I will show you in a minute how this works. And once you have that, you can actually not only use that for nicely writing your code but you can even suggest through the developer what next to write. And that's what I call here auto completion semantics so when you're writing these kind of queries you are supported all the time with particular suggestions of what could fill into your query language at a particular point in time. So let me again show these different types of semantics and tasks by examples. So the first thing would be to say you want to explore this graph. So you start let's stay with RDF resource and then you have the an operator here for a subtype navigation and then you go to creature. So, you change from this context to this context, you go from one node in the graph to another node in the graph. This can become a little bit more complicated. You can start also at a different point, a different note in the graph. So you would start here with dog and then you navigate along the property. Not very surprisingly navigating over has owner you land at the person node. And now we can define what happens then if we define the extensional semantics are that. Well, assume we are for example here at the dog node, we have navigated there from the top class down to creature, down to dog. If by this means selected dog as our current context for evaluation, we can walk through the has owner relationship by this dot operator and well then we can just extensionally evaluate what this query means. So this query, it looks almost like Nick's path query, with a little bit different operators. And then you just say I'm now interested in all the nodes that fulfill these conditions and if you think in terms of you have been here at this dog and you have navigated by this has owner relationship to person, you use the extension to retrieve all persons who own dogs, because that's the context from which it came in your end up here with Bob. Because Bob in this example is the owner of Hasso. Okay so, we have seen, we explore this path and then we say we're interested in one of the kinds of objects that fit my description as I was walking along the graph. Now it becomes interesting. >> Audience: When you said that there's not much distinction between data and schema, does that mean that if my query had been instead like resource to creature and then ask, then follow that subclass, I would get all schema elements that are subclasses of creature? Or do you have to get down into the data before you can [indiscernible]. >> Steffen Staab: The extensional semantics here always looks then for the instances. What you could easily do here is you could do meta-modeling. Yeah, in fact and actually RDF has some meta-model approach. So this for example creature, is a sub class of resource but it's also an instance of the RDFS class. Yeah, so you could walk from resource to RDFS class and then say with an extensional query, give me the extension of RDFS class, creature would belong. Yeah, so RDFS is meta-modeling kind of thing. And thanks for asking a question. I forgot to mention initially, please interrupt me all the time when you're interested in a particular explanation. I think it's better. Yeah, so that's easily possible. And remember see here, again this kind of query for which we, when we look for the extension, we can do the same kind of queries or navigate to dog and then to the owner. So that's the same kind of query has before and that basically means we're interested in some persons here like Bob. But now if we ask for intention, that means we do not look for Bob, but we rather look for the intentional description of the people down here. Right, so this expression now does not any longer refer to the instances like Bob who are owners of dogs, but because we look at the intention it looks at how can we describe the type of these instances. And obviously this type has all the attributes of person. So this means that by using the same kind of query expression refers get the instances and with the same kind of query only the last operator is different, we get the schema description of these instances. Which we may then use very nicely for creating our F# type description. Yeah, so we can now say in F#, using this kind of query I want to describe, whoops sorry. That's just the old version. Here it should describe actually the person class being returned here and I have not updated this probably. And then you would just appear what has been retrieved according to the intention. The thing being here is that once you have formulated this query, you can then build this into your environment and say please give me the object and type them at the same time. And you can use the same kind of thing for the exploration. So here we have a query that says I want to go from resource to creature. From here to here. And then it depends on what kind of operators you use. I've only given a partial account here of the operators, but let's say if you look for subclass navigation, this actually, this expression here would not be complete expression according to our query description. Because it really looks here for another class description or class term, I would rather say. But what we can do here now is that based on the context up to here and based on this kind of operator, we can suggest what would be appropriate next terms to fill into this place. So what we basically do here at this point is we define these auto completion semantics and do suggestions. And we do suggestions depending on the operators. There are some operators that look then for instances. Martin will show you on one of them in the demo. And then the instances we derive using the extensional semantics of this expression plus some further information. Or we can look at types and properties and we derive the properties or types that may be a property in such a situation. Here it be the types. According to the intentional description of the preceding query expression. Okay. So here obviously when you look for creature, we may then look at, well what are the subclasses of that creature and suggest both dog and person, would be appropriate subclasses. And these are suggestive because these are the direct subclasses to the programmer who writes his code and to complete his query. These are just some of the features, just one feature to give you little bit of the flavor of what LiteQ is about. It's a little bit more complicated query, not a lot more complicated but a tiny bit and what it does here is to say that if you start at dog, we may be interested not just in all dogs, because there may be dogs without owners and we cannot then have some license fee for them if they don't have owners, but we can only have license fees if there are owners. And then we can restrict the set of these dogs to those dogs that have owners, right. And this would be a corresponding query for doing exactly that. If some of you are familiar with description logics, are some of you? Have you encountered description logics before? It's a subset of first order logic's and this is very much like a concept description in description logics, where you then can look at such expressions at ask for instances of such a complex concept. And where you can also do kind of queries assumptions. And that's what we're aiming at here. And of course in this context the dogs that have owners, this would include Hasso. If there would be another dog without an owner that would be returned here. It's a very restrictive form of querying. It's not giving you the full SPARQL support, right. It's very much focused traversing the graph and restricting the number of nodes that you're encountering. So it's, I would call it here a left associative conjunctive query. You cannot build arbitrary queries by this way. In SPARQL you have like full power for all kinds of conjunctive queries and more. We don't have that here. And that's by design at this point in time because we really wanted to support the developer in writing these kind of queries and suggesting to him what would be appropriate schema information. Properties, classes or even then in the data level instances to complete his query and to work with that, right. So we didn't target a full query language like SQL, Link, LinkQ or SPARQL, but rather a subset of that. So we have now the exploration task. You've also seen how this should be supported with auto completion. We have the type definition task using the intentional semantics. We have the query task being supported with extensional semantics and the fourth task that I mentioned before in the Naive way was to create objects and manipulate them and make them persistent, right. So you want to get the objects from the data source. You want to work with them and then also give the results back to the data source. And of course for that you have to develop functionality around the query and if we look at how we can do this here, this would just be okay, just first some boilerplate code like including our LiteQ mechanism and you have to define the data source. In the future we will think about how to go from one data source to another data source into Link to open data web, but right now it's just like you have to indicate where exactly this source is. And then you write here your query. So given this data source you write a query for coming for creature navigating down to dog, navigating to has owner so the operators look a little bit different than I've shown you in a more slightly more formal part, but they do exactly the same. And so what you get here is, all these dog owners. And then you can just take these dog owners and iterate over them and for example send them an email reminder. So the nice thing here is, you directly get this the extensional semantics and in this assignment you also assign the types to these dog owners. It's not just that you have some arbitrary data objects here. These data objects are typed with at least person, and also with the hierarchy above person including the creature. So you can use here then, in this expression also like the information from the creature class which had the name actually good for example, or that age attribute. We have a preliminary implementation of this approach supporting F#. So this is the website. It's really preliminary so it's not really anything like productive code. It's good enough to look at the basic idea. And we know quite a couple of things that we have to do to make it usable in a productive environment which will take us a couple of weeks. But it's good enough to take a look at and see what's the principle behind it. And I would now suggest that we switch over to Martin who will show you some demo of the system as it is now. And so we'll switch over the screens and also hand over for explanation to you. Oh yes. >> Martin Leinberger: So what we will do, we will actually query the data source. And for a very simple cross query would be to query for all dogs, which looks like this. And in this case we actually start one step below RDF resource because RDF resource is always there. So we started creature and basically the data source behind it is exactly the same as we've shown you before in the schema. So we go to creature, we do a subtype navigation and choose dog and get the extension, and then we can do something like print out the names of all dogs. Get has name, there we go. And if we run this, we hopefully get a list of all dogs that are in our data source. Right now it's, well as Steffen mentioned before, we are just using RDF schema. That's why you will always get basically a list of possibilities back. Like here, a sequence or list that just contains the string Hasso or Bello because you can't really restrict the number of triples in the schema, so we have to assume that there are always more. Yeah so like this, you could get all dogs and then you could also go on an individual level. So you could say, show me all individuals of dogs and luckily there are only two runs in our data source. So let's choose Bello and if we want to actually get an object so we can work with Bello. And we can now also print out the tax number of him. >> Audience: I find it very interesting that dogs are taxed on this. >> Martin Leinberger: So, that's off a literal German translation. So we will have to look it up actually so that he would say license fee number or something like that, I don't know. It's a German example. [laughter] Or false friend among the different words. All right, so the tax number of Hasso is 1234, so we can now go ahead and change that. And you see this tax number, that's supported because we know the type of Bello. That helps us here to show the right kind of properties. Let's again type the information. >> Audience: Have you thought about having some operators that would be less [indiscernible]. >> Martin Leinberger: Yes, we have definitely thought about it. This was just a first iteration. You know we want to, we were excited to get started and this was the first thing that came to mind. >> Steffen Staab: One thing, I mean the first idea was also to have the kind of operators that I showed you before in the slide, but the implementation we use is the type provider mechanism in F# and then we have to misuse currently when we use that, this kind of notation. So because the documentation already means something in the tight providers, we wrote a completely flexible in choosing the grammar and would we choose the grammar arbitrarily, we would go for something closer to the version I had on the slides. >> Martin Leinberger: Actually we're trying to look into DSLs in F# in the future and trying to explore bit in this direction but we're also open to other suggestions so if anybody has a good point we would be happy. Yeah, so we just changed the tax number of Bello and this will be persisted in the stall. So if we run it again, it will still be the new tax number. And in the background basically are all these queries. Our manipulations are translated to SPARQL queries and SPARQL updates. So basically we don't care about what kind of stories run in the background as long as it is SPARQL compatible. >> Audience: Would you show, I didn't know where dog was in the hierarchy. Do you have a way to search the data context for any. >> Martin Leinberger: Right now in this implementation you wouldn't need to go down the hierarchy. Right now we leave false to start at creature which is a huge limitation right now and in this implementation the only good thing is that the auto completion can you know help you in finding your type >> Audience: All the individuals under creature, like Bello exists under creature, that individual... >> Martin Leinberger: Yeah, you could also do that but... >> Audience: It doesn't go upward, it doesn't go up to class higher. >> Martin Leinberger: No. >> Steffen Staab: Although we could do that actually in our first graph we had operators for that and then we wanted to keep it simple first, get the first diversion going I mean it's still the middle of the project, it's not finished. >> Audience: You know, I just think about when people get data all right the first thing I want to do is like see it, right. >> Steffen Staab: Yeah, absolutely. >> Audience: When you're going to see it, select star, you know... That exploratory nature is what... >> Steffen Staab: Yeah, okay. And well the next thing we would have to do there is you can do in SPARQL limits like you can do in SQL too so you do get like millions of individuals, because it doesn't make sense right. >> Audience: [inaudible] >> Steffen Staab: Yes >> Audience: So this is a [indiscernible] that you're running against which is the type provider framework and the underlying data which is huge. >> Steffen Staab: Yes. >> Martin Leinberger: I mean in the end it's always problematic. If you would query DBpedia and I would try to use individuals you would just get overwhelmed by the result. >> Steffen Staab: And you have actually more than one way to arrive at dog because you can arrive there by going down the hierarchy but you can also cross navigate. Right, like you've seen that with person. We could navigate from creature to person, we can also go to dog and say who are the owners. So we'd also arrive at person. So there are different possibilities to arrive at the same types. Because you really have a graph. That's also a major difference to the usual type provider approaches where you rather have a tree in your exploration. So the tree is basically going down for example a hierarchy, and then you may have some data at the leaves of the tree. That's a typical way freebase type provider for example works. But here we have this graph which also makes it a bit more complicated to sort of the range on the interaction with auto completion engine. Okay, all right. So we switch back to the slide deck. And I have here a comparison with different kind of approaches but also I think tell you little bit about how we would compare the approach. First LinkQ. Well LinkQ is very good at querying for the data and then also manipulating the objects. You have to define your types yourself in order then to use them and to work them with these program expressions. And of course, well you don't have the schema exploration part. But working with the objects and querying, you have here the full expressiveness. For LiteQ we have here two columns. One column is sort of like what we plan to have been the end and the other column is what we have now. That's of course not yet the full thing so even here in the end, we will not have full SPARQL for example. So it would not be as powerful in querying as SPARQL or is not as powerful in querying like LinkQ. But that's like this trade off where we really want to support the user in writing queries and be able to type them automatically, right. If we don't require this automatic typing, we can come with more complex queries. If you don't want to do the auto completion or if you do auto completion in a very different way, that's another thing we are discussing now. How to better support auto completion for SPARQL, then we would not have this kind of restriction. Well we only have a subset of the operators that we have in our concept. And if you look to something like freebase type provider, you have some sort of data querying, but it's much, much more restrictive than even what we have now in LiteQ. Okay, well here we have two kind of type providers. Of course it depends how different type providers are implemented so it's a little bit difficult to say the type provider mechanism is like this because it really depends on the implementation. So we have indicated here two different type providers. One for XML and one for freebase. So for example the schema exploration that happens in type provider and was also sort of like the motivation for us to do a kind of exploration. >> Audience: Sorry, just want to make sure we have apples to apples comparisons here. You said Link. What exactly are you referring to? Just query API level inside the runtime? Or are you referring to trying to have connectivity to an actual persistent store. >> Steffen Staab: I'm not sure where you draw the line there. I mean what you have here with Link is you have sort of the possibilities for doing type inference because you don't have to query expressions, strings, but you really have them as code in your program and you have of course to all the possibilities for doing the querying in your program code directly. >> Audience: When you say Link, what is the data source? >> Steffen Staab: Well it depends of course what you query with, whether it's XML or whether it's a relational. >> Audience: Which is why I'm wondering what the call up actually means, so for instance if instead of Link you put in an entity from work as a link provider, then you get number two and you possibly get number one. >> Martin Leinberger: Okay, I was not aware this but we should look this up. Thanks for it. That's called entity? >> Audience: Entity framework. >> Martin Leinberger: Entity framework. Okay, I was not aware of that. Now I was thinking here of XML and the relational, no? Yeah, for XML type providers, you have schema exploration but for example it's restricted here to trees. But the information also in freebase is not restricted actually to trees. So that's why even self indicated it's not the full exploration of everything. Code type creation including our card version uses erased types. So for these erased types, the types are used at the sign time for doing the auto completion but then they're thrown away. And that has also some negative consequences on our current implementation. We don't have the full possibility to do a switch based on the type of an object at runtime. That's something we would like to have. Right, so that's where we want to go here and have real full types in the full hierarchy, but not part of the type hierarchy mechanism as it is now. Here I was not quite sure, so that's why I put the question mark, but we believe it's also rise types, but we did not do a full investigation. And for data querying. So, here freebase type provider basically asks them for the instances of a particular class. If you have the country class and you get all the instances of that, you would of course get all the countries. But not much more. And once we have the full types, not just the erased types, we can also do new object creations. Currently we can't do that. So what we can do right now is really to take the objects as Martin has shown you in many attributes, but what we would also like to do is say okay here, are these types like dogs from the RDF data source. Make me a new object and also make it persistent in the data source. And that's currently not possible, so we don't have that right now, but I think once we have the full types that should be easy enough to implement. So that's a rough comparison to some of the related work. There's a lot to do now for LiteQ, so the current implementation is a prototype. I just mentioned this erased type problem. When you did do some optimizations with the code, to have a full lazy evaluation which we currently don't have. We also want to further analyze which types are really needed just at design time in which are really needed at runtime so we can sparsely add types to the programming framework. And at least also now we don't throw all the types into the DLL but I think we can do some more optimizations there to make a more finer distinction. We also still need to do, and have not yet done because it's really just the middle of this project is to really look exactly precisely at our query language and its expressiveness. And how it compares to other kind of approaches. And in particular one thing that I think really interesting which goes a little bit also beyond our scope, because we're mostly at home here in this data modeling world looking at RDF. We had quite some projects where we did modeling with DML class diagrams, meta-modeling and this kind of thing. And we could use description logics to explain a lot of that in the queries and that. What I think really interesting is then to look at the programming world and have type inference as you have in F# or in ML and then see how these kind of types that are derived ad hoc where you have dogs that have owners. These are kind a description logics types that are then put into a lattice of different types in the data source description. How this mechanism would interact with type inference you have in your programming world where you then derive new types. And I think that would probably be a [indiscernible] behavior how these different lattices interact. And this could mean that you have newer objects. For example new objects that you have created that you can place at a lower point in the hierarchy in the data source, or you may have more refined for example standing analysis of your types in the programming world because you have all this type hierarchy that you take from the data source. So, it's a little bit speculation here but that's one of the research start aims that we have there. So let me come to a little bit of the evaluation of LiteQ. We've thought about how can we, and it's really a tentative evaluation. I haven't not sort of like wanted to say it's really like good enough to submit this to a software engineering conference for example. But it's like indicating a little bit where the problems are and also like where the advantages are. So first we looked at how can we evaluate LiteQ at all, and then determined that the process as a whole is a little bit hard to evaluate for various reasons because it's hard to see what would be there at the counterpart. And also we had practical problems. For example we used, well people from the Institute on. We didn't have so many test subjects that would know F# or even just like functional programming very well. So it would be quite difficult to come up with a fair evaluation. So we decided to rather compare only NPQL against SPARQL, and with the hypothesis being that NPQL with this auto completion allows for effective query writing wellbeing more efficient when compared against SPARQL, right. And that of course means this is a very focused evaluation that does not use some of the advantages of LiteQ. Because it really goes only into the exploration part and the data querying part are not for example into reusing the queries for doing programming than in task for. Outweighed code functionality against it. So we had 11 participants. While we actually afterwards eliminated one subject because the subject would not be able to handle SPARQL at all. He was able to handle NPQL but not SPARQL, we thought it then too unfair to sort of like include his times. So we still had 10 subjects remaining for analysis. Well, there were students, undergrad students, pHD students, a few Post Docs. The setup was to have a pre-questionnaire about their knowledge, to have a classification, give them a little bit of training in RDF and SPARQL in case they needed that and also NPQL which of course was new for all of them. And then we had some tasks for them to solve, and the post-questionnaire about what they liked and what they disliked. So here is a little bit of insight into that evaluation routine. We looked at all who worked with programming skills where all were other classified themselves as intermediate programmers or better. Object orientation. Eight of them would say they are intermediate or better in that. Functional programming, only four would say they are intermediate or above. Four would say they have no knowledge about functional programming. When we looked at functional programming there are a Lisp and Haskell, most of them mentioned F# was mentioned once here. And we look at dot net, how would they know the framework and be able to work in this environment. So we had one expert, two beginners and seven who had not encountered dot net yet. And then we looked at SPARQL. There were three individuals that said they would be have intermediate knowledge or above for SPARQL and we classified them in the next slides as SPARQL experts in seven had indicated less than intermediate so we would classify them here as SPARQL novices. And we have a training phase, about 20 minutes live presentation by Martin and another PhD student Stefan Shaygerman and indicated two SPARQL queries and let them work with the environment to write SPARQL queries. So gives them just a little bit ad hoc training, just five minutes. And also five minutes for NPQL queries and writing them in the Visual Studio environment so they would basically know how to operate the corresponding tools. There were nine different tasks to solve. Half of the tasks, so we split them up into two groups. In the one group would have half of the tasks in SPARQL and the other half in NPQL, and the other group was just reversed. Just would not have a bias to one or the other kind of tasks. Half the tasks were done using Visual Studio, the other half using SPARQL in the web interface. You have actually seen the web interface before in some of the slides. Very simple web interface. In the task types were really there to explore, well how well can you navigate and explore the data. So it's corresponding to the program as task one of finding what the source is. The source is very small. We had just like 50 facts. There was not a big task yet where you had thousands of schema elements and tens of thousands of facts or so. >> Audience: Was that intentional? I mean I guess was it because you couldn't find a source because a big source was... >> Steffen Staab: No, finding a big source is not a problem. >> Audience: Oh, okay, so why didn't you use larger source? >> Steffen Staab: This is a kind of preliminary trial. There's always still the stuff that we have to do about optimization to deal with large number of facts. We will have that, I'm not concerned about that part right now when we did that. Not quite the case. And then, retrieving and answering questions about the data, task three. And we also included to tasks that were intentionally not solvable by NPQL. I mentioned to you that some of the tasks we just cannot do, because you cannot count for example and so we wondered how, what people would do or when they would back off and say okay I cannot do that. So that was my intention by design. And we took the durations to task completion. Okay, so here is I think the interesting part. Again, don't take these numbers to seriously. It was still like a pretrial. It was not like a full, not formalized so just know the numbers lie a bit. But the tendency was really that the SPARQL novices really appreciated support by NPQL. The SPARQL experts did not benefit so much when you compare this number against that. There was a little bit slack because they were starting to talk about what are the advantages, disadvantages and it was not a fully formalized trial yet, so that would be the next step. So it's rather really tentative results. We have to be a little bit careful about these numbers . But it shows you a little bit the effect that if people are really familiar with SPARQL, they would not gain quite as much, but the other ones for them it was way simpler to deal with it. And while the unsolvable tasks then really they gave people problems just to find out they couldn't solve it. For this you need to understand what you can do and what you can't do. Here is an evaluation per tasks. The nine different tasks. The grey ones were unsolvable and then you see there's typically the average is a little bit better for NPQL overall, but there were also more novices here then SPARQL experts. >> Audience: How do you define completion of an unsolvable task? >> Steffen Staab: I was really just taken the duration. It was, so this will be the next trial we have to just do a cut off at five minutes or so. >> Audience: I see. >> Steffen Staab: And eventually what they did, it was for example a task. It was a small fact base so we ask, how many dogs are there. They could just return all the dogs and count them by hand. That would be allowed. But they could not directly say, count me the number of dogs. Which you can do in SPARQL. And of course it also means that only with small fact base that's really possible. Right, when you have a half 1 million dogs. >> Audience: Or enough time. >> Steffen Staab: Yeah. [laughter]. >> Audience: I'm curious they did not recognize that it was not solvable with the NPQL basically. >> Martin Leinberger: Some of them did. Some of them said, okay I cannot count so I'm going to write here not solvable. Others just took it as it is encountered by without any commenting. >> Steffen Staab: And the point of course is, I mean you can always make the language more complex but at some point you will lose this functionality of getting the types, you know. And so that's a kind of trade-off point which we still need to explore better and understand this better in a formal way. And then in post-questionnaire we ask this question, well do you really want to explore data source in your development environment. Four of them said yes, three actually said no so maybe it's not for all of them. Maybe some of them would rather say I don't care about that, and three didn't have a clear preference. >> Audience: As opposed to what? >> Steffen Staab: Well, you can of course just have a browser. You can have a browser where you wander around, you get a nice graph. And there are browsers like that around. And then you just switch to your IDE and program it. I think part of that also was because the IDE was too slow really. So I think when that becomes faster, I think that numbers will also change a bit, but still it's remarkable that some didn't like or didn't need it so much. NPQL is easier to use than SPARQL. Well some of them would agree to that. But, one needs all better support when writing SPARQL queries. So basically there's no good SPARQL query writing support right now. And we're looking into this question as well because that would also help of course. It's not so straightforward because you have so many dimensions in which you can continue a query once you've written one line and it's not obvious at all how to do that. If you have a large set of queries that people have already asked their queries, then of course you can do recommendations. But if you don't have that, it's not quite clear what to do. And then better response times. So that's one of the tasks, we have to work on that. My conclusion is, tentative conclusion that LiteQ is still in pre-alpha status. We see some advantages, and of course once you handle things better, then some of that result will become stronger, I think. Okay, so we have, how much more time do we have? 20 minutes? Okay, so I can talk a little bit about my third part. We have now this framework for looking at an unknown data source. Exploring it, querying against it, but the problem of course may be how to find this data source in the first place, right. And that's what the last part is about, about SchemEx. Constructing an index such that you can determine where to find certain information and also then induce a schema if you only have facts. And that's happening quite often with the semi structured data sources. So here would be a typical example. You're interested it example for some set of documents that appeared in some conference proceedings written by a computer scientist and you and have a SPARQL query for that. So that's a SPARQL query. You're interested in some x of type document, of type in proceedings where there's some creator that is a computer scientist. And you have Link Data Cloud. And you want to find out where do you find such piece of information, right. What we do is, we compute an index and then can answer this question which for example would say here in the Link Data Cloud there is an ACM data source mirroring the bibliographic metadata of ACM library, and there's DBLPs, one of the other data source. And there are some more, but here these are two that you could ask. So what you need to do then to order to query them, you need schema information. You need to know where data conforming to this or that schema would be around. And we need to look at two types of schema information. First we need explicit schema information. And entity focus will may be directly assigned to a class. So in the example, I assigned this variable to the class document and to the class in proceedings. That's an explicit assignment. But actually what you will see often, and there are a couple papers out there now in the semantic Web community that looked at implicit information. So quite often there is more schema information and how entities are related to each other then in the assignment to classes. But it's not always clear, so there's variance in there. Some data sources, it's really like the class information is dominating. Some data sources, it's really more information about the schema, but just by looking at the data items and how they are related to each other. Yeah, so the schema little index that we then build is for entities, and these entities are maybe related to particular class of types. They may be related to other entities, other classes, and may be related to strings, integers and other kind of delin data types. And the idea of courses is to say, well we have certain structures. And if you map the structures to determine that in data source one, and two or whatever, we find entities that conform to the schema structure. So what do we do? Well one is to say we have type clusters. Type clusters are combinations of classes that appear together. So do we have, let's say a president that also was an actor, right. So you would have a type cluster for this kind of entities. And you would then say for such a type cluster, where in your different data sources, your hundreds of data sources, you have entities that fall into two or three or four of these classes. Now here, we are interested in documents and proceedings, and were interested in a type cluster. And we find that in DBLP and an ACM, we both have entities that fall into this type cluster. Then we have entities and we say they're equivalent if they refer to the same attributes. Two equivalent other entities. And what we do here is we compute for one bisimulation. So if you have a little bit of background in XML schema induction with people like [indiscernible] and some computer data guides and in order to come up with a schema description of XML documents that don't have an explicit schema description. And here what we do, a likewise thing. We restrict it to one bi-simulation, if you do arbitrary bi-simulation you may run into difficulties with regard to the complexity, but here we just look into for such a description. What are the kind of properties that appear together for the entities? And then say okay, these property combinations, they appear in certain data sources. For example if you look at creator, then we would find for example that DBC metadata, but also the DBLP metadata includes this creator property. And then we can combine this together with restrictions from the destination. So we say we have an equivalence class that's built by looking at a type cluster like document and in proceedings and a restriction according to one bi-simulation looking at the different properties like creator. And that it's restricted not to an arbitrary creator. For example it could be restricted to creator that is of type person. Other creator might be of type organization or something like that. And then we find something that belongs into this class in these different data sources. Right? And so we have then here a schema description that's infused by a combination of looking at the different classes that entities have together with different properties that these entities have and their range types. And of course the next step for us to do is that if we have data with little explicit schema information, we use this kind of schema induction in order to present it in the F# environment. And say here, look that's the type that you can reuse. And then here you have the payload, the actual data sources out there on the web that you can query for this kind of data. So if you look at document and in proceedings, we have a type cluster. We can build an equivalence class from this type cluster and from the bi-simulation that restricts creator to computer scientists. And then see that only in DBLP we have this kind of information, because FDBC does not have stuff that was created by computer scientists, I believe. Okay, and then when you would then have a SPARQL query, you can look at exactly this kind of structural information in order to derive that you should rather ask DBLP than any of the other data sources to find entities of that kind. We can precisely compute how this looks like. We were then wondering about exploiting principles of locality. For one thing with it was to look at streams of data that come in to say that when we parse the data out there on the linked data Web, we don't need to look at the full payload, but even if you just look at parts, we get rather good quality. So, we do rather computation that's restricted to a piece of memory that we can handle easily. And we looked at empirically how well this works and we find that actually even with rather small cache sizes of just like hundred K and so on we end up with rather good values for precision and for recall. So we lose a little bit, but we don't lose too much even with rather small cache sizes. And that was actually also something went presented two years ago at the Berlin triple challenge. So the semantic Web conference always has these kind of challenges also for handling large data. It was much appreciated because it was a kind of hardware where we could handle this very efficiently. Give us the first place. So that concludes this part. We still need to explore further the schema induction part, the query federation based on the structures. One thing that we were intensively discussing also in the seminar last summer was, how to handle the querying of data in the linked data Web. There are two main approaches right now. One is a very fine quaint one you always ask a URI, give me your description. You ask the individual object. You can do that. It's easy to publish that. But if you want to do this for many objects, it just doesn't scale because you have too many messages. There's the other very expressive one where you open up your data source to arbitrary SPARQL queries. It's very expressive. You can ask select star. And you get everything, for do all kind of interesting restrictions but it's not a very stable solution. At least right now with SPARQL it puts a not very state and it costs you a lot if people do arbitrary queries. So we're discussing whether something like a restrictive query language, whether it's NPQL or something in this direction might be quite a nice trade-off between being too fine grained, costing too many queries in too many network messages in something which is to expressive like SPARQL, it costs you too much. But it's still ongoing discussion. And that brings me to my last slide. What is the future of these kind of approaches in general and what we look at. It's to look at like searching for this data would of course include stuff like keyword indices and stuff we had in other systems we did, but not in what I described today. To understand this data and there are a few papers out there now from our group and from two or three other groups that look at what does it mean to look at statistics off this area of data structure to derive schema, to understand what are the important properties. Most properties that are allowed according to the schema are not filled. So we have all the tips by distribution of how these properties are used. All right, you can look at schema.orgs, stuff that's supported by BING, stuff that is supported by Google and these properties are not uniformly used over the different sites. And then of course formal intelligent queries. I mean, the one query indicated now force SchemEx was rather still a very fixed one but you know exactly what properties you are looking for. That will not always be the case. And program against the distributed data. That's something which is rather ignored in the semantic Web community up till now. And it may be quite nice if you want to bring this community on board because it's a pain really to write string-based queries that are very brittle and where you don't have type safety and all that things. And we hope that our kind of approach helps in this direction. So, thank you for your attention. >> [applause] >> Evelyn Viegas: Is there any questions? >> Audience: You know, I would just like some more kind of the tasks that used for this. >> Steffen Staab: So we have some more information on the webpage. Are the individual questions also there? >> Martin Leinberger: I'm not sure. >> Steffen Staab: I'm not sure. We should put them there. That was the plan. Let me see, but definitely you can explore more on this webpage and if something is missing from there, just send me an email and we'll provide it to you in one or the other way. Here, that was the webpage. >> Evelyn Viegas: All right, thank you. >> [applause]