Information-Rich Programming in F# with Semantic Data

advertisement
Information-Rich Programming in F# with Semantic Data
>> Evelyn Viegas: Good morning everybody, it's a real pleasure to have here with us Steffen
Staab from the University of Koblenz-Landau in Germany. Steffen is a professor for databases
and information systems and he's cited among the top 10 researchers at WWW and Semantic
Web conferences. And today he's going to talk about some work that we've been working on
with Don Syme from Microsoft Research Cambridge on actually using some semantic
technologies to help make sense of data. But more broadly, Steffen has been working also on
using semantic technologies to support software modeling and I won't talk about what we've
been doing and what Steffen has been doing with his team now because he's just going to talk
about it. I'll just mention that we came back from Popo like a couple of days ago where we had
a workshop on data centric programming where there was a lot of interest, actually to these
approaches of bringing more tooling to help develop first make sense of data. So with that,
Steffen please.
>> Steffen Staab: Thank you very much for your kind introduction Evelyn. So, the work were
going to talk about includes a couple of people from my team, both initiated actually by Evelyn
and Don to look at how can we bring some of this semantic web technologies in semantic web
data structures into F#. And among the people from my team there is Martin Leinberger, he's a
PhD student in our team and he's also the [indiscernible] and I must frankly confess I'm a little
bit too far away from the code base really to show this in a competent manner.
Okay, so our point of departure is this quite famous picture from Linked Open Data Cloud. For
those who have not seen it, who you have seen it before? I guess many will have? No, not so
many. Okay, so what does it symbolize. Each of these bubbles stands for a data source and then
some colleagues from Mannheim, Chris Beyser and his team, they have collected how these
different data sources, each one of these bubbles has hundreds of thousands or millions of
triples of fact and how each of these data sources is connected to other data sources. Following
this mini web paradigm. Okay, many of them are connected to DBpedia, and DBpedia is the fact
base derived from Wikipedia. All right so stuff that drives also knowledge craft and other kinds
of approaches. And the interesting thing about this is that it's like the web but not on a
document base but on a fact base. And what is also interesting is this data has a lot of different
data structures. A lot of different schema. Well, and the sum of that has actually very little
schema and just the facts. So it's very heterogeneous and by now it also is actually more in auto
F triples and even some famous database researchers, here is for example Gerhard Weikum in
his SIGMOD blog asks, this is actually a nice kind of environment to try your approaches if you
deal with heterogeneous data and if you think of thick data not only in terms of having massive
amounts of a simple schema association, but if you think in terms of having a lot of variety of
data. So I will use this as a kind of motivation for my talk, though it's very clear when I talk
about understanding new data sources that come in, it need not be this kind of linked data that
so many web people look at, it could also just be your relational data source that you encounter
where you don't know the schema and that you have to deal with.
So, when we look at these different bubbles, each one representing an individual data source
with millions of facts, you see that when you look at them in detail you have very different
domains. For example here you have stuff about PubMed for example, so for medical domain
or drac bang, gene ontology, this kind of thing. And here you have a kind of column which is
about publications from the ICM or from the DB LP server which provides you a bibliography or
from here's our partner, the Leibniz Institute for Social Sciences with whom we collaborate very
closely. They look at publishing metadata and all kinds of data about scientific literature in the
social sciences. Here you have stuff about geographical information. Here you have stuff with
New York Times, is publishing a lot of linked data about, the articles they have just putting them
on the web and making them accessible. PVC is doing the same kind of thing where you have
this media domain which is very active and then you have like this generic databases, fact bases
like DBpedia, but also like Freebase which is actually a part of Google or like YAGO which was
done by Weikum and his team.
When I talk about a few more of the foundations of the semantic Web, I have to explain to you
the very simple foundation that we have there. What is this very simple foundation? We talk
about data and data for us is simply that you have some subject and each object or here this
particular subject is represented by a URI. A URI is nothing else than a global unique identifier.
So at this global unique identifier you can take this identifier and ask for, please give me more
information about you. Okay. And even such subjects are related by a specific predicate to an
object. This object can be another URI. It can also be just like a string or an integer. And that's
the very simple data model behind RDF. So extremely simplistic, much poorer than relational
databases. But, at the same time, like a common denominator for publishing data. And from
this piece of data you can describe the schema. And that schema has classes and it can have
class hierarchies. You will see a very trivial example from me on the next slide. And you can
describe how different classes are connected by predicates and the predicate has a certain
domain constraint and a certain range constraint. RDF is very simple, you don't even have cut
analysis. If you want to bring them in you need to have a more sophisticated standup which is
called OWL for the web ontology language. Okay, here is a very simple example, a little bit
trivial but I think nice enough to tell you the story. So what is the example? Here RDF resource
is like a kind of top class that includes everything. And then you have a subclass like creature.
All these URI's, they come with name spaces as you know them from XML. So you can really
distinguish, let's say a creature that I described in my examples or a creature as it would be
described in some biology domain or in some other domain as you like. And we always
abbreviate them here just to make it more readable, but you could mix them completely. You
could take some URI's that are coming from the Microsoft domain with some biology domain
with my example. You could mix them all together in a giant global graph. And here for this
class creature, you have several subclasses like dog or person and you have attributes described
like has name, and has age and we could continue here to describe what are the range
constraints. For example that the name would come as a string or that age would come as an
integer. We've left that out of the picture here. And then here you see another example where
between dog and person there is relationship. It has owner relationship, so the domain is dog.
A dog may have and owner which is a person and this is all the schema level. And here you also
have the data level where Bob then would have Hasso has a dog having an owner which is Bob
which is the person. Okay. So very simple example, nothing very fancy about that. What I talk
about this example, please always keep in mind this might directly be an RDF data source or
maybe you take an established mapping of let's say a relational data source into this very
simple kind of framework. So there are, for example the standard here, R2RML that was
standardized like one and half years ago by W3C. Is that W3C recommendation and in the
meanwhile there are very efficient mappers from this relational world to this kind of triple
world. What I mean with very efficient, some colleagues like Juan Sequeder from Texas, they
have execution engines were basically doesn't make a difference whether you then ask at this
triple level or whether you ask at the original relational level. So you can really efficiently map
between these different worlds. The interesting part here is that I will assume in the largest
part of this talk that I have a nice schema description. As I told you for the Linked Open Data
Cloud, that's not always true. Some of the data do not have schema but that's like the rest of
my talk at the very end where will talk a little bit about that. How we can deal with the situation
when we don't have an explicit schema description. And also to make it simple, I don't talk
about different schema languages, I just talk about this schema language and this kind of data
level. Okay, and that describes the agenda of my talk.
I first want to talk about given these assumptions, how we can bring this data sources into the
F# language and Visual Studio. How can we find out what is in an unknown data source that's
out there among these hundreds of data sources that are out there in this Linked Open Data
Cloud or that you encounter just by integrating another part of your company and integrating
another data source that you have not dealt with before.
In the second part, I want to go a little bit into a very preliminary user evaluation. I say very
preliminary because it has a couple of flaws that we are aware of, but I think it still shows a
certain tendency for the approach. Does well and what it doesn't do so well yet. And
eventually, if I still have time, I don't know let's see, I want to talk a little bit about our indexing
approach where we look at this huge cloud and we need to find a particular piece of
information, we want to find something about creatures, about dogs or persons. And of course
this kind of information may be spread all over the place, not just in a single data source and
initially we may not know where it is. And then we have to look at querying such that we find
the corresponding data sources and then explore them and integrate them and program
against them. Okay.
So let's start with this first part where we encounter some unknown data source. Just pick your
favorite one and then you want to explore this data source and program against the data that's
available there. So I think that's quite a, not so uncommon scenario if you do something to
creation of different database. And for the sake of making a little bit better understandable, I
don't go so much into the formal definitions that we have in the tech report, but rather just
show more by example application. Here the example application may begin taking this tiny toy
example that I gave that you may want to collect dog license fees and want to send email
reminders to the dog owners. And we assume here that we have this nice kind of graph again
that I've shown you before. And you would now have to do a couple of tasks to solve this very
simple, to do this very simple program. So the first task that you have to do as a programmer,
now as I indicated here in the title is to really explore the schema and find out what is in this
data source. Yeah? So you really assume you don't know how the data is described there and
you encounter that. And you want to find the types that represent the data you're interested
in. So, of course you see directly here on this slide that this may be person and dog for example.
But of course in general when you approach data source you just don't know whether it's
person or human or whether it's dog or canine or however the formulation might be. So you
really need to look around. And classically this first task, you might do with a kind of browser or
if you work in a semantic web, you might do a SPARQL query and SPARQL is another W3C
recommendation that's around for a couple of years now. Now there are at least a second
version of SPARQL around and there are different kind of engines that support SPARQL and
support the data format of RDF in the background. And you can ask questions like give me all
classes where we have a subject that is of type a different class, so you find all the classes there
are around that have instances by such a kind of a query. And you can do that, but of course it's
not so easy to formulate the queries to explore this kind of unknown data source. You easily get
lost. And if you are in a situation that you say you start here because that's the top class for all
different classes, it's not so easy by asking these queries to navigate through because every
time you want to do a refinement you have to completely change the query. So that will be like
in a Naive approach, you would still have to do it like this. And then when you have found out,
okay I'm interested really in the dog and the person RDF type, well then of course to program
against that you somehow have to mirror this kind of types in your programming environment.
So basically what you have to do is you take these kind of types and you provide code types so
the terminology always overlaps here so I try to speak, when if I want RDF of RDF types and
when I talk about F# for example I try to speak of code types, yeah. Because both are of course
typing systems, but the one is in your data source and the other is in your programming
environment. And you somehow need to establish this type systems both, and you need to map
between them both, right. So you somehow have to, after you've identified these as the types
that you're interested in, you have to establish this code types in F# for example. So you might
say something like a creature is a certain class and while it has certain attributes like it has
name, has age and we then can have subtypes like dog which would be another code type in F#
or person. And they inherit all the attributes from creature and in addition they may have
further attributes like it has owner, property or maybe some tax number for dogs. Okay, and
that's a second task. So the program now has basically two break this down. Which is actually
much of a copying task because the properties are not really new. They're already given in the
schema of the data source.
Next of course what you have to do is data querying. So you won't query, say I'm interested in
all dog owners because I want to remind them to renew their license for their dogs. Okay, so
what do you have to do? Well, once you have your persons and dogs, you are interested in
persons that are owners of dogs, so you may ask a SPARQL query in a Naive approach and just
find these owners by putting forward this kind of worry. Although possible, be aware that in
general if it's an unknown data source you don't really know what this relations look like, what's
their name. And here they even look nice and short but in general these are very long, unique
identifiers in a semantic web, right. So for it may have long packaged names if you encounter a
data source, not just such a toy example. Okay, just to write this query. And once you have this
query, of course you have then to substantiate your objects from this data source and
manipulate them and program against them. Okay, so here we have to develop the
functionality around your query. So typically in F# you would have to formulate a query string.
This would be for example this kind of SPARQL query I've shown you on the previous slide. And
then you could evaluate this and then you could iterate over it to create persons and send
email reminders to them according to the emails that I described. And that would be like the
four steps that I would identify here. So first exploring and understanding what's in the data
source, then creating your types in your programming environment, the code types. Then
querying your data according to that and developing the functionality around that. And for that
it's almost like turning these data instances into code objects. All quite nice and well, but a little
bit laborious. And the idea of LiteQ was to say, let's try to put these different task together and
provide a kind of a framework that just makes it easier for the developer to handle these tasks
with fewer different tools and fewer machinery and less boilerplate code than we have so far.
And of course behind this kind of the idea is you may already recognize if you think of Link,
LinkQ and of type providers you will see some of these ideas shining through here. Again in will
have later on a comparison with regard to these kind of approaches.
Now the core idea of having these, supporting these different tasks is to have a kind of query
language which we call NPQL. NPQL really is the Node Path Query Language and it allows you to
traverse the graph. Traverse this RDF graph that contains information about the schema and
contains information about the data. So the nice thing about RDF is that it hardly distinguishes
the two, so they are not really completely separate. All of the schema information as well as the
data information is encoded in triples and you can ask for this triples using SPARQL. And NPQL
is just like, you could say syntactic sugar to formulate your SPARQL queries in a way that makes
it easier for a developer to go through those different tasks with the same kind of syntax and
not having completed different syntaxes all the time.
For this language we have then developed three different kinds of semantics, right. So you want
to just go through this graph. That's the exploration part. But then you want to query for data,
and for that we have some typical extensional semantics, no big surprise here. Yeah, that's just
what you would expect. And then we have also thought, well we do not only want to have the
extension that is all the data that falls in pattern, but we really also don't want to duplicate the
types that we have in that system there. So what we then define is an intentional semantics for
basically the same kind of queries that gives you back this type descriptions. So you don't have
to reinvent this type descriptions because it just queries for them. And then you sort of like can
put them together with extensions. I will show you in a minute how this works. And once you
have that, you can actually not only use that for nicely writing your code but you can even
suggest through the developer what next to write. And that's what I call here auto completion
semantics so when you're writing these kind of queries you are supported all the time with
particular suggestions of what could fill into your query language at a particular point in time.
So let me again show these different types of semantics and tasks by examples. So the first
thing would be to say you want to explore this graph. So you start let's stay with RDF resource
and then you have the an operator here for a subtype navigation and then you go to creature.
So, you change from this context to this context, you go from one node in the graph to another
node in the graph. This can become a little bit more complicated. You can start also at a
different point, a different note in the graph. So you would start here with dog and then you
navigate along the property. Not very surprisingly navigating over has owner you land at the
person node. And now we can define what happens then if we define the extensional semantics
are that. Well, assume we are for example here at the dog node, we have navigated there from
the top class down to creature, down to dog. If by this means selected dog as our current
context for evaluation, we can walk through the has owner relationship by this dot operator
and well then we can just extensionally evaluate what this query means. So this query, it looks
almost like Nick's path query, with a little bit different operators. And then you just say I'm now
interested in all the nodes that fulfill these conditions and if you think in terms of you have
been here at this dog and you have navigated by this has owner relationship to person, you use
the extension to retrieve all persons who own dogs, because that's the context from which it
came in your end up here with Bob. Because Bob in this example is the owner of Hasso.
Okay so, we have seen, we explore this path and then we say we're interested in one of the
kinds of objects that fit my description as I was walking along the graph. Now it becomes
interesting.
>> Audience: When you said that there's not much distinction between data and schema, does
that mean that if my query had been instead like resource to creature and then ask, then follow
that subclass, I would get all schema elements that are subclasses of creature? Or do you have
to get down into the data before you can [indiscernible].
>> Steffen Staab: The extensional semantics here always looks then for the instances. What
you could easily do here is you could do meta-modeling. Yeah, in fact and actually RDF has
some meta-model approach. So this for example creature, is a sub class of resource but it's also
an instance of the RDFS class. Yeah, so you could walk from resource to RDFS class and then say
with an extensional query, give me the extension of RDFS class, creature would belong. Yeah, so
RDFS is meta-modeling kind of thing. And thanks for asking a question. I forgot to mention
initially, please interrupt me all the time when you're interested in a particular explanation. I
think it's better. Yeah, so that's easily possible. And remember see here, again this kind of query
for which we, when we look for the extension, we can do the same kind of queries or navigate
to dog and then to the owner. So that's the same kind of query has before and that basically
means we're interested in some persons here like Bob. But now if we ask for intention, that
means we do not look for Bob, but we rather look for the intentional description of the people
down here. Right, so this expression now does not any longer refer to the instances like Bob
who are owners of dogs, but because we look at the intention it looks at how can we describe
the type of these instances. And obviously this type has all the attributes of person. So this
means that by using the same kind of query expression refers get the instances and with the
same kind of query only the last operator is different, we get the schema description of these
instances. Which we may then use very nicely for creating our F# type description. Yeah, so we
can now say in F#, using this kind of query I want to describe, whoops sorry. That's just the old
version. Here it should describe actually the person class being returned here and I have not
updated this probably. And then you would just appear what has been retrieved according to
the intention. The thing being here is that once you have formulated this query, you can then
build this into your environment and say please give me the object and type them at the same
time. And you can use the same kind of thing for the exploration. So here we have a query that
says I want to go from resource to creature. From here to here. And then it depends on what
kind of operators you use. I've only given a partial account here of the operators, but let's say if
you look for subclass navigation, this actually, this expression here would not be complete
expression according to our query description. Because it really looks here for another class
description or class term, I would rather say. But what we can do here now is that based on the
context up to here and based on this kind of operator, we can suggest what would be
appropriate next terms to fill into this place. So what we basically do here at this point is we
define these auto completion semantics and do suggestions. And we do suggestions depending
on the operators. There are some operators that look then for instances. Martin will show you
on one of them in the demo. And then the instances we derive using the extensional semantics
of this expression plus some further information. Or we can look at types and properties and
we derive the properties or types that may be a property in such a situation. Here it be the
types. According to the intentional description of the preceding query expression. Okay. So
here obviously when you look for creature, we may then look at, well what are the subclasses
of that creature and suggest both dog and person, would be appropriate subclasses. And these
are suggestive because these are the direct subclasses to the programmer who writes his code
and to complete his query. These are just some of the features, just one feature to give you
little bit of the flavor of what LiteQ is about. It's a little bit more complicated query, not a lot
more complicated but a tiny bit and what it does here is to say that if you start at dog, we may
be interested not just in all dogs, because there may be dogs without owners and we cannot
then have some license fee for them if they don't have owners, but we can only have license
fees if there are owners. And then we can restrict the set of these dogs to those dogs that have
owners, right. And this would be a corresponding query for doing exactly that. If some of you
are familiar with description logics, are some of you? Have you encountered description logics
before? It's a subset of first order logic's and this is very much like a concept description in
description logics, where you then can look at such expressions at ask for instances of such a
complex concept. And where you can also do kind of queries assumptions. And that's what
we're aiming at here. And of course in this context the dogs that have owners, this would
include Hasso. If there would be another dog without an owner that would be returned here.
It's a very restrictive form of querying. It's not giving you the full SPARQL support, right. It's very
much focused traversing the graph and restricting the number of nodes that you're
encountering. So it's, I would call it here a left associative conjunctive query. You cannot build
arbitrary queries by this way. In SPARQL you have like full power for all kinds of conjunctive
queries and more. We don't have that here. And that's by design at this point in time because
we really wanted to support the developer in writing these kind of queries and suggesting to
him what would be appropriate schema information. Properties, classes or even then in the
data level instances to complete his query and to work with that, right. So we didn't target a full
query language like SQL, Link, LinkQ or SPARQL, but rather a subset of that.
So we have now the exploration task. You've also seen how this should be supported with auto
completion. We have the type definition task using the intentional semantics. We have the
query task being supported with extensional semantics and the fourth task that I mentioned
before in the Naive way was to create objects and manipulate them and make them persistent,
right. So you want to get the objects from the data source. You want to work with them and
then also give the results back to the data source. And of course for that you have to develop
functionality around the query and if we look at how we can do this here, this would just be
okay, just first some boilerplate code like including our LiteQ mechanism and you have to
define the data source. In the future we will think about how to go from one data source to
another data source into Link to open data web, but right now it's just like you have to indicate
where exactly this source is. And then you write here your query. So given this data source you
write a query for coming for creature navigating down to dog, navigating to has owner so the
operators look a little bit different than I've shown you in a more slightly more formal part, but
they do exactly the same. And so what you get here is, all these dog owners. And then you can
just take these dog owners and iterate over them and for example send them an email
reminder. So the nice thing here is, you directly get this the extensional semantics and in this
assignment you also assign the types to these dog owners. It's not just that you have some
arbitrary data objects here. These data objects are typed with at least person, and also with the
hierarchy above person including the creature. So you can use here then, in this expression also
like the information from the creature class which had the name actually good for example, or
that age attribute.
We have a preliminary implementation of this approach supporting F#. So this is the website.
It's really preliminary so it's not really anything like productive code. It's good enough to look at
the basic idea. And we know quite a couple of things that we have to do to make it usable in a
productive environment which will take us a couple of weeks. But it's good enough to take a
look at and see what's the principle behind it. And I would now suggest that we switch over to
Martin who will show you some demo of the system as it is now. And so we'll switch over the
screens and also hand over for explanation to you. Oh yes.
>> Martin Leinberger: So what we will do, we will actually query the data source. And for a very
simple cross query would be to query for all dogs, which looks like this. And in this case we
actually start one step below RDF resource because RDF resource is always there. So we started
creature and basically the data source behind it is exactly the same as we've shown you before
in the schema. So we go to creature, we do a subtype navigation and choose dog and get the
extension, and then we can do something like print out the names of all dogs. Get has name,
there we go. And if we run this, we hopefully get a list of all dogs that are in our data source.
Right now it's, well as Steffen mentioned before, we are just using RDF schema. That's why you
will always get basically a list of possibilities back. Like here, a sequence or list that just contains
the string Hasso or Bello because you can't really restrict the number of triples in the schema,
so we have to assume that there are always more. Yeah so like this, you could get all dogs and
then you could also go on an individual level. So you could say, show me all individuals of dogs
and luckily there are only two runs in our data source. So let's choose Bello and if we want to
actually get an object so we can work with Bello. And we can now also print out the tax number
of him.
>> Audience: I find it very interesting that dogs are taxed on this.
>> Martin Leinberger: So, that's off a literal German translation. So we will have to look it up
actually so that he would say license fee number or something like that, I don't know. It's a
German example. [laughter]
Or false friend among the different words. All right, so the tax number of Hasso is 1234, so we
can now go ahead and change that. And you see this tax number, that's supported because we
know the type of Bello. That helps us here to show the right kind of properties. Let's again type
the information.
>> Audience: Have you thought about having some operators that would be less
[indiscernible].
>> Martin Leinberger: Yes, we have definitely thought about it. This was just a first iteration.
You know we want to, we were excited to get started and this was the first thing that came to
mind.
>> Steffen Staab: One thing, I mean the first idea was also to have the kind of operators that I
showed you before in the slide, but the implementation we use is the type provider mechanism
in F# and then we have to misuse currently when we use that, this kind of notation. So because
the documentation already means something in the tight providers, we wrote a completely
flexible in choosing the grammar and would we choose the grammar arbitrarily, we would go
for something closer to the version I had on the slides.
>> Martin Leinberger: Actually we're trying to look into DSLs in F# in the future and trying to
explore bit in this direction but we're also open to other suggestions so if anybody has a good
point we would be happy. Yeah, so we just changed the tax number of Bello and this will be
persisted in the stall. So if we run it again, it will still be the new tax number. And in the
background basically are all these queries. Our manipulations are translated to SPARQL queries
and SPARQL updates. So basically we don't care about what kind of stories run in the
background as long as it is SPARQL compatible.
>> Audience: Would you show, I didn't know where dog was in the hierarchy. Do you have a
way to search the data context for any.
>> Martin Leinberger: Right now in this implementation you wouldn't need to go down the
hierarchy. Right now we leave false to start at creature which is a huge limitation right now and
in this implementation the only good thing is that the auto completion can you know help you
in finding your type
>> Audience: All the individuals under creature, like Bello exists under creature, that
individual...
>> Martin Leinberger: Yeah, you could also do that but...
>> Audience: It doesn't go upward, it doesn't go up to class higher.
>> Martin Leinberger: No.
>> Steffen Staab: Although we could do that actually in our first graph we had operators for
that and then we wanted to keep it simple first, get the first diversion going I mean it's still the
middle of the project, it's not finished.
>> Audience: You know, I just think about when people get data all right the first thing I want
to do is like see it, right.
>> Steffen Staab: Yeah, absolutely.
>> Audience: When you're going to see it, select star, you know... That exploratory nature is
what...
>> Steffen Staab: Yeah, okay. And well the next thing we would have to do there is you can do
in SPARQL limits like you can do in SQL too so you do get like millions of individuals, because it
doesn't make sense right.
>> Audience: [inaudible]
>> Steffen Staab: Yes
>> Audience: So this is a [indiscernible] that you're running against which is the type provider
framework and the underlying data which is huge.
>> Steffen Staab: Yes.
>> Martin Leinberger: I mean in the end it's always problematic. If you would query DBpedia
and I would try to use individuals you would just get overwhelmed by the result.
>> Steffen Staab: And you have actually more than one way to arrive at dog because you can
arrive there by going down the hierarchy but you can also cross navigate. Right, like you've seen
that with person. We could navigate from creature to person, we can also go to dog and say
who are the owners. So we'd also arrive at person. So there are different possibilities to arrive
at the same types. Because you really have a graph. That's also a major difference to the usual
type provider approaches where you rather have a tree in your exploration. So the tree is
basically going down for example a hierarchy, and then you may have some data at the leaves
of the tree. That's a typical way freebase type provider for example works. But here we have
this graph which also makes it a bit more complicated to sort of the range on the interaction
with auto completion engine. Okay, all right.
So we switch back to the slide deck. And I have here a comparison with different kind of
approaches but also I think tell you little bit about how we would compare the approach. First
LinkQ. Well LinkQ is very good at querying for the data and then also manipulating the objects.
You have to define your types yourself in order then to use them and to work them with these
program expressions. And of course, well you don't have the schema exploration part. But
working with the objects and querying, you have here the full expressiveness. For LiteQ we
have here two columns. One column is sort of like what we plan to have been the end and the
other column is what we have now. That's of course not yet the full thing so even here in the
end, we will not have full SPARQL for example. So it would not be as powerful in querying as
SPARQL or is not as powerful in querying like LinkQ. But that's like this trade off where we really
want to support the user in writing queries and be able to type them automatically, right. If we
don't require this automatic typing, we can come with more complex queries. If you don't want
to do the auto completion or if you do auto completion in a very different way, that's another
thing we are discussing now. How to better support auto completion for SPARQL, then we
would not have this kind of restriction. Well we only have a subset of the operators that we
have in our concept. And if you look to something like freebase type provider, you have some
sort of data querying, but it's much, much more restrictive than even what we have now in
LiteQ. Okay, well here we have two kind of type providers. Of course it depends how different
type providers are implemented so it's a little bit difficult to say the type provider mechanism is
like this because it really depends on the implementation. So we have indicated here two
different type providers. One for XML and one for freebase. So for example the schema
exploration that happens in type provider and was also sort of like the motivation for us to do a
kind of exploration.
>> Audience: Sorry, just want to make sure we have apples to apples comparisons here. You
said Link. What exactly are you referring to? Just query API level inside the runtime? Or are you
referring to trying to have connectivity to an actual persistent store.
>> Steffen Staab: I'm not sure where you draw the line there. I mean what you have here with
Link is you have sort of the possibilities for doing type inference because you don't have to
query expressions, strings, but you really have them as code in your program and you have of
course to all the possibilities for doing the querying in your program code directly.
>> Audience: When you say Link, what is the data source?
>> Steffen Staab: Well it depends of course what you query with, whether it's XML or whether
it's a relational.
>> Audience: Which is why I'm wondering what the call up actually means, so for instance if
instead of Link you put in an entity from work as a link provider, then you get number two and
you possibly get number one.
>> Martin Leinberger: Okay, I was not aware this but we should look this up. Thanks for it.
That's called entity?
>> Audience: Entity framework.
>> Martin Leinberger: Entity framework. Okay, I was not aware of that. Now I was thinking
here of XML and the relational, no? Yeah, for XML type providers, you have schema exploration
but for example it's restricted here to trees. But the information also in freebase is not
restricted actually to trees. So that's why even self indicated it's not the full exploration of
everything. Code type creation including our card version uses erased types. So for these erased
types, the types are used at the sign time for doing the auto completion but then they're
thrown away. And that has also some negative consequences on our current implementation.
We don't have the full possibility to do a switch based on the type of an object at runtime.
That's something we would like to have. Right, so that's where we want to go here and have
real full types in the full hierarchy, but not part of the type hierarchy mechanism as it is now.
Here I was not quite sure, so that's why I put the question mark, but we believe it's also rise
types, but we did not do a full investigation.
And for data querying. So, here freebase type provider basically asks them for the instances of a
particular class. If you have the country class and you get all the instances of that, you would of
course get all the countries. But not much more. And once we have the full types, not just the
erased types, we can also do new object creations. Currently we can't do that. So what we can
do right now is really to take the objects as Martin has shown you in many attributes, but what
we would also like to do is say okay here, are these types like dogs from the RDF data source.
Make me a new object and also make it persistent in the data source. And that's currently not
possible, so we don't have that right now, but I think once we have the full types that should be
easy enough to implement. So that's a rough comparison to some of the related work.
There's a lot to do now for LiteQ, so the current implementation is a prototype. I just
mentioned this erased type problem. When you did do some optimizations with the code, to
have a full lazy evaluation which we currently don't have. We also want to further analyze
which types are really needed just at design time in which are really needed at runtime so we
can sparsely add types to the programming framework. And at least also now we don't throw
all the types into the DLL but I think we can do some more optimizations there to make a more
finer distinction. We also still need to do, and have not yet done because it's really just the
middle of this project is to really look exactly precisely at our query language and its
expressiveness. And how it compares to other kind of approaches. And in particular one thing
that I think really interesting which goes a little bit also beyond our scope, because we're
mostly at home here in this data modeling world looking at RDF. We had quite some projects
where we did modeling with DML class diagrams, meta-modeling and this kind of thing. And we
could use description logics to explain a lot of that in the queries and that. What I think really
interesting is then to look at the programming world and have type inference as you have in F#
or in ML and then see how these kind of types that are derived ad hoc where you have dogs
that have owners. These are kind a description logics types that are then put into a lattice of
different types in the data source description. How this mechanism would interact with type
inference you have in your programming world where you then derive new types. And I think
that would probably be a [indiscernible] behavior how these different lattices interact. And this
could mean that you have newer objects. For example new objects that you have created that
you can place at a lower point in the hierarchy in the data source, or you may have more
refined for example standing analysis of your types in the programming world because you
have all this type hierarchy that you take from the data source. So, it's a little bit speculation
here but that's one of the research start aims that we have there.
So let me come to a little bit of the evaluation of LiteQ. We've thought about how can we, and
it's really a tentative evaluation. I haven't not sort of like wanted to say it's really like good
enough to submit this to a software engineering conference for example. But it's like indicating
a little bit where the problems are and also like where the advantages are. So first we looked at
how can we evaluate LiteQ at all, and then determined that the process as a whole is a little bit
hard to evaluate for various reasons because it's hard to see what would be there at the
counterpart. And also we had practical problems. For example we used, well people from the
Institute on. We didn't have so many test subjects that would know F# or even just like
functional programming very well. So it would be quite difficult to come up with a fair
evaluation. So we decided to rather compare only NPQL against SPARQL, and with the
hypothesis being that NPQL with this auto completion allows for effective query writing wellbeing more efficient when compared against SPARQL, right. And that of course means this is a
very focused evaluation that does not use some of the advantages of LiteQ. Because it really
goes only into the exploration part and the data querying part are not for example into reusing
the queries for doing programming than in task for. Outweighed code functionality against it.
So we had 11 participants. While we actually afterwards eliminated one subject because the
subject would not be able to handle SPARQL at all. He was able to handle NPQL but not
SPARQL, we thought it then too unfair to sort of like include his times. So we still had 10
subjects remaining for analysis. Well, there were students, undergrad students, pHD students, a
few Post Docs. The setup was to have a pre-questionnaire about their knowledge, to have a
classification, give them a little bit of training in RDF and SPARQL in case they needed that and
also NPQL which of course was new for all of them. And then we had some tasks for them to
solve, and the post-questionnaire about what they liked and what they disliked.
So here is a little bit of insight into that evaluation routine. We looked at all who worked with
programming skills where all were other classified themselves as intermediate programmers or
better. Object orientation. Eight of them would say they are intermediate or better in that.
Functional programming, only four would say they are intermediate or above. Four would say
they have no knowledge about functional programming. When we looked at functional
programming there are a Lisp and Haskell, most of them mentioned F# was mentioned once
here. And we look at dot net, how would they know the framework and be able to work in this
environment. So we had one expert, two beginners and seven who had not encountered dot
net yet. And then we looked at SPARQL. There were three individuals that said they would be
have intermediate knowledge or above for SPARQL and we classified them in the next slides as
SPARQL experts in seven had indicated less than intermediate so we would classify them here
as SPARQL novices. And we have a training phase, about 20 minutes live presentation by Martin
and another PhD student Stefan Shaygerman and indicated two SPARQL queries and let them
work with the environment to write SPARQL queries. So gives them just a little bit ad hoc
training, just five minutes. And also five minutes for NPQL queries and writing them in the
Visual Studio environment so they would basically know how to operate the corresponding
tools. There were nine different tasks to solve. Half of the tasks, so we split them up into two
groups. In the one group would have half of the tasks in SPARQL and the other half in NPQL,
and the other group was just reversed. Just would not have a bias to one or the other kind of
tasks. Half the tasks were done using Visual Studio, the other half using SPARQL in the web
interface. You have actually seen the web interface before in some of the slides. Very simple
web interface. In the task types were really there to explore, well how well can you navigate
and explore the data. So it's corresponding to the program as task one of finding what the
source is. The source is very small. We had just like 50 facts. There was not a big task yet where
you had thousands of schema elements and tens of thousands of facts or so.
>> Audience: Was that intentional? I mean I guess was it because you couldn't find a source
because a big source was...
>> Steffen Staab: No, finding a big source is not a problem.
>> Audience: Oh, okay, so why didn't you use larger source?
>> Steffen Staab: This is a kind of preliminary trial. There's always still the stuff that we have to
do about optimization to deal with large number of facts. We will have that, I'm not concerned
about that part right now when we did that. Not quite the case. And then, retrieving and
answering questions about the data, task three. And we also included to tasks that were
intentionally not solvable by NPQL. I mentioned to you that some of the tasks we just cannot
do, because you cannot count for example and so we wondered how, what people would do or
when they would back off and say okay I cannot do that. So that was my intention by design.
And we took the durations to task completion.
Okay, so here is I think the interesting part. Again, don't take these numbers to seriously. It was
still like a pretrial. It was not like a full, not formalized so just know the numbers lie a bit. But
the tendency was really that the SPARQL novices really appreciated support by NPQL. The
SPARQL experts did not benefit so much when you compare this number against that. There
was a little bit slack because they were starting to talk about what are the advantages,
disadvantages and it was not a fully formalized trial yet, so that would be the next step. So it's
rather really tentative results. We have to be a little bit careful about these numbers . But it
shows you a little bit the effect that if people are really familiar with SPARQL, they would not
gain quite as much, but the other ones for them it was way simpler to deal with it. And while
the unsolvable tasks then really they gave people problems just to find out they couldn't solve
it. For this you need to understand what you can do and what you can't do. Here is an
evaluation per tasks. The nine different tasks. The grey ones were unsolvable and then you see
there's typically the average is a little bit better for NPQL overall, but there were also more
novices here then SPARQL experts.
>> Audience: How do you define completion of an unsolvable task?
>> Steffen Staab: I was really just taken the duration. It was, so this will be the next trial we
have to just do a cut off at five minutes or so.
>> Audience: I see.
>> Steffen Staab: And eventually what they did, it was for example a task. It was a small fact
base so we ask, how many dogs are there. They could just return all the dogs and count them
by hand. That would be allowed. But they could not directly say, count me the number of dogs.
Which you can do in SPARQL. And of course it also means that only with small fact base that's
really possible. Right, when you have a half 1 million dogs.
>> Audience: Or enough time.
>> Steffen Staab: Yeah. [laughter].
>> Audience: I'm curious they did not recognize that it was not solvable with the NPQL
basically.
>> Martin Leinberger: Some of them did. Some of them said, okay I cannot count so I'm going
to write here not solvable. Others just took it as it is encountered by without any commenting.
>> Steffen Staab: And the point of course is, I mean you can always make the language more
complex but at some point you will lose this functionality of getting the types, you know. And
so that's a kind of trade-off point which we still need to explore better and understand this
better in a formal way. And then in post-questionnaire we ask this question, well do you really
want to explore data source in your development environment. Four of them said yes, three
actually said no so maybe it's not for all of them. Maybe some of them would rather say I don't
care about that, and three didn't have a clear preference.
>> Audience: As opposed to what?
>> Steffen Staab: Well, you can of course just have a browser. You can have a browser where
you wander around, you get a nice graph. And there are browsers like that around. And then
you just switch to your IDE and program it. I think part of that also was because the IDE was too
slow really. So I think when that becomes faster, I think that numbers will also change a bit, but
still it's remarkable that some didn't like or didn't need it so much. NPQL is easier to use than
SPARQL. Well some of them would agree to that. But, one needs all better support when
writing SPARQL queries. So basically there's no good SPARQL query writing support right now.
And we're looking into this question as well because that would also help of course. It's not so
straightforward because you have so many dimensions in which you can continue a query once
you've written one line and it's not obvious at all how to do that. If you have a large set of
queries that people have already asked their queries, then of course you can do
recommendations. But if you don't have that, it's not quite clear what to do. And then better
response times. So that's one of the tasks, we have to work on that. My conclusion is, tentative
conclusion that LiteQ is still in pre-alpha status. We see some advantages, and of course once
you handle things better, then some of that result will become stronger, I think.
Okay, so we have, how much more time do we have? 20 minutes? Okay, so I can talk a little bit
about my third part. We have now this framework for looking at an unknown data source.
Exploring it, querying against it, but the problem of course may be how to find this data source
in the first place, right. And that's what the last part is about, about SchemEx. Constructing an
index such that you can determine where to find certain information and also then induce a
schema if you only have facts. And that's happening quite often with the semi structured data
sources. So here would be a typical example. You're interested it example for some set of
documents that appeared in some conference proceedings written by a computer scientist and
you and have a SPARQL query for that. So that's a SPARQL query. You're interested in some x of
type document, of type in proceedings where there's some creator that is a computer scientist.
And you have Link Data Cloud. And you want to find out where do you find such piece of
information, right. What we do is, we compute an index and then can answer this question
which for example would say here in the Link Data Cloud there is an ACM data source mirroring
the bibliographic metadata of ACM library, and there's DBLPs, one of the other data source.
And there are some more, but here these are two that you could ask. So what you need to do
then to order to query them, you need schema information. You need to know where data
conforming to this or that schema would be around. And we need to look at two types of
schema information. First we need explicit schema information. And entity focus will may be
directly assigned to a class. So in the example, I assigned this variable to the class document
and to the class in proceedings. That's an explicit assignment. But actually what you will see
often, and there are a couple papers out there now in the semantic Web community that
looked at implicit information. So quite often there is more schema information and how
entities are related to each other then in the assignment to classes. But it's not always clear, so
there's variance in there. Some data sources, it's really like the class information is dominating.
Some data sources, it's really more information about the schema, but just by looking at the
data items and how they are related to each other. Yeah, so the schema little index that we
then build is for entities, and these entities are maybe related to particular class of types. They
may be related to other entities, other classes, and may be related to strings, integers and
other kind of delin data types. And the idea of courses is to say, well we have certain structures.
And if you map the structures to determine that in data source one, and two or whatever, we
find entities that conform to the schema structure. So what do we do? Well one is to say we
have type clusters. Type clusters are combinations of classes that appear together. So do we
have, let's say a president that also was an actor, right. So you would have a type cluster for this
kind of entities. And you would then say for such a type cluster, where in your different data
sources, your hundreds of data sources, you have entities that fall into two or three or four of
these classes. Now here, we are interested in documents and proceedings, and were interested
in a type cluster. And we find that in DBLP and an ACM, we both have entities that fall into this
type cluster. Then we have entities and we say they're equivalent if they refer to the same
attributes. Two equivalent other entities. And what we do here is we compute for one bisimulation. So if you have a little bit of background in XML schema induction with people like
[indiscernible] and some computer data guides and in order to come up with a schema
description of XML documents that don't have an explicit schema description. And here what
we do, a likewise thing. We restrict it to one bi-simulation, if you do arbitrary bi-simulation you
may run into difficulties with regard to the complexity, but here we just look into for such a
description. What are the kind of properties that appear together for the entities? And then say
okay, these property combinations, they appear in certain data sources. For example if you look
at creator, then we would find for example that DBC metadata, but also the DBLP metadata
includes this creator property. And then we can combine this together with restrictions from
the destination. So we say we have an equivalence class that's built by looking at a type cluster
like document and in proceedings and a restriction according to one bi-simulation looking at
the different properties like creator. And that it's restricted not to an arbitrary creator. For
example it could be restricted to creator that is of type person. Other creator might be of type
organization or something like that. And then we find something that belongs into this class in
these different data sources. Right? And so we have then here a schema description that's
infused by a combination of looking at the different classes that entities have together with
different properties that these entities have and their range types. And of course the next step
for us to do is that if we have data with little explicit schema information, we use this kind of
schema induction in order to present it in the F# environment. And say here, look that's the
type that you can reuse. And then here you have the payload, the actual data sources out there
on the web that you can query for this kind of data. So if you look at document and in
proceedings, we have a type cluster. We can build an equivalence class from this type cluster
and from the bi-simulation that restricts creator to computer scientists. And then see that only
in DBLP we have this kind of information, because FDBC does not have stuff that was created by
computer scientists, I believe. Okay, and then when you would then have a SPARQL query, you
can look at exactly this kind of structural information in order to derive that you should rather
ask DBLP than any of the other data sources to find entities of that kind.
We can precisely compute how this looks like. We were then wondering about exploiting
principles of locality. For one thing with it was to look at streams of data that come in to say
that when we parse the data out there on the linked data Web, we don't need to look at the full
payload, but even if you just look at parts, we get rather good quality. So, we do rather
computation that's restricted to a piece of memory that we can handle easily. And we looked at
empirically how well this works and we find that actually even with rather small cache sizes of
just like hundred K and so on we end up with rather good values for precision and for recall. So
we lose a little bit, but we don't lose too much even with rather small cache sizes. And that was
actually also something went presented two years ago at the Berlin triple challenge. So the
semantic Web conference always has these kind of challenges also for handling large data. It
was much appreciated because it was a kind of hardware where we could handle this very
efficiently. Give us the first place. So that concludes this part. We still need to explore further
the schema induction part, the query federation based on the structures.
One thing that we were intensively discussing also in the seminar last summer was, how to
handle the querying of data in the linked data Web. There are two main approaches right now.
One is a very fine quaint one you always ask a URI, give me your description. You ask the
individual object. You can do that. It's easy to publish that. But if you want to do this for many
objects, it just doesn't scale because you have too many messages. There's the other very
expressive one where you open up your data source to arbitrary SPARQL queries. It's very
expressive. You can ask select star. And you get everything, for do all kind of interesting
restrictions but it's not a very stable solution. At least right now with SPARQL it puts a not very
state and it costs you a lot if people do arbitrary queries. So we're discussing whether
something like a restrictive query language, whether it's NPQL or something in this direction
might be quite a nice trade-off between being too fine grained, costing too many queries in too
many network messages in something which is to expressive like SPARQL, it costs you too
much. But it's still ongoing discussion. And that brings me to my last slide. What is the future of
these kind of approaches in general and what we look at. It's to look at like searching for this
data would of course include stuff like keyword indices and stuff we had in other systems we
did, but not in what I described today. To understand this data and there are a few papers out
there now from our group and from two or three other groups that look at what does it mean
to look at statistics off this area of data structure to derive schema, to understand what are the
important properties. Most properties that are allowed according to the schema are not filled.
So we have all the tips by distribution of how these properties are used. All right, you can look
at schema.orgs, stuff that's supported by BING, stuff that is supported by Google and these
properties are not uniformly used over the different sites. And then of course formal intelligent
queries. I mean, the one query indicated now force SchemEx was rather still a very fixed one
but you know exactly what properties you are looking for. That will not always be the case. And
program against the distributed data. That's something which is rather ignored in the semantic
Web community up till now. And it may be quite nice if you want to bring this community on
board because it's a pain really to write string-based queries that are very brittle and where you
don't have type safety and all that things. And we hope that our kind of approach helps in this
direction. So, thank you for your attention.
>> [applause]
>> Evelyn Viegas: Is there any questions?
>> Audience: You know, I would just like some more kind of the tasks that used for this.
>> Steffen Staab: So we have some more information on the webpage. Are the individual
questions also there?
>> Martin Leinberger: I'm not sure.
>> Steffen Staab: I'm not sure. We should put them there. That was the plan. Let me see, but
definitely you can explore more on this webpage and if something is missing from there, just
send me an email and we'll provide it to you in one or the other way. Here, that was the
webpage.
>> Evelyn Viegas: All right, thank you.
>> [applause]
Download