Document 17864769

advertisement
>> Nikhil Swamy: All right. Thank you all for coming. I'm really
delighted to welcome Philip Wadler to MSR. He's visiting us for the
next couple of days. Most of you probably know Phil from his work on
all kinds of stuff. Functional programming in Haskell, Aizu
programming in generics for Java, databases and XML and query languages
related to that stuff. And I guess he's going to talk mostly about
query languages today and how that fits in with general purpose
programming languages.
So looking forward to hearing about that. Also a little plug for
tomorrow. So at an unspecified time tomorrow Phil will talk about ->> Philip Wadler:
Unspecified location.
>> Nikhil Swamy: And unspecified location, Phil will talk about
relational parametricity. If you're interested in hearing about that,
send me an e-mail. I'll announce it on some of the Rise lists. If
you're not part of Rise, send me an e-mail and I'll tell you where that
meeting will be at some point tomorrow. Without further ado, Phil,
managing query ->> Philip Wadler: Thank you very much. It's a great pleasure to be
here and to be able to talk to this audience. I'm particularly pleased
I know some of the people in here are researchers and some are
developers. And we'll talk about both those things. So thank you very
much for coming.
Just a bookkeeping thing, to mention, for the more researchy ones of
you, this talk that Nikhil mentioned, a long time ago I did something
called theorems for free. And the way the magic in it works is through
something called semantic parametricity due to John Reynolds. The way
that semantic parametricity is explained is not surprisingly with
semantics. And turns out the semantic explanation of it gets a bit
weighty and difficult. So a few years ago I wrote a paper called the
Girard Reynolds Isomorphism that explains the same thing but in a more
simpler way using the translation between two systems, one representing
the programming language and one representing a logic. And they're
interesting translations in fact both ways.
If you're at all interested in semantic parametricity this is a bit
easier way of explaining what's going on. If you know about it and you
want some extra insight or you don't know about it and you want to know
what it is, then this would be an appropriate talk.
And this is work I did many years ago that unlike most of my other work
is not very highly cited and I think it should be. So I thought I'll
go and talk to people about it. Right. Let's see. So that's the
first thing out of the way. The second thing out of the way is Judith
what are you doing here? You've seen this talk.
>>: [indiscernible].
>> Philip Wadler: You can. This is the same talk I gave at DDBFP,
same title even. But there's more data. We've got more data since I
gave it at DDFP.
>>: I have my questions, you see.
>> Philip Wadler: Even better. So the last time I gave this talk at
DDFP and also at the Midlands Graduate School, it had this title: The
Essence of Language Integrated Query. I'm pleased to tell you that
this paper will be presented at ICFP but not with this title.
They said: Oh, no, we're not sure that's really the essence of
language integrated query. So they said you have to change the title.
So the new title is called a Practical Theory of Language Integrated
Query. In many ways this is a better title because in fact what I'm
going to show you is that there's a practical side and a theoretical
side to what we're doing and I'll talk a little bit how those interact.
But in some ways it's a worst title, because this title alludes to one
of John Reynolds' classic papers, the Essence of Algal [phonetic].
John Reynolds died recently. So it's a shame that that bit of tribute
to him has gone away.
I should mention, by the way, that the work on semantic parametricity
is also John Reynolds and so the talk tomorrow will certainly be a
tribute to him. So the tribute's gone away, but his influence, of
course, goes on.
Right. What is the difference between theory and practice? You all
know the answer to this riddle. In theory, there is no difference but
in practice there is. How many people -- that's an old saw. How many
people have heard that joke before? Yeah, all of you.
>>: Attributed to young [indiscernible].
>> Philip Wadler:
That I did not know.
>>: But maybe he also [indiscernible].
>> Philip Wadler: Oh, okay. Good. I will get that citation from you
afterwards so I can add it. So this is going to be our touch stone for
the talk, what is the difference between theory and practice. And I'd
like this to be an attractive talk, you guys all just interacted with
me, good. If I say something you don't understand, please do ask.
There are many, many, many database programming languages. Here are a
few of them. I don't think I'll go through these in any detail at all,
except to point out it goes all the way back to Kleisli many years ago
and done by people at Penn who are now my colleagues at Edinburgh, so I
ought to mention them. But because Kleisli in fact is named that
because of the use of Monads [phonetic] in this work, influenced by
some things I had done and it in turn has a deep influence on what I'm
going to show you today.
So there are many, many systems here.
>>: [indiscernible].
>> Philip Wadler: Thank you. I better fix that. This is the one that
I worked on. And one of the points is that all the way back to here
you'll see the roots of the ideas I'm talking about. So these are very
deep ideas, but I don't want to claim credit for that. What I'm going
to be talking about is how do you take one of these database query
languages that's a nice database query language and integrate it with a
programming language that's a nice programming language? You've got
the database language. You've got the programming language. How do
you get them to play nicely together?
And that, of course, is exactly what LINQ is intended to address. And
LINQ was in fact in part based again on ideas of monads that go back to
the Kleisli work and to my own work.
Okay. And we're going to show a way of making this go even better.
And these are our goals. What I'm going to do first is skip to the end
and show you something. I really should move this slide to the
beginning. So there's the whole talk for you.
What I wanted to show you is this bit of results. So one of the
contributions of this talk is I'm going to go through some of a small
set of examples, and there are more examples in the paper. And these
all cover a set of things that I think are important to do. In one way
this is the essence of language integrated queries. We've tried to
extract the essence of what are some of the important things we want to
do.
So these are all sample programs you can see in the talk or read in the
paper. And here's what happens if you try to compile them under F#20
or F#30. And those little Xs mean it didn't compile. It fell over
waving its legs in the air. You can see F#20 falls over on some things
and F#30 falls over on some different things. And our stuff doesn't
fall over on any of these. In fact, our stuff is guaranteed not to
fall over on any of these. We will give you a theory. So here's the
theory bit of it that says for things in this subset it always works.
So that's quite nice. You can know for certain set of programs that
you might rise queries that they will not fall over kicking their legs
in the air. Now the subset we deal with, this is the difference
between theory and practice, the subset is smaller than everything you
might want to do. There are other things you want to do. You will see
that the key idea that we're doing here is we're taking programs and
normalizing them. This is the amount of time it takes to normalize.
The nice thing if you're accessing a database, accessing a database is
expensive. So normalization, down in the noise. Right? You can
afford the time to normalize if what you're doing is accessing a
database. And that's what these figures show you.
But as I say we only deal with a subset of queries in the theory. What
happens in practice? So we took every single query that's in the F#30
documentation and ran it through our system in the normalizer, and the
point is they all work. So that's the practice side of it. In
practice, this could be put in F#0 today and more things would work and
nothing would break.
So that's the take-away lesson. So let me go back and explain how it
works. So I mentioned that part of what we want to do is extract out a
set of problems that capture the kinds of things you want to do. So
here's my list to explain the kinds of things we want to do,
abstraction over values. Pretty much everybody can do that.
Abstraction over predicates. You can do higher order queries,
composition of queries, composition is always a good thing. Dynamic
generation of queries. That's very important.
And something that's often not dealt with well. And type safety is a
nice thing to have. We also want very much the Goldilocks property.
The number of queries that you generate should be not too few and not
too many. Where and so what's just right in this case is very easy to
characterize, just right means exactly one.
So one query in your program should turn into one SQL query sent to the
database. And I'm going to be working only with SQL, but I believe
that the ideas I'm showing would extend to other things, for instance,
queries written in a different query language like XQuery. Possibly
even these ideas could extend to integrating GPU code into a general
programming language, or something like that.
So we want exactly one. We don't want too few. What too few means is
no queries came out, it couldn't do it. Failure, kicking its legs in
the air and not too many. You don't want to have what looks like one
query end up generating a thousand different queries, and one of the
things I showed you in that result table does generate a literally a
thousand queries from one query, and you don't want that to happen.
And then as I mentioned, the theory is not everything you could write
in LINQ. The theory is basically dealing with select from where
queries and exist and union. What we don't deal with is group by or
sort by, and those of course would be very important.
to extend the theory to include those.
It would be nice
As I've mentioned the practice includes those already. But the theory
doesn't yet. We can't give you any guarantees about them. And just as
a notational convention doing exactly what happens in LINQ, every time
I say list, I mean bag, because we're going to a database, the ordering
is not relevant. As I mentioned we're not dealing with sort by. Okay?
So I will say list and bag interchangeably. Is everybody familiar with
bag? Some people prefer multi-set. What do people use here, bag or
multi-set? Fine, I will continue to say bag.
Okay. So here's an example. So I've got a database consisting of some
people and their ages. And some couples. And couples here have a her
and a him. If you've been paying attention to what's going on with the
law in the UK, very soon I'm going to have to apply schema update to
this schema. But for the time being we'll be old-fashioned and have
her and him.
>>: I don't know how you're going to do that since the columns are
distinct.
>> Philip Wadler: Yeah, it's much easier to have a her and a him.
can have a partner one and partner two.
You
>>: Yeah, but the problem is how do you really have two partners,
neither one is first nor second.
>> Philip Wadler: Yes, fortunately that entire question is orthogonal
to this talk. So here's a typical query written in SQL. Find all the
women -- what we're going to return is the women's name and the
difference between the woman's age and the man's age. And what we're
going to be drawing from is couples as C, people as W. People as M.
And of course W is going to be woman. M is going to be man. And the
her must be the woman's name and the him must be the man's name and the
woman's age is greater than the man's age. This is just finding all
couples where the woman is older than the man and we're printing the
woman's name and the difference in the ages.
Indeed, if they were symmetric it would be hard to use this example.
So here's our database. Well, wait, no, here's our database. But what
we want to do is look at the data in the programming language. So we
want some way viewing the database as data in the programming language.
There's an old idea about how we do this. We are going to view each
table as a bag of records. And since we have several tables we
actually have a record of bags of records.
So here we have a record with two fields people and couples. And
people is this bag of records and each record has a name and an age.
And couples, again, is a bag of records where each record has a her and
a him. Everybody suitably bored? If there's something you don't
understand, please do ask a question. No questions yet? Yes?
>>: Before you alluded to just when you make a query and just kind of
keels over and puts its legs in the air, what exactly does that mean in
essence? Does that mean the parity is just invalid or ->> Philip Wadler: We'll see examples where it might have trouble
generating a suitable SQL query. So this is what happens in LINQ. You
can write LINQ things that denote queries, but at runtime it tries to
turn that into an SQL query and it fails. That's what I meant by
kicking its legs in the air.
Our type of database here is just a record of lists of records. So
this is just encoding the type of the database. I'm saying here, okay,
so DB prime is going to be this data. And I've put a prime on it to
indicate you don't really want to do this. I'll explain why in a
minute. But the idea is we've got a construct that says read in the
whole -- look at the people database. Read it in. Convert it to a
data structure and store that in memory.
And then we could write out our query now in ordinary F#. This is
ordinary F#. This corresponds exactly to our query. It says integrate
over the couples' table. Let's see vary over that. W vary over
people. M vary over people. Do a condition just the same one as we
had before. The her field is the same of C is the same as the name
field of W. The him field is C the same field as name field of him and
M is greater than MH and we yield as before a name field which is the
woman's name and a difference field which is the difference between the
woman's age and the man's age and there we get the answer. There,
we're done. We've now integrated our queries into our programming
language, and we're writing them just as ordinary programs in our
programming language. This is fine.
Why did I put a prime down there? Well, of course this is fine except
for one tiny problem. And the tiny problem is doing it this way is
insane. Other than that, it's perfectly fine. Why is this insane?
Most of you will probably have already figured this out. Here I've got
about ten records. Of course, in a real database I might have 10,000
or 10 million or 10 billion or 10 trillion records, and reading 10
trillion records into the computer's memory might not be feasible.
The other problem is if you look at this query, this specifies for C
ranging over all of couples. For W ranging over all of people and for
M ranging over all of people. So these are large, it would be the size
of couples times the size of people squared, because we ran over people
twice. So that could be huge. This is at least cubic in the size of
the database. And what you'd like is something that runs much faster
than that. Of course, with indexing as an actual SQL query, this would
run much faster than that.
So apart from the fact that this is insanely inefficient, it's perfect.
So what we'd like to do is write something like this and generate SQL
and actually have the SQL execute against the database. How do we do
that? So the key idea which already exists and is used in F# for
exactly this purpose is quotation.
So the claim is quotation is the essence of language integrated query.
So a quotation just means a data structure that represents an
expression. And the type of a data structure representing expression
of type A will be written Expr A. So here I've got Expr DB is the type
of quotations that return values of type DB, where DB is exactly what
we had before. And quotations in F# are written in these brackets
written angle bracket at and at angle bracket. And the quoted code I
will always have written in blue. So now instead of saying actually
read in database people, we say just bind DB to a quotation of
something that stands for access to database people. So now, before,
remember, we just had here's an expression in F# of this type. It
returns a list of name difference records. Now we're going to return
an Expr of the same thing.
So we just write the thing in quotation brackets. And other than that
it's exactly as it was before. And now what we do instead of just
saying differences, we say run of differences. So run is going to be
this construct in our language that takes a quoted thing and runs it
against the database. And we get the same answers before. So what
does run do? It computes the quoted expression. It simplifies the
quoted expression or normalizes it. It then takes the normalized
quoted expression and translates that to SQL. We will see that once
you've done normalization, it looks almost exactly like SQL. So this
step is very easy. You ship it off to the database and run it. That
step's very easy. You get back a table. You translate that table back
to a data structure in the host language that can be processed. So all
of these steps are very easy. Turns out this is rather easy as well.
So in fact all these steps are pretty easy. The hard work, of course,
is actually executing the SQL query, but we're using existing SQL
implementations to do that.
And now here's the guarantee. If you stick to the subset language that
I mentioned, which is just the only construct we're going to use are
for, if, yield and also concatenation of bags and the exist construct,
which given a bag returns true or false depending on whether or not
it's empty. So just testing a bag for emptiness, unioning bags and
for, if and yield. If you just write your program using those and in
addition your answer type is what is called flat, meaning it's a bag of
record of scalars. So we've seen tables are just that. They're bags
of record of scalars. So if your answer is a table, which it better be
if it's the answer from an SQL query. The first requirement is just
the answer has to have the obvious type for a table return from an SQL
query.
Second thing is you only do permitted operations. So those are for,
if, yield, as we just saw. Union, exist, and, right, if you're doing
addition or less than or whatnot, they better be operations that the
database supports. Check that this number is prime. Can't do unless
it's actually supporting your database language, which last time I
looked at SQL it did not have an is prime primitive. And in particular
recursion is not generally supported in SQL. You can't use recursion.
And finally, your uses of database must all be consistent. Right? If
we expand this program out each of these expands to a use of database,
they all better be the same database, right? You tried access two
different databases then you'll be in trouble. Of course we'll put as
many tables in the database as we like, but it all had better be one
database. So all these constraints are I think quite reasonable if you
want to translate into SQL.
Okay. So I mentioned all these great things we want to be able to do.
Abstraction. Composition and dynamic generation of code, how does that
work? So the first thing you want to do, of course, is to be able to
abstract over values. So here's something range. And range takes a
pair of integers and returns a list of names. It must be a table
again. So we make a list of records. Each containing a name field.
And we're going to find everybody whose age is greater than or equal to
the first integer, and less than the second integer.
So it's a function for each W in the database. Check the age. And
notice this is a range over all people. So it's women and men. And
then we yield the name. So we're going to find all people whose age
they're 30 somethings between greater than or equal to 30 less than 40
and that's Cora and Drew. Notice, by the way, that the abstraction
here is going on in the quoted code. I'll say more about that later.
But that's actually slightly surprising. I'll come back to that. So
that's fairly straightforward, right? Everything supports abstracting
over values. That's the obvious thing that you want to be able to do.
A more sophisticated thing to do is abstracting over predicates. So
now let's take an arbitrary predicate over integers and return a list
of names. So now we take the predicate as an argument and again
iterate over everybody and people. And if the age satisfies the
predicate then we return the name. The predicate of course can be
between 30 and 40, and then it would return Cora and Drew as before or
the predicate might be people whose age is even because mod works
because it's supported in SQL. Again if we had prime here it wouldn't
work because prime is not supported in SQL.
So now we're invoking satisfies on an arbitrary function. Again, the
function not surprisingly has to be written in the quoted language. So
everything here is just inside quotes. Any questions about that yet?
Okay. So now we can extract -- yes?
>>: So the expression of integrals and names, I'm surprised it's not an
expression of integral to an expression of names.
>> Philip Wadler: Right. So this is exactly the point I just
mentioned. I said it's slightly surprising that this bit of the
abstraction and this bit of the abstraction is happening inside the
programming language. And you just confirmed that by saying I'm
surprised. I expected it to be Expr of it two Expr of names, I'll
return to that later. I'm glad you said you were surprised. I was
surprised, too.
Okay. And then, of course, we want to compose queries. So here's
something getAge, which is going to run over a loop of people and if
the name matches, returns the age. And notice that using these
constructs, the best we can get coming out is actually a list of ages,
because there's no way of converting a list to a single value. That's
not one of our supported operators. So the closest we can get to
getAge is return a list of the ages.
And then we can compose getAge with range. So given two strings, we
will return Everett. We'll find the age of the first person, the age
of the second person and we'll return everybody whose age is between,
that's greater than or equal to the age of the first person and less
than the age of the second person. So these are all people that are at
least as old as Edna but younger than Bert. That turns out to be Cora,
Drew and Edna.
Now we're just in a very straightforward way composing queries to build
a larger query. So we call getAge. It's going to return list. Let A
be everything in that list. Actually, there's only one possibility.
The same for B and find everything in range.
We'd like to turn this into just -- right. [indiscernible] was kind
enough to read this paper and he said what's hard about that? Execute
getAge on the database. Execute getAge again database, execute range
on the database. Three queries, you've got your answer. Yes indeedy.
But we don't want to do it that way. We want one query. Every time we
call run, it should be a single query.
And it turns out turning that into a single SQL query is not entirely
trivial. And then finally of course we would like dynamically
generated queries. This is extremely important. You fiddle around
with your Web interface. This is almost every Web program in
existence. What you're really doing is through the Web interface
building some data structure, turning that data structure into a query,
executing that query on a database, and then displaying the answer.
That's pretty much every Web application in existence.
So as a simple example of that, here's a data structure that represents
predicates. So above will mean greater than or equal to the given
integer. Below means less than the given integer. We can take and/or
of arbitrary predicates. So not surprisingly this structure T0 of type
predicate represents the query we had before, which are people at least
30 and less than 40. And this is something different, not of or below
30 or above 40 which happens to existentially be the same predicate,
not the same piece of code but true for the same values. Any questions
about that? That's how we represent predicates.
How would we turn this into a query? We use our Web interface. We
build up a predicate and we want to query against the database. Again,
this can be pretty straightforward. I'll write a function P that takes
a predicate and returns an Expr into Boole that's given a age,
depending whether the predicate is satisfied. There's an operation
that takes an integer into an Expr of integer, which is called lift.
This percent sign means splice into some code. We've been using that
before. To splice into our database or call to getAge or to range.
People didn't ask about that. So I assume it's fairly straightforward.
So that just takes a value of type Expr and splices it in to get a
bigger value of type Expr. In this case you want to splice in is an
integer, convert integer to a lift between Expr of integer and we can
splice it in.
And this is the obvious thing, the function over X that if the given
integer is less than or equal to X. Similarly for below. And then for
"and" we just recursively apply P to T and U to convert those into
predicates and apply each of those to X and as a result similarly for
or, similarly for not. You couldn't imagine a much more
straightforward piece of code than this.
Now, remember our database doesn't handle recursion. This is, of
course, recursive code. We're recursively invoking P. But we're doing
the recursion at query generation time, not query execution time. So
recursion at query generation time is fine. Query execution time is
not.
Notice also, remember, you said before you wanted something type was
something goes to Expr of something. Here we have this. The type of
this code is given a predicate return an Expr of int to Boole. The
predicate argument can't be inside Expr because we're taking apart the
predicate at query generation time not query execution time.
So things are inside the Expr when we want them to happen at query
execution time, outside when we want them to happen at query generation
time.
And then not surprisingly, P of T 0. You just expand that out using
the code I gave you before and it turns into this, which is kind of
messy, because we've got -- here's the above 30.
And then here's the "and" of those two things.
Here's the below 40.
And this is, of course, needlessly complicated code but we'll normalize
it. And the normalizing here is very easy. We just substitute X for X
in this case. And we end up getting this for the normalized code,
which is fine for executing. So now I can do satisfies of P applied to
T0. And this gives, of course, the same answers as before. And of
course since T1 is the same predicate, this gives the same answers as
before. Now we've got dynamically generated code because we can build
T0 at runtime.
So notice that T0 here, its type would be predicate, not Expr of
predicate. But it generates an Expr. Okay. Any questions about that?
That's the basic technique. There's one other thing we can do, which
is really cool, which is nesting.
Sometimes it's very useful to build a non-flat data
of fielding your query. I'll give you what I think
example of this. And our point is going to be it's
nest as long as the answer itself is not nested but
structure as part
is a compelling
perfectly fine to
flat.
Again, a good area for future work is what do you do if you want the
answer itself to be nested there's nice work by the Ferry Group I
showed at the beginning they show if your nesting is D deep. If you
have a list of list of lists for your answer, that's three deep. They
can do it with three queries, not one, but three queries, number
depending on the depth of nesting. That's work that's been done.
Adapting that to this work would be interesting future work. Let's
restrict ourselves, the answer is flat but there's maybe some nesting
in the query. Why would you want to do that? Here's an example.
Here's some company. It's got four departments. Products, policy,
research and sales and each department has some employees.
And each employee has tasks they can do. So Alex knows how to build
stuff. Bert knows how to build stuff. Cora knows how to abstract,
build and design things. Fred knows how to call. And so that makes
sense because like Fred works in sales. So it makes sense he knows how
to call. Cora knows how to abstract. That makes sense because she
works in research. Alex knows how to build that makes sense because
she works in product. So how would I represent this organization?
It's got three tables, departments, employees and tasks that we've just
seen. So there's our organization represented and ready to do a query.
Here's the query I want to do. Find departments where every employee
can do a given task. So here I'm asking find departments where every
employee knows how to abstract. Turns out there are two departments
like that. One is research. Not surprisingly.
The other is quality. That's maybe surprising. But not too
surprising, because if you go back and look you see that the quality
department in this company has no employees. So if you ask is it true
that every single employee in quality knows how to abstract, the answer
is yes because there are none of them.
So here's the query. Very straightforward, right? Well, let's see.
We go over all the departments. Let D be a department. Let's go over
all the employees. Find those employees that are in that department.
And then for those find -- let's go over tasks. Find all the tasks
done by that employee and check if the task is this task U that we're
interested in. So U is the name in this case abstract.
So now we're forming an inner list here for each given employee of all
the tasks that that employee can do and we're saying, right, if one of
the tasks the employee can do is the given one, then yield an empty
record. So then this list will be non-empty if the employee can do the
given task. Then we take the negation of that. So this is now the
list of all employees that cannot do the given task. And if that list
and will yield something if that's true. And if that list is empty,
then that means there's no employee in that department that cannot do
the given task.
Nice straightforward piece of code, right? If you were writing SQL,
the exact analog of this is the cleanest thing you could write. This
is actually the native code that somebody would write in SQL to answer
this query.
But as we've just seen, it's not really very clear. Can we structure
this in a way to make the query easier to read? So that involves
nesting. So here's a more logical way of structuring the data that we
have. We're going to have a bag of departments -- a bag of records.
And the record has a department name and a list of all the employees.
For each employee we have the employee name and a list of all the tasks
that employee can do.
So Cora can do three tasks. Drew can do two tasks and so on. And
quality has no employees. So now we want to build the nested data.
That's easy. Here's the straightforward query that takes the three
flat tables that I showed you and gives you a nested organization
structure.
Very straightforward. Then given a nested organization structure, so
here's some higher order queries you might want. And these all
actually exist in LINQ with pretty much these names. "Any", which
takes a list of As and a predicate over As and returns true if some
value in that list satisfies the given predicate. So this is just the
exists predicate. "All" is the dual of that. So run over the negation
of the predicate and as long as that's non-empty, this is just the
normal way for defining for all in terms of their exist using double
negation, using De Morgan's law. This returns everything in the list
satisfies the given predicate and contains is just checking whether
given a list of A's and a value A, it just checks there's some value in
the list that is equal to the given U. So a list of Xs and a value U,
just the predicate we use with any is X equal to U. So this is check
for containment. Check whether the value A appears in the A list.
So now we can rewrite expertise much more cleanly. Remember, before we
had expertise prime. Here the prime doesn't mean this is insanely
inefficient, here the prime means this is insanely difficult to read.
So here's an equivalent query, which I think is not insanely difficult
to read, which is just for each D in the nested organization, extract
from that the employees, use the predicate that the tasks field of the
employee contains the given task name U. And if that's true, yield
that department name. So again running expertise over abstract we get
quality and research just as before, because the two predicates are
equivalent.
Okay. So that's why you might want to do nesting.
moment I'll say how we support all this.
And now in just a
>>: SQL should be extended with the nested structures.
so many people happy.
It would make
>> Philip Wadler: Possibly you want a bigger programming language.
But I'm not going to address that problem. I'm going to say SQL is a
hard object to move. But you guys can change LINQ if you want to. And
this doesn't even require any changes to LINQ, just a change to the
LINQ query provider, the thing that changes the LINQ expression tree
into SQL. So this is an easy way of getting all that power without
needing to change SQL and getting it easily within LINQ.
Now I'll go to the point you were surprised by and that I think
everybody should have been surprised by. I was surprised by it. I was
so surprised I can still remember exactly where I was when I realized:
Oh, this is surprising! I was riding through -- in fact well named,
the Links, in Edinburgh, with the castle to my left. What's going on
here? So what's interesting here. The way we wrote range, we took a
pair of integers, all done inside the quotes. And you might well have
been surprised by that because you might have expected it outside the
quotes. The natural thing to do is to say, well, really take two
integers, but just to separate out the problems, let's say take two
Exprs of integers, and we want to plunk those in in the right place in
the query. Wouldn't that be the best way of doing it? Well, no,
because that gets in the way of building, doing compositions that
return just one query. So then when we did compose we'd like that to
give a single query.
As it was, that was easy. What would happen if we put the quotations
on the outside? Put the quotations on the outside so let's make the
minimum change we could. So we could also put these strings on the
outside. Let's even leave the strings on the inside. Let's just
assume only this one did the expected thing and put them on the
outside. What goes wrong? What goes wrong is we call -- again we call
getAge of S and getAge of T for A and B. But A and B now are in -range now is expecting quoted things. So we give it A quoted and B
quoted. And in fact there are types of instances with quotations this
is a fine bit of code and everybody is happy. There's something called
meta ML, where this is a perfectly fine thing to do. F# supports typed
quotations, but it does not support the type system of M meta ML. What
happens if you try to do this in F#? It gives you an error message. I
will now give you a loose translation of the error message. The error
message says where are you, crazy? You've given me a quotation for A,
but A isn't bound within that quotation. Turns out A is bound in an
enclosing quotation. But F# doesn't handle that. So if F# could
handle it, this would be a perfectly fine way to do things. But it
can't. So what you have to do is put it on the inside instead. That's
why the surprising thing happens.
>>: Expr is multiplicative?
>> Philip Wadler:
multiplicative?
What do you mean when you say Expr is
>>: I'm thinking multiplicative Haskell, actually.
>> Philip Wadler: Right. You have an Expr that's of a function, Expr
type A and B another type A and you want to apply the two.
>>: Yes.
>> Philip Wadler: Here's some whiteboard. So the type of F is an Expr
of A to B. And the type of X is an Expr of A.
>>: Can you apply those -- is there [indiscernible] gives you an IP to
that?
>> Philip Wadler:
want.
We just do percent F and that does exactly what you
So if you read the paper -- I'm not going to do it here, but if you
read the paper, there are things that says these are what is called
open quotation. Open quotation is not bound in the quotation, these
are easy to deal with as long as you have a very sophisticated type
system that including in the type of the Expr the types of all -sorry, the names and types of all the free variables that appear inside
quotes.
So it's very straightforward to deal with. It's just F# doesn't deal
with it. And we show that if you take a very simple version of such a
language in fact basically we say instead of having an open variable,
lambda abstract over it. That's a fairly straightforward thing to do.
That's all we've done all the time. Instead of having free variables,
we just abstract over them.
So in fact if you had perfect -- if you were willing to change your
programming language, you would be perfectly fine to add one of the
many different ways of dealing with open quotation to it. Surprising
result, the aha moment I had, because I was doing it for ages without
realizing it, the paper was almost ready to send off. In the week we
were sending off I suddenly went, wait, this is amazing. We're putting
all the quotations inside rather than outside. And the reason we're
doing it is because then we don't need open quotation.
And you never need open quotation because you can just lambda abstract
instead. And that's all cheap because we're going to normalize things
before we generate the query. So the fact that we've got lots of extra
lambda abstractions that get applied doesn't matter. It all gets
normalized out.
So there's nothing deep here, but as I say it was surprising enough to
almost make me fall off my bicycle.
>>: Isn't there another problem, too, that range actually runs the
query already, so you would ->> Philip Wadler:
No, you never run a query until you say run.
>>: Range prime I thought was [indiscernible] to be running.
>> Philip Wadler:
to Expr.
No, range prime just had a prime because it's Expr
>>: I see.
>> Philip Wadler:
names not names.
By the way, there's a typo here that should be Expr
>>: [indiscernible].
>>: I didn't notice it was outside.
>> Philip Wadler:
There's the run.
>>: But still range prime is a -- you so you would apply it to runtime
still in the upper ->> Philip Wadler:
Yes.
>>: In a system like MetaML you have the nested quotations what you're
saying is take MetaML and quotations and start applying...
>> Philip Wadler:
This would be perfectly fine --
>>: [indiscernible].
>> Philip Wadler:
This would be perfectly fine in metaML yes.
>>: But arbitrary metaML do quotations, the misinformation do
transformation, put all quotations inside and use lambda extractions.
>> Philip Wadler: So the interesting question is could any program in
metaML be changed to this other style, and I don't know the answer to
that yet.
>>: So you don't know the exact -- for this system it holds?
don't know yet like when would it fail to --
But you
>> Philip Wadler: Yes. I detailed in detail how and Haskell
implements the metaML system for the quotations. As a week ago it has
that in it as a feature.
>>: Way advanced.
[laughter].
>>: So behind, Tim.
>> Philip Wadler: So what I'm saying here is in the language that only
supports closed quotations, like F#, obviously you better use closed
quotations if you can rather than open quotations. That works for us
because of the normalization and if you want a more complicated way of
saying it, prefer quotations of functions to functions of quotations,
which is, you want a function that took a quotation to a quotation,
we've got quotation of a function.
There's a more complicated example in the paper involving changing
queries over XPath into SQL. Look at the paper. I won't show you
that. But what I do want to do is take a couple of minutes to show you
how it works. So this is scheduled from ->>: We have the room until three.
>> Philip Wadler: Until three but people probably want to go. People
want to leave, should I finish at 2:30? Is that what people would
like.
>>: You can go up to 2:30 and maybe a bit past.
>> Philip Wadler: Okay. Let's aim for 2:30. So this is a very
straightforward type system for the language we've got. All I want to
show you, this is the typing for expressions. It's just what you would
expect. Notice that we've got recursion as a construct in the language
for expressions.
Here's the typing for quoted terms. It's again exactly what you'd
expect. Everything in blue this time. Notice that we've got access to
the database in the language of quoted expressions, but not in the
language of unquoted. So things outside of quotes you can use
recursion. Things inside you can't.
Now, in practice the way you do this is you just run over the quote and
see is there anything I'm permitted there, like recursion. So at
runtime you'll need to check that.
In the theory of the language, then in the theory we check for that in
the type system. And then the key things are the things that move
between quoted expressions and unquoted expressions. Notice quoted
expressions need an extra thing in the type so things in gamma are free
variables that appear unquoted. Things in delta are free variables
that appear inside the quotes like fun X where X is inside a quote.
And then the thing that moved between just moved between the two
judgments. Quoting moves from one judgment to the other. And I quote
takes another way around. Run takes the quoted thing runs it. Type T
where T, remember, is, what was it, bag of records of scalars. Your
table type. And Boole, remember, takes something of base type. So
that's O. And turns it into an Expr representing same thing.
So now we need to normalize. The rules for normalization are all very
familiar. If you've got a function application, substitute. If you
have a record, extract the record field. These are both called beta
reduction. They're very standard rules for for and yield. These are
called the monad laws. If you've got four of something that's
immediately doing a yield just substitute. If you have two 4s you can
rearrange them. That's called the associativity law. And you have
other laws that are very straightforward. If you've got a four of an
if, turn it to an if of a four. You have a four over the empty record,
that of course would just be the empty record. If you have a four over
concatenation, we didn't even see examples involving this is union, but
that would just turn into a union of two fours, and of course if of
true and if of false reduced in obvious ways. Notice we're using the
vary of F than else where the else clause returns empty. It's a where
clause in SQL.
>>: Don't you worry about captures at this point?
>> Philip Wadler:
Yes.
Capture avoiding substitution, of course.
And so these are very standard rules. They go way back. And then we
need some non-standard rules just to make sure it's in SQL format. So
SQL is not completely compositional. You can use union but only at the
top level. If you have four of a union turn it into a union of a four.
This, by the way, is the only place where we rearrange the order. This
is why we have bags, apart from the fact SQL supports bags.
So again we'll need to if we have 4 over an empty list we'll have to
turn it into an empty list, turns out SQL doesn't support that. And
these are just all straightforward things. The most interesting one is
you cannot have two where clauses in SQL. So if you have two
successive where clauses turn them into one.
You can't have -- you can only have a where inside a four. If you have
an if outside the four, push it inside. So this pushes all our ifs to
the end. So let's -- and then this has all the standard properties you
would want for both of these relations.
>>: I'm surprised you pushed it to the end, bubble these to the top as
much as you can.
>> Philip Wadler: The most efficient thing is to bubble the ifs to the
top. The thing SQL supports at the end. SQL forces you to push the
wheres to the end. What we'll do is push the wheres to the end and SQL
optimizer will bubble them up again. But this has all the standard
properties. The reductions preserve typing. Strongly normalizing and
confluence. You can apply them in any order whatsoever. So all of
this is very straightforward.
>>: You said they're strongly normalized but you didn't show the
reductions for the recursive functions, right?
>> Philip Wadler:
We're reducing quoted terms.
>>: [indiscernible].
>> Philip Wadler: But only reducing the quoted stuff and the quoted
stuff can't include recursion. That's why it's all easy. And I should
mention that you can find these rules at least going back to the old
papers on Kleisli. Ezra Cooper working on the LINQs team working with
me, wrote a paper that has essentially these rules and these rules in
it. They were done all at once. Strong normalization was hard to
prove. One of the small innovations in this paper is we break it into
two sets of rules, straightforward, strong normalization is well known.
These you have to prove strong normalization but it's straightforward.
>>: Do you really have strong reduction, can you de[indiscernible].
>> Philip Wadler:
Yeah.
>>: Beta outside, call name reduction?
>> Philip Wadler: Yeah, we're doing call by name reduction in the
quoted terms. Doesn't matter.
>>: Can you reduce even underneath that I suppose ->> Philip Wadler:
Yes because there's no side effects.
>>: I don't see the rules that let you do that.
>> Philip Wadler:
What rule lets you do it?
This rule.
>>: Reduce N without having ->> Philip Wadler: Sorry, it's the compatible closure of these rules so
you can reduce anywhere.
>>: Okay.
>> Philip Wadler: Example. Remember compose. Let me show you how all
these rules help us out. Here's compose of Edna and Bert. We take the
definition of compose, expand it out. That involves calling age of and
range. We expand everything out and we get this. So this is in fact
what happens after you have spliced everything together. So at the
point we need to normalize, this is what we're given.
So now we're going to normalize this. Let's see, what have we got?
We've got this function applied to Edna and Bert. We'll substitute
Edna and Bert. We've got this. We've got some fours with fours inside
of them and ifs inside of them, but remember we have rules that would
percolate those out. We go ahead and do that. We do it here again.
Now we've just got a bunch of fours and ifs. But we've got ifs
alternating with fours. We have these rules, push it to the end. We
do that. We use the rule that combines this. So all these separate
ifs now get combined together into one big if. And this looks exactly
like SQL. Right?
Select, where -- sorry. These are froms, this is where, and this is
the select at the front. So this has to go to the front. But that's
about it. So there's the corresponding SQL. And we execute that and
we get the answer.
Okay? And this is what you saw before, right? So -- oh, the nested
one, if you run it in F#30, it will run but it actually acts on the
nested structure and to act on the nested structure it issues the query
the same number of times as our departments. So if you have 100
departments, this will execute 100 queries. Whereas we execute one
query. So it's a lot faster. Everything else we're either slightly
faster or slightly slower than F# in either variant. And they're all
much of a muchness. And the point is the tables are big enough that
the normalization time doesn't matter and is quite tiny.
And this is from moderate sized tables with about 5,000 entries.
Really big tables the normalization really doesn't matter. And this
is, by the way, using a very slow normalizer. I'm sure you could write
faster ones.
>>: Why does F#2 or F#3 fall over in some cases?
>> Philip Wadler: Ha, because they're not doing the normalization.
They could be, but they're not.
>>: What does it mean, like SQL they cannot generate SQL for it?
>> Philip Wadler: You can build up the Expr that represents the query
and then you hand it over to the query provider, which is supposed to
turn it into SQL. And it says, well, I don't know what to do here.
>>: It changes between different releases.
>> Philip Wadler: Changes between different releases. So for the
subset I've described, we can always guarantee it works. And as I
mentioned if you're outside that subset for all the queries in the
standard documentation, it works. And again the times are, the
normalization time is small. So this is something that somebody could
just sit down and implement today. Right. We've implemented it.
Turns out there's one problem with F#30 in hooking in and using your
own data provider, your own thing that converts LINQ expressions into
SQL queries. It would be very nice if somebody on the F# team would do
that so people can start using this new technique. That can be done
now. And there are all the details. But, please, you can download the
paper from my website. There will be an updated version by the end of
the month which will be the camera ready copy for ICFP and please have
a look if you would like to see more details.
So these were our goals. You've seen they've all been achieved. And
finally I want to return to this old question of what is the difference
between theory and practice? So we do have two different things.
We've got the theory that I've showed you and we've got the practical
implementation. Right. And the theory doesn't apply to all programs,
but if you just do normalization over ordinary programs, it never makes
things worse, not surprisingly. So anything that ran before still runs
after you normalize it. And some things that didn't run before do run,
if you normalize it.
What we actually got here is a recipe. And the recipe says what do you
do if you want to take some arbitrary domain-specific language and
integrate it into your programming language? Well, just write out the
domain-specific language in the same syntax as your programming
language and quote it.
Actually, it's not necessarily. And then normalize it. The one
thing -- what normalizations do you need? It will vary from language
to language but you certainly want beta reduction. And sometimes maybe
even beta reduction will be adequate. I'm guessing that for generating
GPU code it's just data reduction. I'm not sure. We'll look at that
as an example next.
Some reviewers of the paper point out, wait a minute, you do not need
the host language and the quoted language to be the same. In fact, in
our case they weren't really the same. It's the same syntax, but the
host language has recursion. The quoted language has the database
construct. So in fact they differ just slightly. In fact, you could
just have your quoted language be completely different. Nobody does
that, though, in practice.
In practice, people implement quotation that's for the language that
contains the quotation. So the quotations in F# are of F# code. The
quotations in Haskell are in Haskell code and so on.
So what is the difference between theory and practice in our work?
Well, in theory there is a difference, but in practice there isn't.
Thank you very much.
[applause]
>>: So I'm curious if you looked around and found evidence of people
running into this limitation of F# and complained about it, basically.
Or sort of developer communities running into this obstacle,
essentially.
>> Philip Wadler: Right. That's a very good question. The answer is
you can find various blog posts saying -- Thomas Petricheck [phonetic],
for example, who is on the F# team has done clever work in getting
things like our dynamic queries to work. Done [indiscernible] if you
want dynamic query to work here's how you do it. He gives a clever
recipe. He doesn't have any theory that says it's always going to work
or anything like that.
So this is actually a very good question. How often do people bump
into this in practice? We don't know. There's no systematic data.
There's a little bit of anecdotal data from people doing blog posts
saying this is how I managed to get this thing to go through LINQ. But
in fact I would say we don't have convincing knockdown evidence to
prove it's a problem.
>>: Moving chart because when people complain we say we'll fix that
case?
>> Philip Wadler: Looking at this this, these are all things you'd
like to do, and sometimes you can and sometimes you can't. So this is
sort of our knock-down evidence, saying there's a problem.
>>: Do you know about other languages, like C# LINQ providers,
[indiscernible].
>> Philip Wadler: We've not tested any of this in C#. That's a good
question. And C# doesn't actually give you quotation per se it's all
hidden that give you authors that build up expression trees but they
don't have quotation per se. Doing this in C# would not be quite so
easy. And one of our recommendations would be put quotation into your
programming language because it gives people this as an option.
>>: So F# [indiscernible] that is what is it that it provides in its
more complicated algorithm?
>> Philip Wadler: It's complicated enough we haven't looked exactly
what this is doing. We're hoping they will just adopt it and use it.
>>: So you don't know, for example ->> Philip Wadler: Drop a note saying you should implement Phil's stuff
right now, I want it. Sorry?
>>: For example, it's at least not the case that they're just missing
the normalization phase or that they do ->> Philip Wadler: No, it's exactly just that they're missing the
normalization phase. You just add normalization it will work, because
the back end that we used for this is F#30. But the only thing we
changed is we did some normalization first, that's it. So just
normalizing is all you need to do.
>>: You said the quoted language does not support [indiscernible] of
any sort. Could we have that dropped in kind of feature where you can
have -- providing any -- not expression, but any value passed in
expanded in place. If I got that correctly.
>> Philip Wadler: Right. You're referring to the lift operator that
would convert an integer to an Expr of integer or string.
>>: Yes.
>> Philip Wadler:
Right.
>>: So in that case, is it -- well, is it the case ->> Philip Wadler: But that's a value. You use recursion to compute
the value but then you just have the value.
>>: What about -- a function itself?
Sorry if I misunderstood.
>> Philip Wadler:
>>: Okay.
Ah, lift applies to base types, not to functions.
So --
>> Philip Wadler: Functions you must start with quoted code. You
can't take an arbitrary function and turn it into a quoted expression
tree that describes that function.
>>: Okay.
Thank you.
>>: Did you have a question?
>>: I'll ask afterwards.
>>: All right.
[applause]
Well, let's thank the speaker again.
Download