>> Phil Bernstein: Hi, I'm Phil Bernstein. It's...

advertisement
>> Phil Bernstein: Hi, I'm Phil Bernstein. It's a pleasure to be introducing James Terwilliger this morning.
James had a great visit last summer to MSR, where he did work on mapping link queries to schematized
XML data.
Today he's here to talk about his Ph.D. thesis research on graphical user interfaces as updateable views.
James has a lot of experience in several years of commercial development, in addition to his Ph.D. work.
So I'm sure it's going to be interesting to see how he'd like the environment, to understand how he'd like the
environment for developers to look like in the future. James, it's all yours.
>> James Terwilliger: Good morning, everyone. Echo. Excellent.
The title of my talk today is graphical user interfaces as updateable views. But just as a point of
clarification, I'm not talking about taking a graphical user interface and magically transforming it into an
updateable view. What I mean is they already are one. Your GUI is a view and it is updateable. And what
we want to do is leverage that in a way that can provide some very interesting results.
So we're going to take the graphical user interface, treat it as a conceptual model for the data in the
database, allow developers to program to that data model and also come up with a new way to map the
user interface data to physical schema called a channel.
We'll talk a lot about that. But first a reminder of what we're talking about. This is a form from a
commercially available piece of software that does clinical endoscopies. Everything on this form in terms of
its structure should be very familiar. We have things like radio arrays, text boxes, list boxes.
And the terminology you might not understand, but if you happen to be in the field of clinical epidemiology
or endoscopy, or even if you were a statistician who was running queries on clinical data you would
understand what SA or NSA or any of these things meant.
So this was a form that was designed specifically for use by a domain expert. Forms are everywhere. This
is an example of one built in ASP.net. This is a web form that does site management and content
management. So same kinds of graphical things are in play here. Text boxes.
You have a tree view. Same kinds of controls at work. Because this paradigm of software development
are popular there are lots of tools out there dedicated to the purpose of designing these things.
So this is one from Eclipse that does standard Window toolkit user interfaces. This is one from Microsoft
Access. Microsoft has a brand new one based on the new .NET framework. In this case called Expression
Blend. But the one we're all most familiar with in this room, I would imagine, is is the form builder that
comes with Microsoft Visual Studio.
And I think we've all pretty much used this software before. You have some sort of a toolbox of common
controls on the left-hand side. And when you drag and drop one of these controls onto the field of the form,
you see a graphical representation of that, but then you also, behind the scenes, get all the code necessary
to generate that form.
So what we have done is put together a framework we call GUI as view or graphical interface as view. And
this software starts with the UI designer or developer. Programming a user interface using a familiar tool,
Visual Studio, in a familiar way, with no changes, except we ask the developer to use the guava library of
widgets instead of the Visual Studio widgets which happen to be identical. The guava library is merely an
extension of the Visual Studio components.
From that we generate a few things. Generate default database instance in schema. Generate all the
insert, update, delete statements necessary to communicate between the user interface and default
schema as well as database bindings, then there's this additional piece, application-specific query
interface. I'll talk to great length about that.
So what you get for free is a fully functional database back-up application with schema default. You can
think of this like a user interface to relational mapper for which you can come up with your own acronym. I
provide three possibilities for you.
Subsequently, a database developer or designer might come in and decide that that default schema that
comes from the user interface might be insufficient for various physical reasons. And so we also provide a
declarative mapping that the database developer can use to transform that schema based on whatever
physical design they have in mind.
So what do we do beneath the scenes? You start with this user interface built on the graphical user
interface library we provide. This creates a structure G tree a hierarchical representation of all the data in
that application.
From that, you get the default instance of the database with the schema and all the bindings, et cetera, that
are necessary to communicate with UI. You also get a default mapping to physical database. This is
called a channel. I introduced this briefly in the subtitle of the talk. And this functions very similarly to what
you think of as middleware in a three-tier application.
You get a default one for free with guava but the database designer can come in and modify that over time.
Finally, given a natural schema and the channel we give you the physical database instance.
Obviously, there's more to a piece of software than just a UI and the database transformation, we offer
various ways for business logic to interact with all of our default mappings. You can intercept the default
mappings and rewrite them or you can completely escape the system. But the most interesting arrow is the
one between business logic and natural schema, because we allow the developer to code directly to the
natural schema and transform all of those statements as necessary.
Just for scoping information, what we're currently looking at is a situation where you have a single user
interface application running on top of your physical storage. We have some ideas for how to get multiple
applications to work together. I can talk about that later if people are interested.
But for the purposes of this talk, we're looking at one UI per application. We're also looking at building new
applications. So if you have an existing application, you might want to translate to using the guava GUI or
the widget library. But we're primarily focused with you have a new application you're trying to build and we
want to give you all these things for free.
>>: Just clarify, you said single GUI, you don't mean one form.
>> James Terwilliger: I don't mean one form. I mean a single GUI application. So you might have, let's
say, you might have a situation where you have an entire graphical user interface for, say, billing and
another one for endoscopies and another one for such and such. We're focused on the entire one for
endoscopies, there can be as many forms as you want in the application.
The final piece that we give you is this application-specific query interface. This is a vital interest to us,
because, first of all, this was the entry point for us into this research. And so we're going to be talking -- I'm
going to be speaking about these two components, mostly, for the talk. And the reason why we want to
give people who have domain expertise the ability to write queries against an interface that looks like this
that was designed with care and understanding of a domain expert in mind as opposed to having to write
queries against a database even with the help of a data dictionary, which might not be available.
So we want to give people this instead of what you have here. We see that as a much better view of the
data. So the contributions that we see of this work, we, first of all, provide the library of the UI-driven
development so as user interface designer start with your UI, we'll explain why we think that's a good way
to approach the designing software. Also, we provided the architecture and algorithms to get from the user
interface to your default schema and your query interface. That constitutes the first part of the talk. And
second also we give you the schema language and an implementation of that schema language to
transform all of these default things we give you into something more physically appropriate. And that
constitutes the second major portion of the talk.
Also we have an end-to-end prototype that we've implemented, applied to a practical application that we've
been working in concert with an actual software development group.
And I'll touch on that as I look at the other two topics over the course of the talk. So that's how today's talk
is going to proceed. We'll start with the user interface, because that's where the architecture starts. So,
first of all, user interface. The user interface is a conceptual model. It's a conceptual model because it has
all the various things that we think of as being in a conceptual model. We have entities, we have attributes,
and we have relationships between those things.
And those correspond in a natural fashion to the constructs you see in forms in a user interface. So a form
becomes an entity or, say, a grid control on a form becomes an entity.
The attributes of the forms are the text boxes, the lists, the radio arrays, those various things. And the
relationships user interface correspond to the relationships between the forms. So modal relationships.
Launch relationships. Or, say, lookups between one form and the data in another form.
So whether or not you know it, the database -- the UI developer, when they're writing a UI, is in fact
designing data, writing a conceptual model.
In fact, it's the conceptual model that the user interface sees which is very important for certain kinds of
software processes like rapid prototyping and agile methods where the customer is very much integrated
into the process. So providing a frequent user interface to understand what the contents of an application
is, is very important.
The key structure in all of our artifact generation is this thing called the G tree. It's a tree because the
nature of user interface is that it's hierarchical. It's hierarchical in two senses. It's hierarchical based on the
controls within a form, because you have a containment relationship between controls on the form and it's
also hierarchical based on the launch relationships between forms. And that's the essence of what a G
tree is. You're stitching those two hierarchies together into a single hierarchy that represents all the data
that's present in a single application.
Each node in a G tree contains information about the control it represents. That includes things like the
domain of the control. But it also contains information about the context of the control. And that's important
for generating this query interface. So we'll gather information like the text that appears before the control,
or one of the more interesting things I think is the tool tip that appears when you mouse over a control.
All of those things that provide context to the user as they're using the interface.
>>: For most applications you think the tree is sufficient? Because you might have the back button that
takes you back to the previous page, for example.
>> James Terwilliger: For most applications I see a tree is sufficient because whenever you have
something like ABAC button those two forms tend to have like a one-to-one relationship between those
forms in which case you can encapsulate that as like a collection of forms but they all have sort of a single
layer relationship between them.
Like a wizard interface is an example of that, where you have three or four different forms. You can go
back between them. But they're really all a representation of a single entity.
So this is an example application that you're going to see for the next few minutes. It's a very simple part of
an application. Two forms. The relationship between these two forms is that when you click the details
button on the form on the left, you get launched the form on the right.
Translating from this to a G tree is relatively straight forward. You start with the top level form. Call that an
entity node because it's corresponding to a structure in the UI that's an entity. And you follow the control
relationship down through the normal Windows forms hierarchy.
But when you goat a control like the details button you have to follow the relationship across the details
button to the new form. And that's how you stitch together the launch hierarchy with the control hierarchy.
And what you get is a single tree.
From the G tree you also get a default database instance. And we call it the natural schema, because it
has a natural relationship with the user interface from which it came. So, first of all, the domains of the
various attributes are relatively easy to identify. So especially for something like a check box, you know
that it's a boolean, for something like a string you know it's going to be something like a Var char with a
character limit that matches the text box. I can't tell you how many times I've seen bugs arise in software
where the text limit of the text box and the width of the database are different and that's what ends up
causing the database problems, truncation without error.
So you get the lengths and domains of the various attributes directly from the user interface avoiding that
problem. The translation from the whole G tree to a relational schema is you take the entity form or the
entity nodes in the tree, translate those into tables. Translate the attribute nodes and translate those into
the various columns in the nearest entity note table and there will be an example of this in one more slide.
The launch relationships between forms turn into foreign keys as one would expect. The difference is if
you have a one-to-one relationship like the example application I showed you a couple slides ago, that
turns into a foreign key from primary key to primary key. And if you have a multiple launch relationship
where you have something like new edit delete functionality, where there's more than one, say, physician
for a given patient, you have, you introduce a new foreign key column and you create a foreign key from
the new column to the parent table.
I'm skipping over more complicated controls. I have an example of that if people are interested. The basic
idea behind a complex control is you can either serialize it to a single element and call it a single control,
single attribute in a table. Or if you have some way of breaking it down into entities and attributes the
developer can specify that in a semi-automated fashion.
So for our given example, here was the G tree we saw before. It had two entities endoscopy and
endoscopy details. And what's on the right is the natural schema associated with it. And you can see the
foreign key between endoscopy details and endoscopy and that foreign key induces a one-to-one
relationship between those tables.
So with that in mind, here's what we want to provide to a domain expert. This isn't the query interface
we've implemented yet. But this is the one we're striving for. And think of this like query by example. So if
I'm trying to construct the query I see on the bottom I say show me the endoscopist and the severity. Think
of it like print nodes in QV, endoscopies that have been completed where Bob was the anesthetist.
Physicians are on a first name basis. And where complications occurred. Very much query by example,
certainly our inspiration for this work.
This is the interface we have implemented. It contains all of the same information that you saw on the
previous query interface, it's just in a different structure. The structure you see, especially in the main field
of the query interface, is exactly the G tree. You're seeing the hierarchical representation of the entire user
interface.
And what you see on the right-hand side is the context information for particular node and the ability to
specify a condition on that node. So you see here is specifying an equality condition for staff equals male
but you're seeing what was the tool tip in this case. There wasn't any. But also what was the leading text.
The leading text was gender and you can also see location and size information which isn't as relevant.
But we wanted to give you as much information about the context as possible. You can also see in this
query interface you have an advantage over the previous one in that you can search the QI. So if you want
to find all locations in the entire user interface that, for instance, pertains to name, you can type in "contains
name" and you find it where it's located. So you can do a search on schema effectively.
So specifying a query is like annotating a schema. G tree is like schema. So we're going to annotate that
with the various query parameters that we have. In this case, you are printing the endoscopist and severity
and we're applying conditions to anesthetist, procedure complete and complications occurred. And the
algorithm for translating this into a query is you start by taking the smallest possible sub tree that contains
all of those elements and then translate that into using a relatively straightforward algorithm that the
interesting part of the algorithm is that the relationship between entity nodes on the tree is governed by
joins as we would expect.
So what you see on the left here is the relational algebra of the query that's specified. And this is relational
algebra expressed against that natural schema. The default database instance.
>>: So if a form is used as a sub form in multiple other forms, it's going to be duplicated in this?
>> James Terwilliger: Yes, that's correct. So if I see a form such as, let's say there's a details form like
medical history or something like that, that happens to be a child form of multiple parent forms. You're
going to see it in multiple locations within the G tree, that's correct.
>>: Does that lead to any technical problems? If it's duplicated like that?
>> James Terwilliger: So the biggest problem is that when it appears in the natural schema, what you
effectively have is a foreign key relationship from the details form to multiple parents which is not normally
allowed. So in guava we have a generalized notion of foreign keys that allows that to be possible.
So in that case you would have a foreign key from, say, details to colonoscopy and EGD and flex SIG.
>>: (Inaudible).
>> James Terwilliger: It could still be a one-to-one or one-to-many relationship. It's just a disjunction. So
you say I have this entry in the details table and it must be present in at least one of the parents.
>>: I see. Okay.
>> James Terwilliger: The expressive power of this query interface is comparable to conjunctive queries
except that it does not have direct comparison between two columns. With the exception of that restriction,
and we allow joins across foreign keys as well. So it's roughly comparable to conjunctive queries but
strictly less.
Now, that's the expressive power of this query interface. Similar to something like in the entity framework,
we also allow queries to be written directly against the conceptual model. This would be the SQL that
corresponds to the queries that you saw before.
The current implementation, the queries are specified in extended relational algebra. If you express the
query in extended relational algebra, you can express any query you wish, not necessarily the ones that
are in the query language of the query interface we provide.
There's a fair amount of related work in this area. Obviously our inspiration for this work is query by
example, but there are some other research projects new and old pertaining to the information content of
forms and also for presenting usable user interfaces to domain experts or people who are maybe not
necessarily technical experts.
We see it as the advantage of our approach that we give it to you for free with the development process. A
few notes about the implementation of this part. We started by extending existing Windows form widgets.
So things like text box, radio list, all the normal things. And because we did that, we allow developers to
use Visual Studio Form Designer in the normal way. You don't have to use a special tool, and you don't
have to use a special process, with one exception I'm going to get to in a second.
The natural schema object itself is implement as a data set object. For those unfamiliar with data set.
That's an object that comes with the .NET framework that is an in memory database, effectively in memory
tables and you can express relationships between them.
And there's a thin API that sits on top of this data set that intercepts inserts, updates, deletes, queries, et
cetera, applies it to the data set and also propagates it through the channel so it has an effect on the
physical database.
The one exception to using it in the normal way is that the relationship between forms is implemented as a
new property on clickable items, such as hyperlinks or buttons. And the reason why is because we made
an implementation choice early on that we weren't going to just analyze code. We weren't going to analyze
the events between buttons.
We were going to provide an additional feature so that if you wanted to click a button and have a form
launch we're just going to say, you know, just tell us what the form is you want launched and we'll do it for
you. We do it behind the scenes using reflection.
The query interface is implemented as a control that you just drop onto a form. So you can drop the query
interface anywhere you want in the application. Specify what the root of the tree is and it will generate the
query interface for you.
The way that the software works is when the application is launched, the framework loads the first form into
memory and walks the hierarchy of the application but it does behind the scenes. You don't see anything.
So it walks the whole hiearchy but you don't ever see any of that. And also the application includes "okay"
and "cancel" buttons that are automatically hooked up to these insert and delete update statements that are
generated. You can override that if you need to. But the second -- when you normally add a form to the
project you have a form that's derived from form, the class. In our case you derive it from G form and you
automatically get okay and cancel buttons that you can override.
If you don't, you get all this default functionality. There's a few lessons I learned from walking through this.
The most important thing was that modifying custom controls to work with guava. So if there was a control
that somebody wrote from scratch, we went ahead and took a few examples of controls like this and
modified them to work with guava. And what we learned from that experience was that we didn't actually
need to implement anything new. We just needed to change the existing functionality so that it met our
interfaces.
That was important. If a developer is going to use this in the future we don't want to impose an additional
burden just so that things can work with guava. We found that to be very interesting and useful.
We also saw it encouraged good encapsulation. So we saw at least one example of a custom control that
was spread out over six classes, and, yeah, six. Awesome.
And to make it work with guava it had to be encapsulated into one. We have implemented a representative
sample of the application that we've been working with. This is how it mapped out. We converted a total of
361 controls. And the pertinent part of this pie chart is that the only ones we had trouble with were the two
1s. And one of them was just because we didn't support the image data type at the time. And the other
one referred to external data. So it just needs to bypass things and refer to physical storage. Everything
else it was either a direct conversion or had an obvious direct conversion if we put a little more time into it.
I've already covered the query interface. In the interests of time what I'd like to do is skip to a quick video
demonstration, if that's okay. So what I have is a video of the user interface part of guava in action. So this
is -- first, what I'm doing I'm blowing up my database tables. Nothing up my sleeve. Everything is
generated automatically.
So I delete my database, and I'm just going to reconstruct the root database. I'm not going to create any
tables. And now I'm going to switch over to my development environment and the first thing you're going to
see is the code for form. And you notice that the code for the form is pretty much empty as you would
expect. The only difference between this and the standard form template that we have used is we have an
additional constructor that's used for the thing of guava things behind the scenes. But the developer
wouldn't write this code. The developer would just include this code in the project template. This is the
form designer. I'm going to click on a control in the form designer. This is the tab control.
If you look down in the properties window you'll see it's a G tab control instead of a normal tab control. And
here's the text box. It's just a G text box instead of a text box. That's the only difference to the developer is
that you used the tools that are available in our set of the text box like that. That's a G radio array button
that you see there.
Now we're going to run the application. The first thing you'll notice is the application comes right up.
There's not a lot of overhead by the scenes. We don't have the scalability results in a large application yet.
But it's important that the application comes up quickly.
This is the query interface that corresponds to this application. I'm walking the control hierarchy. And in
this case I'm looking at information about a patient. So to demonstrate how the QI works, I'm going to go
over and add a patient because I've been feeling a little under the weather lately I'm going to go ahead and
add myself. My name is James Terwilliger.
A little bit of background information. Go ahead and try this as your real Social Security number. I dare
you. If I remember correctly, yeah, I have a little bit of gender crisis there for a second, and, okay. So I've
just added myself as a patient. Now I'm going to go back to the query interface and do a very simple query
for myself. I'm looking for the first name and last name of all males.
I hit query, and I see that I get returned. And the last thing I want to show is that I'm in fact in the database.
That there's nothing behind the scenes working. So I switch over to my database. I refresh to make sure
that the most current tables are there. I see that I have quite a few tables corresponding to the forms in my
application.
If I open the contents of that table, I see that I have exactly one row that corresponds to me. So that is half
of guava. That is the half of guava that takes a user interface and provides you all the default information.
From here -- well, I'll talk about the contributions at the end because we're a little low here on time. But
from there what we do is we work on the physical design decisions as implemented by a database
developer. So this is the part that takes the default schema and maps it to a physical database.
Just as a reminder, we are here in the architecture. Guava does give you one for free. But the one for free
is just a default mapping one-to-one. So if you want any physical design, this is where you put it. A lot of
work has been put into creating database mappings and physical design. In particular relational lenses and
both views take similar approaches, the one you're about to see, which is sort of an algebraic operator at a
time kind of approach. But there are other ones, including extra account transform load. I hear this entity
framework thing that Microsoft provides -So with all this stuff in mind, why is it that we need a new one? Well, we have two groups of requirements
here. First of all, the mapping needs to be able to handle a certain language of statements. It needs to be
able to translate a certain number of things as expressed against a natural schema, translate it into the
physical schema. That includes the standard queries and DML insert, update, delete. It includes schema
evolution. We want to be able to take changes as specified against the user interface and therefore the
natural schema and translate those into changes against the physical database and do that without any
additional intervention.
We also want guaranteed information preservation. This is important because as we recognized in the title
slide, graphical user interfaces are updated views. And so we want to have the user interface always be
the definitive view of data. No weird side effects. Definitely no loss of data. We don't ever want to be in a
position where we add a patient and it gets dropped. We need information possess vacation and also key
enforcement preservation.
But also the expressive requirements. What do we want the physical database to look like in terms of the
natural schema? In particular, pivot and unpivot transformations. We've seen countless instances where
the physical database will have data that is in a generic format, whereas the natural schema of the
interface will be in a specific format. I'll give an example of that in a slide or two. But the idea is that we
have expressive requirements that can't necessarily be met by any of the existing tools. So to say that
slightly slower, there isn't a single one of the tools on the left that can give us all of the requirements on the
right.
And so we set out to develop an approach that can cover everything that's on the right-hand side. I
mentioned a pivot. Pivot is a relatively complicated transformation and is not handled by most mapping
languages.
In this case, if you start with the data on the left and move to the right it's called a pivot. If you go the
opposite direction it's an unpivot. Transformation often used in data warehousing.
You have things that are data on the left that are schema on the right. That's what makes it very difficult for
most modern mapping languages, they can't handle a mapping between data and schema like this. If you
wanted to right a SQL statement that translated from the left to the right, it would be what you see on the
bottom right-hand corner; that's if you are privileged enough to have a database with a SQL Server with a
pivot operator. If not, it's a left outer join nightmare.
So this is the kind of complicated mapping that we want to be able to support in our mapping language. So
what is a channel? So as mentioned very briefly before, a channel is like this algebraic set of discrete
transformations. You start at the natural database and you apply operators one at a time to the natural
database, corresponding to your physical design decision. And what you get after you apply the natural
database through those transformations is what you get as your physical, your physical database design.
Another view of this that's more graphical would look like this. So, for instance, you're going to apply a
horizontal merge to three tables T-1 through T-3 they're going to unpivot that. Separately you're going to
take tables T-4 and 5 and merge them and partition them and do some additional operations to them.
This is the exact same channel that you saw, if I jump back for a minute, this channel is just a serialized
version of this channel. This is just a more graphical way of looking at it.
What is also clear from this is that the exact output of this channel is determined by the input. So what do I
mean by that? I haven't specified anywhere on these operators what columns or what the domains are or
anything like that. And that's what's important for schema evolution in this point. So if I were to suddenly
add a column for T-2. H merge knows how to handle the add column statement for T-2. Propagates it into
an equivalent form and pushes it through to the subsequent operators.
We see this as a reasonably familiar design scenario for database developers, because they've already got
one that's similar ETL. This is a screen shot just from Microsoft SQL Server integration services but this is
a relatively common type of tool that's familiar to database developers.
Our transformation is seen here. We have seven operations that we support. These are the operations
that are necessary to support the requirements as set out before. We have four operators designated for
partitioning and merging various tables. We have pivoting and unpivoting which is unique to our
framework, to our knowledge. And also Apply, which can apply an invertible function iteratively over the
rows of a table. So this is our language of transformations.
Each operator can be defined in terms of extended relational algebra and can be defined in terms of how to
translate an instance of the natural database into an instance of the physical database.
Now, of course, that's not actually what's happening in the user interface. In user interface you have
something like an updateable view where you're exciting queries and other kinds of statements against the
natural database but you're not executing them there, you're propagating them through the channel.
The statements that we support are queries and extended relational algebra. Inserts, updates, deletes.
Some schema evolution DDL statements, including adding tables, renaming columns and so forth. Adding
and dropping foreign key constraints. We also have two additional statements. One is a looping construct
that is basically what it says here, for each toople (phonetic) T with a query S do the sequence -- query Q
do the sequence statements S. We also have an error check statement. Which is if the query Q returns a
nonempty result raise an error.
This set of statements is closed under our transformations. What that means is that if you have any kind of
statement of the form you see here and you push it through an operator, you're going to see a collection of
statements here. You're not going to get anything bizarre. Anything that operators don't understand.
We need -- incidentally, the reason why loop and error are even in this list is because of that closure
property. For instance, it is the case that the pivot operator you cannot push an insert statement through
without doing a little bit of additional processing.
So how do we do this? A normal view fashion, you execute query against the natural database. It's
pushed through the operator. It gets translated into a query on the opposite side. It's translated in very
similar fashion to view unfolding. So any references to an affected table is replaced by a sub query that is
equivalent to the table you're replacing. And I have a couple of information preservation formulas at the
bottom. I don't expect you to follow them, necessarily. But the formulas themselves, what you need to
come away from it is that the formulas themselves guarantee information preservation.
So if you were to execute a query against the natural schema, what you get as a result is the result you
expect. It is the result as if there were no operator there. Similarly, for DML statements like insert, you
execute the insert against the operator. What you get is some sort of translation. The translation might not
be an insert. The translation might be multiple statements. It might be things that aren't inserts at all. In
this case it's an insert, update followed by some looping. It's still within that list of statements but it is an
equivalent transformation and it's an equivalent transformation if the expression on the bottom is true.
In this case, for inserts. And what that big long formalism says is if I run the single table query T against
the thing I just inserted stuff into, I should get back what I had before plus my new rows.
DDL operates in a very similar fashion. The interesting part here is that a DDL statement executed against
the natural schema may result in both DDL and DML on the physical side. But it's still closed. We know
how to execute or we know how to translate DDL. We know how to translate DML.
Just to give you a couple of concrete examples about these operators. This one is the vertical partition
operator. Vertical partition is you take the set of columns in the table and you distribute them amongst two
tables. One reason why you might do this if you have sparse columns and you separate them into a
second table. Make sure if there's a row in the second table that's all nulls, that basically you just don't
include it. So it's a potentially a space-saving feature. This is how you declare V partition in the lower
right-hand corner.
As mentioned before, we need both a definition of the operator and then also a definition of their translation
of statements. So in this case you have translation of the schema in terms of what the input columns and
the output columns were of tables. And then you also have a translation of data in terms of relational
algebra. So you see that the output instance of the table on the left corresponds to the columns that we
expect. It's just a projection. And then the output on the right is the same thing except we're eliminating all
of the empty rows.
>>: What is it you're propagating here. You have the V partition as the operator.
>> James Terwilliger: So we're not propagating anything on this slide. This is providing a semantics to the
operator. This is what the operator does. So if you were to consider the channel in terms of translating full
instantiated instances of the natural schema into a fully instantiated instance of the physical database, this
is how you would go about it.
>>: So it's got an input schema producing an output schema, put data that's producing ->> James Terwilliger: Correct. And then we have to describe what we do with all the various statements
that we support. So to give a couple of examples, if you see an insert statement, you would break it into
two insert statements, one on the left-hand side one on the right-hand side. The one on the right-hand side
you can drop if it's not inserting anything other than the key.
So notice that this insert has to respect the definition of the translation in the previous slide. And what I
mean by respect is that you have the input instance, then you have the input instance plus the inserted
rows. If you push those through, you should get what you expect.
Similar process for queries. In this case if you translate -- if you push a query through you translate any
references to this left table and you translate that into a left outer join between the two output tables. And
it's a left outer join specifically so you can recover those rows that you had dropped because they didn't
have any information in it.
A more complicated and significantly more interesting operator is the unpivot operator. Unpivot is like the
example that you saw before going from things as schema to things as data. So in this case the before
image of a table is patient and you have a number of attributes corresponding to the data for each patient.
After you unpivot it, you end up with something on the right, where you have effectively key attribute value
triples.
This is a pretty common thing to happen in commercial software. You'll notice that the key has changed.
It's now patient ID and attribute, instead of patient ID. So a single patient's data is spread out over a
number of rows and the patient table after you've applied the unpivot.
Just like any other operator, we need to define it in terms of its semantics and also in terms of its
translation. So we can demonstrate that its schema, the columns, the output columns of the table look like
the original keys, plus your attribute column and your value column.
In terms of data, the output instance looks like this unpivot of the original input instance. And unpivot can
be expressed in slightly longer relational algebra. You see the relationship here, unpivot looks like a fair
amount of unions put together.
So you take something like the key and your Social Security number and then you union that with the key
and your name. And then union that with the key and so forth. Insert statements translate into a number of
insert statements corresponding to the various triples that you're inserting. And query replacement turns
into, translates all references to T to a pivot of T. I notice this is the opposite of what we did with the data
instance, because we're trying to effectively undo the transformation that we did to the database instances
as a whole.
So as you push the query through, every reference to T is going to be replaced with a pivot of T. It's true
that not all DBM S's understand the pivot operator but they always understand the left outer join version of
the pivot operator. So if the relational algebra ends up being pushed to the database, when that happens -we can recognize whether it's a database that supports pivot. If it doesn't, we replace it with the other form
of the operator.
So each operator has to support all of these information preservation conditions. They can be spelled out
like follows. If I pose a query against a natural schema, I'll get back the same result as if the physical
schema matched the natural schema. So I'm not losing any data and I get exactly the data that I want and
that can be any query. Any query I push through I should get exactly what I want, no losses, no extra data.
And then, for instance, the second example here, if I issue an insert update or delete, then I get exactly
back what I would have gotten on the original table modified by my insert, update and delete. So this is a
fancy way of saying that working against the natural schema the entire physical database should be
transparent and completely lossless.
And these statements can be formalized. You've seen a couple of these statements. There are five
statements total. And each of these correspond to one of the translations that you saw previously. So, for
instance, O query of QT means what is the query translation for the single table query referring to T. So
this is, for instance, in the second one it's saying if I push the query through and I execute it against the
pushed through database, after executing the push through DML, I'll get back exactly what I want.
So this is all just a way of saying that I'm -- all this formalism is describing the preservation properties and
it's just another way of saying that operator O must always return what I expect when I issue queries or any
other statements.
It's also the case that operator O for any operator must respect foreign keys. So that means that if I insert
something that would violate a foreign key as expressed in the natural schema, then something down at
the physical schema must also throw an error. There's additional formalism to go with that, but I figured
perhaps this was quite enough formalism for one day.
It's also the case that deletes must cascade. So all of the usual things we expect from referential integrity
must also be true in the physical database. In terms of implementation of this part of the architecture, just a
couple of quick notes, the operators are implemented using the visitor pattern. We've gone through a
couple of implementations of the channel and the most recent one leveraged my experience from last year
as an intern working with entity framework. So I saw what a good design pattern for translation of these
things look like.
Queries DML and DDL represent, are represented by syntax trees similar to conical query trees or conical
command trees here at Microsoft. We've developed an ETL-like tool for designing channels. We designed
it using the domain-specific language tool that comes with the Visual Studio SDK.
As James checks his watch, next week -- no, different joke. Different joke. Yes, next week we're doing a
comprehensive performance analysis on this. To verify this statement that time spent in the channel is
negligible compared to the time spent executing anything else.
This is a normal requirement of middleware in software development. You can't have the middleware be
the dominant factor or even a noticeable factor when it comes to UI execution.
We've been waiting to do this analysis until we were certain that we've done as much optimization on the
output SQL and also on the channel itself as we could. So I think we're ready to do that.
I can say that when we ran these results last year against a previous implementation, time spent within the
channel was about three orders of magnitude less than time spent within the database.
So at that point it met this hypothesis we just need to retest that hypothesis. A few more notes about
implementation. This is the amount of code it takes to implement each of these operators. It ranges from
around 400 lines. About 300 lines for function application up to about 800 lines. Horizontal merge is
actually a little tricky because it's horizontally merging tables that might not be union compatible. The keys
have to be union compatible but the other columns don't. So handling schema evolution in that case is kind
of an interesting problem. That's why it takes more code there.
The whole code for the whole framework comes to about 25,000 lines. The UI library would be
substantially smaller if it weren't for the fact that we had some orthogonal work that we were trying to relate
guava with some other research we were doing at Portland State. This is an estimated line. In my
judgment I would say the actual size is about 15% less, maybe around 22,000 lines of code or so for the
whole application.
Most of the effort in designing the channel in terms of establishing what the formalism was behind
everything, we went through a couple of iterations of that. But defining what the action is of seven different
operators on 15 different statements then proving the vast majority of those with respect to our formal
equations did take quite a bit of time.
But once all of that effort had been done, it was fairly straightforward process to write each one of the
operators.
>>: So these proofs are against some abstraction. You're not proving code.
>> James Terwilliger: No, we're not proving code.
>>: Some spec with what the operator does.
>> James Terwilliger: Yes, so we define what the operator does against instances of the database, and we
treat that as the definitive semantics of what that operator does. Is what it does to a whole instance. And
then we verify that every single one of our statement translations respects that and is
information-preserving.
Writing the SQL Server provider was intricate. What I mean by the provider is the thing that takes those
trees and translate those into the native syntax of SQL Server. It was a little bit tricky because of things like
pivot and unpivot statements and function application to get those to work correctly took a little bit of effort.
But it is up and running. So let's see, how are we doing on time?
>>: Good.
>> James Terwilliger: So this is a section dedicated to looking at some of the current work and
opportunities that we're doing with guava. What I think I'll do is touch on schema evolution and then move
to the conclusion.
But if you're interested at all in how we're experimentally evaluating or referential integrity or this thing
called application operators, which are operators above and beyond the seven that we've described, I can
come back to those things if people are interested.
So in terms of schema evolution, you can think of the physical database in terms of a function that takes in
a UI and a channel. Because we have a default representation of a UI and the channel is what transforms
that into your physical storage. So how do you change the physical database? Well, you could have
changed the UI or you could have changed the channel.
You could have altered the operators in it. So in the user interface you might have added forms. You
might have changed some controls. You might have changed some context information. A lot of the
changes that you might make to a UI would have a direct impact on the data you want to store in your
physical storage but not all of them might not necessarily do.
The channel will either propagate all of these changes automatically or in a very, very rare instance throw
an error. And the only case it would throw an error is something like you've added a control and
somewhere down the line the name of that control happened to conflict with something.
But in that case you would know exactly where the conflict happened and it would be a very easy
resolution. There's also the possibility that the channel itself might change. You might add new operators,
change the physical design layout. Changes to the channel are separate from changes to the user
interface. So if you change the channel. The UI is completely unaffected. So the channel is mostly for
designing the physical database. It has nothing to do with the UI.
>>: Question. So when you work on the UI, let's say you've modified a couple of things. And you want to
(inaudible) so you (inaudible) all those operations. You compare the old UI with the new UI and you find
your place from the DDL major operations on the database?
>> James Terwilliger: Yes.
>>: (Inaudible).
>> James Terwilliger: That's exactly the deployment. Is that you make some changes to the user
interface. Those user interfaces, the user interface changes are captured atomically, if you were to drop a
control on the form it would register as an add column against the natural schema in a log behind the
scenes. And then the next time you run the application -- the very first thing that happens any time you run
the application after it generates the G tree is it pushes the natural schema through to initialize the
operators.
Make sure it's aware of all the schema that's in place and it will also generate the database on the other
side. But if it notices that there were some DDL statements in the log, what it will first do is push through
the old natural schema and then push through the DDL statements in that order.
So updates to the user interface. We've covered this a little bit. But the key is that because the natural
schema is directly related to the user interface itself, that changes to the UI have a relatively straightforward
correlation to changes to the natural schema, which can be expressed in DDL, just standard relational
DDL. Like alter table.
And here's the changes made to the UI are recorded and translated into DDL statements that are packaged
and pushed at a later time when it's necessary. You can do more complicated changes. I have a little bit
of material if people are interested in how to do that a little bit later. But the idea is that if you want to do
something more complicated like change the domain of a control, that can be expressed in terms of more
atomic things, and we already know how to push the atomic things through. So actually in the interests of
time I'd like to skip forward to the conclusion.
But if people are interested we can come back to some of these other topics. So future work. Conclusion.
What we see as the contributions of this work, again, we see there's a framework for generating a database
backed operation from the UI.
So a user interface centric design methodology which we think might support the Agile and like extreme
programming kinds of software development. An application-specific query interface which was our initial
research question that we think this framework does in fact solve that research goal.
We have a middleware abstraction tool for providing information preserving transformations that includes
some expressive power that's not available in other mapping languages.
We see the channel as something that can be used independent of the UI portion. So if you want an
information preserving transformation just from one database to another like let's say from your say your
production database to your data warehouse. We see that as possible.
And then also a comprehensive approach to schema evolution. The attempt to be -- the intent of which is
to avoid having to write manual database upgrade scripts, which is a common problem I've seen.
We have four publications to date. So this work is out there. All of the -- one major aspect of future work
we see is seeing whether the channel is applicable and alternative data models. So so far everything
you've seen is in the relational model which has the advantage that there is well understood DDL language
on how to translate one schema to another.
Alternatively, we might look at XML. See if we can have channels between XML and XML. In terms of
schema XML you can create and drop scheme mass, validate data against a schema. But there's not yet a
language for incrementally involving XML schemas.
We might consider other models RDF, et cetera. Or we might also have a situation where the natural
schema is in relational and part of the physical schema is in XML or vice versa. So having an operator that
can do shredding or translation between models.
To give you a taste of how XML would work, remember that a channel you need to know what the
expressive power of the operators are but you also need to know what your list of statements are.
So your list of operators would have to be things that are information-preserving. In this case you might
have something like promote attribute to elephant -- to an elephant in XML.
Yeah, it's close to lunch, I think.
Invert hierarchy within an XML would be an information, would be an interesting information-preserving
XML transformation. So you'd have that definition of expressive power. But then you'd also have the list of
supported statements. We know what queries look like in XML. We have several languages for that. X
path, X query. XXSLT, all of those. We know that several databases support an XML DML-like language
that can do things like insert nodes. Change individual values, et cetera.
But the DDL is kind of the trick here, is there's no DDL for XML. So what do you do when you want to
evolve your XML schema and there's a couple of possibilities. We could invent a language with the
expressive power of XML that would be very hard. You could do something like this last bullet which we
found kind of interesting where you generate an XSSLT against the schema and we could translate that
XSSLT against the schema automatically against into XSSLT against the data, so the data that met the
validation before would meet the validation after.
Other things worth looking at pushing other things through a database, through a channel. So physical
constraints. Like foreign keys we've done. But maybe other kinds of constraints we could push through.
Physical statistics we've done a little bit of work on.
So if you know what the statistics are for histograms, for instance, on the natural schema, we can give you
close or estimated statistics on the physical database.
We would like to analyze this framework qualitatively, present it to some software developers see what they
think about it. I've had some interaction with some Agile programmers who are very interested in this but
would like to formalize that kind of study.
Also future work possibility could be how can you construct a channel that gives you the maximum benefits.
So use a channel for optimized physical database tuning.
And that concludes my talk. Thank you.
(Applause).
>> Phil Bernstein: Questions? Very quiet crowd. This is really uncharacteristic.
>> James Terwilliger: It's because I was so eloquent.
>>: Given so many times that you've already figured out where all the questions lie, do you think that -- you
talked a bit about XML at the end. Do you think that would be actually a good thing to support XML in the
database? Do you think it's just -- would it actually benefit the framework, or is it just something that you
would do because people felt the need to have XML in the database?
>> James Terwilliger: It's certainly the case that when I brought up this research in an academic setting
one of the first questions is wow, this is really cool, can you do it in XML. So at the very least there's an
interesting research question there. In terms of its application to actual software engineering, I haven't
seen very many software applications that use XML underneath natively. Usually they have a data set or
some other relational-like structure that they bind to.
But it might certainly be the case that even though it's bound to relational in the natural schema, it might
use an operator to translate to XML on the way to storage and actually store it as XML. Sort of a channel
to stored XML if you will.
And then have XML translation that would further optimize at that point. So that's -- that's where I see XML
channels as being more likely, as you introduce XML in the channel and then you actually apply operators
from there.
>>: Do you have any sense of -- presumably this is supposed to improve productivity of developers putting
together these sorts of applications. Obviously the query interface is a nice -- that's better functionality.
But you're automating so much, I guess the idea is that this will be a more productive environment for
people. Do you have a sense of how much more productive?
>> James Terwilliger: So I've had a few masters students work with me over the past year or two and their
reaction to this was I want it, I want it, I want it. I've worked with one or two people who, as their profession
do Agile development. And they gave me a pretty similar reaction. I think that it's a big open question in
terms of the best way to interface the business logic portion with the stuff that we generate automatically.
We provide one way of doing it. But we're not sure it's the best way to do it. So we'd still need more
results on that.
>>: Did you do any surveying of medical people to ask them how they would respond to this? I notice that
this is a fund with some sort of grant from medical.
>> James Terwilliger: Yes, so the majority of this work was done under the Collins Medical Trust Grant
and also a NSF grant, at the bottom. We've been working pretty closely with this software development
group underneath Oregon Health and Science University that includes physicians. It includes clinical
terminology aware statisticians. So people running data warehouse-type queries over a data warehouse
but who are required to understand clinical terminology not to a diagnostic level.
And then also developers. So everywhere from developer on the scale all the way to total domain expert.
And within this group we've demonstrated both the query interface and the software development to each
phase of that. That's where our feedback has been coming from.
It is certainly the case that we want to open this up to evaluation by more people.
>>: So have any of the medical people tried to generate an application with this framework?
>> James Terwilliger: Not yet. So far we've been translating the application in-house, but having
developers look in. But it's still at an informal level.
>>: You said you didn't actually do this. The query interface is pretty cool. But at the end you have this
more generic implementation. Is that just level of effort or were there other technical?
>> James Terwilliger: There's one technical impediment but beyond that it's all level of effort. It's all
coding. And the level of effort comes to -- let me skip forward a few slides, since I have other things. And
animations apparently. Here it is. So the control you see on the right is one that's directly out of the
application that we've translated.
>>: They do that. They draw pictures.
>> James Terwilliger: Yes, so they provide the stomach in the control and then what you do in this control
is you either draw a little area or click a little button and it will come up with the findings window and says
what finding did you find at this location. Then there's some functionality behind it.
In this control it's easy to guavalize. It has a very simple relational background. We didn't change any of
the behavior. This control works in guava and it's beautiful and everything works well and how do you
query it? That's the technical impediment. You have a complex control like this. What does this control
look like in the query interface. We've actually given this one a query interface. We modified the behavior
of the control slightly so you can toggle between query mode and data entry mode. And the query mode
looks like you draw a rectangle and it translates that rectangle into did I find a finding within this boundary.
That's certainly not the only queryizabe version. It's definitely the case that for every complex control like
this you would impose on the programmer the need to come up with a query version of it. That is a
technical limitation that is not present in the current implementation because all you're seeing is the G tree.
So you don't have to implement any additional things behind it.
We can queryize this. But it's not automatic. It requires extra stuff.
>>: So this kind of touched on my question. I was kind of just wondering if you're targeting some specific
class of applications. Because the examples seem to be very easily mapped to a relational data model.
So just wondering if there's some specific target you're looking at.
>> James Terwilliger: Good question. I don't know if the microphone caught that. So let me repeat the
question. The question was this framework geared toward a particular class of application based on
maybe its internal data model, et cetera. It's been my experience working in development that a relational
data model underlying it is certainly the most common thing right now. So gearing it towards relational
model I think isn't as much a restriction currently. That may change in the future. It is also the case that
we're geared towards data entry-centric applications.
So if you had an application that, say, it was a graphical user interface but it was all for the monitoring and
sensor data, it's not clear to me how to guavize an application of that character yet. So this is certainly for
the case where you're both entering and viewing data, not just viewing data. So that does narrow the class
of application down a little bit. To give you an example of an application like that, I once worked for a
company that did software for silicon manufacturing. It was just basically a gigantic microscope that did a
bunch of readings and millings on silicon. This would not be applicable to that application. It had a
database. It had a user interface on top of the database, but it was all about visualizing. And so it's not
clear how I would apply this to that kind of application.
>> Phil Bernstein: Thanks again.
(Applause)
Download