1 David Lomet: It's a pleasure for me to welcome Carlo Curina from Politecnico di Milano, and by way of UCLA, who will talk to us about some combination of temporal databases and schema evolution. The system is called Panta Rhei. Carl? >> Carlo Curina: Thank you very much for the introduction. So I will present you part of my Ph.D. thesis, which is basically about database evolution. So our motto is Panta Rhei, which means everything in a state of flux. And this is particularly true for the business reality, which is continuously changes and forces the information system underneath to continuously adapt to evolving requirements. And this has been started as a bad program, software involving information system. It actually costs a lot. Can cost up to 75% of the role costs. As a database person, I focus on the program of the evolution of the data management core of an information system. In particular, I concentrate on the issue of schema evolution, so how we can try to better support the applications when the schema need to change. And the problem of archiving the contents of a database, for example to meet the continuing obligations of a company. And, of course, when you evolve, one of the things the company continuously do is merge and acquire each other and so you have to integrate information system. And at the data information level, that is often a problem of that being data integration or related data exchange and similar. So during my Ph.D., I had the great opportunity to visit professors I know for over a year now at UCLA, and, in fact, my thesis is little bit on two side of the Atlantic ocean and two side of the problem. In Politecnico, I work on data integration and data extraction with an approach based on intelligence. And some work on context [unintelligible] data filtering. And in the professor, I work on the problem of schema evolution and database archival. Today, I will not have time to talk about both so I will focus on the problem of evolution and databases. And, of course, I'm available if you want to talk later in the meeting. We have about the rest, it's -- I will be happy. So the problem with schema evolution, it's a well-known and longstanding problem and traditionally has been very difficult to understand exactly, at least for the research, especially for the academic researcher to exactly understand what the overall impact of evolution on a database and what characteristic 2 has this type of evolution. So before starting the actual work on schema evolution, we decided it was needed some better understanding of the problem. So what we did is analyze information system, likely to either web information system. Some of them are open source in nature, like Wikipedia, so we had a good set of information about what they were doing. Wikipedia is a popular website. And as our relational database began, it actually stored the entire content of the website. So it was a good example. It's very popular. Everyone knows about it, and it's an open source. We can both get to that and release information that we obtain. And we discovered it was a pretty interesting case, we had over 170 schema versions in four and a half years of [unintelligible] time. And there was also a big need for archiving the content of the database. They actually needed transaction time database, even if they probably don't say it in the same way, because about 30% of their schema was actually dedicated to time stamping the content, maintaining archived version of the content or deleted version of the images, et cetera. And so what the use of this work has been to assess the severity of the problem itself and to in a sense, guide the design of our systems. In particular, in the language of schema identification operators that as I will present this is quite central in our work. So give you an idea of the knowledge, I posted a couple of the statistics we can collect for the Wikipedia case. One was showing the schema sites in term of number of columns in each version, each subsequent version. So goes from about 100 to 250, so it's growing quite a lot. And this goes together with the popularity of the website and all the features that have been added little by little. And on the other side ->>: So what's the cause of the dramatic downward -- >> Carlo Curina: The spikes? Okay. That's, the problem is that even in the publicly released schema, which are the one we're using, not the internal development one, sometimes they have a syntactic mistake in the schema, so the actual screen that is supposed to load the schema will not actually work and will only load two or three table so it ends up having the spike like this. I left them there, because in a sense they represent part of the program of evolution. On the other side, I show what's the impact, what can be the impact on the application. 3 So what we show is the query success. If we take a bunch of queries that used to run and were designed for schema 28 and blindly execute them into subsequent schema versions. And as you can see, there's not much success in doing that. It means that even a little bit, few differences in the schema can actually break a lot of queries. These represent part of what we want to try to solve. So the Wikipedia example was very useful for us. And give us a lot of feedback so we decided to go a little beyond Wikipedia. We developed a tool sweep to automate the collection of these kind of information about the evolution of a system and to run analysis on top of that in an automatic way. And the process at the moment is collecting of very large data set of these evolution history. At the moment, we are around 180 evolution histories from scientific databases like [unintelligible], the nuclear research center in Geneva and several genetic databases. Of course, open source information system, some administrative databases from the Italian government. So it is creating a basically this big pool of real examples. And to use these for benchmarking tools that support schema evolution and to have a clear idea of what's schema evolution. At the moment, we are planning to survey about 30 open source commercial and academic tools around to create a good picture of what's the problem, what are the solutions. And I guess in general this kind of data can be useful not only for benchmarking schema evolution, but in general for creating benchmark and having, you know, feedback for example for the idea of mapping composition to benchmark that [unintelligible] was suggesting in one of his recent papers can probably be based on some real data and trying to see what -- how the various two would do, mapping [unintelligible] two would do to try to follow the evolution that happened in these cases. Of course, we're going to release everything for public use. So let's see how we try to solve the problem now. Analyze the problem and it looks pretty clear what it is. The Panta Rhei framework is the effort that we're doing at UCLA for involving the problem of evolution. So there's three main system that collaborate tightly to solve this problem. One is the system PRISM data is the first one I will present and I will give you a short demo of it, which basically supports schema evolution for a snapshot databases, like regular databases and uses schema mapping and query writing to support that. 4 The second system is called PRIMA and basically add archival functionalities on top of what PRISM is capable of doing. And there's a bunch of optimization we have to put in place to make things running fast enough. Each of them sort of complement by maintaining the sort of documentation of the evolution. So meta data histories and allow temporal queries on top of the meta data history. This can be useful. So let's see. Schema evolution. We typically start from a database, DB1. Here's correspondence, [unintelligible] and a bunch of queries, Q1, that run just fine for business needs. But we need to store more data. We need to store data differently to be faster, for example, supporting new functionalities. So we need to move to use schema, S2. So what typically happen is the database administrator will write down some SQL script that changing the schema. Somewhat SQL script that migrated data from schema 1 to schema 2, and the application developer will probably need to tweak their queries to run in the new schema. And this was typically happens today. Unfortunately, this is happen again and again and again, so we have these type of process happen every time we have -- we change the schema. To give you an idea, Wikipedia, we set 170 schema version. If you go toward the scientific databases, then some database, they had 410 schema versions in nine years of history. So this means about schema version every week or every ten days. That's a little bit too much. I mean, it's a lot of the reworking of the query. Especially, in the case of Wikipedia, impact on queries, up to 70% of the query needs to be manually adapted after each evolution step. This is like the worst case. It means that if I have hundreds of case, it means I'm doing a lot of work. Also, it's a problem for the data migration, because every time we do data migration, there is a risk of losing data because I did something wrong in the script that migrated data or I might create redundancy. Even the efficiency of the might be ration, if you talk to people that work on astrological databases, they say even adding one single column might take weeks, because they have bites of data in each table. So even the migration might be costly. And the efficiency of the new design is at stake. How fast will run my queries in the new design. So what we would like to be the schema evolution, we would like to have the design process itself to be somewhat assisted and predictable so I have an idea of what 5 will be the outcome for data schema, et cetera. The migration of script, we don't really want to take care of that. Let's have some tools to generate them. And the legacy queries will be wonderful to have them somewhat automatically adapted to running the new schema. And the same goes for views of dates and old address of the objects that we have within database system. >>: In this ideal world, it's presuming that when you move from database to database, it's monotonic in terms of the information content, right? >> Carlo Curina: No, no, not needed to be. It might be that some -- >>: Query Q1 talks about something that you decided not to store anymore. no way -- There's >> Carlo Curina: In that case, you cannot support query Q1 anymore. That would be the user's choice. If I decide that, for example, I have a question to ask the engineers, and I fire all the engineers in my company, I don't need to support the query anymore, right? So this is more or less what we try to do with our system. General view of the PRISM system. So to support schema evolution, the evolution design, we define this language of schema modification operators. It's not a new idea. We just try to make it working in our examples. And by analyzing this language, we can fore see what will be the impact on the schema, on the data and on the queries. The automated data migration, we just generate the [unintelligible] script out of this language and to automate a query support, we will derive logical mapping so correspondence is between subsequent schema version and use this information in -- to do the query writing. And the current work is basically try to make the same work, working also for that bait. And so having a better idea of the entire constraint and propagating them and see what happens the update. So it will be more tricky. So the language we have, seam modification operators, each operator represents an atomic change. By combining them, we can create complexion evolution steps. And it is that there have benzene and tests done on the Wikipedia example, and they cover the entire 6 Wikipedia example. And if you try to apply them to the other cases we collected, we have a [unintelligible] of the evolution step by 99.9%. Which I would call it a success by now, and we will see later how we can make it to like 0.1% more. And so the ideas that each operator works both on the schema and on the data. So when I say join able AB into C, I mean create the new table C, populating it with the joint of the two input table and drop the input table. So from a data migration point of view, it is that each operator is fairly simple and compact and what we will do is just having an SQL screen that implements in the system what this semantics of our SMO is. There is some optimization issues. We try to do some optimization at the single SMO level. And there is more work that can be done, considering sequences. So if you add, then remove things. Or if you do multiple renaming, you can probably try to optimize there further. And from logical mapping -- so the deal now is I need to do query writing. I basically need to have some logical representation of what's the relationship between subsequent schema versions. And at the moment, what I have is the input schema, I have the SMO; and applying it, I can have the output schema. And what I will do is basically take in this looking at each of these SMO and try to see what's the correspondence that they generate between input and output schema. The language we use is [unintelligible] embedded dependencies. It's a fairly powerful, logical mapping developed by Allen Deutsche and Val Tannen [phonetic]. This is basically, this is the minimal language we need to cover all our SMOs. And we will see how we use it for the rewriting, you know. So in this case, for example, I have joined Table R and S into T where some condition, I will write these like this. One says basically I will find in Table T what was in Table , and Table S, and satisfy condition. The second one basically just saturated the first one and says that in Table 3, there is only what was coming from Table R and Table S satisfy condition. So we create an equivalence mapping between the subsequent schemas. So based on this language, we can say that we have mapping M that related to schemas and we can define our query on semantics. So let's say database D2, first of all, it's basically what the database, D1, migrated by this mapping, plus or minus whatever update comes into the new schema, okay? That's the starting point. 7 And what we want to do is now answer the query, Q1, over at database D2. The problem is query Q1 is in schema version one and database D2 is on the other schema version so we have to find a way to connect them. We can assume we have an inverted mapping, M minus 1, that will bring the data back under schema S1. So if this is true, I can answer -- my query answer will be query Q1 on the version of D2 which is being migrated back. This sounds just fine. It's what we want to achieve. The problem is that I will not really migrate the data every time I have to answer a query. We can think about [unintelligible]. The problem that [unintelligible] genetic database I will not materialize the database 410 times. And every time I update it, I have to propagate the update throughout, you know, all of the previous versions. So this is just what we want to achieve. How we will achieve it is trying to find an equivalent query, Q1, prime that, if executed directly on top of D2, will produce the same result of executing the original query, Q1, on the version of D2 that has been migrated back. This is our job, finding this query, Q1. And the problem is that we were actually assuming to have an inverted mapping M minus 1. The user is giving me the mapping N, which is the one forward. Now I have to invert it. These are not really friendly for when it comes to variability. It has been studied recently by [unintelligible]. This language is a bit too powerful to do inversion. So what we will try to do is inserting the SMOs and driving the inverted DDs. And this works because the seven DDs -- we need the entire power of the language, but we don't use older construct of the language. You know the places. So the idea here is the following. When we have to invert SMOs, there are still a couple of issues. One is that not every SMO is a perfect universe. You can't imagine if I drop a table, there is not much it can do to invert it and go back to the exact same database. So what we need to do is using the notion of quasi inverse, again introduced recently by the IBM group, and the notion of quasi inverse is something like intuitively the best we can do. Meaning whatever data we lost or is lost, we will try to bring back 8 everything that survived the way forward, we will trying tri to bring it back. That's probably last intuition. >>: do. [inaudible] the best that you can do might be better than the best that I can >> Carlo Curina: Yeah. No, the definition is even more formal than that. can expect from this group of people. >>: As you [inaudible]. [laughter] >>: Before you go down this path, why not process the query by treating mapping M as a low equivalence view mapping and then just use hands-free queries [inaudible]. Would you come up with a different answer than doing it this way? >> Carlo Curina: One of the opportunity we vis actually once we invert the mapping using that information to generate views and supporting the regional query using views, the idea is ->>: That would be a global solution? >> Carlo Curina: Yeah, because we use the inverse to support it. >>: Right. I'm wondering if it would come up with a different result than just sort of directly applying the query to the lab, to them treating the lab [inaudible]. >> Carlo Curina: Thank you, but if there are issues like equivalence versus the general data information approach, which just requires sound, I think it might do. I haven't considered that. So the other issues is not every SMO has a unique inverse. But think about this. We are design time here so we can still ask the database administrator to help us, and to design away the cases in which the inverses not unique. A couple of examples. Quasi inverse, for example, you have adjoined RR and S into T, and last thing that this joined in which not every [unintelligible] in R and S participate and join. So I'm losing information by doing this. 9 But I can use the decompose to go redistributed into Table R and S. if now I decompose the data back if more. So I will obtain the identity on About the R into S. on what we us what's back and whatever data survived the join will be The actual definition of quasi inverse means that they bring that forward again, I will lose nothing the target now. multiple inverses, one of the [inaudible] SMOs is a computable R in S, We create two copies of the same table, and now how to go back. It depends do with these two tables, okay? And the database administrator will tell the use of these [unintelligible] application. If I have some guarantee that they're perfectly redundant all the time, then it's enough to drop Table S and I will go back to exact same database. If the Table R is sort of an old copy and Table S is one in which actually new, interesting data comes in, then probably DBA want to drop the data R and rename S into R, because want to bring back the data. Or if both receive interesting material, we want to merge them, to union them back to the Table R. And this depends, choosing between these depends basically on what will be the semantics of the rewriting. So how are we going to support the old queries on top of the new database? So while we have the inverted SMOs, we can derive the inverted DDs, which now specify the relationship in the [unintelligible]. So from the D2 towards D 1, and we can use this information by means of using a technique called chase and back chase, that will basically is going to add atoms to the query and remove atoms to the query, guaranteeing the equivalence. And, of course, we will not do it randomly, but we queries expressed as schema S1. Now I will try to S2, using the mapping that I have. And I will try succeed to get the query completely expressed only then I succeed doing my rewriting. will try to add atoms from the add atoms that are from schema to remove atoms from S1. If I on S2 and still be equivalent, So the back chase has been started by Paul [unintelligible] and Deutsche. And, in fact, we use Morris as our query writing engine. He's highly optimized implementation of the chase and back chases developer, UCSD by Allen Deutsche, also co-author of some of the papers of the punch rate framework. So we use Morris to do this query rewriting. 10 So what's the issue here from a presentation point of view? We have web address interface that support the database administrator doing the development. It will be done with a CD and I have a preview of this in a couple of slides. And a run time where we can do is support the queries as a line query writing, or as I was anticipating before, by creating composed views that go from schema S1 to schema S2. So create these, I will use the same query writing engine, but I will use it at design time. I will issue some query, something like select star from table and I will write that into an equivalent query in the target database and now I can set up a view that is the table that I have, and the body will be whatever the rewriting is. And so in this way, I don't need to trust my research prototype tool to run at run time to support application. >>: So you try to generate that view and it's, M was not invertible, what kind of view comes out and do you give up or do you just produce the best you can do. >> Carlo Curina: No, the chase and back chase either I can make it for an equivalent or just give up basically this idea. I know it's different from what you can do with the system that you were developing, which can much realize part of it on the way. And yeah, definitely from this point of view, either you obtain an equivalent query, or you just don't -- cannot support the queries. >>: Long-term, if you evolve a database short-term, obviously, it's great if you can take your old queries and run them against the new database. But at some point, you've got to translate the queries. This can't go on. You can't keep running it through a rewriting engine every time. So presumably, you do need that, you do need to do the inverse and produce the view and be able to modify the queries. >> Carlo Curina: To me, this can be a tool that basically supports the work of also the application developer. So it is that you will have to do the rewriting yourself if you use a tool like this. If you use a tool a tool provides a suggestion of how you can do the rewriting and probably what you want to do is to check the rewriting, make sure that it is optimal and works fine for you, and probably embed directly that new query inside your application. So you can try to say that completely transparently for an application, application will keep working. But most probably what will happen in a real scenario is that people want to see what the rewriting is and that just will speed up the process of migrating the applications. That's my view on this. 11 >>: In your experience, are the rewritings comprehensible by humans? >> Carlo Curina: Very much, very much, because the Morris rewriting engine has been born as an optimization engine, so what he also does is removing extra, unneeded atoms and trying to squeeze it down. So the result I've had on the Wikipedia, I will show you with a tool, but they make sense. The only funny thing is that I don't know, for example, the alias he's choosing for the table will be [unintelligible] instead of something more meaningful for a human. But if you accept this little thing, it's more or less reasonable. >>: Can you give it a try to try to derive names for the new tables from the old ones? >> Carlo Curina: I would say. I would say. Just the initial of the table and make it unique somehow. And if it's already unique, you would say table page as P, instead of as XO. So when we finish this, we were like okay, how we can test this, okay? How we can make sure that that actually works. So we make tests. Again, we use Wikipedia and the schema of Wikipedia. We used the query, we actually had access to the line profiler that provided the writing on the Wikipedia installations. And we struck ten plates from that queries and we basically run those and we used the data they actually released to make the rewriting -- sorry, the execution time measurement. And we had two things that we want to make sure. One is that the system actually cures a lot of queries and is actually capable of rewriting a lot of queries, and that the rewriting is actually decent in speeds and result. For first good news, 92.7% -- yes. >>: I was just wondering if the downward spikes in the left graph there correlate with the downward spikes in the previous graft. >> Carlo Curina: >>: They're exactly the same? >> Carlo Curina: >>: Exactly correct. Okay. Exactly the same. 12 >> Carlo Curina: I just interrupt them not to make the graph looking too funny, but also the reading query will spike down there, because there's no schema to run on is system. >>: That's not simply a dotted line, that's breaks? the spikes below. That has significance with >> Carlo Curina: Yeah, they are perfectly aligned down there. on the final feature. It's just photo shop So the idea here is the following. 97.2% of the evolution steps are completely automated by the system. So every query that can be rewritten will be rewritten by the system. In the 2.8% of the cases in which the system didn't succeed, it was able to deal with 17% of the queries in the group of queries that need to be rewritten. We're basically taking care of most of the work and the worker will be folk sudden on a small portion of what he was doing by hand today, which is already pretty good. And that, just to give you an idea, that's more queries that the user can rewrite are actually a situation in which these just using domain knowledge and knows he's dropping a table somewhere. But the same information can be also found if we correlate our two table in absolutely funny way. So I don't think there is any way to go. numbers for a while, at least. So we will not be able to improve these The other question was good how are the rewritten queries? Fairly good news. The actual rewritten queries are fairly close to what the user can do by hand. Their execution time measured on one of the databases released by Wikipedia is actually fairly close to the one done by use thor. And this gap in performance -- that's the average, and the gap in performance, especially for query 13, and maybe for these other queries, there's a couple of cases. The problem there was the following. The user was actually using some information about an integrity constraint that was not specified. Something like a foreign key that was not explicit in the Wikipedia schema. And at that point, he was using that to remove joints from the query. So if we tried to put that integrity constraint explicit, we fit it in the query writing engine, the rewriting engine with the same simplification and achieve basically the same performance or almost undistinguishable from the one obtained by the user. 13 And that was good news. We were quite happy. So we wrote the demo that got accepted at ICD, and I will give you a short overview of the system. Okay. I think it looks good. Okay. So this interface is a little bit between something meant for a demo CD and what an actual tool would look. So there are things which are definitely academic and things that probably might be useful. So starting point is just running some consideration parameters, like the database to connect to, what's the schema we're going to work on, and what's the schema where we gonna store the result of our work. The first interesting part is this, and he basically, I see what's the schema, and I can issue SMOs. So modification to that schema and see what's going to happen. And the system help me in doing this. For example, I start with something simple, like retain table archive into my archive. And the system show me that table archive will disappear. In the final schema, I will have a table of my archive. Let's say I like that. And for example, I'm crazy and I want to drop the table page from Wikipedia. This means the entire Wikipedia will stop working, but assume that's my goal. The system says the syntax is fine, but pay attention, because the drop table is not information [unintelligible]. It's like saying you're crazy, but it's a little more polite than that. So we drop the table page, and I don't know, for example, we can copy table blobs into blobs 2. And now it's telling me copy table. It's pretty obvious, but it's generating redundancy here. So go ahead if you want. I want to go ahead, and then I'll show you an example of, let's say, compose. They compose table text, which contain the text of Wikipedia into T1 and input there and the actual text and T2, where I put [unintelligible] and flags which are random other attributes. So I throw it in there, and again here, T1 and T2. Now we go to the next see what happened to the inverse and what the system can do with this. system does is for each SMO, it tries to come up with an inverse. So, for retain table archive into my archive does the opposite. My archive into It's fairly obvious. part and What the example, archive. For the drop table page, it will say, well, I mean, the data are gone. We cannot do much with that but at least we have the same look in the schema, we will create an empty page. An empty table page with the same attributes. 14 And copy table blobs, here it suggests a merge of the two tables. Of course, I kind of ride this and say no, drop table blobs 2, for example, because I know that the 2 will be just redundant. And the decompose has come up with a join of the two. In this case, using the common attributes to make the join, and what we are doing now is being a little bit more specific with integrity constraint so it can also tell you, you need more -- you need -- you're missing one foreign key to make sure that the actual [unintelligible] will do the right work to bring back the data, so on and so forth. And here, in the ->>: So merge is union. >> Carlo Curina: Yeah, merge is union. It's a pretty good choice for the name, I know. Yeah, merge is basically doing a union. So here, it shows me, given a query workload that I've given in input, and it was preloaded in system, but what it can do. I mean, how many query can make it up to this point? Particularly if I click, he shows me what the rewriting of the query will look like. So for example, for the table archive that I'm not using -- okay, here. Select from archive, it says the rewriting will XO.R common from my archive XO. So it's just doing that. Now, as you can see, at the certain point, it dropped to 80%. Let's see what happened there. The point is that there are two tables. There are two queries that were running on table page. I just can't imagine if the table is no more there, there's not much it can do. Instead of just creating an empty table I know that I have another copy of the table page I kept there just in case, and it's perfectly, you know, full of all the data I need, I can just say, for example, rename some other table to that or copy the table to page and the system will do the rewriting. And when I go on, let's say, for example, I go at the last step and I see here my query to run on text and it will run on T1 and T 2 and make the join and go on the selection. Now I go back to this slide for just one second and I move to the -- to sort of, 15 I call it the mad DBA example. I guess no one want to do that much damage to the table, but let's assume just to test our system. So I do the composition as I did before. On one side I will partition the data horizontally, I will rename the table, rename one column, and join it with another random table. On this side, I do a tricky thing, basically take the attribute of facts and split it into two sub-attributes and then drop the original table. So to go back, what I will need to do is [unintelligible] the two sub-attributes. And on this other side, I add an attribute provenance that is needed because later on, I will merge or will make a union with another which will contain some random data that I don't want. So I will use the provenance when I go back to partition the data in order to have here only the [unintelligible] to pertain to that table. So let's see what happened when I do this. Okay. Now, let me go back to the presentation, because I'm not having the screen. What's going on? Okay. I will have to switch to mirror. Otherwise, I don't see what you see. Okay. So I go back to the SMO design. I will remove all this DDs and I load a previous -- okay. I don't know why I'm not seeing it. Okay. And I load it. So basically this set of SMOs are the one we've seen in the graph. And just throw in sequentially. So now I go to the inverse side, and I see what happens here. Of course, I'm doing funny things so like dropping columns and things. So he's telling me, okay, this inverse, I'm creating just an empty column. So he's highlighting it in red to tell me, you might want to do something there. Make it something better than that, okay? So what I will do in this case is saying that I will create this attribute as a concatenation of split flag one and split flag two. They were the two attributes that represent basically the left and right inside of my initial old flags attribute. So what I'm doing, I'm recreating the column old flags, populating it with a concatenation of these two. And the functions are limited to run on a top level, on the same table. And they can be user defined function or whatever you want to put there. And let me copy here so I will be faster. Then I comment on it. So what it does here, saying that I merged the two table. And now to go back, what I will try to do is using the provenance column, to decide 16 where the tuple should go. In this case, I will use provenance sequel to old was the value that I set for the text two table to send the data back there. The system, in the meantime, executes all the queries throughout all the evolution step and seems that everything worked fine. In fact, we didn't really remove any information. We just reshaped the data into the database in many fun and different ways, but we just moved data. So if we look down our table text, now it's got pretty nasty, but it's correct. I spent about 20 minutes to read this and making sure. So it seems good to me. So basically, it's concatenating the two column that is being split. He's merging the two side of the composition on top. He's using some provenance equal to all the to filter out the data that were just added randomly there. It's making a union because we had a partition at a certain point and is putting everything together with a, for example, renamed attributes. And is executing doing the same job. So now, if I want to make sure I have one slide, one tab that I call validation here. The idea is basically, I can run some extra query and make sure that everything is fine. Just for academic purposes, I'm showing here the entire set of DDs, which is endless and horrible that basically are the internals of the system, the mapping between these subsequent version that are used by our Morris engine. And finally, we can get to deployment phasing in which down here, I generate the SQL migration script that basically will implement one SMO at a time and migrate it out. This one is one that I way saying can be optimized, because it might see that there is something that is done and then process again and so I can make one single step there. And here, I'm generating the inverted views, in particular I want to show you the view for the text. So it's this one here. And basically, as you can see, is showing the union between the two side and et cetera, okay? So it's supporting the table text as a complex view on top of whatever it was there. And this is, of course, a composed view, because otherwise it would have generated one single view, and it will have like, you know, a chain of views. But two long chains of views doesn't look good into the database. So we generate and compose one directly. So okay. This is one. So current to future work in PRISM. One is extending most 17 of these to deal more exclusively with integrity constraints. So being able to write and remove integrity constraint and to see how we propagate integrity constraint through the existing SMOs. This will give us more, better guarantees on -- instead of just saying, oh, this -- the compose might not be information preserving. If I know how I do the compose and how the integrity constraint will be, I can make sure and give better feedback to the user, basically. And we model the update of mapping between databases, and we are modifying the rewriting engine in order to rewrite the update as well. A few issues there, but seems to be working fine. >>: Can you, when you find a transformation which doesn't preserve information, can you always in some unmade way, add something to the new schema which preserves the information. >> Carlo Curina: I don't think it can be done always. It depends upon what the user wants to do. In some cases, for example when I did the merge, the trick you're suggesting when I added the provenance column. It's been done because I know I was going to do a merge and I need something to go back. So in the case of Wikipedia, there was a situation like that in which I added a fake column and then I remove it, because basically, later on in the schema, by comparing two existing attributes if they were equal, I could generate, for example, one for the provenance column or zero, if it was not. In the case of Wikipedia, they used to store pages as current and old, and then put them all together and split them horizontally with page and revisions. So by doing some comparison, I could understand if a table is a current one or not, scop sometimes, it's possible. And I would say it's depending upon what you want to do with that. If you want to drop a table, you can copy somewhere else before, but it's kind of tricky. >>: You were asking a different question. You want to extend the target table, you want to extend the target schema so it has all the remaining information? >>: Yes. >>: You can always do that, you can just duplicate any of the source stuff that's missing. The question is whether you can do it minimally? 18 >>: Yeah. >>: There's some kind of a diff between you imagine that you're just taking the union of the source schema and the target schema and you want to get rid of all the source schema stuff that's subsumed by the target schema somehow. >> Carlo Curina: I got it, okay. And future work on PRISM, the is extending SMOs to cover aggregate and data transformations. In the data we collected, this seems very rare case. It would probably be less rare if you had a chance to obtain ETL transformation and these kind of other scenarios. So we definitely want to need -- we want to extend these SMO. I think it's a needed extension. And we want to apply the query writing feature of PRISM to the problem of determining the data provenance in ETL scenarios for data warehouse. So when you have a data warehouse, you use the ETL to load the data warehouse, and now we can rewrite the query to that runs on the data warehouse as a query that was running on the original database. I can have an idea. I don't really want to do it, because, of course, for performance reason will not be smart. But just to have a clue of what, you know, where the data were coming from. That's future work. We didn't talk much about it. It's what [unintelligible] is doing, and data, I think, is that by having some sophisticated tools that support schema evolution, you can think of design methodology for database, which is less up front and rigid and be more like the [unintelligible] methodologies for software development, something like a design as you go, which you allow people to design the schema and then to keep evolving during the design. Because you will have better support from the tools side. So now, I have more time here so tell me if I'm going long. Now we start talking about the archival part. And so this was PRISM that supports legacy queries on top of the evolving snapshot database. And also, store the historical method data that, of course, knows what are the schema and the relationship between the schema. If we also maintain a transaction time archive of the regional snapshot database, we can think about supporting temporal queries on top of these archives. And the problem that we are under schema evolution so we will see that this is quite challenging. And so motivation of transaction time is the basis, you will ask David, but I think we all agree that maintaining archival of databases might be very useful 19 for bunch of reasons. Now the problem will be how can I archive and post complex temporal queries on top of these archived that have evolving schema. The three challenges are how to achieve perfect archival. We said in evolution, the user might say drop table. Of course, I don't want to drop the entire history of that table. Otherwise, what's the point of having an archive if I lose this piece of information? And I also want to make, be able to query this thing without too many troubles with the schema evolution and there are issues for a query language I'm going to use and achieve, of course, original performance. I have a big database that keeps growing, complex temporal queries, temporal coalescing and whatever else. So there will be a challenge. So perfect archival, the idea is maintaining original schema archiving. So the data will be stored under the schema in which they firstly appeared. So this way, we do not corrupt the archive. We will need to take care of the schema evolution in somewhat of an automatic way, because we don't want the user to issue queries on every version of the schema. And to achieve performance and have expressive query languages, we will decouple the logical and physical layer, and I will show you how, and we will need to plug in extra temporal specific optimization for removing temporal joints, removing -- optimizing temporal coalescing and so on. So we said -- sorry. So we said the schema must be the original one. In fact, if I migrate that, I would say we lose information. So we want to keep the data in the original schema. Now the problem is that my entire history will be broken into a series of different schemas. And I need to support queries somehow. So the one idea is that the user just specify a temporal query, splitting his query into a series of sub-queries, each one of which run on a different schema version. I don't want to see the biology [unintelligible] base 410 different queries to obtain the history of some gene or something. I don't think it is a decent interface, so in [unintelligible] the idea of basically letting the user query one version. In this case, it's the last one. And to answer, we will migrate the data answer there. Again, this is good for us as an answering semantic, but it's not what we want to do in practice. Because realizing the history 20 in every version. And so what we will do is actually letting the user issue the query in one version of the schema and migrate it -- sorry and rewrite the query with similar technique, but now will be like temporal queries and things change a little bit. So we said, transaction time database is one of the issues was the query languages. So we need to have some expressive query language. We want to issue not only snapshot queries, but also more complex queries. There have been many proposals that tend to extend the standards. Now there's making inroad in the actual standard, SQL standard, some basic support for this. There are two competing [unintelligible] for transaction time database. One is tuple level time stamping, in which a busy colleague want to add time stamps at the tuple level so every time the tuple changes some of its attributes, it will basically create a copy of that tuple and put the new time stamp for the life of that tuple. The problem is that you create some redundancy, and when you take the projection of one attribute, you might have duplicate. If you look at these your two, here we have two meeting time. And if I project only for data, I want to answer my user also with the information about the life of that value, I will need to collapse those two duplicate and join -- and unify the two time spent. While for the redundancy, there are other ways to remove it. I know immortal DB is doing an interesting compression techniques to avoid this duplication. Still, you have the problem of doing temporal coalescing, which is a very expensive task. So there are approaches, actually with level time stamping. That is basically you put the attribute, the life span of that value to each attribute. This has been said is better in literature. The problem here is that of course you have less redundancy and you need less temporal coalescing, because it is basically that when I filter on department number, I already have the data represented only once. The problem is that that doesn't look like relation. It's a funny object. It's not a relational table. So how are we go with it? We use XML to represent that. So before someone think I'm crazy, I want to say that I will show you what the advantage of XML at the logical level. I'm not going to store the history of a big relational database in XML and 21 try to execute XML queries for real. >>: You had a question on that before. >> Carlo Curina: Plenty, plenty. So it that is we can represent a stable XML like this. We have database, table, row and the attribute. And now the time stamping will be attributes of the values column. So changing the value of one attribute in the tuple means just adding one new element here with a new time stamp to represent this new value. So while you use XML, it's because XQuery will be our query language. XQuery is actually a pretty good query language for when we have to express temporal data. The good things is that complex temporal queries are reasonably simple to express. And I know [unintelligible] made several tests with a lot of students, and he says that they can understand XQuery way better than [unintelligible] SQL are the temporal extension, so he definitely decided that XQuery will be the way to go. And the good thing is that you need no extension. the temporal things. There's no special treatment for So here, a few example. For example, the classical, given the history of the title of some given employee, it means filtering and getting one, one employee and then getting the title. And if you think about the previous light, there's been getting this bunch of titles here, which is exactly the history. . And so other example, we can find employees, the snapshot, this snapshot will be basically filtering on time start and time end, with a given moment in time. And I can say retrieve employee who work in the department this one and left before given moment in time. This will be fairly simple. I will say that the department number will be this is your one and the time end will be this one. So it means that the tuple was -- the guy was working in that company in that department and left before that moment in time. And, of course, we can think about more complex things like give me the people who worked in the same department of [unintelligible] in 1987 and now are working somewhere else. Because it's simple to express like this. So when I do evolution, I will rewrite 22 this query into an equivalent query to span over the varies schema versions. Here's an example. I want to be fast to finish. Some writing performance, let's suppose we go XML, first of all. The rewriting, well, pretty difficult. Meaning that now we are rewriting XQuery so we need to use XICs, which are basically the DD equivalent for XML, and if we do not do any optimization for hundred schema version, we will have over 100 seconds of rewriting time. That's not something nice. If we prune SMOs and we do some SMO compression, we can get it down to a little bit less than alpha second for 100 schema version, which is still a little bit too much, but it's acceptable because the temporal query is not issued very often. So we can assume for now that that is fine. When we're trying to do execution, big trouble comes. We try to execute the [unintelligible], okay. We take a bunch of temporal queries that run on that, and actually, this realigning means that the system crashed before -- the [unintelligible] XQuery engine crashed before giving the announcer. So we need to plug in temporal optimization. For example, the temporal joints and remove unneeded joints and do minimum source detection so removes sub goals in the XQuery. And this actually can make it this time. And it can go for, you know, like ten seconds, 12 seconds. The database extremely small. It's not something we want in practice. Having a history of a table which is, say, 640 kilobytes, is not reasonable. So we want to go for relational now. We still like XQuery's language. So our query interface would be XQuery. And what we will do is actually shred the XML document into H tables. H tables are basically a relational way of representing the XML document in which you basically split the various attributes in different tables. And the idea is now we can rewrite the XQuery into SQL for those. And this is coming from the arches system that was developed at UCLA. And once we have SQL to run on that, we can use SQL rewriting, come will be as we see faster to do the version adaptation on top of the various portion of the history. Nice thing is that now with the current evolution at this level, it means if we want to plug a new time version of SQL, it will be enough to have the translation from that input query language to this SQL on top of these kind of H-table and then evolution will be taken care down there. 23 So just to look to the performance, the rewriting time goes down to roughly 400, roughly 15 milliseconds, 12 milliseconds, which is definitely better than the alpha second we got before. The execution performance, we compared Galax [phonetic], which is another [unintelligible] XML. How DB2 deals with XML and our H-tables, so our relational shredding of the XML. Tuple level times [unintelligible] and the snapshot database. Because, in fact, here, the example I'm showing you is a current snapshot query. There are a bunch of other graphs in the work, in the paper, this example. And here, the growth goes from about 1 megabyte to about 1 gigabyte of data. it's definitely more reasonable than what we had before. So So if you see here, we have from the gal a we have about four order of magnitude of improvement in executing that in this way, coupled with the [inaudible] DB2. We are very close to what the TLT can do or even better for when the site grows. And very close to the snapshot database, which for a current snapshot query is sort of a golden standard. Better than that, we can't go. I'm almost done. And so one other issue we were saying is temporal coalescing so removing the duplicates with emitting time span. So just by moving towards a tuple level time stamping to H-tables, we are reducing a lot of need for coalescing. Okay? So we are about five times faster in queries that would need coalescing in a tuple level time stamping. But still ->>: Is that because of reduced data sizes? >> Carlo Curina: Just because they are precoalesced, basically. You don't really need to coalesce anything in the level time stamping, because that already represented once with the overall time stamp, time span that you need. So you basically removing the need for part of the temporal coalescing. >>: But just wondering whether that was due to the fact that there was -- the data was that much smaller. >> Carlo Curina: Partially, it's probably due to that also. And it's probably also what we can see here, why the [unintelligible] be better than the tuple level time stamping. But we have a little bit less tuple to deal with. And but still, since we break the schema, the history in -- due to the schema evolution, we still need to do some coalescing there. Because every time I change 24 the schema, I will need to have basically a little bit of duplication of the table that we're still alive in the moment in time. So we basically define a new way to do the coalescing which exploits the characteristic of the program. In fact, it works time for the partitions. It's not a general coalescing technique. And when compared to SSC, which is like the best in literature in term of coalescing, developed by previous student of professor [unintelligible], we go a couple of order magnitude faster than that. Again, because we are tackling the coalescing only for partitions. The last slide, HMM, [unintelligible], once we understand how to use XQuery on top of the data, it's enough to have sort of a versioned information schema and then we can run XML queries, temporal XML queries on top of the schema history. And for example, the previous -- the beginning, the slide in the beginning, as we get in XQuery. So concluding, finally, PRISM supports schema evolution for snapshot databases, take care of data migration and derive logical mapping and rewrite queries. And PRIMA support perfect archival of database schema evolution and provide intuitive temporal queries, XQuery is our default. We can think about extending it with new query languages and support query answering performance by having an architecture that complete a query interface from a logical layer and provides some temporal optimization to run faster. If you're interested, there are demo papers and whatever else at the website. >>: Thank you very much. [applause]. >>: Can you comment on the -- there's a bunch of languages along the way here. Can you comment on the expressiveness of each of them? I mean, if you're handling what kinds of queries, the mappings, you said, are these embedded tendency language which I assume are basically conjunctive queries, plus equality? Anyway, just -- and also, just where do you run into trouble? I mean, have you tried to extend this to a richer, to richer queries or richer mappings, would things start falling apart? >> Carlo Curina: Okay. So the queries we support in the PRISM system are basically union of conjunctive queries. That's the baseline and you can probably extend it 25 with aggregates, I guess without too much trouble, but we have to figure it out more precisely. >>: [inaudible]. >> Carlo Curina: [unintelligible] are trouble for the rewriting engine, because we plan to deal with it. The problem with the [unintelligible] is it's not monotonic so what happened is you're sort of not dealing with a negation much. At least the same problem with the negation actually comes when we try to do updates, because, for example, I have to say -- you have negation that pops in down there. And what we're trying to do is it was basically you can use the chase in a sound but not complete way to [unintelligible] writing. So it means if you can -- if you succeed, you basically isolate the negation. So you take the -- whatever is inside the negation, whatever is outside the negation, you try to rewrite and if you can't come up with a written query, the written query will be correct. Doesn't mean -- there might be cases in which you cannot rewrite, because it cannot deal with the negation explicitly. So we're actually implementing that right now, and it is basically trying to see how much not complete it is. So if it can deal with, like, 80, 90 percent of the cases, it's still useful. If it deals with five percent of the cases, we cannot claim to have as much. So that's one thing. The XQuery language, the query supported basically a relational queries. They're expressed in XQuery because it's nicer to talk about the temporal aspect, but that's just, those are basically just relational queries on top of a query. In fact if you see, we do not use the full power of XML in the XQuery, because it's a specific kind of XML schema that is comfortable to be used. The advantage of doing that is not really the power of -- because we want to use the full power of XQuery, but it is that there is no need for extension to XQuery, so whatever optimization comes in for that language, it can be exploited without, you know, requiring tweaking and adaptation to these SQL new version of SQL that we can come up with. >>: You handle all of XPath? That would be impossible, but how much XPath? >> Carlo Curina: I didn't do much test in this. This was a merely work for the other students. I don't know how much he implemented XPath. 26 >>: There are some [unintelligible] problems buried in there when you start working. >> Carlo Curina: At the moment, it's very limited, I don't know how much of the limitation due to the fact that the guy was lazy and didn't implement the entire parts and did everything, or how much was actually, you know, a real limitation, technical difficulty. So I know the example we tested were not extremely difficult. I would say there were fairly interesting temporal queries from a temporal point of view. Not extremely challenging for XPath, basically just filtering on top of the attributes in different ways. And it is that you can, using again XPath, you can define some comfortable function that's, for example, [unintelligible] overlap. So you just give two element and it's just like tell me if they are overlapping in time. This will make the user queries even more comfortable as a language, basically. >>: I had a question about a couple versus attribute in the time stamping. We do this deferential compression and it's not clear to me that that isn't simply a low level implementation choice, whether you want to refer to it as tuple, double, or attribute level. The information is the same, right? It's just a matter of how efficiently you represent [unintelligible], right? >> Carlo Curina: I guess the problem is that, okay, you're doing -- if I got right what you're doing, because I read the paper, but I might be wrong. You're finding a smart way to compress the data, but still conceptually, that's a tuple level time stamping. So you still have one attribute would appear several times, but actually has been compressed just for performance reasons. >>: That's right. >> Carlo Curina: The moment in which you have to -- if you limit to snapshot queries, probably make no difference. The moment in which you want to return, either do a more complex queries that need, for example, temporal join or if you want to do queries that return to the user the time life of that attribute, at that point, with tuple level time stamping, no matter how you implement it, I think you still need to do, to find the duplicate and find how many the tuple exists and get the time span. While in the attribute level time stamping, that is not needed, because you only 27 have the attribute is only once, appears only once with the full-time span within there. >>: So if I can translate that, I think you're conceding that the information is there, but the way to express the queries is harder in one view than it is in the other? >> Carlo Curina: One of the -- not only the way you express the query, but especially the problem when you compute the query. Because when I have to compute the temporal coalescing, let's say you have whatever language is on top and tell me, just do temporal coalescing, do this projection and do the temporal coalescing there, because I want to know for how long the worked in the department. At that point in one case, you have the attribute with directly the information there. I just have to throw out the attribute time start and time end and it's already what I have there. In the other cases, you have the attribute will appear multiple times. From storage issue, you might have squeezed the representational of the view of the attribute, but I think you still have the multiple, the multiple time stamping, which the query went ->>: [inaudible] that operates directly on the compressed representation. >> Carlo Curina: I guess you can. I guess you can go in that direction. Actually, what we are doing also for the temporal coalescing, it was basically when we split, due to the partitioning over time, we basically maintained one extra attribute to tell us where this data was coming from in the previous version of the schema, plus some extra [unintelligible] around there. But it it's basically, you can probably do something similar, saying I don't only have the time stamp for the entire tuple, but somehow a reference to where is the time stamp for this attribute or something like this. Definitely, you can get something like that. >>: So changing the subject, is that -- so there's, you know, querying across time, of course, is complicated by schema evolution. And, you know, an equally big problem of data evolution that needs to be solved. >> Carlo Curina: >>: Say again? Data evolution. So the fact that I've got some genetic database and they 28 changed the gene name in 2006, or, you know, in my company, we used to have winter reviews, but then we opened an office in New Zealand so now they're called January reviews, because, you know, it didn't make sense. And so, you know, I'm not sure what my question is, but it seems like it would be interesting to think about how do you sprinkle in, you know, equivalent statements about data as well as equivalent statements about schema so that you can, you know, I want to get all the history of the information about this gene, including that ->> Carlo Curina: >>: This is the point we change the name. Including that under the previous name or something. >> Carlo Curina: I think there is one option for doing that with it's linked. [unintelligible], somehow can be changed to work that out. When we do the add column, for example, we can add a column using some function to populate it. So what you can think of doing is something like, I add a column, and I generate the value inside that column by some transformation function that takes gene A and puts, call it gene B. And then you have of course the drop column and the inverse. In the query, you will have something like up to a certain point, the queries will just ask the column, whatever it is, the name of the column. And from a certain point on will ask that column and whenever they mention it will run a function on top of that. So my query starts saying select gene name equals Gene A. And this will be through up to a certain point of history. From there on, the sub queries will say something where gene equal change name to Gene A. Something like this. That's something I think you can think of doing. I don't know how much can harm performance because you will start like issuing functions and function calls throughout the execution later. So it becomes. >>: That function might have an extensional representation too you, know. >> Carlo Curina: Actually, that's the way we deal with function through the writing engine, we take the function representation, we say okay, let's assume there's a table that implement this function. We do the rewriting assuming there's the table. [unintelligible] there's the table that -- and we call it back into a function, into a function call. So definitely, you can think of doing something like that as well. And I think it's -- 29 >>: What transmits this data evolution into a schema evolution, basically, or at least into a temporal schema? >>: I think that's right. It's promoting it, up to saying rather than having -- yeah, you say it's a new column that has the new names. And so now it's ->> Carlo Curina: That's similar to the problem of when you want to do [unintelligible]. At the moment, we can sort of support it by saying, you add the column, add the column, add the column, based on how much times, how many different value you have inside. But, of course, it's not the way to go. I know there is some interesting work by [unintelligible] and other people that work at UCSCN and IBM to basically having technique basically to deal with this. And it is that probably when we want to extend the SMOs in that direction of data transformation, we might also try to tackle this problem you are suggesting and try to see what to do. Or maybe we can have some extra integrity constraint that we can plug into the system that speaks about the value of this and says, for example, whatever you say, Gene A is equal to gene B or something like this. And then use the rewriting engine to do the magic in the query as well. >>: We should probably wrap up. >> Carlo Curina: I want to thank you very much. Thank you very much.