18847 >>: It's a real pleasure to introduce Y.C. Tay, professor from the National University of Singapore. I've known Tay for almost 30 years, back when he was a graduate student at Harvard is when we first met and he ended up doing a Ph.D. thesis in the database area on the performance of locking systems, which was really landmark work at the time and is still very relevant. He uncovered the abstraction of lock thrashing, which was behavior that had been observed but nobody had really understood it from an analytic standpoint, and also he did a good job in separating out the issue of resource contention for resources other than locks from the contention that arises from locks. People still read this work today. It's really as relevant now as it was. In general, his work is on analytic models of performance, most of his research. He recently came out with an introductory book on the topic, a monograph from the Morgan Claypool synthesis series entitled analytic performance models of computer systems that was published just last month. And you can see a pointer to that on his Web page online. As well his Web page has a nice pointer to a lecture from a couple of years ago that he gave on universal cache missed equations for memory hierarchies. So he works on hard problems that are of pretty general interest to the issue of performance and computer systems, and today he's going to tell us on how to upsize a database so that you can use it for performance evaluation and database systems. >> Y.C. Tay: So I'm from the other side of the ocean, and coming here one thing I learned is always to have an advertisement. So as Stu mentioned, this just came out, this book. Less than 100 pages, and out of the -- and the idea is to introduce the elementary performance modeling tricks to people who have no interest in becoming performance analysts. They just need to have some rough model of how their system is behaving. The techniques are actually illustrated with discussion of 20 papers. And some work that was on here. So take a look. It's an easy read, I promise. So UpSizeR. >>: You're going to autograph them later? [laughter] >> Y.C. Tay: The book should be read in soft copy because the 20 papers are hyperlinked. So I don't know how to autograph electronically. >>: I don't know, do they publish hard copies? >> Y.C. Tay: They do. They do. Yeah. But it's best read in soft copy. So this is work done with the students, Bing Tian Dai, Daniel Wang, Yuting Lin and Eldora Sun. What I'm going to do is describe two problems. One I call the data scaled problem and the other one the social network attribute problem and of course introduce UpSizeR. It's the first attempt at solving the data scaling problem. And it has within it what I call a strawman solution to the attribute value correlation problem. And you'll very soon see why I call it a strawman solution. So the motivation we are in the era of big data. And big data means different things to different people. It could mean genome data or astronomy data. But in our context here, I'm talking about Internet services. So some of these applications have insane growth rates. And I imagine if you are developer of something that you aspire to be the next Twitter, you worry that your system is going to grow so large next month that your dataset is going to grow so large next month that your system may not be able to cope with it. So if you're looking forward, you might want to test the scalability of the system with some by definition synthetic dataset. So how can you do that? One way to do so might be to use TPC benchmarks. TPC benchmarks allow you to specify the size of the dataset and one gig, 100 gig, generate it for you, and there are different flavors as well. And they're supposed to be domain-specific. But they're not application-specific. If you think about something like TPCW which is supposed to be for e-commerce and Amazon and eBay are roughly speaking e-commerce and how relevant benchmark is for Amazon or eBay it's not clear. So the idea here is that we want something that's application-specific. So rather than design one benchmark for this application and a benchmark for that, what we want is one single tool where you come with your dataset and I'll scale it up for you. So the dataset scaling problem is that you give me a set of relational tables, and UpSizeR, the R is big capitalized because I want to emphasize about talking about relational tables so people don't misunderstand, I'm upsizing a graph or something like that. Scale factor S and I'll generate a dataset that D prime that is similar. Here's where I use my probe. You come to me. I'm UpSizeR and you come to me with dataset D and I'll scale it up to D prime. These two are going to be similar. The trick here, of course, is what does it mean to be similar. Right? It could be statistical properties. The statistical properties in here must be similar to the statistical properties in here, or it could be graph -- if this is generated by a social network, then the social network here must be similar in some sense. So when defining the problem, I wanted a -- since UpSizeR is supposed to work for different applications, I want a definition of similarity that is application -- that would work for any application. So our way around the problem is I'm going to mention when you come to me with the dataset you also have in mind a certain set of queries that you're going to run. And the way you're going to judge whether D prime and D are similar is going to run the query on D prime and see whether it is what you expected. So that's how you're going to judge where D prime is there. So as suggested by this, when we conceived the problem we were think of up sizing, that's why we call it up sizing, scaling it bigger than 1. When I went around and talked to people it seemed there's some interest in same sizing, making a copy that is the same size. Why would you want to make a copy that's the same size? Well, various scenarios. For example, I'm a vendor and you hire me, you have a dataset and you hire me to do something with your dataset. But you don't want to show me your real dataset. What you can do you run UpSizeR to make a copy, synthetic copy and you let me work with the synthetic dataset. That's a possibility. Another possibility in the cloud setting, you may want to test a different layout for your dataset in the cloud, or you might want to test a different index implementation or whatever, but so you're going to do all of this in the cloud, but you don't want to upload your data into the cloud. So what you might do you generate a synthetic copy in the cloud, or for people who already are working in the cloud and they want to do some experiments somewhere else in the cloud, you don't want to ship all the data there. So what you might want to do is no extract the statistics, you go there you blow it up and get a synthetic copy. Various scenarios you can think of. But actually when I talked to a particular company, this is a big company. And they have no shortage of big datasets. What they're actually interested in is the downsizing. So why would you be interested in down sizing? For them they see their customers are struggling with trying to get a sample of their dataset. So you have a dataset that is here this big and you have some prototype application that you want to debug. Now heaven forbid you run your application on this big dataset. You want to take a small sample of the dataset then you debug your application there. Sampling, what's so difficult about sampling? If this thing has a million users and you want to test your application on a thousand users you randomly sample a thousand users. What's so difficult? Well, it's not so simple because you might randomly sample a thousand users, and these thousand users know it's somehow connected reference objects and those objects are somehow connected to other users and you have to pull out those users and it's like spaghetti in the end you get way more than a thousand users. So for various reasons, you might also be interested in downsizing. Now, the UpSizeR algorithm will work for all these three instances. Okay. So I would like to say there's no previous work, but I'd be lying. So I already mentioned TPC benchmarks, and if you trace it backwards you get a Wisconsin benchmark that was done 27 years ago. And the Wisconsin benchmark, the relations are completely synthetic. And that particular approach -- so meaning that they did not try to scale it up from some applicable dataset. In that particular approach propagated down to the TPC benchmark. That's how they do it. So it's not that they did not consider the possibility of scaling out empirical datasets, but they encountered some issues. When you try to scale up empirical dataset and you try to make the statistics are the same, if your table has only 20 TUPLs, then you worry whether this 20 TUPLs is telling you anything about the real underlying distribution. So for us this is not a problem because the datasets we're dealing with nowadays are so huge it's not a problem. And another problem that they encountered was why were they interested in the synthetic datasets? They wanted to compare a particular vendor's offering against another vendor. Use these benchmarks to do this comparison. When you want to do this comparison, you don't just have the dataset. You actually run some queries, and you want to tune your data and your query so that you can test joint algorithms or the buffer management or whatever. So this tuning is much easier if everything is synthetic, you use empirical dataset it's much harder to do the tuning. For me here, the way I formulated the problem. >>: Question. If I want to buy supplies for my company, I want to do a query, want the query that I run tomorrow morning for my business, the more queued the more honest the benchmark will be these other benchmarks. >> Y.C. Tay: So you want to have a benchmark where you can test these vendors offering against that vendor. So you want to tune the queries, et cetera, so you can test what this system is actually doing versus what the system's actually doing. >>: If I wanted to do research on the database, if I want to use it like a user, no. >> Y.C. Tay: Okay. This was the idea. Particular argument. So for us it's not an issue because UpSizeR doesn't worry about queries. UpSizeR thinks, assumes that you have a set of queries in mind but it doesn't -- it doesn't make use of the set of queries, for example. So this is important to remember. In formulating the problem it's entirely in formulating a solution it's entirely possible to factor in the set of queries that you want to answer in doing the upsizing, but we don't do that. The question -- the problem that they came up against was that scaling and empirical dataset is hard. And they're saying that the dataset scaling problem is hard. But it's been 27 years. It's a long time in this business, and it's about time we take another contract at the problem. So for you guys I assume I don't have to go through this. I'm going to use the Flickr dataset. So in the Flickr dataset, users upload photographs and make comments and tag the photographs, et cetera. And my job is to upsize it. So in the Flickr dataset, you have a photo table with primary care and a foreign key values who indicated who uploaded the photograph and you have nonkey attributes for the data size, et cetera. So the user ID here is pointing to another table so it's the primary key for another table with a user name, user location, et cetera. These are the nonkey attributes. And then you'll have a common table which says that this particular user commented on this particular photograph and a tack table that said a particular user tagged this particular photograph. So what you get now is a schema graph. What assumptions do we use? We start with the following. I'm going to assume that the primary key is just a single tone attribute. Sometimes it could be two, but right now I assume it's just one. I'm going to assume that a table has at most two foreign keys. So in the example that I'm using, the user table has no foreign key. The photo table has one foreign key. The type table has two foreign keys and the common table has two foreign keys. And that's it. No three foreign keys. I'm going to assume that the schema graph is acyclic. I'm going to assume that degree separation is static. I define what degree distribution means. Let me give you intuition first. A user will pose a number of comments. If you take a histogram, you can change it into a distribution that says with probability so and so you'll upload one photograph, two photographs, et cetera. I'm going to assume that this distribution is the same. So the degree distribution for this and the degree distribution for this is the same. That is not necessarily reasonable, right? Think about why we want to do this. We want to do this because you're afraid that when the dataset gets to be this size, your system cannot cope. So there's a time element involved. So if you're thinking about loading photographs, by the time you get to here, everybody would have loaded more photographs, right? The tail of the distribution would have shifted. So it's maybe not reasonable to assume that the degree distribution is static. But for the time being let's use that assumption. I'm going to assume the non-key attributes depend only on the key values. So the non-key attributes here only depend on the key values. That means in the case of a user table, these values here only depend on the key values, don't depend on the other tables, don't depend on the other TUPLs. This thing is completely synthetic. It has no meaning. What does it mean to depend on this? What it means is that these values here can be independently generated. So when you give me the empirical dataset I'm going to mine it. I'm going to find out what's the distribution on the age, the sex, the names, et cetera, and I'm going to use those distributions to independently generate the values here. And later on I'll worry about correlations. But at the moment let's assume that it's independent. Okay. So for the other -- so for more sophisticated tables like this one, then this one in general is a function of this key values, right? If I know that this user is a gardener, there could be a function that tells me what the key values should be, what the nonkey attributes might be over here. So that is nontrivial, the correlation is nontrivial. Okay. And I'm going to assume that the data correlations are not induced by a social network. And that's not true for the Flickr network. But let's just use this simplifying assumptions for the time being. So UpSizeR uses three ideas. The first idea is a degree. It's really simple. This user up loads four photographs so the degree of Y to four is four. And the user X make two comments so the degree of X to comment is 2. The degree -- user Y make one comment so the degree of Y to comment. Look at it for a moment and you realize that what I'm talking about here is a bipartite graph. So every schema in H uses a bipartite graph among the row here and the degree is just talking about the degree of the bipartite graph, that's why we use that terminology. The next thing we worry about is the join degree distribution. So I'm a user. I up load so many photographs. There's my degree. And I make so many comments and that's my degree. And these two degrees are going to be correlated, because a user likes to comment on his own photographs. So we have to respect that correlation. In a table like comments, for something like a social network, you mentioned that the users might be, there might be a social network for gardeners, a social network for bird watchers, et cetera, and the photographs might be trusted around, flowers or kites or cars or whatever. And it's more likely that a gardener would comment on a flower. So there's some core clustering distribution that we have to respect. So for a table with two foreign keys, the way UpSizeR do with the correlation between the two foreign keys is to look at a core clustering distribution. So these are the three ideas that we work with. The algorithm is quite simple. The first thing we do is we sort the four tables. So remember that there's a schema graph on the four tables. So when we sort it, the first one to come out would be the one with no foreign key. And then the one with foreign keys pointing to whatever has already come out, like this one for total. And the one with the foreign keys that points out to whatever else, like these two, this would be the order in which we generate the tables. Okay. So how do we generate the user table? So your empirical dataset has a million users. You want to upsize it to two million. Very simple. I generate two million permanent key IDs and then for each one of them I generate the rest of the nonkey values and that's done. Then the ->>: [Inaudible] nonkey value. >> Y.C. Tay: Sorry. >>: How do you determine the nonkey value? >> Y.C. Tay: So, remember the assumption that a nonkey value only depend on the key values. And since in this case the key values have no meaning, that means that nonkey values is a function of this one in some trivial way. So that means that I can independently generate the nonkey values. So ->>: Decide properties nonkey values the main column. The main column might have a certain length distribution. >> Y.C. Tay: I mine this guy for the name distribution, the age distribution, the sex, whatever, and I use those to generate those values. So at this moment I don't worry about correlation. I just generate them independently. >>: When you talk about a name, you're not modeling, do not come from a dictionary, just be a string of characters, name? >> Y.C. Tay: Right. In this first cut version of UpSizeR, we are focused on the key values. So we don't do anything sophisticated with the nonkey values. So the next thing to generate is the photo table. So how do we do the photo tables? Now, remember that so the first thing we do is we mine this for the degree distribution, degree from user to total. This user uploads so many photographs. So we know that distribution. Two million users here. So what I do is for each one of them I pick a degree value from the distribution. So user number one uploads, has degree three. He uploads three photographs. User number generate 13 TUPLs. User two has five photographs, I generate five TUPLs, done. That's easy enough. Now for the correlation, for the case where there are two foreign key values, right? What I'm going to do is I have to assign a degree from the user table to this column and a degree from the photo table to this column. Now, the degree from the -- the degree from the user table to this column is correlated with user to portal. And that one has already been generated. So I have to take a conditional probability. No problem. So as long as I respect the correlation. The other thing I have to worry about is the correlation in these two columns. So for the correlation in these two columns, what I do is first I generate a common ID. I know the core clustering distribution. So now I just assigned a common ID to a core cluster. Let's say I assign it a co-cluster for gardener and flowers. Now after I've assigned it co-cluster for gardener and flowers I have to pick a particular gardener, have to pick a particular flower. This gardener commented on 10 flowers and that gardener commented on three flowers, then with high probability I assign the comment to that particular gardener. So it's common sense. This is what I do, I generate assign a co-cluster. I pick a user in a co-cluster in the user cluster, according to the degree figure photograph according to the degree. When I'm done with that, I will have the three key values. When I have the three key values, nonkey attributes depend on the key values, and I can generate the rest of them and that will be one TUPL. And then I do this. So every time I sign a photograph or sign a comment to him I reduce his degree, assign to a flower I reduce its degree. And I do that until the degrees are zero. And then I generate the tech table similarly. So it's very simple. The actual algorithm itself, the complete algorithm itself has certain complications, but the basic ideas are already here. So relaxing assumptions. So suppose the primary key have two attributes, then what we do is we create a synthetic attribute as a primary key. We work with the fake primary key, and then when we are done generating everything we throw away the key, the synthetic primary key. Suppose it has more than two foreign keys, the co-clustering of the two foreign keys we can use any algorithm. It's orthogonal to UpSizeR. If he has three foreign keys, just find me a three dimensional co-clustering algorithm and I'll work with that. It doesn't matter to us. Suppose the degree distribution, the schema graph is cyclic, and a good example of that might be the manager is supposed to be an employee and then you have an edge like this. I think we can deal with that. If it's a cycle where one relation goes to another relation and it comes back, I think we can deal with that, too. But if with a [inaudible] cycle then I'm not sure how we should do that. Remember why we need to assume this acyclic. We need to assume it's acyclic because that gives us a way to sort the tables and we know who to generate first. Right? But if it's an arbitrary cycle. Not just like this but like that, like that and whatever. And where do you start. Until somebody gives me a realistic scenario, I'm not going to worry about it. And the degree distribution is static. Sometimes degree distributions are not static. For example, the number of countries in the world you can't simply upsize that. So for us what we do is we up size it anyway and then we hash it back and that would change the degree distribution. A more interesting example is this case. Remember for the Flickr when I say the tilde of the distribution has shifted. So suppose the degree distribution changes. There are various options we can use. One option is that when you come to me with the empirical dataset, you also tell me how the degree distributions change and then I will use that instead. Right? If you're going to tell me how the degree distributions have changed, give me two snapshots and I'll do an extrapolation from that. If you don't have two snapshots, I can go in here and I can mine the time column, for example, and deduce from the time you upload the photograph to extrapolate it and say by next month how many photographs you would have uploaded. Many options for dealing with changing distribution. So a TUPL's nonkey attributes may depend on its key attributes. I'm assuming that nonkey attributes only depend on key attributes, don't depend on the other tables and don't depend on the other TUPLs. Let's look at one example where they depend on the other TUPLs, how can we deal with that? So, for example, in the case of tack, for example, if you are a bird watcher, the tag values use might be correlated somehow. For example, you're more likely to use bird and tree and sky. Less likely to use car and bicycle or whatever. So in that case there's some correlation in the tight values here. So how do we deal with that? So the way we deal with that remember the UpSizeR algorithm that's described is based on the key values. So the tables have already been generated now you want to fill in the tag values to capture this correlation. So you know let's say this particular user used the thin tag values and you're trying to find out, try to give him the thin tag values that are correlated. So what you need to do is generate a factor of 13 values for this particular user. Now, what I do is I go into the empirical data set, and I'll sample from it some type factor. And this type vector will presumably have some coherence. If I need the thin type values and this is too long I just throw away some of them. If it's too short I add some more. When I add some more, I have to respect the correlation among the tag values. Which means that when you give me the empirical data set I have to mine the correlation in the tag values. It's very compute intensive. So if you try to capture all the correlation in the nonkey attributes, it's going to be a major pain. But if that's what you need, then that's what you need. Okay. Up to this point I think that we have the basic bag of tricks for dealing with classical data sets that banking data or retailer data, et cetera. But now comes the hard part. What if the data set were generated by social networks. The issue there is that if the data -- if the data is generated, induced by social interactions, I have to understand what kind of correlations, what the correlations look like, so that when you give me the data set I can extract from it the social interactions or the social network, then I can scale up the social network and then induce the correlations, use a bigger social network induce a correlation in the syntactic data set. So I need to know this mapping. I need to reverse engineer the social interactions. And I don't know how to do that. There's no literature on this. There's a lot of work on draft theoretic views of social interactions. There's a lot of work on algorithms for social networks. But as far as I can tell, there's no work on a database theoretic view of social interactions. If the database, if the relational database were induced by a social network, what kind of correlations would they have? And I've not seen papers of that kind. So since there's no literature on this, we just do some half assed thing. So this is our hypothesis, that friends are similar, because they're similar their foreign key identities can be swapped. What does that mean? Let me give you the example, right? So in the Flickr data set I'm going to define friends as two guys commenting on each other's paragraph. We take that as the evidence that they're friends, and you can extend that to Facebook writing on each other's walls, for example. Similar, how do I tell whether the tool is similar? We just look at the tag vectors. So if this guy uses this but the tag vector and that guy uses that particular tag vector data cosigned and we used that to judge how lightly it is that they're going to be friends. So now how to replicate the social network. So this thing is generated by social network. And I need to replicate that in this particular data set. So the first thing I do is I go in here I strike all those guys who come in on their own photographs. This is a technicality. Almost everybody does that. And then I -- because I defined friends as commenting on each other's photograph, I can actually figure out who are the friends in this particular data set. I can reverse engineer it. And then now I've generated the data set the synthetic data set for that, right? Remember I've already generated a table, now what I need to do is go and tweak the values in those tables so that I capture the social network. Right? Up to this point I have not done that. First of all, I need to know how many pairs to tweak. So in the original, in the empirical data set, there are so many ages, so many pairs of friends. Now for the synthetic one, I scale it up by S. Why S? Shouldn't it be S squared. So I checked the literature, this thing could be growing exponentially, but then the number of friends you have doesn't grow exponentially, except for some outliers. So my understanding of that is in a social network the way it grows is you add more sub social networks. Rather than one social network blowing up. So if it grows by adding sub social networks, then, yeah, it does give linearly. So that's how we're going to do it. So I'm going to go in and tweak the values, the attribute values, so that I have this number of friends. So I'll pick M pairs and I'll pick pairs from this set and I will decide whether they're going to be friends or not by calculating the co-sine. So if the co-sine is 0.3 then probability is 0.3 and make them friends. So for each pair that I choose -- so this guy commented on his own photograph. That guy commented on his own photograph. Now I swap this, too. Now when I swap this, then they comment each other's photograph. We do that for N pairs so that's how we generate the social network for this one. Okay. Enough of that. Now for some data. >>: So we've crawled Flikr to get a base set, F 1. The data we store in four tables. And so this is real values crawled and then we take this base set and then we up size it by 1, meaning we make a synthetic copy. These two values should be the same. It's just a floating point thing. So this is not a problem. So this is measuring the table sizes, the number of TUPLs and these four tables. These tables generated by the degree distribution. And you see that the disagreement is already significant. Those two tables are created by the co-clustering, and apparently it works decently. And then we crawl some more, at any point we have one million and half a million comments. We cross some more so F .281 is a data set where he has 2.81 times the number of users in F 1 and we did a similar thing and I let you judge how well this works. So this is the one with the most severe disagreement, and it's not clear why that is so, it could be that comments have some recursive property, when I comment on your stuff, you comment on my stuff, and it blows up, something like that. So this is comparing the table sizes. But I said we should Judge the quality of UpSizeR by looking at the queries, right? So we tested it on four queries different number of joints. And here you see what we've got and I think it's pretty decent up to this point. And this is you can look at it as estimating joint sizes, and one comment that I got from some people was that this kind of estimation should be built into query optimizers. So anyway, so this is estimated -- these are queries. >> Y.C. Tay: F.9 are they Flikr or is it synthetic. >>: This is real. >> Y. C. Tay: These are real. These are your upsizers. >>: Yes, this is taking the original data set and upsizing it by 9.911. So these are exercise -- this is checking the weather we were able to capture the correlation among the key values. Now what about nonkey values. So we tested retrieval on the nonkey value like bird the typed values, the typed value, the bird here, for example. And agreement is not so great now. It's okay here, but for the smaller sizes it's not so great. And it's not clear to me why it's bigger for -- it's better for bigger sets and not so good on the smaller sets. We didn't try very hard. In the case of this particular query here, we are actually testing whether we captured the correlation among the key values. The nonkey values, the tech values. And we didn't try very hard. As I mentioned, just capturing the correlation between pair-wise correlation between tech values is already very compute intensive. If you try to capture more than the correlation among more than two typed values, it's horrendous. So we didn't try so hard on that one. I imagine that we can improve on these values. So up to this point the social network is not in the queries. Now we want to test how this performs, what, if the query refers to the social networks what would the comparison, how would the comparisons look. So remember that UpSizeR deals with the query, with the social network, with the swap techniques. So now we have to compare the real query results against UpSizeR with and without swap. We're going to see how it goes. So the first query is retrieved pairs of friends. And you see that without the swap, you completely cannot capture. Right? You have millions of users and millions of photographs, millions of comments. The probability that you and I will comment on each other's photograph is basically zero. So you really have to go in and tweak the values if you want to capture the social network. Then the way we replicate the social network is to look at a similar -- look at whether two users are similar, the way we check whether two users are similar is to look at tech vectors and want to see whether this is actually a good reflection on whether to comment on each other's photographs. So you retrieve X and Y who are friends and have at least N texts that they both use. And the agreement is still decent until like here. So maybe we can scale not by a factor of ten but something less than that is okay. >>: What breaks at 9.11. Looks like a photo ->> Y.C. Tay: I don't know. We should look into that. It could just be that you have something -- so remember the way we build up this data set is we crawled and crawled and crawled. It could be that there's something wrong with this particular data set. Maybe the way we did the crawling we might have messed up or whatever. It's not clear. So social interactions have structures. And this is the major issue. So one structure would be triangle, a friend of a friend is likely to be a friend. We didn't try -- we're not so ambitious. We just wanted to see whether we can capture the V shapes. X and Y are friends and Y and Z are friends so can we capture that? And the answer is clearly no. So what happened. It was doing well up to this point. So it turns out that, yeah, we can scale up the number of friends accurately, but we cannot scale up the topology. It turns out that in Flikr there are some hot shots with many friends like this in the star shape. And when we scale it up to F prime, we lost that star shape. The star shape just dissipated. So we have to figure out some way of doing it. Now, it may be it has to do with the way we used the -- remember, we replicate the friendship graph by looking at a similarity and similarity is defined in terms of tech vectors. So currently, if this tag vector were a bird, chicken, and this tag vector is birds, hen, then the co-sine is 0. So imagine you can actually do better than that, right? You put in an ontology or whatever. But my students tell me you need a lot more than just that. That you really -- you will fix some of the structural issues but you can't fix all of it. So I described two problems. First is the data set scaling problem, and as I say now UpSizeR has what I believe the basic set of techniques for scaling up anything that's classical data set. But when it comes to the nonkey attributes, there are all kinds of correlations in real data, and I imagine that they kind of correlations one application is interested in the difference from the kind of correlation another application is interested in. So if UpSizeR is to be a tool that will work for any application, we're going to need a lot of help. We can't do this ourselves. So we are releasing UpSizeR as open source and hope that if it is useful for people, then people will contribute by making the relevant changes, additions to them. The other problem that I introduce is the social network's attribute value correlation problem. Quite a mouthful. I can't figure out some shorter way of describing this problem. And we'll be the first to admit that the soft technique is a dog. It's not going to generalize. So what would be a general technique of replicating a social network in the relational data set? Here we'll need some kind of theory to help us and right now because there's no database directly work on social networks, we don't know what to do. And if there was some theory built around this, the relationship between the social network and the relational data set then we can, it gives us something to work with. And I think it's a wide open problem. It's completely blank now. Anyone who gets there first will collect the citations. And this is why I encourage students to do in the universities. You guys are too old for that. But.... Okay. I'll be happy to take more questions. >>: Could you run down UpSizeR lesson one how well you predict the past. It's easier than predicting the future. >> Y.C. Tay: Predict the past. We haven't done that. We should do that. This is something I intend to do in the next set of experiments, yeah. >>: Can you give us a sense of performance, how long it takes? >> Y.C. Tay: I keep forgetting to get the numbers. So if you want to capture the non-key attribute correlation it could take days. But this kind of numbers really depends on what you're using and how many of them you throw at it. So I don't know how meaningful it is. If you don't care about the nonkey attributes, you only are upsizing, you only worry about a correlation among the keys. I think for F-1 it was some of the order of an hour to make a copy, to make a synthetic copy. And F-1 is one and a half million comments and half a million photographs. >>: Is it naturally parallelized. >> Y.C. Tay: Some part of it can be parallelized. But I think there's some other parts that don't parallelize so well, I think. No, this sequence of generation cannot be parallelizable. In general, here, these two guys both refer to these two guys. So these two tables have to be generated together. But I mentioned that they -- you know, you can change the code so you can generate this in parallel. And furthermore, there could be some other, if you go deeper into hierarchy, there could be some other correlations that use a completely different set of keys, and then that, since those refer to a completely different set of keys those can be generated independently of these two. So some part of it can be parallelized. But this particular sequence it will have to be sequential. That one has to come first and then that one and then this too. >>: Seems like you could have a preprocessing phase that figures out what the parallelism opportunities are to just analyzing foreign key relationships, really, to figure out where you can partition it. >> Y.C. Tay: Yeah. Certainly. You might not even need to do that. You just put them in the queue and then just take them off the queue, whatever has been done. >>: I see. Just generate the ->> Y.C. Tay: Just queue up these tables, and if the processor is idle, you can just pull out the next one to generate. >>: You could horizontally partition here, can't you. >> Y.C. Tay: Of these tables? >>: Yeah. I mean you're following paths of foreign keys, but within a table it seems like you can ->> Y.C. Tay: If you assume that this TUPLs don't have correlation, then you can generate these TUPLs in parallel, for example, this very fine-grained parallelism. >>: Anyone else? >>Y.C. Tay: Great, we are on time. [applause]