18847 >>: It's a real pleasure to introduce Y.C. Tay,...

advertisement
18847
>>: It's a real pleasure to introduce Y.C. Tay, professor from the National University of
Singapore. I've known Tay for almost 30 years, back when he was a graduate student
at Harvard is when we first met and he ended up doing a Ph.D. thesis in the database
area on the performance of locking systems, which was really landmark work at the time
and is still very relevant.
He uncovered the abstraction of lock thrashing, which was behavior that had been
observed but nobody had really understood it from an analytic standpoint, and also he
did a good job in separating out the issue of resource contention for resources other
than locks from the contention that arises from locks. People still read this work today.
It's really as relevant now as it was. In general, his work is on analytic models of
performance, most of his research.
He recently came out with an introductory book on the topic, a monograph from the
Morgan Claypool synthesis series entitled analytic performance models of computer
systems that was published just last month. And you can see a pointer to that on his
Web page online. As well his Web page has a nice pointer to a lecture from a couple of
years ago that he gave on universal cache missed equations for memory hierarchies.
So he works on hard problems that are of pretty general interest to the issue of
performance and computer systems, and today he's going to tell us on how to upsize a
database so that you can use it for performance evaluation and database systems.
>> Y.C. Tay: So I'm from the other side of the ocean, and coming here one thing I
learned is always to have an advertisement. So as Stu mentioned, this just came out,
this book. Less than 100 pages, and out of the -- and the idea is to introduce the
elementary performance modeling tricks to people who have no interest in becoming
performance analysts. They just need to have some rough model of how their system is
behaving.
The techniques are actually illustrated with discussion of 20 papers. And some work
that was on here. So take a look. It's an easy read, I promise.
So UpSizeR.
>>: You're going to autograph them later?
[laughter]
>> Y.C. Tay: The book should be read in soft copy because the 20 papers are
hyperlinked. So I don't know how to autograph electronically.
>>: I don't know, do they publish hard copies?
>> Y.C. Tay: They do. They do. Yeah. But it's best read in soft copy. So this is work
done with the students, Bing Tian Dai, Daniel Wang, Yuting Lin and Eldora Sun. What
I'm going to do is describe two problems. One I call the data scaled problem and the
other one the social network attribute problem and of course introduce UpSizeR. It's the
first attempt at solving the data scaling problem. And it has within it what I call a
strawman solution to the attribute value correlation problem. And you'll very soon see
why I call it a strawman solution.
So the motivation we are in the era of big data. And big data means different things to
different people. It could mean genome data or astronomy data. But in our context
here, I'm talking about Internet services.
So some of these applications have insane growth rates. And I imagine if you are
developer of something that you aspire to be the next Twitter, you worry that your
system is going to grow so large next month that your dataset is going to grow so large
next month that your system may not be able to cope with it. So if you're looking
forward, you might want to test the scalability of the system with some by definition
synthetic dataset. So how can you do that? One way to do so might be to use TPC
benchmarks. TPC benchmarks allow you to specify the size of the dataset and one gig,
100 gig, generate it for you, and there are different flavors as well. And they're
supposed to be domain-specific.
But they're not application-specific. If you think about something like TPCW which is
supposed to be for e-commerce and Amazon and eBay are roughly speaking
e-commerce and how relevant benchmark is for Amazon or eBay it's not clear.
So the idea here is that we want something that's application-specific. So rather than
design one benchmark for this application and a benchmark for that, what we want is
one single tool where you come with your dataset and I'll scale it up for you.
So the dataset scaling problem is that you give me a set of relational tables, and
UpSizeR, the R is big capitalized because I want to emphasize about talking about
relational tables so people don't misunderstand, I'm upsizing a graph or something like
that. Scale factor S and I'll generate a dataset that D prime that is similar.
Here's where I use my probe. You come to me. I'm UpSizeR and you come to me with
dataset D and I'll scale it up to D prime. These two are going to be similar.
The trick here, of course, is what does it mean to be similar. Right? It could be
statistical properties. The statistical properties in here must be similar to the statistical
properties in here, or it could be graph -- if this is generated by a social network, then the
social network here must be similar in some sense.
So when defining the problem, I wanted a -- since UpSizeR is supposed to work for
different applications, I want a definition of similarity that is application -- that would work
for any application.
So our way around the problem is I'm going to mention when you come to me with the
dataset you also have in mind a certain set of queries that you're going to run. And the
way you're going to judge whether D prime and D are similar is going to run the query on
D prime and see whether it is what you expected.
So that's how you're going to judge where D prime is there. So as suggested by this,
when we conceived the problem we were think of up sizing, that's why we call it up
sizing, scaling it bigger than 1.
When I went around and talked to people it seemed there's some interest in same sizing,
making a copy that is the same size. Why would you want to make a copy that's the
same size? Well, various scenarios. For example, I'm a vendor and you hire me, you
have a dataset and you hire me to do something with your dataset.
But you don't want to show me your real dataset. What you can do you run UpSizeR to
make a copy, synthetic copy and you let me work with the synthetic dataset. That's a
possibility. Another possibility in the cloud setting, you may want to test a different
layout for your dataset in the cloud, or you might want to test a different index
implementation or whatever, but so you're going to do all of this in the cloud, but you
don't want to upload your data into the cloud. So what you might do you generate a
synthetic copy in the cloud, or for people who already are working in the cloud and they
want to do some experiments somewhere else in the cloud, you don't want to ship all the
data there.
So what you might want to do is no extract the statistics, you go there you blow it up and
get a synthetic copy. Various scenarios you can think of.
But actually when I talked to a particular company, this is a big company. And they have
no shortage of big datasets. What they're actually interested in is the downsizing. So
why would you be interested in down sizing? For them they see their customers are
struggling with trying to get a sample of their dataset.
So you have a dataset that is here this big and you have some prototype application that
you want to debug. Now heaven forbid you run your application on this big dataset. You
want to take a small sample of the dataset then you debug your application there.
Sampling, what's so difficult about sampling? If this thing has a million users and you
want to test your application on a thousand users you randomly sample a thousand
users. What's so difficult? Well, it's not so simple because you might randomly sample
a thousand users, and these thousand users know it's somehow connected reference
objects and those objects are somehow connected to other users and you have to pull
out those users and it's like spaghetti in the end you get way more than a thousand
users.
So for various reasons, you might also be interested in downsizing. Now, the UpSizeR
algorithm will work for all these three instances. Okay. So I would like to say there's no
previous work, but I'd be lying. So I already mentioned TPC benchmarks, and if you
trace it backwards you get a Wisconsin benchmark that was done 27 years ago. And
the Wisconsin benchmark, the relations are completely synthetic. And that particular
approach -- so meaning that they did not try to scale it up from some applicable dataset.
In that particular approach propagated down to the TPC benchmark. That's how they do
it.
So it's not that they did not consider the possibility of scaling out empirical datasets, but
they encountered some issues. When you try to scale up empirical dataset and you try
to make the statistics are the same, if your table has only 20 TUPLs, then you worry
whether this 20 TUPLs is telling you anything about the real underlying distribution.
So for us this is not a problem because the datasets we're dealing with nowadays are so
huge it's not a problem. And another problem that they encountered was why were they
interested in the synthetic datasets? They wanted to compare a particular vendor's
offering against another vendor. Use these benchmarks to do this comparison. When
you want to do this comparison, you don't just have the dataset. You actually run some
queries, and you want to tune your data and your query so that you can test joint
algorithms or the buffer management or whatever.
So this tuning is much easier if everything is synthetic, you use empirical dataset it's
much harder to do the tuning. For me here, the way I formulated the problem.
>>: Question. If I want to buy supplies for my company, I want to do a query, want the
query that I run tomorrow morning for my business, the more queued the more honest
the benchmark will be these other benchmarks.
>> Y.C. Tay: So you want to have a benchmark where you can test these vendors
offering against that vendor. So you want to tune the queries, et cetera, so you can test
what this system is actually doing versus what the system's actually doing.
>>: If I wanted to do research on the database, if I want to use it like a user, no.
>> Y.C. Tay: Okay. This was the idea. Particular argument. So for us it's not an issue
because UpSizeR doesn't worry about queries. UpSizeR thinks, assumes that you have
a set of queries in mind but it doesn't -- it doesn't make use of the set of queries, for
example. So this is important to remember. In formulating the problem it's entirely in
formulating a solution it's entirely possible to factor in the set of queries that you want to
answer in doing the upsizing, but we don't do that.
The question -- the problem that they came up against was that scaling and empirical
dataset is hard. And they're saying that the dataset scaling problem is hard. But it's
been 27 years. It's a long time in this business, and it's about time we take another
contract at the problem.
So for you guys I assume I don't have to go through this. I'm going to use the Flickr
dataset. So in the Flickr dataset, users upload photographs and make comments and
tag the photographs, et cetera. And my job is to upsize it.
So in the Flickr dataset, you have a photo table with primary care and a foreign key
values who indicated who uploaded the photograph and you have nonkey attributes for
the data size, et cetera.
So the user ID here is pointing to another table so it's the primary key for another table
with a user name, user location, et cetera. These are the nonkey attributes. And then
you'll have a common table which says that this particular user commented on this
particular photograph and a tack table that said a particular user tagged this particular
photograph.
So what you get now is a schema graph. What assumptions do we use? We start with
the following. I'm going to assume that the primary key is just a single tone attribute.
Sometimes it could be two, but right now I assume it's just one. I'm going to assume that
a table has at most two foreign keys. So in the example that I'm using, the user table
has no foreign key. The photo table has one foreign key.
The type table has two foreign keys and the common table has two foreign keys. And
that's it. No three foreign keys. I'm going to assume that the schema graph is acyclic.
I'm going to assume that degree separation is static. I define what degree distribution
means. Let me give you intuition first.
A user will pose a number of comments. If you take a histogram, you can change it into
a distribution that says with probability so and so you'll upload one photograph, two
photographs, et cetera.
I'm going to assume that this distribution is the same. So the degree distribution for this
and the degree distribution for this is the same. That is not necessarily reasonable,
right? Think about why we want to do this. We want to do this because you're afraid
that when the dataset gets to be this size, your system cannot cope.
So there's a time element involved. So if you're thinking about loading photographs, by
the time you get to here, everybody would have loaded more photographs, right? The
tail of the distribution would have shifted. So it's maybe not reasonable to assume that
the degree distribution is static.
But for the time being let's use that assumption. I'm going to assume the non-key
attributes depend only on the key values. So the non-key attributes here only depend on
the key values. That means in the case of a user table, these values here only depend
on the key values, don't depend on the other tables, don't depend on the other TUPLs.
This thing is completely synthetic. It has no meaning. What does it mean to depend on
this? What it means is that these values here can be independently generated.
So when you give me the empirical dataset I'm going to mine it. I'm going to find out
what's the distribution on the age, the sex, the names, et cetera, and I'm going to use
those distributions to independently generate the values here. And later on I'll worry
about correlations.
But at the moment let's assume that it's independent. Okay. So for the other -- so for
more sophisticated tables like this one, then this one in general is a function of this key
values, right? If I know that this user is a gardener, there could be a function that tells
me what the key values should be, what the nonkey attributes might be over here. So
that is nontrivial, the correlation is nontrivial.
Okay. And I'm going to assume that the data correlations are not induced by a social
network. And that's not true for the Flickr network. But let's just use this simplifying
assumptions for the time being.
So UpSizeR uses three ideas. The first idea is a degree. It's really simple. This user up
loads four photographs so the degree of Y to four is four. And the user X make two
comments so the degree of X to comment is 2. The degree -- user Y make one
comment so the degree of Y to comment. Look at it for a moment and you realize that
what I'm talking about here is a bipartite graph. So every schema in H uses a bipartite
graph among the row here and the degree is just talking about the degree of the bipartite
graph, that's why we use that terminology.
The next thing we worry about is the join degree distribution. So I'm a user. I up load so
many photographs. There's my degree. And I make so many comments and that's my
degree. And these two degrees are going to be correlated, because a user likes to
comment on his own photographs.
So we have to respect that correlation. In a table like comments, for something like a
social network, you mentioned that the users might be, there might be a social network
for gardeners, a social network for bird watchers, et cetera, and the photographs might
be trusted around, flowers or kites or cars or whatever.
And it's more likely that a gardener would comment on a flower. So there's some core
clustering distribution that we have to respect. So for a table with two foreign keys, the
way UpSizeR do with the correlation between the two foreign keys is to look at a core
clustering distribution.
So these are the three ideas that we work with. The algorithm is quite simple. The first
thing we do is we sort the four tables. So remember that there's a schema graph on the
four tables. So when we sort it, the first one to come out would be the one with no
foreign key. And then the one with foreign keys pointing to whatever has already come
out, like this one for total. And the one with the foreign keys that points out to whatever
else, like these two, this would be the order in which we generate the tables.
Okay. So how do we generate the user table? So your empirical dataset has a million
users. You want to upsize it to two million. Very simple. I generate two million
permanent key IDs and then for each one of them I generate the rest of the nonkey
values and that's done.
Then the ->>: [Inaudible] nonkey value.
>> Y.C. Tay: Sorry.
>>: How do you determine the nonkey value?
>> Y.C. Tay: So, remember the assumption that a nonkey value only depend on the key
values. And since in this case the key values have no meaning, that means that nonkey
values is a function of this one in some trivial way.
So that means that I can independently generate the nonkey values. So ->>: Decide properties nonkey values the main column. The main column might have a
certain length distribution.
>> Y.C. Tay: I mine this guy for the name distribution, the age distribution, the sex,
whatever, and I use those to generate those values. So at this moment I don't worry
about correlation. I just generate them independently.
>>: When you talk about a name, you're not modeling, do not come from a dictionary,
just be a string of characters, name?
>> Y.C. Tay: Right. In this first cut version of UpSizeR, we are focused on the key
values. So we don't do anything sophisticated with the nonkey values.
So the next thing to generate is the photo table. So how do we do the photo tables?
Now, remember that so the first thing we do is we mine this for the degree distribution,
degree from user to total. This user uploads so many photographs. So we know that
distribution. Two million users here. So what I do is for each one of them I pick a
degree value from the distribution. So user number one uploads, has degree three. He
uploads three photographs. User number generate 13 TUPLs. User two has five
photographs, I generate five TUPLs, done. That's easy enough.
Now for the correlation, for the case where there are two foreign key values, right? What
I'm going to do is I have to assign a degree from the user table to this column and a
degree from the photo table to this column. Now, the degree from the -- the degree from
the user table to this column is correlated with user to portal. And that one has already
been generated. So I have to take a conditional probability. No problem.
So as long as I respect the correlation. The other thing I have to worry about is the
correlation in these two columns. So for the correlation in these two columns, what I do
is first I generate a common ID. I know the core clustering distribution. So now I just
assigned a common ID to a core cluster. Let's say I assign it a co-cluster for gardener
and flowers. Now after I've assigned it co-cluster for gardener and flowers I have to pick
a particular gardener, have to pick a particular flower. This gardener commented on 10
flowers and that gardener commented on three flowers, then with high probability I
assign the comment to that particular gardener. So it's common sense.
This is what I do, I generate assign a co-cluster. I pick a user in a co-cluster in the user
cluster, according to the degree figure photograph according to the degree. When I'm
done with that, I will have the three key values. When I have the three key values,
nonkey attributes depend on the key values, and I can generate the rest of them and
that will be one TUPL. And then I do this. So every time I sign a photograph or sign a
comment to him I reduce his degree, assign to a flower I reduce its degree. And I do
that until the degrees are zero. And then I generate the tech table similarly.
So it's very simple. The actual algorithm itself, the complete algorithm itself has certain
complications, but the basic ideas are already here. So relaxing assumptions. So
suppose the primary key have two attributes, then what we do is we create a synthetic
attribute as a primary key. We work with the fake primary key, and then when we are
done generating everything we throw away the key, the synthetic primary key.
Suppose it has more than two foreign keys, the co-clustering of the two foreign keys we
can use any algorithm. It's orthogonal to UpSizeR. If he has three foreign keys, just find
me a three dimensional co-clustering algorithm and I'll work with that. It doesn't matter
to us.
Suppose the degree distribution, the schema graph is cyclic, and a good example of that
might be the manager is supposed to be an employee and then you have an edge like
this. I think we can deal with that. If it's a cycle where one relation goes to another
relation and it comes back, I think we can deal with that, too. But if with a [inaudible]
cycle then I'm not sure how we should do that. Remember why we need to assume this
acyclic. We need to assume it's acyclic because that gives us a way to sort the tables
and we know who to generate first. Right? But if it's an arbitrary cycle. Not just like this
but like that, like that and whatever. And where do you start. Until somebody gives me
a realistic scenario, I'm not going to worry about it.
And the degree distribution is static. Sometimes degree distributions are not static. For
example, the number of countries in the world you can't simply upsize that. So for us
what we do is we up size it anyway and then we hash it back and that would change the
degree distribution.
A more interesting example is this case. Remember for the Flickr when I say the tilde of
the distribution has shifted. So suppose the degree distribution changes. There are
various options we can use. One option is that when you come to me with the empirical
dataset, you also tell me how the degree distributions change and then I will use that
instead. Right? If you're going to tell me how the degree distributions have changed,
give me two snapshots and I'll do an extrapolation from that.
If you don't have two snapshots, I can go in here and I can mine the time column, for
example, and deduce from the time you upload the photograph to extrapolate it and say
by next month how many photographs you would have uploaded. Many options for
dealing with changing distribution. So a TUPL's nonkey attributes may depend on its
key attributes.
I'm assuming that nonkey attributes only depend on key attributes, don't depend on the
other tables and don't depend on the other TUPLs. Let's look at one example where
they depend on the other TUPLs, how can we deal with that? So, for example, in the
case of tack, for example, if you are a bird watcher, the tag values use might be
correlated somehow. For example, you're more likely to use bird and tree and sky. Less
likely to use car and bicycle or whatever. So in that case there's some correlation in the
tight values here. So how do we deal with that?
So the way we deal with that remember the UpSizeR algorithm that's described is based
on the key values. So the tables have already been generated now you want to fill in the
tag values to capture this correlation. So you know let's say this particular user used the
thin tag values and you're trying to find out, try to give him the thin tag values that are
correlated. So what you need to do is generate a factor of 13 values for this particular
user.
Now, what I do is I go into the empirical data set, and I'll sample from it some type factor.
And this type vector will presumably have some coherence.
If I need the thin type values and this is too long I just throw away some of them. If it's
too short I add some more.
When I add some more, I have to respect the correlation among the tag values.
Which means that when you give me the empirical data set I have to mine the
correlation in the tag values. It's very compute intensive.
So if you try to capture all the correlation in the nonkey attributes, it's going to be a major
pain.
But if that's what you need, then that's what you need.
Okay. Up to this point I think that we have the basic bag of tricks for dealing with
classical data sets that banking data or retailer data, et cetera. But now comes the hard
part.
What if the data set were generated by social networks.
The issue there is that if the data -- if the data is generated, induced by social
interactions, I have to understand what kind of correlations, what the correlations look
like, so that when you give me the data set I can extract from it the social interactions or
the social network, then I can scale up the social network and then induce the
correlations, use a bigger social network induce a correlation in the syntactic data set.
So I need to know this mapping. I need to reverse engineer the social interactions.
And I don't know how to do that. There's no literature on this.
There's a lot of work on draft theoretic views of social interactions. There's a lot of work
on algorithms for social networks.
But as far as I can tell, there's no work on a database theoretic view of social
interactions. If the database, if the relational database were induced by a social
network, what kind of correlations would they have? And I've not seen papers of that
kind. So since there's no literature on this, we just do some half assed thing.
So this is our hypothesis, that friends are similar, because they're similar their foreign
key identities can be swapped.
What does that mean? Let me give you the example, right?
So in the Flickr data set I'm going to define friends as two guys commenting on each
other's paragraph.
We take that as the evidence that they're friends, and you can extend that to Facebook
writing on each other's walls, for example. Similar, how do I tell whether the tool is
similar?
We just look at the tag vectors.
So if this guy uses this but the tag vector and that guy uses that particular tag vector
data cosigned and we used that to judge how lightly it is that they're going to be friends.
So now how to replicate the social network. So this thing is generated by social network.
And I need to replicate that in this particular data set. So the first thing I do is I go in
here I strike all those guys who come in on their own photographs. This is a technicality.
Almost everybody does that.
And then I -- because I defined friends as commenting on each other's photograph, I can
actually figure out who are the friends in this particular data set. I can reverse engineer
it. And then now I've generated the data set the synthetic data set for that, right?
Remember I've already generated a table, now what I need to do is go and tweak the
values in those tables so that I capture the social network.
Right? Up to this point I have not done that. First of all, I need to know how many pairs
to tweak. So in the original, in the empirical data set, there are so many ages, so many
pairs of friends. Now for the synthetic one, I scale it up by S.
Why S? Shouldn't it be S squared.
So I checked the literature, this thing could be growing exponentially, but then the
number of friends you have doesn't grow exponentially, except for some outliers.
So my understanding of that is in a social network the way it grows is you add more sub
social networks. Rather than one social network blowing up. So if it grows by adding
sub social networks, then, yeah, it does give linearly. So that's how we're going to do it.
So I'm going to go in and tweak the values, the attribute values, so that I have this
number of friends. So I'll pick M pairs and I'll pick pairs from this set and I will decide
whether they're going to be friends or not by calculating the co-sine. So if the co-sine is
0.3 then probability is 0.3 and make them friends.
So for each pair that I choose -- so this guy commented on his own photograph. That
guy commented on his own photograph.
Now I swap this, too. Now when I swap this, then they comment each other's
photograph. We do that for N pairs so that's how we generate the social network for this
one. Okay. Enough of that.
Now for some data.
>>: So we've crawled Flikr to get a base set, F 1. The data we store in four tables. And
so this is real values crawled and then we take this base set and then we up size it by 1,
meaning we make a synthetic copy. These two values should be the same. It's just a
floating point thing. So this is not a problem.
So this is measuring the table sizes, the number of TUPLs and these four tables.
These tables generated by the degree distribution. And you see that the disagreement
is already significant. Those two tables are created by the co-clustering, and apparently
it works decently. And then we crawl some more, at any point we have one million and
half a million comments.
We cross some more so F .281 is a data set where he has 2.81 times the number of
users in F 1 and we did a similar thing and I let you judge how well this works. So this is
the one with the most severe disagreement, and it's not clear why that is so, it could be
that comments have some recursive property, when I comment on your stuff, you
comment on my stuff, and it blows up, something like that.
So this is comparing the table sizes. But I said we should Judge the quality of UpSizeR
by looking at the queries, right?
So we tested it on four queries different number of joints. And here you see what we've
got and I think it's pretty decent up to this point. And this is you can look at it as
estimating joint sizes, and one comment that I got from some people was that this kind
of estimation should be built into query optimizers.
So anyway, so this is estimated -- these are queries.
>> Y.C. Tay: F.9 are they Flikr or is it synthetic.
>>: This is real.
>> Y. C. Tay: These are real. These are your upsizers.
>>: Yes, this is taking the original data set and upsizing it by 9.911. So these are
exercise -- this is checking the weather we were able to capture the correlation among
the key values. Now what about nonkey values.
So we tested retrieval on the nonkey value like bird the typed values, the typed value,
the bird here, for example.
And agreement is not so great now. It's okay here, but for the smaller sizes it's not so
great.
And it's not clear to me why it's bigger for -- it's better for bigger sets and not so good on
the smaller sets.
We didn't try very hard. In the case of this particular query here, we are actually testing
whether we captured the correlation among the key values. The nonkey values, the tech
values.
And we didn't try very hard. As I mentioned, just capturing the correlation between
pair-wise correlation between tech values is already very compute intensive.
If you try to capture more than the correlation among more than two typed values, it's
horrendous. So we didn't try so hard on that one. I imagine that we can improve on
these values. So up to this point the social network is not in the queries. Now we want
to test how this performs, what, if the query refers to the social networks what would the
comparison, how would the comparisons look.
So remember that UpSizeR deals with the query, with the social network, with the swap
techniques. So now we have to compare the real query results against UpSizeR with
and without swap.
We're going to see how it goes. So the first query is retrieved pairs of friends. And you
see that without the swap, you completely cannot capture. Right?
You have millions of users and millions of photographs, millions of comments. The
probability that you and I will comment on each other's photograph is basically zero. So
you really have to go in and tweak the values if you want to capture the social network.
Then the way we replicate the social network is to look at a similar -- look at whether two
users are similar, the way we check whether two users are similar is to look at tech
vectors and want to see whether this is actually a good reflection on whether to
comment on each other's photographs.
So you retrieve X and Y who are friends and have at least N texts that they both use.
And the agreement is still decent until like here. So maybe we can scale not by a factor
of ten but something less than that is okay.
>>: What breaks at 9.11. Looks like a photo ->> Y.C. Tay: I don't know. We should look into that. It could just be that you have
something -- so remember the way we build up this data set is we crawled and crawled
and crawled.
It could be that there's something wrong with this particular data set. Maybe the way we
did the crawling we might have messed up or whatever. It's not clear.
So social interactions have structures. And this is the major issue. So one structure
would be triangle, a friend of a friend is likely to be a friend. We didn't try -- we're not so
ambitious.
We just wanted to see whether we can capture the V shapes.
X and Y are friends and Y and Z are friends so can we capture that? And the answer is
clearly no.
So what happened. It was doing well up to this point.
So it turns out that, yeah, we can scale up the number of friends accurately, but we
cannot scale up the topology. It turns out that in Flikr there are some hot shots with
many friends like this in the star shape. And when we scale it up to F prime, we lost that
star shape. The star shape just dissipated. So we have to figure out some way of doing
it.
Now, it may be it has to do with the way we used the -- remember, we replicate the
friendship graph by looking at a similarity and similarity is defined in terms of tech
vectors.
So currently, if this tag vector were a bird, chicken, and this tag vector is birds, hen, then
the co-sine is 0. So imagine you can actually do better than that, right? You put in an
ontology or whatever.
But my students tell me you need a lot more than just that. That you really -- you will fix
some of the structural issues but you can't fix all of it. So I described two problems.
First is the data set scaling problem, and as I say now UpSizeR has what I believe the
basic set of techniques for scaling up anything that's classical data set. But when it
comes to the nonkey attributes, there are all kinds of correlations in real data, and I
imagine that they kind of correlations one application is interested in the difference from
the kind of correlation another application is interested in.
So if UpSizeR is to be a tool that will work for any application, we're going to need a lot
of help. We can't do this ourselves. So we are releasing UpSizeR as open source and
hope that if it is useful for people, then people will contribute by making the relevant
changes, additions to them. The other problem that I introduce is the social network's
attribute value correlation problem. Quite a mouthful.
I can't figure out some shorter way of describing this problem.
And we'll be the first to admit that the soft technique is a dog.
It's not going to generalize. So what would be a general technique of replicating a social
network in the relational data set? Here we'll need some kind of theory to help us and
right now because there's no database directly work on social networks, we don't know
what to do.
And if there was some theory built around this, the relationship between the social
network and the relational data set then we can, it gives us something to work with. And
I think it's a wide open problem.
It's completely blank now. Anyone who gets there first will collect the citations. And this
is why I encourage students to do in the universities. You guys are too old for that.
But....
Okay. I'll be happy to take more questions.
>>: Could you run down UpSizeR lesson one how well you predict the past. It's easier
than predicting the future.
>> Y.C. Tay: Predict the past. We haven't done that. We should do that. This is
something I intend to do in the next set of experiments, yeah.
>>: Can you give us a sense of performance, how long it takes?
>> Y.C. Tay: I keep forgetting to get the numbers. So if you want to capture the
non-key attribute correlation it could take days.
But this kind of numbers really depends on what you're using and how many of them you
throw at it. So I don't know how meaningful it is.
If you don't care about the nonkey attributes, you only are upsizing, you only worry about
a correlation among the keys.
I think for F-1 it was some of the order of an hour to make a copy, to make a synthetic
copy. And F-1 is one and a half million comments and half a million photographs.
>>: Is it naturally parallelized.
>> Y.C. Tay: Some part of it can be parallelized. But I think there's some other parts
that don't parallelize so well, I think.
No, this sequence of generation cannot be parallelizable.
In general, here, these two guys both refer to these two guys.
So these two tables have to be generated together. But I mentioned that they -- you
know, you can change the code so you can generate this in parallel. And furthermore,
there could be some other, if you go deeper into hierarchy, there could be some other
correlations that use a completely different set of keys, and then that, since those refer
to a completely different set of keys those can be generated independently of these two.
So some part of it can be parallelized. But this particular sequence it will have to be
sequential. That one has to come first and then that one and then this too.
>>: Seems like you could have a preprocessing phase that figures out what the
parallelism opportunities are to just analyzing foreign key relationships, really, to figure
out where you can partition it.
>> Y.C. Tay: Yeah. Certainly. You might not even need to do that.
You just put them in the queue and then just take them off the queue, whatever has
been done.
>>: I see. Just generate the ->> Y.C. Tay: Just queue up these tables, and if the processor is idle, you can just pull
out the next one to generate.
>>: You could horizontally partition here, can't you.
>> Y.C. Tay: Of these tables?
>>: Yeah. I mean you're following paths of foreign keys, but within a table it seems like
you can ->> Y.C. Tay: If you assume that this TUPLs don't have correlation, then you can
generate these TUPLs in parallel, for example, this very fine-grained parallelism.
>>: Anyone else?
>>Y.C. Tay: Great, we are on time.
[applause]
Download