>> Cheng Huang: Hello everyone and welcome to the... talks by Professor Alex Dimakis and his student Dimitris, so...

advertisement
>> Cheng Huang: Hello everyone and welcome to the afternoon talks. Today we will have two
talks by Professor Alex Dimakis and his student Dimitris, so we'll have two hour slot in total. So
the first talk will take about 45 minutes to an hour and then we will take a short break, and
then we will have Dimitris talking about a second topic. So the first one is erasure codes for big
data over Hadoop. The second topic is large scale sparse PCA for Twitter analysis. So that one
we'll start about an hour from now. So let me first introduce Alex. Alex is currently an assistant
professor at USC, so he has longtime ties to Microsoft Research, so Alex won a Microsoft
Fellowship when he was a graduate student and then he did an internship here with our group
and then he graduated from Berkeley around 2008, and he did a postdoc at Caltech for one
year and he is now an assistant professor at USC. So Alex has done some work on coding which
attracted a lot of attention in the society, so the most well-known one is called regenerating
codes. So for regenerating codes he has won a dissertation award from Berkeley and also the
2010 IEEE Data Storage Committee Best Paper Award, and he has won the NSF for Career
Award and is chair assistant professor right now, so I will keep it short on that side and let Alex
tell us the story about erasure codes for Hadoop.
>> Alex Dimakis: Thank you, Cheng. Thanks for the invitation. This is joint work with my
students at USC, Mahesh, Megas and Dimitris; so Dimitris will be giving a talk afterwards. A
part of it is joint work with Scott, Ramkumar and Dhruba from Facebook. The overview is
basically the following. The first message and the most important message that is making this
relevant is the fact that data seems to be growing faster than infrastructure now and the cost of
storing it is increasing very fast. In particular 3X replication which is sort of the standard thing
that people do is becoming too expensive. So the main message that I think most large-scale
storage systems currently use already is for cold data and we will define cold later, but when
the data is not used that much or not read that much instead of using 3-D replication we use
some form of erasure coding and we save a lot of storage, and most of the data is frequently
called data depending on the application again, but there seems to be very big gains in storage
that you can have. The interesting message that I want to say is that most storage systems
today are basically using Reed-Solomon distributed, and the main message I want to say is that
classical codes like this are unsuitable or in other words I am saying that there are other codes
that are much better than Reed-Solomon. What my lab is working on for a long time is basically
creating new codes that are optimized for distributed problems and bounds and information
[inaudible] capacity as to what are the best codes one can hope for, so we have a sense of
knowing how far we are from optimal. I'll just give a very quick overview of how codes are used
to store. Usually we have a file or data object and we cut it up into blocks or chunks depending
on which company you work at. I will just call them blocks here, so you cut it in blocks. In the
coding theory language with K blocks and then you produce codes like this very simple code
that says just take two blocks, XOR them together and produce a simple XOR block and this is a
3, 2 MDS code, so attention compared to the standard systems CS language, people would call
this a 2+1 which means two information blocks and one parity. I use the standard coding
theory language which means 3, 2, so the number of failures I can tolerate is n minus k, so one
failure here. So this of course is a single parity code used all over the place and can tolerate
one failure. This code is a 4, 2 MDS code so that means that it has the best distance possible
which means it has the best fault tolerance possible which means it can tolerate any two
erasures, and you can see of course, for example, if you lose these two blocks you can recover
from the other two linear equations. So the general idea for an nk erasure code just to set up
my notation is an nk code is a black box which takes k packets and produces n packets. These n
packets typically have the same size as the original k, but we will be relaxing that later, so we
will see about that, but that is important. And the key property that it provides you is that any
k out of these n suffice to reconstruct the original k. If these coded packets have the same size
as the original, then this is optimal, so this is called MDS if they have the same size. If we start
playing with the sizes, then it is a different story, and there are a lot of bounds on what is going
on. That is called the optimal reliability that you can get for that given level of redundancy and
that is well known and there are many off-the-shelf codes that people can use. The new aspect
as I said is that now we have packets that are distributed in a network so classical codes have a
lot of sub optimals. So let's talk about Hadoop. I gave a talk about the state-of-the-art for
distributed source codes two years ago in this group. I'll tell you what has been new since
those last two years and I will tell you what is now the state-of-the-art. I know most people are
interested in the practical aspects, so there are two parts of the talk. There is a practical part
that starts this slide and then there is a theoretical part and depending on how many people
are awake I will keep on going depending on interest and time. So the practical aspect, let's
establish what current Hadoop is doing. Let's say you have a file in Hadoop. I am looking at the
HDFS, the Hadoop system and let's say you have a 640 MB file; you cut it up into blocks. Here I
chose 64MB to be the block size; sometimes it can be larger, sometimes up to up to 256MB.
Now you have these 10 blocks that represent your file. You're going to, what the standard
thing is is 3 by application, which means that every block is replicated twice. Now you have
these blocks; you distribute them over servers in your cluster and you are happy. The problem
with that is there is a very large storage overhead and as the data grows faster than the
infrastructure grows as I said this cost is dominating, so it is a very big problem. So Facebook
recently created a component that is open source that is called HDFS RAID that is a component
that works on top of HDFS which is used in warehouses in production, and what it does is after
it detects that the file, let's say our file here has been cold, so it hasn't been accessed let's say
for some period of time that you can set, then it deletes the replicas and creates these four
parity blocks instead. These are created by 14, 10 Reed-Solomon, so these are four ReedSolomon parities and now this is the replication of the file is low so there is a default replication
that is 3X in Hadoop i that is lowered into 1X and there is a separate file that is called a parity
file that contains only the four parities corresponding to this file, and the HDFS RAID is a
component that keeps track for every file that is RAIDed, that's the word. It keeps another file
that is the parity file for that file which consists of, for every 10 blocks of the original four. So
that's the way it works in HDFS RAID. So when we compare HDFS with a straight application to
HDFS RAID, you see of course this system can tolerate two missing blocks only and it has a
storage cost of 3X, whereas this system can tolerate four missing blocks and has a storage cost
of 1.4 X, so that is more reliable of course and it is using half the storage. The problem of
course is that in this system you cannot read in parallel; that is one of the problems. And that's
why they use it for data that is cold, archival cold data. And so the HDFS RAID is the opensource system. Another paper that included a system from the CMU group was called
DiskReduce that introduced this idea. This deployment of Reed-Solomon at Facebook has up to
the last time I got information has saved around 5 petabytes. This is for a data warehouse
cluster that was storing around 30 petabytes of storage, so of course they did not RAID ReedSolomon everything; they were slowly increasing which files were going to be RAIDed and this
is the storage savings which is quite significant, 5 out of 30 taken out of storage is quite a
significant storage savings. However, Reed-Solomon has several limitations that limits their
use. So currently in the data warehouses, and this is all focusing, my case studies are focusing
on data warehousing. When we talk about storing photos, storing windows, storing search
indices, they have different usage scenarios and so they may have different trade-offs, but for
data analytics, the main issue seems to be that they got to 8% or maybe 10% of the data being
Reed-Solomon code and even from that they got big benefits as I mentioned before. Our goal
would be to create new codes that would allow you to go into coding 40 or 50% of the coded
data, hopefully, and the question is what is stopping you here. The first issue is maybe you
don't have enough cold data and that's not true. To the best of my knowledge most data
warehousing scenarios would have enough cold data to go up to 40%. Again it depends on the
scenario, but that is not the main bottleneck. If we could go up to 40% coded warehouse you
could save petabytes of course which is a very important gain. The bottleneck I want to argue it
is what's called the repair problem or different volumes of the repair problems. So what is the
repair problem? Here I am showing a 10, 14 Reed-Solomon code and just these 14 blocks were
stored in different machines in the cluster. When you lose a machine, you have to repair it.
There is a, the machine is sending a--you don't lose a block; you lose a machine, so when you
lose a machine you lose thousands of blocks. When the machine stops sending a heartbeat to
the main node as it's called in HDFS, then the main node says okay maybe this guy will come up.
Maybe it waits 10 minutes, I don't know, depending on the configuration, and then after a
while it triggers a repair job, a MapReduce job that is repairing every single block that was in
that node that is dead. Of course it may still come back, but that is the way that it is working.
So a MapReduce job starts for every block that is lost that needs to re-create that block. In
order to create that block if you have Reed-Solomon code and you use a naïve way of repairing
one block failure, you need to access 10 blocks to do that. At some other machine somewhere
you need to talk to 10 out of those 13, right, because you lost one, so you need to read 13
blocks, transfer 13 blocks in your data center, move them to this guy and this guy basically runs
a Reed-Solomon decoding procedure here that's basically a mapper that is running here, opens
streams to 10 others, runs the decoding, produces all of the 10 blocks, stores three, so this is
three prime, I say three prime will be called three and then throws away one, two, three and all
the whole, the rest of the others. That's the way that Reed-Solomon is decoded right now.
There are some smarter ways of decoding Reed-Solomon that have appeared in information
theory recently, but kindly they have several issues in being implemented. They don't have
very big benefits. We don't know the minimum possible repair for Reed-Solomon in particular.
We know over all codes what is the capacity, so I will talk about that. So the first question you
may ask is who cares? Is that a big deal? Are there a lot of repairs? So here we ask is repair
frequent. So this is a trace of a 3000 node Facebook data warehouse production cluster and
what you see here is a number of failures every day. You see these are the days here. So you
see that it seems like often there are 20, 40, something very funny happened here and it goes
all the way up to 100 out of the 3000 nodes are down. So that is quite a lot of failures
happening even on a typical day. Yeah?
>>: So how do I read this graph? So among the December 26 how many failures were total
here?
>> Alex Dimakis: This is the situation. It's not per day. It has higher resolution. I think it may
be per hour or somehow aggregated and this is the number of nodes that I think were not
reporting…
>>: A failure node is counted a number of times before it's completely repaired.
>> Alex Dimakis: I don't know exactly the tools that they used to produce this how it is exactly
counted, but the key thing that we care about really is how many repair jobs are triggered.
That's the thing that we care about. There is a threshold that says okay, if the server is not
giving a heartbeat for a minute then they don't start the repair job. And I don't know exactly
how this tool is reporting a death; maybe it's reported dead after a minute and I don't know
exactly how. But the main point of this chart is that there are a ton of failures and there are a
ton of repair jobs. So here I say let's say each, let's say a typical node would be 10 terabytes, 15
terabytes thereabouts. So when you lose that guy you have lost 15 terabytes worth of blocks,
so when you have replication you have to go and copy all of those blocks so you move in the
network 15 terabytes for every node you lose, so let's say on a typical day you have 20 such
jobs happening. You have to move 300 terabytes in your network. The problem is now if this
node was Reed-Solomon encoded, you would have to move 10 times more data and you have
to read 10 times more data. So we estimate, by the way the total graphic in this cluster would
be around 2 petabytes per day from what we estimate, typical and we estimate that 20 to 30%
of the network traffic is repair traffic in any given day with the common configuration of 8%
Reed-Solomon packets. If you want to encode 50% of your cluster, then the repair traffic would
completely saturate your data center network, so that's our estimate right now. It would be
completely impossible, so all your server, all your system would be doing is repairing itself at
this rate of failure with this data. So that basically says it's impossible to go up to 50% because
you would have a ton of this guy on a ton of network. So in order to clarify one point, the point
of that is that this is an important problem and a probable bottleneck. In order to talk about
repair I need to distinguish between two concepts. You remember here I lost block number
three; we are calling it three and now I want to repair that block, so there are two notions that
you can think of here. Exact repair means that the block that I recover is exactly the block that I
lost. There is a looser concept called functional repair, which means that the block I repaired is
not the block I lost but it is a new block that gives me still any k out of n guarantee, but even
though this was a systematic block which means data, now this can become a new parent, so
this problem, the functional repair problem is a special case of exact repair and it's a much
easier special case of exact repair information theoretically. I just need to have yes?
>>: [inaudible] hit on the latency if the cold data is read?
>> Alex Dimakis: Of course, exactly. If you are in a situation that is purely archival and you
almost never read, you may be okay with functional repair, so functional repair is a much easier
problem than exact, so we have explicit practical very simple codes that achieve the capacity
for this problem. As I will say, we don't really even after so much research we don't really have
practical codes that achieve the capacity for this problem for some cases, so if you are willing to
tolerate the problem that you will be losing systematic blocks and you will be replacing them
with linear equation parities, so you have latency, read latency problems, if you are willing to
take that then there are codes that you can use that are called functional regenerating codes,
very, very simple things. If you are not willing to tolerate that which is mostly the case, then
you have to, we have to talk about exact repair and there are papers the talk about p to p
backup using functional repair, but I don't know; it depends on the application. I just wanted to
make that distinction. Now there are different metrics that we care about. Observe here again
that there are different things that happen. The newcomer, let me go back to where I was. So
this guy is called the newcomer. The newcomer is the node that hosts the repairing block. The
metrics that we care about are how many bits traveled in the network total. That is called one
metric, the repair bandwidth. The second metric we care about is how many bits were read
from the discs. This can be different, so there are some schemes that read from the disk, do
local processing and then send data, so these schemes are not very convenient if you're using
MapReduce, so the implementation is more complicated, but disk I/O and network can be
different. And the third metric that we care about is how many nodes this guy talks to. This
guy right now talks to 10 nodes. Even if you read a bit, that is a different story, but if we care
about how many, we talk about that metric. So there are these three metrics. The first one I
said is the number of bits communicated called repair bandwidth. The second is the number of
bits read from the disk during a single node repair called disk I/O, and the third metric is the
number of nodes accessed to repair a single node failure which is called locality of the code. So
let me tell you a little bit, so this is April 2012, right? I give different talks and I want to make
sure of what is open and what is closed is up-to-date because there is a lot of, yesterday for
example, there was a paper on this, so there are a lot of things happening. For this problem,
for the repair bandwidth, the capacity which is the minimum possible information theoretically
over all codes is known. It is known not for the whole region, so as I will briefly discuss later in
the theory section of my talk, there is a whole tradeoff between storage and repair
communication. The capacity is known to be achievable for the two extreme points only, but
we do know what is the best possible for this case. We do not know the intermediate points
and more important for practical use we don't really have a high rate practical codes known for
the most useful points which is the minimum storage point, so that problem is an important
open problem and I think there are no practical deployable codes for this. For the second
problem, there are codes that are near optimal here but we don't have exact measure of their
capacity bound so it's an active area. For this problem the disk I/O, the capacity, the minimal
information to be read is open, is unknown. There is an obvious bound that you can give that
says you will never be sending things you never read. So an obvious bound is that the repair
bandwidth is always a bound on the disk I/O, but can you actually match the disk I/O to be
equal to this capacity? We don't know if there exist codes that can read exactly as much as
they transfer and repair all failures. There is again some work on this but the general, both the
capacity and the [inaudible] codes are open for this problem. Is it clear? Does anybody have
any questions? The third problem is the problem is a simpler metric which doesn't care how
much you read or how much you transfer; it just cares if you talk to a node or if you don't, so I
just want to keep the repairs as called localized, so I will like single block failure to correspond
to a node talks only to three other nodes. So this is called locality. What is very recently the
capacity for scaler codes was discovered by some people in this room [inaudible] so I will talk
about this, and there are very few explicit code instructions for specific cases and we have one
and I will talk about that. So this is sort of the landscape of the metrics that people care about;
depending on the problem people care about different metrics differently. I'm going to talk
about the practical implementation we did now and then we have a whole theory discussion if
you want to learn the bounds and the codes for each one of those things. Are there any
questions for this? Maybe? So let me just move to a little bit about what is practically
deployed and we will move about the theory later. So this here is, we're going to talk a little bit
about practice. So this is a real system that was deployed at the Chinese University of Hong
Kong. It was using a regenerating code and it was built by [inaudible] and it was running on
eight nodes and then there was a bigger version that was published in FAST [phonetic] this year
that was running on bigger. This is how failures are introduced into the system, so this is me
cutting off one of the discs. Okay, it's not true. This was one implementation that used a
specific code that I can't tell you about. The main problem with that code is that the disk I/O
was very high, so it wasn't a very, so for the Hadoop scenario it wasn't exactly the practical
useful code. Let me tell you a little bit more about the Hadoop MapReduce infrastructure and
the implementation of one specific new code that we did in Hadoop. Hadoop and MapReduce
the way I understand is basically that it is the open source version of the Google file system and
the Google MapReduce framework, but it is open source and free. It consists of two
components, the file system which is called HDFS and the data analytics processing system that
is called MapReduce. The software is called Apache Hadoop MapReduce. So hundreds of
companies are using Hadoop and tens of startups are developing tools for Hadoop so it is a
very, very exciting area. There are a lot of buzzwords happening. The BigData buzzword now I
think is basically founded on these tools. In particular, the tool that we care about is HDFS RAID
which I talked about a little bit. It is a Reed-Solomon add-on module that sits and works with
HDFS. It exists as an open source. You can download it. It is not very user-friendly so you have
to fix configuration files and it won't quite compile, so you need to hack it up a little bit. If
anybody wants more information to download HDFS RAID send me an e-mail and we have
spent a month trying to make it work, but it's functional and it's used in production also, slightly
different versions are used in production by Facebook, Ali Baba and some other companies. So
the currently available HDFS RAID basically has two modes. It says okay, either three
replication or a Reed Solomon with any k and n you want, usually it would be a 5, 7 or a 10, 14,
or one of those codes. Or a single parity code which basically deletes the third replica and
replaces the third replica with one XOR, so two replicas, one XOR, so that's one other mode
that you can run. What we did is we implemented what is called a locally repairable code,
which is a new code into the system, into HDFS RAID and our HDFS RAID with our new code is
publicly available, so if anybody wants to play around with it, we would be very happy to share
with you it is on [inaudible]. So let me tell you how our code works, this locally repairable code.
So the Reed-Solomon, let's focus on a 10, 14 Reed Solomon, for example; it takes 10 blocks and
produces four parity blocks using a Reed-Solomon equation here. By the way, the encoding is
done once the name node detects that a file is going to be RAIDed, there is another program
running called the RAID node that says I'm going to take that file, delete the replicas, run a
MapReduce job that erases blocks and produces these Reed-Solomon blocks. Here is the first
idea. Let's just add a simple XOR. Let's just take these five blocks, XOR them together and
produce a new block the same size as these blocks and have this extra block and install it
somewhere X1. Let's do the same thing here and let's do the same thing here. Now the first
observation now is that let's say I lose this block number two here. If I lose this block then I can
repair this block by just reading the other blocks in the XOR and the XOR, very simple thing,
simplest thing in the world. So this one is called a 4 code, for example; it was a very simple
idea. So it says local XOR allow single block recoveries by transferring only five blocks which in
the example is 320MB instead of 640MB. The problem of course with this idea is that now you
are storing more, so now you are storing 17 blocks compared to storing 14, so your storage
overhead goes up. Is there anything cool that you can do this now if you are coding? So you
can do the following in this thing with this idea. First of all you may multiply before XORing
these together, you could multiply, think of these as elements of a finite field, you can multiply
by any elements in the finite field that you want, C1, C2, C3, C4, C5 as long as they are not zero
and then still if you lose one of the blocks you can recover by RAIDing the other five, no
problem as long as they are not zero. So I have the freedom to choose any coefficients I want. I
say here any linear combinations as long as I can invert these coefficients and repair locally. So
here is now a strange thing. I can choose any coefficients here and any coefficients here and
any coefficients here, so what are the best coefficients I can choose? One idea was to just XOR
them together which corresponds to using ones, but is that optimal? Is that providing the best
fault tolerance in the system? For example, how many failures can I tolerate here? I can
definitely tolerate four; you can see that, right? Just by using these I can tolerate four. I don't
even need it. But can I tolerate five? There is no, it is not clear to see from this construction if
you can tolerate five at XORs or not. The thing is depending on how you choose these
coefficients, the distance of the code, the photons of the code can change. So how you choose
these coefficients in an optimal way, we don't have a general polynomial time method of doing
that. You can do it; we have a way that does it that says if we choose these coefficients
randomly in a large finite field we can bound the probability that this code is not matching the
best distance possible, or we could check exponential many subsets, and for a 10, 14 maybe we
can check, but in particular, what we did is if we choose this, you see, these coefficients and
these coefficients here, these are ones now but they don't have to be ones. These coefficients
and these coefficients depend on how you choose these coefficients. So if you set these
coefficients to be Reed-Solomon of a specific type, extended with a specific finite field in a
specific primitive polynomial, then we proved in our paper that if you choose these to be all
ones, then it is optimal. But in general we don't have a general construction for choosing them.
So here is another cool thing you can do here. You can choose these coefficients, these and
these and these so that this funny business happens. If I take this fork, as it's called, a local
parity, and this local parity and I XOR these local parities together, I get this local parity, so it
turns out you can do that, and if you do that, do you see now what I can do now, I can do now,
this is called now, we call it an implied parity. I don't even have to store this anymore because I
can create, XOR these two together and get it, so how do I repair? I don't store X3, so when I
have a failure, this node fails, I am going to read his friends one, two, three; I am going to read
these forks, XOR them together, produce the implied parity, XOR the implied parity with the
other guys and get back P2. So I can still repair and now you can verify that the locality of every
single block is five; before the locality of those was four. But now the locality of everything is
five. So this is the code we implemented in HDFS. So single block failures, all single block
failures can be repaired by accessing 5 blocks versus 10 in the naïve Reed-Solomon. We store
16 blocks total, so we have a little bit of storage overhead, 1.6 versus 1.4 and we implemented
this system and we tested it and I want to tell you how the test look like on two different
systems. Here I say in general, choosing the coefficients is nontrivial, must check exponentially
many subsets for linear dependencies. It's an open problem to do it in polynomial time,
explicitly. And one lemma that we have is that if you choose it, the 10, 14 which are apparently
deployed in Facebook, then using ones works, with this specific choice. So our Java
implementation is fairly simple. All you need is a encode function, a decode function and then
so this is the encode function. This is the whole encode function of the Reed-Solomon. It is a
very simple thing. You just need the arithmetic in a finite field and then you need basically to
hack a few other things so that the Hadoop when it has a failure doesn't talk to everybody
because the component that decides who talks to whom during a failure is not in the decode
function, so we had to hack that and the decode function was very easy to change. So this is
our name node as it is called, running so this is called, so this was January of this year. The
version of Hadoop running is the USC three XOR is the name of the code that I told you, a not
very sexy name. We need to find a better name. And we had a [inaudible] this was 50 nodes
initially but we were killing nodes so at this point there were 37 nodes. This is on Amazon and
we had uploaded I guess 100 GB and we were keeping track of how long it takes to repair and
how much network [inaudible]. So the total experiment involved 100 machines Amazon C2,
one of the experiments. 50 machines were running HDFS RAID, the Facebook Reed-Solomon
version and 50 were running our crazy code that I just described, the three XOR one. This is
what is now called locally repairable code, LRC, okay? The name, we will have to find the name
that works. So we uploaded say 50 files, each 640 MB on Amazon and we were killing nodes
and measuring network traffic, disk I/O, CPU, how long it took to repair and all of those things.
This is a measurement, so here for example we kill one node and you see the Facebook cluster
is creating this much network and our cluster is creating around half; this is a red line. This is
the repair duration was roughly the same in the set case. Then we killed one node again. Then
we killed one node again; don't ask me why there is variability because I don't know. There
were just a lot of things going on. And many things like Hadoop is doing all sorts of strange
things every now and then that we didn't really understand, but they were orthogonal to our
experiments. And then we went to lunch; this is lunch break. Then we killed the two nodes,
three nodes, three nodes, three nodes and two nodes again, so what is the message of this?
The interesting thing is when we killed a lot of nodes it also finishes faster, you see? So the red
guy is done repairing where the blue one, the Reed-Solomon is not done yet. Yes?
>>: [inaudible] Facebook [inaudible] code?
>> Alex Dimakis: I'm using exactly the HDFS and the HDFS RAID used in Facebook cluster, but
we are not using the--this is not an experiment in the cluster.
>>: Are they using open source?
>> Alex Dimakis: Yes.
>>: They contribute that?
>> Alex Dimakis: Yes. So everything is open source, I mean the Hadoop part, so that we have
another experiment in the Facebook cluster, in a small Facebook cluster, in the test cluster, but
it's very small. We were just testing, comparing the numbers we got from Amazon to the
numbers we got on this, but it's not that big of a scale, so the files were bigger but we didn't
have 50 nodes.
>>: So for the single [inaudible] case [inaudible]?
>> Alex Dimakis: I don't know. I think that is because the time was dominated by other things.
It wasn't dominated by this guy over here.
>>: [inaudible].
>> Alex Dimakis: Honestly, I don't know exactly what is going on, but when we saw this, when
we were loading more and more data and when we had more and more failures, the time it
took was faster for our system [inaudible] was slower of course, but the time it took was not
something that we understood at this point.
>>: [inaudible] something is wrong with it…
>> Alex Dimakis: Yes, there are a lot of things that were strange here. This is just a plot of CPU
utilization, so Amazon allows us to measure that. Our code, so for example our code for a
simple one failure is not decoding Reed-Solomon, just doing XORing so we expected it to be
much, much smaller for our case compared to the Reed-Solomon, but we don't really see that
and probably what's happening is the CPU is not dominated by the decoding anyway. The CPU
is, all that this says is that it is roughly the same, so the CPU is roughly the same. In theory we
expected it to be much faster. The decoding is much faster but there are a lot of other things
that are going on that probably dominate the CPU anyway, at least at this scale. This is the
most important thing. It says that for the Facebook cluster, which again, I say it is an Amazon
cluster running Facebook software, the bytes read, the HDFS bytes read during recovery from
data loss is more than double. So disk I/O is the blue thing here. This is single node failures.
These are, so these are three node failures, oh there it says. Okay great. So this is one node
failing, 16 blocks lost. One node failing, 17 blocks lost. One node failing, 14. Here we lost three
nodes, three nodes, two nodes, two nodes. Consistently it's 2.5, 2.6 X less and that is very
reasonable, right in theory we can commit exactly what it is because we read five blocks and
their construction actually reads 15. It should read 10 but the actual limitation does not, it
opens streams to all 15. That was the way it was done in open source. Now they've fixed that,
so now they open only to 10, but that version was opening 15.
>>: With more than one failure then you would read more than five so [inaudible]?
>> Alex Dimakis: Okay, very good. With more than one failure there are different things that
can happen, okay, good point, very good point. Let's look at this. So this is the code that we
have. Let's say that two nodes fail. If two nodes fail, first of all let's call this whole thing a
stripe. Two nodes failing could influence only one block per stripe in which case no problem. If
two nodes failing influence to blocks in one stripe, there are two things that can happen again.
Either for example, block two or n blocks seven are lost, in which case the local repairs work,
but if block two and block three fail then local repairs have not worked. So the way we
implemented this is what's called a light decoder. So there is a light decoder that tries to repair
just from the locals. If it can't do it it throws an exception and says I give up, and then the
standard Facebook Reed-Solomon decoder takes over, and we made sure that the ReedSolomon decoder was still their own Reed-Solomon decoder, because we didn't want to make a
less efficient or more efficient Reed-Solomon, so to compare exactly the same thing. Okay so
basically we introduced a light decoder which was between the standard procedure. But most
two failures would still be either influencing only zero or one block from each stripe or if it was
two it was quite common that you got benefits. So there, it depends on how much a stripe is
influenced. So that's why it's not obvious to measure the benefits, theoretically. But they were
observations. And new storage code reduces the bytes read roughly by 2.6 X, both in theory
and measurement. The network bandwidth consumed is reduced by approximately a factor of
two. We use the disadvantages, we use 14% more storage and the CPU we thought would we
be better but we didn't measure that, so it's similar. So the other interesting thing is that in
several cases we were 30 to 40% faster in repairs, especially in the larger scale repairs or the
larger scale systems that we tested, and that is important because that increases your actual
availability, so it increased the [inaudible] and the MTTLD, you know, the meantime to lose
data. In the theoretical analysis you can say you get a ton more zeros of availability from this
code compared to Reed-Solomon. The gains, that is the other thing that I wanted to say, is that
if you are willing to use bigger codes, so 45, 50 LRC would give you super good storage
efficiency, very close to one, reasonable locality, you read seven blocks or something in order to
repair one and the benefits in availability would be incredibly high because Reed-Solomon you
can't really, you can't really get a big Reed-Solomon because then you would have to read 50
blocks to repair one, so you would have 50 X, so it's a disaster. So one important conceptual
point that these locally repairable codes allow you to do is to have, to go to big codes for the
first time and going to big codes can give you a tad more availability and a ton better storage
efficiency. Yes?
>>: [inaudible] benefit from the [inaudible] because you're repair node could become a
bottleneck when you are reading 10 pieces into the same node.
>> Alex Dimakis: Let's understand that part. Several people asked us about that, so let's
understand this. So you are saying that when, so you have to think of how this thing is actually
creating a local system. So there are many nodes. So a node is a bucket that stores thousands
of these blocks from different stripes. When you lose a node all of these blocks have to be
repaired. They are not all repaired in one place. There is a MapReduce job that says this block
will be repaired there, this block will be repaired there, so there is a placement policy that HDFS
is running and depending on the placement policy it will, so there is no issue of everything
being read in one point because you see…
>>: [inaudible] for say [inaudible] you look at repair of just one file. Is that on a single node?
>>: No. The blocks in the file are spread out [inaudible].
>> Alex Dimakis: The block, so the…
>>: Blocks, it schedules a task for each block that is lost. It will determine if the blocks are
under replicated and make a MapReduce job per block regardless of node it is on.
>> Alex Dimakis: Yes, but per block is right. Per block, you read 50 blocks; that's true. Yeah,
that's true.
>>: [inaudible] per block for five. Let's say five only one block [inaudible] so [inaudible] traffic
on that, maybe it's the one that [inaudible] current generation, so even if you consider one
file…
>> Alex Dimakis: No, but that's what I'm saying. So you are absolutely correct that per block,
one block is being repaired at one node and that block needs to talk to five others. If it needed
to talk to 10 others that would be double the traffic arriving here, fine. That is true. However,
this does not scale because as we go to thousands of nodes and thousands of blocks, the code
still stays at 10, 14. So now if you were at indeed a very large code, but say you had a code that
had 5055 then one block, if one block had to read 50, then perhaps that would, but it would not
be as you scale the nodes, you scale the files. That's orthogonal, because the files are
distributed in every node is doing a local little block repair, so at local if per stripe you are
correct. So if per stripe the 50° would be a problem, then yes, but that degree does not scale.
That's all I'm saying. Okay good. Other questions? Yes?
>>: I’m missing why the read I/O and the network are different.
>> Alex Dimakis: Okay, because the network is TCP/IP and God knows what it does. So I don't
know, so the way that the network is happening is Hadoop, so HDFS RAID opens a stream and…
>>: Is it just overhead or transmission stop?
>> Alex Dimakis: There are a ton of things that are happening. It's not explained by the theory.
So I don't expect these things to, as we grow I don't expect these things to become very
different. I think they would become a constant factor, but in practice, that is roughly 2X versus
2.6 X so this difference I have no idea what is going on because it opens TCP/IP streams and
God knows what it's doing. And then we were killing nodes at the same time, right, so there
are a lot of things going on.
>>: It looks a lot like it's reading all 13 copies. That would come out to 2.6, but that would
come out to 2.6 for the I/O's as well.
>> Alex Dimakis: Yeah, but you see the network was not consistent at all, so…
>>: [inaudible] right so it could balance out?
>> Alex Dimakis: Yeah, but there is a lot of strange things. This is what Amazon calls network
in. The network there's something else that, so this is the sum of the arriving network traffic at
each node. The network out was slightly different. I don't know exactly what the Amazon
monitor is measuring and there for a lot of crazy things that Hadoop is doing. Sometimes for
example, Hadoop might send, so this measures for example, the pings, right; it measures the
heartbeats. It measures many other things that are not just our own business, so there are a
lot of things that are running in parallel to us. Yeah?
>>: If you subtract the number of times the light decoder failed to recover so you had to fall
back to the…
>> Alex Dimakis: Ah, good question. We can probably get that from our traces. I don't have it
now, but I know that just from this, this was very clean. So HDFS by itself was much cleaner
than the network and this is reported by the way by Hadoop. It's not reported by Amazon.
From this we can estimate, and most of the time it was basically the light decoders. In the
cases after we have killed a lot of stuff, our cluster was getting smaller and smaller and then it
was more likely that three failures would kill two things from the same stripe, but most of the
time it was--and if you have a big cluster then it's very uncommon that two things kill the same
stripe.
>>: Looking toward [inaudible]…
>> Alex Dimakis: No. No. [inaudible].
>>: How was it reported by Amazon? Is it per [inaudible] or application?
>> Alex Dimakis: There was a tool that you can get aggregate or per VM. [inaudible] as they
call it. Yes?
>>: Do you have any measures of liveness or usability of the cluster during these processes? I
know with replication level III they use it both for recoverability of data as well as scale apps so
they will choose nodes to be as close to the data as possible and you can run more jobs
touching the same data this way.
>> Alex Dimakis: You are saying if you run another MapReduce job at the same time?
>>: Right, if you have actual workload, right. So this is a system sitting idle and things just
dying.
>> Alex Dimakis: Right.
>>: And self recover and nothing happened.
>> Alex Dimakis: That's right.
>>: Now user workload will force if you have any kind of metric like TeraSort it will actually eat
bandwidth, so was it a meaningful, how much bandwidth was being used is now meaningful to
task overhead as well as the disk reads and so with HDFS triple replication the data is live while
it's being recovered. In this case isn't the data dead while it's being recovered? Don't you have
to block tasks?
>> Alex Dimakis: So it depends. All are very good points. First of all we never, as you saw we
never compared replication to any of the coding schemes here. Replication is much faster if
you run another MapReduce job at the same time; of course it can read other things. So we
think of it as a multilayer system. If you come for hot data it is a three application. If you
decide to code then we are just comparing the coding options here, so that's the first step.
That's why we never compared through application because that is a different layer of our
abstraction. Now between codes we block fewer things. We talked to fewer nodes most of the
time. We create less network. We measured, we did a few small experiments where we were
running another job at the same time and then we killed nodes and saw the repairs, and we did
see the big benefits of Reed-Solomon compared to, sorry LRC compared to Reed-Solomon, but
they were super noisy. We have a few plots on that but so far we haven't included them, but
yes, certainly you are right. The other job will create traffic in this I/O and it will take longer to
complete if you are doing a repair, but you need to repeat that many times because sometimes
you kill something and it is not being touched by the other job, so there is a lot of averaging you
have to do in order to get nice plots and we didn't do that at that scale, but yes, you are
absolutely correct. Did I answer all of your questions?
>>: I think, yeah. I think there's still a question of the practicality of it, of both implementations
in terms of we don't see that it even does block, like the jobs could try to read it and just fail to
read blocks and just fall apart.
>> Alex Dimakis: Yes, but that is, there is…
>>: That is something, you would have to see the application actually running with a live--we
can see that the idle system consumes fewer resources and theoretically things should be
higher throughput but we can't see that the implementation is sound enough to carry real
traffic. So that's something you just have to mess with.
>> Alex Dimakis: So the HDFS RAID is running in production, so the HDFS RAID has all of the
mechanisms to protect what you are saying, so the hacked version that we have should do the
same. That how much better it does, I haven't consistently measured but we all expect it to be
better. So these are the conclusions of that. Yes?
>>: [inaudible] minimum system scaling does a number of nodes, right? Because if a node is
failing you are repairing portions on different nodes, right?
>> Alex Dimakis: Right.
>>: And these nodes are doing other work. They are serving other systematic pieces for the
other traffic order so if you don't have enough nodes, each node will have too much burden of
the repair and I guess that was a question.
>> Alex Dimakis: That is true, but what this is, this is more true for Reed-Solomon compared to
LRC, right?
>>: Yes but if you go from 3X application to [inaudible] repair so there you are increasing the
minimum [inaudible] threshold of system scale because you need to do more work for repair so
for the same amount of storage failed at the norm. You need to spread it out to more nodes.
>> Alex Dimakis: Yeah, but the system scale is the list of my words here because…
>>: [inaudible] lower bound.
>> Alex Dimakis: Yes, but there is 33,000 nodes is easily in a cluster.
>>: 33,000 nodes will not be on the [inaudible].
>> Alex Dimakis: Not 33, three [inaudible], yeah.
>>: So you have to also look at same [inaudible] because as you go up the hierarchy and the
network is oversubscribed so repair traffic becomes a bottleneck.
>> Alex Dimakis: So in our case repair traffic is a bottleneck already.
>>: If you have inter-cluster repair traffic that will become even more bottleneck, what I am
saying is you have to have the repair localized to islands of high interconnect bandwidth so that
the network does not become the bottleneck.
>> Alex Dimakis: So that, placing things and repairing locally in a multi-hierarchal type of in rack
close data center is very interesting. Right now to the best of my knowledge there are very few
systems that do a closed data center Hadoop. We are very interested in this but we haven't
worked on it, so, but definitely that is the next step to worry about, yeah.
>>: So much like read or write [inaudible] have a percentage of which the threshold for using
the read or write [inaudible] versus the [inaudible] section is no longer worth the; you just go to
the critical section. It seems like for this particular algorithm the correlation between machine
failure and rack failure there is a rate at which if machines and racks go down as one unit, like
maybe they share a power supply, the Reed Solomon is the better approach versus trying to do
local because local would fail every time if you take your local ribbons and…
>> Alex Dimakis: No, no. There are two things. There are two notions of local that we need to
distinguish. Local you mean placing blocks in the same rack.
>>: By local I am assuming that the five is…
>> Alex Dimakis: In the same rack.
>>: Is fairly local, that supposes for all of them.
>> Alex Dimakis: Okay. That is not true in the current implementation. I can see why you may
want to do that and that right now the placement, the default for the placement is random for
free application. Actually free application, I try to keep two copies in the same rack and one
across, but when they do code it so the Reed-Solomon are across rack, so each one of the 14
blocks are in 14 racks, and we think if you want to maximize your ability, we didn't hack the
placement. Right now the implementation is just placing 16 blocks in 16 different racks, so that
is what it will do now. If you want to play around with placement you can do that, but we
haven't gotten there and I think it is a messy world, but sure you may get benefits, but it
depends exactly as you say, how often do nodes fail? How often do racks fail? And how much,
so the in rack traffic is four GB is a one GB switch and the cross rack is a four GB switch. If they
were much different, then again you have to decide. Placement is agreed another business we
have to talk about. Right now we are spreading it maximally.
>>: It seems like you should be able to compute the costs of the local versus the cost of the
Reed-Solomon and then determine--there is a rate of machine failure at which the local is no
longer worth doing. 99% machine failure…
>> Alex Dimakis: Compared to rack failure and also network and in the network.
>>: [inaudible] crossover here.
>> Alex Dimakis: Yes, I understand.
>>: And I think it would be interesting to have that--I mean this seems mostly theoretical in
terms of where it's coming from…
>> Alex Dimakis: Yeah.
>>: So it would be an interesting value to have. There are things like Google who built the GFS
on things sitting on pizza boxes, like little pizza boxes. Not something like the concept of a rack
box, but physical pizza boxes, when they first made GFS. I mean the probability of the place
catching fire and everything going up is much higher, was very high at the time, so they it would
be interesting to have an actual value [inaudible] as people design their data centers they
could, they could sort of pick and choose.
>> Alex Dimakis: I think it's interesting. I actually have one student who is looking in this
direction so we may ping you more on it. Okay. So in the interest of time and as I correctly
predicted this is the back to theory slide, which is finishing, so I am done [laughter].
>>: I have a question as it relates to reliability.
>> Alex Dimakis: Yes.
>>: Can we go back to the code.
>> Alex Dimakis: Yes.
>>: So this code I understand I mean of course it can basically fix four failures. Can it basically
fix all five failures, for example a failure of six…
>> Alex Dimakis: Is a very good question. The 10, 14 cannot provably. So there is no 10, 14
with this locality that can correct five, but if you change the numbers a little bit, so we have that
was the theory part, so there is a theory that tells you exactly how many, tell me your locality.
Tell me your global parts and I could tell you what's the best distance you can get. For this case,
no, we cannot take five, but if you have a slightly different Reed-Solomon, I think it was 9, 4, I
don't remember now, if you have 9 and 4 parities then these would add in the distance, so
definitely there are cases where they do.
>>: I just want to comment. I think Cheng is basically our code. We have to basically distance
maximized [inaudible] codes.
>> Alex Dimakis: Yes, very good. So I think I will stop here because well I am out of time. So
thank you very much.
>>: These are all Byzantine faults, right?
>> Alex Dimakis: This is worst-case, yes.
>> Cheng Huang: Okay let's think our speaker. [applause]. Welcome to the second part of the
talk. This talk will be done by Dimitris. Dimitris is currently a third year PhD student at USC in
Alex Dimakis’ group. He will tell us that the one state that are reliably stored, how do you
efficiently do analytics on top of that.
>> Dimitris S. Papailiopoulos: I am going to talk about large-scale sparse PCA, so I am going to
give an introduction to the technique and present a new algorithm for that that runs on big
data sets and specifically on, I have done on Twitter, on Twitter analysis. So this is joint work
with some collaborators from USC and my previous advisor from Greece. So I will talk about
sparse PCA analysis and what this tool is about, and why this can be useful and then I am going
to tell you why sparse PCA is an intractable problem and then I will introduce a new algorithm,
a new approximation which we think is suitable for large-scale problems and then we introduce
framework that is good for Twitter analysis which we call Eigen Tweets. A very short overview
for principal component analysis. It is basically a dimensionality reduction tool that is used for
clustering [inaudible] and applications like that. Sparse PCA is a variant of principal component
analysis which is in particular useful when we care about interpretability which I'm going to
explain later. Then I'm going to present the algorithm that we have for sparse PCA which is
really fast when we tested it on large data sets. I will basically explain the techniques on the
Twitter model, but this can be taking, we can really consider any data set. So we tested this on
tweets. Tweets are sentences that are comprised of very few words, like five words, a small
number of words. What we do is we generate a vector, a very long vector and its index of each
vector corresponds to a word and whenever a word appears in a tweet I put a 1 here. This is a
very long vector. For example, it's about 50,000 in the data sets that we are testing and the
vector is super sparse, so it has like 5 or 10 nonzero entries. This is our sample vector of length
50 K. What we do is we collect a bunch of these vectors of these tweets and we collect them in
a file in a big matrix. This matrix set, the sample matrix contains all of these vectors. Each
vector is a tweet and each row corresponds to a word. Wherever there is a 1 it means that this
tweet for example has this word. What we want to do is we want to find a sparse vector that
closely matches my data set, so I want to find a vector that is really close, a tweet that matches
the tweets in my data set. So what is the metric that I care about? It is the metric of the sum of
projections. I have this vector and I take inner products with my data set, so any net product
with a tweet in my data set tells me how much this vector, each vector in my data set here. So I
want to maximize the sum of these projections these inner projects. So I want to solve this
problem here. So the way we can solve it is principal commonalities. The first step is we create
an empirical correlation matrix R which is produced by taking the inner products of S with itself
and R the correlation matrix in the entity i and j. Basically the i is the entity of R is counting
exactly the number of times the word i appears with words in my data set. If I want to find the
vector that maximizes the productions which is this problem, this problem can be cast to these
maximizations here so instead of having the maximization of X transpose S and [inaudible]
squared, I can just maximize X transpose S, S transpose X and this is this quadratic form here.
Now this problem we know how to solve. The solution to this is the top eigenvector of my
empirical of my sample correlation matrix. So SVD can solve, can give me a solution of this
problem in polynomial time or it can, so it's, the [inaudible] complexities and n squared but it
can be faster dependent on the sparsity from my sample matrix, but this is how many known
entries my sample matrix has. Now the problem PC is it doesn't actually solve the problem I
want solved. The problem is that the top eigenvector, the solution to this optimization problem
is going to be a super dense vector, so the solution of this maximization here is going to be a
vector which has nonzero loadings [inaudible] and each entry corresponds to a specific word.
Now having a dense vector is not good because this is supposed to be a tweet and a tweet that
consists of thousands of words is not credible. It doesn't make any sense. So I would like to
have a super sparse vector that does the same job, that maximizes this end product. So I would
like to have a sparse vector because sparsity means interpretable, so I would like to have a
vector like that. I would like to have a vector that had entries like strong, earthquake, Greece
and morning which would indicate that the main topic of a particular tweet data set is that
there was an earthquake in Greece, for example. So this is a sparse vector is much better than
a dense vector in terms of interpretability. I want to solve the same problem with the other
constant of sparse, so I want to maximize this quadratic form subject to the L2, the [inaudible]
constraint and the cardinality constraint, so this means that I want to find the vector that has
only k nonzero entries. In practice I am going to just enforce it to be five or six or something
like that, a really small and constant number with respect to my problem size. This constraint
theory is exactly what makes my problem NP hard. So this is a cardinality constraint which
makes the problem intractable. Sparse PC is NP hard. So there have been introduced many
approximations and many relaxation schemes. The easiest thing one can do is I can simply
compute the vanilla PCA and I could just threshold all my, I could just give the k highest, the k
maximum absolute value entries. There are also some regression techniques, some SPC
relaxations. I generalize the power method, some a new technique based on [inaudible]
consent algorithm, but in general the problem is intractable especially when I go to very large
data sets all of these methods with the exception of this one are really, become basically
intractable. So when I talk about data sets of hundreds of thousands of entries and hundreds of
thousands of words, these methods cannot run. So we want to start with a very, we want to
introduce an approximation starting with something really easy. So I'll start with the
assumption that my correlation matrix is rank one. I want to keep my correlation matrix rank
one if my problem would be easy. So the approximation that we have basically solves the
sparse PCA as far as [inaudible] so if my correlation matrix has cost and rank then the sparse
PCA is a polynomial time problem, but this is not enough and we also introduce a feature
elimination, so in the tweet data set this will eliminate words that will never appear in my
optimal solution. This is keyed to run our approximation on large data sets. The key is we can
always decompose that because it's positive semi-definite. And so the sum of the outer
products of its eigenvectors, so V1 is the leading eigenvector, V2 is the second eigenvector and
so on. We approximate that with other one and other one is going to be a rank one matrix
which is going to be this inner product. I am asking the same question again. If R is rank one, if
I use this matrix can I solve this sparse PCA problem? This is my problem here which I can
replace R one with the other product and all this basically boils down to is maximization. I want
to maximize this absolute sum. Do you think that this is an easy problem? Basically this boils
down to a sorting problem you just need to keep the k absolute, or the k maximum absolute
values of V of V1 and these are going to be the optimal indexes, the optimal words in a sense, k
words in a sense. This problem I can easily solve. I just need to solve the absolute values of V
and just keep the k maximum. This is basically equivalent to thresholding. I just compute like
[inaudible] PCA, keep the leading eigenvector, threshold the entries, the n-k’s smallest entries.
If my correlation matrix would have been rank one then I know how to solve the problem. It is
just a sorting problem, reduced to a sorting problem. Another question is let's take it a step
further and let's check what happens when R has the rank two. When R has rank two, when I
do the rank two approximation, instead of keeping just one eigenvector, I keep the sum of
these two other products. I have the same maximization here. I replace my other matrix with
this matrix here, V1 and V2 which is the two leading eigenvectors of my initial correlation
matrix. What I do now is introduce a new vector C of phi which is going to unlock the low
ranked structure of R. this vector here is going to basically give me instances of the rank one, of
the rank one problem. C of phi is going to be a vector that spans the unit circle of dimension
two and has a unit node. Now I will use the Cauchy Swartz inequality in the following way.
Basically this is my matrix. This is the [inaudible]. If I take any net product of C, this guy here,
so this is like a vector, right? If I take any product of this guy with this guy by the Cauchy Swartz
inequality, I know that this thing is less than or equal to the product of these two norms.
Because I am constraining my [inaudible] unit norm, this guy is basically one. I have this
inequality here which becomes an inequality when C of phi is collineared with this vector here.
I introduce a variational characterization of the norm, of the norm of the matrix. What I will do
basically is the following. Instead of solving this problem here, the initial sparse PCA problem
for the rank 2 matrix, I will solve a double maximization over both X and phi of this quantity
here. Because this thing is less than or equal to this thing, then and the maximum of this thing
is equal to this thing, then the maximum over phi and X is going to be equal to the maximum of
this. This is a variational characterization of this problem and it kind of looks more complicated,
but this is exactly what unlocks the poly-time approximation. The clue here is what happens if
we fixed the angle phi. I want to see what will happens if we fix this phi here. If we fix this
angle phi here, this is just the vector. This is going to be a fixed vector and this is going to be a
max of V transpose X, max of an inner product of a vector. This is a rank one instance. This is
exactly equivalent to a rank one instance that I had before that I can still solve by just sorting
my elements and by just keeping the k maximum elements. The clue is that the sparse PC is
going to be the solution of one of the many rank one instances that exist out there. If I was
able to scan all phi angles and keep all possible sparse PC candidates, then the maximizer across
all phi’s could be by optimal sparse PC. So the problem now is that phi continues. It is an angle
between zero and pi and I cannot scan just all possible angles. The whole idea that's going to
lead to the next step is that the optimal solution here depends on the sortings of my V of phi
vector. When the sorting changes, then the optimal, the locally optimal vector changes. The
clue is the following. If I take this vector here, so I was considering this vector, right? So if I
take this vector and write it down again this is basically each coefficient so this guy has n
coefficients. Each of these coefficients is basically a continuous function, a continuous
[inaudible] in my angles. So what I will do now is I will just plug these angles and see what's
going on. Here I have a random matrix V and I just plot my V of phi, so this is the first V of phi,
second, third and so on and so forth. And these are plotted. These [inaudible] are plotted as a
function of phi. Now what does it mean to fix an angle? So if I fix an angle here I know how to
solve the problem. I would just do the sorting thing and that's it. So for this angle here my
locally optimal free sparse vector is going to be the vector that has non-zero loadings in
[inaudible] that are denoted by the top three tiers that intersect my this line here. So the
whole idea here is to check what is going on with the sortings of this spanogram. So we call this
figure a spanogram. If I am able to track all sortings, all possible sortings, I will be able to track
all possible rank one instances which means that I will be able to track all possible rank one, all
possible sparse specific candidates. When does the sorting change? The sorting changes when
two of these curves intersect. So I have many intersections in this spanogram and all I want to
do is basically find all of these intersections and compute on each intersection the locally
optimal sparse PC vector. I now want to compute what is the complex [inaudible]. I need to
find the number of intersections because the number of intersections defines the number of
rank one instances. This is a simple problem. We just need to solve this simple equation. So
you need the two curves in the second one they become the equal. [inaudible] because they
have the absolutes. So the number of whole intersections is two answers too because I have n2
ways to pick two pairs. So these are exactly the number of curve crossings that I have. And this
is exactly the number of sparse specific candidates that I have for my rank two approximation.
The whole idea here is that we introduced the new spherical vector which unlocks the rank two
structure of my matrix and it gives me a new view on the problem. I was able to track
intersections which correspond to unique sparse specific vectors and I know that within these
candidates there exists the optimal one. What I need to do is just plug-in the two answers, the
two angles in my rank one solver which is just a solving and then keep all of the candidates
which are going to be quadratic in n. Then I just need to plug them into my initial metric and
keep the maximizer. This is going to be. Huh?
>>: [inaudible] these are dimensions?
>> Dimitris S. Papailiopoulos: Right, so n here is the number of words, which is 50 K, which
means I cannot basically run anything [inaudible], right? This algorithm as we have it right now
we cannot run on very large problem sizes. The complex of this thing is poly-time and we
consider the lives of these as rank d matrices by introducing a spherical vector which scans a
hyper sphere instead of a circle and the theory is that if you have approximate your initial
correlation matrix with a constant rank 1+ an identity and this guy has rank d, then I can
compute the optimum sparsity of this guy in time which is polynomial in n. Now the problem is
that the rank is in the exponent here, right? This is, although it's poly-time, it is not tractable if
you go to…
>>: One small n is big n.
>> Dimitris S. Papailiopoulos: Yes.
>>: So small n is [inaudible].
>> Dimitris S. Papailiopoulos: Right. So we also have an approximation guarantee that tells you
what is the gap with the optimal solution, how far are you away from what the optimal would
give you? Now this is not enough because as I told you the exponent, although it is going to be
constant if the rank is constant, it's large enough for us not be able to run it on big data sets. So
if I have like 100 K words which is my problem size is 100 K, then the complexity is prohibited.
Depending on the sortings of these curves in my spanogram, I can drop curves, which means I
can drop lines, which means I can drop words which will never occur in my top k candidates
here. The clue is that some curves never get in the top k curves, which means that if a curve
never goes up there then the corresponding word is never going up there in a sparse k locally
optimal vector. The idea is to track the k top curves and then just compute the maximum
amplitudes of the curves and then find the minimum of these k top curves. The maximum of all
the curves that are below this minimum I can drop them. So this is an algorithm that basically
eliminates curves which means, which translates to an elimination of features or an elimination
of words. The thing is how good this elimination algorithm does. For synthetic data, for a
Gaussian V of rank two and a k of 10, having this scale here is something we then do with 100,
1000, going up to a million, the number of words or the number of features I am left with after I
do my elimination algorithm, is kind of growing logarithmic with my problem size. What does
this basically mean? It means that if I am at the million words here for example and I want to
solve my problem, I can equivalent solve a problem on 93 words without losing optimality. A
problem on a million words is equivalent to a problem on 93 words for this specific instance of
my matrix. This is basically telling you that for randomness this works good, but we want to
find out if for real data this gives us any benefit.
>>: This is for rank approximation?
>> Dimitris S. Papailiopoulos: Yes. This is for rank approximation. It's kind of similar.
>>: Similar?
>> Dimitris S. Papailiopoulos: Yes. Yes. Yes.
>>: So say rank five would be [inaudible]?
>> Dimitris S. Papailiopoulos: It's kind of like d times log n, roughly. We just first want to see
how good the elimination algorithm works on the real data sets, and we fix again for k to be
equal to 10 and we have a data set that consists of hundreds of thousands tweet words and
each tweet has a constant number of words, about 5 or 10, and the number of unique words in
the data set is about, yeah so it's basically n. When I have one k of words that I have about 21
words that I only need to check and then if I go up to, you know, 200 K, then I just have again
like 34. So this means basically that the practical data set validates our…
>>: [inaudible].
>> Dimitris S. Papailiopoulos: It's better, yes. It is slightly better. The whole idea is that you
can do the rank approximation on your correlation matrix around the feature elimination and
then run the algorithm that we have for constant rank correlation matrices. We want to use
this for Twitter analysis. We want to use this framework for Twitter analysis. What happens
with Twitter is that it puts out many tweets and each tweet kind of has a tag on it. Some are
about world news, local news, [inaudible], you know, article pointers and we would like to have
a black box that gets all of these twitters, all these tweets and outputs, you know, events or
major events and hot topics during a specific date or a specific time window, like a week or
something like that. So we want to have, we have a framework called Eigen Tweets and what
we want this to do is we want to feed this guy tweets which are about in our data set which I
am going to talk about it in the next slide, which are about 5K tweets per hour or 50 K tweets
per day and what we want to do is we want to have a black box that gives us the major events
in our data set. We would like to have something here that says the first Eigen Tweets has to
do with event, an earthquake. The second has to do with about uprising, protest and the other
has to do with deals, summer and stuff like that. We want to kind of cluster our, give directions
in our data which is going to explain what's going on. The algorithm that basically Eigen Tweets
is what I told you earlier. The first thing is you just compute the sample correlation matrix. You
do an eigenvalue decomposition to get the deep [inaudible] eigenvectors, random elimination,
feature elimination algorithm and then run the algorithm that shows the relaxation, the
approximation. Now what you do is, so this algorithm is going to output a [inaudible] words.
What we are doing here is we are zero forcing these words in our data set. We are eliminating
these words from our data set and then we are computing this whole method to get the second
PCA. So this enforces the principal, the sparse principal components to be orthogonal to each
other this zero force in here. Now the most hard thing here is the eigenvector composition
basically, so assuming that after the initial [inaudible] we are just left with an algorithmic
number of words then this is what dominates the computation, roughly speaking. Our data set.
Basically our data set consists of full Greek tweets but we have crawled for months. We used
the crawler using SideBar software. That's part of SideBar software and the size of data sets
that we ran our experiments has millions of tweets. The flow of tweets is about 3K tweets per
hour, 50 K tweets per day and about 1 million a month. The number of unique words is in our
analysis about 2K, per day they are about 50 K and per month they are about 300 K. It means
basically that the number of tweets grow much faster than the number of unique words. The
sample set that we picked was between May 1st and May 10th from 2011; the number of
tweets in this time window were 500 K and the unique words were 200 K, so this means that
the problem size, the n that we had before is 200 K. My rank one approximation which is
simple eigenvalue the composition plus the [inaudible] is almost second. The rank 27 seconds.
The elimination is giving you back only 20 words which means that out of all of these hundreds
of thousands of words we just need to keep 20 words here and the rank three approximation
needs a little bit more words to operate and it takes a little bit more time. The first question is
why would you use sparse PC and just not count, pick the 10 most frequent words or the ten
highest correlation words. The thing is this is just a bag of words that kind of, it's, the most
frequent words are words that are not going to give you, words that don't go and tell you
what's going on here or something, so it's just a year, Greece, love, Twitter, May, Osama, Laden
and so on, that was when bin Laden had been founded. Same for the highest correlation
words. This is, all I am trying to say is that picking the most frequent words is not good for
classifying. It is not a good way to find trends of what's going on in your data set. We'll start
with the rank one approximation. The first Eigen Tweet, the first principle component has to do
about how people love, so it's love, Greece, know, received, Greek, happy and all of that. The
second principle component is not really good. It says year, Greece Osama, Laden, mommies,
world, May. This is basically not a good [inaudible]. It doesn't really tell you what's going on
with your data set and that's because the rank one approximation is not sufficient for the
principal components to give you a good trend, trends that are interpretable. The third PC is a
little bit better. It basically says home, Facebook, Veggos, Thanasis, Job, nice, days and these
were all words, Thanasis is one of the most favorite comedians in Greece died. So it kind of
gives you what's going on but it seems that topics are mixing with each other and that's
because of the insufficiency of the approximation, so we take it one step further and go for a
rank two approximation. Oh, by the way here the approximation guarantees almost 80% which
means there is only a 20% gap tops from the optimal, from the full rank solution. I take the
rank two approximation which only leaves 20 words out of 200 K words, and what we see is
identical to the first PC of rank one approximation. The second PC kind of clears up things, so it
says year, Greece, mommies, world, mothers, May and Twitter, so basically in this time period
there was also the Mother's Day. The third PC has to do about the death of Osama bin Laden,
so now we see that the topics are kind of starting to clear up. The approximation guarantee is
better here. We have 86% approximation guarantee, which means roughly, at most 14% away
from the optimum. And for the rank three approximation we just get a little bit better PCs, and
the approximation is slightly better as well. We are at tops 10% from the optimal. So the
sparse principal component is an intractable problem, but there are some nice tractable
relaxations. The first result that we have here is that if you're correlation matrix is
approximated by a low rank approximation matrix then we can solve the problem in poly-time.
This is not enough for large-scale problems. For large-scale problems we need something
better and what we do is we do a feature elimination. We have a feature elimination
technique, which gives you a better result and a much better algorithm in the sense that it's
equivalent to the initial one which runs from a very small problem size, and runs really good.
We use this approximation for a new framework that we call Eigen Tweets which is basically
sparse principal component analysis in the context of Twitter. For future work, the basic thing
is that the algorithm is really fully parallelizable so when I compute locally optimal candidates,
this can be done in a parallel fashion. The same thing goes for the elimination algorithm as
well. The same goes for the SVD. The three building blocks of the Eigen Tweets, all of these
three can be computed in parallel fashion and we would like to implement this framework in
the Hadoop MapReduce and see how it performs in very, very big data sets. Thank you.
[applause].
>>: To track all of the intersections, can all of that be done in parallel?
>> Dimitris S. Papailiopoulos: Yes. Basically what you need to do is you just need to feed the
matrices, so if you get like 10 CPUs, you just feed the methods to each CPU and each CPU can
compute any intersection or a couple of intersections or a small number of intersections. For
instance, you compute candidates, so each machine can compute different candidates and all
of these comprise the optimal set so you know that the optimal PC is in there.
>>: [inaudible] so they go in lockstep way, right?
>> Dimitris S. Papailiopoulos: They don't need to exchange information. Basically the only
thing that they need to do is have the matrix and for that matrix just pick the distinct answers
two those.
>>: [inaudible] has to go across all of them?
>> Dimitris S. Papailiopoulos: For it elimination all you need to exchange information between
executions but you can still run it in parallel, but you definitely need to kind of synchronize,
yeah.
>>: So once you start doing this in parallel [inaudible] some take longer [inaudible] repairing
from [inaudible].
>> Dimitris S. Papailiopoulos: Yeah.
>>: Can I take a look at the complexities slide?
>> Dimitris S. Papailiopoulos: This one?
>>: Yeah. So up here what is S?
>> Dimitris S. Papailiopoulos: Okay. Basically S is sparsity of my tweets. So the sparsity of my
tweets, so in this example it is like 5 to 10. If I have, if my matrix is super sparse it means that
my inner products can be computed in a very fast manner. If I have like a vector [inaudible] if I
have a vector that is super long but has only five entries, then the other product takes constant
time to be computed because I only compute a constant number multiplications. This is why all
of these calculations depend on the sparsity of the matrix. If my matrixes were super dense
then both the, even computing the correlation matrix would be a tough job to do.
>>: And what is the n small?
>> Dimitris S. Papailiopoulos: The n small is the number of features, the number of words that I
need to optimize over after I have run my elimination algorithm.
>>: When you use the [inaudible] matrix [inaudible] so is there any value to somehow keep the
order of the words, would it give you better results?
>> Dimitris S. Papailiopoulos: [inaudible].
>>: By Microsoft and Microsoft by are very different things, so then you put them in your
vector and you have by Microsoft and Microsoft by correspond to the same vector?
>> Dimitris S. Papailiopoulos: Yeah, right correspond to the same vector, right.
>>: So then the same value. Is there any way to keep the order of the words?
>> Dimitris S. Papailiopoulos: You can do that, but this will give you much bigger data set, so
now you have a dense set that is a matrix and you would be needing a tensor to keep the order.
>>: You can have one [inaudible] pairs of words but then your length would be [inaudible]
longer. But then it would go number of words squared. It would explode but I wonder if there
is any shortcut. People do all kinds of things. This one is a bag of words; it is simplistic. But yes
you, you are right. People do these kinds of things sometimes. But there is no [inaudible]
sparse PCA on that.
>>: [inaudible] elimination of feature word help [inaudible] the concept of eliminating
[inaudible].
>>: Yeah, this method would work. Definitely, this is interesting; this method would have no
problem. If your features become pairs of words, no problem. I mean it's all the same.
>>: [inaudible].
>>: The question is like can you use the features to compare words on the [inaudible]. How is
complexity [inaudible]?
>> Dimitris S. Papailiopoulos: You know for sure the thing is going to be n squared. That's
really what you know for sure. So you are getting +1 on your exponent of everything.
>>: [inaudible].
>> Dimitris S. Papailiopoulos: Huh?
>>: [inaudible] words [inaudible]. I guess consider [inaudible].
>>: Is this the ideal approach or did you actually want to find the optimal [inaudible] what's
happening as opposed to interaction? Was there a plot that demonstrates that it tends to
actually trended in a way towards optimal approach?
>> Dimitris S. Papailiopoulos: I can definitely not load your data set because it's, yeah exactly,
but what you can do for sure is take the sparse principal components which are what you say
they are just index sets, like five indexes of words, and then you can protect your data set and
like three principal components and this will kind of give you clustering of your data set. How
much this point is leaning towards each kind of topic or principal component and that's the way
you can visualize this, projecting on lower dimensions.
>>: So I guess you stripped articles, right, because I didn't see the top dimension [inaudible]
and i and u?
>> Dimitris S. Papailiopoulos: Yes, yes. There is a phase of extra normalization, so we have this
kind of first-order approach we throw away links, references and stuff like that, you know, all of
these things and just keep words, and not all words. We just keep the words that have length,
more than three characters, so we throw away these words and we also have like a list of
words, like me, you, stuff like that or words that are not giving anything about the context of
the intent, yes, we have to throw away even before we start doing anything like that because
they are going to populate your first principal components. That's the whole, that's the whole
idea.
>>: So the first thing you do is strip the most common words?
>> Dimitris S. Papailiopoulos: Right, yeah.
>>: Then we strip the least common words, like statements.
>>: The most common words are [inaudible].
>> Dimitris S. Papailiopoulos: Right.
>>: But the least common words are stripping away. That probably is never going to throw
away something that [inaudible] to keep.
>>: Are they current? So like ran and run are stripped in that same index in the vector? Like
are they kernelized so that running, ran are all mapped to the same index in the vector or do
you have the same…
>> Dimitris S. Papailiopoulos: Yeah. That we should do but we don't do that now. We should
do that, though, yeah. It would definitely give you better examples, so if there is like a small
type or like what you said, they are going to be different indices, although they should be the
same, but we don't do it here. For example, you have like Greece in Latin and Greece in Greek
text and this should be, you know, mapped in the same index but we did not do that. That is a
nice direction. We should follow that as well.
>> Cheng Huang: Okay. Let's thank the speaker again. [applause].
Download