>> Cheng Huang: Hello everyone and welcome to the... talks by Professor Alex Dimakis and his student Dimitris, so...

>> Cheng Huang: Hello everyone and welcome to the afternoon talks. Today we will have two talks by Professor Alex Dimakis and his student Dimitris, so we'll have two hour slot in total. So the first talk will take about 45 minutes to an hour and then we will take a short break, and then we will have Dimitris talking about a second topic. So the first one is erasure codes for big data over Hadoop. The second topic is large scale sparse PCA for Twitter analysis. So that one we'll start about an hour from now. So let me first introduce Alex. Alex is currently an assistant professor at USC, so he has longtime ties to Microsoft Research, so Alex won a Microsoft Fellowship when he was a graduate student and then he did an internship here with our group and then he graduated from Berkeley around 2008, and he did a postdoc at Caltech for one year and he is now an assistant professor at USC. So Alex has done some work on coding which attracted a lot of attention in the society, so the most well-known one is called regenerating codes. So for regenerating codes he has won a dissertation award from Berkeley and also the 2010 IEEE Data Storage Committee Best Paper Award, and he has won the NSF for Career Award and is chair assistant professor right now, so I will keep it short on that side and let Alex tell us the story about erasure codes for Hadoop. >> Alex Dimakis: Thank you, Cheng. Thanks for the invitation. This is joint work with my students at USC, Mahesh, Megas and Dimitris; so Dimitris will be giving a talk afterwards. A part of it is joint work with Scott, Ramkumar and Dhruba from Facebook. The overview is basically the following. The first message and the most important message that is making this relevant is the fact that data seems to be growing faster than infrastructure now and the cost of storing it is increasing very fast. In particular 3X replication which is sort of the standard thing that people do is becoming too expensive. So the main message that I think most large-scale storage systems currently use already is for cold data and we will define cold later, but when the data is not used that much or not read that much instead of using 3-D replication we use some form of erasure coding and we save a lot of storage, and most of the data is frequently called data depending on the application again, but there seems to be very big gains in storage that you can have. The interesting message that I want to say is that most storage systems today are basically using Reed-Solomon distributed, and the main message I want to say is that classical codes like this are unsuitable or in other words I am saying that there are other codes that are much better than Reed-Solomon. What my lab is working on for a long time is basically creating new codes that are optimized for distributed problems and bounds and information [inaudible] capacity as to what are the best codes one can hope for, so we have a sense of knowing how far we are from optimal. I'll just give a very quick overview of how codes are used to store. Usually we have a file or data object and we cut it up into blocks or chunks depending on which company you work at. I will just call them blocks here, so you cut it in blocks. In the coding theory language with K blocks and then you produce codes like this very simple code that says just take two blocks, XOR them together and produce a simple XOR block and this is a 3, 2 MDS code, so attention compared to the standard systems CS language, people would call this a 2+1 which means two information blocks and one parity. I use the standard coding theory language which means 3, 2, so the number of failures I can tolerate is n minus k, so one failure here. So this of course is a single parity code used all over the place and can tolerate one failure. This code is a 4, 2 MDS code so that means that it has the best distance possible which means it has the best fault tolerance possible which means it can tolerate any two erasures, and you can see of course, for example, if you lose these two blocks you can recover from the other two linear equations. So the general idea for an nk erasure code just to set up my notation is an nk code is a black box which takes k packets and produces n packets. These n packets typically have the same size as the original k, but we will be relaxing that later, so we will see about that, but that is important. And the key property that it provides you is that any k out of these n suffice to reconstruct the original k. If these coded packets have the same size as the original, then this is optimal, so this is called MDS if they have the same size. If we start playing with the sizes, then it is a different story, and there are a lot of bounds on what is going on. That is called the optimal reliability that you can get for that given level of redundancy and that is well known and there are many off-the-shelf codes that people can use. The new aspect as I said is that now we have packets that are distributed in a network so classical codes have a lot of sub optimals. So let's talk about Hadoop. I gave a talk about the state-of-the-art for distributed source codes two years ago in this group. I'll tell you what has been new since those last two years and I will tell you what is now the state-of-the-art. I know most people are interested in the practical aspects, so there are two parts of the talk. There is a practical part that starts this slide and then there is a theoretical part and depending on how many people are awake I will keep on going depending on interest and time. So the practical aspect, let's establish what current Hadoop is doing. Let's say you have a file in Hadoop. I am looking at the HDFS, the Hadoop system and let's say you have a 640 MB file; you cut it up into blocks. Here I chose 64MB to be the block size; sometimes it can be larger, sometimes up to up to 256MB. Now you have these 10 blocks that represent your file. You're going to, what the standard thing is is 3 by application, which means that every block is replicated twice. Now you have these blocks; you distribute them over servers in your cluster and you are happy. The problem with that is there is a very large storage overhead and as the data grows faster than the infrastructure grows as I said this cost is dominating, so it is a very big problem. So Facebook recently created a component that is open source that is called HDFS RAID that is a component that works on top of HDFS which is used in warehouses in production, and what it does is after it detects that the file, let's say our file here has been cold, so it hasn't been accessed let's say for some period of time that you can set, then it deletes the replicas and creates these four parity blocks instead. These are created by 14, 10 Reed-Solomon, so these are four ReedSolomon parities and now this is the replication of the file is low so there is a default replication that is 3X in Hadoop i that is lowered into 1X and there is a separate file that is called a parity file that contains only the four parities corresponding to this file, and the HDFS RAID is a component that keeps track for every file that is RAIDed, that's the word. It keeps another file that is the parity file for that file which consists of, for every 10 blocks of the original four. So that's the way it works in HDFS RAID. So when we compare HDFS with a straight application to HDFS RAID, you see of course this system can tolerate two missing blocks only and it has a storage cost of 3X, whereas this system can tolerate four missing blocks and has a storage cost of 1.4 X, so that is more reliable of course and it is using half the storage. The problem of course is that in this system you cannot read in parallel; that is one of the problems. And that's why they use it for data that is cold, archival cold data. And so the HDFS RAID is the opensource system. Another paper that included a system from the CMU group was called DiskReduce that introduced this idea. This deployment of Reed-Solomon at Facebook has up to the last time I got information has saved around 5 petabytes. This is for a data warehouse cluster that was storing around 30 petabytes of storage, so of course they did not RAID ReedSolomon everything; they were slowly increasing which files were going to be RAIDed and this is the storage savings which is quite significant, 5 out of 30 taken out of storage is quite a significant storage savings. However, Reed-Solomon has several limitations that limits their use. So currently in the data warehouses, and this is all focusing, my case studies are focusing on data warehousing. When we talk about storing photos, storing windows, storing search indices, they have different usage scenarios and so they may have different trade-offs, but for data analytics, the main issue seems to be that they got to 8% or maybe 10% of the data being Reed-Solomon code and even from that they got big benefits as I mentioned before. Our goal would be to create new codes that would allow you to go into coding 40 or 50% of the coded data, hopefully, and the question is what is stopping you here. The first issue is maybe you don't have enough cold data and that's not true. To the best of my knowledge most data warehousing scenarios would have enough cold data to go up to 40%. Again it depends on the scenario, but that is not the main bottleneck. If we could go up to 40% coded warehouse you could save petabytes of course which is a very important gain. The bottleneck I want to argue it is what's called the repair problem or different volumes of the repair problems. So what is the repair problem? Here I am showing a 10, 14 Reed-Solomon code and just these 14 blocks were stored in different machines in the cluster. When you lose a machine, you have to repair it. There is a, the machine is sending a--you don't lose a block; you lose a machine, so when you lose a machine you lose thousands of blocks. When the machine stops sending a heartbeat to the main node as it's called in HDFS, then the main node says okay maybe this guy will come up. Maybe it waits 10 minutes, I don't know, depending on the configuration, and then after a while it triggers a repair job, a MapReduce job that is repairing every single block that was in that node that is dead. Of course it may still come back, but that is the way that it is working. So a MapReduce job starts for every block that is lost that needs to re-create that block. In order to create that block if you have Reed-Solomon code and you use a naïve way of repairing one block failure, you need to access 10 blocks to do that. At some other machine somewhere you need to talk to 10 out of those 13, right, because you lost one, so you need to read 13 blocks, transfer 13 blocks in your data center, move them to this guy and this guy basically runs a Reed-Solomon decoding procedure here that's basically a mapper that is running here, opens streams to 10 others, runs the decoding, produces all of the 10 blocks, stores three, so this is three prime, I say three prime will be called three and then throws away one, two, three and all the whole, the rest of the others. That's the way that Reed-Solomon is decoded right now. There are some smarter ways of decoding Reed-Solomon that have appeared in information theory recently, but kindly they have several issues in being implemented. They don't have very big benefits. We don't know the minimum possible repair for Reed-Solomon in particular. We know over all codes what is the capacity, so I will talk about that. So the first question you may ask is who cares? Is that a big deal? Are there a lot of repairs? So here we ask is repair frequent. So this is a trace of a 3000 node Facebook data warehouse production cluster and what you see here is a number of failures every day. You see these are the days here. So you see that it seems like often there are 20, 40, something very funny happened here and it goes all the way up to 100 out of the 3000 nodes are down. So that is quite a lot of failures happening even on a typical day. Yeah? >>: So how do I read this graph? So among the December 26 how many failures were total here? >> Alex Dimakis: This is the situation. It's not per day. It has higher resolution. I think it may be per hour or somehow aggregated and this is the number of nodes that I think were not reporting… >>: A failure node is counted a number of times before it's completely repaired. >> Alex Dimakis: I don't know exactly the tools that they used to produce this how it is exactly counted, but the key thing that we care about really is how many repair jobs are triggered. That's the thing that we care about. There is a threshold that says okay, if the server is not giving a heartbeat for a minute then they don't start the repair job. And I don't know exactly how this tool is reporting a death; maybe it's reported dead after a minute and I don't know exactly how. But the main point of this chart is that there are a ton of failures and there are a ton of repair jobs. So here I say let's say each, let's say a typical node would be 10 terabytes, 15 terabytes thereabouts. So when you lose that guy you have lost 15 terabytes worth of blocks, so when you have replication you have to go and copy all of those blocks so you move in the network 15 terabytes for every node you lose, so let's say on a typical day you have 20 such jobs happening. You have to move 300 terabytes in your network. The problem is now if this node was Reed-Solomon encoded, you would have to move 10 times more data and you have to read 10 times more data. So we estimate, by the way the total graphic in this cluster would be around 2 petabytes per day from what we estimate, typical and we estimate that 20 to 30% of the network traffic is repair traffic in any given day with the common configuration of 8% Reed-Solomon packets. If you want to encode 50% of your cluster, then the repair traffic would completely saturate your data center network, so that's our estimate right now. It would be completely impossible, so all your server, all your system would be doing is repairing itself at this rate of failure with this data. So that basically says it's impossible to go up to 50% because you would have a ton of this guy on a ton of network. So in order to clarify one point, the point of that is that this is an important problem and a probable bottleneck. In order to talk about repair I need to distinguish between two concepts. You remember here I lost block number three; we are calling it three and now I want to repair that block, so there are two notions that you can think of here. Exact repair means that the block that I recover is exactly the block that I lost. There is a looser concept called functional repair, which means that the block I repaired is not the block I lost but it is a new block that gives me still any k out of n guarantee, but even though this was a systematic block which means data, now this can become a new parent, so this problem, the functional repair problem is a special case of exact repair and it's a much easier special case of exact repair information theoretically. I just need to have yes? >>: [inaudible] hit on the latency if the cold data is read? >> Alex Dimakis: Of course, exactly. If you are in a situation that is purely archival and you almost never read, you may be okay with functional repair, so functional repair is a much easier problem than exact, so we have explicit practical very simple codes that achieve the capacity for this problem. As I will say, we don't really even after so much research we don't really have practical codes that achieve the capacity for this problem for some cases, so if you are willing to tolerate the problem that you will be losing systematic blocks and you will be replacing them with linear equation parities, so you have latency, read latency problems, if you are willing to take that then there are codes that you can use that are called functional regenerating codes, very, very simple things. If you are not willing to tolerate that which is mostly the case, then you have to, we have to talk about exact repair and there are papers the talk about p to p backup using functional repair, but I don't know; it depends on the application. I just wanted to make that distinction. Now there are different metrics that we care about. Observe here again that there are different things that happen. The newcomer, let me go back to where I was. So this guy is called the newcomer. The newcomer is the node that hosts the repairing block. The metrics that we care about are how many bits traveled in the network total. That is called one metric, the repair bandwidth. The second metric we care about is how many bits were read from the discs. This can be different, so there are some schemes that read from the disk, do local processing and then send data, so these schemes are not very convenient if you're using MapReduce, so the implementation is more complicated, but disk I/O and network can be different. And the third metric that we care about is how many nodes this guy talks to. This guy right now talks to 10 nodes. Even if you read a bit, that is a different story, but if we care about how many, we talk about that metric. So there are these three metrics. The first one I said is the number of bits communicated called repair bandwidth. The second is the number of bits read from the disk during a single node repair called disk I/O, and the third metric is the number of nodes accessed to repair a single node failure which is called locality of the code. So let me tell you a little bit, so this is April 2012, right? I give different talks and I want to make sure of what is open and what is closed is up-to-date because there is a lot of, yesterday for example, there was a paper on this, so there are a lot of things happening. For this problem, for the repair bandwidth, the capacity which is the minimum possible information theoretically over all codes is known. It is known not for the whole region, so as I will briefly discuss later in the theory section of my talk, there is a whole tradeoff between storage and repair communication. The capacity is known to be achievable for the two extreme points only, but we do know what is the best possible for this case. We do not know the intermediate points and more important for practical use we don't really have a high rate practical codes known for the most useful points which is the minimum storage point, so that problem is an important open problem and I think there are no practical deployable codes for this. For the second problem, there are codes that are near optimal here but we don't have exact measure of their capacity bound so it's an active area. For this problem the disk I/O, the capacity, the minimal information to be read is open, is unknown. There is an obvious bound that you can give that says you will never be sending things you never read. So an obvious bound is that the repair bandwidth is always a bound on the disk I/O, but can you actually match the disk I/O to be equal to this capacity? We don't know if there exist codes that can read exactly as much as they transfer and repair all failures. There is again some work on this but the general, both the capacity and the [inaudible] codes are open for this problem. Is it clear? Does anybody have any questions? The third problem is the problem is a simpler metric which doesn't care how much you read or how much you transfer; it just cares if you talk to a node or if you don't, so I just want to keep the repairs as called localized, so I will like single block failure to correspond to a node talks only to three other nodes. So this is called locality. What is very recently the capacity for scaler codes was discovered by some people in this room [inaudible] so I will talk about this, and there are very few explicit code instructions for specific cases and we have one and I will talk about that. So this is sort of the landscape of the metrics that people care about; depending on the problem people care about different metrics differently. I'm going to talk about the practical implementation we did now and then we have a whole theory discussion if you want to learn the bounds and the codes for each one of those things. Are there any questions for this? Maybe? So let me just move to a little bit about what is practically deployed and we will move about the theory later. So this here is, we're going to talk a little bit about practice. So this is a real system that was deployed at the Chinese University of Hong Kong. It was using a regenerating code and it was built by [inaudible] and it was running on eight nodes and then there was a bigger version that was published in FAST [phonetic] this year that was running on bigger. This is how failures are introduced into the system, so this is me cutting off one of the discs. Okay, it's not true. This was one implementation that used a specific code that I can't tell you about. The main problem with that code is that the disk I/O was very high, so it wasn't a very, so for the Hadoop scenario it wasn't exactly the practical useful code. Let me tell you a little bit more about the Hadoop MapReduce infrastructure and the implementation of one specific new code that we did in Hadoop. Hadoop and MapReduce the way I understand is basically that it is the open source version of the Google file system and the Google MapReduce framework, but it is open source and free. It consists of two components, the file system which is called HDFS and the data analytics processing system that is called MapReduce. The software is called Apache Hadoop MapReduce. So hundreds of companies are using Hadoop and tens of startups are developing tools for Hadoop so it is a very, very exciting area. There are a lot of buzzwords happening. The BigData buzzword now I think is basically founded on these tools. In particular, the tool that we care about is HDFS RAID which I talked about a little bit. It is a Reed-Solomon add-on module that sits and works with HDFS. It exists as an open source. You can download it. It is not very user-friendly so you have to fix configuration files and it won't quite compile, so you need to hack it up a little bit. If anybody wants more information to download HDFS RAID send me an e-mail and we have spent a month trying to make it work, but it's functional and it's used in production also, slightly different versions are used in production by Facebook, Ali Baba and some other companies. So the currently available HDFS RAID basically has two modes. It says okay, either three replication or a Reed Solomon with any k and n you want, usually it would be a 5, 7 or a 10, 14, or one of those codes. Or a single parity code which basically deletes the third replica and replaces the third replica with one XOR, so two replicas, one XOR, so that's one other mode that you can run. What we did is we implemented what is called a locally repairable code, which is a new code into the system, into HDFS RAID and our HDFS RAID with our new code is publicly available, so if anybody wants to play around with it, we would be very happy to share with you it is on [inaudible]. So let me tell you how our code works, this locally repairable code. So the Reed-Solomon, let's focus on a 10, 14 Reed Solomon, for example; it takes 10 blocks and produces four parity blocks using a Reed-Solomon equation here. By the way, the encoding is done once the name node detects that a file is going to be RAIDed, there is another program running called the RAID node that says I'm going to take that file, delete the replicas, run a MapReduce job that erases blocks and produces these Reed-Solomon blocks. Here is the first idea. Let's just add a simple XOR. Let's just take these five blocks, XOR them together and produce a new block the same size as these blocks and have this extra block and install it somewhere X1. Let's do the same thing here and let's do the same thing here. Now the first observation now is that let's say I lose this block number two here. If I lose this block then I can repair this block by just reading the other blocks in the XOR and the XOR, very simple thing, simplest thing in the world. So this one is called a 4 code, for example; it was a very simple idea. So it says local XOR allow single block recoveries by transferring only five blocks which in the example is 320MB instead of 640MB. The problem of course with this idea is that now you are storing more, so now you are storing 17 blocks compared to storing 14, so your storage overhead goes up. Is there anything cool that you can do this now if you are coding? So you can do the following in this thing with this idea. First of all you may multiply before XORing these together, you could multiply, think of these as elements of a finite field, you can multiply by any elements in the finite field that you want, C1, C2, C3, C4, C5 as long as they are not zero and then still if you lose one of the blocks you can recover by RAIDing the other five, no problem as long as they are not zero. So I have the freedom to choose any coefficients I want. I say here any linear combinations as long as I can invert these coefficients and repair locally. So here is now a strange thing. I can choose any coefficients here and any coefficients here and any coefficients here, so what are the best coefficients I can choose? One idea was to just XOR them together which corresponds to using ones, but is that optimal? Is that providing the best fault tolerance in the system? For example, how many failures can I tolerate here? I can definitely tolerate four; you can see that, right? Just by using these I can tolerate four. I don't even need it. But can I tolerate five? There is no, it is not clear to see from this construction if you can tolerate five at XORs or not. The thing is depending on how you choose these coefficients, the distance of the code, the photons of the code can change. So how you choose these coefficients in an optimal way, we don't have a general polynomial time method of doing that. You can do it; we have a way that does it that says if we choose these coefficients randomly in a large finite field we can bound the probability that this code is not matching the best distance possible, or we could check exponential many subsets, and for a 10, 14 maybe we can check, but in particular, what we did is if we choose this, you see, these coefficients and these coefficients here, these are ones now but they don't have to be ones. These coefficients and these coefficients depend on how you choose these coefficients. So if you set these coefficients to be Reed-Solomon of a specific type, extended with a specific finite field in a specific primitive polynomial, then we proved in our paper that if you choose these to be all ones, then it is optimal. But in general we don't have a general construction for choosing them. So here is another cool thing you can do here. You can choose these coefficients, these and these and these so that this funny business happens. If I take this fork, as it's called, a local parity, and this local parity and I XOR these local parities together, I get this local parity, so it turns out you can do that, and if you do that, do you see now what I can do now, I can do now, this is called now, we call it an implied parity. I don't even have to store this anymore because I can create, XOR these two together and get it, so how do I repair? I don't store X3, so when I have a failure, this node fails, I am going to read his friends one, two, three; I am going to read these forks, XOR them together, produce the implied parity, XOR the implied parity with the other guys and get back P2. So I can still repair and now you can verify that the locality of every single block is five; before the locality of those was four. But now the locality of everything is five. So this is the code we implemented in HDFS. So single block failures, all single block failures can be repaired by accessing 5 blocks versus 10 in the naïve Reed-Solomon. We store 16 blocks total, so we have a little bit of storage overhead, 1.6 versus 1.4 and we implemented this system and we tested it and I want to tell you how the test look like on two different systems. Here I say in general, choosing the coefficients is nontrivial, must check exponentially many subsets for linear dependencies. It's an open problem to do it in polynomial time, explicitly. And one lemma that we have is that if you choose it, the 10, 14 which are apparently deployed in Facebook, then using ones works, with this specific choice. So our Java implementation is fairly simple. All you need is a encode function, a decode function and then so this is the encode function. This is the whole encode function of the Reed-Solomon. It is a very simple thing. You just need the arithmetic in a finite field and then you need basically to hack a few other things so that the Hadoop when it has a failure doesn't talk to everybody because the component that decides who talks to whom during a failure is not in the decode function, so we had to hack that and the decode function was very easy to change. So this is our name node as it is called, running so this is called, so this was January of this year. The version of Hadoop running is the USC three XOR is the name of the code that I told you, a not very sexy name. We need to find a better name. And we had a [inaudible] this was 50 nodes initially but we were killing nodes so at this point there were 37 nodes. This is on Amazon and we had uploaded I guess 100 GB and we were keeping track of how long it takes to repair and how much network [inaudible]. So the total experiment involved 100 machines Amazon C2, one of the experiments. 50 machines were running HDFS RAID, the Facebook Reed-Solomon version and 50 were running our crazy code that I just described, the three XOR one. This is what is now called locally repairable code, LRC, okay? The name, we will have to find the name that works. So we uploaded say 50 files, each 640 MB on Amazon and we were killing nodes and measuring network traffic, disk I/O, CPU, how long it took to repair and all of those things. This is a measurement, so here for example we kill one node and you see the Facebook cluster is creating this much network and our cluster is creating around half; this is a red line. This is the repair duration was roughly the same in the set case. Then we killed one node again. Then we killed one node again; don't ask me why there is variability because I don't know. There were just a lot of things going on. And many things like Hadoop is doing all sorts of strange things every now and then that we didn't really understand, but they were orthogonal to our experiments. And then we went to lunch; this is lunch break. Then we killed the two nodes, three nodes, three nodes, three nodes and two nodes again, so what is the message of this? The interesting thing is when we killed a lot of nodes it also finishes faster, you see? So the red guy is done repairing where the blue one, the Reed-Solomon is not done yet. Yes? >>: [inaudible] Facebook [inaudible] code? >> Alex Dimakis: I'm using exactly the HDFS and the HDFS RAID used in Facebook cluster, but we are not using the--this is not an experiment in the cluster. >>: Are they using open source? >> Alex Dimakis: Yes. >>: They contribute that? >> Alex Dimakis: Yes. So everything is open source, I mean the Hadoop part, so that we have another experiment in the Facebook cluster, in a small Facebook cluster, in the test cluster, but it's very small. We were just testing, comparing the numbers we got from Amazon to the numbers we got on this, but it's not that big of a scale, so the files were bigger but we didn't have 50 nodes. >>: So for the single [inaudible] case [inaudible]? >> Alex Dimakis: I don't know. I think that is because the time was dominated by other things. It wasn't dominated by this guy over here. >>: [inaudible]. >> Alex Dimakis: Honestly, I don't know exactly what is going on, but when we saw this, when we were loading more and more data and when we had more and more failures, the time it took was faster for our system [inaudible] was slower of course, but the time it took was not something that we understood at this point. >>: [inaudible] something is wrong with it… >> Alex Dimakis: Yes, there are a lot of things that were strange here. This is just a plot of CPU utilization, so Amazon allows us to measure that. Our code, so for example our code for a simple one failure is not decoding Reed-Solomon, just doing XORing so we expected it to be much, much smaller for our case compared to the Reed-Solomon, but we don't really see that and probably what's happening is the CPU is not dominated by the decoding anyway. The CPU is, all that this says is that it is roughly the same, so the CPU is roughly the same. In theory we expected it to be much faster. The decoding is much faster but there are a lot of other things that are going on that probably dominate the CPU anyway, at least at this scale. This is the most important thing. It says that for the Facebook cluster, which again, I say it is an Amazon cluster running Facebook software, the bytes read, the HDFS bytes read during recovery from data loss is more than double. So disk I/O is the blue thing here. This is single node failures. These are, so these are three node failures, oh there it says. Okay great. So this is one node failing, 16 blocks lost. One node failing, 17 blocks lost. One node failing, 14. Here we lost three nodes, three nodes, two nodes, two nodes. Consistently it's 2.5, 2.6 X less and that is very reasonable, right in theory we can commit exactly what it is because we read five blocks and their construction actually reads 15. It should read 10 but the actual limitation does not, it opens streams to all 15. That was the way it was done in open source. Now they've fixed that, so now they open only to 10, but that version was opening 15. >>: With more than one failure then you would read more than five so [inaudible]? >> Alex Dimakis: Okay, very good. With more than one failure there are different things that can happen, okay, good point, very good point. Let's look at this. So this is the code that we have. Let's say that two nodes fail. If two nodes fail, first of all let's call this whole thing a stripe. Two nodes failing could influence only one block per stripe in which case no problem. If two nodes failing influence to blocks in one stripe, there are two things that can happen again. Either for example, block two or n blocks seven are lost, in which case the local repairs work, but if block two and block three fail then local repairs have not worked. So the way we implemented this is what's called a light decoder. So there is a light decoder that tries to repair just from the locals. If it can't do it it throws an exception and says I give up, and then the standard Facebook Reed-Solomon decoder takes over, and we made sure that the ReedSolomon decoder was still their own Reed-Solomon decoder, because we didn't want to make a less efficient or more efficient Reed-Solomon, so to compare exactly the same thing. Okay so basically we introduced a light decoder which was between the standard procedure. But most two failures would still be either influencing only zero or one block from each stripe or if it was two it was quite common that you got benefits. So there, it depends on how much a stripe is influenced. So that's why it's not obvious to measure the benefits, theoretically. But they were observations. And new storage code reduces the bytes read roughly by 2.6 X, both in theory and measurement. The network bandwidth consumed is reduced by approximately a factor of two. We use the disadvantages, we use 14% more storage and the CPU we thought would we be better but we didn't measure that, so it's similar. So the other interesting thing is that in several cases we were 30 to 40% faster in repairs, especially in the larger scale repairs or the larger scale systems that we tested, and that is important because that increases your actual availability, so it increased the [inaudible] and the MTTLD, you know, the meantime to lose data. In the theoretical analysis you can say you get a ton more zeros of availability from this code compared to Reed-Solomon. The gains, that is the other thing that I wanted to say, is that if you are willing to use bigger codes, so 45, 50 LRC would give you super good storage efficiency, very close to one, reasonable locality, you read seven blocks or something in order to repair one and the benefits in availability would be incredibly high because Reed-Solomon you can't really, you can't really get a big Reed-Solomon because then you would have to read 50 blocks to repair one, so you would have 50 X, so it's a disaster. So one important conceptual point that these locally repairable codes allow you to do is to have, to go to big codes for the first time and going to big codes can give you a tad more availability and a ton better storage efficiency. Yes? >>: [inaudible] benefit from the [inaudible] because you're repair node could become a bottleneck when you are reading 10 pieces into the same node. >> Alex Dimakis: Let's understand that part. Several people asked us about that, so let's understand this. So you are saying that when, so you have to think of how this thing is actually creating a local system. So there are many nodes. So a node is a bucket that stores thousands of these blocks from different stripes. When you lose a node all of these blocks have to be repaired. They are not all repaired in one place. There is a MapReduce job that says this block will be repaired there, this block will be repaired there, so there is a placement policy that HDFS is running and depending on the placement policy it will, so there is no issue of everything being read in one point because you see… >>: [inaudible] for say [inaudible] you look at repair of just one file. Is that on a single node? >>: No. The blocks in the file are spread out [inaudible]. >> Alex Dimakis: The block, so the… >>: Blocks, it schedules a task for each block that is lost. It will determine if the blocks are under replicated and make a MapReduce job per block regardless of node it is on. >> Alex Dimakis: Yes, but per block is right. Per block, you read 50 blocks; that's true. Yeah, that's true. >>: [inaudible] per block for five. Let's say five only one block [inaudible] so [inaudible] traffic on that, maybe it's the one that [inaudible] current generation, so even if you consider one file… >> Alex Dimakis: No, but that's what I'm saying. So you are absolutely correct that per block, one block is being repaired at one node and that block needs to talk to five others. If it needed to talk to 10 others that would be double the traffic arriving here, fine. That is true. However, this does not scale because as we go to thousands of nodes and thousands of blocks, the code still stays at 10, 14. So now if you were at indeed a very large code, but say you had a code that had 5055 then one block, if one block had to read 50, then perhaps that would, but it would not be as you scale the nodes, you scale the files. That's orthogonal, because the files are distributed in every node is doing a local little block repair, so at local if per stripe you are correct. So if per stripe the 50° would be a problem, then yes, but that degree does not scale. That's all I'm saying. Okay good. Other questions? Yes? >>: I’m missing why the read I/O and the network are different. >> Alex Dimakis: Okay, because the network is TCP/IP and God knows what it does. So I don't know, so the way that the network is happening is Hadoop, so HDFS RAID opens a stream and… >>: Is it just overhead or transmission stop? >> Alex Dimakis: There are a ton of things that are happening. It's not explained by the theory. So I don't expect these things to, as we grow I don't expect these things to become very different. I think they would become a constant factor, but in practice, that is roughly 2X versus 2.6 X so this difference I have no idea what is going on because it opens TCP/IP streams and God knows what it's doing. And then we were killing nodes at the same time, right, so there are a lot of things going on. >>: It looks a lot like it's reading all 13 copies. That would come out to 2.6, but that would come out to 2.6 for the I/O's as well. >> Alex Dimakis: Yeah, but you see the network was not consistent at all, so… >>: [inaudible] right so it could balance out? >> Alex Dimakis: Yeah, but there is a lot of strange things. This is what Amazon calls network in. The network there's something else that, so this is the sum of the arriving network traffic at each node. The network out was slightly different. I don't know exactly what the Amazon monitor is measuring and there for a lot of crazy things that Hadoop is doing. Sometimes for example, Hadoop might send, so this measures for example, the pings, right; it measures the heartbeats. It measures many other things that are not just our own business, so there are a lot of things that are running in parallel to us. Yeah? >>: If you subtract the number of times the light decoder failed to recover so you had to fall back to the… >> Alex Dimakis: Ah, good question. We can probably get that from our traces. I don't have it now, but I know that just from this, this was very clean. So HDFS by itself was much cleaner than the network and this is reported by the way by Hadoop. It's not reported by Amazon. From this we can estimate, and most of the time it was basically the light decoders. In the cases after we have killed a lot of stuff, our cluster was getting smaller and smaller and then it was more likely that three failures would kill two things from the same stripe, but most of the time it was--and if you have a big cluster then it's very uncommon that two things kill the same stripe. >>: Looking toward [inaudible]… >> Alex Dimakis: No. No. [inaudible]. >>: How was it reported by Amazon? Is it per [inaudible] or application? >> Alex Dimakis: There was a tool that you can get aggregate or per VM. [inaudible] as they call it. Yes? >>: Do you have any measures of liveness or usability of the cluster during these processes? I know with replication level III they use it both for recoverability of data as well as scale apps so they will choose nodes to be as close to the data as possible and you can run more jobs touching the same data this way. >> Alex Dimakis: You are saying if you run another MapReduce job at the same time? >>: Right, if you have actual workload, right. So this is a system sitting idle and things just dying. >> Alex Dimakis: Right. >>: And self recover and nothing happened. >> Alex Dimakis: That's right. >>: Now user workload will force if you have any kind of metric like TeraSort it will actually eat bandwidth, so was it a meaningful, how much bandwidth was being used is now meaningful to task overhead as well as the disk reads and so with HDFS triple replication the data is live while it's being recovered. In this case isn't the data dead while it's being recovered? Don't you have to block tasks? >> Alex Dimakis: So it depends. All are very good points. First of all we never, as you saw we never compared replication to any of the coding schemes here. Replication is much faster if you run another MapReduce job at the same time; of course it can read other things. So we think of it as a multilayer system. If you come for hot data it is a three application. If you decide to code then we are just comparing the coding options here, so that's the first step. That's why we never compared through application because that is a different layer of our abstraction. Now between codes we block fewer things. We talked to fewer nodes most of the time. We create less network. We measured, we did a few small experiments where we were running another job at the same time and then we killed nodes and saw the repairs, and we did see the big benefits of Reed-Solomon compared to, sorry LRC compared to Reed-Solomon, but they were super noisy. We have a few plots on that but so far we haven't included them, but yes, certainly you are right. The other job will create traffic in this I/O and it will take longer to complete if you are doing a repair, but you need to repeat that many times because sometimes you kill something and it is not being touched by the other job, so there is a lot of averaging you have to do in order to get nice plots and we didn't do that at that scale, but yes, you are absolutely correct. Did I answer all of your questions? >>: I think, yeah. I think there's still a question of the practicality of it, of both implementations in terms of we don't see that it even does block, like the jobs could try to read it and just fail to read blocks and just fall apart. >> Alex Dimakis: Yes, but that is, there is… >>: That is something, you would have to see the application actually running with a live--we can see that the idle system consumes fewer resources and theoretically things should be higher throughput but we can't see that the implementation is sound enough to carry real traffic. So that's something you just have to mess with. >> Alex Dimakis: So the HDFS RAID is running in production, so the HDFS RAID has all of the mechanisms to protect what you are saying, so the hacked version that we have should do the same. That how much better it does, I haven't consistently measured but we all expect it to be better. So these are the conclusions of that. Yes? >>: [inaudible] minimum system scaling does a number of nodes, right? Because if a node is failing you are repairing portions on different nodes, right? >> Alex Dimakis: Right. >>: And these nodes are doing other work. They are serving other systematic pieces for the other traffic order so if you don't have enough nodes, each node will have too much burden of the repair and I guess that was a question. >> Alex Dimakis: That is true, but what this is, this is more true for Reed-Solomon compared to LRC, right? >>: Yes but if you go from 3X application to [inaudible] repair so there you are increasing the minimum [inaudible] threshold of system scale because you need to do more work for repair so for the same amount of storage failed at the norm. You need to spread it out to more nodes. >> Alex Dimakis: Yeah, but the system scale is the list of my words here because… >>: [inaudible] lower bound. >> Alex Dimakis: Yes, but there is 33,000 nodes is easily in a cluster. >>: 33,000 nodes will not be on the [inaudible]. >> Alex Dimakis: Not 33, three [inaudible], yeah. >>: So you have to also look at same [inaudible] because as you go up the hierarchy and the network is oversubscribed so repair traffic becomes a bottleneck. >> Alex Dimakis: So in our case repair traffic is a bottleneck already. >>: If you have inter-cluster repair traffic that will become even more bottleneck, what I am saying is you have to have the repair localized to islands of high interconnect bandwidth so that the network does not become the bottleneck. >> Alex Dimakis: So that, placing things and repairing locally in a multi-hierarchal type of in rack close data center is very interesting. Right now to the best of my knowledge there are very few systems that do a closed data center Hadoop. We are very interested in this but we haven't worked on it, so, but definitely that is the next step to worry about, yeah. >>: So much like read or write [inaudible] have a percentage of which the threshold for using the read or write [inaudible] versus the [inaudible] section is no longer worth the; you just go to the critical section. It seems like for this particular algorithm the correlation between machine failure and rack failure there is a rate at which if machines and racks go down as one unit, like maybe they share a power supply, the Reed Solomon is the better approach versus trying to do local because local would fail every time if you take your local ribbons and… >> Alex Dimakis: No, no. There are two things. There are two notions of local that we need to distinguish. Local you mean placing blocks in the same rack. >>: By local I am assuming that the five is… >> Alex Dimakis: In the same rack. >>: Is fairly local, that supposes for all of them. >> Alex Dimakis: Okay. That is not true in the current implementation. I can see why you may want to do that and that right now the placement, the default for the placement is random for free application. Actually free application, I try to keep two copies in the same rack and one across, but when they do code it so the Reed-Solomon are across rack, so each one of the 14 blocks are in 14 racks, and we think if you want to maximize your ability, we didn't hack the placement. Right now the implementation is just placing 16 blocks in 16 different racks, so that is what it will do now. If you want to play around with placement you can do that, but we haven't gotten there and I think it is a messy world, but sure you may get benefits, but it depends exactly as you say, how often do nodes fail? How often do racks fail? And how much, so the in rack traffic is four GB is a one GB switch and the cross rack is a four GB switch. If they were much different, then again you have to decide. Placement is agreed another business we have to talk about. Right now we are spreading it maximally. >>: It seems like you should be able to compute the costs of the local versus the cost of the Reed-Solomon and then determine--there is a rate of machine failure at which the local is no longer worth doing. 99% machine failure… >> Alex Dimakis: Compared to rack failure and also network and in the network. >>: [inaudible] crossover here. >> Alex Dimakis: Yes, I understand. >>: And I think it would be interesting to have that--I mean this seems mostly theoretical in terms of where it's coming from… >> Alex Dimakis: Yeah. >>: So it would be an interesting value to have. There are things like Google who built the GFS on things sitting on pizza boxes, like little pizza boxes. Not something like the concept of a rack box, but physical pizza boxes, when they first made GFS. I mean the probability of the place catching fire and everything going up is much higher, was very high at the time, so they it would be interesting to have an actual value [inaudible] as people design their data centers they could, they could sort of pick and choose. >> Alex Dimakis: I think it's interesting. I actually have one student who is looking in this direction so we may ping you more on it. Okay. So in the interest of time and as I correctly predicted this is the back to theory slide, which is finishing, so I am done [laughter]. >>: I have a question as it relates to reliability. >> Alex Dimakis: Yes. >>: Can we go back to the code. >> Alex Dimakis: Yes. >>: So this code I understand I mean of course it can basically fix four failures. Can it basically fix all five failures, for example a failure of six… >> Alex Dimakis: Is a very good question. The 10, 14 cannot provably. So there is no 10, 14 with this locality that can correct five, but if you change the numbers a little bit, so we have that was the theory part, so there is a theory that tells you exactly how many, tell me your locality. Tell me your global parts and I could tell you what's the best distance you can get. For this case, no, we cannot take five, but if you have a slightly different Reed-Solomon, I think it was 9, 4, I don't remember now, if you have 9 and 4 parities then these would add in the distance, so definitely there are cases where they do. >>: I just want to comment. I think Cheng is basically our code. We have to basically distance maximized [inaudible] codes. >> Alex Dimakis: Yes, very good. So I think I will stop here because well I am out of time. So thank you very much. >>: These are all Byzantine faults, right? >> Alex Dimakis: This is worst-case, yes. >> Cheng Huang: Okay let's think our speaker. [applause]. Welcome to the second part of the talk. This talk will be done by Dimitris. Dimitris is currently a third year PhD student at USC in Alex Dimakis’ group. He will tell us that the one state that are reliably stored, how do you efficiently do analytics on top of that. >> Dimitris S. Papailiopoulos: I am going to talk about large-scale sparse PCA, so I am going to give an introduction to the technique and present a new algorithm for that that runs on big data sets and specifically on, I have done on Twitter, on Twitter analysis. So this is joint work with some collaborators from USC and my previous advisor from Greece. So I will talk about sparse PCA analysis and what this tool is about, and why this can be useful and then I am going to tell you why sparse PCA is an intractable problem and then I will introduce a new algorithm, a new approximation which we think is suitable for large-scale problems and then we introduce framework that is good for Twitter analysis which we call Eigen Tweets. A very short overview for principal component analysis. It is basically a dimensionality reduction tool that is used for clustering [inaudible] and applications like that. Sparse PCA is a variant of principal component analysis which is in particular useful when we care about interpretability which I'm going to explain later. Then I'm going to present the algorithm that we have for sparse PCA which is really fast when we tested it on large data sets. I will basically explain the techniques on the Twitter model, but this can be taking, we can really consider any data set. So we tested this on tweets. Tweets are sentences that are comprised of very few words, like five words, a small number of words. What we do is we generate a vector, a very long vector and its index of each vector corresponds to a word and whenever a word appears in a tweet I put a 1 here. This is a very long vector. For example, it's about 50,000 in the data sets that we are testing and the vector is super sparse, so it has like 5 or 10 nonzero entries. This is our sample vector of length 50 K. What we do is we collect a bunch of these vectors of these tweets and we collect them in a file in a big matrix. This matrix set, the sample matrix contains all of these vectors. Each vector is a tweet and each row corresponds to a word. Wherever there is a 1 it means that this tweet for example has this word. What we want to do is we want to find a sparse vector that closely matches my data set, so I want to find a vector that is really close, a tweet that matches the tweets in my data set. So what is the metric that I care about? It is the metric of the sum of projections. I have this vector and I take inner products with my data set, so any net product with a tweet in my data set tells me how much this vector, each vector in my data set here. So I want to maximize the sum of these projections these inner projects. So I want to solve this problem here. So the way we can solve it is principal commonalities. The first step is we create an empirical correlation matrix R which is produced by taking the inner products of S with itself and R the correlation matrix in the entity i and j. Basically the i is the entity of R is counting exactly the number of times the word i appears with words in my data set. If I want to find the vector that maximizes the productions which is this problem, this problem can be cast to these maximizations here so instead of having the maximization of X transpose S and [inaudible] squared, I can just maximize X transpose S, S transpose X and this is this quadratic form here. Now this problem we know how to solve. The solution to this is the top eigenvector of my empirical of my sample correlation matrix. So SVD can solve, can give me a solution of this problem in polynomial time or it can, so it's, the [inaudible] complexities and n squared but it can be faster dependent on the sparsity from my sample matrix, but this is how many known entries my sample matrix has. Now the problem PC is it doesn't actually solve the problem I want solved. The problem is that the top eigenvector, the solution to this optimization problem is going to be a super dense vector, so the solution of this maximization here is going to be a vector which has nonzero loadings [inaudible] and each entry corresponds to a specific word. Now having a dense vector is not good because this is supposed to be a tweet and a tweet that consists of thousands of words is not credible. It doesn't make any sense. So I would like to have a super sparse vector that does the same job, that maximizes this end product. So I would like to have a sparse vector because sparsity means interpretable, so I would like to have a vector like that. I would like to have a vector that had entries like strong, earthquake, Greece and morning which would indicate that the main topic of a particular tweet data set is that there was an earthquake in Greece, for example. So this is a sparse vector is much better than a dense vector in terms of interpretability. I want to solve the same problem with the other constant of sparse, so I want to maximize this quadratic form subject to the L2, the [inaudible] constraint and the cardinality constraint, so this means that I want to find the vector that has only k nonzero entries. In practice I am going to just enforce it to be five or six or something like that, a really small and constant number with respect to my problem size. This constraint theory is exactly what makes my problem NP hard. So this is a cardinality constraint which makes the problem intractable. Sparse PC is NP hard. So there have been introduced many approximations and many relaxation schemes. The easiest thing one can do is I can simply compute the vanilla PCA and I could just threshold all my, I could just give the k highest, the k maximum absolute value entries. There are also some regression techniques, some SPC relaxations. I generalize the power method, some a new technique based on [inaudible] consent algorithm, but in general the problem is intractable especially when I go to very large data sets all of these methods with the exception of this one are really, become basically intractable. So when I talk about data sets of hundreds of thousands of entries and hundreds of thousands of words, these methods cannot run. So we want to start with a very, we want to introduce an approximation starting with something really easy. So I'll start with the assumption that my correlation matrix is rank one. I want to keep my correlation matrix rank one if my problem would be easy. So the approximation that we have basically solves the sparse PCA as far as [inaudible] so if my correlation matrix has cost and rank then the sparse PCA is a polynomial time problem, but this is not enough and we also introduce a feature elimination, so in the tweet data set this will eliminate words that will never appear in my optimal solution. This is keyed to run our approximation on large data sets. The key is we can always decompose that because it's positive semi-definite. And so the sum of the outer products of its eigenvectors, so V1 is the leading eigenvector, V2 is the second eigenvector and so on. We approximate that with other one and other one is going to be a rank one matrix which is going to be this inner product. I am asking the same question again. If R is rank one, if I use this matrix can I solve this sparse PCA problem? This is my problem here which I can replace R one with the other product and all this basically boils down to is maximization. I want to maximize this absolute sum. Do you think that this is an easy problem? Basically this boils down to a sorting problem you just need to keep the k absolute, or the k maximum absolute values of V of V1 and these are going to be the optimal indexes, the optimal words in a sense, k words in a sense. This problem I can easily solve. I just need to solve the absolute values of V and just keep the k maximum. This is basically equivalent to thresholding. I just compute like [inaudible] PCA, keep the leading eigenvector, threshold the entries, the n-k’s smallest entries. If my correlation matrix would have been rank one then I know how to solve the problem. It is just a sorting problem, reduced to a sorting problem. Another question is let's take it a step further and let's check what happens when R has the rank two. When R has rank two, when I do the rank two approximation, instead of keeping just one eigenvector, I keep the sum of these two other products. I have the same maximization here. I replace my other matrix with this matrix here, V1 and V2 which is the two leading eigenvectors of my initial correlation matrix. What I do now is introduce a new vector C of phi which is going to unlock the low ranked structure of R. this vector here is going to basically give me instances of the rank one, of the rank one problem. C of phi is going to be a vector that spans the unit circle of dimension two and has a unit node. Now I will use the Cauchy Swartz inequality in the following way. Basically this is my matrix. This is the [inaudible]. If I take any net product of C, this guy here, so this is like a vector, right? If I take any product of this guy with this guy by the Cauchy Swartz inequality, I know that this thing is less than or equal to the product of these two norms. Because I am constraining my [inaudible] unit norm, this guy is basically one. I have this inequality here which becomes an inequality when C of phi is collineared with this vector here. I introduce a variational characterization of the norm, of the norm of the matrix. What I will do basically is the following. Instead of solving this problem here, the initial sparse PCA problem for the rank 2 matrix, I will solve a double maximization over both X and phi of this quantity here. Because this thing is less than or equal to this thing, then and the maximum of this thing is equal to this thing, then the maximum over phi and X is going to be equal to the maximum of this. This is a variational characterization of this problem and it kind of looks more complicated, but this is exactly what unlocks the poly-time approximation. The clue here is what happens if we fixed the angle phi. I want to see what will happens if we fix this phi here. If we fix this angle phi here, this is just the vector. This is going to be a fixed vector and this is going to be a max of V transpose X, max of an inner product of a vector. This is a rank one instance. This is exactly equivalent to a rank one instance that I had before that I can still solve by just sorting my elements and by just keeping the k maximum elements. The clue is that the sparse PC is going to be the solution of one of the many rank one instances that exist out there. If I was able to scan all phi angles and keep all possible sparse PC candidates, then the maximizer across all phi’s could be by optimal sparse PC. So the problem now is that phi continues. It is an angle between zero and pi and I cannot scan just all possible angles. The whole idea that's going to lead to the next step is that the optimal solution here depends on the sortings of my V of phi vector. When the sorting changes, then the optimal, the locally optimal vector changes. The clue is the following. If I take this vector here, so I was considering this vector, right? So if I take this vector and write it down again this is basically each coefficient so this guy has n coefficients. Each of these coefficients is basically a continuous function, a continuous [inaudible] in my angles. So what I will do now is I will just plug these angles and see what's going on. Here I have a random matrix V and I just plot my V of phi, so this is the first V of phi, second, third and so on and so forth. And these are plotted. These [inaudible] are plotted as a function of phi. Now what does it mean to fix an angle? So if I fix an angle here I know how to solve the problem. I would just do the sorting thing and that's it. So for this angle here my locally optimal free sparse vector is going to be the vector that has non-zero loadings in [inaudible] that are denoted by the top three tiers that intersect my this line here. So the whole idea here is to check what is going on with the sortings of this spanogram. So we call this figure a spanogram. If I am able to track all sortings, all possible sortings, I will be able to track all possible rank one instances which means that I will be able to track all possible rank one, all possible sparse specific candidates. When does the sorting change? The sorting changes when two of these curves intersect. So I have many intersections in this spanogram and all I want to do is basically find all of these intersections and compute on each intersection the locally optimal sparse PC vector. I now want to compute what is the complex [inaudible]. I need to find the number of intersections because the number of intersections defines the number of rank one instances. This is a simple problem. We just need to solve this simple equation. So you need the two curves in the second one they become the equal. [inaudible] because they have the absolutes. So the number of whole intersections is two answers too because I have n2 ways to pick two pairs. So these are exactly the number of curve crossings that I have. And this is exactly the number of sparse specific candidates that I have for my rank two approximation. The whole idea here is that we introduced the new spherical vector which unlocks the rank two structure of my matrix and it gives me a new view on the problem. I was able to track intersections which correspond to unique sparse specific vectors and I know that within these candidates there exists the optimal one. What I need to do is just plug-in the two answers, the two angles in my rank one solver which is just a solving and then keep all of the candidates which are going to be quadratic in n. Then I just need to plug them into my initial metric and keep the maximizer. This is going to be. Huh? >>: [inaudible] these are dimensions? >> Dimitris S. Papailiopoulos: Right, so n here is the number of words, which is 50 K, which means I cannot basically run anything [inaudible], right? This algorithm as we have it right now we cannot run on very large problem sizes. The complex of this thing is poly-time and we consider the lives of these as rank d matrices by introducing a spherical vector which scans a hyper sphere instead of a circle and the theory is that if you have approximate your initial correlation matrix with a constant rank 1+ an identity and this guy has rank d, then I can compute the optimum sparsity of this guy in time which is polynomial in n. Now the problem is that the rank is in the exponent here, right? This is, although it's poly-time, it is not tractable if you go to… >>: One small n is big n. >> Dimitris S. Papailiopoulos: Yes. >>: So small n is [inaudible]. >> Dimitris S. Papailiopoulos: Right. So we also have an approximation guarantee that tells you what is the gap with the optimal solution, how far are you away from what the optimal would give you? Now this is not enough because as I told you the exponent, although it is going to be constant if the rank is constant, it's large enough for us not be able to run it on big data sets. So if I have like 100 K words which is my problem size is 100 K, then the complexity is prohibited. Depending on the sortings of these curves in my spanogram, I can drop curves, which means I can drop lines, which means I can drop words which will never occur in my top k candidates here. The clue is that some curves never get in the top k curves, which means that if a curve never goes up there then the corresponding word is never going up there in a sparse k locally optimal vector. The idea is to track the k top curves and then just compute the maximum amplitudes of the curves and then find the minimum of these k top curves. The maximum of all the curves that are below this minimum I can drop them. So this is an algorithm that basically eliminates curves which means, which translates to an elimination of features or an elimination of words. The thing is how good this elimination algorithm does. For synthetic data, for a Gaussian V of rank two and a k of 10, having this scale here is something we then do with 100, 1000, going up to a million, the number of words or the number of features I am left with after I do my elimination algorithm, is kind of growing logarithmic with my problem size. What does this basically mean? It means that if I am at the million words here for example and I want to solve my problem, I can equivalent solve a problem on 93 words without losing optimality. A problem on a million words is equivalent to a problem on 93 words for this specific instance of my matrix. This is basically telling you that for randomness this works good, but we want to find out if for real data this gives us any benefit. >>: This is for rank approximation? >> Dimitris S. Papailiopoulos: Yes. This is for rank approximation. It's kind of similar. >>: Similar? >> Dimitris S. Papailiopoulos: Yes. Yes. Yes. >>: So say rank five would be [inaudible]? >> Dimitris S. Papailiopoulos: It's kind of like d times log n, roughly. We just first want to see how good the elimination algorithm works on the real data sets, and we fix again for k to be equal to 10 and we have a data set that consists of hundreds of thousands tweet words and each tweet has a constant number of words, about 5 or 10, and the number of unique words in the data set is about, yeah so it's basically n. When I have one k of words that I have about 21 words that I only need to check and then if I go up to, you know, 200 K, then I just have again like 34. So this means basically that the practical data set validates our… >>: [inaudible]. >> Dimitris S. Papailiopoulos: It's better, yes. It is slightly better. The whole idea is that you can do the rank approximation on your correlation matrix around the feature elimination and then run the algorithm that we have for constant rank correlation matrices. We want to use this for Twitter analysis. We want to use this framework for Twitter analysis. What happens with Twitter is that it puts out many tweets and each tweet kind of has a tag on it. Some are about world news, local news, [inaudible], you know, article pointers and we would like to have a black box that gets all of these twitters, all these tweets and outputs, you know, events or major events and hot topics during a specific date or a specific time window, like a week or something like that. So we want to have, we have a framework called Eigen Tweets and what we want this to do is we want to feed this guy tweets which are about in our data set which I am going to talk about it in the next slide, which are about 5K tweets per hour or 50 K tweets per day and what we want to do is we want to have a black box that gives us the major events in our data set. We would like to have something here that says the first Eigen Tweets has to do with event, an earthquake. The second has to do with about uprising, protest and the other has to do with deals, summer and stuff like that. We want to kind of cluster our, give directions in our data which is going to explain what's going on. The algorithm that basically Eigen Tweets is what I told you earlier. The first thing is you just compute the sample correlation matrix. You do an eigenvalue decomposition to get the deep [inaudible] eigenvectors, random elimination, feature elimination algorithm and then run the algorithm that shows the relaxation, the approximation. Now what you do is, so this algorithm is going to output a [inaudible] words. What we are doing here is we are zero forcing these words in our data set. We are eliminating these words from our data set and then we are computing this whole method to get the second PCA. So this enforces the principal, the sparse principal components to be orthogonal to each other this zero force in here. Now the most hard thing here is the eigenvector composition basically, so assuming that after the initial [inaudible] we are just left with an algorithmic number of words then this is what dominates the computation, roughly speaking. Our data set. Basically our data set consists of full Greek tweets but we have crawled for months. We used the crawler using SideBar software. That's part of SideBar software and the size of data sets that we ran our experiments has millions of tweets. The flow of tweets is about 3K tweets per hour, 50 K tweets per day and about 1 million a month. The number of unique words is in our analysis about 2K, per day they are about 50 K and per month they are about 300 K. It means basically that the number of tweets grow much faster than the number of unique words. The sample set that we picked was between May 1st and May 10th from 2011; the number of tweets in this time window were 500 K and the unique words were 200 K, so this means that the problem size, the n that we had before is 200 K. My rank one approximation which is simple eigenvalue the composition plus the [inaudible] is almost second. The rank 27 seconds. The elimination is giving you back only 20 words which means that out of all of these hundreds of thousands of words we just need to keep 20 words here and the rank three approximation needs a little bit more words to operate and it takes a little bit more time. The first question is why would you use sparse PC and just not count, pick the 10 most frequent words or the ten highest correlation words. The thing is this is just a bag of words that kind of, it's, the most frequent words are words that are not going to give you, words that don't go and tell you what's going on here or something, so it's just a year, Greece, love, Twitter, May, Osama, Laden and so on, that was when bin Laden had been founded. Same for the highest correlation words. This is, all I am trying to say is that picking the most frequent words is not good for classifying. It is not a good way to find trends of what's going on in your data set. We'll start with the rank one approximation. The first Eigen Tweet, the first principle component has to do about how people love, so it's love, Greece, know, received, Greek, happy and all of that. The second principle component is not really good. It says year, Greece Osama, Laden, mommies, world, May. This is basically not a good [inaudible]. It doesn't really tell you what's going on with your data set and that's because the rank one approximation is not sufficient for the principal components to give you a good trend, trends that are interpretable. The third PC is a little bit better. It basically says home, Facebook, Veggos, Thanasis, Job, nice, days and these were all words, Thanasis is one of the most favorite comedians in Greece died. So it kind of gives you what's going on but it seems that topics are mixing with each other and that's because of the insufficiency of the approximation, so we take it one step further and go for a rank two approximation. Oh, by the way here the approximation guarantees almost 80% which means there is only a 20% gap tops from the optimal, from the full rank solution. I take the rank two approximation which only leaves 20 words out of 200 K words, and what we see is identical to the first PC of rank one approximation. The second PC kind of clears up things, so it says year, Greece, mommies, world, mothers, May and Twitter, so basically in this time period there was also the Mother's Day. The third PC has to do about the death of Osama bin Laden, so now we see that the topics are kind of starting to clear up. The approximation guarantee is better here. We have 86% approximation guarantee, which means roughly, at most 14% away from the optimum. And for the rank three approximation we just get a little bit better PCs, and the approximation is slightly better as well. We are at tops 10% from the optimal. So the sparse principal component is an intractable problem, but there are some nice tractable relaxations. The first result that we have here is that if you're correlation matrix is approximated by a low rank approximation matrix then we can solve the problem in poly-time. This is not enough for large-scale problems. For large-scale problems we need something better and what we do is we do a feature elimination. We have a feature elimination technique, which gives you a better result and a much better algorithm in the sense that it's equivalent to the initial one which runs from a very small problem size, and runs really good. We use this approximation for a new framework that we call Eigen Tweets which is basically sparse principal component analysis in the context of Twitter. For future work, the basic thing is that the algorithm is really fully parallelizable so when I compute locally optimal candidates, this can be done in a parallel fashion. The same thing goes for the elimination algorithm as well. The same goes for the SVD. The three building blocks of the Eigen Tweets, all of these three can be computed in parallel fashion and we would like to implement this framework in the Hadoop MapReduce and see how it performs in very, very big data sets. Thank you. [applause]. >>: To track all of the intersections, can all of that be done in parallel? >> Dimitris S. Papailiopoulos: Yes. Basically what you need to do is you just need to feed the matrices, so if you get like 10 CPUs, you just feed the methods to each CPU and each CPU can compute any intersection or a couple of intersections or a small number of intersections. For instance, you compute candidates, so each machine can compute different candidates and all of these comprise the optimal set so you know that the optimal PC is in there. >>: [inaudible] so they go in lockstep way, right? >> Dimitris S. Papailiopoulos: They don't need to exchange information. Basically the only thing that they need to do is have the matrix and for that matrix just pick the distinct answers two those. >>: [inaudible] has to go across all of them? >> Dimitris S. Papailiopoulos: For it elimination all you need to exchange information between executions but you can still run it in parallel, but you definitely need to kind of synchronize, yeah. >>: So once you start doing this in parallel [inaudible] some take longer [inaudible] repairing from [inaudible]. >> Dimitris S. Papailiopoulos: Yeah. >>: Can I take a look at the complexities slide? >> Dimitris S. Papailiopoulos: This one? >>: Yeah. So up here what is S? >> Dimitris S. Papailiopoulos: Okay. Basically S is sparsity of my tweets. So the sparsity of my tweets, so in this example it is like 5 to 10. If I have, if my matrix is super sparse it means that my inner products can be computed in a very fast manner. If I have like a vector [inaudible] if I have a vector that is super long but has only five entries, then the other product takes constant time to be computed because I only compute a constant number multiplications. This is why all of these calculations depend on the sparsity of the matrix. If my matrixes were super dense then both the, even computing the correlation matrix would be a tough job to do. >>: And what is the n small? >> Dimitris S. Papailiopoulos: The n small is the number of features, the number of words that I need to optimize over after I have run my elimination algorithm. >>: When you use the [inaudible] matrix [inaudible] so is there any value to somehow keep the order of the words, would it give you better results? >> Dimitris S. Papailiopoulos: [inaudible]. >>: By Microsoft and Microsoft by are very different things, so then you put them in your vector and you have by Microsoft and Microsoft by correspond to the same vector? >> Dimitris S. Papailiopoulos: Yeah, right correspond to the same vector, right. >>: So then the same value. Is there any way to keep the order of the words? >> Dimitris S. Papailiopoulos: You can do that, but this will give you much bigger data set, so now you have a dense set that is a matrix and you would be needing a tensor to keep the order. >>: You can have one [inaudible] pairs of words but then your length would be [inaudible] longer. But then it would go number of words squared. It would explode but I wonder if there is any shortcut. People do all kinds of things. This one is a bag of words; it is simplistic. But yes you, you are right. People do these kinds of things sometimes. But there is no [inaudible] sparse PCA on that. >>: [inaudible] elimination of feature word help [inaudible] the concept of eliminating [inaudible]. >>: Yeah, this method would work. Definitely, this is interesting; this method would have no problem. If your features become pairs of words, no problem. I mean it's all the same. >>: [inaudible]. >>: The question is like can you use the features to compare words on the [inaudible]. How is complexity [inaudible]? >> Dimitris S. Papailiopoulos: You know for sure the thing is going to be n squared. That's really what you know for sure. So you are getting +1 on your exponent of everything. >>: [inaudible]. >> Dimitris S. Papailiopoulos: Huh? >>: [inaudible] words [inaudible]. I guess consider [inaudible]. >>: Is this the ideal approach or did you actually want to find the optimal [inaudible] what's happening as opposed to interaction? Was there a plot that demonstrates that it tends to actually trended in a way towards optimal approach? >> Dimitris S. Papailiopoulos: I can definitely not load your data set because it's, yeah exactly, but what you can do for sure is take the sparse principal components which are what you say they are just index sets, like five indexes of words, and then you can protect your data set and like three principal components and this will kind of give you clustering of your data set. How much this point is leaning towards each kind of topic or principal component and that's the way you can visualize this, projecting on lower dimensions. >>: So I guess you stripped articles, right, because I didn't see the top dimension [inaudible] and i and u? >> Dimitris S. Papailiopoulos: Yes, yes. There is a phase of extra normalization, so we have this kind of first-order approach we throw away links, references and stuff like that, you know, all of these things and just keep words, and not all words. We just keep the words that have length, more than three characters, so we throw away these words and we also have like a list of words, like me, you, stuff like that or words that are not giving anything about the context of the intent, yes, we have to throw away even before we start doing anything like that because they are going to populate your first principal components. That's the whole, that's the whole idea. >>: So the first thing you do is strip the most common words? >> Dimitris S. Papailiopoulos: Right, yeah. >>: Then we strip the least common words, like statements. >>: The most common words are [inaudible]. >> Dimitris S. Papailiopoulos: Right. >>: But the least common words are stripping away. That probably is never going to throw away something that [inaudible] to keep. >>: Are they current? So like ran and run are stripped in that same index in the vector? Like are they kernelized so that running, ran are all mapped to the same index in the vector or do you have the same… >> Dimitris S. Papailiopoulos: Yeah. That we should do but we don't do that now. We should do that, though, yeah. It would definitely give you better examples, so if there is like a small type or like what you said, they are going to be different indices, although they should be the same, but we don't do it here. For example, you have like Greece in Latin and Greece in Greek text and this should be, you know, mapped in the same index but we did not do that. That is a nice direction. We should follow that as well. >> Cheng Huang: Okay. Let's thank the speaker again. [applause].

>> Cheng Huang: Hello everyone and welcome to the... talks by Professor Alex Dimakis and his student Dimitris, so...

Related documents

Products

Support

&gt;&gt; Cheng Huang: Hello everyone and welcome to the... talks by Professor Alex Dimakis and his student Dimitris, so...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Cheng Huang: Hello everyone and welcome to the... talks by Professor Alex Dimakis and his student Dimitris, so...