16199 >> Doug Burger: It's my pleasure to introduce Jangwoo Kim from Carnegie Mellon, who is a graduating doctoral student, who is here on a job interview. I've been very impressed by Jangwoo Kim so far. So I'm very much looking forward to his talk, which looks at the implications of process variation for SRAM memories, and this work has a lot of momentum behind it and very excited to hear about it. Jangwoo Kim worked for Babak Falsafi at Carnegie Mellon, a good personal friend of mine. And he comes very highly recommended. Babak said he was so good, he could handle any questions we asked. So just let the guns blast. I'm kidding with that last part. But I'm sure that will be the case. So without further ado, we're looking forward very much to your talk. Thank you for coming. >> Jangwoo Kim: Good morning, everyone. My name is Jangwoo Kim. Thanks for coming to my talk. Today in this talk I'm going to explain one of my work which improves memory system and reliability and manufacturability using efficient multiple error correcting code. So when we say reliability and manufacturability, we're talking about errors. What do we do with errors in devices. Errors can be further classified as soft or hard error by the nature of the errors. And hard errors can be fully classified as manufactured errors or runtime errors. Here I'm showing three cases of errors. First one is a manufactured time hard error. This is mostly the result of a device variability. So some transistors are faster than some transistors. Sometimes we lose value in transistors. So if we keep are using this device, this error doesn't happen. They occur, but there's a possibility of losing values in the runtime. Runtime hard error is basically our device way out or some device getting broken during runtime in the field. So the real problem is these errors occur in increasing frequency and increasing scale. So how do errors occur in memory systems? So in previously we used to have a small number of errors and each error occurred only a few bits of, small scale errors. As we go to more deep scale with technologies and everything gets smaller and then we apply all voltage, so we're expecting more errors, and each error will take, affect a large scale information loss. So we need a mechanism to tolerate those increasing frequency of errors and large scale information loss in the field. >>: So are you asserting that a high energy neutron knocking a silicon out of the lattice can actually cause multiple errors in the frames? >> Jangwoo Kim: That's one of the reasons. So, okay, these days we are seeing up to four bit errors from single particle failure. But by decreasing the dimension of both dimensions, we're expecting up to eight to 16. That's the industry set. That's only one reason. The other case is even soft error can propagate to the sharing or substrate. Actually secondary impact. And this is observed in the real world. So actually it is up to 16 cells technology. It's a rare event, but that could happen. >>: So it's a cascading effect of silicon atoms? Going through multiple cells in the lattice. >> Jangwoo Kim: Yes. Then they share the substrate and the wall wells. So to mitigate the problem, you can put more contact or we can separate ourselves so that's the circuit level solutions. But this is only about soft errors. I'm not just talking about soft errors. >>: I understand. But you've got the little lightning bolt. >> Jangwoo Kim: Yes. >>: That sounds great from a [inaudible] point of view. You have a particle detector here. >> Jangwoo Kim: Particle detector? [Laughter]. >>: Eventually we'll have a Neutrino detector. >>: That's really valuable. >> Jangwoo Kim: I didn't know about that way. So conventional protections assume only a small scale errors and small number of errors. Here I'm showing two cases of our commission of protection. You have to figure you have to use rows to repair the defect. Right here you use conventional, a single error correction code and combined it with two-way bit interleaving. If you combine one bit EEC with two-way interleaving, you can correct up to two cluster bit errors. The problem is, after we repair certain groups of defects using existing redundancy, the remaining errors which require other protection. And for the soft errors, so after we collect up to 2 bit clustering errors, if we get the large errors we cannot correct them. So conventional protections techniques cannot repair many errors or large scale information loss. So the natural way of dealing with this problem is probably we can increase error coverage by reducing scheme. So the first one we can probably apply a multi-bit ECC or we can increase the degree of interleaving, or we can just put more redundancy. But obviously if we put multiple ECC, we actually put extra coding logic, extra space for check bit. So that's a significant VRSI overhead. The higher degree will be with interleaving. Actually can correct up to only cluster bit errors or it's very power inefficient. You can also include very high amount of power. So I will conduct this point in later slides. Obviously large redundancy, that takes a larger area. So my solution is we apply, I propose using a two-dimensional protection inside of each memory array. So here I applied a conventional protection. Single conventional protection as the first line of defense. So for this one I assume we can detect errors in runtime at minimum latency. But we can also apply larger scale coding, which applies only once only to the larger scale data. By reaching this strong coding, we can correct larger scale erros. Basically we're spreading the responsibility over the error detection and correction to two different codings. Any questions? So by combining this common case single coding or more sophisticated complex coding, but this is done in background, we can combine them. Get energetic impact on higher area coverage and failure overhead. >>: One question, if you were only concerned about soft errors and not hard errors, what solution protection packaging, basically hardening through packaging? >> Jangwoo Kim: Okay. You mean the circuit technology for hardening? >>: Yeah, for packaging. >> Jangwoo Kim: Yeah, I mean, this is always about the cost, what we get and how much we pay. >>: Circuit. >> Jangwoo Kim: Obviously some space applications. I mean, they can put like a circuit, hardened technologies are always mass producted, commercial product. Don't have that much protection in terms of circuit. They actually prefer putting single error detection code and some interleaving. This is a more popular choice. >>: Right. So alpha particles, my understanding is that alpha particles are easy to shield against, and so people have sort of done that. But the higher neutrons require ->> Jangwoo Kim: Alpha particle is not a problem anymore. We solved the problem using sorting techniques. That's only a baseline. We're mostly worried about the high energy particle coming from sky. But still we can do certain technology -- I mean make hardened technologies. But that's not a popular solution. So my application is two-dimensional error coding in cache. We apply a hardened code, single error detection code, or optional, you can create single bit effect, but assume it on error detection code for a moment. And we put particle code at the bottom of the array. So basically vertical code you simplify the code. You can see the whole picture when combined horizontal code and vertical code, that becomes a strong code. So we identify some certain amount of defect. Since we have higher area of coverage, we only repair using redundancy only for the larger scale defects. And we also experience soft errors between runtime, when you use both coding, then we can correct all those errors. So when you use 2-D error coding, we can get very high multiple error coverage at very low overhead, and it can be used for artifact tolerance, too. >>: Can you talk about the -- are you going to talk about the downsides to this approach later? >> Jangwoo Kim: Yes, of course. >>: And in the background. >> Jangwoo Kim: So usually when you put that value to the memory, we have to compute extra check bit. This becomes extra code. For horizontal code, when we write something, we update horizontal code. For vertical code we also have to update vertical code. This is related to Doug's question. So for this vertical code update is performed out of the critical path of normal operations. That's what's the background. So this vertical code and update is done -- it's done, how can I say? >>: Later? >> Jangwoo Kim: It's not later. Actually it can be done later. But it doesn't affect normal operations. Actually, I'll come back to that one. >>: It's done incremental. >> Jangwoo Kim: Yes, incremental. Actually we can pipeline this update logic and completely separate from this memory, normal memory with read and white. >>: So it's just the latency? Because the bank is not structured in a way that makes getting all of the column. That's the problem. >>: Okay. >>: When you read out a row, you get the whole horizontal row. >> Jangwoo Kim: Actually, your answer is better than my answer. So ->>: I'm old and gray. >> Jangwoo Kim: Yeah, so it's the horizontal and vertical code. It's considered as a product code in coding theory. To update the vertical code and horizontal code, normal way is read everything. We update every whole array at once to compute. But that doesn't happen in cache because we only modify a single line. So to simple way to update vertical code, we get the difference -- when we write horizontal code, we get the difference between old data and new data and that difference applies to the vertical code. That update is done background of normal operations. That doesn't affect normal operations. Normal read and write and memory. >>: Okay. So you can [inaudible]. >> Jangwoo Kim: I'm sorry? >>: So there's a window where you have enough data for the vertical code. >> Jangwoo Kim: Right. Actually, there's overhead. We actually have to read all the value before we write. That becomes really -- so that actually encodes the extra push bandwidth. So I'll come back to that point. That's really the overhead of ->>: Reading ->> Jangwoo Kim: Right. >>: Of course, works really well with DRAMS because you [inaudible]. >> Jangwoo Kim: Actually, some processes do that for the cache, too. For the partial EEC write. When you write a small bite, you have to be the whole data to compute new EEc. Latest optimum core processor, they actually do read before write every time for read. And RFID do that, too. >>: They're not going through every row. >> Jangwoo Kim: Yeah, yeah, right. >>: Do you end up with a problem if there's been an error and you do an upper right, then? >> Jangwoo Kim: I'm sorry? >>: On reads, presumably you're checking the normal code, right? >> Jangwoo Kim: Right. >>: If, when you do read before write, you check up on all horizontal code? >> Jangwoo Kim: Yes, of course, yes. >>: You spread that bit out. >> Jangwoo Kim: Yes, of course. If we find that we find errors we have to correct them right away. But the message is we only detect errors for these configurations. So I'm going to briefly go over the existing protection schemes and why it doesn't scale, explain why it doesn't scale. So this is the typical way of making multi bit ECC. We can start with the single bit ECC. We take the six-foot data and compute the ECC bit using our X sort tree. This is the implementation of a multiplication. Then we get the 72 bit ECC code. So this is already significant overhead in terms of power and area. If you want to increase the error collection coverage, we have to duplicate, linearly increasing the circuit overhead and check with area overhead. Obviously we see a sizable award, increase critically, and it takes larger area and power. So these two graphs show what's the overhead of, energy overhead and storage overhead. So for X axis I'm increasing error correction strength from eight bit error detection code to eight bit error correction code. When you apply ECC to different cosin, we get the linearly increasing storage and energy overhead. So that backs up my explanation from the previous slide. >>: I have a question. So it seems like as you increase your width, the width of the word, you would increase your energy overhead but decrease your storage overhead. >> Jangwoo Kim: Yes. So this is the nature of recording. A specific coding to a larger data, the reliability area of the extra check bit decreases. >>: But the relative energy increases. >> Jangwoo Kim: But also this area includes a few lines driving energy. So now we have a few bits. So we actually are saving some power in the P line driving. >>: That's fine. Okay. I was wondering why the cross-over. >> Jangwoo Kim: Yeah. So another way to protect multi-bit soft errors is put higher degree of bit interleaving. This is a simple illustration of the interleaving. So we take a four-to-one bit interleaving four-way interleaving, we get four words. And we interleave bit by bit. And so the period of this coding is when you get four cluster bit errors, it is manifested as single bit error per code word. So this is actually used in every processor these days. But since we are making a long word line, every cache of read and write we have to drive all the lines. So that's the limitation. And that takes lots of power. So this is energy overhead that is caused by bit interleaving. So error increase, the interleaving no interleaving to 16 interleaving. Increases actually up to 1600 percent. >>: Can you go back? So why couldn't you put the support in the array to only read out and drive the lines for the interleaved work that you want? >> Jangwoo Kim: Actually, it's timing. So we have to precharge everything. And we just take the corresponding award. If you put extra circuit in this time and critical process, that takes much time. Am I missing the question? >>: No, you understood the question. Just doesn't seem like the timing is that critical. Putting a transistor on the SRAM, putting a transistor on the driver for the precharge, it seems like you could turn that on or off in advance and save a lot of energy. >> Jangwoo Kim: Actually, it's not that simple to -- let me see. So basically we make another array of multi-plexing. So actually that can take a lot longer time than we think. So we have a large redundancy. Okay. The fundamental problems about redundancy. So the nature of our defect is small scale defects. It's many defects randomly distributed. Redundant, the redundancy we conventionally use in memory is granularity of reliance rows or arrays. If we keep repairing small defect using larger scale lines and columns, then we end up with a way larger area. So this graph shows the defect rate tolerable by the given amount of our redundancy. So here 10 percent means we gave 10 percent extra area as a redundant purpose, for the purpose. Even with that amount of redundancy we only tolerate up to .2 percent defect rate. So obviously we're wasting a lot of area. So let's -- so I'm going to show how to go over this problem and multi-bit error tolerance. So, again, the architecture is based on horizontal code and vertical code, and these vertical and horizontal code are basically error detection code. So this can be implemented as interleaved by code. This is a fast code. When you combine those horizontal and vertical code you can correct a lot of scale errors. So why -- so we should understand why 2-D coding works. So when you apply the coding mechanism, we have to decide a coding strength and the data size. For example, conventional code apply the simple coding, single error correction code, to small data. And that way they achieve a fast error detection correction latency. Obviously this is input code so the area coverage is limited. 2-D coding actually takes the approach of a single post-dimensional code. And it adds a second dimensional code by combining strong code and larger data. Basically, we put product code to the entire array. And then this is only used for correction. So low latence doesn't become problem. By adding it, we only get the synergistic impact of low latency and high error coverage. That's why 2-D coding works. So let's go over how 2-D coding actually detect and correct errors. So during read, horizontal coding will detect errors. Doesn't know exact locations, but it can know this row has errors. Then by combining vertical code and horizontal code it can reconstruct a whole line. So basically our error correction coverage is based on how much area we allocate for the vertical code. So do you have any questions? So let's see how 2-D coding make scalable solutions. So I'm starting with four bit error coverage area. You combine the four bit ECC with four-way bit interleaving. So you correct up to four bit errors. If we want to increase this error coverage to 32 bit, then we have to increase the degree of interleaving and strength of error detection code. Here I'm assuming four bit error correction code and four-way bit interleaving. That way we can correct two bit errors. However, you're making extremely long word lines, takes larger power and you're also making larger error for four bit ECC check bit. So that takes extra coding, extra area. So instead of doing this we can apply 2-D coding. So basically we're not changing anything about memory array. And we simply place existing, one bit ECC to ABDC. Basically similar overhead. And we only add 32 bit ECC. That's our error detection code at the bottom. So our overhead is only a small area located at the bottom of the array. So, again, actually I'm going back to that reserve question. 2-D coding is not free. To update vertical code, we have to be able to know the difference between old data and new data. So that means every write becomes read before write operation. So read/write operations basically incurs extra read array to normal operations so that can incur extra port contention. So theoretically it can affect the normal operation's performance. But in real world, read/write contention is found as not that serious problems. For example, some actually support read before write operations. This is mostly for [inaudible] of ECC. And for those architectures we don't have any, 2-D coding doesn't incur any extra bandwidth. We can also use our very own techniques such as postaling. Postaling means we use read and write part of operations, and we issue read operations using idle cycles. So just because we don't have two read and write, and as back-to-back operations. And for popular multi-thread architectures this latency can be hidden just by simply scheduling other thread. So throughput doesn't affect much. Questions? >>: There's lots of things we could do if we have lots of threads. >> Jangwoo Kim: Yes. Of course. So I evaluated two data coding using Flaxis [phonetic]. That's our post-simulator developed at Carnegie Mellon. So I have a two baseline, two multi-process and baseline. The first one is a fast MP. Fast MP means we have fat write audible quarter and four audible processors [phonetic]. With ECMP we have a simple course and we have many, eight course. To get the high IP, we put error cache and also cache is large. This is basically similar to Intel 120 processors. To provide one, we tried to model Sun's [inaudible] processor. We have a small, after cache we have many banks. And there's only one port for error and cache because this is multi-thread processors. We also evaluate -- so we implement 2-D coding in the entire cache hierarchy over the two baseline systems, and we evaluate using a six workload. So we basically evaluate using two server workload and 2-D scientific work code. So this is the purpose overhead of 2-D coding during normal operations. In other words, this is overhead what we pay from the extra per contention. So you're not assuming -- so we are assuming the baseline doesn't have the read before write operations as a baseline. So for each baseline, there's the six work load. And the Y axis is purpose overhead, normalized to the baseline system which doesn't use 2-D coding. So higher bar is higher performance loss. And the first bar is the performance overhead when we apply 2-D coding to error and cache only. But the second bar is the same but this time we enabled postaling. And the third bar is purpose overhead when used 2-D coding cache, and the fourth bar is when we apply 2-D coding to both L and L caches. >>: The second? >> Jangwoo Kim: The second bar. So we protect only our cache in 2-D coding but this time we apply the postaling. So what we do -- I'll explain. When we see the read/write operations, we take read -- we address the operations. We issue using idle cycles. So there's a certain distance between read and write part. So as you see ->>: So why do you see the performance loss when you're doing port sealing? >> Jangwoo Kim: We are saving lots of performance overhead. But this is not the perfect way. Sometimes we don't have idle cycles. >>: Then you're actually ->> Jangwoo Kim: Full, then, yes. So we see the overhead purpose, overhead is really small, like 3 percent. But really for more intensive application like ocean, this baseline already takes like a half a bandwidth without 2-D coding. So encoding, adding 20 percent, actually it was 20 percent. 20 percent extra read encode high performance overhead, 10 percent. >>: You said ->> Jangwoo Kim: I'm sorry? >>: How large a working set? >> Jangwoo Kim: How large a working set? >>: Yeah. >> Jangwoo Kim: So for -- I don't have the exact number for I cache, but what I know is server workload, we are spending -- half of the miss is from I cache miss. So that's a typical behavior of a server workload. So I cache doesn't look at, have the full instruction footprint. So we have many missing out of cache. So data work set. So we see, the reason -- one of the misses, I also don't have a specific number. >>: [Inaudible]. >> Jangwoo Kim: Actually, so this is the actual breakdown. So during 100 cycles, this is how much time is spent from each category. So this is cache to cache. So we are spending this much time. The white part is the data cache and the other one is cache and the other is cache data and this stripe is instructions. So for link CMPD, multi-architecture so we hide this latency and we get almost no sensitivity in throughput. So this is implementation VRS overhead of 2-D coding compared to reducing scheme. For this graph we assume we want to achieve a 32 bit error coverage. So for the design point include two dimensional coding which use simple error detection code in both directions; and, three, the rest of the three end points combine different degree of interleaving and different strength of error correcting code. For example, when you combine eight-way interleaving with four bit ECC then you can get 32 bit error coverage. And so every overhead, force bar is coding, overhead, how much we allocate for check bit. And the second bar is error detection latency. This is a coding latency. And the third bar is how much energy we spent. >>: What's the normalization point? >> Jangwoo Kim: Normalization point is 2-way simple error detection code. So the big message is 2-D coding is almost similar, has almost similar overhead to conventional 2-way single error correction code, even though we achieve 32 bit multi-bit error coverage. But other design points have to consume lots of power over a larger area. Actually, one thing I want to say is this is also huge improvement in terms of latency. Certain microprocessors cannot use ECC in their own cache because even single error correction is slower than, too slow for them. For this one, you can use error detection code using error free mode. So we can actually implement 2-D coding without encoding that much latency even compared to single bit error correction code. So I already showed this graph. So we have given amount of redundancy for four megabyte SRAM. And red one was a defect tolerance. If you apply 2-D coding, at this time we assume the horizontal code can correct a single defect per word. So we get huge improvement in fault tolerance. So almost, this is like -so if you have 2-D coding, .2 percent redundancy. You have tolerance as having 10 percent of redundancy. >>: Is that assumed randomly? >> Jangwoo Kim: This is random. Actually, this is random. I'm just showing some potential here, forced line potential. So I feel that 2-D coding can be built to achieve multi-bit error tolerance at very low overhead. And I also show that 2-D coding can be used for defect tolerance if we have error correction code as horizontal code. >>: Can you go back for a second? I want to make sure I understand. So why is -- I'm missing something fundamental here. Why is this a higher cell defect rate better? >> Jangwoo Kim: This is a tolerable defect rate. >>: I see. So you're already under your bar. >> Jangwoo Kim: Right. >>: And you're just -- you're just saying these are both good solutions but you get better latency? >> Jangwoo Kim: Right. >>: Okay. >> Jangwoo Kim: So the question is, if we are going to use 2-D error coding for variability tolerance, variability tolerance, so what we should know what's the defect rate in the future. We should know the error correction strength required for that kind of tolerance. And also we should know how this tolerance can be translated as build performance power or scalability. So I'm moving to the variability tolerance of 2-D error coding. So what's the variability? Now we're having for 65 to 45 nanometer vision these days, and we are moving to 22 to 32-nanometer vision. We should expect to see high device variability. You have two transistors sitting next to each other, you'll have more than 30 percent difference in terms of power and latency. So we are expecting to see many different kinds of device variabilities such as unstable cells. So we lose value during reduction [phonetic] or some cells are really slow and some cells taking too much power to idle time. So what we can do is we can just discard our chips which has this bad cells. Right? So that is really -- I can't tell this number, but they say losing lots of money by discarding these chips in the industry. And also what it can do, you can pair using certain redundancy. Or we can just let chips, ship the chips which you want lower frequency or just take more power. So this is a typical processor meaning. So the big message is it's really difficult to scale SRAM for future technologies. So this is a typical curve. What I'm showing here is a histogram of 5,000 SRAM chips and X axis of latency of V chip and Y axis count. So this is -- every graph looks like this with a long tail at the end. So what I'm showing here is, if we want 100 percent yield, we make a cut at three significant points. If we make that cut, most of the chips can actually be faster than this yield cut point; but if we move these lines to the left, we can make a chip faster but we're losing larger chips because this is the target. So really important thing to note here is yield is not just one number. So we have a trade-off, a set of yield and a set of performance. This is not just the performance. This is the same story for performance power or scalability. >>: Couldn't you just label them, measure them differently in Excel? Slow, medium and fast? I think that happens. >> Jangwoo Kim: So actually I have the same question from the Intel. And they said potentially we can do that. But the reason we aren't doing it, we're spending too much time to testing each chip. And label each one. These days the cost of a chip is by area and the time to market. These days we already spend half of the time for the testing. >>: Would it be cheaper to just test them once and throw them out? >> Jangwoo Kim: They're doing binning, but -- they're doing binning but they're doing very cost of binning. So [inaudible] 1.8, 1.6. That's -- but finding one is really hard. >>: It's also hard because you have so many structures. If you just had one, it might be easy to test for. And you've got to make sure you don't miss something. So it's dangerous to do the fine grid stuff. >> Jangwoo Kim: Okay. I hint everything that we can use multi-bit error correction code to tolerate the variability errors. So suppose we have multi-bit error correction code. We can use some transferrable code to repair the mask of the bad cells. We can get higher yield performance and power. Since we are correcting, repairing these bad cells using error correction code, we don't have to use very much large amount of redundancy. But we also need some portion of multi-bit error correction code to support on-time reliability. This was a huge challenge for industries. I mean some people already said we can use the ECC for variability tolerance, but they gave up because they cannot give up the soft error tolerance in the field. So assuming we have multi-bit error correction code. Now we have to know how much potential is there. So I modeled -- I measured a potential multi-ECC for variability tolerance. I do circuit level simulations and modeling. I modeled interdivariation, interdivariation [phonetic] is random variation existing in the di. It's a major source of variation in the future. And this is a basic combination of pure random mismatch and some spatial correlations among the mismatches. And I come up with, I designed a five-cell SRAM Jam point. So 45-nanometer baseline is existing baseline cell. I two seat them for the 20 percent higher density, 20 percent of lower power. And then in addition we evaluate 32-nanometer cell and 20-nanometer cell. This is like an imaginary cell. Basically we use I test road map realistic cell design point. And also I assumed the 60th send out cell. It's metric. So I implied interdivariation to channel width length and adjust for voltage. And we measure the full characteristic from each cell. We get the read and write latency, the write latency and reachability and minimum voltage. Minimum voltage basically tells how much power it can consume. So this is how I model interdivariations. So there's -- so I use a hierarchical variability model. So the bottom bar has like a random cell-to-cell mismatches. And the highest bar has two mismatches. You combine, we add all these variables. You get the realistic interdivariations. And this is how it is typically represented in 64-by-64 cell array. So this is a case of thrust voltage. So different colors shows different voltage. And as you see, most of this is random. There's a certain hotspots because of special correlations. >>: What was the middle layer? >> Jangwoo Kim: So the [inaudible]. What do you mean middle layer? >>: You have three layers that you're adding up there. >> Jangwoo Kim: It's not three layers. I have like a whole -- it's a lot of layers. I added like one bit to one bit and data two bit to two bit. Actually, I add everything to get this variability model. >>: How do you validate that? Does the literature already validate these models? >> Jangwoo Kim: These models were validated in the literature. Actually, this is pretty new. >>: Yeah. >> Jangwoo Kim: So using this variability model, cell variability model, I conduct multi-cell simulations to get the 5,000 3 DSM chip. What I did, I take each cell from that variability model and construct a 32 bit SRAM chip. And the important thing, the worst case cell determined is the performance of chip, power of a chip. Those things. Then we assume that single error correction code per word and 5 percent redundancy for array. This is a typical amount given to a latest Intel processor. They actually use 30 megabyte, up to 19 megabyte of cache, they actually, three megabyte space for the reliability. So this is actually the simulation number. The results from simulations. I'm showing legal latency and [inaudible] latency over the 45-nanometer baseline cells, cell design. So what I show is this black line histogram is the baseline. You don't use -- you see redundancy for variability tolerance. The blue line use single error correction code to tolerate availability errors. And the green one is using post vendors and single bit error correction code. So the important message here this was why a old-hundred person [inaudible]. Now I can move to this line, move to this point. Then we get huge improvement in the latency of write latency, still made 100 percent yield. So the actual number is important but it's not as important as this. So the thing is now my chip can run at the best case chip of the previous square. Right? So that's a huge advantage. This is the same patterns for rescalability or minimum voltage. For example, we can move with this line to this point. That means we can get the lowest power chip over the old distributions. So the message here is we really have to use ECC, and we combine ECC with the amount of redundance to get variability tolerance. >>: There's three different forms, colors in using the same area or ->> Jangwoo Kim: I'm assuming -- as you're saying -- it's the same area, I'm assuming that the existing error code is existing for soft area tolerance, and five percent is using for runtime appearance. We're not using them for variability tolerance. So this is for other, the rest of the cell design point. We start at 49 nanometer technology with higher density, and the low voltage or 32 nanometer and 22 nanometers. So what I'm showing here is one here is the operational point over one sigma point and five percent redundancy is used for redundancy. So for future technologies we have to use a five percent redundance to mask out really bad cells. I will presume five percent is used for variability tolerance. And we made the yield at 84 percent to get the reasonable operation point. Then we use an ECC to correct the rest of the cells. I mean, the other bad cells. And then we get this spark of each point. I know this is hard to see. But basically what I'm showing is if I see the higher bar than one, that means I improved the yield from 84 percent to 100 percent. And I also improved the performance power in reliability. I expect questions. >>: I have a question. Maybe I missed something. But how do you tolerate variability with error correction codes? Is this variable in time or variable in space? >> Jangwoo Kim: Availability in -- what is variability in space? >>: Basically you have different latencies of different cells. >> Jangwoo Kim: Right. Okay. So, for example, latency. If you decide they're too slow, so when you read the value, this value will be, is not -- it contained the incorrect value. So you will detect those errors. >>: How do you know if it's incorrect if it's not changing in time? >> Jangwoo Kim: Changing in time? >>: So the value you read over there will always be the same rate. >> Jangwoo Kim: No, when you write it, you made a correct code or assumed the correct value. But that value changes in the memory. Or we actually started with the wrong value. But the coding algorithm knows that this value is wrong. So actually this is not different from soft error. >>: You're coding it before you're writing it in. >> Jangwoo Kim: Right. >>: When you read it out ->> Jangwoo Kim: The coding itself is not tolerable, it's not subject to variability headers. >>: So the follow-up question then is the error correction scheme, which basically -- I'm assuming you need both the vertical and the horizontal code, or just the vertical code is easy. There's no latency penalty for that? >> Jangwoo Kim: Latency penalty? >>: In order to do sort of the variability hiding where you're on the fly, lot of cells to -- >> Jangwoo Kim: We're assuming 2-D error coding for the large error scale correction, and we're assuming in-line error corrections for variability errors. I actually explained that one in the next slide. You are on the right page. So redundancy overhead -- this is showing the required error correction strength and the redundancy to get two-month from previous graph. So what I'm showing is this block error rate from single failure source and these three lines has different sense of error correction code. >>: By failure you mean a variability induced permanent error? >> Jangwoo Kim: When I say 30 percent from here, from this failure point ECC happened to correct 30 percent of a block. So this shade region is actually a block error rate shown, observed from previous simulations. >>: You say ECC correct 30 percent but misses the error rate. It gets better when you go to the right or worse? >> Jangwoo Kim: Worse. >>: It's the opposite of what you just said. It's a fraction but ECC is not corrective. >> Jangwoo Kim: Let me see. So okay. So I have to define the better and bad. When I say better, it means we have a less, less variability. >>: Block failure. >> Jangwoo Kim: Right. But in terms of a question, how much will you tolerate. That's -- that's getting better when we move to the higher ECC. But the thing is if you want to get the higher error rate, you have to correct many more blocks [inaudible]. I'm trying to show that in this graph. >>: You're answering a more complex question than I was asking. You really want to be the director of the graph. It gets more expensive to support that as you go. >> Jangwoo Kim: That's the message. For example, if we have 30 percent at the block rate, if we have a single bit error correction code, we have to allocate the 40 to 50 percent latency overhead. Or if you have a double error correction code, you need only less like than 10 percent for overhead. Actually, some other proposers show this graph also. And what they did they say 50 percent of cache to get the same achievement. So what I've been showing is really good to have the double error correction code, dedicated for variability of tolerance. And to get the soft error tolerance we need additional write capability. I assume that's done in 2-D dimensional coding, but we need to detect the [inaudible] errors, because if we have two-bit errors and there's a soft error becomes a three bit error. We still have to detect these errors. So let's go back to ECC. So why is ECC not implemented in the real world, not in microprocessors. Let's start with understanding why a single error correction code is fast. So two codes, to get the 72 bit code, you take from 60 bit data. And then we compute through the multiple system matrix. This is implemented by axle computations [phonetic]. So we get 72 bit data. We store that value in the memory. And when you read, we take that value and we do similar process to get the AP syndrome. So the AP syndrome is still we know there's no errors. If there's an R value, [phonetic] we know there's errors. This is the error detection point. So if there's errors, how will a single error correction code will be correct is you examine syndrome value. So if you have a 72 bit value and you have single bit errors, there's only 72 bit cases. So basically you're trying to match whether the syndrome, examine whether the syndrome matches one of the 72 bit error parents. If that matches, then we know the error locations. We invert it. That's how we correct it. This can be done very fast and it can be implemented at 72 [inaudible]. It's not really camp [phonetic], but it's like 72 parallel circuits. So why is the ECC so low? For encoding and decoding it's almost the same. The only difference is different size of error correction, 2-D coding question matrix. This can be done in parallel. So up to this point detection coding error detection can almost take the same latency as a single error correction code. But the problem is when you have errors, so now we have to assume two bit errors, right? If you have two bit errors from 17.9 bit data, they're more than 3,000 error parents, right? So to make a really fast decoding circuit we have to make 3,003. How do I -- this is not really done in any microprocessors. In the real world what they do is they pipeline this circuit as a series of lookups, small table lookups and multiplications. Take easily 100 of cycles. Now we're proposing using ECC to tolerate variability as to runtime. This is always happening. If you take the hundred cycle latency, it can be implemented. So now I just showed why the multiple ECC cannot be built for variability tolerance. But there's a way to mitigate this latency if we know error locations. So where there are errors, we know the error locations. So error locations, if you know error locations, you can use erasure coding algorithm. Erasure means when there's errors in the code, the errors only occurred in this place. That's called erasures. So variability is just like erasures. You are doing a task and we know this value is wrong and we put this in the erasures. So there's credence full of demeaning sensible code [phonetic]. It's to satisfy these equations. We can correct T random errors and [inaudible] number of erasures. And we can also detect ED errors. So for conventional single error correction code having minimum distance of 4. So this is a case of having minimum distance of 4. We can configure many different kinds of error correcting code by changing the configuration of error coding and 2-D coding matrix. For example, we can make a three bit error detection code or conventional single error correction code. If you know two erasures, we can make double error correction, error detection code. I'm not sure why this is a good idea. >>: Hold on. So I get all of that, except if you can kind of ->> Jangwoo Kim: So what's the erasure coding algorithm? So we take original work and put, add yours to these two locations and put the ones oldest locations. Right. So each award has up to two errors now. The important thing is if you add up the original numbers, the one numbers, it's always two. So, in other words, at least one of these block, you have less than one error. Okay? So what that means, if we put a single error correction code to each award, at least one request we get the original value. >>: Why is that a guarantee? >> Jangwoo Kim: Okay. So when you put 00 here, this one has up to two errors. Right? And this one has also two errors. But if either one of it has one error. So if you add up 1, you always get the two. Okay. So this is always 2. And each one is less than 2. So at least one is less than or equal to 1. So if you're trying to make a very simple view of error coding, that's what you have behind us. >>: You're copying the word and you're replacing. In some sense you have all possibilities for the erasures, a copy with all possibilities. >> Jangwoo Kim: Right. So this is how we implement erasure code. So basically we put a single data correction mechanism, but we just doubled it and applied to each word. And we just examine which one, corrected the value, and we take the value. The important thing is this latency is almost the same as single error correction latency. By computing hundred cycle or thousand cycle, this is a huge win. Agent power, you're just doubling single error correction code, so it's similar to the twice of power overhead. It's also much less than conventional double error correction code. So the message is we can correct two variability errors in runtime. And we can also detect another bit errors which is the case of soft error. This is ongoing work. So ->>: Can you go back? So I see how this gets you fast sector, but I don't see the connection to ->> Jangwoo Kim: This is not a fast sector. We're taking existing sector and making that ->>: I understand. But I'm just -- okay. Got it. >> Jangwoo Kim: This is ongoing work. So real overhead of erasures, if you have to store the erasures. For this work we know this bit is wrong. But there's a way to detect erasures in runtime. This is one case. What we do, we examine B lines from each cell. There's only two value dependence, 1-0. That's how we read the value. But if you get the old 00 value, that means this is a slow cell. And this is a different 1-1, it takes too much power. So basically we are pulling the latency of this circuit. So if we see this works, so we can detect erasures in runtime during the in-field. So final picture, 2-D coding is a fast code, error detecting code, simple error detection code. We're assuming many defects and single resolve error. We only repair larger scale availability errors using redundancy. Then during runtime we will correct up to two variability errors. If you see, for example, if you have a soft error, it becomes a large scale errors. Then 2-D coding, you have taken care of those errors. So we are working on this one, this array, to take out the prototype in terms of this year using 9 millimeter process. So some errors of decoding is we can use 2-D coding idea to provide a variability tolerance. And by doing this you get a higher yield or higher performance or all power. So let me just go briefly, go over the related works. So there have been proposals to use multi-bit error correction code, multi soft errors, variability errors. For most of them they actually use error correction, multi error correction code as they are. They detect much latency or power. Another way, one of the proposals, group of proposals to duplicate value in different places, it can be duplicate in extra cache for example or duplicate in another array. But it actually requires extra area or takes larger bandwidth and power. Also it leaves multi-bit error protection at the other array. So variability tolerance, there's been some recent work with statistical model and the circuit simulations and do multi-core analysis simulation to get the variability model. So basically we are showing similar problems for the future. And there's lots of techniques to meet the variability errors. And obviously we can use existing single error correction code where we lose on time reliability. Also, we can apply larger redundancy by [inaudible] redundancy or redundancy or we can disable memory. The proposal is disable like more than 50 percent cache to save on power. There's a bunch of techniques to use nonstandard cell structure or circuit technologies; but, again, if this achieves something, we can apply our technique to improve further. So we aren't contrary to these techniques. So let me just go over what I have been doing and what I'm going to do for the future. So I spent a long time in Carnegie Mellon University to develop a first simulations to the research. So we're pretty proud of these simulations. And I basically worked on reliable architectures. So at some point I worked on reliable, developing reliable processors. So using redundant multi-threading or redundant processor cores and using, reducing compared to the bandwidth or something like that. Also applied memory ray technique to survive from node failure in DSM server which contains many memory nodes. So it was some effort from operating system, too. Then for this talk my latest work is 2-bit error coding, or erasure coding to get multi-bit error tolerance or variability tolerance. >>: You said you're referring to [inaudible] memory? >> Jangwoo Kim: Yes. So what I've done, lose one node, we guarantee. >>: I see what you're doing. >> Jangwoo Kim: This is the memory. So I'm currently working on this part to get a prototype chip and then trying to see the potential improvement, evaluate the potential improvement. And so for my future work I'm interested in -- so my view for future processor should allow easier design, easier testing, easier manufacturing or easier paring. So for this one I must assume fully distributed architecture with permability and some complication support. I'm still assuming multi-level protection in different granularity. Also I'm very, still interested in first-time simulations. So as we go to multi-core we still have the problem in accuracy and latency of simulations. So in MCU we spend lots of time to bring up hardware and software core simulations. And I'm very interested in that one, too. Also, I'm interested in fast memory or the virtual technologies to get optimum research utilizations. So that's another area that I'm going to investigate for. So in conclusion of this talk, I show that why we should worry about the memory reliability and manufacturability. And I propose a two-dimensional error coding or fast multi-bit error detection code to correct soft errors and variability errors in the field at very low overhead. So that concludes my talk and I'll take questions. (applause) >>: So can this be used to increase reliability of memories that use multiple voltage levels to increase storage density? >> Jangwoo Kim: Of course, right? Even for -- so even for multi-level voltage or whatever, it will be manifest as errors, right? Randomly distributed errors. And at long tail. As long as we have distribution with long tail at the end. Error correction code is a very good candidate, because you take most of the cases using small overhead. >>: Are you thinking primarily of flash or DRAM as well? >>: Actually, more -- I didn't think of flash. >> Jangwoo Kim: Actually, I can talk about flash, too. >>: Flash is natural. >> Jangwoo Kim: So flash, actually, flash to get the multi-level solved we have to make still a distinction among the voltage levels. But still we have a huge problem in variability of these levels. So these days mass product flash has only up to like two bit MLC. So I'm actually interested in using my technique to get four bit MLC, something like that. >>: Seems like a natural. >> Jangwoo Kim: Any other questions? >>: I had one other quick one. So you're applying most of this work to SRAMs. And, in particular, caches. And it seems like you could partition caches into -- when you read cache data, if it's clean, you really only care about error detection because you just flush the pipe and do it from memory. If it's dirty, you really want to care about error detection. But much of the data in your cache is either dead, very little of your data is still alive and dirty. >> Jangwoo Kim: Actually, that's 100 percent correct for soft errors. For variability errors, it's not really true, because we are randomly introducing errors, right? So even for clean data we have to correct it right away. It's not just about detection. We have to correct it. If this is used for variability errors. >>: Your solution of timing variability is about memory. >>: What's that. >>: If your solution to the cell is to go out to memory, you haven't helped yourself. >>: I'm thinking about soft errors. >> Jangwoo Kim: Soft errors, you are right. Also, it's not only about cache. For example, for variability errors, you can actually twist it. So I only show that four different design points. Latency-by-latency reliability. But you can change it by existing changes to size of each cell. For example, if we are going to use this for, say, store timing critical, you can actually make a transistor more vulnerable to other than timing. Something like that. There's some potential to use for different error structures. Actually, we are trying to model story cue and one cache and auto cache as two different cases using 2-D coding. >>: You don't typically think your time coded ->> Jangwoo Kim: Then issue cue. Or this profile. >>: Doug Burger: Any other questions? We'll let him off the hook. All right. Thank you very much. >> Jangwoo Kim: Thanks.