>> Onur Mutlu: It's a great pleasure to introduce... assistant professor at Stanford. Subhasish was with Intel before...

>> Onur Mutlu: It's a great pleasure to introduce Subhasish Mitra who is an assistant professor at Stanford. Subhasish was with Intel before he joined Stanford. He was a principal engineer over there. We worked for five years, and he has built up very influential techniques in the old bus systems design. He work on from the circuit level to the system level. In particular, his while optic test compression technique X-Compact at Intel which received the Intel achievement award which is the highest honor that Intel ->> Subhasish Mitra: That's fine. >> Onur Mutlu: And he's been working on Built-in Soft Error Resilience techniques. I think he will talk a little bit about that and online self-test. And with that ->> Subhasish Mitra: Thank you, Onur, you know, for inviting me over here. And actually we are having a very good collaboration going on. My student Yan Jing [phonetic] is sitting over here. She is actually spending doing her internship at MSR this summer. So hopefully, you know, there will be a lot of interest in this kind of stuff and I would like to have a two way conversation, rather than a one way traffic over here, so please feel free to stop me and ask questions, that way it will be more fun. So let's say that you designed a system and let's say we start from a hardware chip and we perfectly made sure that everything was correct and it was perfectly formally verified and then it was tested right so there were no defects and you had this chip in the field and then you see errors happening left and right in the storage nodes, in the flip-flops, in the memories and the combinational logic, and one of the big sources of these problems is what is called the rendition induced soft errors that happen because of particles in the packaging material and neutrons that are coming from cosmic rays, basically, okay? So the point over here is that in the past no one is to care about these kinds of problems, only the space guys is to worry about the rendition in the soft errors but going forward at sub-45 nanometer technologies as we will see on the next slide, you have to worry about almost all parts of your design and its resilience to soft errors. You have to worry about memories, you have to worry about flip-flops, you have to worry about probably combination of logic as well, although it's not quite clear that whether combinational logic is going to play a big role or not. But this is a big problem that one has to worry about and here is a quote from one of the IT executives and this came up in the Forbes Magazine a couple years back in response to one of the processors that were built and sold to these guys. So who cares about soft errors? You know, why are we talking about it? Well, it all depends on error rates. And I'm not going to get into the details of where I got this number from, but you will roughly find if you look at soft error rates of individual components in a system you will get a number of something like this that if you had a 20,000 processor server form you will have one major flip flop error every 20 days. Now, this is to say that you could have a big problem as a result of this or not. So for example here is an actual chip that I worked on in the past, and if you look at the various components of this chip and you worry about memories, you worry about flip-flops, you worry about combinational logic and you worry about the contributions of these various components to the overall soft error rate of this chip and this is what you find. For example, this particular chip it was a storage chip and it had a lot of on chip memories and those on chip memories will be typically protected using ECC, you know, error correcting codes. Still there were lots of on chip memories that were not predicted or the designers decided not to predict these on chip memories because every chip at the end of the day has to satisfy a certain soft error goal. Almost no chip has a zero soft error rate. As long as you satisfy your goal that your customer wants, you're okay. For that particular chip it was fine not to predict this on chip -- this remaining unprotected memory that's shown over here. But that's not what the biggest concern was at that time. The concern was that how in the world are we going to deal with these flip-flops that had almost the same chunk if you look at of the overall soft error rate distribution as the unpredicted memory. Because for the unpredicted memory if one had to go put one prediction in, one would have to go and put in ECC or put more error correcting codes. There was a path that was known to be able to solve that problem versus for flip-flops, you know, you cannot use ECC, and the question is what else could you do? And that was one of the burning questions at that time. So going back to my previous point of you know who cares about soft errors as we said over here that if you had 20,000 processors server form, you could have roughly one major flip-flop error every 20 days. And this measure, it has a meaning, so people have done this start is where they have shown that most soft errors do not matter at the system level, it's only roughly around five percent of the soft errors that really matter at the system level. Then, you know, people do not agree, some say 10 percent, some say one percent, some say five percent. It doesn't really matter because this 20 days could become 200 days or it could become four days or whatever. Those are all very important numbers from our standpoint. And what are these major effects could be? Well, you know, one of the major effects could be silent data corruption which means, you know, that one could go to the bank and deposit $20,000 and depending on which way the bit flipped you could either be very happy or you could be very sad. And so you know, that's really a big issue and that's why almost every chip today has two soft error goals, one about the silent data corruption and they want to have very low rates of silent corruption in goals. Versus they may look at detected but uncorrected errors, which means that you know that there is a problem in your system but you really do not know how to deal with it which means that either you have to do some kind of recovery or something and depending on how recovery is implemented, you could have system downtime or not. And we know -- I don't have to preach this to this audience that the cost of downtime could be really high. So again, it's not that you have to worry about the silent errors, you also have to worry about detected but uncorrectable errors and do something about them. And since I'm an academic, wearing my academic hat, I just said memory soft errors is a soft problem. That's not really true. You know, there are lots of open issues that we could discuss. But at a bigger scale, I think these are much more open problems than memory soft errors. So taking a step back, the soft errors are not the only problems that we have to worry about going forward. Now, if you look at some of the technological related problems you have to worry about radiation for sure, the soft errors that you were talking about, you have to worry about this [inaudible] which is called erratic bits and these erratic bits, actually Intel had a paper in IEDM in 2005 that kind of was one of the first papers that talked about this erratic bits. What they found that for their really low power processors they found that the minimum voltage at which the memories could work were erratically shifting over time. Which means that one way to solve the problem is to have a big voltage guard band to [inaudible] going to go up, now these are targeted for really low power applications and then your battery life is affected. And, you know, like people were really concerned about the problem like this. So but these are kind of the sources of temporary errors. Versus going forward you also have to worry about these other sources of problems which include aging, transistors are like human beings and they age over time and the amount of aging depends on how much workload like you know this transistors were working on. You have to worry about the real life failures. Typically what happens is when chips are designed and tested in the test floor, these chips are stressed at a very high voltage and at a high temperature so that the weak chips are screened out. Now, if you test these chips too many, for example if you stress them too much, this is what happens to the chips. And this was actually one of the highest performance microprocessor chips from a company not Intel, because I used to work at Intel, so and you can see that this really what happened is because of extremely high leakage currents in the burning oven at the higher temperature these chips actually were fried. So which means that either you can apply a lot of stress which will kill your good chips or you're going to apply very little stress, which is -- which won't be enough to stress your bad chips, which means now you have to do something about the so called real life failures, otherwise these chips will fail within the warranty period and, you know, the vendor has to deal with it. And of course, you know, you know, you have to worry about process variations. And I'm sure you've heard of word process variations a lot, and so you know that's kind of the last one on my list, although you know it could be something very important, you know, these are the ones that are not talked about a whole lot. But these are some of the mechanisms that people are worried about going forward. So how do we deal with these problems? Well, what we think at Stanford in our research group what we think is that to be able to deal with the problems of these soft errors and erratic bits you will use a technique which is called BISER or Built-In Soft Error Resilience that I will talk about today. And this BISER technique is going to correct all these temporary errors that you would worry about. And then for the two sides of the backed-up curve that have to do with our real-life failures as I said that burning is getting difficult and almost any other alternate burning that the industry developed in the '80s are in their Sunset Boulevard right now, because none of them were, because everybody is reaching their limits basically. And IDDQ is an example of that. So for this real life failures and for aging problems you will be using a technique that I'll also talk about, which is called circuit failure prediction where you will be collecting a lot of data while the system is running and based on that data you will try to tell that where the system is actually going and where it could fail because of these aging problems. And to be able to do a good prediction you may have to do a very tall online self test because we know if with any prediction if the input data is not good enough then your prediction won't be very good. So that's why you may want to do a very tall online self test. But given this circuit failure prediction together with online self test, we think that that can resolve the problems of our real-life failures and wear-out and aging and so on. Yes? >>: What's the timeframe for wear-out? >> Subhasish Mitra: So in the past people used to worry about wear-out, you know, like really late in the game, you know, for example you know like 20 years or 15 years or something like that. So today what has happened is that wear-out happens from day zero. So for example if you have a chip and your PMOS transistors for example, their special voltages will be degrading, you know, just from the day that you are, you know, exercising the chip. Now, one way to deal with the problem is to say that well, you know, if I had -- if my speed as a result changed by five percent, let's say over five years and just making these numbers up, then you may not worry about it, because you could say I would just put a guard band of five percent in my frequency and I could claim that I do not have an aging problem. Although there is an aging problem that's going on that has been resolved by doing pessimistic design. Now, there is a big worry going forward that if you were at least making these chips work for seven years or something like that, which would be the case for interpret applications and you know, you could have as much as 20 percent of the guard band being -- you may have to go and put in 20 percent of the guard band to deal with the problems, now, that becomes an important issue because as we know very well that our speeds are not going up, number one, number 2, our variations are going up, which means that you have to put a guard band for variations and then now you have to go and put a guard band for aging and you put on guard bands over guard bands over guard bands and how many guard bands you going to put basically? So that's where the worry is. So it depends on how you look at the problem of aging. Does that answer your question? >>: I'm curious if I buy a computer today is there, you know, is there a warranty on how long the processor will run? >> Subhasish Mitra: Yes. Roughly. So there are two kinds of warranties. So there is one kind of warranty today which you buy a processor, you know, that if it fails, if it breaks within the warranty period completely then, you know, somebody is going to replace your processor. And that warranty makes sense today is because you know people are still doing this burning and since they're doing burning most of the weak chips they have are, you know, they're gone in the manufacturing floor so you know, you may, you know, like maybe they will have some goals like you know hundred in a million or 50 in a million, you know, like could be bad because of that thing. So that's one kind of warranty. And the other kind of warranty is basically it's not really a warranty, but it's more like a used condition thing that says that even if you buy a PC, maybe you use a PC for only three years and in five years, you know, I do not made probably three years is probably the right number. And if you use it for three years then let's try to target by design and by guard banding that you get a sudden level of speed by doing something like that. Now, that's becoming very murky because you know, our speeds are not even constant anymore inside a CPUs because of dynamic voltage and frequency scaling and so on. So, you know, it's not clear cut. But at least they tried to target when they say that this thing is going to work at certain gigahertz as a speed been they take those speed guard bands into account today. >>: So I shouldn't be surprised if after seven years ->> Subhasish Mitra: Oh, yes, absolutely. >>: It either breaks or. >> Subhasish Mitra: Absolutely. So that's the whole reason people that do work locking to play games, you know, they do not go and, you know -- so that's why they are going to now have even tags isn't it inside the processor or something to find out if this was work locked or not and then they are not going to replace your processor. You know, if you came back after two years and they found that you work locked it, it has to do with reliability basically. That's it. That's aging basically. So what we think is that by using this concept of BISER and circuit filler prediction and online self test we can have a consistent story about a new way of thinking about designing robust systems kind of it's a departure from the classical thinking of, you know, having a massive redundancy or massive online checking to be able to deal with the problems but more like understanding the underlying physics of these mechanisms to come up with very optimized solutions and hopefully I'll be able to convince you that the costs of doing that will be much cheaper than, you know, just, you know, than just relying on redundancy to be able to solve some of these problems. Now, to be able to do that, the reason that I came here and the reason you know [inaudible] got so interested in it and we were doing this collaboration is what is most important is this global optimization. Because I will go and I will put a lot of bells and whistles inside the hardware to be able to do some of these problems, but then you need optimization at high levels of the stack to be able to orchestrate my hardware mechanisms the right way at the right time so that you do not pay a full, a very high system level cost. So the story about cutting down the cost is going to be in this global optimization, and hopefully I'll be able to show you some examples of, you know, how that would happen. So that's why it is important for, you know, folks like yourselves to be interested in these kinds of problems. So that's the story about technology and you know, I'm sure you have hard people saying that oh, gee, you know, these are never going to be problems anymore because everybody's going to stop at 22 nanometers and nobody -- and the semiconductor industry is not going to go beyond 22 nanometers. So you know, what I tell my students and, you know, and everybody, that still actually you're in business so far as our research group is concerned because even if you do not worry about technology related problems that we talked about before, still even if you stop at 22 nanometers still we'll be building complex systems. And if we build complex systems that means that we have to deal with the problems of bugs. And validation is -- will still continue to be a problem. And if you look at hardware design today, hardware design is mostly about validation today. And even in validation, what people have found is that while significant progress has been made implicitly can validation which is often called verification where, you know, people use the simulation with verification, formal verification and so on, what people are finding is post-Silicon validation is one of the big costs in hardware design today. And by post-Silicon validation what we mean is the chip came back, you plugged into the system and what you saw is a blue screen and from there you have to find out which one of those billion transistors is actually malfunctioning because it just takes only one logic gate to malfunction in a synchronous design to be able to create any problem at the lower level -- at one level up. So from that level, how do you find out -- yes? >>: So when you say the gate is malfunction you don't mean it's because of a logic it's because of ->> Subhasish Mitra: Timing, timing chips, absolutely. So it could be -- actually it's good if it is a logic error because if it is a logic error then it can be repeatable, okay. But if it was a timing error because of some noise happened or because of slight variation happened, you know, those are like particle highs and bugs, which means that when you try to observe, when the problem happened the problem is not going to happen, it's only in a sudden electrical state of your system the problem is going to show up and it will be extremely hard to reproduce. This is like, you know, the car has a problem and we go to the mechanic and the mechanic says no there is no problem, you know, everything is just fine. And they're seeing it, and that's what the biggest cost is. They're saying that today it could be up to 35 to 40 percent of the total cost is about this. And in the feature I have a cord from an Intel executive letter in my talk which says that this will actually design cost unless we do something about it. So, you know, we are in business. And even if the technology stops and we have a technique called IFRA actually which can go and actually do this thing from a system level failure you can tell that you can kind of pinpoint that whether it's the ALU or the schedule or something in the hardware that's creating the problem and we can do it very efficiently. And I won't get into details of that, I'll just mention it later, but if anybody is interested I'll talk about it. >>: [inaudible]. >> Subhasish Mitra: I can talk to you. >>: You're doing some machine learning stuff there? >> Subhasish Mitra: Not really machine learning but it is our data collection, so I do collect some very specific kind of data based on the architectural specific kind of data, and I do post analysis of that in some clever ways, and I'll be very happy to talk about that. So you know, you know, I'm around today and you know, you can reach me later on. Please feel free. Yes? >>: [inaudible] acronym INFRA? >> Subhasish Mitra: Yes, INFRA stands for Instruction Footprint Regarding an Analysis. Which makes sense, doesn't it, when instructions are passing through the processor, they have some footprints that you collect and you have to be very careful with what kinds of footprints those are, and you record them currently. You do not do anything about it. And when a crash happens you scan them out and you analyze them to find, to diagnose the failing locations. >>: [inaudible] happening in hardware. >> Subhasish Mitra: Yes? >>: You store all this data in hardware buffers? >> Subhasish Mitra: Only 60 kilobytes hardware buffers. Not everything. You're not going to be able to store a billions of instructions isn't it? >>: Yes, of course. >> Subhasish Mitra: Of course. But then that's why what is very important is that you have to reduce your time from the occurrence of the problem to the time of the exposure of the problem to one level up and that's where we play some tricks to cut down that time, to be able to do that. And actually we have some actual demonstrations that it works in real life. So I'll be very happy to talk about it. Okay. So that kind of is a big picture. There is now the project that we work on that I'll show two slides on which is, you know, blue sky. We are even looking at beyond CMOS and actually we showed the first experimental demonstration of being able to create complex logic gets out of carbon nanotube transistors. And I'll share some pictures with you later on. But that's like in a really blue sky, but it's a lot of fun. Okay? So let's go back and let's get into the details. First I'll start with this Built-In Soft Error Resilience technique and I'll show you how we can self correct this relation in these soft errors. So here is a big picture. Here is the key ticket of all this BISER stuff. If you look at traditional error detection and recovery, number one, it is expensive, and, number two, it is expensive not only in terms of hardware cost, it's also expensive in terms of design methodology. I know of at least one microprocessor where people very seriously thought of implementing error detection and recovery and [inaudible] backed off just because of the nightmare of validating the recovery mechanism will actually recover the way you want it to recover at the right time. And you know, that validation is very expensive. Versus this Built-In Soft Error Resilience that I will talk about, you will see that I'm going to redesign the flip-flops inside any digital design and that redesign of the flip-flops will self correct any errors that would happen. Okay? And as we will see that it is -- you will be able to correct both errors in latches and combinational logic, although just for being high levelling I will focus on latches so that you get the essence of it, and I'll be very happy to talk about, you know, how one could do it for combinational logic, and very recently we showed that these are useful not only for this tradition in the 1069 ware but also the erratic bit errors that I was talking about, these erratic shifts in beaming and so on. So my message is that correct errors don't detect them. So here is how it works. So let's say that you got a combinational logic and, you know, this will be at the logic level. This won't be too many transistor. This is not the only transistor diagram I have in my presentation so please don't worry about it. So let's say you have a combinational logic and the output of the combinational logic is is connected to a latch, not a flip-flop, latch, okay, suppose. You could do the same thing for flip-flops. And suppose, you know, that I give you another latch in parallel to this latch. Just, you know, take my word today, okay. And I'll talk about where this latch comes from and what's the cost and all that kind of stuff. So one thing that you could do, of course, to say that well, you know, I'll connect the output of this combinational logic to this latch and I'll connect the output of the combinational logic to this latch. Of course I could do that. And then you could say, well, you know, now that I have these two latches, I could stick in a comparator, for example, and if there was an error, there will be an error flag which says that you got an error. Now, that's not a very good idea. The reason it's not a very good idea is the following, that now I think of a design with a million latches, and each of those latches will be holding an error signal saying that, you know, whether I found an error or not. To be able to do any kind of a recovery you have to grab all those error signals and you have to pass them on to your recovery block to tell the recovery block that gee, you know, now you have to do recovery. And suddenly your chip design will be dominated by all the routing of these error signals that will have to go to the recovery block. And the designers will just hate you because you know, routing is already a big problem and these routing of error signals, gee, they will say I'd rather stay with the old technology and not deal with it. So instead of doing that, what you do is you just insert this transistor temperature with four transistors. And this is actually a very well structure that was actually invented by Mueller in 1959 at the University of Illinois, okay. And this is called a C element that people have used extensively in its implement circuit design. And this C element works in the following way. So when the two inputs of the C element are the same, it acts as an inverter. So for example you can see that these are As and Bs, 0011 it acts as an inverter. When the two inverts of the C element are different, it holds the previous value at its output. So let's look at what happens over here. So if you didn't have any errors in these two latches, then, you know, the two inputs of the C element will be the same, it will act as an inverter, logically it will do the right thing, now logically it will add some more load and so on and you have to deal with it, okay. Now if the two inputs of the C element are different, which means if one of these latches have an error, it will mismatch over here and the C element will return the previous value at its output. Now, who in the world has said that this previous value is the correct value? Because if the previous value is not the correct value, I have not done anything good by having this C element isn't it and why am I calling it a self-correcting design? So on the next slide using cardboard animation I will formally prove that the previous value will always be the correct value actually. And that's why it will work. Okay. And here is how it works. So again, you know, you have these two latches is the system latches that run in latch and your combinational logic is going to write something into the latches, so the way this whole thing works is the following as we all know that first let's say the combinational logic was framed to write the zero into these two latches. We know that the clock input of these latches have to be one. That's when the latches are in transferent mode, that's when you can write into the latches. Now, the key observation is that this is very well known in soft error literature that when the latches are trasferent they are not vulnerable to soft errors because the latches are being very strongly driven by the upstream combinational logic. So what happens is well you know the zero got written over here, over here, this acted as an invert er, you got a one at the output, which is the inverted output. And now the clock input went from one to zero, that's when the latches are actually storing the value and that's when you avoid soft errors and say the latches. A flip happens let's say on this latch doesn't really matter it could be any one of these latches. The C elements is a mismatch at its two inputs and it's going to block and the output content is to stay at the correct value. By taking advantage of this fact, you can show that you can -- it's a self correcting design because it returns the correct value at its output no matter what happens. >>: [inaudible] you taught that you actually split it into two pieces, two phases. >> Subhasish Mitra: Yes. >>: And you made this observation that the error happened only ->> Subhasish Mitra: Only in the second phase. If the error happened in the first phase there would be no way you could help me. So, you know, the way I think about it is that, you know, it's always what I have found in most of the problems that I worked on, that, you know, actually as we go one level deep and we understand kind of the source of the issue, you know, like that's where physics comes in, we understand the physics little bit, there are so many interesting solutions that one can come up with without having to pay that much price. If I didn't use that particular property, then you know, everything will blow off and a scheme like this would not work. >>: There's also this hidden assumption that both cannot go back. >> Subhasish Mitra: I'll talk about that, you know, yes. So there is an issue that what happens with single element multiple assets and I'll share some data with you, real data with you on what is the vulnerability of that. So you can think of this as a flip-flop, you know. So now you can think of for example when people do design they have a technology library and the library has N gates, O gates, nan gates, they have a regular flip-flop, they have a scan flip-flop. Now, you think of having a BISER flip-flop inside your library basically. Now, so far as the constant's concerned, you will be thinking, you know, what are the costs of doing something like this? That's what I'm going to share on the next slide and then I'll try to share some really recent data that, you know, I think I need a miracle, but you know, making some people agree that, you know, I can share this data with people and -- yeah, go ahead. >>: What happens if there is an error in the C element? >> Subhasish Mitra: Right. That's a very good question. So let's try to understand. What happens if there is an error in the C element? Well, if there is an error in the C element, these two latches under a single error assumption these two latches are going to drive the C element very strongly. So that's what you will see is at a maximum you will see a glitch at the output of the C element. And, you know, and that glitch will go away in the subsequent stages of the combinational logic. But very good question. So please, you know, please ask these questions, otherwise it's very hard to talk about everything. Okay. But good. And I'm making people think. So what's the cost of doing something like this? So of course at the flip-flop level so let's look at power cost first because that's what people care about the most. And then we'll talk about area costs. So at the flip-flop level, I got two latches that are transitioning, so of course I'm going to have 2X the power at the flip-flop level, and I cannot, you know, do anything about it. But you know, again, nobody sells flip-flops in an actual design, everybody sells the full design. So we looked at the alpha processor because it was open source and we got hold of this air injector that Professor [inaudible] Illinois developed and based on that, we found that if you can find the right flip-flop through product, not all flip-flops are equally important from an architectural level error standpoint. So if you can find the right flip-flops to protect, for example, if you wanted to reduces your chip level soft error rate by 2X, you just protect two percent of the flip-flops and say 15 percent of the flip-flops with a two percent power penalty. If you wanted to cut down the chip level soft error level by 10X, your product are 50 percent of the flip-flops, 50 percent of the important flip-flops and you pay a nine percent chip level power penalty of doing something like this. Okay. Yes? >>: You predict the important ones by injecting. >> Subhasish Mitra: So. >>: And determining ->> Subhasish Mitra: Here, yes, here we were injecting errors, you know, doing a fault injection and determining whether it was important or not. On the -- either on the next slide or two slides from here I will show you a formal verification technique for doing that actually. So you know, because that's an important question. Now, that's where you know optimization comes in. Given that we can solve the problem at the lowest level to be able to content cost now we have to look at other layers to be able to cut down -- to be able to do the cost optimization. So this is what you see. Now, what are the cheap level area costs of doing something like this? Well, you know, of course at a flip-flop level your area doubles because your flip-flop has gotten bigger, but then there are two benefits that work in your favor. Number one, we know that in real estate design you're always dominated by long interconnects. Versus here what you are adding is local transistors and local interconnects. And you know, we have done these experiments, you know place and route and everything, you know, my friends at various other companies have done these experiments and the chip level area cost of doing something like this is extremely small. Sometimes you see a one percent, sometimes you see a .5 percent, sometimes even people have seen zero percent. Because you know, their chips have so much widespread because that's in the designs rather than hand placed you know, custom designs. And you know, like you can use the next question is how do you really optimize the BISER insertion, the question that you were asking. There is actually another benefit of using this BISER flip-flops in post-Silicon validation and test that I'll talk about later. So let's focus on this thing that I was talking about, how do you go and find out which flip-flops are the most important flip-flops? Of course you could do a lot of fault injections or error injections to be able to do that. The thing with error injection is that if error injection tells you that there is a problem that's good news because then you know that there is a problem. If your error injection tells you that there is no problem, the trouble is you really do not know whether there is no problem or whether it's simply because that you have a [inaudible] issue here, okay. So that's where actually this is work, joint work with Professor [inaudible] at UC Berkeley where we were using a formal verification approach of doing it. We were saying, look, what's going on over here? So if you have a bunch of flip-flops and let's say you were wearing about a soft error in a flip-flop, you could model that as a two-step machine. So you know, there was no soft error to start with, at some arbitrary level a soft error happened and from that standpoint there is no other soft error that would happen, everything would be fine with that flip-flop. And now if you just take a cross-product of this particular two step machine with the formal model of your design and then you can check for properties of the design and then you could come up with two answers. If the properties pass then you don't have to protect that flip-flop, otherwise if the properties do not pass, you do have to protect the flip-flop basically. And you know, and you can use a model checker and so on and we did it. What you find is that by doing this for example for the space wire communication protocol chip design which is on open course, you can cut down the cost, power cost of BISER by 4X. So you know, it's like something around five percent or something like that at the chip level that you have to pay the price. It's a -- and it's a rigorous way of really proving that you can actually do it. But of course ->>: [Inaudible] there are some latches there, it doesn't matter what values they have. >> Subhasish Mitra: Yes. >>: What the [inaudible]. >> Subhasish Mitra: Exactly. So that the a very good question. So what are they doing? Well, there are two things that you can take benefit of. First of all, since it was a communication protocol, there is already built in mechanisms in the communication protocol itself to deal with errors, okay. For example, if you have a CRC errors or if you have a timeout errors, okay, the -- they will automatically deal -- the system will deal with it. So that's why even if those latches -- those are care bits, okay, but still, you know, you do not have to go and protect them because they are taken care of at the system level, at the protocol level but then there is another thing which is that if you have a single cycle error in a latch, that's not valid to be a don't care. So you know there is a thing that if there is a stock at fault in a particular design, that means if a signal is always stuck at one and if you do not find it which means it's a don't care. But in this particular case, it's a single cycle stock error which is different from saying that, you know, like that you don't care about the system. So you know, I can show you actually finest technician examples where you could have a single cycle error that would not matter but if that particular flip-flop was gone your complete design would be messed up. >>: But I thought if you had a single cycle error that because that could be repeated at a recycle, it's like you are completely existentially quantifying out the contents of that flip-flop. >> Subhasish Mitra: Yeah, but you know, but then you know like you could always have architectural constraints, for example, like these are the sequential don't cares, you could say that every time I have why one over here, I have a zero over there, okay, something like that, okay, so if you have architectural constraints of, you know, of what you said would be true, if there would be no constraint on the contents of the flip-flops. But if there are only a certain set of bits in the flip-flops that are care bits, the others are sequentially don't care bits, then you may run this situation like that. But for this particular design what we found was that we were taking advantage of basically the protocol level design basically that. >>: [Inaudible] comment about that, also. >> Subhasish Mitra: Yes. >>: So it seems that to say -- so this BISER stuff is really cool because you do it only local and. >> Subhasish Mitra: I agree. >>: And fix the problem. Okay. But now you're trying to save that stuff. >> Subhasish Mitra: Yes. >>: And you're trying to not use it. >> Subhasish Mitra: Yes. >>: And then what you're going to do is if you don't use it, then you might trigger all this high level protocol stuff. >> Subhasish Mitra: Sure. >>: Right? >> Subhasish Mitra: Absolutely. >>: Which might have its own expense. Of course nothing is for free. >> Subhasish Mitra: Absolutely, yes. I absolutely agree. And that's where, you know, that's where the optimizer -- that's why the optimization is so important to understand that at what level you want to go and solve the problem. I absolutely agree with you. Yes. There is also another issue isn't it? You ready to make a comment? >>: Well, also, I mean if you make the reliability at the lower level stronger than you can eliminate some of the protocol to higher level. I mean so that [inaudible]. >> Subhasish Mitra: At the end of the day, that's why it's so important to have a cross-layer optimization, exactly. There is another issue isn't it like which -- we didn't talk about, who in the world said that we had the complete specification of that particular thing that you ->>: [Inaudible]. >> Subhasish Mitra: But it's garbage in, garbage out, isn't it? Like if you didn't verify with respect to the right set of properties it goes back to the coverage of your problem is that it verified again. So but here is the point that I wanted to make is that there are interesting techniques and one could even think of for example combining inner injection with something like this to be able to tell you know what you could do. This is kind of an interesting spin to the whole thing, the whole problem. There is another benefit of using a technique like you know BISER that I was showing before, and the benefit is that actually you can turn production on and off. So here is actually a flip-flop design that's inside many of your processors. This is Intel's scan flip-flop design. And Intel decided to implement a scan flip-flop in this particular way for the following reason, that so that while the system is running they could scan on the don't which is a snapshot of the system without having to stop the system. And this feels very useful for post-Silicon validation purposes. And that's why they had this spare flip-flop. And now you can think of, you can reuse the same thing for protection and now you know what I call is it a design quality flip-flop, okay? Because that's going to do soft error correction in the field, that's going to help for scan test in the -- on the tester floor, and then that's going to also enable you to do post-Silicon debug in, you know, in a post-Silicon debug environment and oh, by the way, since you can turn this flip-flop on and off, that means that you can have again if you understood your system well enough, you could be in between two more, so you could be in a high reliability mode and you could be in an economy mode where, yeah, the area you are stuck with, but area is not the biggest deal, the power is the biggest thing where you can play against either strategically or dynamically. This brings to the question of, you know, application rate, application level error rates. So, you know, when I was at Intel and it's not about Intel it's with almost any company that you can think of, you know, there was this whole issue of having one single core, one single processor core that will run for all applications, you know, like for laptops, for servers and so on, and then there were this -- there were these wars you know that who pays the price of reliability because you know, more power, you know, more soft error protection, you know, the goals could be very different depending on what market space you are talking about. By having this kind of dynamic way of, you know, turning protection on and off, very we call reconfigurable protection, you know, you can trade off, you can say, well, you know like, you know, for applications that do not really care about soft errors, you know, I turn the protection off, otherwise I turn the protection on and the question is are there any benefits of even doing it dynamically during runtime. I don't know the answer. So that's why these are question marks. But these are interesting things that are enabled by a flip-flop design like that. >>: So one thing in your design [inaudible] you don't actually know if it's happening? >> Subhasish Mitra: Yes. >>: So have you thought at all about that, in other words [inaudible]. >> Subhasish Mitra: You're not the first person to ask me the question, as you can guess, okay. So I have two answers to this, okay. My first answer is that why do you need to know? If I can correct, okay. You know, there are so many things that happened inside our design. Do you know about everything? And too much knowledge is sometimes not very good. Okay. >>: [Inaudible]. >> Subhasish Mitra: Yeah. Yeah. But at the same time, one could come up with, you know, incarnations of these flip-flops for example where you could have inner checkers as well, you know rather than just having the C element at the higher cost of course, and there were some companies that were even, you know, suggesting something like that. So, yeah, you could do that. >>: Have you thought at all about like a statistical error detection network because for example in the case you're talking about where we're reconfigurable, if you had an environment where you're seeing a lot of errors you might want to turn on the error protection. >> Subhasish Mitra: Yes. Exactly. You could clearly do something like that, yes. And the good news is that unlike typical power you know shutdown mechanisms that take a lot of cycles to be like in a power saving mode or in a like in a regular mode, here actually you take four clock cycles to turn prediction on and off because you do it using scannable signals that are already present inside the chip. So, you know, somebody already asked this question of what happens for multiple upsets? Now, if I was giving this talk to you last year, I would have said oh, you know no one cares about multiple upsets, and I'll all level single upsets. And but you know what has happened is just these companies have actual data now to show that, you know, first of all for memories it was known for a long time that you have multiple single, multiple upsets. But they have started seeing single multiple upsets in 45 nanometer for flip-flops now. And although the rate of upsets is much lower, say maybe it's like for five percent of the cases or maybe one percent of the cases you will see single level multiple upsets. It's like [inaudible] I'm okay with respect 20X or something like that. On the next slide I have some actual data for BISER that I will share with you. But the big one is that the -- the big comment is that singular assumption is not clearly sufficient. And to be able to do the analysis of single level multiple upsets you have to worry about layout, you have to worry about technology CAD. And we are doing some work, and we think that the BISER can cut down single and multiple upsets by more than 100X, actually we could put 300X. Because what happens is to give you the heuristic explanation of why that would be the case, in BISER first of all the corresponding nodes only have to be upset to be but the non corresponding nodes are very close to each other, so still you end up having a compact layout but the effects of kind of moved away. Yes in. >>: [Inaudible] in memory? >> Subhasish Mitra: In some sense you can think of it that way, but at a much finer scale. But you can think of it that way, yes. What I was amazed in July I was at this small conference called IOLTS or International On Line Test Symposium they asked me to give an invited talk on soft errors and I show up there, give my talk, and the next speaker is Dr. Norbert Seifert from Intel who I worked with when I was at Intel and he stands up and he starts showing data on my BISER flip-flops. So it seems that, you know, there were some experiments that were done by some folks at a 45 nanometer on a test chip and BISER was there basically, and it was -- you can see what they found, so you know, I worked with Norbert, I said, you know can I create one slide that will be approved by you that I can show to people, you know, and finally, you know, they came back and said, yes, you know, I could do something. So for alpha particles they found that BISER [inaudible] soft error rates by more than 1000X. For neutrons, they found that BISER [inaudible] soft error by more than 100X, actually it's far more than 100X. You know, I was told not to tell the exact number basically. And what was found is that BISER was very effective in correcting single and multiple upsets and the reason is because first of all because of the diversity that I was talking about and also what was happening is BISER doesn't have any self-regenerative feedback, it's not like, you know, if you have an error, you know, you make up for it and you feed the same thing back into the flip-flop, it's like kind of, you know, feet forward kind of a design. And that's why, you know, the multiple upset effects are very small in BISER and so you know when they were saying you know this is 100X and all that stuff that you know includes even single and multiple upsets. So that's something that I was very happy about. And of course know better said that I have to write this mandatory thing that you can get more reduction at higher cost. And you know, I would agree with that. Okay. Okay. So that's what I have to say about Built-In Soft Error Resilience or BISER, and now I'll move on to the next topic which is about circuit failure prediction. So I can remind you where we are. We said that for this normal operation you care about the soft errors and temporary errors and so on, and that's why you would be using this Built-In Soft Error Resilience versus for our real-life failures and wear-out you'll be using this idea that we are calling circuit failure prediction with this CASP online self test. So let's look at what is circuit failure prediction? Well, the whole idea behind circuit failure prediction is to be able to predict failures before errors appear in system deterrent states. And this is in contrast to classical error detection where you know, you find a problem after the errors have appeared in system deterrent states. Of course you know, something like a soft error reduction in error you are not going to have -- you're not going to be able to predict because it's a pretty random phenomenon. But for transistor aging an our real-life failures that I was talking about before, there is a gradual degradation that's associated with this mechanisms. And this gradual degradation shows up as [inaudible] and this since they shows up at delay shifts you can go and predict, you can tell that where the system is going before it has actually failed. And what are the pros and cons of doing circuit failure prediction, well, since you're finding errors before the -- since you are finding fake problems before errors appeared in system design stats you don't have an issue of data corruption. You do not have an issue of being able to deal with very high error rates and, you know, recovering from high error rates is a problem. And it ends up very good self diagnosis because now you are collecting all this information over several billions of cycles that the system is running to be able to predict that whether something is going bad or not. So you know what the behavior of the system was before it actually failed or you know before it actually went to a predict -- to a failure scenario. And that's why for this mechanisms of, you know, transistor aging and our real life failures because of this gradual degradation which results in delay shifts because of this physical phenomena you can go and predict failures before errors set in the system. And by the way, Shakespeare excellence several centuries back why circuit failure prediction is a good idea. So what is the applicability of circuit failure prediction? Well, as I said, it has to do with degradation. And degradations shows up as delay shifts. Now, the conventional wisdom of a delay shift is that a delay shift make things slow, you know, and you can see in a lot of people talking about it. The startling thing that we found and this is, you know, based on, you know, you can see several generations of test chips that we have been doing where we have been measuring stuff, we found that these delay shifts do not have to be positive delay shifts in the sense that they do not have to slow things down, they actually can make things fast before things break. But still they are delay shifts and as long as you can find delay shifts you can tell that whether there is something wrong going on in your system before the system fails. Okay? So that's kind of the key idea over here. Now, as I said, that for transistor aging, it's pretty well established how these delay shifts happen and these are positive delay shifts. Things get slow basically. But for our real-life failures life for example get outside problems you know, as I said we have actually experimental data and we are clearly seeing the delay shifts can happen in both directions before the chip absolutely fails basically. But still they are delay shifts and if you can find delay shifts you should be able to find whether there is a problem and this can provide even several different kinds of alternatives to burning. Now, how am I going to do circuit failure prediction now? So, you know, that was very fuzzy kind of, you know, like a very philosophical discussion that we had until now about, you know, that we can -- we should be able to do circuit failure prediction. Now, the question is, what kind of support do I need on chip to be able to do this kind of prediction, and there are two ways to do it. One is concurrently with the application execution which means while the system is running, while your application is running, you have special again flip-flop designs, special sensors, that would look at delay shifts and would try to find out whether there have been any delay shifts that happen inside the system. Or you could do it using a periodical online self test, which means that the system is running as it is and every so often you run a very [inaudible] online self test and you find out what's going to happen with these delay shifts in system and that even provides some cost savings opportunity because you may not need these sensors and you may not need to pay the power or the area price of doing this sensors and so on. But still you can find delay shifts. So I will get into the details on the next few slides on, you know, how one could design some sensors like this and how one could do this kind of online self test. So let me be more concrete, and let me know you one concrete example of circuit failure prediction in the context of, you know, NBTI aging, NBTI stands Negative Bias Temperature Instability. It's an aging mechanism that started happening in PMOS transistors starting 19 nanometer tech -- became very prominent starting 19 nanometer technology. Basically what happens is the threshold voltage degrades over time which means that the drive current goes down over time, which means that the PMOS delay goes up over time. And the factors that determine the amount of aging, they're like human beings, you know, so it depends on temperature, it depends on voltage, it depends on the workload because it is related to the percentage of time that the PMOS transistor is on. So as I -- as we discussed before about speed guard bands and so on, so the current industrial practice to be able to deal with this transistor aging is basically a [inaudible] you tell [inaudible] if you want to use this thing for seven years in an interprets environment this is the maximum amount of aging one could ever get based on the worst case workload, the worst case [inaudible], the worst case voltage and it's about [inaudible] going to go and put a speed guard when based on that if that speed guard band was two percent you would not worry about it, you would just go and put it. If that speed guard band it's 20 percent, then you would worry bit and you want to be able to do something about it, and that's what the big worry is going forward that whether you are going to have very large guard bands in talking about 50 nanometer technology or something like this. And I'll show you an example of how one views this concept of circuit failure prediction to eliminate such worst case guard band. So here is how it works. So what you do is to start with, you start off with a very small guard band. So for example, instead of working correctly for up to 7 years, you say I want to work correctly for up to one day or 15 days or ten days, doesn't really matter, okay. You start off with a very small guard band and during those 15 days, I will stick with 15 days, that sounds like a good number. And during this 15 days what you do is while the system is running you find this delay shifts and you find these delay shifts they are using sensors or using online self test or something like that. And then at the end of the 15 days what you want to find out is whether I got enough aging in various parts of the chip or not. Since I can find delay shifts I'll be able to tell how much aging I have gotten in various parts of that chip. And based on this aging information, either they will decide that you do not have to make any changes to your guard band or you don't have to make any changes to your system, or you will be doing some kind of self healing and luckily, there are various different options that are available for self healing. For example, you can go and change the body bias, body bias not may go away, you know, when you go to a 32 nanometer technology or something like this, but still you know, it's still a knob that one could use depending on the company where the processors are built, you could add just VDD. Or you could even add just speed, you know, and a very fine grained way. Or actually people have found that if you laid these transistors to rest a little bit, then these transistors recover from their aging, which means that, you know, you could even use spare course to run your tasks and let this transistors, you know, go to a spa, you know, and recover from aging and then again work fine, you know, after so many days or something like this. Now, that means that you have dine flip-flops with built in aging sensors to be able to find these delay shifts or you have to use online self test. And -- yes? >>: [Inaudible] twice the age mean it just seems like you can [inaudible]. >> Subhasish Mitra: Absolutely, yes. >>: And [inaudible] the case. >> Subhasish Mitra: Yes. >>: And you would still get [inaudible]. >> Subhasish Mitra: Right. So what will happen is that depends on the granularity basically. It depends on your voltage instruments, it depends on your body bias, domains and so on. Yes. So but you are stuck with it basically. Maybe in the future we can think of very fine grained ways of doing the self healing that I do not know of right now. >>: [Inaudible]. So is a variance pretty big across ->> Subhasish Mitra: Yes, the variance is is big across, yes. And one could take advantage of that. Okay. So the cost of putting these sensors, so we'll have special flip-flops as I will show next with build in edging sensors as we can see the costs are very small for doing something like this. And by the way, now is, a lot of people have suggested ring oscillators and so on to collect this aging data. That's not very scaleable because ring oscillators, their activity factors are very different from the activity factors or the signal probabilities of your actual design and they may not even show up the same way basically. So our ring oscillator may or may not age as much as your actual design. So what you want to do is you want your actual design to tell you how much it has aged and that's what you can do with this circuit fewer prediction techniques. So the big question is what kinds of aging sensors you will be using and how you will be doing an online self test to be able to do the circuit failure prediction. So let's talk about these two things for the remaining part of this presentation and we'll be done. So again, the idea behind these sensors to find delay shifts is the following. That today, for example, let's say you had guard band TG, the 15 day guard band that we were talking about. If you get the chip out of the door today, you know that all signals will transition before this TG today. But today, depending on the slack of your pads, depending on how much this chip has actually aged, this transitions may or may not creep into the guard band. And that is actually the signature that you can find using, for example, a special sensor that's designed inside the flip-flop to tell that the combination -- for the combination logic that's connected, the output of this at the input of this flip-flop where its signals are transitioning with respect to that clock edge. And basically that's what an aging sensor does. You have to make sure there are lots of details about this thing, you have to make sure that the edging sensor itself is aging [inaudible], you have to make sure that this aging sensor is not very big, you have to make sure -- you have to be careful about where you optimize and you do not want to place these aging sensors in all flip-flops of your design and so on, but, you know, I wanted to give an essence of this thing. I won't talk about those details. One important to note is again you have made the flip-flops bigger, but you have not added any global interconnects because you'll be using the scan chain which is already existing on the ship to get the information out. And why is it possible? Well, I'll come to this point a little later but let's go to the scan chain thing. Because it is a prediction technique, which means that you are taking advantage of the gradual degradation mechanism so you do to get this information out that you know whether something is actually edged right at the cycle where you have captured that delay shift. You can wait for a while and you know, like 15 days we were talking about when you can get that information out which means that you do not really have to go and, you know, scan out, have expert interconnects to do this thing, you can use the existing scan chain to get the information out, you do not have to add any global interconnects. The other thing that people worry about a lot with, you know, these kinds of techniques, is whether it imposes any additional hold time constraints in your design. There have been some work that have been done that impose hold times and hold times are not good, you know, no one wants hold time issues because that can completely mess up your design. So -- but techniques like this never impose any hold times. So do these things work in real life? I was pretty high level about, you know, like the details but these things do work in real life, they're actual test chips which show that these things are operational and they do the right thing. Now, as we said to be able to do a good circuit failure prediction you have to play a very total online self test and why, because it could very well be that your application may not be exercising the pads that are aging because it depends on the profiles of your application and you know for any prediction mechanism it's garbage in, garbage out. You know, if you do not give good inputs, you are not going to give good prediction. And also online self test is also useful for a hard failure detection and diagnosis. Now, there are some constraints about doing online self test. First of all, since we are talking about delay shifts now we are talking about very tall delay tests. You cannot simply do some stock full testing and say, well, you know, I'm done, you have to do pretty thorough delay tests to be able to find these problems. Number one. Number two is when you do online self test, you cannot have any visible system level downtime because if the system is down, that's of no use. There should be minimal milk left performance impact. That means you have to be clever about how you schedule this online self tests. Of course, it has to be local, of course it has to have minimal design flow impact. Now, let's look at some of the existing techniques for online self test, and they're inadequate. For example, you could think of logic built-in self test or LBIST, which I'm sure many of you are familiar with, at least heard about it. Logic built-in self test has very low test coverage with respect to delay faults. You cannot get very high delay test coverage out of random patterns basically which logic built-in self test uses. Cost can be very high, design flow can be really messy. So that's why people have not really used logic built-in self test in a big way even for regular manufacturing tests. There is this idea from 1986 by Mel Brewer from USC on this idea called roving emulation. Basically their idea was that, you know, if I cast this idea into these terms basically, basically what they are talking about is if you have a multiple codes and if you have one spare code and you know you always check your core under test with respect to that one core, it's like a duplication type multiplace duplication basically but that wasn't the idea that, you know, Mel Brewer was talking about that time he was talking about an emulation engine so you can emulate almost anything, it's part of that. But you know, I just give a very high level view. An issue is that you know if you're talking about applications you can already find the word applications by finding those delay shifts. That's not the biggest problem. The whole reason of doing online self test is very high coverage. And if you still rely on the application to give you the coverage, again, you do not know what you are getting. So you have to have pretty high coverage online self tests which these techniques cannot give. So what do you do? Well, you know, you do something very simple and stupid and that's what we call CASP. CASP stands for Concurrent Autonomous Stored Patterns. It's very simple. Concurrent, why? Because, well, you know if you have multiple systems you can test one or more cores while the rest of the cores are doing their job which means that no visible system level downtime. Autonomous because there will be a non-chip test controller and that's almost the only thing you have to implement to be able to do something like this, so you know, no one has to use a tester or do this thing manually. And third is the most funniest thing, which is that you can just go and store all your test patterns that you could think of in an off-chip flash. And that will be it. And you can start whatever patterns you wanted you can store all your production tests, your structural and functional tests with quantified coverage and so on, and you know, you will get a really good test coverage which you could not get with built-in self test. And why does this make sense today? This is -this sounds like a trivial and stupid idea. Why does it make sense today? Well, in the past people could not think of an idea like this, because the cost of non built-in storage was so much but today just having an off-chip flash to be able to store these test patterns is nothing and the amount of storage we are talking about is, you know, okay, like I will show you some results for the [inaudible], you will find it's like, you know, six megabytes of storage or something like this. Okay. I give you a gigabyte, okay. Still, you know, it's nothing compared to, you know, the amount of non built-in storage that I am getting. So that's why the major technology trends favored this idea. This is the classic example of some things that did not make any sense 10 years back. But with the technology trends changing, we should not think about this complex ways of solving the problem, there are far simpler ways of solving this problem. So what we do, multiple designs, you know, align sell current, align self test and that's what basically working on over here, you know, at MSR on, you know, doing very optimized schedulers basically for running this online self tests. And then we go and store all of our test patterns in local non volatile storage. It's also very flexible and upgradable in the field after your system is in the field if you find that a specific find of failure mechanism is more prevalent than the other, you can go and change your test patterns because, you know, Microsoft people are very good as helping people download patches, now you can go and download test pattern patches basically while the system is running based on what kind of failure mechanisms you see. And the big reason why this has been enabled today is because two things happen in the testing world that people, you know, sometimes do not know about. One is this idea of test compilation which can cut down the test volume and test time by 10X to 100X to up to 1000X. And now if we can do this massive test compilation, we do need as much storage, we do not need as much test time to, you know, run these tests. And also another trend that's happening in the testing world is on chip support for at speed tests. In the past people used to run tests out of the testers, but then, you know, when you have a two gigahertz microprocessor which is I know, in storage 500 megahertz tester, there is no way you can do at speed tests from the tester so you need on-chip support to be able to run at speed tests. And if you have this on-chip support to run these at speed tests we should make use of them in the system to run these online tests and that's why it becomes more and more relevant. Do you have a question? >>: [Inaudible] the hardware built into the chip? >> Subhasish Mitra: Yes. So basically, you know, some funny ways of doing XR gates and if you do the XR -- so these are not your regular compression algorithms, okay? They do not work, okay. But these are some funny connections of [inaudible] which kind of go back to coding theory, actually. It gives rise to a new class of error correcting codes actually. And if you do that, you can cut down your test set volume very significantly without impacting test quality and you know, if you're interested in that's another talk, but I can tell you in more details that's like something very close to my heart. So we looked at actually [inaudible] did this work, we looked at the CASP for the open source T1 because it's open source and you know if I show numbers you won't be able to tell that now this is some academic speaking, you know, this is a really design. As you can see, that even if we have a 10X computation, we're talking about six megabytes of storage, which is nothing isn't it? Look at the extremely high coverages that we are getting. This true time is basically the most [inaudible] test that the test community knows today. And getting a 93 percent of this true time coverage is extremely high. You know? People don't get that in real life, but this is a real design. And the test time per core is roughly around 300 milliseconds, and this includes, you know, this is assuming a flash. If you were having a hard disk, this number would go up. Assuming a flash, you know, this includes a time on bring in the patterns, run the tests, compare the results and, you know, and telling whether this chip works or not. And look, all this happens with an extremely small area impact of 0.01 percent. And [inaudible] had to change 10,000 lines of very [inaudible] of this open source microprocessor to be able to implement something like that. So this is something very doable, something very practical. And this is how it works. So for example, you know, there is a test scheduler which says well, the core four is now selected for testing, core four is temporarily isolated, the test is run post processing and this content is to happen. And as we are, you know, I'm sure you know you have all guessed that this scheduler is very important because the scheduler can completely mess up your entire application performance if you are not careful about what was running and if you are not careful about the scheduling. Actually once you have this kind of an idea of this online self test, you can make use of virtualization to optimize your online self test. And the reason is the following: That if you wanted to implement the whole thing in hardware you will run into a problem, and the problem is the following. Let's say one of the cores the being tested. While it is being tested the rest of the cores are running and they maybe generating requests for this score which is on the test and now you will need these buffers to store these requests, to buffer these requests and you will not know how many buffers you would need to buffer these requests. Instead, if you have a virtualization support, you could, you know, either stall and you can bring it back or if you had a spare core you could even transfer the actively design a core which is to be tested into that spare core and rove around while the rest of the system is seeing the constant number of cores. So you could do all that stuff, and we have been able to do this on an actual platform over here which is shown over here and you know you could have application performance in fact of less than two percent if you had a spare core, program, and rove things around. And you could do even better things, you know, that's what [inaudible] is working on, on scheduling techniques so that you may not even need that spare core for example, and still continue to do your work. Okay. So I'm almost done. I'll talk about some of the other research projects very quickly before I end this talk. But please feel free to stop me and ask questions. The first one is about post-Silicon validation. I promise that I will show some slides. So as I said over here, this is a typical you know development flow for almost anything and post-Silicon validation costs is one of the biggest costs and this is, you know, so you know, [inaudible] is the vice president of design technology solutions at Intel, this is you know like what he said, and these are the numbers that are [inaudible] about two days post-Silicon validation costs and everybody thinks that this is going to go up. And the major causes of this cost are two. As we talked about, one is localization and the other are these electrical bugs. And this is an actual number that I got from actual people that do actual post-Silicon validation at these companies that want electrical bug it could take days to weeks to localize an electrical bug before, you know, the circuit debug guys come in and can fix it. Okay. So what we do is that in this [inaudible] it stands for instruction footprint recording analysts, it works mainly for processors right now. And the reason it works for processors is because the way with choose these instruction footprints, the way we choose these analyses, they're very tied to the microarchitecture of a processor. But if you can do that, you can eliminate the limitations of existing techniques because you do not need any full system level reproducibility, you do not need any full system level simulation, but still accuracy can tell the location of a bug and by accuracy I mean that you can tell which unit which hardware block it came from and the size of the block will be something like around 5,000 to 10,000 to input nano gates. You know. This is pretty good granularity because now the circuit level debug guys can go and test the heck out of that particular block to find out the speed path that would create this problem. And at the very high level, this is my only slide which kind of gets into any details of this [inaudible] what we do is that in the design phase we insert these recorders inside the chip design so you can think of for every pipeline stage there is a small recorder as it should be that pipeline stage which stores very specific information corresponding to that pipeline stage. And the total amount of storage that you need for this recorders overall for the alpha microprocessor chip is like roughly around 60 kilobytes distributed storage. And you know, so, and when the post-Silicon validation is happening this recorders store, you know, special information in a non intrusive way, because we do not stop the system on anything, it's happening all in parallel, and when a failure is detected, that's what is important because you want to detect the failure fast. Now, if the failure, if the inner appeared you know, a billion cycles if the beginning of your execution of validation, you don't care about that. What you care about is the appearance of the error to the appearance of the failure. So that's what you want to make to be, you know, you want it to be very short, and that's what some of those inner checking techniques play a very special role. Once the failure is detected, you scan all the contents and you post analyze offline using some special analysis techniques. Actually we use some ideas from [inaudible] control flow analysis, dependency analysis and so on. You know, you can think of using [inaudible] basically to do diagnosis basically and you do not need any system level simulation because we are looking at self consistence. We knew the assembly code of the binary of what you were running in the post-Silicon validation platform, and we have this information from the recorders and we look for self [inaudible] where things are consistent. If you read rate something like -- if you wrote something, whatever you read was the same thing that you wrote and so on. And to be able to localize these bugs. Again, 2000 feet view of the tinge but that's work in real life. And as I said that, you know, we also do some work beyond [inaudible] blue sky stuff on, you know, designs carbon nanotube filled up with transistors, they have a big promises, you know, the energy delay a benefit of using carbon nanotube fits could be around, you know, 40X or something like that at least 10X. So somebody said in a company said you know if you guys tell us 10X and then after all the work is done probably it will be 50 percent but still we'll be very happy about it. So that's where you have to start up with the big number. And 10X is true. But there are some major show stoppers. You cannot grow this carbon nanotubes parallel to each other, you know, like you can look in a ->>: [Inaudible]. >> Subhasish Mitra: Carbon nanotubes. >>: [Inaudible]. >> Subhasish Mitra: Oh, okay. What are carbon nanotubes? Good. So carbon nanotubes are very simple. You have graphine, okay, you have a sheet of graphine. And what is graphine? Well graphine is a very shine sheet of graphite. And what is graphite? The stuff that you use in your lead pencils basically. So if you draw a line or like if you shared a region you get multiple layers of graphine basically. And you know like if one of those layers of graphine is to get out and then if you rolled that sheet of graphine you will get a tube, and that is what a carbon nanotube is basically. >>: It's embedded in the ->> Subhasish Mitra: So what we do is that we -- so you can grow carbon nanotubes in their own silicon or on cords, okay? Now, people have found, we have found that these carbon nanotubes grow better on cords than on silicon. I have some picture on carbon nanotubes on silicon, you will see that it will really look like noodles, okay. Versus here as you can see that they kind of look like, you know, kind of straight with some, you know, with some, you know, kinks and bends and so on. And we can grow actually on if you will buffer scale and so on. Now, if you want to make logic out of these carbon nanotubes as you can guess if things go left and right, you know, how in the world can you design logic that would ever work, isn't it? And so you have to worry about mispositioning of the carbon nanotubes, they can be misaligned. You cannot tell that I want a carbon nanotube here, I do not want a carbon nanotube there and so on. You have to deal with that problem. And oh, by the way, [inaudible] on the part of the carbon nanotube start out to have zero band gap which means they act as metal, which means they are always conducting which means if you are creating transistors out of those you will given all these things, you want to make a real technology out of these things, okay. And we haven't solved the problem, I'm not saying we solved the problem, but we actually had the first VSS demonstration this June at the VSS symposium where, you know, we could have full scale growth of carbon nanotubes and we were growing them on cords because they have better alignment behaviors on cords but nobody wants to use cords for doing circuits isn't it. So we can use a sticker technique actually to smooth the carbon nanotubes out of cords and put them on silicon, we can do that actually. And then after doing that, we are actually have a theory, graph theory where we can prove that no matter how misaligned the carbon nanotubes are as long as they're directional, which means they are not spiralling in one particular place, actually there is a way to create layouts of logic circuits so that even if the carbon nanotubes are misaligned and mispositioned, you will always have correct logic functions no matter what. And these are some examples of -- this is a pullup of a N gate, this is a pullup of an R gate, and you can see, you know, look at the currents you know, they actually work in real life. These are some more complex kits like this is an NO invert and RN invert and so on. So [inaudible] where we can create single logic gates and you know like we are working on more stuff as we speak. >>: This [inaudible] chip was the size in Stanford lab? >> Subhasish Mitra: It was created at Stanford lab, yes, exactly. It was fabricated at Stanford lab, yes. So, you know, those are some of the other projects that are going on in our group, and I -- you know, I did some other stuff to you know before I joined Stanford basically, you know, like I like to do stuff basically doesn't really matter what it is, so this was basically as I say, I was talking about this coding theory thing that we developed basically, you know. When I was at Stanford, I went to Stanford for graduate start basically for my Ph.D. and, you know, we did a lot of stuff, we opened the self repairing FPGAs, we had this big project on configurable computing and we are saying well if you have more faults you know, you create another configuration that you know avoids that particular location of the fault and till you can continue to run your application and so on. We showed all that stuff and then we also did a pure software techniques for hardware for tolerance basically and that where deployed on the R go satellites up in the space and that actually send data on what kinds of soft errors happen up in the space basically. And they had a radiation hard microprocessor and they had a commercial microprocessor with no hardware for tolerance and with the software techniques, the commercial processor did better than the validation processor. And you know, like there were lots of interesting stuff basically in that space. So that kind of concludes my talk. I hope that you know I got the message across that going forward for robust system design there are lots of challenges but there are lots of new opportunities that are either created by understanding the physics or because of changing cost constraints in future technologies. But that means that we have to think of new ways of doing things. And I hope what we talked about today like would be bringing some C change in future designs. So thank you very much. [applause]

>> Onur Mutlu: It's a great pleasure to introduce... assistant professor at Stanford. Subhasish was with Intel before...

Related documents

Products

Support

&gt;&gt; Onur Mutlu: It's a great pleasure to introduce... assistant professor at Stanford. Subhasish was with Intel before...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Onur Mutlu: It's a great pleasure to introduce... assistant professor at Stanford. Subhasish was with Intel before...