>> Onur Mutlu: It's a great pleasure to introduce... assistant professor at Stanford. Subhasish was with Intel before...

advertisement
>> Onur Mutlu: It's a great pleasure to introduce Subhasish Mitra who is an
assistant professor at Stanford. Subhasish was with Intel before he joined
Stanford. He was a principal engineer over there. We worked for five years, and
he has built up very influential techniques in the old bus systems design. He
work on from the circuit level to the system level. In particular, his while optic test
compression technique X-Compact at Intel which received the Intel achievement
award which is the highest honor that Intel ->> Subhasish Mitra: That's fine.
>> Onur Mutlu: And he's been working on Built-in Soft Error Resilience
techniques. I think he will talk a little bit about that and online self-test. And with
that ->> Subhasish Mitra: Thank you, Onur, you know, for inviting me over here. And
actually we are having a very good collaboration going on. My student Yan Jing
[phonetic] is sitting over here. She is actually spending doing her internship at
MSR this summer. So hopefully, you know, there will be a lot of interest in this
kind of stuff and I would like to have a two way conversation, rather than a one
way traffic over here, so please feel free to stop me and ask questions, that way
it will be more fun.
So let's say that you designed a system and let's say we start from a hardware
chip and we perfectly made sure that everything was correct and it was perfectly
formally verified and then it was tested right so there were no defects and you
had this chip in the field and then you see errors happening left and right in the
storage nodes, in the flip-flops, in the memories and the combinational logic, and
one of the big sources of these problems is what is called the rendition induced
soft errors that happen because of particles in the packaging material and
neutrons that are coming from cosmic rays, basically, okay?
So the point over here is that in the past no one is to care about these kinds of
problems, only the space guys is to worry about the rendition in the soft errors
but going forward at sub-45 nanometer technologies as we will see on the next
slide, you have to worry about almost all parts of your design and its resilience to
soft errors. You have to worry about memories, you have to worry about
flip-flops, you have to worry about probably combination of logic as well, although
it's not quite clear that whether combinational logic is going to play a big role or
not.
But this is a big problem that one has to worry about and here is a quote from
one of the IT executives and this came up in the Forbes Magazine a couple
years back in response to one of the processors that were built and sold to these
guys. So who cares about soft errors? You know, why are we talking about it?
Well, it all depends on error rates. And I'm not going to get into the details of
where I got this number from, but you will roughly find if you look at soft error
rates of individual components in a system you will get a number of something
like this that if you had a 20,000 processor server form you will have one major
flip flop error every 20 days.
Now, this is to say that you could have a big problem as a result of this or not.
So for example here is an actual chip that I worked on in the past, and if you look
at the various components of this chip and you worry about memories, you worry
about flip-flops, you worry about combinational logic and you worry about the
contributions of these various components to the overall soft error rate of this
chip and this is what you find.
For example, this particular chip it was a storage chip and it had a lot of on chip
memories and those on chip memories will be typically protected using ECC, you
know, error correcting codes. Still there were lots of on chip memories that were
not predicted or the designers decided not to predict these on chip memories
because every chip at the end of the day has to satisfy a certain soft error goal.
Almost no chip has a zero soft error rate. As long as you satisfy your goal that
your customer wants, you're okay.
For that particular chip it was fine not to predict this on chip -- this remaining
unprotected memory that's shown over here. But that's not what the biggest
concern was at that time. The concern was that how in the world are we going to
deal with these flip-flops that had almost the same chunk if you look at of the
overall soft error rate distribution as the unpredicted memory.
Because for the unpredicted memory if one had to go put one prediction in, one
would have to go and put in ECC or put more error correcting codes. There was
a path that was known to be able to solve that problem versus for flip-flops, you
know, you cannot use ECC, and the question is what else could you do? And
that was one of the burning questions at that time.
So going back to my previous point of you know who cares about soft errors as
we said over here that if you had 20,000 processors server form, you could have
roughly one major flip-flop error every 20 days. And this measure, it has a
meaning, so people have done this start is where they have shown that most soft
errors do not matter at the system level, it's only roughly around five percent of
the soft errors that really matter at the system level. Then, you know, people do
not agree, some say 10 percent, some say one percent, some say five percent.
It doesn't really matter because this 20 days could become 200 days or it could
become four days or whatever. Those are all very important numbers from our
standpoint.
And what are these major effects could be? Well, you know, one of the major
effects could be silent data corruption which means, you know, that one could go
to the bank and deposit $20,000 and depending on which way the bit flipped you
could either be very happy or you could be very sad. And so you know, that's
really a big issue and that's why almost every chip today has two soft error goals,
one about the silent data corruption and they want to have very low rates of silent
corruption in goals.
Versus they may look at detected but uncorrected errors, which means that you
know that there is a problem in your system but you really do not know how to
deal with it which means that either you have to do some kind of recovery or
something and depending on how recovery is implemented, you could have
system downtime or not. And we know -- I don't have to preach this to this
audience that the cost of downtime could be really high.
So again, it's not that you have to worry about the silent errors, you also have to
worry about detected but uncorrectable errors and do something about them.
And since I'm an academic, wearing my academic hat, I just said memory soft
errors is a soft problem. That's not really true. You know, there are lots of open
issues that we could discuss. But at a bigger scale, I think these are much more
open problems than memory soft errors.
So taking a step back, the soft errors are not the only problems that we have to
worry about going forward. Now, if you look at some of the technological related
problems you have to worry about radiation for sure, the soft errors that you were
talking about, you have to worry about this [inaudible] which is called erratic bits
and these erratic bits, actually Intel had a paper in IEDM in 2005 that kind of was
one of the first papers that talked about this erratic bits. What they found that for
their really low power processors they found that the minimum voltage at which
the memories could work were erratically shifting over time.
Which means that one way to solve the problem is to have a big voltage guard
band to [inaudible] going to go up, now these are targeted for really low power
applications and then your battery life is affected. And, you know, like people
were really concerned about the problem like this. So but these are kind of the
sources of temporary errors. Versus going forward you also have to worry about
these other sources of problems which include aging, transistors are like human
beings and they age over time and the amount of aging depends on how much
workload like you know this transistors were working on.
You have to worry about the real life failures. Typically what happens is when
chips are designed and tested in the test floor, these chips are stressed at a very
high voltage and at a high temperature so that the weak chips are screened out.
Now, if you test these chips too many, for example if you stress them too much,
this is what happens to the chips. And this was actually one of the highest
performance microprocessor chips from a company not Intel, because I used to
work at Intel, so and you can see that this really what happened is because of
extremely high leakage currents in the burning oven at the higher temperature
these chips actually were fried.
So which means that either you can apply a lot of stress which will kill your good
chips or you're going to apply very little stress, which is -- which won't be enough
to stress your bad chips, which means now you have to do something about the
so called real life failures, otherwise these chips will fail within the warranty
period and, you know, the vendor has to deal with it.
And of course, you know, you know, you have to worry about process variations.
And I'm sure you've heard of word process variations a lot, and so you know
that's kind of the last one on my list, although you know it could be something
very important, you know, these are the ones that are not talked about a whole
lot. But these are some of the mechanisms that people are worried about going
forward.
So how do we deal with these problems? Well, what we think at Stanford in our
research group what we think is that to be able to deal with the problems of these
soft errors and erratic bits you will use a technique which is called BISER or
Built-In Soft Error Resilience that I will talk about today. And this BISER
technique is going to correct all these temporary errors that you would worry
about. And then for the two sides of the backed-up curve that have to do with
our real-life failures as I said that burning is getting difficult and almost any other
alternate burning that the industry developed in the '80s are in their Sunset
Boulevard right now, because none of them were, because everybody is
reaching their limits basically. And IDDQ is an example of that.
So for this real life failures and for aging problems you will be using a technique
that I'll also talk about, which is called circuit failure prediction where you will be
collecting a lot of data while the system is running and based on that data you
will try to tell that where the system is actually going and where it could fail
because of these aging problems. And to be able to do a good prediction you
may have to do a very tall online self test because we know if with any prediction
if the input data is not good enough then your prediction won't be very good. So
that's why you may want to do a very tall online self test. But given this circuit
failure prediction together with online self test, we think that that can resolve the
problems of our real-life failures and wear-out and aging and so on. Yes?
>>: What's the timeframe for wear-out?
>> Subhasish Mitra: So in the past people used to worry about wear-out, you
know, like really late in the game, you know, for example you know like 20 years
or 15 years or something like that. So today what has happened is that wear-out
happens from day zero. So for example if you have a chip and your PMOS
transistors for example, their special voltages will be degrading, you know, just
from the day that you are, you know, exercising the chip.
Now, one way to deal with the problem is to say that well, you know, if I had -- if
my speed as a result changed by five percent, let's say over five years and just
making these numbers up, then you may not worry about it, because you could
say I would just put a guard band of five percent in my frequency and I could
claim that I do not have an aging problem. Although there is an aging problem
that's going on that has been resolved by doing pessimistic design.
Now, there is a big worry going forward that if you were at least making these
chips work for seven years or something like that, which would be the case for
interpret applications and you know, you could have as much as 20 percent of
the guard band being -- you may have to go and put in 20 percent of the guard
band to deal with the problems, now, that becomes an important issue because
as we know very well that our speeds are not going up, number one, number 2,
our variations are going up, which means that you have to put a guard band for
variations and then now you have to go and put a guard band for aging and you
put on guard bands over guard bands over guard bands and how many guard
bands you going to put basically? So that's where the worry is. So it depends on
how you look at the problem of aging. Does that answer your question?
>>: I'm curious if I buy a computer today is there, you know, is there a warranty
on how long the processor will run?
>> Subhasish Mitra: Yes. Roughly. So there are two kinds of warranties. So
there is one kind of warranty today which you buy a processor, you know, that if it
fails, if it breaks within the warranty period completely then, you know, somebody
is going to replace your processor. And that warranty makes sense today is
because you know people are still doing this burning and since they're doing
burning most of the weak chips they have are, you know, they're gone in the
manufacturing floor so you know, you may, you know, like maybe they will have
some goals like you know hundred in a million or 50 in a million, you know, like
could be bad because of that thing. So that's one kind of warranty.
And the other kind of warranty is basically it's not really a warranty, but it's more
like a used condition thing that says that even if you buy a PC, maybe you use a
PC for only three years and in five years, you know, I do not made probably three
years is probably the right number. And if you use it for three years then let's try
to target by design and by guard banding that you get a sudden level of speed by
doing something like that. Now, that's becoming very murky because you know,
our speeds are not even constant anymore inside a CPUs because of dynamic
voltage and frequency scaling and so on.
So, you know, it's not clear cut. But at least they tried to target when they say
that this thing is going to work at certain gigahertz as a speed been they take
those speed guard bands into account today.
>>: So I shouldn't be surprised if after seven years ->> Subhasish Mitra: Oh, yes, absolutely.
>>: It either breaks or.
>> Subhasish Mitra: Absolutely. So that's the whole reason people that do work
locking to play games, you know, they do not go and, you know -- so that's why
they are going to now have even tags isn't it inside the processor or something to
find out if this was work locked or not and then they are not going to replace your
processor. You know, if you came back after two years and they found that you
work locked it, it has to do with reliability basically. That's it. That's aging
basically.
So what we think is that by using this concept of BISER and circuit filler
prediction and online self test we can have a consistent story about a new way of
thinking about designing robust systems kind of it's a departure from the classical
thinking of, you know, having a massive redundancy or massive online checking
to be able to deal with the problems but more like understanding the underlying
physics of these mechanisms to come up with very optimized solutions and
hopefully I'll be able to convince you that the costs of doing that will be much
cheaper than, you know, just, you know, than just relying on redundancy to be
able to solve some of these problems.
Now, to be able to do that, the reason that I came here and the reason you know
[inaudible] got so interested in it and we were doing this collaboration is what is
most important is this global optimization. Because I will go and I will put a lot of
bells and whistles inside the hardware to be able to do some of these problems,
but then you need optimization at high levels of the stack to be able to
orchestrate my hardware mechanisms the right way at the right time so that you
do not pay a full, a very high system level cost.
So the story about cutting down the cost is going to be in this global optimization,
and hopefully I'll be able to show you some examples of, you know, how that
would happen. So that's why it is important for, you know, folks like yourselves to
be interested in these kinds of problems. So that's the story about technology
and you know, I'm sure you have hard people saying that oh, gee, you know,
these are never going to be problems anymore because everybody's going to
stop at 22 nanometers and nobody -- and the semiconductor industry is not going
to go beyond 22 nanometers.
So you know, what I tell my students and, you know, and everybody, that still
actually you're in business so far as our research group is concerned because
even if you do not worry about technology related problems that we talked about
before, still even if you stop at 22 nanometers still we'll be building complex
systems. And if we build complex systems that means that we have to deal with
the problems of bugs. And validation is -- will still continue to be a problem.
And if you look at hardware design today, hardware design is mostly about
validation today. And even in validation, what people have found is that while
significant progress has been made implicitly can validation which is often called
verification where, you know, people use the simulation with verification, formal
verification and so on, what people are finding is post-Silicon validation is one of
the big costs in hardware design today. And by post-Silicon validation what we
mean is the chip came back, you plugged into the system and what you saw is a
blue screen and from there you have to find out which one of those billion
transistors is actually malfunctioning because it just takes only one logic gate to
malfunction in a synchronous design to be able to create any problem at the
lower level -- at one level up.
So from that level, how do you find out -- yes?
>>: So when you say the gate is malfunction you don't mean it's because of a
logic it's because of ->> Subhasish Mitra: Timing, timing chips, absolutely. So it could be -- actually
it's good if it is a logic error because if it is a logic error then it can be repeatable,
okay. But if it was a timing error because of some noise happened or because of
slight variation happened, you know, those are like particle highs and bugs,
which means that when you try to observe, when the problem happened the
problem is not going to happen, it's only in a sudden electrical state of your
system the problem is going to show up and it will be extremely hard to
reproduce. This is like, you know, the car has a problem and we go to the
mechanic and the mechanic says no there is no problem, you know, everything is
just fine.
And they're seeing it, and that's what the biggest cost is. They're saying that
today it could be up to 35 to 40 percent of the total cost is about this. And in the
feature I have a cord from an Intel executive letter in my talk which says that this
will actually design cost unless we do something about it. So, you know, we are
in business.
And even if the technology stops and we have a technique called IFRA actually
which can go and actually do this thing from a system level failure you can tell
that you can kind of pinpoint that whether it's the ALU or the schedule or
something in the hardware that's creating the problem and we can do it very
efficiently. And I won't get into details of that, I'll just mention it later, but if
anybody is interested I'll talk about it.
>>: [inaudible].
>> Subhasish Mitra: I can talk to you.
>>: You're doing some machine learning stuff there?
>> Subhasish Mitra: Not really machine learning but it is our data collection, so I
do collect some very specific kind of data based on the architectural specific kind
of data, and I do post analysis of that in some clever ways, and I'll be very happy
to talk about that.
So you know, you know, I'm around today and you know, you can reach me later
on. Please feel free. Yes?
>>: [inaudible] acronym INFRA?
>> Subhasish Mitra: Yes, INFRA stands for Instruction Footprint Regarding an
Analysis. Which makes sense, doesn't it, when instructions are passing through
the processor, they have some footprints that you collect and you have to be very
careful with what kinds of footprints those are, and you record them currently.
You do not do anything about it. And when a crash happens you scan them out
and you analyze them to find, to diagnose the failing locations.
>>: [inaudible] happening in hardware.
>> Subhasish Mitra: Yes?
>>: You store all this data in hardware buffers?
>> Subhasish Mitra: Only 60 kilobytes hardware buffers. Not everything. You're
not going to be able to store a billions of instructions isn't it?
>>: Yes, of course.
>> Subhasish Mitra: Of course. But then that's why what is very important is that
you have to reduce your time from the occurrence of the problem to the time of
the exposure of the problem to one level up and that's where we play some tricks
to cut down that time, to be able to do that. And actually we have some actual
demonstrations that it works in real life. So I'll be very happy to talk about it.
Okay. So that kind of is a big picture. There is now the project that we work on
that I'll show two slides on which is, you know, blue sky. We are even looking at
beyond CMOS and actually we showed the first experimental demonstration of
being able to create complex logic gets out of carbon nanotube transistors. And
I'll share some pictures with you later on. But that's like in a really blue sky, but
it's a lot of fun.
Okay? So let's go back and let's get into the details. First I'll start with this
Built-In Soft Error Resilience technique and I'll show you how we can self correct
this relation in these soft errors.
So here is a big picture. Here is the key ticket of all this BISER stuff. If you look
at traditional error detection and recovery, number one, it is expensive, and,
number two, it is expensive not only in terms of hardware cost, it's also expensive
in terms of design methodology. I know of at least one microprocessor where
people very seriously thought of implementing error detection and recovery and
[inaudible] backed off just because of the nightmare of validating the recovery
mechanism will actually recover the way you want it to recover at the right time.
And you know, that validation is very expensive.
Versus this Built-In Soft Error Resilience that I will talk about, you will see that I'm
going to redesign the flip-flops inside any digital design and that redesign of the
flip-flops will self correct any errors that would happen. Okay? And as we will
see that it is -- you will be able to correct both errors in latches and combinational
logic, although just for being high levelling I will focus on latches so that you get
the essence of it, and I'll be very happy to talk about, you know, how one could
do it for combinational logic, and very recently we showed that these are useful
not only for this tradition in the 1069 ware but also the erratic bit errors that I was
talking about, these erratic shifts in beaming and so on. So my message is that
correct errors don't detect them.
So here is how it works. So let's say that you got a combinational logic and, you
know, this will be at the logic level. This won't be too many transistor. This is not
the only transistor diagram I have in my presentation so please don't worry about
it.
So let's say you have a combinational logic and the output of the combinational
logic is is connected to a latch, not a flip-flop, latch, okay, suppose. You could do
the same thing for flip-flops. And suppose, you know, that I give you another
latch in parallel to this latch. Just, you know, take my word today, okay. And I'll
talk about where this latch comes from and what's the cost and all that kind of
stuff.
So one thing that you could do, of course, to say that well, you know, I'll connect
the output of this combinational logic to this latch and I'll connect the output of the
combinational logic to this latch. Of course I could do that. And then you could
say, well, you know, now that I have these two latches, I could stick in a
comparator, for example, and if there was an error, there will be an error flag
which says that you got an error. Now, that's not a very good idea.
The reason it's not a very good idea is the following, that now I think of a design
with a million latches, and each of those latches will be holding an error signal
saying that, you know, whether I found an error or not. To be able to do any kind
of a recovery you have to grab all those error signals and you have to pass them
on to your recovery block to tell the recovery block that gee, you know, now you
have to do recovery. And suddenly your chip design will be dominated by all the
routing of these error signals that will have to go to the recovery block. And the
designers will just hate you because you know, routing is already a big problem
and these routing of error signals, gee, they will say I'd rather stay with the old
technology and not deal with it.
So instead of doing that, what you do is you just insert this transistor temperature
with four transistors. And this is actually a very well structure that was actually
invented by Mueller in 1959 at the University of Illinois, okay. And this is called a
C element that people have used extensively in its implement circuit design. And
this C element works in the following way.
So when the two inputs of the C element are the same, it acts as an inverter. So
for example you can see that these are As and Bs, 0011 it acts as an inverter.
When the two inverts of the C element are different, it holds the previous value at
its output. So let's look at what happens over here.
So if you didn't have any errors in these two latches, then, you know, the two
inputs of the C element will be the same, it will act as an inverter, logically it will
do the right thing, now logically it will add some more load and so on and you
have to deal with it, okay.
Now if the two inputs of the C element are different, which means if one of these
latches have an error, it will mismatch over here and the C element will return the
previous value at its output. Now, who in the world has said that this previous
value is the correct value? Because if the previous value is not the correct value,
I have not done anything good by having this C element isn't it and why am I
calling it a self-correcting design? So on the next slide using cardboard
animation I will formally prove that the previous value will always be the correct
value actually. And that's why it will work. Okay. And here is how it works.
So again, you know, you have these two latches is the system latches that run in
latch and your combinational logic is going to write something into the latches, so
the way this whole thing works is the following as we all know that first let's say
the combinational logic was framed to write the zero into these two latches. We
know that the clock input of these latches have to be one. That's when the
latches are in transferent mode, that's when you can write into the latches.
Now, the key observation is that this is very well known in soft error literature that
when the latches are trasferent they are not vulnerable to soft errors because the
latches are being very strongly driven by the upstream combinational logic. So
what happens is well you know the zero got written over here, over here, this
acted as an invert er, you got a one at the output, which is the inverted output.
And now the clock input went from one to zero, that's when the latches are
actually storing the value and that's when you avoid soft errors and say the
latches. A flip happens let's say on this latch doesn't really matter it could be any
one of these latches. The C elements is a mismatch at its two inputs and it's
going to block and the output content is to stay at the correct value.
By taking advantage of this fact, you can show that you can -- it's a self
correcting design because it returns the correct value at its output no matter what
happens.
>>: [inaudible] you taught that you actually split it into two pieces, two phases.
>> Subhasish Mitra: Yes.
>>: And you made this observation that the error happened only ->> Subhasish Mitra: Only in the second phase. If the error happened in the first
phase there would be no way you could help me. So, you know, the way I think
about it is that, you know, it's always what I have found in most of the problems
that I worked on, that, you know, actually as we go one level deep and we
understand kind of the source of the issue, you know, like that's where physics
comes in, we understand the physics little bit, there are so many interesting
solutions that one can come up with without having to pay that much price.
If I didn't use that particular property, then you know, everything will blow off and
a scheme like this would not work.
>>: There's also this hidden assumption that both cannot go back.
>> Subhasish Mitra: I'll talk about that, you know, yes. So there is an issue that
what happens with single element multiple assets and I'll share some data with
you, real data with you on what is the vulnerability of that.
So you can think of this as a flip-flop, you know. So now you can think of for
example when people do design they have a technology library and the library
has N gates, O gates, nan gates, they have a regular flip-flop, they have a scan
flip-flop. Now, you think of having a BISER flip-flop inside your library basically.
Now, so far as the constant's concerned, you will be thinking, you know, what are
the costs of doing something like this? That's what I'm going to share on the next
slide and then I'll try to share some really recent data that, you know, I think I
need a miracle, but you know, making some people agree that, you know, I can
share this data with people and -- yeah, go ahead.
>>: What happens if there is an error in the C element?
>> Subhasish Mitra: Right. That's a very good question. So let's try to
understand. What happens if there is an error in the C element? Well, if there is
an error in the C element, these two latches under a single error assumption
these two latches are going to drive the C element very strongly. So that's what
you will see is at a maximum you will see a glitch at the output of the C element.
And, you know, and that glitch will go away in the subsequent stages of the
combinational logic. But very good question. So please, you know, please ask
these questions, otherwise it's very hard to talk about everything. Okay. But
good. And I'm making people think.
So what's the cost of doing something like this? So of course at the flip-flop level
so let's look at power cost first because that's what people care about the most.
And then we'll talk about area costs. So at the flip-flop level, I got two latches
that are transitioning, so of course I'm going to have 2X the power at the flip-flop
level, and I cannot, you know, do anything about it. But you know, again, nobody
sells flip-flops in an actual design, everybody sells the full design.
So we looked at the alpha processor because it was open source and we got
hold of this air injector that Professor [inaudible] Illinois developed and based on
that, we found that if you can find the right flip-flop through product, not all
flip-flops are equally important from an architectural level error standpoint.
So if you can find the right flip-flops to protect, for example, if you wanted to
reduces your chip level soft error rate by 2X, you just protect two percent of the
flip-flops and say 15 percent of the flip-flops with a two percent power penalty.
If you wanted to cut down the chip level soft error level by 10X, your product are
50 percent of the flip-flops, 50 percent of the important flip-flops and you pay a
nine percent chip level power penalty of doing something like this. Okay. Yes?
>>: You predict the important ones by injecting.
>> Subhasish Mitra: So.
>>: And determining ->> Subhasish Mitra: Here, yes, here we were injecting errors, you know, doing a
fault injection and determining whether it was important or not. On the -- either
on the next slide or two slides from here I will show you a formal verification
technique for doing that actually. So you know, because that's an important
question. Now, that's where you know optimization comes in. Given that we can
solve the problem at the lowest level to be able to content cost now we have to
look at other layers to be able to cut down -- to be able to do the cost
optimization.
So this is what you see. Now, what are the cheap level area costs of doing
something like this? Well, you know, of course at a flip-flop level your area
doubles because your flip-flop has gotten bigger, but then there are two benefits
that work in your favor. Number one, we know that in real estate design you're
always dominated by long interconnects. Versus here what you are adding is
local transistors and local interconnects.
And you know, we have done these experiments, you know place and route and
everything, you know, my friends at various other companies have done these
experiments and the chip level area cost of doing something like this is extremely
small. Sometimes you see a one percent, sometimes you see a .5 percent,
sometimes even people have seen zero percent. Because you know, their chips
have so much widespread because that's in the designs rather than hand placed
you know, custom designs.
And you know, like you can use the next question is how do you really optimize
the BISER insertion, the question that you were asking. There is actually another
benefit of using this BISER flip-flops in post-Silicon validation and test that I'll talk
about later.
So let's focus on this thing that I was talking about, how do you go and find out
which flip-flops are the most important flip-flops? Of course you could do a lot of
fault injections or error injections to be able to do that. The thing with error
injection is that if error injection tells you that there is a problem that's good news
because then you know that there is a problem. If your error injection tells you
that there is no problem, the trouble is you really do not know whether there is no
problem or whether it's simply because that you have a [inaudible] issue here,
okay.
So that's where actually this is work, joint work with Professor [inaudible] at UC
Berkeley where we were using a formal verification approach of doing it. We
were saying, look, what's going on over here? So if you have a bunch of
flip-flops and let's say you were wearing about a soft error in a flip-flop, you could
model that as a two-step machine. So you know, there was no soft error to start
with, at some arbitrary level a soft error happened and from that standpoint there
is no other soft error that would happen, everything would be fine with that
flip-flop.
And now if you just take a cross-product of this particular two step machine with
the formal model of your design and then you can check for properties of the
design and then you could come up with two answers. If the properties pass
then you don't have to protect that flip-flop, otherwise if the properties do not
pass, you do have to protect the flip-flop basically.
And you know, and you can use a model checker and so on and we did it. What
you find is that by doing this for example for the space wire communication
protocol chip design which is on open course, you can cut down the cost, power
cost of BISER by 4X. So you know, it's like something around five percent or
something like that at the chip level that you have to pay the price. It's a -- and
it's a rigorous way of really proving that you can actually do it. But of course ->>: [Inaudible] there are some latches there, it doesn't matter what values they
have.
>> Subhasish Mitra: Yes.
>>: What the [inaudible].
>> Subhasish Mitra: Exactly. So that the a very good question. So what are
they doing? Well, there are two things that you can take benefit of. First of all,
since it was a communication protocol, there is already built in mechanisms in
the communication protocol itself to deal with errors, okay. For example, if you
have a CRC errors or if you have a timeout errors, okay, the -- they will
automatically deal -- the system will deal with it. So that's why even if those
latches -- those are care bits, okay, but still, you know, you do not have to go and
protect them because they are taken care of at the system level, at the protocol
level but then there is another thing which is that if you have a single cycle error
in a latch, that's not valid to be a don't care. So you know there is a thing that if
there is a stock at fault in a particular design, that means if a signal is always
stuck at one and if you do not find it which means it's a don't care.
But in this particular case, it's a single cycle stock error which is different from
saying that, you know, like that you don't care about the system. So you know, I
can show you actually finest technician examples where you could have a single
cycle error that would not matter but if that particular flip-flop was gone your
complete design would be messed up.
>>: But I thought if you had a single cycle error that because that could be
repeated at a recycle, it's like you are completely existentially quantifying out the
contents of that flip-flop.
>> Subhasish Mitra: Yeah, but you know, but then you know like you could
always have architectural constraints, for example, like these are the sequential
don't cares, you could say that every time I have why one over here, I have a
zero over there, okay, something like that, okay, so if you have architectural
constraints of, you know, of what you said would be true, if there would be no
constraint on the contents of the flip-flops. But if there are only a certain set of
bits in the flip-flops that are care bits, the others are sequentially don't care bits,
then you may run this situation like that.
But for this particular design what we found was that we were taking advantage
of basically the protocol level design basically that.
>>: [Inaudible] comment about that, also.
>> Subhasish Mitra: Yes.
>>: So it seems that to say -- so this BISER stuff is really cool because you do it
only local and.
>> Subhasish Mitra: I agree.
>>: And fix the problem. Okay. But now you're trying to save that stuff.
>> Subhasish Mitra: Yes.
>>: And you're trying to not use it.
>> Subhasish Mitra: Yes.
>>: And then what you're going to do is if you don't use it, then you might trigger
all this high level protocol stuff.
>> Subhasish Mitra: Sure.
>>: Right?
>> Subhasish Mitra: Absolutely.
>>: Which might have its own expense. Of course nothing is for free.
>> Subhasish Mitra: Absolutely, yes. I absolutely agree. And that's where, you
know, that's where the optimizer -- that's why the optimization is so important to
understand that at what level you want to go and solve the problem. I absolutely
agree with you. Yes. There is also another issue isn't it? You ready to make a
comment?
>>: Well, also, I mean if you make the reliability at the lower level stronger than
you can eliminate some of the protocol to higher level. I mean so that [inaudible].
>> Subhasish Mitra: At the end of the day, that's why it's so important to have a
cross-layer optimization, exactly. There is another issue isn't it like which -- we
didn't talk about, who in the world said that we had the complete specification of
that particular thing that you ->>: [Inaudible].
>> Subhasish Mitra: But it's garbage in, garbage out, isn't it? Like if you didn't
verify with respect to the right set of properties it goes back to the coverage of
your problem is that it verified again. So but here is the point that I wanted to
make is that there are interesting techniques and one could even think of for
example combining inner injection with something like this to be able to tell you
know what you could do. This is kind of an interesting spin to the whole thing,
the whole problem.
There is another benefit of using a technique like you know BISER that I was
showing before, and the benefit is that actually you can turn production on and
off. So here is actually a flip-flop design that's inside many of your processors.
This is Intel's scan flip-flop design. And Intel decided to implement a scan
flip-flop in this particular way for the following reason, that so that while the
system is running they could scan on the don't which is a snapshot of the system
without having to stop the system. And this feels very useful for post-Silicon
validation purposes. And that's why they had this spare flip-flop.
And now you can think of, you can reuse the same thing for protection and now
you know what I call is it a design quality flip-flop, okay? Because that's going to
do soft error correction in the field, that's going to help for scan test in the -- on
the tester floor, and then that's going to also enable you to do post-Silicon debug
in, you know, in a post-Silicon debug environment and oh, by the way, since you
can turn this flip-flop on and off, that means that you can have again if you
understood your system well enough, you could be in between two more, so you
could be in a high reliability mode and you could be in an economy mode where,
yeah, the area you are stuck with, but area is not the biggest deal, the power is
the biggest thing where you can play against either strategically or dynamically.
This brings to the question of, you know, application rate, application level error
rates.
So, you know, when I was at Intel and it's not about Intel it's with almost any
company that you can think of, you know, there was this whole issue of having
one single core, one single processor core that will run for all applications, you
know, like for laptops, for servers and so on, and then there were this -- there
were these wars you know that who pays the price of reliability because you
know, more power, you know, more soft error protection, you know, the goals
could be very different depending on what market space you are talking about.
By having this kind of dynamic way of, you know, turning protection on and off,
very we call reconfigurable protection, you know, you can trade off, you can say,
well, you know like, you know, for applications that do not really care about soft
errors, you know, I turn the protection off, otherwise I turn the protection on and
the question is are there any benefits of even doing it dynamically during runtime.
I don't know the answer. So that's why these are question marks. But these are
interesting things that are enabled by a flip-flop design like that.
>>: So one thing in your design [inaudible] you don't actually know if it's
happening?
>> Subhasish Mitra: Yes.
>>: So have you thought at all about that, in other words [inaudible].
>> Subhasish Mitra: You're not the first person to ask me the question, as you
can guess, okay. So I have two answers to this, okay. My first answer is that
why do you need to know? If I can correct, okay. You know, there are so many
things that happened inside our design. Do you know about everything? And
too much knowledge is sometimes not very good. Okay.
>>: [Inaudible].
>> Subhasish Mitra: Yeah. Yeah. But at the same time, one could come up
with, you know, incarnations of these flip-flops for example where you could have
inner checkers as well, you know rather than just having the C element at the
higher cost of course, and there were some companies that were even, you
know, suggesting something like that. So, yeah, you could do that.
>>: Have you thought at all about like a statistical error detection network
because for example in the case you're talking about where we're reconfigurable,
if you had an environment where you're seeing a lot of errors you might want to
turn on the error protection.
>> Subhasish Mitra: Yes. Exactly. You could clearly do something like that,
yes. And the good news is that unlike typical power you know shutdown
mechanisms that take a lot of cycles to be like in a power saving mode or in a
like in a regular mode, here actually you take four clock cycles to turn prediction
on and off because you do it using scannable signals that are already present
inside the chip.
So, you know, somebody already asked this question of what happens for
multiple upsets? Now, if I was giving this talk to you last year, I would have said
oh, you know no one cares about multiple upsets, and I'll all level single upsets.
And but you know what has happened is just these companies have actual data
now to show that, you know, first of all for memories it was known for a long time
that you have multiple single, multiple upsets. But they have started seeing
single multiple upsets in 45 nanometer for flip-flops now. And although the rate
of upsets is much lower, say maybe it's like for five percent of the cases or
maybe one percent of the cases you will see single level multiple upsets.
It's like [inaudible] I'm okay with respect 20X or something like that. On the next
slide I have some actual data for BISER that I will share with you. But the big
one is that the -- the big comment is that singular assumption is not clearly
sufficient. And to be able to do the analysis of single level multiple upsets you
have to worry about layout, you have to worry about technology CAD.
And we are doing some work, and we think that the BISER can cut down single
and multiple upsets by more than 100X, actually we could put 300X. Because
what happens is to give you the heuristic explanation of why that would be the
case, in BISER first of all the corresponding nodes only have to be upset to be
but the non corresponding nodes are very close to each other, so still you end up
having a compact layout but the effects of kind of moved away. Yes in.
>>: [Inaudible] in memory?
>> Subhasish Mitra: In some sense you can think of it that way, but at a much
finer scale. But you can think of it that way, yes. What I was amazed in July I
was at this small conference called IOLTS or International On Line Test
Symposium they asked me to give an invited talk on soft errors and I show up
there, give my talk, and the next speaker is Dr. Norbert Seifert from Intel who I
worked with when I was at Intel and he stands up and he starts showing data on
my BISER flip-flops. So it seems that, you know, there were some experiments
that were done by some folks at a 45 nanometer on a test chip and BISER was
there basically, and it was -- you can see what they found, so you know, I worked
with Norbert, I said, you know can I create one slide that will be approved by you
that I can show to people, you know, and finally, you know, they came back and
said, yes, you know, I could do something.
So for alpha particles they found that BISER [inaudible] soft error rates by more
than 1000X. For neutrons, they found that BISER [inaudible] soft error by more
than 100X, actually it's far more than 100X. You know, I was told not to tell the
exact number basically.
And what was found is that BISER was very effective in correcting single and
multiple upsets and the reason is because first of all because of the diversity that
I was talking about and also what was happening is BISER doesn't have any
self-regenerative feedback, it's not like, you know, if you have an error, you know,
you make up for it and you feed the same thing back into the flip-flop, it's like kind
of, you know, feet forward kind of a design. And that's why, you know, the
multiple upset effects are very small in BISER and so you know when they were
saying you know this is 100X and all that stuff that you know includes even single
and multiple upsets. So that's something that I was very happy about. And of
course know better said that I have to write this mandatory thing that you can get
more reduction at higher cost.
And you know, I would agree with that. Okay. Okay. So that's what I have to
say about Built-In Soft Error Resilience or BISER, and now I'll move on to the
next topic which is about circuit failure prediction.
So I can remind you where we are. We said that for this normal operation you
care about the soft errors and temporary errors and so on, and that's why you
would be using this Built-In Soft Error Resilience versus for our real-life failures
and wear-out you'll be using this idea that we are calling circuit failure prediction
with this CASP online self test.
So let's look at what is circuit failure prediction? Well, the whole idea behind
circuit failure prediction is to be able to predict failures before errors appear in
system deterrent states. And this is in contrast to classical error detection where
you know, you find a problem after the errors have appeared in system deterrent
states. Of course you know, something like a soft error reduction in error you are
not going to have -- you're not going to be able to predict because it's a pretty
random phenomenon.
But for transistor aging an our real-life failures that I was talking about before,
there is a gradual degradation that's associated with this mechanisms. And this
gradual degradation shows up as [inaudible] and this since they shows up at
delay shifts you can go and predict, you can tell that where the system is going
before it has actually failed. And what are the pros and cons of doing circuit
failure prediction, well, since you're finding errors before the -- since you are
finding fake problems before errors appeared in system design stats you don't
have an issue of data corruption. You do not have an issue of being able to deal
with very high error rates and, you know, recovering from high error rates is a
problem. And it ends up very good self diagnosis because now you are
collecting all this information over several billions of cycles that the system is
running to be able to predict that whether something is going bad or not. So you
know what the behavior of the system was before it actually failed or you know
before it actually went to a predict -- to a failure scenario.
And that's why for this mechanisms of, you know, transistor aging and our real
life failures because of this gradual degradation which results in delay shifts
because of this physical phenomena you can go and predict failures before
errors set in the system. And by the way, Shakespeare excellence several
centuries back why circuit failure prediction is a good idea.
So what is the applicability of circuit failure prediction? Well, as I said, it has to
do with degradation. And degradations shows up as delay shifts. Now, the
conventional wisdom of a delay shift is that a delay shift make things slow, you
know, and you can see in a lot of people talking about it. The startling thing that
we found and this is, you know, based on, you know, you can see several
generations of test chips that we have been doing where we have been
measuring stuff, we found that these delay shifts do not have to be positive delay
shifts in the sense that they do not have to slow things down, they actually can
make things fast before things break.
But still they are delay shifts and as long as you can find delay shifts you can tell
that whether there is something wrong going on in your system before the
system fails. Okay? So that's kind of the key idea over here.
Now, as I said, that for transistor aging, it's pretty well established how these
delay shifts happen and these are positive delay shifts. Things get slow
basically. But for our real-life failures life for example get outside problems you
know, as I said we have actually experimental data and we are clearly seeing the
delay shifts can happen in both directions before the chip absolutely fails
basically. But still they are delay shifts and if you can find delay shifts you should
be able to find whether there is a problem and this can provide even several
different kinds of alternatives to burning.
Now, how am I going to do circuit failure prediction now? So, you know, that was
very fuzzy kind of, you know, like a very philosophical discussion that we had
until now about, you know, that we can -- we should be able to do circuit failure
prediction. Now, the question is, what kind of support do I need on chip to be
able to do this kind of prediction, and there are two ways to do it. One is
concurrently with the application execution which means while the system is
running, while your application is running, you have special again flip-flop
designs, special sensors, that would look at delay shifts and would try to find out
whether there have been any delay shifts that happen inside the system. Or you
could do it using a periodical online self test, which means that the system is
running as it is and every so often you run a very [inaudible] online self test and
you find out what's going to happen with these delay shifts in system and that
even provides some cost savings opportunity because you may not need these
sensors and you may not need to pay the power or the area price of doing this
sensors and so on. But still you can find delay shifts.
So I will get into the details on the next few slides on, you know, how one could
design some sensors like this and how one could do this kind of online self test.
So let me be more concrete, and let me know you one concrete example of
circuit failure prediction in the context of, you know, NBTI aging, NBTI stands
Negative Bias Temperature Instability. It's an aging mechanism that started
happening in PMOS transistors starting 19 nanometer tech -- became very
prominent starting 19 nanometer technology.
Basically what happens is the threshold voltage degrades over time which means
that the drive current goes down over time, which means that the PMOS delay
goes up over time.
And the factors that determine the amount of aging, they're like human beings,
you know, so it depends on temperature, it depends on voltage, it depends on
the workload because it is related to the percentage of time that the PMOS
transistor is on.
So as I -- as we discussed before about speed guard bands and so on, so the
current industrial practice to be able to deal with this transistor aging is basically
a [inaudible] you tell [inaudible] if you want to use this thing for seven years in an
interprets environment this is the maximum amount of aging one could ever get
based on the worst case workload, the worst case [inaudible], the worst case
voltage and it's about [inaudible] going to go and put a speed guard when based
on that if that speed guard band was two percent you would not worry about it,
you would just go and put it. If that speed guard band it's 20 percent, then you
would worry bit and you want to be able to do something about it, and that's what
the big worry is going forward that whether you are going to have very large
guard bands in talking about 50 nanometer technology or something like this.
And I'll show you an example of how one views this concept of circuit failure
prediction to eliminate such worst case guard band.
So here is how it works. So what you do is to start with, you start off with a very
small guard band. So for example, instead of working correctly for up to 7 years,
you say I want to work correctly for up to one day or 15 days or ten days, doesn't
really matter, okay. You start off with a very small guard band and during those
15 days, I will stick with 15 days, that sounds like a good number. And during
this 15 days what you do is while the system is running you find this delay shifts
and you find these delay shifts they are using sensors or using online self test or
something like that. And then at the end of the 15 days what you want to find out
is whether I got enough aging in various parts of the chip or not.
Since I can find delay shifts I'll be able to tell how much aging I have gotten in
various parts of that chip. And based on this aging information, either they will
decide that you do not have to make any changes to your guard band or you
don't have to make any changes to your system, or you will be doing some kind
of self healing and luckily, there are various different options that are available for
self healing. For example, you can go and change the body bias, body bias not
may go away, you know, when you go to a 32 nanometer technology or
something like this, but still you know, it's still a knob that one could use
depending on the company where the processors are built, you could add just
VDD. Or you could even add just speed, you know, and a very fine grained way.
Or actually people have found that if you laid these transistors to rest a little bit,
then these transistors recover from their aging, which means that, you know, you
could even use spare course to run your tasks and let this transistors, you know,
go to a spa, you know, and recover from aging and then again work fine, you
know, after so many days or something like this.
Now, that means that you have dine flip-flops with built in aging sensors to be
able to find these delay shifts or you have to use online self test. And -- yes?
>>: [Inaudible] twice the age mean it just seems like you can [inaudible].
>> Subhasish Mitra: Absolutely, yes.
>>: And [inaudible] the case.
>> Subhasish Mitra: Yes.
>>: And you would still get [inaudible].
>> Subhasish Mitra: Right. So what will happen is that depends on the
granularity basically. It depends on your voltage instruments, it depends on your
body bias, domains and so on. Yes. So but you are stuck with it basically.
Maybe in the future we can think of very fine grained ways of doing the self
healing that I do not know of right now.
>>: [Inaudible]. So is a variance pretty big across ->> Subhasish Mitra: Yes, the variance is is big across, yes. And one could take
advantage of that. Okay. So the cost of putting these sensors, so we'll have
special flip-flops as I will show next with build in edging sensors as we can see
the costs are very small for doing something like this. And by the way, now is, a
lot of people have suggested ring oscillators and so on to collect this aging data.
That's not very scaleable because ring oscillators, their activity factors are very
different from the activity factors or the signal probabilities of your actual design
and they may not even show up the same way basically. So our ring oscillator
may or may not age as much as your actual design. So what you want to do is
you want your actual design to tell you how much it has aged and that's what you
can do with this circuit fewer prediction techniques.
So the big question is what kinds of aging sensors you will be using and how you
will be doing an online self test to be able to do the circuit failure prediction. So
let's talk about these two things for the remaining part of this presentation and
we'll be done. So again, the idea behind these sensors to find delay shifts is the
following. That today, for example, let's say you had guard band TG, the 15 day
guard band that we were talking about. If you get the chip out of the door today,
you know that all signals will transition before this TG today.
But today, depending on the slack of your pads, depending on how much this
chip has actually aged, this transitions may or may not creep into the guard band.
And that is actually the signature that you can find using, for example, a special
sensor that's designed inside the flip-flop to tell that the combination -- for the
combination logic that's connected, the output of this at the input of this flip-flop
where its signals are transitioning with respect to that clock edge. And basically
that's what an aging sensor does.
You have to make sure there are lots of details about this thing, you have to
make sure that the edging sensor itself is aging [inaudible], you have to make
sure that this aging sensor is not very big, you have to make sure -- you have to
be careful about where you optimize and you do not want to place these aging
sensors in all flip-flops of your design and so on, but, you know, I wanted to give
an essence of this thing. I won't talk about those details. One important to note
is again you have made the flip-flops bigger, but you have not added any global
interconnects because you'll be using the scan chain which is already existing on
the ship to get the information out. And why is it possible? Well, I'll come to this
point a little later but let's go to the scan chain thing.
Because it is a prediction technique, which means that you are taking advantage
of the gradual degradation mechanism so you do to get this information out that
you know whether something is actually edged right at the cycle where you have
captured that delay shift. You can wait for a while and you know, like 15 days we
were talking about when you can get that information out which means that you
do not really have to go and, you know, scan out, have expert interconnects to do
this thing, you can use the existing scan chain to get the information out, you do
not have to add any global interconnects.
The other thing that people worry about a lot with, you know, these kinds of
techniques, is whether it imposes any additional hold time constraints in your
design. There have been some work that have been done that impose hold
times and hold times are not good, you know, no one wants hold time issues
because that can completely mess up your design. So -- but techniques like this
never impose any hold times.
So do these things work in real life? I was pretty high level about, you know, like
the details but these things do work in real life, they're actual test chips which
show that these things are operational and they do the right thing. Now, as we
said to be able to do a good circuit failure prediction you have to play a very total
online self test and why, because it could very well be that your application may
not be exercising the pads that are aging because it depends on the profiles of
your application and you know for any prediction mechanism it's garbage in,
garbage out. You know, if you do not give good inputs, you are not going to give
good prediction.
And also online self test is also useful for a hard failure detection and diagnosis.
Now, there are some constraints about doing online self test. First of all, since
we are talking about delay shifts now we are talking about very tall delay tests.
You cannot simply do some stock full testing and say, well, you know, I'm done,
you have to do pretty thorough delay tests to be able to find these problems.
Number one. Number two is when you do online self test, you cannot have any
visible system level downtime because if the system is down, that's of no use.
There should be minimal milk left performance impact. That means you have to
be clever about how you schedule this online self tests.
Of course, it has to be local, of course it has to have minimal design flow impact.
Now, let's look at some of the existing techniques for online self test, and they're
inadequate. For example, you could think of logic built-in self test or LBIST,
which I'm sure many of you are familiar with, at least heard about it. Logic built-in
self test has very low test coverage with respect to delay faults. You cannot get
very high delay test coverage out of random patterns basically which logic built-in
self test uses. Cost can be very high, design flow can be really messy. So that's
why people have not really used logic built-in self test in a big way even for
regular manufacturing tests.
There is this idea from 1986 by Mel Brewer from USC on this idea called roving
emulation. Basically their idea was that, you know, if I cast this idea into these
terms basically, basically what they are talking about is if you have a multiple
codes and if you have one spare code and you know you always check your core
under test with respect to that one core, it's like a duplication type multiplace
duplication basically but that wasn't the idea that, you know, Mel Brewer was
talking about that time he was talking about an emulation engine so you can
emulate almost anything, it's part of that. But you know, I just give a very high
level view.
An issue is that you know if you're talking about applications you can already find
the word applications by finding those delay shifts. That's not the biggest
problem. The whole reason of doing online self test is very high coverage. And
if you still rely on the application to give you the coverage, again, you do not
know what you are getting. So you have to have pretty high coverage online self
tests which these techniques cannot give. So what do you do? Well, you know,
you do something very simple and stupid and that's what we call CASP. CASP
stands for Concurrent Autonomous Stored Patterns. It's very simple.
Concurrent, why? Because, well, you know if you have multiple systems you can
test one or more cores while the rest of the cores are doing their job which
means that no visible system level downtime.
Autonomous because there will be a non-chip test controller and that's almost the
only thing you have to implement to be able to do something like this, so you
know, no one has to use a tester or do this thing manually.
And third is the most funniest thing, which is that you can just go and store all
your test patterns that you could think of in an off-chip flash. And that will be it.
And you can start whatever patterns you wanted you can store all your
production tests, your structural and functional tests with quantified coverage and
so on, and you know, you will get a really good test coverage which you could
not get with built-in self test. And why does this make sense today? This is -this sounds like a trivial and stupid idea. Why does it make sense today?
Well, in the past people could not think of an idea like this, because the cost of
non built-in storage was so much but today just having an off-chip flash to be
able to store these test patterns is nothing and the amount of storage we are
talking about is, you know, okay, like I will show you some results for the
[inaudible], you will find it's like, you know, six megabytes of storage or
something like this. Okay. I give you a gigabyte, okay. Still, you know, it's
nothing compared to, you know, the amount of non built-in storage that I am
getting. So that's why the major technology trends favored this idea. This is the
classic example of some things that did not make any sense 10 years back. But
with the technology trends changing, we should not think about this complex
ways of solving the problem, there are far simpler ways of solving this problem.
So what we do, multiple designs, you know, align sell current, align self test and
that's what basically working on over here, you know, at MSR on, you know,
doing very optimized schedulers basically for running this online self tests. And
then we go and store all of our test patterns in local non volatile storage.
It's also very flexible and upgradable in the field after your system is in the field if
you find that a specific find of failure mechanism is more prevalent than the other,
you can go and change your test patterns because, you know, Microsoft people
are very good as helping people download patches, now you can go and
download test pattern patches basically while the system is running based on
what kind of failure mechanisms you see.
And the big reason why this has been enabled today is because two things
happen in the testing world that people, you know, sometimes do not know
about. One is this idea of test compilation which can cut down the test volume
and test time by 10X to 100X to up to 1000X. And now if we can do this massive
test compilation, we do need as much storage, we do not need as much test time
to, you know, run these tests. And also another trend that's happening in the
testing world is on chip support for at speed tests. In the past people used to run
tests out of the testers, but then, you know, when you have a two gigahertz
microprocessor which is I know, in storage 500 megahertz tester, there is no way
you can do at speed tests from the tester so you need on-chip support to be able
to run at speed tests. And if you have this on-chip support to run these at speed
tests we should make use of them in the system to run these online tests and
that's why it becomes more and more relevant.
Do you have a question?
>>: [Inaudible] the hardware built into the chip?
>> Subhasish Mitra: Yes. So basically, you know, some funny ways of doing XR
gates and if you do the XR -- so these are not your regular compression
algorithms, okay? They do not work, okay. But these are some funny
connections of [inaudible] which kind of go back to coding theory, actually. It
gives rise to a new class of error correcting codes actually. And if you do that,
you can cut down your test set volume very significantly without impacting test
quality and you know, if you're interested in that's another talk, but I can tell you
in more details that's like something very close to my heart.
So we looked at actually [inaudible] did this work, we looked at the CASP for the
open source T1 because it's open source and you know if I show numbers you
won't be able to tell that now this is some academic speaking, you know, this is a
really design. As you can see, that even if we have a 10X computation, we're
talking about six megabytes of storage, which is nothing isn't it? Look at the
extremely high coverages that we are getting. This true time is basically the
most [inaudible] test that the test community knows today. And getting a 93
percent of this true time coverage is extremely high. You know? People don't
get that in real life, but this is a real design.
And the test time per core is roughly around 300 milliseconds, and this includes,
you know, this is assuming a flash. If you were having a hard disk, this number
would go up. Assuming a flash, you know, this includes a time on bring in the
patterns, run the tests, compare the results and, you know, and telling whether
this chip works or not. And look, all this happens with an extremely small area
impact of 0.01 percent. And [inaudible] had to change 10,000 lines of very
[inaudible] of this open source microprocessor to be able to implement something
like that. So this is something very doable, something very practical.
And this is how it works. So for example, you know, there is a test scheduler
which says well, the core four is now selected for testing, core four is temporarily
isolated, the test is run post processing and this content is to happen.
And as we are, you know, I'm sure you know you have all guessed that this
scheduler is very important because the scheduler can completely mess up your
entire application performance if you are not careful about what was running and
if you are not careful about the scheduling. Actually once you have this kind of
an idea of this online self test, you can make use of virtualization to optimize your
online self test. And the reason is the following: That if you wanted to implement
the whole thing in hardware you will run into a problem, and the problem is the
following. Let's say one of the cores the being tested. While it is being tested the
rest of the cores are running and they maybe generating requests for this score
which is on the test and now you will need these buffers to store these requests,
to buffer these requests and you will not know how many buffers you would need
to buffer these requests.
Instead, if you have a virtualization support, you could, you know, either stall and
you can bring it back or if you had a spare core you could even transfer the
actively design a core which is to be tested into that spare core and rove around
while the rest of the system is seeing the constant number of cores. So you
could do all that stuff, and we have been able to do this on an actual platform
over here which is shown over here and you know you could have application
performance in fact of less than two percent if you had a spare core, program,
and rove things around.
And you could do even better things, you know, that's what [inaudible] is working
on, on scheduling techniques so that you may not even need that spare core for
example, and still continue to do your work.
Okay. So I'm almost done. I'll talk about some of the other research projects
very quickly before I end this talk. But please feel free to stop me and ask
questions.
The first one is about post-Silicon validation. I promise that I will show some
slides. So as I said over here, this is a typical you know development flow for
almost anything and post-Silicon validation costs is one of the biggest costs and
this is, you know, so you know, [inaudible] is the vice president of design
technology solutions at Intel, this is you know like what he said, and these are the
numbers that are [inaudible] about two days post-Silicon validation costs and
everybody thinks that this is going to go up.
And the major causes of this cost are two. As we talked about, one is
localization and the other are these electrical bugs. And this is an actual number
that I got from actual people that do actual post-Silicon validation at these
companies that want electrical bug it could take days to weeks to localize an
electrical bug before, you know, the circuit debug guys come in and can fix it.
Okay. So what we do is that in this [inaudible] it stands for instruction footprint
recording analysts, it works mainly for processors right now. And the reason it
works for processors is because the way with choose these instruction footprints,
the way we choose these analyses, they're very tied to the microarchitecture of a
processor.
But if you can do that, you can eliminate the limitations of existing techniques
because you do not need any full system level reproducibility, you do not need
any full system level simulation, but still accuracy can tell the location of a bug
and by accuracy I mean that you can tell which unit which hardware block it
came from and the size of the block will be something like around 5,000 to
10,000 to input nano gates. You know. This is pretty good granularity because
now the circuit level debug guys can go and test the heck out of that particular
block to find out the speed path that would create this problem.
And at the very high level, this is my only slide which kind of gets into any details
of this [inaudible] what we do is that in the design phase we insert these
recorders inside the chip design so you can think of for every pipeline stage there
is a small recorder as it should be that pipeline stage which stores very specific
information corresponding to that pipeline stage.
And the total amount of storage that you need for this recorders overall for the
alpha microprocessor chip is like roughly around 60 kilobytes distributed storage.
And you know, so, and when the post-Silicon validation is happening this
recorders store, you know, special information in a non intrusive way, because
we do not stop the system on anything, it's happening all in parallel, and when a
failure is detected, that's what is important because you want to detect the failure
fast.
Now, if the failure, if the inner appeared you know, a billion cycles if the
beginning of your execution of validation, you don't care about that. What you
care about is the appearance of the error to the appearance of the failure. So
that's what you want to make to be, you know, you want it to be very short, and
that's what some of those inner checking techniques play a very special role.
Once the failure is detected, you scan all the contents and you post analyze
offline using some special analysis techniques. Actually we use some ideas from
[inaudible] control flow analysis, dependency analysis and so on.
You know, you can think of using [inaudible] basically to do diagnosis basically
and you do not need any system level simulation because we are looking at self
consistence. We knew the assembly code of the binary of what you were
running in the post-Silicon validation platform, and we have this information from
the recorders and we look for self [inaudible] where things are consistent. If you
read rate something like -- if you wrote something, whatever you read was the
same thing that you wrote and so on. And to be able to localize these bugs.
Again, 2000 feet view of the tinge but that's work in real life. And as I said that,
you know, we also do some work beyond [inaudible] blue sky stuff on, you know,
designs carbon nanotube filled up with transistors, they have a big promises, you
know, the energy delay a benefit of using carbon nanotube fits could be around,
you know, 40X or something like that at least 10X. So somebody said in a
company said you know if you guys tell us 10X and then after all the work is done
probably it will be 50 percent but still we'll be very happy about it. So that's
where you have to start up with the big number. And 10X is true.
But there are some major show stoppers. You cannot grow this carbon
nanotubes parallel to each other, you know, like you can look in a ->>: [Inaudible].
>> Subhasish Mitra: Carbon nanotubes.
>>: [Inaudible].
>> Subhasish Mitra: Oh, okay. What are carbon nanotubes? Good. So carbon
nanotubes are very simple. You have graphine, okay, you have a sheet of
graphine. And what is graphine? Well graphine is a very shine sheet of graphite.
And what is graphite? The stuff that you use in your lead pencils basically. So if
you draw a line or like if you shared a region you get multiple layers of graphine
basically. And you know like if one of those layers of graphine is to get out and
then if you rolled that sheet of graphine you will get a tube, and that is what a
carbon nanotube is basically.
>>: It's embedded in the ->> Subhasish Mitra: So what we do is that we -- so you can grow carbon
nanotubes in their own silicon or on cords, okay? Now, people have found, we
have found that these carbon nanotubes grow better on cords than on silicon. I
have some picture on carbon nanotubes on silicon, you will see that it will really
look like noodles, okay. Versus here as you can see that they kind of look like,
you know, kind of straight with some, you know, with some, you know, kinks and
bends and so on. And we can grow actually on if you will buffer scale and so on.
Now, if you want to make logic out of these carbon nanotubes as you can guess
if things go left and right, you know, how in the world can you design logic that
would ever work, isn't it? And so you have to worry about mispositioning of the
carbon nanotubes, they can be misaligned. You cannot tell that I want a carbon
nanotube here, I do not want a carbon nanotube there and so on. You have to
deal with that problem. And oh, by the way, [inaudible] on the part of the carbon
nanotube start out to have zero band gap which means they act as metal, which
means they are always conducting which means if you are creating transistors
out of those you will given all these things, you want to make a real technology
out of these things, okay.
And we haven't solved the problem, I'm not saying we solved the problem, but
we actually had the first VSS demonstration this June at the VSS symposium
where, you know, we could have full scale growth of carbon nanotubes and we
were growing them on cords because they have better alignment behaviors on
cords but nobody wants to use cords for doing circuits isn't it. So we can use a
sticker technique actually to smooth the carbon nanotubes out of cords and put
them on silicon, we can do that actually. And then after doing that, we are
actually have a theory, graph theory where we can prove that no matter how
misaligned the carbon nanotubes are as long as they're directional, which means
they are not spiralling in one particular place, actually there is a way to create
layouts of logic circuits so that even if the carbon nanotubes are misaligned and
mispositioned, you will always have correct logic functions no matter what.
And these are some examples of -- this is a pullup of a N gate, this is a pullup of
an R gate, and you can see, you know, look at the currents you know, they
actually work in real life. These are some more complex kits like this is an NO
invert and RN invert and so on. So [inaudible] where we can create single logic
gates and you know like we are working on more stuff as we speak.
>>: This [inaudible] chip was the size in Stanford lab?
>> Subhasish Mitra: It was created at Stanford lab, yes, exactly. It was
fabricated at Stanford lab, yes. So, you know, those are some of the other
projects that are going on in our group, and I -- you know, I did some other stuff
to you know before I joined Stanford basically, you know, like I like to do stuff
basically doesn't really matter what it is, so this was basically as I say, I was
talking about this coding theory thing that we developed basically, you know.
When I was at Stanford, I went to Stanford for graduate start basically for my
Ph.D. and, you know, we did a lot of stuff, we opened the self repairing FPGAs,
we had this big project on configurable computing and we are saying well if you
have more faults you know, you create another configuration that you know
avoids that particular location of the fault and till you can continue to run your
application and so on.
We showed all that stuff and then we also did a pure software techniques for
hardware for tolerance basically and that where deployed on the R go satellites
up in the space and that actually send data on what kinds of soft errors happen
up in the space basically. And they had a radiation hard microprocessor and
they had a commercial microprocessor with no hardware for tolerance and with
the software techniques, the commercial processor did better than the validation
processor. And you know, like there were lots of interesting stuff basically in that
space.
So that kind of concludes my talk. I hope that you know I got the message
across that going forward for robust system design there are lots of challenges
but there are lots of new opportunities that are either created by understanding
the physics or because of changing cost constraints in future technologies. But
that means that we have to think of new ways of doing things. And I hope what
we talked about today like would be bringing some C change in future designs.
So thank you very much.
[applause]
Download