>> Doug Burger: Okay. It's my pleasure to welcome Amin Ansari here from the
University of Illinois. Amin did his PhD at the University of Michigan and has been on a post-doc position at Illinois.
He comes very highly recommended and has a great stream of accomplishments in architecture. And so I'm really looking forward to hearing your job talk which you told me has been practiced many times. So it will be perfectly smooth.
>> Amin Ansari: All right. Hopefully. Puts a little bit of pressure on me.
Okay. Good morning, everyone. And welcome to this talk. Thank you so much for the introduction.
And I went to University of Illinois after winning National Science Foundation Computing
Innovation Fellow and started working with Josep Torrellas after finishing my PhD.
And today I'm going to give an overview of what I've done last few years on a topic that I think is essential to computer architecture as we get to the end of Moore's law.
The title of this talk is Optimizing Power-Efficiency and Reliability In Extreme Scale
Computing. Here I'm going to start by looking at the significance of power and energy efficiency. As you know, improving energy efficiency has many implications. And here
I'm going to start by looking at a few of these implications.
The first thing is there are many devices that are battery operated, and they are becoming more popular such as PDAs, laptops and medical devices. And basically enhancing the energy efficiency can prolong the battery life of these devices.
More throughput oriented systems such as data centers cost is the more pressing issue.
And this cost includes the cost of cooling, electricity, thermal packaging, and enhancing the energy efficiency in such a domain can say millions of dollars for a typical data center per year.
At the other end we have the device lifetime. How power consumption leads to hotspots on the chip and since without failures are highly dependent on the chip's temperature, enhancing the power efficiency can prolong the lifetime of all the computing devices.
And basically if we focus on power and in efficiency we can in essence all these different aspects at the same time.
Now, we are looking at trade-off between energy and reliability. I have plot that shows the energy per operation on the Y axis, and we have the supply voltage in the X axis.
And this is for a typical CMOS processes.
We start with the dynamic energy. And as you can see by -- while increasing the apply voltage we have this quadratic increase in the energy consumption. The next is the leakage energy. And as you can see for the small -- for a low voltages we have this exponential increase in the leakage.
If you put these two together, we're going to get the green line that shows the total energy of the operation. And there is an optimal point at the very low voltages.
Now, if you consider the cost of reliability this is going to change a little. Since at ultra low voltages we'll have an excessive amount of voltage noise, we need to deal with the timing failures that are arising in that region and preparing these faults --
>>: What are you defining reliability?
>> Amin Ansari: Reliability meaning that you can accomplish the operation that you're supposed to do.
>>: So if I went out, though, the reliability's going to go way up as well, punch through oxides or whatever. So what kind of supply voltage are you talking about?
>> Amin Ansari: This is -- yeah, this is mostly for a up to like one volt, for example, to .2 volt. That is a reasonable voltage range. Yeah, you're right, if you go to, like, five volt probably you will have -- your wear-out failures will be so like dominant that you need to have some other reliability mechanism at that point. That's true.
>>: So the reliability mechanism here is just -- it flips due to noise?
>> Amin Ansari: Most of the timing -- timing faults, meaning that the circuit cannot make the timing constraint that you put on it, and it can cause a big flip in the output that you have, yeah.
Go ahead.
>>: To understand the blue line, I understand that that is a -- you designed the circuit for that supply voltage. Because clearly if you reduce the supply voltage the same circuit you would lose for energy?
>> Amin Ansari: That's true.
>>: So as you reduce supply voltage, you thin out the oxide?
>> Amin Ansari: That's true. Yeah, that's true.
>>: So I don't agree with your answer. So if I'm having timing violations, I can just slow down, slow down my clock -- you know, when you're moving -- you're moving your voltage, you're slowing your clock anyway.
>> Amin Ansari: This is assuming the alpha [inaudible] model for clock scaling.
Basically you are not going to slow down your circuit by like a factor of a 1000X in order to allow it to operate. Assuming like your semi-linear relation between the voltage and frequency, how the circuit will act.
>>: Sure.
>> Amin Ansari: You're right, absolutely: If you are willing to reduce the clock frequency by, like, 1000X, basically you can led the circuit work.
>>: I care about correct results maybe is where you're going.
>>: Exactly.
>>: I'm assuming also that the reliability cost for unreasonable [inaudible] detected early enough not to cause mischief. Because if [inaudible] complication active as an air bag on me, that may be expensive in the next lawsuit.
>> Amin Ansari: That's true.
>>: Not to mention ICBMs.
>> Amin Ansari: Okay. So if we put this reliability cost, we add this reliability cost to the green curve that we had, we're going to have this red curve here that basically there is a
-- there is a substantial shift in the minimum -- in the optimal operational voltage of the operation toward the right side but still one can decide to operate at the reasonably lower voltages that, at that point, you don't need to -- go ahead.
>>: What are the units that let you add those?
>> Amin Ansari: Sorry.
>>: Thank you.
>>: What were the units that let you add energy and reliability?
>> Amin Ansari: What's the unit?
>>: What is the units of reliability cost?
>> Amin Ansari: It's basically in terms -- this is a -- this is a schematic plot showing the trend. So there is no exact number on it. But in the cost is basically the cost of detecting the fault, removing or repeating the operation, or roll back -- it's any mechanism that allow you to repeat the calculation and get the correct result.
>>: So the units are energy --
>> Amin Ansari: Yes. Definitely. That's energy.
>>: That's the answer to the question. So you're assuming that there's an energy cost to maintain reliability.
>> Amin Ansari: Exactly.
>>: Which you can -- and you could two a redundant computation but you're doing it with check point based architecture?
>> Amin Ansari: Something like that:
>>: Well, yes or no.
>> Amin Ansari: Yeah. I guess, yes.
>>: Yeah. Okay.
>> Amin Ansari: So we were either can operate at a region like, I don't know, like one volt that you wouldn't observe any failures due to the process variation, or you can choose to go to these lower voltages and try to operate at the optimal voltage value and basically save some energy at that point.
And you need to have these aggressive energy efficient techniques that can trade off reliability for energy efficiency and still can tolerate some of the errors that are happening in this region. And in the rest of this talk I will look at the ways that we can achieve this low-voltage operation by efficiently tolerating some of the errors that are happening in this region.
>>: So I -- go ahead, Scott.
>>: Are you making a distinction between errors that occur versus errors that have an impact or not?
>> Amin Ansari: These are the point -- errors are faults here, I mean, faults that you can detect and you need a mechanism to correct it.
>>: He's assuming that no errors actually manifest an architectural state eventually?
>> Amin Ansari: Yes. You need to correct it.
>>: But if I, for example, had a bit flip in an uninitialized portion of the cache.
>> Amin Ansari: That's true.
>>: That's not a -- that's not an error under your definition or it is an error?
>> Amin Ansari: You're talking about architectural masking and like microarchitectural masking and different type of -- different type of phenomenons in that category.
Here by failure we mean that something that can potentially corrupt architectural state of the program.
>>: I have one more question. Notably absent when you shifted from power to energy is any discussion of performance.
>> Amin Ansari: We will discuss performance.
>>: Okay.
>> Amin Ansari: Later. But at this point, the thing that we are trying to do is basically assuming linear scaling of the performance voltage how this circuit will react basically.
But that's a very good question that energy and power basically there is a notion of performance there as well. And you need to look at it.
>>: I guess my point was that you're presupposing the optimal energy point is your desired operating point. And I'm not sure that it is.
>> Amin Ansari: We are targeting like [inaudible] domain. And for those type of systems, basically, the main criteria is energy, I think. Or at least we are assuming here like energy is a very big factor. Like if you care about is battery life and like cost of calculation like electricity bill, like some of those kind of map today in energy. But you're right.
>>: Threshold device people use it because they care only about energy and not about performance, which is why no one uses it.
>> Amin Ansari: Yeah. I guess as I go through the talk and we'll present some of the schemes, you'll see there is a variation between the schemes that I present. Some of them more care about like getting a constant throughput out of the system. Some of them are willing to sacrifice some of the performance that you can get.
>>: Okay.
>> Amin Ansari: But this is the general trend I wanted just to discuss that how the behavior of the system is as we go for the lower supply voltages.
So now looking at this low voltage operation Vdd is known to be the best lever for energy efficiency that gain a very big reduction in dynamic power and static power.
And, as I mentioned, we want to have this near threshold voltage operation. That means we want to reduce the supply voltage to a value that's a bit higher than the threshold voltage which maps to around 500 millivolt for the current technologies. And the main advantage I will get is that we can get a very big reduction of power and energy. But the main drawback is the speed. The chip can go substantially lower, like a 10X is something that's very typical.
And another thing is that we will have a substantial increase in the gate delay variation as a big obstacle in the way of this low-voltage operation.
>>: So -- I'm sorry to keep hassling you.
>> Amin Ansari: Oh, no, go ahead. I actually enjoy having conversations.
>>: Okay. So you make a bold statement that Vdd reduction is the best lever for energy efficiency.
>> Amin Ansari: I have discussed this with many people and -- actually, this is an -- I wouldn't be comfortable saying best. I would --
>>: But you [inaudible] best.
>> Amin Ansari: True. I would say one of the best.
>>: Okay.
>> Amin Ansari: It's definitely one of the best. I'm not sure whether it is the best one or not.
>>: Well, I would argue that specialization gives you a much bigger [inaudible] energy efficient than Vdd.
>> Amin Ansari: That's true. But for general purpose computation, like the circuit techniques you can do are very limited, you're right, if you have like a completely application specific process it probably gains like several orders of magnitude better energy efficiency, that's true. I agree.
>>: To make sure I understand. So you save 4X on energy [inaudible] you do it 10 times slower and --
>> Amin Ansari: Yeah.
>>: And you save 4X on the energy?
>> Amin Ansari: Yes.
>>: So the 40X is a good market [inaudible].
>>: And you do near threshold and not subthreshold because --
>> Amin Ansari: No. We're not going to discuss subthreshold.
>>: Because?
>> Amin Ansari: Because, in general, I guess the subthreshold circuits you expect a very, very large raise in the fault rate, very, very large, and you need a completely new techniques at the circuit level to deal with it.
Generally, the architecture techniques can deal with failure rates that are -- that are reasonably high but not super high. Like not one every out of 10 instructions will go wrong or something is going to go wrong.
For subthreshold you generally need to deal with failures at the very, very aggressive manner. And also the performance that you get out of the circuit is very low in general.
Most of the subthreshold applications that I've seen are for, like, medical devices. Like you put a chip for like a pacemaker in the heart. That performance is pretty much not a relevant issue. So in this near threshold we're still hoping to get a decent performance out of the system. Yeah.
So, as I mentioned, the last bullet, variation is one of the obstacles we have here. And since this is an important issue, I'm going to provide a little -- explain a little with more on what is a parameter variation and how it's going to affect the chip. Basically it's defined as the deviation of device parameters from their nominal values, and two of these parameters are threshold voltage and effective channel link which are more important for the modeling purposes.
And here I have two plots. The one on the left-hand side shows the static power versus threshold voltage. And when you have a nominal chip, basically there's a nominal threshold voltage, and that maps to a certain static power.
As you add to the parametric variation, the thing that happens is that you will have some devices that have lower VTH and some of them have a higher VTH. And the ones that have a lower VTH, since this curve is exponential while -- when you are going for the lower VTH values, the ones that are at the end -- at the lower end of this curve, they will have a exponentially higher leakage power.
So the thing that happens, that the net effect of all devices that you have on a chip will be a substantial increase in the leakage power of the system. And on the right-hand --
>>: [inaudible] per device variation not per chip?
>> Amin Ansari: This is the variation -- this is static power of a particular device, one device. But inside the chip which you add to the parameter variation, some of the devices will ends up having a lower VTH, some of them a higher VTH. And the ones that have lower VTH the static power is much higher, therefore, the net effect, as you will get across the chip is basically a substantial increase in the static part.
On the right-hand side, I have a plot that shows the distribution of path delays for a pipeline stage. And on the X axis, we have the delay. And since the path that has the worst timing characteristic will be to remind the clock frequency, that tau NOM, for example, is the corresponding frequency of the chip in this case.
As you add to the parametric variation, this plot becomes more flat, and the tail will shift to the right side. And some of the paths will have a much worse timing characteristic.
Therefore, the clock frequency of the chip needs to be reduced substantially.
As a result, both power consumption and the delay become pressing issues while we are targeting this low-voltage operation.
All right. Here I have the roadmap of this talk. Our objective is to enable low-voltage operation of high-performance microprocessors with low-cost solutions. And I have work --
>>: Are you really talking about high-performance processors in your threshold regimes?
>> Amin Ansari: Actually that's the object of this talk, yes.
>>: Okay.
>> Amin Ansari: I will try.
>>: All right.
>> Amin Ansari: You can --
>>: Great.
>> Amin Ansari: You can see how it goes.
So I work on several different areas. But given the limited time that we have in the rest of this talk, I will present three solutions that I've proposed to tackle this reliability and energy efficiency in some of the main components of a high-performance microprocessor.
We're going to start by looking at network-on-chip. I'll present Tangle. That's a work under submission. It's a collaborative research with Intel Labs under DARPA and DOE grants. And the objective is to dynamically do the voltage adaptation based on the errors that we observe in the network.
After that, we are going to look at cache hierarchy. I'm going to present Archipelago, which is a work that I published in HPCA 2011. It's a highly flexible cache architecture that tolerates SRAM failures at near threshold domain.
And at the end I'll present Necromancer, which is a solution for processor pipeline. It's published in ISCA 2010. And the extended version was published in IEEE Micro 2010.
And the objective here is to protect the processor pipeline with energy-efficient solutions.
And at the end of the talk, I will present some of the potential future research directions and also give a brief overview of my other research accomplishments.
So we are going to start with network-on-chip. What is the impact of variation on NoC?
Network-on-chip is especially vulnerable to variations. This is because this routers and links connecting very distant parts of the chip. And due to the systematic variation, they will exhibit very different speed and power characteristics across the chip.
Therefore you need conservative voltage guardbands to tolerate the process variation.
And it has shown in the prior work that network-on-chip can consume up to 40% of the chip power, therefore, since the power is strongly proportional to the supply voltage, there's a great opportunity here to save the -- some energy by reducing some of the guardbands that has been added for tolerating process variation.
Here I have to -- here to demonstrate did impact of variation on the routers, I have a plot that shows the probability of error versus supply voltage for a 64 router 2D mesh with one voltage regulator per router. And in this plot, as you can see, there are 64 curves.
And each of these curves show the failure route of a single stage of a single router.
And --
>>: Can you -- I'm sorry. Can you describe what you mean by having the voltage regulator per router? I thought this was the network-on-chip.
>> Amin Ansari: Yeah. We have one regulator pair router inside the chip. It's a mesh network, for example, that you have many routers.
>>: So how much area does 64 voltage regulators take?
>> Amin Ansari: That's a very interesting question. I have multiple answers to that.
One of them is that in order to do the limited study of how much can we potentially get, we did the study with 64 routers per chip. We can change the scheme so that you can combine some of the routers and put like four or eight of them in the same voltage regulator.
>>: Sure.
>> Amin Ansari: Another way to implement it that there are LDOs that there are like load drop-off voltage regulators. And you can implement this thing in a hierarchical fashion. Put like each, I don't know, four by four submesh inside one of the voltage regulators and use LDOs close to each of these things. Since the variation in the voltage of each of them is very small.
Basically the thing that LDO is doing is that it can vary the voltage by a very small amount, like hundred millivolt for example.
And given the systematic variation, the ones that are close to each other showed kind of similar behavior. So you can set the voltage of the whole -- like a submesh, like a 16 --
16 router submesh with the regulator with the on chip voltage regulator and then tune each of them using an LDO, which is very, very high efficiency and low cost.
>>: Okay.
>> Amin Ansari: That was a great question. So, as you can see, we -- as we go from the high voltage to the lower voltages, the thing that happens is the failure is extremely, extremely low, and as we go higher, we basically go to the steep curve and gradually we get to a failure rate that is almost one, which meaning that all the time you are going to fail. We're going to have failure in your computation.
And this curve is very steep. That shows process variation has a major impact on the routers. And one thing that I want you guys to look at is that at the error rate around 10 to the mines 18, which is a relatively lower rate that allows fault three operation for a long time, you can see there is a very large variation of the voltages across the chip from almost 530 millivolt all the way up to 750 millivolt.
This means that if you have a scheme that can adapt the voltage of each router based on what is the minimum Vdd, which -- at which it can comfortably operate, you can save a substantial amount of energy in the network. Our estimation shows 30 to 40% can be easily achieved.
>>: I think this chart is showing systematic process variation for across the chip. It was a setup. But you're now arguing -- or is that incorrect?
>> Amin Ansari: Go on.
>>: Okay. But now you're arguing that you want to have per router or much more local control of these. I mean, this chart to me says we ought to bin our parts. Which we do already.
>> Amin Ansari: No: No. These are -- network is in one chip. So all the routers are basically in one chip. How can you bin --
>>: [inaudible] that's how I read it. Ben [inaudible] early was talking about systematic process variations, what.
>>: Yes.
>>: What hits his network on a chip.
>> Amin Ansari: Inside one chip. So if you manufacture one chip inside of that chip if you measure the voltage at which the -- each of these routers can have this error rate, that's what you're going to end up having.
>>: For each of these curves is a single router inside the chip.
>>: Exactly.
>>: And that entire router is assumed to be at some process value, but different routers have different variations of the process.
>> Amin Ansari: We actually use --
>>: [inaudible].
>> Amin Ansari: [inaudible] just tools developed in Illinois that does process variation modeling in a very detailed manner. And because we did the synthesis. And we got the net lists of different routers and basically calculated logic effort of each of those paths. And kind of like feed them to variance to get this result here.
So it includes the impact of variation in each of the single routers.
>>: All of the curves are identically shaped under that?
>> Amin Ansari: Yes. All of them pretty much have the same shape, meaning that as you increase the -- as you decrease the voltage, the failure rate moves in a similar manner like increase in a similar manner.
So what's the main idea behind our approach? Basically we are trying to achieve high energy efficiency by removing the VDD's margin that's added for variation and wearout while keeping the frequency constant.
So here is the kind of an answer to one of the questions that [inaudible] asked that what's going to happen to the frequency in the performance. Here our objective is to keep the frequency of the chip constant and try to remove any margin that is added for variation wearout purposes.
And we reduce -- the way that we are going to do this is that we are trying to reduce the
Vdd of each of the router to the minimum level that it can comfortably tolerate. And we start by having a high voltage for each of these routers. As you saw in this plot, something around 800 is safe voltage. And all of them can operate correctly and allows the fault-free application. And we periodically decrease the Vdd and we monitor the errors that are happening inside the system. And, if needed, we are going to increase the voltages.
And we rely on some inexpensive error detection since we want to have only a few bits added to each packet we're going to use CRC. And since we want to do the error checking as infrequent as possible, we're going to do end-to-end basically once we do that. In quoting at the source node and we do the error detection at the destination node.
>>: So this -- this is more of a protocol question. This is something that we've been wrestling with. Scott, I'm sure has thoughts on this too.
So if you're going -- if you're not going to run the network at a guaranteed reliable state, then you're going to start flipping bits in your packets. And you're going to put your router in if all sorts of very, very scary conditions. Right? So you can -- you can route stuff randomly, you can change the destination node. How do you -- you know, how do you guarantee that you're not going to unlock your network?
>> Amin Ansari: [inaudible].
>>: [inaudible] any protocol guarantees on top of that if you're not going to drain packets that have been corrupted, you know, at each node.
>> Amin Ansari: That's a great question. I will describe the scheme and I will try to answer your question there. It's a very, very great question. Okay.
As a result, the thing that Tangle is trying to do is dynamically changing is supply voltage of each router with errors that observe inside the network. And this allows us to adapt to the workload phase's temperature and also the wearout.
So here I'm going to describe how we handle the errors inside Tangle. We continuously monitor the errors. And when an error occurs the destination node drops the fleet and waits for transmission from the source node. The source node, we have a watchdog timer and for each of the watchdog timer reaches certain value, the source node sends a signal to our reliability management unit and asks for a Vdd increase.
Given the deterministic routing the reliability management unit knows what subset of routers it needs to increase the voltage on because it's like an XY routing. And if there's
a failure, it nodes at what subset of routers this packet's supposed to go through. So it's going to increase the voltage on that path.
>>: So how do you -- how does the source node know that the flip was dropped?
>> Amin Ansari: Source node is -- has a watchdog timer that if it doesn't --
>>: [inaudible] source node.
>> Amin Ansari: If you don't get an AC after a certain amount of time, you know that something was wrong. And then you tell your reliability management unit that something was wrong, do something for me. Here is me and here was what was the -- here is what is the destination that this packet is supposed to go through.
Actually I have examples of this thing in the next slide.
>>: I don't need an example I [inaudible] just -- so you're assuming that the source node's going to buffer everything until it receives an AC. And what if the AC gets corrupted and gets sent somewhere else?
>> Amin Ansari: That's a very good question. The same thing happens. Basically if the
AC gets corrupted, the source node still didn't get the AC. So it's going to have a time on it. So it knows that the routers that this ac's supposed to go through has some problem as well.
>>: But how does it know whether or not -- but so now it's going to retransmit something that's already been correctly received.
>> Amin Ansari: No. No. What you -- you ran voltage increase and then retransmit.
>>: Yeah.
>>: So I think what Doug is getting at now is that the destination got the message. And so it believes it's moving forward, but it's going to later on get another copy of that message because its AC got dropped.
>> Amin Ansari: That is true. I guess the destination node we can either use the reliability management unit to tell the destination node that, okay, this is a duplicate message because that one knows that what message it's going to get to the destination node as a duplicate because that's the thing that -- that's the request that it got from the source node.
>>: Wait. Where is the reliability -- the reliability measurement unit is somewhere on the network, right?
>> Amin Ansari: It's a distributed unit, yeah. So they will build it in a hierarchical fashion. All their -- all their source nodes are connected to that unit.
>>: So you're going to get another copy of the message. And so whenever you get a message you have to talk to the reliability management use to see if you've already received this message?
>> Amin Ansari: No, no, the reliability management unit can send you like an interop that says that, shall okay, if -- like because first it increases the voltage on a particular path and then the packets will go through. So it knows that which node will eventually --
>>: [inaudible]. I don't care about the voltage increase because I could just assume that it does it wrong the first time and does it right the second time.
>> Amin Ansari: Yeah. That's pretty much what's happened.
>>: Yeah. So -- because, you know, other things can corrupt network packets. So you can't assume that it's always correct. You just increase the voltage and your probability of an error is lower.
>> Amin Ansari: Yeah.
>>: So now -- so whenever I receive -- so from a destination, whenever I receive a message I have no idea whether I've already received this message or not. Because this could be a retransmission.
>> Amin Ansari: But the reliability management unit knows that you're going to receive this packet again, right?
>>: How?
>> Amin Ansari: Because originally when this packet got transferred and then there was a time on the source node, the source node knew that the -- this packet needs to be retransferred again.
So when it sends that packet to the reliability management unit and says that the problem happen in this packet, the reliability management unit knows that this is going to be retransmitted. And it already knows the destination node for that.
>>: So which is -- there's some network between the reliability management units at each source and destination.
>> Amin Ansari: All the sources --
>>: That's communicating things. And is that also just as likely to be buggy? Or do you have something to --
>> Amin Ansari: No, that operates at the higher voltage always. Yeah. And network is always operating at maximum voltage or 800 millivolts so that you don't observe errors on that.
>>: So I receive a packet. CRC checks out. I receive another packet. And how -- when I receive that packet, how does the network arrive at the management unit tell me
that that's a -- I mean, I could set a bid in the header that said this was a retransmit. But
I've -- I mean --
>>: You want an answer or his answer?
>> Amin Ansari: There can be --
>>: [inaudible].
>> Amin Ansari: One of the ones that I was proposing was that reliability management unit basically knows that this packet is definitely going to get retransmitted. So it can tell the destination node that this -- there will be a duplicate of this message. So you can just like just send an AC and drop it.
Or, as you said, you can have a bid --
>>: Okay. Okay. So I guess what I'm uncomfortable with. I don't care about the reliability management unit --
>> Amin Ansari: Yeah, yeah.
>>: -- because I can also set a bid in the header of this if this is a retransmit.
>> Amin Ansari: Yeah, that's true.
>>: I guess just [inaudible]. But when I receive this duplicate packet, what do I do witness?
Q Just drop it. Send in an AC to the source.
>>: But how do you know if I've already consumed it or not?
>> Amin Ansari: If the CRC check --
>>: So here's the scenario. I'm sorry to harp on this. But this -- I'm not sure this works.
>> Amin Ansari: All right.
>>: So I receive packet A.
>> Amin Ansari: Yeah.
>>: And -- what's that?
>>: I think he's okay. So keep rolling if you want him to get --
>>: Yeah. I receive packet A. I'm going to consume it. I'm not going to buffer it waiting for an AC on my AC. Right?
>> Amin Ansari: That's right.
>>: I consume it. Now -- and I send an AC back. The AC gets dropped or routed somewhere else. Somebody else receives an AC. God knows what happens now.
>>: Make it disappear.
>>: Yeah, make it disappear. All right. And actually the CRC will catch that.
>> Amin Ansari: Yeah.
>>: Right? So now I send packet B, which is the same as packet A, and I have some state that says that this is a retransmit. And now I receive packet B. Do I consume packet B?
>> Amin Ansari: No. [inaudible].
>>: But what if packet A was dropped and I got packet B? Like if packet A was sent to the wrong place and it timed out, then it sends packet B. And I get packet B and I say okay, now I'm great, I can --
>> Amin Ansari: No. But at a destination you know that, for example you say that you have a number for -- like a packet number or something, and you know that whether you consumed it or not, right? And when you get the duplicate, you know whether this number matches that one or not.
>>: So you're -- so you're saying I need to keep a log -- I need to come up with some unique ID for every packet I consume and keep a log of all of those packets and then do a cam against them to make sure that the packet that arrived isn't sitting in those?
>> Amin Ansari: Yeah. Not -- for -- like the log wouldn't be that long because that -- that would be the latency of basically doing this retransmission and changing the --
>>: [inaudible] network how can you bound the latency?
>> Amin Ansari: [inaudible] actually like a chapter in the paper like how we deal with the congestion.
>>: Well, no, now how you deal with congestion but how you -- do you see where I'm going? Like I don't --
>>: I know where you're going and I know [inaudible] but I don't know whether you want me to take you there.
>>: Yes.
>>: Okay. So all you need of the destination is a piece of information for each source which says what is the packet ID that I have successfully gotten. And I'm going to make a rule that says once a packet is --
>>: You're guaranteeing in order delivery.
>>: We're assuming in order delivery.
>>: Okay. All right.
>>: And so what you do is when a packet arrives, if it is the next one, you increment that value, you consume it, you send the AC back.
>>: Yeah. Yeah. Okay.
>>: And then if it is either too large or too small compared to that -- but if it's too large, you drop it. If it's too small you drop it and AC it.
>>: Right.
>>: And I think you're fine.
>>: Is that what you do?
>> Amin Ansari: It sounds a very interesting answer. We assume that a mechanism can be used to handle the --
>>: You haven't thought about this problem. Okay.
>>: Rather than put the watchdog timer in the router and having and ultra reliable network to say I got a bad network, could -- can't you put the watchdog timer in the reliability management unit and then you send message that things are okay and then you arrive and start checking out the voltage. Or there will be too many messages?
>> Amin Ansari: That's an interesting question. So you're saying that put the workload timers in the reliability management unit and send -- basically acknowledge the things that are wrong or --
>>: Acknowledge the things that are right.
>> Amin Ansari: Only the things --
>>: If a node says things are all right, don't do anything. If a node doesn't say anything, right, check out the voltage. Then it can be as unreliable as the rest of it. You may have many more packets. I don't know that.
>> Amin Ansari: That's a different way of designing it. I think that makes sense. I don't know what are the tradeoffs here, basically, if you put the watchdog timer on the reliability management unit my guess is that you need probably a lot more communication with the reliability management unit to realize what's going on with different packets.
>>: But --
>> Amin Ansari: That's an alternative. We were trying to keep the reliability management unit network very, very, like, low bandwidth so you don't need so many communication on that one.
>>: Are you going to talk about how you lowered the voltage back down?
>> Amin Ansari: Yeah. I will talk a little about like what is the mechanism that we are using basically to change the voltage.
So here I have an example of what's happening to the supply voltage of a particular router over time. And as I mentioned when we start with the a relatively high voltage, around 800 millivolt and we gradually reduce the voltage at the beginning of each epoch
-- basically an epoch is a time period during which we first decrease the voltage and we monitor the network to see whether errors happening or not. If it's happening we're going to increase the voltage.
And we start with the relatively high voltage, tuning a step, and as we gradually get to the lower voltage values, we are going to reduce this tuning a step. As you can see, for example, the first step is around like 60 millivolt or 70 millivolt. And gradually goes down to around 10 millivolt which is kind of getting to the physical limitations of the voltage regulator.
And for example, the thing that's happening in epoch 4 is when we are going down to around 600, 10, 20 millivolt, we're going to observe some errors in the network, we're going to raise the voltage and continue the operation. We're going to decrease. And in the epoch 6 you will see that some other error's happening.
And basically the thing that's happening is that you gradually converge to a value that's router can comfortably operate at. You'll see errors every now and then. But the overhead of fixing those errors are very, very marginal compared to the time window that we have here. So --
>>: [inaudible]. So if I -- if I am within a nanosecond of an epoch boundary and I get a very to -- or whatever your smallest amount is, and I get a very to increase the voltage, you are, in fact, going -- so you get an error, you're going to increase it, decrease it almost immediately.
>> Amin Ansari: [inaudible].
>>: Okay.
>>: So do you -- I still don't understand how you can guarantee that you don't deadlock given that you're allowing the corrupted messages to flow through the mesh. Sorry to keep harping on --
>> Amin Ansari: No, no, I actually kind of enjoy this conversation. So I don't mind it.
>>: So how -- so if I flip a bit and now it gets routed to the wrong place, and that's happening arbitrarily in the network, now I can end up with a -- I can end up with a circular dependence on resources and deadlock?
>> Amin Ansari: That's not going to happen. Let me show you the example of how it's working and then we're going to discuss this particular example. Yeah. So here I have a very, very simple example. I think that you guys already have a very good vision of how it's working. Basically the four-by-four mesh, the darker color means that that router highways been affected more by process variation and has the worse timing characteristic. And the lighter one means that it has the better timing characteristic.
So we imagine that the ones that are lighter, we hope that they end up with the lower voltage values than the ones that are dark here with the higher voltage values.
We set everything to 800 millivolt. We want to transfer a packet from core 1, 1 to core
1, 4. It goes through there based on deterministic routing goes through X and then goes through the X. And it gets delivered without any errors. After a few epochs, the thing that's happening is voltage of all the routers will go down to around 700 millivolt.
And if you want to transfer the same packet, the thing that happens is that at router 1 or
--
>>: [inaudible].
>> Amin Ansari: Yeah. We're going to have fault and it's going to go to the next one, have a fault, and then delivered. The thing that's happened is that router 4, 4, we're going to drop that packet, raise the voltage of all the routers on that particular path and basically do a retransmission.
>>: And --
>> Amin Ansari: The next -- and the next epoch basically I have another example of another path that how the voltage changes happen in this path. Now, going back to the question that Doug has, if -- what's going to happen if you have deadlocks?
So how can you have a deadlock? Basically deadlock means that you have a -- this problem starts at the particular node, right?
>>: Are you assuming storm forward?
>> Amin Ansari: Yes. Let's assume that this problem starts at the particular node. Like let's think about this particular example that we want to transfer this packet over this path.
>>: So -- well, I --
>> Amin Ansari: At some point this problem starts to happen, right?
>>: So you do [inaudible] each router? So you're buffering the whole message before you send any data from the message on the next router on a dot chip network? No?
>> Amin Ansari: No.
>>: No. You do [inaudible] routing?
>> Amin Ansari: Yes. Definitely more routing.
>>: Okay. So -- so now I have a -- the point you're making, I have a problem -- I have router 1, 3 and it sends a message down. And I have a problem with some other router, and it sends a message up. And now the buffers are full and they're both waiting for buffers to clear, but they're depending on each other. Okay?
>>: So I think where Doug is heading was that the message -- you want a message that where I actually corrupt the destination address, correct?
>>: Yeah.
>>: Yes, I guess that's --
>>: Okay.
>> Amin Ansari: I guess the deadlock one scenario that I can imagine is that the package is going through the forwarding path and at some point you will have failure and basically gets to a loop here, for example, something like that.
>>: [inaudible].
>> Amin Ansari: That's the scenario that you were describing, right?
>>: Yes. Classic, classic --
>> Amin Ansari: Yeah.
>>: [inaudible].
>> Amin Ansari: Yeah, I agree. So the thing that's happening is that the voltage of this router was wrong. So it's doing something incorrectly, right?
>>: No. The voltage is only going to affect the error rate.
>> Amin Ansari: Yeah.
>>: Okay? And if you -- if you up the voltage, you're going to change -- you're going to improve the error rate. But I've already incurred the error. And now I've got these -- I've got to stick with deadlock where the buffers are full and the messages can't --
>>: They're going to flush the -- flush the buffers.
>> Amin Ansari: Okay. So only once the -- an incorrect CRC is detected in a destination, correct?
>>: [inaudible] flush on a --
>> Amin Ansari: No, no. [inaudible] the source node.
>>: Okay. So at the source node you didn't get an AC, which means you're going to up the power and you're going to flush the entire network?
>> Amin Ansari: No, not the entire network. Just the ones in that path. Just the routers in that path.
>>: But --
>>: No, that doesn't do it.
>>: Yeah, because I could have sent the routers -- I could have sent a message off that path. The message left the path.
>> Amin Ansari: Yes. But either that message get delivered to this mesh or it doesn't.
You say that is not going to get --
>>: All right. Here -- here -- here's the scenario. All right? I'm routing from the top corner to the bottom corner.
>> Amin Ansari: Yes.
>>: All right? And, yes, this sounds unlikely, but we're running it large scale. So now I have an error where I'm now going to route that message down this way. All right? So the -- and it's a long message. So it's going to fill up the buffers.
>>: I think you actually want to have the flip when it's coming down to go over. But keep going --
>>: Well -- well, actually no. Let's just say [inaudible]. And now I flip another bit. Now I try to send from it here to here, right? So I have two errors in the message. And now it's routing through and actually trying to route through itself.
>>: Okay. Sure.
>>: I mean, you can do -- you can construct a more likely example with two message, right, and know --
>>: Oh, I see.
>>: That message you now have -- and it's all off of --
>>: It's all off the path.
>>: Yup, I see it.
>> Amin Ansari: I kind of see what you're saying. The one answer they've is that --
>>: [inaudible] the buffers. Those nodes are completely wedged. But they're not sending or receiving. They're not going to time out. This guy times out because he didn't get it [inaudible].
>>: That's true.
>>: He flushes the buffers along that path. But the message is off that path, and your network --
>> Amin Ansari: One is that the likelihood of a message getting into such a scenario that basically two faults are happening such that --
>>: You could do it with two messages with higher probability. I already said that.
Don't -- never, never argue that the likelihood of the deadlock scenario --
>> Amin Ansari: Is very low.
>>: All right.
>>: Just remember we're Microsoft. We run at large --
>> Amin Ansari: Very large scale.
>>: Scale.
>> Amin Ansari: So any corner case can happen. That's true.
Okay. I guess that the only answer that we would have in that scenario is that you probably need flush all the routers inside your network. That's -- that's the only way that it's going to fix it there. At the beginning --
>>: [inaudible] if you would add that. One of them is you can tell us why the message that was going down through from R2,3, R3,3, can't make that turn or you're going to tell me about how Doug's stuck thing is eventually going to get flushed. Now, I think you might have an answer for either one of those.
>> Amin Ansari: Yeah, I think that one is for -- because of the X, Y routing that it cannot go through that particular path. It cannot go -- it's very -- it doesn't -- it's actually possible, I mean, that -- and if the fault is in a very, very particular manner and that, as
Doug was saying, for a larger scale system it might happen, you --
>>: So it is possible?
>> Amin Ansari: It is possible. So I can see that happening.
>>: [inaudible].
>>: Yes.
>>: That will go from Y to X routing in your router?
>> Amin Ansari: That is not Y to X routing. It's still X to Y routing. It comes here, X, Y, and X, Y.
>>: [inaudible] save yourself now, I think, by saying the R2,3, R3,3, flowing message cannot possibly go to R3,2. Is that true or is it not true?
>> Amin Ansari: This message going here?
>>: Yes.
>>: It can if its destination gets flipped in R3,3.
>>: I don't think can -- he's -- I believe you have a router that does all Xs before Ys.
>> Amin Ansari: Yeah. But --
>>: So if you had a message that was moving on Y, you would never look at it and send it to X because that path doesn't exist --
>> Amin Ansari: Yeah, but scenario that I can imagine that can happen is that this one says send a message to here. And basically you're going to send it this here.
Then this one says send it here. And basically in a loop to have these fault patterns that are like complimentary, this one [inaudible] it's very, very unlikely, but I can see that happen.
>>: So then let me give you your second handout. Let's take Doug's scenario. Doug's scenario is you have a tail -- a dog chasing its tail, one message R2,3, 3,3, 3,2, 2,2, 2,3.
>> Amin Ansari: Exactly.
>>: Okay. Tell me what happens in that network when I go forward now. What messages are delivered in that region?
>> Amin Ansari: You mean the other messages that --
>>: Other messages in your network that are trying to be delivered in that region. Are they?
>> Amin Ansari: They can -- they can be delivered. That message is going to flow so that other -- other things are not going to get --
>>: [inaudible] it's stuck?
>>: Buffers are full. If it can be transmitted.
>> Amin Ansari: But why the buffers are [inaudible].
>>: Because they're filled up with the message and they can't be deallocated until the message makes forward progress. The message can't because the head of the message is waiting for a buffer to be filled up, to be transmitted.
>> Amin Ansari: But what if all of them try to go through the same path?
>>: Where Scott is going is that somebody's eventually going to try to transmit and they're going to time out and then they're going to flush that packet.
>> Amin Ansari: You're going to flush that packet.
>>: Right.
>> Amin Ansari: That's a very, have you interesting answer. Yeah. I can imagine --
>>: Deadlock stop stuff in which case it stops stuff and your system will eventually recover. Or it doesn't stop stuff, and if it doesn't stop stuff you don't care.
>>: Yeah, that's true.
>> Amin Ansari: Thank you so much.
All right. I guess I describe this particular example. So the thing that's happening is that eventually basically the system will converge to a state that the voltage of all the routers are getting close to their optimal values. And the main reason for this is that the routers that have worse timing characteristic basically will end up being at the intersection of a larger number of paths. And those paths that are intersecting cause this router's voltage to go high -- up faster -- in a faster rate compared to the other ones.
Definitely for the very, very small networks, this might be not a correct statement. But we realized that for example for a 16 -- 16-node mesh and upward, all the simulations will end up having a similar -- will end up with this similar behavior.
So in summary, we showed the dynamic route-oriented approach in order to save energy in network-on-chip. It adapts to temperature, workload phases and wearout.
And we tried to achieve this in the presence of process variation by keeping the frequency unchanged, reducing the voltage of each router to the minimum value that they can operate comfortably and we efficiently tolerated some of the occasional errors that are happening in the network.
Applying Tangle to a 64-node mesh, we were able to reduce the energy consumption of network-on-chip by around 28%. And this came with less than 5% overhead, for a wide variety of benchmark suites.
>>: So -- so -- so I wanted to talk about flushing the network now.
>> Amin Ansari: Yeah.
>>: So when I flush a path, I'm going to kill a bunch of messages that may be using routers along that may be sitting along that path? Right?
>>: That's correct.
>>: Yeah.
>>: And so then they're going to AC and so I'm going to flush that path. So if I've got a highly congested network, why doesn't the flush create a cat state of flushes? And I ends up with [inaudible]?
>> Amin Ansari: It is possible, I can see that happening that if you flush one path -- I think, it's -- it might make more sense to say that you're going to flush the whole network.
These epochs are pretty long. So --
>>: But even if you flush the whole network, you still have the same problem, in that you've killed a bunch of messages and now those messages are going to time out and they're going -- I mean --
>> Amin Ansari: No, no, but when you flush you're going to reset the timer for each of them, basically.
>>: So the first -- so the first detected watchdog failure is going to be assumed to be the only problem that you can detect at that point.
>> Amin Ansari: At that point.
>>: You flush everybody.
>> Amin Ansari: Yes.
>>: You penalize that path --
>> Amin Ansari: True.
>>: And then you restart.
>> Amin Ansari: That's right.
>>: And even if there were other problems in the system, you're going to detect them with some future --
>> Amin Ansari: [inaudible].
>>: Message?
>> Amin Ansari: [inaudible].
>>: Okay.
>>: Yeah. I guess I didn't follow that. I still don't see -- I've killed a bunch of messages.
I don't know if they've been delivered or not. So they're buffered at that time source.
>>: Yes.
>>: And I have to wait to get an AC.
>>: Yes.
>>: And I've killed a bunch of ACs, maybe a VAC, so I'm going to time out. And then when I time out, something's gone wrong in the network, and that timeout could be a deadlock, so I have to flush.
>>: Yes.
>>: So every time -- based on your mechanism, every time you flush the network, you're going to kill some good messages if there's any traffic in the network, and then you're going to cause some timeouts and then you're going to flush again.
>>: It was that last sentence where you're wrong.
>>: Okay.
>>: So I'm not exactly sure how he does it, but I believe he -- the first timer goes off for the first message that is found to be a problem.
>>: First timeout.
>>: And he will then somehow wave a magic wand --
>>: And drain the network?
>>: And drain the net -- well, just dump everything in the network. And then tell every timer that the messages you had in flight are not going to be considered to be buggy in and of themselves, they're going to be buggy because of the path I had and so retransmit them and don't -- and reset the counters, and don't penalize them for having been caught up in this.
I think it's subtle, but I think it's doable.
>>: Yeah.
>>: I wouldn't want to write that code.
>>: Assuming you have the mechanisms --
>>: And it's microcode.
>>: [inaudible].
>>: Yeah. So while Doug's thinking --
>>: No, I buy it, I'm just trying to think if there are corner cases. I was worried about the amount of source buffering that we have to do. And you have to buffer -- and you have to have the detection 64 nodes to see the message --
>> Amin Ansari: Actually buffering in the source node is a very important thing.
Actually I mention -- we have some collaborators from Intel Lab and buffering at the source node is the main thing that they mention about this work that might be a limiting factor.
We try to do a study for the buffering. And one thing that we realized that all 60 -- buffering 64 flips at the source node is enough for a 64 node mesh network to have a relatively -- like low performance overhead and make the system kind of working and not stuck a lot.
But that's a big drawback basically.
>>: Okay.
>>: Well -- sorry. The source buffer you need is 64 flips?
>> Amin Ansari: Yes, that is what we tried in --
>>: That's an empirical study.
>> Amin Ansari: Yeah. It's completely empirical.
>>: Well, but how many messages is that?
>> Amin Ansari: It -- it's -- we assume, I guess, six -- six flips per message, yeah.
>>: Okay.
>>: And when you -- when you run out of buffering space --
>> Amin Ansari: You're going to stall.
>>: You just stall the process --
>> Amin Ansari: [inaudible].
>>: So you said that your -- the voltage per router tends to get close to the optimal?
>> Amin Ansari: That's true.
>>: How do you know that, or have quantified that or --
>> Amin Ansari: That's based on our status -- static analysis basically. We do a static analysis as I showed like in one of the earlier plots that we have a -- we have an idea of what should be the voltage of each of these routers going to end up to.
And basically at the end of the day, we do a comparison and we see that these voltages that we get across the system are close to the ones that we got through the static analysis.
>>: Okay.
>> Amin Ansari: Yeah.
>>: So I want to do a quick back in the environment. So how many cycles is it router to router? Roughly?
>> Amin Ansari: Just the link itself or the whole --
>>: No, to get -- from a [inaudible].
>> Amin Ansari: [inaudible].
>>: The link is --
>> Amin Ansari: That's probably -- you can assume it's one cycle.
>>: But you're crossing clock domains, aren't you?
>> Amin Ansari: No, no, I said we keep the clock constant. That's the main object.
>>: Yes. Okay. All right.
>> Amin Ansari: [inaudible] domains.
>>: Yeah. So you're crossing voltage domains --
>> Amin Ansari: Yeah.
>>: Does that --
>> Amin Ansari: Crossing voltage --
>>: [inaudible].
>> Amin Ansari: You need voltage shifters but it's not as bad as crossing a frequency domain.
>>: I don't mean the voltage, so -- but is it -- I mean, is it free in terms of time?
>> Amin Ansari: It is not --
>>: All right. So let's say four cycles. It sounds a little optimistic to me. But four cycles, and you've got 64 run network. So your average message transmit time will be --
>>: But how many cycles is --
>> Amin Ansari: Nine -- sorry, 12. Three or two or something. 1.5 [inaudible]. It's going to be like 12.
>>: Yeah, yeah, it will be your eight -- we'll say 10. So it's 40 cycles from the first data packet to get through. And how many cycles does it take to get one foot through? Your size?
>> Amin Ansari: Two [inaudible].
>>: The router?
>>: The flip is how many cycles on the network? If your flip was [inaudible].
[brief talking over].
>> Amin Ansari: Yeah, 120.
>>: 120 bits. And how wide are your links?
>> Amin Ansari: 64.
>>: All right. So it's two cycles. All right. So that's the noise. And an average message is 6 flips. That's 12 cycles. So you're looking at, let's say, 50 -- 50 cycles to transmit a message on average across the network. And then it's 50 cycles to return back. That's 100 cycles round trip. How long does it take you to flush the network?
>> Amin Ansari: That's -- I actually don't have --
>>: Yeah. Okay. So but -- yeah. Because you've changed the flush policy.
>> Amin Ansari: Yeah.
>>: So a hundred cycles for message. If I've got six messages source buffered, that's
600 cycles?
>> Amin Ansari: I guess so.
>>: Yeah. Well, I guess you could pipeline those. But what is your injection rate into the network? How often does a processor send --
>> Amin Ansari: I don't remember.
>>: Okay.
>> Amin Ansari: I don't remember. We don't have a lot of we -- a lot of benchmarks.
My imagination, my -- what I imagine is that there is different injection rates for different benchmarks.
>>: [inaudible] just going to run little slots and see what --
>>: Given the -- given the size of the window, you know, how -- and the number of messages in flight how many you would need to buffer at the source before you started
[inaudible] because there's going to be a lot of stuff.
>>: [inaudible] locality and et cetera.
>>: Yeah. Yeah. Okay. Well, we can move on.
>>: Okay.
>> Amin Ansari: Okay. So in the second part of the talk, I will discuss some of the work that we did on the cache. The main reason that I presented Tangle.
>>: We only have about 10, 15 minutes left in your talk since we've grappled this.
Would you mind jumping ahead to the Necromancer.
>> Amin Ansari: All right. So since we have limited time, I guess I'll present the
Necromancer, which is the work that is done in order to protect the processor pipeline.
So assuming that I have presented the cache work as well, we've presented some works that can protect network on chip cache. And in order to maintain an acceptable level of yield for the whole process, since the processor pipeline also consumes a substantial area of the chip, we want to have a mechanism that basically protect this part as well.
However, this is more challenging, since the processor pipeline is inherently more irregular. For example in network-on-chip all the routers look the same. On the cache pretty much everything is SRAM, so it's easier to basically have a redundancy based techniques to repair these structures. Here we are trying to not rely on redundancy-based approach and use an alternative to make the dead core do something useful and basically in essence the system throughput.
In the next few slides I'm going to describe how we are trying to exploit the function of dead core to achieve a better system throughput.
So given that dead core contains a fault, we cannot trust that core to execute the program even for a short period of time. And the question becomes how can we exploit such a core? The approach that we are taking is to use that core in order to accelerate another core. We have this dead core. It contains a hard fault, and we add a little core to the system that we call animator core. Animator core is basically an older generation core running the same ISA with substantially less resources.
And then we -- sorry?
>>: Nonchip or --
>>: Yeah.
>> Amin Ansari: So animator core is running the main program. And we let the dead core also to run the same program. And we abstract the execution information on the dead core and extract some useful information from it that we call hints. And we send these hints to animator core in order to in essence the performance of the animator core.
Here in order to see how much speedup we can potentially get, I have a study that shows the IPC of different alpha cores normalized to the most simple one. From left to right is a two order -- two issuing order, four issuing order, two issue out of order and six issue out of order.
And we want to use the first tree as different --
>>: Six --
>>: Out of order. Six.
>>: So --
>>: It's four issue out of order. It can dispatch six instructions [inaudible] but it will never sustain more than four?
>> Amin Ansari: It will never sustain more -- in our experience, we actually use six issue. So basically sorry if it's not EV6, EV --
>>: [inaudible].
>> Amin Ansari: Yeah. It's similar to EV6. We try to keep it six issue. Yeah. You're right. Might not dispatch six. So the thing that's happening is that from left to right, basically the complexity of this core's increasing, and it needs more resources. We're going to use the first three as different alternatives for our animator core and want to see that how much performance we can get by providing perfect hints.
Here perfect hints means that we will have a -- we're going to have perfect branch prediction and we're going to -- we're going to have no L1 cache misses.
So this yellow bar is showing how much performance is going to get having that perfect hints for the animator core. As you can see, the thing that's interesting here is if you can provide these correct hints for the animator core, the performance of a dual issue out of order can actually exceed the original performance of a six-issue out of order. It means that there's a lot of opportunity for the acceleration.
Obviously providing perfect hints is not possible, but we are trying to do our best to get close to this chart -- to this plot.
So here is the -- a little bit more detailed structure of Necromancer. On the left-hand side we have the dead core, and on the right-hand side we have the animator core.
Most of the hints go to a queue in order to get transferred between these two cores.
And the modules that we added to both these cores are highlighted in this figure.
So we are going to start by describing the dead core and we gradually go through the different structures here. The dead core basically runs the same program and provides the hint for the animator core. You can think about it as an external run-ahead engine, as then 6 issue out of order. And we have instruction cachings that are basically PC of committed instructions. We have data cachings that are addresses of committed loads and store. And we have branch prediction hints that are branch prediction updates of the animator core.
And one modification that we make here is we make the L1 cache of the animator core to have only read access to the L2 cache since we don't want to -- we want to preserve the memory state, we don't want to write the dirty line back to the shared memory.
>>: So there's --
>> Amin Ansari: Oh, go.
>>: -- something that I don't understand about this full speed. So I'm building -- I'm building a -- designing a processor. I'm going to have a bunch of cores on it.
So the ID here is I'm going to put a [inaudible] gathering language and the ability to drop dirty lines and then push those out to the small core? Am I doing this from the day I line up the design --
>> Amin Ansari: Yeah, yeah, yeah.
>>: -- or am I only doing it when the first thing fails?
>> Amin Ansari: This is for the manufacturing faults. So basically you will know at the manufacturing time that something is wrong. You need to somehow fix it. Yeah.
So I have a slide -- I will jump the a slide I have that I think will be useful. That's how we're going to design the system. Oh, go ahead.
>>: If dead core cannot read what it has written, can it still run ahead far enough to do the --
>> Amin Ansari: There are different type of failures that are -- that can happen inside the -- inside the --
>>: [inaudible] allowing it to read what you've written because only has read access on the cache.
>> Amin Ansari: No, no, but it has local L1 cache, which is --
>>: [inaudible].
>> Amin Ansari: Yeah, yeah. L2 is --
>>: L2 [inaudible].
>> Amin Ansari: Yeah. Yeah.
>>: Sorry.
>> Amin Ansari: All right. So here I have a picture that basically shows how we're going to design the system. This is model after [inaudible] with 16 cores, each cluster has four cores. And this is how we're going to -- this is how the vision basically that we shared with animator core across several aggressive cores. And at the manufacturing time, we know that whether any of them has a failure or not. Depending on that, we are going to, like, hard wire, or, like, bend some fuse or something that's going to couple one of these with the animator core inside the group.
Should I move on?
So animator core is, as I mentioned, is an older generation running same ISA. Here I assume the two issue out of order. And since it has the precise state, basically we allow it to handle the exceptions. We treat the cachings to warm up the -- as the pre-fetching information, to warm up the local caches.
And we rely on some fuzzy hint disabling approach, based on the continuous monitoring of the hint effectiveness to realize that when do we need to disable a particular type of hint to save on the energy and also reduce the contention and the resources of the animator core.
And we copy the architectural -- the PC and architecture registers whenever we realize that the state of the animator -- of the dead core is basically so far away from the correct state, and we want to do a resynchronization.
But sometimes it starts executing in the wrong path and goes on and on, and basically the hints that you are going at that point is no longer useful. And we do a resync between them.
Since most of our communications are unidirectional, we're going to use a single queue for transferring the hints. And L2 warmup is provided free. That there is no need for a hardware to put -- to do a communication between the two cores.
>>: So there is a lot of work on, you know, slip [inaudible] processors and SSRT and, you know, subordinate simultaneous cores and all of this. Are you -- is your big advance here -- clearly the coupling for yield is a novel step. Are you adding other hinted mechanisms that do better than the prior work --
>> Amin Ansari: I think that's a very, very good question.
>>: -- core acceleration? Or are you basically taking the best practices from that prior work and applying it to this topology [inaudible].
>> Amin Ansari: That's a great question.
So the biggest advantage that we get in terms of selling points is the definitely that wore using it for yield and [inaudible]. The second thing is that since -- I would say that the way that we apply it is different from the prior work. But there are some few things that are completely different. And that's mainly this decision to disable a particular type of hints or do the resynchronization. And these are like the fine-grain decisions that we make that are not like the case for the prior work.
And basically the way that we design the hint distribution and gathering is based on the fact that we know that we need to disable these hints once in a while.
>>: Yes.
>> Amin Ansari: Exactly.
>>: Okay.
>> Amin Ansari: So here for the sake of time, I'll skip over caching and I'll go over the lifecycle of a single branch prediction hints in our system.
The thing that's happening is whenever there is a branch prediction update in the dead core, we send the signal to the hint gathering unit and we say that, okay, generate a new hint. It's going to look into the queue. If queue is full, it's going to stall, meaning that the an -- the dead core is already too far away -- far ahead of the animator and it does not need to generate any new hints.
Otherwise, it's going to get that PC and -- PC and add two more fields to it. One is type of the hint. Another one is H tag. H tag is a field that we use in order to let the animator core know that when to use that particular type of hint. When is the right time to use it?
And we inject this hint inside the queue. It gradually goes through the queue and gets ahead of the queue. At the animator core site we have a buffer that basically buffers these hints and allow them ways to get out of order access -- or random access to the different hints inside of this buffer. Because we have the cache hints as well.
And whenever the H tag is less than the some empirical value we know that at this point we can use this particular type of hint. And if the hint disabling unit is allowing that hint to be applied locally, we get the hint and we send it to the fetch unit of the animator core.
In the fetch unit we make a small modification. I don't know whether it's small or not.
The original predictor of animator core is a simple bimodel. We're going to add another bimodel predictor. We call it Necromancer predictor is. And this predict -- this particular predictor basically will keep track of the hints that are coming.
At a higher level we apply a terminate -- apply a tournament predictor that decides for a particular given PC whether the original branch predictor of the animator core should take over or the Necromancer predictor should take over. And we are going to use this tournament predictor to decide whether for a given -- at a given point in time whether we should disable the branch prediction hints or not.
And this -- at this slide i describe how we supply an Necromancer to a larger CMP system.
In summary, we're going to use Necromancer to in essence the throughput of the system by exploiting dead cores. And we leverage a set of microarchitectural techniques to provide intrinsically robust hints, fine and coarse-grained hint disabling, online monitoring of the hint effectiveness, and also dynamic state resynchronization between the two cores.
Applying Necromancer to a four-core CPM, on average we get 88% of the original performance of a live core. And this comes with a modest area and performance overhead of five and nine percents.
>>: 88% is the performance of the Necromancer core with the hints compared to the original --
>> Amin Ansari: Exactly.
>>: [inaudible] issue core?
>> Amin Ansari: Exactly. Exactly. Compared to six issue. Exactly.
>>: And that's really just a pair? It's not a four --
>> Amin Ansari: It's just --
>>: Yeah.
>> Amin Ansari: Yes. But it means that it shares the net [inaudible].
>>: You gain --
>> Amin Ansari: Yeah.
>>: You're talking about a pairing of two cores --
>> Amin Ansari: Exactly.
>>: The sharing is not --
>> Amin Ansari: Not like sharing for performance, exactly.
So I guess that I will go over some of the potential future works and the -- some of the concluding remarks, if we have time.
>>: So I have another -- one more question.
>> Amin Ansari: Okay. Go ahead.
>>: So for a four-core CPM in the Necromancer --
>> Amin Ansari: Yeah.
>>: [inaudible] let's assume that the -- ignore L2 or anything like that. So each core with its L1 and all of that is 25 percent of that cluster in terms of area.
>> Amin Ansari: Yeah.
>>: And now I'm adding a new core, plus all of the traffic logic and buffering. So to add
5.3% to the area means that the hint plus the new core adds 5.3% area to that, which means that the core is a sixth the size of the other one? So the big once?
>> Amin Ansari: It's roughly one-fifth. This is including the L2s. The L2s are already there. So if you remove the L2s, it's going to be a bit more area overhead. I see including L -- you mentioned excluding L2s. I'm saying this 5% is including L2s.
But the area comparison of a -- the thing that you're interested in is that what's the area of this one compared to this one? It's around a fifth.
>>: A fifth?
>> Amin Ansari: Yeah.
>>: What are the -- so what are the cache sizes in the two cores?
>> Amin Ansari: This one is 64 kilobyte and data 64 for instruction. This one is around
4 kilobyte range.
>>: Okay.
>> Amin Ansari: So it's very, very small.
>>: [inaudible].
>> Amin Ansari: [inaudible].
>>: Yeah.
>> Amin Ansari: So I'll go over the future work very quickly. Some of the things that I can basically work on on the same area of reliability and low power design is one, is the control theory for power management of processor pipeline, caches, and network-on-chip. I'm actually currently working on this with some of my -- we have some collaboration with Intel Labs. And I have some other students that are working on this at U of I.
And another area is we want to exploit process and temperature for reducing the refresh energy of dynamic memories, since dynamic memories are becoming more common on
-- for on chip, also, like Power 7 and other processor.
Usage of temperature and process variation can substantially decrease the number of refreshes that you need to do. Another thing is using more reconfigurable architecture for near threshold voltages. And I have some ideas to extend some have the prior works that I've done for this domain.
Another thing is that I'm excited about is to study the newer technology nodes and see what are the tradeoffs between the energy and reliability, for example, for PCM, carbon nanotubes and spin torque transfer memories.
Another thing that I'm excited about is to see if we look at approximate computing what are the things that basic reliability can contribute to this? For example, some of the things is like understanding the fault propagation and containment, measuring the accuracy of result, what is an acceptable level of accuracy limiting the fault rate, devising a fault backed -- fall back mechanism or even making more applications suitable for this domain. I think these are all interesting challenges in this domain.
Here I have a list of publications during the course of my graduate and post graduate studies. I worked generally in three areas. I worked on reliability and fault tolerance, low-powered design, and also single-thread performance and throughput.
In addition to what I've presented, I also work on some other areas like transient fault recovery where [inaudible] scheduling online testing, refresh energy reduction for dynamic memories, energy efficient accelerators, and also application specific processors.
So in conclusion as you know, we have more transistors these days than what we can power on, therefore, in order to scale the performance in an energy constrained environment, we need to improve the computational efficiency.
And this needs to rethink the computer architecture for energy efficiency from ground up.
As we saw in this talk, reliability and energy efficiency are tradeable, and we can take advantage of techniques that can tolerate process variation for this purpose. However, conventional solution -- reliability solutions are too expensive for this main-stream high-performance microprocessors.
There is a need for new proposals that can trade reliability and power -- and energy efficiency by providing runtime adaptability and high degree of reconfigurability.
Thank you so much for your time. I'll be happy to take any further question. I got very, very interesting feedbacks for the first week, and I really appreciate it.
[applause].
>> Doug Burger: Thank you very much. In the interest of time, I think you're going to be meeting with most of us anyway, so --
>> Amin Ansari: All right.
>> Doug Burger: [inaudible] have a question. Thank you for the talk.
>> Amin Ansari: All right.
>> Doug Burger: Definitely interactive. [inaudible]. You are on for 1:30.
>>: Yeah.