>> Darko Kirovski: Welcome. Today I would like... from UCLA Computer Science Department to Microsoft Research. He'll...

>> Darko Kirovski: Welcome. Today I would like to welcome Professor Cong
from UCLA Computer Science Department to Microsoft Research. He'll be
talking about converting C code to FPGAs.
I want to present Professor Cong as one of the I would say brightest stars of
electronic design automation over the past probably 15 years in all of academia,
as well as industry. He has done pretty much everything that there is in
academia. He's the chair of the department. He is IEEE fellow, produced couple
of students who have ended up at Wisconsin, UT Austin, UNUC, et cetera.
He is head of Magma, he's the chief technology advisor to Magma. There's just
tons of think I could just spend pretty much the entire hour introducing Professor
I think the best thing is if we start down to the technical talk. Thank you, Jason.
>> Jason Cong: Darko, thanks very much for the introduction. It's my pleasure
to be here. Even before walking into the room Darko has lower my expectations
and you may not even see people because people can sign up remotely, da, da,
da, in fact that was my expectation as well because I always associate the name
Microsoft with sort of leading software research. I even didn't expect the topic
would generally interest, so I appreciate your being here.
But I think it's getting to the point that the boundary between software and
hardware is blurred quite a bit. There's also a lot of opportunity to use some of
the techniques I'm talking about for accelerated computing as a new computing
platform in general. So as we go through that be happy to take questions and
have some discussions.
Since this is my first time talking at Microsoft Research, give you a quick
overview about some of the things we are doing. We are looking at the -- we are
actually looking at a number of topics from physical synthesis with scaleability
vulnerability and also the novel interconnects for communications. In particular
we're looking at the 3D designs and also using the RF interconnect for on chip
This is actually very useful for multicore type of designs. 3D will cut down the
communication between the different cores, core to cache, right, and having the
new type of configuration. The RF will talk in a moment will give you the
capability to multiplex multiple channels on the same physical media, so allow
you to have a reconfigurable interconnect.
And the topic I'm going to talk more today is on the electronic level synthesis
from behavior level system level. Let's see. I'll just come down to this slide, talk
a little bit about 3D. It's a (inaudible) project at UCLA started about five, six years
ago. We have developed the very complete system. By the way, today if you go
to any EDA vendors, CAD tool vendors you want to buy a tool suite to do 3D
design that does not exist, so this is typically how that work that our part or ASF
will put in research dollars to enable technology first and then industry will follow.
So in this case, we have actually in the second phase we have a joint project with
IBM, Penn State and UCLA. So we pretty much created a physical design flow
from 3D floor planning placement, detail routing and that interface with the
commercial router, these are cadence.
And all of this is on open source data model called the open access. If any
people here are interest to do some architecture explorations, that's actually very
useful. We have several papers using this as a kind of a physical prototyping
tool just to say hey, if we can actually design things in 3D instead of 2D how
much will impact on latency power and so on, so forth.
This is a quick introduction about the RF part where we have a o where we have
done a lot of work looking at the novel interconnect schemes. What we have so
far is that you can see that the today because of the power that the limitation we
limit it in our communication bandwidth. I mean the processor performs more or
less to four or five gigahertz. However the basic transmission line if you look at it
on today even 90 nanometer process we can carry somewhere around hundred
to 200 gigahertz of signals.
And so we are only using five percent or less than five percent of bandwidth
available. So the idea here is that working to multiplex multiple communication
channels on the physical wire which is a well terminated transmission line. And
by doing that, I can actually have a totally reconfigurable system. Of course you
also get the benefits if it goes at the speed of light, right, it's a much higher
bandwidth but the more important part is kind of a multicore, minicore system is
that I can talk to you on demand. We can create a channel and at the particular
frequency if we need to talk a lot more we can grab more frequency, when we're
done we can release that, someone else can talk.
So this work actually just published in HPCA conference back in February
received the best paper award seem to be the community really like the concept
is looking in this direction.
So this is kind of the idea. You can still have your that same mesh network on
chip kind of arrangement. But you can superimpose a freeway on top of that.
And with that, you can actually get for example different access point they can
actually can create a different short cut for you to have dedicated
communications. And so sort of at the very high level you can think about, you
are a surface street, you can view the freeway on that, you can have a kind of a
reconfigurable interconnects between any two points. Or actually can have a
multiple pair of them talking simultaneously.
The number of pairs determine by the number of the channels we can multiplex
on the high speed transmission line. Okay. So but where I want to spend a little
bit of time to talk today is on this electronic system level design where we have
done is that we take kind of enable the design abstraction from the current VHDL
Verilog into a higher level. So why do we were interested in doing that?
And the first if you really design a customized chip which (inaudible) has been
our research focus at UCLA in my lab, so we worked with almost all the major
semiconductor companies very closely, so you are looking at 10 minigate to 50
minigate, but even a 10 minigate it takes you half million lines of RTL.
So it's a lot of code to create. I'll show you later on instead of by moving sort of
the up the level obstruction you can reduce that by roughly a factor of 10. And
then also almost all the sort of interesting SOC designs, small chip designs you
have either one a multiple processors there, right. So be much nicer you can
actually use a common specification and basically C like language, so part of that
goes to the embedded processors, the rest of that can be compiled, synthesized
into the hardware.
Moreover, there's a lot of push in hardware community to just be able to have an
executable specifications. No one is willing to dive into writing half lines of VHDL
Verilog called from a Power Point specification.
So they want to have a higher level execution model first and then validate that
model and then refine it into RTF.
So if you walk into almost all the major semiconductor company today they have
actually a lot of models written in C or C++ or variation of C which is called
SystemC. Basically C++ program with a special cache of extension libraries to
describe somehow our behavior in communication, synchronization and clocking
so on, so forth.
So in a way, the community or the industry is also ready for something of this
kind because it remind me the revolution we went through 10, 12 years ago
remember when Verilog language first introduced that was not intended as a
synthesis language, right, because Verilog allows you to write models at RT
level, virtual transfer level can simulate 100X, 200X faster at the gate level, so a
lot of people start coding things in Verilog or later on HDL just for the purpose of
But once you have all those models available, the question is why should I refine
it manually into gave lab or specification list can we automate that process?
That's where synopsis started the revolution so we can all design the RT level
So I think we are at a similar transition point if we can enable people to design
circuits, just use C, C++ and then redesign that into the traditional RTL, that's
actually where we want to kind of go.
And then finally I'll show you that once we have this technology that it also create
a capability of doing so-called XL computing or reconfigurable computing and
where you can couple the CPUs with FPGAs that kind of a platform to enable a
much higher performance that much more reasonable power budget.
Okay. So that's where the reason where we have. So the research started at
UCLA in the year 2000 when we start kind of looking into the direction. For me
personally to start working this area, also for another reason, because clearly in
the chip design today the performance dominated by the interconnect wires.
There's a lot of concern saying that we cannot scale the performance any further,
which is true, but at that time we also realize not all the long wires are bad. The
long wires you have to cross in the single clock cycle is the performance bottle
neck. If you have a long wire it can cross a multiple clock cycle, that's fine. I said
okay, this is a great idea, maybe I should design a synthesis system so I allocate
the multiple clock cycles for every long wires.
So we look along that direction a bit further but it turned out that the if you want to
do that kind of optimization, a synthesis, you cannot start with the existing RTL
specification. Because by definition RTL is the cycle accurate specification. It
tells you what to do at every clock cycle. You may actually make a mistake as a
designer that you don't give me the freedom to optimize so I really want to have
the total freedom to schedule the computationable also communication
In order to do that, I have to move up to that behavior level. So of course once
we started doing that, we also see a lot of benefits just moving higher level for
design productivity, for verification, for a number of other things. So this is the
prototype system that we have developed at UCLA and later on I can also
mention a little bit about commercialization effort.
So we have a system, we can take C, C++ system, CS input and at the other
side we can take the platform descriptions, for example whether this map to
FPGA, whether map it to ASIC, FPGA, whether (inaudible) it to it's a ASIC, it's a
90 nanometer, 65 nanometer.
So then we can pilot into a system level data model where we have the capability
now to map it into is there a pointer I can -- it can map it into sort of a three
different alternatives. One is that you go to customize the hardware and through
behavior synthesis and the other one is to go to customized processors or stand
up processors, doesn't have to be one, it could be multiple of them.
And then finally you can also generate the interface to put everything together.
The part I want to talk a little bit further today is about this automated the
hardware synthesis part. So from a C level behavior specification we can come
down to first we go through a number of compiler like optimization,
transformation, for example loop unloading, pipelining, there's a stress reduction,
b-width analysis optimization, memory analysis. I didn't list all of them. For
example we also have interprocedure analysis, so on, so forth.
So this effort actually on the compiler side so we decide to take a hybrid effort. A
lot of previous effort people start writing a dedicated compiler for that. I think it's
probably wrong because this is where the community has done a lot of research,
it's the best to leverage open source compiler where you can kind of take all the
existing effort and then add the hardware specific passes that are on top of that.
That's the approach we took. I don't know how many of you are familiar with that
AIOBM compiler from the University of Illinois. So that was the one we actually
used at the base.
It seems to me that it's very big at Apple it's an open source effort, 10, 20s
engineers from Apple check off codes all time. It's a very active effort.
From what I'm understanding that was the kind of next compiler framework
they're going to move to. And then of course now core research here is sort of
scheduled in resource binding in how do you generate very efficient competitive
hardware. I guess no one will argue with me about the motivation of doing
something of this kind. The question is how competitive you can be, right. So
that I put in a lot of effort.
I will tell you a little bit about the hardwares that we are doing at a very high level.
And I also have some one-on-one time I'll be going into more details.
And finally we generate the VHDL Verilog. We can also generate SystemC so
you can simulate it without any license because those are C++ programs. The
target can be FPGAs can be ASICs. So we also generate the constraints. The
constraint could be like a multicycle constraints, the force pass constraint.
We also have the plan to do very aggressive power optimization. By the way,
this is a unique advantage when you do kind of a synthesis from behavior level.
Though power is a big constraint, concern. So how do you minimize it? Really
the most effective way is just to shut it down when you are not using it, right? So
today we are doing cross grain power optimization at OS level or system level so
if I am not using the disk after 30 seconds I can shut off the disk.
But in fact, even when you are using the disk if you look at the disk controller, it
might be only 15 percent of circuits active at any given point the other 80 percent
are not active. But the designer does not have the capability to go into shut
down the circuit one by one at the verifying grain level. So that's the technology
we can provide.
Because we do the entire scheduling, remember. We schedule the computation
to all the resources. We know exactly what is going to be on, what is going to be
off. We had that idea for a long time. Used to be the case there's no way to
communicate to the downstream implementation tool and I can say okay, I'm
going to shut down this multiplier for 16 clock cycle, I'm going to let it on for
another four clock cycle, right.
How do I tell the implementation tool? There's no way. Only until very recently,
a few months ago, the industry get together the synopsis cadence, magnum,
they created a format called the UPS. How many have heard about that?
Uniform power specification. So allow us to go from high level, I mean to kind of
a passive design intent.
So they created the format not for this automated tool, really for human
designers, right. So if you're an RTL designer if you're really very careful, very
clever you can say okay, these things go to high BDB, this go to low BDB, this
one will be off, da, da, da. So we think that is actually ideal for us to pass the
idea to the implementation tool. So that's also a very active research area.
Just I show you one slides about the result first just convince you that we can do
this. For example this is the result from two years ago from one of the industry
partners that give us profile decoder. So they kind of chop it up in hard design to
a number of blocks. So these are the modules within that design and these are
the submodules, and these are the number of lines of the C code you can have,
for example, a copy controller, motion compensation, there's a parser, decoder,
so on, so forth.
All together it's about 5,000 lines of a C code. And we can take that to
synthesize into VHDLs in automated fashion. So altogether we get about 56 or
57,000 lines of code. So enter the code density we get a reduction about 10 to 1
or 11 to 1.
Not only that, so this one works that we map it to designing FPGA. These are
the number of slices you have. All together it's about less than half of the
smallest device has at that time for vertex two. And also it works at the frame per
second as required.
So the design can also be very competitive. So that's one example. Later on I
also going to show you some design that actually that compared to the sort of
manual designs.
So these are the logic we use, these are the embedded memory we use.
>>: How (inaudible).
>> Jason Cong: I'm sorry?
>>: How are you shutting off components?
>> Jason Cong: So this one we are not doing that. The FPGA you cannot do
that, but with the customized ASIC design, right, so we can do that. Where we
can -- so today in power is a big concern so that when goes down to 45
nanometer, 65 man meter, they provide you with standard cell libraries but they
also provide you so-called MTC where you have the capability to have your
power wheel connect to the real power wheel through a sleep transistor. So
when you are not using this part of logic, I can just cut off, I can just turn off this
sleep transistor. So this small portion of the circuit will be off. Okay.
>>: (Inaudible).
>> Jason Cong: Yeah, so that's the way to solve the leakage. Because you can
do a lot of things to help with dynamic power. The leakage, the only way to turn
the knob when you are not using it. Right. But you can see, I mean, this is I can
do it at a very fine grain level, I mean there's three multipliers, a chunk of let's say
I'm now doing motion compensation I can turn it off. So this is very scary for the
human designer to do that. They just cannot think at that level of detail, right. So
this kind of thing is perfect for automation.
So the advantage you can do that is quite clear. As I said, first if you write and
you model at the higher level is typically 1,000 X, 100 X speedup in simulation
time. I show you you can do better code density and also more importantly you
can do a very rapid exploration. I have some example to show.
And finally there are two kind of exploration you can do. One is that you can
explore the boundary in hardware and software so in that MPEG 4 design I show
you. Maybe it says not all the modules equally important. I can implement some
of them on the embedded processor, some of them on customized logic.
So that if you start with a uniform to see like a specification you can experiment
that. The second thing is that even for the same behavior model you can try
different latency of frequency. So if you can with RTL you just say okay, where
can I implement this thing at 250 megahertz so it was 500 megahertz so you set
a target, there's a team of people working on it for several months, you get a
But with this one, we can say okay, what happens if you go with 500 megahertz,
what happen if you go 400 megahertz, right? Each one of them will give you a
different area, more important different latency on the different throughput and
also different power.
So actually I believe we can get the better result than human designers for
multiple reasons. One is I told you this power, the designers is very difficult for
them to deal with that fine grain level to decide who is on, who is off. And also
you have to make a lot of decisions when you generate RTL code. Some of
them of course are very experienced designers. They say okay, I know average
trade-off but a lot of them starting, they have limited experience in IP design.
You have two computations, you have to decide whether you share adder or not
or you share a multiplier or not.
So if you share it what does it mean you have to have a multiplex right in front of
that, and then you can design what time this data goes through. But you have to
remember that this is the area multiplex bigger or the area of adder bigger? So
you ask that question, you probably had down a lot of RTL designs you may not
know because it depends on the platform you have, whether you go to designing
FPGA or you go to TSMC 65 nanometer. So we actually model all this platform
information so we can make that decisions that when we generate the RTL code.
Whether you share or not share, you get a different RTL code.
So there's all these things that we believe once we automate we actually give
you better result. We are getting to there. So let me give you some examples
about the type of optimizations we have.
So there's one very important step is the one you from a behavior level model to
cycle accurate RTL model. So behavior model there's no concept of cycles, it's
just computation model need to be done right where in the RTL model you have
to say what computation need to be carried out every clock cycle.
So kind of an analogy will be you may have Bill Gates or (inaudible) want to
reach a hundred billion dollar next year now so that's a very level goal. So the
COOs, the president going to refine target quarter by quarter maybe even month
by month and week by week, right so that's where we could schedule it.
Someone probably also do the allocation saying division by division which one
should achieve which target then department by department so on, so forth. So
that we call the resource allocation.
So scheduling is an important step to give. You make computation graph, how
do I figure out what computation carry out in what clock cycles. There's really it's
difficult is that first you have a very different application data intensive, computer
intensive, memory intensive and also there's a lot of constraint. For example you
have resource constraint. I only have (inaudible) and not only that once you go
to FPGA they have a different type of resource constraint. I have a number of
logic lots gates, right, I also have a beauty in DSPs, I have a beauty in memory.
All of those will be constraints.
Latency constraints, the frequency constraint, relative timing constraint. And
then if you wanted to better optimization you also have to be careful. Some
operation can be done in one clock cycle. We call that operation chaining.
Some of the operation actually have spread out in multiple clock cycle so that
called a multi cycle operations. And some of them has to be handled with pipe
lining, right, so that even before the first operation is finished I pump in another
Of course if you are experienced designers, you can possibly do all of those,
right, but still there's a limit when you get down to half million code you imagine
how much optimization can you do for every one of them? So these are kind of
we see ideal opportunity to optimization.
However, we look at the literature very carefully. Really the profits very hard. So
there's basically two kinds of approaches. One is that the people are shown the
(inaudible). So all they can do is that they -- these are very simple minded
heuristics. And most common ones you probably have noticed is called the leaf
scheduling. It's not only used for hour synthesis, it's also used for multi
processor synthesis. You somehow order the operations or task in the linear
order and with some heuristic waiting functions or priority functions you decide
who to execute next and you decide what resources it goes to.
And so it's efficient but it's greedy. Another way is that it -- they try to capture all
the constraints. So it becomes a linear programming constraint. It turns out you
can do some reasonable approximations so almost all the constraint will be
linear. The challenging part, however, variables, integer variables because you
can now see this operation happen at step 2.7, you had to say either clock 3 or
clock 4, right, so has to be integers.
So that's where you have the difficulty. So we actually make a very significant
progress a couple of years ago. We come up with a new type of project called
that using the so-called system of difference constraint. So here's the -- let me
illustrate at a very high level. Let's say you have an adder and you have a
multiplier, adder is faster, two nanoseconds, multiplier five nanoseconds, I have a
target clock cycle being 10 nanosecond, this is 100 megahertz, a slow down on
FPGA. And let's say for some reason just for the illustration purpose the
resource are only one multipliers and one adders. But the only multipliers
So first you have the so-called data dependency constraints. It's not too difficult
to write it down. You have X 3 minus X1 larger than zero basically say X3 cannot
execute before X1, right? These three cannot execute before V1. I can also
have these three X3 minus X two larger than zero.
So for the dependency constraint you can actually have that. And then you can
look at the frequency constraint. Okay. If I have a multiplication, multiplication
and addition and subtraction is the same thing, can I have all three of them in the
same clock cycle? No, because if you add five nanoseconds, 5 nanoseconds,
two, you exceed this one. So now can say X5 minus X2 has to be larger than 1
because they have to differ by at least one clock cycle, right?
So that's type of frequency constraint I have. And you also have a resource
constraint because I only have one multiplier. In that case, these two cannot be
in the same clock cycle, so I can't say X3 minus X2 larger than 1 because they
have to be separated by at least one clock cycle.
So if you write out all these constraints you get matrix and you have a variable
vector. So this is we call that the scheduling steps, control steps of those
operations. And it's basically linear type of a constraint. Whereas the difficulty
come from is that these has to be integer variables. So in general if you write in
this way it's are very difficult to solve.
But you probably notice that when I generate the constraints I was very careful all
my constraints looks had the form XI minus XJ, no larger than certain things.
So if you do that kind of modeling you can show this matrix has a nice property
called the total unit modularity. Once you have that total unit modularity you can
guarantee that the vectors you can just solve that using standard linear
programming techniques, you can guarantee you get the integer solutions. So
that's actually kind of the one approach we take. And that once we have this
basic framework done, the question will be how do we approximate all the
constraints by this linear sort of a system, a difference constraints. So we turn
out we can do that very efficiently.
Let me give you another example kind of optimization we put it in. So after you
schedule, after you schedule you have a data operations in circles, right, fit into
operations in the later clock cycles. And once you cross the clock boundary,
what do you need to do? You got to store the data, right? Before it disappears.
So you actually would introduce a few registers. So this is a very sort of
(inaudible) called the register finding. You have to figure out I have all these
registers, these are their life intervals and how much register I need, who goes to
what register.
And you we do that in computer architecture design as well. Accepting how we
design, not only how you assign a variables affect the number of registers you
also affect the amount of interconnect, how the wires are connected. Because I
have to introduce multiplexors when I have two registers feeding to same
functional unit.
So that is actually the best possible binding you can have and I have one
execution unit, I have two registers, one has three input -- two multiplexers, one
has three input and the other one two. But don't be deceived by this picture.
Each wire here is not this one bit, right, it could be 64 bit datas coming in, so
those multiplexors are huge. Actually we find out that consume 30, 40 percent of
the chip area could just all be a lot of this steering logic.
But it turn out if actually have a register file then my structure is very simple, that I
can actually have a execution unit, I have two read ports and one write port.
Okay. So that's kind of the model I have. And is it easy or is it difficult to get
register files let's say in FPGA designs in our case? It's actually not too difficult
because for those of you have deal with FPGA, they have this concept called the
block ran, right, distributed throughout the chip.
Let me show you. And, yeah, I used to have a figure. I probably delete that. So
(inaudible) cases they all have tens to hundreds of multi plexor -- I mean the
embedded memory block. So those embedded memory blocks are ideal to kind
of be the ones to implement the register files.
Now here you have a interesting question. So can I divide my entire designs into
a number of island? Within each island I have a local register file, right, I have
some data routing logic and the rest will be the computational unit. So that's the
architecture I want to create. Of course you can feed everything into one register
pile, that's good, but most likely will not be very good because you have a limited
number of read-write ports, your concurrency is very poor.
So what you want to do is you partition your computation into a number of island
and where you can actually do this.
So that was also another algorithm we developed be able to kind of explore the
computation, the communication locality to make efficient use of this embedded
memories on chip. So we can actually come up with a very significant
improvement if you do this using either the discreet register file approach, DRF,
versus we use this -- versus a distributed register file approach versus the
conventional discreet register approach, right? We can reduce the area by
almost 50 percent because you make use of embedded memories.
And not only that, when you reduce that much logic, then the clock is also
somewhat improved as well. And there's also question how do you generate this
sort of a register file island, right? There's a different approach. We also
experiment with more efficient algorithms versus less efficient algorithm. So we
can also see 12, 13 percent of difference there.
So this is another example type of optimization you can do. So finally I'll give you
yet another example about how do we do the communication between different
mod duals? So that design is not going to just be one module, your design is
going to be -- have multiple modules. They communicate. How do they
communicate? The common ways are 5-0s and busses and share memories.
Share memory actually is it's a kind of a different animal, but 5-0s and buss, they
share a common property. What's the commonality between them? I call them
sequential community media because the order you write it into the bus is exactly
the same order you'll read out of that, right, with the 5-0s as well.
So that's actually one property we explore. But it turn out once you have a
different modules, so typically in today by the way when a (inaudible) does a
design or Intel does a design they will divide it into a number of team. Each team
will be responsible for one blocks. So each block itself is optimized. But actually
intracommunication in this block can be poor.
It turn out actually how you generate the data from this module because it's going
to fit into the next data that through 5-0s, right? And then you generate the
customer logic block for one, you generate data for the custom logic block for
another, a lot of cases people use divide and con concur approach, they will
design each one independently. But it turned out that ordering your generous
data is actually crucial for the guys after you because different ordering can
enable different type of implementation and also the latency, the communication.
The idea is that they will have some dependencies with the data, if you can
generate those data first and then go to them under that you can actually follow
up with the data later on.
So that's type of one type of optimization we can have. So where we did is you
can actually prove there's a way you can take all that controlled data flow
diagram of all the mod duals together and then you can have a way to module
those -- model those communication channels as part of your scheduled
resources. Then you actually have a much more global picture.
So we can say that by doing this we've got a behavior and a communication
co-optimization and you can actually end up to have a much better result. So in
this case we have a number of DSP like applications and by doing -- this is a
traditional approach you design every mod dual separately and then hook them
up together using 5-0s versus in this case that we actually do the behavior
synthesis and the communication synthesis simultaneously you can see in some
cases we get a very good reduction, 30 percent, 40 percent.
And a big part of that is how do you handle arrays? Well, when you have array
data do you just kind of produce on the data row by row or column by column or
there's some clever schemes you can produce the data in some specialized
order? So that's the kind of ordering we explore. When you actually pass that to
the downstream implementation, basically you have a different way to encode
your matrix, right? You want to -- but you don't want to really kind of have a very
sort of unstructured data pass to the next one. So you reorder the matrix, you
still want to actually compress it again, so basically we also have a compression
technique to show that you have this unstructured data, I can reorder them, I can
still kind of make them into a set of a smaller arrays or some special structure
array. So if you don't do any optimization you can end up with lots of
unstructured data and once do you that optimization it's a lot more reasonable.
So that's another type of optimization we put it into the system. Okay. So I think
all together we have been doing this research for six, seven years. The result
looks very promising. Some of my students graduated two years ago, actually
two and a half years ago so we're exploring what you do. So one decision they
made is they actually licensed the tool from UCLA for commercialization. So
there's a startup company not too far away from UCLA called AutoESL. And I'm
also advisor to the company. So doing the commercialization of this technology.
So that's one thing that the -- I want to kind of. So they have a -- what they have
done a much better job is that they have a very robust language coverage now,
complete support of a C, C++, SystemC under them. They also have much
better support of all the platforms. For us at university we tend to focus more on
algorithmic part of that, right, so the core optimization engines.
They also come up with a sort of a simulation verification where you have the
let's say that test programs, test data for the C models, you can reuse that to test
your RTL code. You can generate that actually very significant.
And the whole tool was on kind of a -- you can generate a C code in any other
ways so the default for them was to provide you with this eclipse based
framework where you can do the editing, you can sort of organize your designs
and then you can see the output of that.
So that was one design they have. So here is for example another design they
were taking that part of the motion compensation to the ASIC implementation.
Here shows the power of doing sort of architecture exploration, right? You can
study frequency to be one gigahertz down to half a gigahertz, quarter gigahertz.
Try different ones. You can see you generate very different number of cycles.
So basically you get a totally different RTL implementation. The cell count differs
by more than a factor of two, the area also differs quite significant and the of
course the critical path, the frequency, the total latency all different because the
latency is a combination of a frequency and also the overall number of cycles.
So that kind have exploration is also in our opinion very important. That's
actually a factual design quality, very significant. Yes?
>>: (Inaudible) how can you imagine with 900 megahertz? (Inaudible).
>> Jason Cong: How do you -- so if you this is a picosecond, so this is basically.
>>: One nanosecond (inaudible).
>> Jason Cong: Yes. So basically that's the resulting frequency, right, 900, so I
think it's pretty close to gigahertz. Not exactly.
>>: But you (inaudible)? Isn't the constraint broken?
>> Jason Cong: Yes, the constraint is broken. So here is kind of a classical
problem, right. You tell the design team to work very hard to timing closure by
giving hem RTL. There's no guarantee that RTL can be used to meet the
constraint, right? So this is the case basically shows you this is the -- you
probably wanted to go as fast as possible in frequency and downstream too
cannot achieve that, that's actually one data point. You can say okay now I can
back off, then there's different implementation.
So I think that's (inaudible) something we look at it as a result. And then by the
way, it's also now the case that the highers, the frequency design will be the most
efficient. Because first they can burn a lot of power, right, and also if you look at
the complete product between the frequency and also the latency, the cycle
count, it actually what determines the performance.
What I want to show is that by having a single behavior model you can do this
type of exploration very easily in the matter of hours.
They also have the flow to take it from SystemC specification in this case is kind
of data security type of applications. Very similar kind of a reduction you start
from a couple thousand lines of C code, SystemC code in this case, you
generate an RTL code in 23,000 lines and this is at the TSMC 90 nanometer
process was about 70 K gate com.
I mentioned there's a simulation verification flow where you can reuse the
behavior tests vectors tests bench to test your RTL code. One of the things I
want to mention a little bit is that having a tool like this has actually enabled very
interesting applications. Of course the primary use when we develop this
technology was for hardware design. It turn out once we have this capability we
can also use it for -- we can also use it for the -- we can also use it for
computation very efficiently. Let me see -- I want to just switch to another
presentation real quickly. I want to say a few words about that. Because I want
to show you one slides's about because I promised to show something about the
quality of the result of the tool.
So this is actually one data point fairly recent, there's a large defense contract
work with them. They want to implement the three tests on FPGA and they
actually was -- had the menu reference design. It's basically coming from IP
vendor as a reference point. So they have a C level model they want to see how
that actually impact the design.
So it turned out that this is their C manual is sort of a implementation. It's on
pipeline implementation. Well they give the behavior model to the tool. The tool
can also generate a very competitive design. You can see that the interim
frequency and the throughput will be the same in terms of lookup table areas this
is a logic is actually less and the interim number three plus is slightly higher. So
this is basically very comparable.
But tell you the tool can match the actually refined manual RTL design. But it's
more interesting is that by the same tool you can start experimenting. This is on
pipeline. You can say okay, if I want to pipeline it, I'm going to feed in data every
18 clock cycles. I believe the whole if you don't do anything is actually 56 clock
cycles. You have the latency there, right, you feed in every 18 clock cycles, now
you design larger in area, you can see that the logic is about the same but the
number of free flops increases because these are the pipeline stage, right?
Now the throughput from 200 goes to 600. You can say I want to put in data
every 6 clock cycles. Now you can get something which is a 10 axis throughput
and of course with more area in the more logic.
So that cast kind of the first you can match the manual designs in a very
competitive way but a more over it also gave you a lot of alternative solutions. I
think that is what enables. Yes.
>>: What is the input? You said that it was basically a GCC program? How
modified was it before feeding them into this process? I'm assuming it wasn't just
another modified software implementation.
>> Jason Cong: There's two things. One is that the interim language coverage
there's two, is actually very good. I'll show you one slide. So one of the
application of this tool is at the investment bank which actually surprised us
So there's a group of people at the investment bank they want to implement
customized circuit to do stock option calculation, financial engineering, all of
those things. Okay. Imagine that's probably the (inaudible) that that financial
system meltdown.
(Laughter) They are trying to be too smart.
So they actually took very kind of totally different code, right, I mean written for a
very different purpose able to go through a system. We actually get a very
strong recommendation endorsement from them. So interim language coverage
is very good.
But it doesn't mean that every piece of program once you can pass it through will
give you the good result. So you have to also think about the kind of the
concurrency and so some of the structures. For example, I will show you also
some UCLA experience we have using the (inaudible) computing, how you
structure a code is important.
Now so the interim language code is very good. It also give you the option, for
example, like this pipelining you can actually add (inaudible) code, this loop I
want to pipeline and here's the initial interval. If you don't do anything, they will
pick one, right? But then you can also give your own input to that. So there's
also -- so it will be a C code. The good thing is that the C code you fit it into
AutoPilot can also be pass for GPC, right? But it give you some (inaudible) that
you can also kind of as a directive to the power synthesis.
>>: Was this financial institution using your software or the startups
>> Jason Cong: It is (inaudible) well it come to the real applications particularly
goes to the startup. This is one is also the same thing because just the interim,
the language coverage, all of this, that they provides much better support. Yes?
>>: (Inaudible) pretty much (inaudible) for C, nothing is of the right size, would -(inaudible) if you have something more sensible like (inaudible) addressing rather
than funny boxes with six wires in and four wires out?
>> Jason Cong: So I did show a slide with AES. We also tried it with another
one. So in that particular application they chose to SystemC. So this is the
difference, the difference between C and SystemC. SystemC is more intended
to access C++ extension for hardware designers, right, it give you the control to
have kind of arbitrary precision, that bit back manipulation, some of the
synchronization so that's for the reason is that when you want to have more
control you probably want to use not a vanilla C language you want to give you
more power to do that. Yes?
>>: (Inaudible) because someone is looking at (inaudible) validations
themselves. How (inaudible) application (inaudible)?
>> Jason Cong: Sort of a traditional profiler to do the profiling that's very easy.
And not only that, you can typically it's iterative process, right you come out with
some model, you go to the tool, you get to APG, you do some simulation or even
just running at speed you can say okay this module is too slow, then you can
come back to do that. So it's more of an iterative process. They will actually give
you the information about sort of a simulation, the latency of every modules
Yeah, these are some of the marketing slides I will skip basically talk about
what's the unique extensions of that. I think just two parts I will highlight. One is
the language coverage because we take a very different approach. A lot of these
EDA companies they start using their customizing compilers as they do that. I
think that's actually wrong approach. You want to kind of leverage all the
research in a compiler community, you take off, build things on top of that.
The second thing is obviously the order of optimizations we are doing, with
scheduling, with resource binding, with how the optimize communication,
This fine grain power optimization I still think is a very important part of what
differentiates it from a lot of other tools.
So I want to say a little bit one thing it enables on (inaudible) computing. So you
have seen these slides many times. Clearly power is the barrier. So the industry
current approach is that by going through paralyzation. So I stop the frequency
skating so I will kind of have more course at the lower frequency. And the
question is how do we go from there. Because you can view this very generic
systems with multiple cores and or you can actually have a large system with
many of these compute units either in a cabinet, in a cluster, whatever.
So what we are seeing in UCLA is yeah, that's one to go that but we think there's
a lot more opportunity to gain power efficiency by customization. So we think
that could be the next trend after paralyzation.
The reason I'm thinking about that way is that I think we actually of sufficient
computing power for most of the applications if you bought on to us. Mean this
presentation, whatever the MSM messenger, email, you really don't need to run
at the very high frequency, very high performance. However, if you borrow down
to every individual customer or every single enterprise they do have one or two a
handful of high -- I mean sort of a computing intensive application, need a lot of
So the idea I think is natural thing to think is whether we have a way to speed up
those application for that particular user or particular enterprise instead of just
raising the level of a general purpose computing.
So the computing hungry class can be accelerated for example using dead
indicated hardware. You look at this sum that the latest announcement they
already include some encryption engines, right, and customize hardware for that.
You can have a customized processors. We actually did research at UCLA
called ASIP, application specific instructions to their processors. And you can
use GPU. GPU is not very power friendly, but it can give you a lot of
performance. So we in particular are looking at the
possibility FPGAs because it's programmable, right, it give you are total
customization for implementation.
So it has actually a lot of promise if you look at the literature there's a conference
dedicated to this area call the FCCM. Field -- what it stand for. I forgot the
acronym, but it's basically customized computing, field customized computing
machine or something.
But there's talks again of barriers. I actually directed an effort even 12 years ago
at UCLA but I give it up very quickly. First at that time the FPGA sits on the
peripheral bus which (inaudible) work station. When it goes through S bus get to
the processor and get the data back is like taking years, so any gain you got from
FPGA is useless, right? The second thing is if you look at all these papers they
have very impressive speedup now on server numerical computation, (inaudible)
. But all of those are written by lower paid graduate students using VHD. Real
programmers will not program Verilog, you have to actually get to C and C++.
So I think both barrier are kind of disappearing in a way. One is on the
communication side there's a lot of exciting R chip two chip communications. So
example this is the hyper transfer bus that AMD sort of enabled, allow you to go
from a processor, co-processors very efficiently and Intel also opened up there
French side bus and you can actually allow very high speed communication.
Even the PCA express is actually quite decent bandwidth, speaking of
processors with a co-processor. So these things are actually that barrier is
become very small.
So the other one -- for example this is the unit we have at UCLA and I will show
you some application we've developed where you have the PGA, I mean CPU,
this is Optron can have its own DDR memory and this is originally intended for us
to do Optron systems, right, so you can replace one of them with FPGA and they
will be communicating by the hyper transfer bus at the very high speed.
The second thing is that with this kind of a compilation technology I talked about
really you can go to hardware implementation very efficiently. So this is one of
the famous algorithm called the black truth algorithm for stock option pricing.
And when we were working with this financial institute at the beginning of it they
were skeptical, they said I will not give you any account -- proprietary data. Why
don't you just go to this Website to download this piece of code. Tell me the
So it was obviously now written for sort of a synthesis. But it turn out the only
thing we need to rewrite is this power function so that we -- because -- and right
now even though I need to rewrite it. Now you can just take that with support X
to the P's power. You can take that into FPGA. Really there's no modification of
that. You can get implementation on a language site FPGA. This thing runs at
only hundred megahertz roughly because the clock here is 10 nanosecond. It's
basically a customized machine going through 400 some clock cycles on every
12 clock cycle you can pump in the data.
So we get about 30 X speed up over the implementation and D machine. So not
only that, if you look at the power implication. So on the FPGA running at 100
megahertz you consume about six watts of power, you could run it on the AMD
processor, it does 68 horsepower so there's kind of a 10X of a difference there.
If you really want to bridge the gap by having 10 Optron work stations, whatever,
CPUs, so you are talking about a difference about almost hundreds of X of CPU
power. So that's where we see the potential, we'll call the customized
computing. If this client or this enterprise that's their primary interest, maybe we
don't have to view it very powerful general purpose computing for that, we can
have actually accelerate a lot of applications as a co-processors. Yes?
>>: Is that FPGA do (inaudible) IEEE (inaudible)?
>> Jason Cong: Yes. I can show you. This is what we do no you. What we did.
In fact, again, this is the effort done at the AutoESL startup company. So what
they did was they actually added the full support of (inaudible) 7.5 floating point.
So this is even better than GPU because it supports both the single precision an
double precision.
And also what we added is a bit accurate fixed point arithmetics. So this is not
only a specification at the input, that the program will automatically propagate all
the b-width information for the intermediate variables. You can do (inaudible)
efficient hardware.
And then the good thing is that for them the investment bank this music because
they really don't want to learn anything about the VHDL Verilog, they have no
interest about anything about that FPGA so the actually support C++. So he
gives a very glowing call and he says okay it does whatever. By the way, there's
a number of company claim they can do C to RTL but you should ask a little bit of
detailed questions. For example, half of the company, more than half will say we
do not support structures. So that's actually a show stopper right away. A lot of
them -- I mean then, (inaudible) cannot support array structures, structures of
arrays. By the way, there's so fundamental limitation you cannot do.
For example you cannot support dynamic memory allocation because you cannot
generate transistors on the fly, right, and you cannot support unbounded
recursions because everything is a fine night resource. So those are considered
to be fundamental limitation, you have to rewrite a code somewhat. But all
others really kind of artificial constraints a lot of the tools impose we don't have to
kind of face that problems.
And at UCLA, I do not know too much about financial application. When I saw
this, I said this is a very interesting, so that's this particular application I'm familiar
with and I mean hardware design area. It turn out that one very time consuming
task today is called the disography simulation. We use one 93 laser to introduce
49 -- 45 nanometer transistors, right? So it's a miracle because you actually use
a very wide brush to paint very fine, right? The small font basically can do that.
But it's through a lot of numerical techniques. So in order to do that, you have to
solve so-called Hopkins equation, that this can take hours or days to do that.
So I start ask a student to do an implementation. Originally this tool, before the
tool available the research started the early part of last year. So he spent about
four months to do RTL coding. He sort of got the system working. But then a
good part is that towards the last summer, the tools available he was able to try
the tool just from C to RTL. He did it at in about -- he did it about in about a week
or two that so this is the core part of the loop to have to deal with.
So just for this project we also gain a lot of experience because there was a
question asked answer what kind of C you can take. So his initial C code can
pass the cool without any problem and get the RTL. But they do not get a very
good performance. It turned out you have several loops, right, your loop with a
different rectangles that they lay out rectangles. You also loop through a
different level kernels, this is by kernel decomposition from optical theory. And
finally loop for different pixel, right?
This is a very natural way to write it mathematically, but it's not a very good way
to implement our hardware. That is also not a good way to implement on the
multicores CPUs because you do not consider the data locality, right? So what
you want is that the kernels you wanted to be reused as much as possible
because that just affects your precision. You just one kernel to do all the
computation and come back to do another kernel. So that actually is easy to do.
You can change the loop ordering that you move the kernel to the outer most
iteration and within that you have to do some analysis do I do process for every
rectangle, every rectangle you can do all the pixels, or do I do for each pixel I
consider all the rectangles, right?
So that kind of a code rewriting is still the human creative part. So it's not saying
you have the tool you can just write any C code and try implementation, we don't
need designers. We still need good designers that can think kind of analytically
or kind of strategically where you can make use of the tool to get the best result.
So he actually did that, so eventually we get about 15 X speed up, again on
Optron work station. So this is about 15 watts we consume compared to 86
watts. So you get about 6 X times 15, so that's about 100 X power performance
efficiency we gain. It's also a fairly successful example we have there. So some
of this data point really kind of points you think there's a lot to be gained by this
concept of domain specify computing, right, so you can actually create
customized computers, customized machines with much higher performance but
much lower power profile that for specialized application or specialized
application domains.
So I think that's kind of the materials I want to cover. I'll be happy to use the rest
of the time for some discussions. Very quickly, I just wanted to show that you
there's ongoing effort with UCLA to processor based design optimization. First is
that we can create so-called application specify instructions set processors.
Same thing that with the FPGA with some kind of a programmable logic we can
generate customized instructions. So these customized instruction can
implement a sequence of applications, you can execute a very commonly, right.
So those we can identify those and we can create the pattern libraries. Finally
we can generate sort of the unique, the new programs you need to have. So this
is what we got the patent generation and the patent selection with like among
these what are the set of patents should be mapped to specialized instruction
sets. And finally you can generate the new program to do that.
So not only that, so we are looking at the possibility of synthesizing the entire
design using this kind of a methodology into a number -- basically application
specific processor network. So the difference here is that you definitely had a
customize the hardware logic using xPilot, AutoPilot. But a lot of the others can
just be specialized processors, the processor with different gateways, with
different instructions, with different file, cache sizes. So my view is if you look at
the complexity of the design you can easily put in 1,000 to 2,000 reasonable size
transistors. Remember that the core due is a billion transistor design. I
remember when I entered graduate school 386 came out. That was a big
achievement. That's one million transistor design. So you can put 1,386 there
without any problem.
So my view is that let the processors are just the future standard cells, the
building block, the questions of can you synthesize your design into that very
So we actually had the very nice progress just recently. We can synthesize
application into a homogenous collection, homogenous set of processors. This is
just that one illustration. These are the task graphs you have and you can have
the naive way so you have one kind of processor for each one of them, you have
a 10 nanosecond stage delay, five stages. So this is one implementation. So
these two can be actually one processor because I have 10 nanosecond so it's
okay to have them to one processor, actually two of them. So in this case, I got
four stages pipeline, right? So finally we can actually come out with even more
clever implementation that I can actually get three stages still satisfy the 10
microsecond stage pipe delay.
So the idea is that if you think about the processors your building blocks you had
to think about how do you map computation to a network of processors very
efficiently. So that's kind of the outgoing research we are doing at UCLA. So this
paper was published in FPGA '07. So we're looking at kind of extensions in this
So this is another example where we start looking at heterogenous system, for
example, part of that can all be in micro place, that's the designing embedded
processors but part of that can going into customized hardware. You can see
sort of a significant performance gain by doing this trade-off. But I think today in
today's design what is more important is the power gain. Typically when you go
from a processor based implementation to a customized hardware based
implementation you can gain power efficiency by someone between 100X to
1,000X. So that's actually what will make a difference.
So I think the conclusion I want to kind of share with everyone is I think the
behavior technology today is mature enough to handle a lot of applications. The
sort of the improvement coming from several areas by doing platform based
modeling and by looking at the communication, the interconnect optimization
simultaneously and a lot of these advanced algorithm we developed they also
have to improve the quality result I think at the end we can actually get RTL code
very comparable to the human designers. Hopefully we'll be better. As I said in
the interim power routability, some of these very late metrics, the RTL designer
has a hard time to deal with.
We also kind of looking beyond this is to look at the synthesize at a system
where hopefully we can make use of both of processors and customize hardware
to get a lot more efficient implementation.
So that concludes my talk.
>> Darko Kirovski: Questions?
>>: So the last thing you mentioned about that processors and the nice graph
showing sort of tasks going from one processor to another, so can you kind of
applications that you are basically trying to get into this paradigm?
>> Jason Cong: Yeah. So here, there what we're thinking is that I guess you are
referring to this example. So this could be a task graph like if you have a stream
type of application like a video encoding, MPEG 4, I think motion JPEG was the
application we used. You can typically decompose your design into a number for
modules. And it's a stream processing because data will keep coming in, and it
will be implemented in a pipeline in a fashion so typically you have a stage delay
that determine the throughput, right. So that's given as a constraint.
Another question is that given that throughput how many processors do I need to
have, and what's the minimum number of processors? What's minimum sort of a
stage latency, how many stages? I'm just showing you that was an
implementation, naive implementation. You need what seven processors, five
stages. And if you be a bit more clever, you need so this case is five processor,
four stages eventually can do it in four processors, right, three stages. So that's
kind of the trade-off we're looking at.
>>: Designing tools are not exactly great in that respect like routing clocks
(inaudible) so your numbers are taking the tools as they are, correct? So the
actually (inaudible) you'll be getting a lot better than three 00, right.
>> Jason Cong: Sure.
>>: Do you have an estimate on that?
>> Jason Cong: That if we -- so that FPGA to be honest is not very kind of
efficient for power in general, right, that was not their effort. But actually we did a
lot of work at UCLA on just on the FPGA architecture design. We published
sequence workshops. In fact, you can introduce the concept of multi DBD,
FPGAs, right, you can also have kind of a original clock gating, power gating, all
of those. So I believe they are following that very carefully. We actually have
very good relationship with both companies. They have funded UCLA research
for about 16 years now from early 1990s when they were very small. So they are
taking a lot of it.
But in the current commercial implementation, none of them that capability. So
really the best way you can do is by optimizing some transition activity, right, and
also by improve some data and memory locality in doing that. So absolutely
agree with you, there's room to be improved.
Code gating you can implement on a FPGA fairly easily because they do have
three flops with enable signals. So kind of things the same concept when I say
fine grain optimization. If you cannot shut out the power at least you can shut off
the clock, right, so that will help the dynamic power. And that's actually fairly
easy to do. I believe the tool does that already, they can actually just generate
the clock getting signals.
>>: Have you tried and see like one of those results ->> Jason Cong: Yeah. I don't have the slides, but they have slides. I think it
was quite dramatic that was 20, 30 percent of a reduction in power.
>>: I have a couple questions here. So when (inaudible).
>> Jason Cong: When you go to FPGA to debug we do not provide the specific
tools. But I think there's a very nice tool called the chip scope. I don't know how
many of you have used that. They can actually capture basically all the runtime
formation and you can read the art, you can actually look at the signal value.
>>: (Inaudible) I assume that you have some tools that (inaudible) right?
>> Jason Cong: Yes. So the ->>: (Inaudible).
>> Jason Cong: (Inaudible) yes. So in general it's not the easy problem
because remember I was keep saying that we want to leverage all these
advance compilers which is very good for the quality result but it also messed up
the correspondence that the -- because they do a lot of. So it's what we are
thinking is that we can give you option, you can compile using let's say dash G
kind of option so I do less optimization allows you to see a better
correspondence. And then once you get it correct you can go back to do more
aggressive optimization.
>>: One more question. This is more of a vision question. So there's a lot of
speculation (inaudible) thinking there will be hundred thousand CPUs on a single
chip, et cetera. There was a paper from a couple years ago where they looked at
the (inaudible) that you're going to have on the surface chip. And if the
conclusion from what I recall was if all CPUs are under distress the thermal
power and just in general assumption is such that you stop getting any gains
from the multicore vector beyond I think the result was around (inaudible) so
where do you see -- mainly because of the leakage, right. So what do you see
here, what is your vision there? Do you think that it's going to be so easy to get
into the leakage, you know? I mean, that's -- I know that's one of the major
problems right now.
>> Jason Cong: Yeah. It's not easy to do that. In fact -- where was I? I was just
thinking that that's where you need the customization will be very useful, right?
And also you can see the type of customization we think about is that each
processor can be customized in term of B width versus a (inaudible) clock
frequency supply voltage. Maybe you can have all these cores. They don't have
to be add to the same clock rate, they don't have to add the same clock
frequency, right? So under that instruction and a degree of reliability. So there's
a lot of parameters. And what you also want to do is maybe very fine grain of
power kind of control. But even though you are doing the computation maybe
not every cycles that there will be active. So you can do a lot more. So that
probably can extend your -- always there will be some limitations in the power
thermal constraint.
>>: So do you see like a thousand core chips (inaudible).
>> Jason Cong: As I said, in term of complexity there's no difficulty to put in a
thousand core chips, right. Where the bottle neck, in fact one is the
communication I see and the other one is the power thermal behavior. So in fact
some of the work I mentioned early on about RF was also to address the
communication bottle neck.
>>: (Inaudible).
>> Jason Cong: (Inaudible) very power efficient because we transmitting waves
instead of charging, discharging the RS feed.
>> Darko Kirovski: Any other questions? Yes.
>>: (Inaudible) how do they figure out that the (inaudible) is how do they monitor
that better? From doing what it's supposed to do, (inaudible).
>> Jason Cong: So how do you know that FPGA -- so you can do, they have
actually test patterns you can use, maybe at the power up time you can just the
same thing as you test memories you can test all the logic interconnect quickly
and then you can start uploading your application into that.
>>: (Inaudible).
>> Jason Cong: I think you can get -- as a research topic we even look at the
possibility of you can have some small logic just keep testing your entire at the
one time but that I do not know whether they provide that or not but that's some
of the ideas we have. Yes?
>>: I would just point the first (inaudible) exactly that, they did not have an
(inaudible) they had a test (inaudible) pretty unreliable.
>> Jason Cong: It's sort of tricky. Basically we're thinking that you have to be
able to migrate some logic, right, so you want to test this part of a logic here and
then I can test that. So we wrote some papers on this type. I mean I'm not sure
how it's not really in kind of any really system here. In fact some of the
discussion was a (inaudible).
>>: (Inaudible).
>> Jason Cong: I'm sorry?
>>: The 3D AC design, actually tubes coming out (inaudible).
>> Jason Cong: We did a couple of tape parts. One the first we have limited
access to the found describe, so the found describe we can use is (inaudible)
lab. This also funded by the (inaudible). UCLA we did one chip was using that
interesting concept called capacity of the coupling so actually you don't need to
have a haul that ->>: (Inaudible).
>> Jason Cong: (Inaudible). Actually that was IFCC paper last year was very
high frequency at gigahertz range. So it was actually a successful tape pod. We
did a second tape pod where we have a microprocessors where we have two
microprocessors, a cache, so that one we have, we got the chip back but it would
have in I/O problems in the design. So that we haven't been able to kind of a
(inaudible) chip yet. But what I'm saying is that you can certainly do that today.
By the way the key part of a (inaudible) design is the using thermal (inaudible)
because it can be very hot, right. We did a calculation if we do don't anything
that is very difficult to take the heat off. So first you can do optimization which
active layer should be close you should I which one should be further away. But
not only that, you also need to turn out this (inaudible) process is SOI process.
Thermal friendly. So we have to insert this V. The V is the connections between
different layers. That's very good for thermal conduction. When you don't have
enough of those we'll put in actually thermal ourselves. That's part of the
>>: (Inaudible). Enormous amount of capacity (inaudible).
>> Jason Cong: So we just putting, actually we are doing nothing.
>>: Oh, I see.
>> Jason Cong: So that's why power down optimization will do. There's some
interesting studies we can show. You can do that either during placement or
after placement.
>>: (Inaudible).
>> Jason Cong: (Inaudible) it's not that bad. We actually have papers, we have
data to show you that it's still reasonable and the amount is manageable.
>> Darko Kirovski: Any more questions?
>> Darko Kirovski: