23203 >> Allesandro Forin: Good morning, everybody. Thank you... those watching here or later, thank you for your interest. ...

advertisement
23203
>> Allesandro Forin: Good morning, everybody. Thank you for coming. And for
those watching here or later, thank you for your interest. Today we are talking
about Jason's intern project, which was a lot of fun, had some ups and downs,
like he will tell you. And the general idea was to see what could we do when
FPGA and Kinect are together. Jason.
>> Jason Oberg: Thanks, Allesandro. Thanks for coming everybody. As the
title suggests, we implemented the Kinect body recognition algorithm on an
FPGA. My prior research, I've been doing for the last three years or so, is
looking at security, looking at how we're designed for security and hardware
information flow tracking. This summer is very, very different.
I have done some past high performance competing with FPGAs, our group has
done a face detection algorithm on an FPGA, as well as a GPU, and recently
working with some graduate students on some high level synthesis. Now that
you know that I'm an actual person and I have come from somewhere.
But giving you a brief overview of my talk, I'm going to go through kind of the
motivation of why we wanted to explore this. Go through the goals that we had
or we have for my summer internship what I thought to get out of it and what
actually came out of it.
Go through some issues with Kinect and USB and FPGAs and the meat of the
talk body recognition algorithm, I'll talk about the algorithm, the overview, go
through the architecture that we designed to implement that and then go through
our results, and then kind of at the end discuss some future work that this could
have.
So the motivation is we're going very, very low power. Things are getting more
embedded. It's very high demand for low power hardware acceleration,
especially we can figure out hardware helps because you can redesign, if your
specifications change, you can kind of redesign this hardware-specific application
to be accelerated low power.
And another motivation is Kinect has been a huge success. Everybody -- it's
blown up the SDKs, it's really gotten a lot of good feedback. Also the FPGA
community is kind of lacking in example obligations.
There's not -- they haven't really seen the push that we've seen from the software
guys using the Kinect. We haven't really seen that in hardware. So we're kind
of -- we want to kind of promote that and get that going.
And so they're very much lacking interfacing solutions and also just using the
Kinect in general. This in turn, if we get this hardware community excited this will
promote the news of the Kinect in an embedded environment, opens the doors
for a lot of different hardware accelerated applications.
And so our end goal, which was rather an ambitious goal, was to build kind of the
Kinect system in hardware. So this is kind of the system that you typically see at
a high level, you have your Kinect connected to the XBox, which is connected
over USB.
The infrared sensor on the Kinect projects out a 3-D infrared image and receives
it back and is able to construct depth image that says how far things are from
others. So that gets passed to some sort of motion estimation on the XBox
which will say this particular region is likely to belong to this player. So it
associates player information with the depth image.
That gets passed into this kind of body recognition algorithm, which operates on
these annotated depth images, and returns a result of where different body parts
are likely to be. So in that kind of example you can see the head is -- there's a lot
of pixels that have a distribution around the head. This is very likely to be in this
part of the body.
That kind of does some post-processing which will lay out where your skeleton is
likely to be and then that kind of gets sent to the display.
So what we want to do is like I said build this entire thing, change what the XBox
does and do it in hardware and see what happens. So first thing we wanted to
do was interface the Kinect with the FPGAs via USB. And that was step one.
We want to be able to capture data from the Kinect.
The second thing that we wanted to do was once we have data to do this
skeletonization or the body part recognition algorithm and possibly some other
things. So once we get the data, see what we can do with it.
And then lastly we wanted to do kind of any sort of post-processing we could do,
either skeletonization or if we were going to do something else we could do
gesture tracking or something like that, we wanted to see what we could do, kind
of at the back end before displaying.
So this is a slide I had kind of at the very beginning when I was like this is what
we want to do. We wanted to, our primary goal was let's get this interfacing
solution to work. Let's give the FPGA guys something to be able to -- get the
Kinect running and stimulate some interest.
Then we had this addition, if we have time we'll get these algorithms going. And
so we kind of -- I had these three main stages where basically I had to write a
USB driver for a USB micro controller that was on one of these FPGA
development boards by Xilinx, and then I had to write Verilog to read from that
micro controller to actually capture the data once it was buffered and we were
going to write some image processing stuff.
Run the algorithm. It turns out that this was kind of a complete disaster. I'll go
through why. So our primary goal completely shifted. I mean, we set up like
we're going to make this interface nice for everybody. But this all just turned
horribly bad. And so everything kind of switched. I mean, the first six weeks
spent on this everything just kind of changed and we went toward this additional
goal which was to build these some sort of algorithm on the FPGA and turned out
we got a hold of the body part recognition code and started working with that.
So what happened with the USB? If you ever -- my advice to you if you ever
have to do anything with USBs on one of these boards, just avoid it. So it's
nasty. So we wanted to have this interface built, and our goal is to release it.
Everybody would have been happy.
There's no documentation. So we wanted to make a nice step-by-step procedure
so everything could follow it. But it turns out that these, this chip provided on the
boards, there's not enough space for anything. I mean, the code alone barely
fits. You can't do any sort of buffering.
Actually, I had to optimize the code and hand remove things because at first
actually the code itself didn't fit. So you have to room for any sort of data. It's all
just instructions.
The chip doesn't even do UBS 2.0. It actually does 1.0 plus. It does 1.0 with
some little extra benefits. But it doesn't -- it won't really work for anything that
has high image demands in something like the Kinect. But I did get the mouse
and keyboard to work which was awesome. So I have that documented. So
hopefully we can send that to people and people will be happy and be like thanks
for not making me spend hours and hours to try to get the Kinect to work
because it's not really possible.
But it doesn't solve my problems. So we want to know how to interface with the
Kinect. We want to have some sort of way of doing this.
And so we started thinking about other options. And this whole time I was trying
to write an embedded driver for the Kinect to try to figure out how to do that. We
thought why not use what the software guys did. They have a nice driver that
they've released that works really well.
So all right it was to reuse that and then on the host PC and send everything to
the board over either ethernet or PCI express. So what we chose to use was this
simple interface for reconfigureable computing, SIRC, which was a debugging
tool developed by Ken and Allesandro and the rest of the embedded group. So
actually this platform turned out to be really good to use, and it was very easy.
So the high level overview of how we kind of use SIRC. So we have -- the setup
now is we have the Kinect connected to the host PC which has a host side SIRC
software running on it, which can buffer input data. So this transfer uses the
driver from the SDK so we don't have to deal with that.
Then data is sent over to the hardware over ethernet or PCI express. I use
ethernet because that's what I use, but you can use either. So you can see that
it captures the depth image. Can send it over to the ethernet to the FPGA. The
FPGA can operate on the depth image. Generate some results. Send it back,
do some lightweight post processing and then you can display it.
So we have this whole interface now built. We don't have to worry about UBS in
this SIRC and everything is really easy to use and it's already well documented.
And so it's kind of alone a good interface for FPGAs with the Kinect. So before I
get into the detail of that, the forest fire algorithm body recognition rectangle
there. Probably would be good to give you guys an overview of how this
algorithm operates.
But beforehand, I'll just go through some example applications where this
algorithm is used. We specifically were kind of using it for this human
post-estimation body part recognition. But it's been used for key point
recognition, for augmented reality. You can reference an object to a particular
other object and say these are the points that these match at. So algorithms
used for that.
It also can be used for object segmentation. You can say this is water, this is a
boat you can segment objects, the algorithm's been used for that and also for
something like organ detection where you can say this is where, this is the lung
and this is the heart things like that.
This algorithm we used, we used it for body part recognition, but it has a wide
array of things that you can do. So if you were to retrain for a different
application, you could still run this thing on the FPGA. It would just be detecting
your organs instead of your hand.
So kind of the basics of how this algorithm works. This is a nice example that we
obtained from Toby Sharp. You can imagine you have -- you're outside. You
step outside and it's say the world state. It enters your decision tree and it says
is it raining? If it's raining then it's very likely to be wet outside. If it's not raining,
then you check -- if the sprinklers are on. If the sprinklers are on it's still likely to
be wet outside.
So this process is exactly how -- the algorithm just basically takes an object
input, it makes decisions along this binary tree, and at the bottom the leaves
represent the probability that that object belongs to some class.
So it just basically classifies the object with some probability. And it's going to -just to go through pseudo code in general. You can imagine you have some V
which is imagine it's some object. And then a particular node. If this node has
an actual branch if it's not a leaf you want to evaluate that node's kind of decision
on your input object.
The last example it was, you know, is it raining outside, is the sprinkler on. You
compare it against a threshold for that node. Then if it's greater you go to the
right. If it's not, then you go to the left child. And once you hit the leaf nodes you
return that probability. This is the way it works at a high level.
So what we did is we're dealing with depth images. And with pixels. And so we
used the same exact algorithm, but our object that we want to classify is a
particular pixel. We want to see if it's part of a specific body part. So the pixel
enters the decision tree at the root. And then at each node you evaluate the
node's function on that pixel to determine whether to go left or right.
The computation ends up being the functions basically a difference between
neighboring pixels. It grabs another picture in the image, sees how it compares
them and makes a decision based on that.
Once you hit a leaf, it basically determines the likelihood that that pixel belongs to
that particular body part. So it might be this pixel is in your hand with 90 percent
probability.
>>: Is it the return -- returned the probability that it's in all the body parts, right?
So you could see like probability that it's -- it's the probability it's your shoulder,
it's your elbow, so you get ->> Jason Oberg: That's why I have multiple things here. So, yeah, they're
different weighted.
>>: So at each you can't get a probability vector for all the body parts.
>> Jason Oberg: Technically, yeah, but the actual implementation here it's
segmented so this might be hand, head, feet, and this one might not be head in
it. But in general I think ->>: But you get more than one probability.
>> Jason Oberg: I think there's five.
>>: I think there's another one where you have probability of wet. This doesn't
give you a single like hand .5.
>> Jason Oberg: No, it gives you a weight. It varies. Each -- at least for the
database we're using, each leaf doesn't have all of them. But they're all
contained at some leaf. So I think it's in chunks of five and there's 31 total.
>>: Multiple trees.
>> Jason Oberg: And then there's multiple trees. You run this -- our thing has
three trees and you run it three times.
So this is the whole process. So we wanted to basically -- we wanted to build
that hardware. That's the whole algorithm. So how do we do that? We, first we
need to have somewhere to store the binary tree. We store the binary tree in off
chip DDR. Currently we modeled it because we didn't have time to physically
deploy it on the physical chip. But that's where the database would be. I think
it's roughly 24 megabytes. So it doesn't fit in the resources on an FPGA. So the
main points we have in our architecture here is we're using the host machine with
kind of the diagram I had before where you're using the driver to the Kinect UBS
driver on the post side to fetch frames and then send them over ethernet to the
hardware. And so we have a FIFO load kind of section which initializes this FIFO
with the -- basically with all the active pixels. I mentioned before the depth pixel.
The depth frame comes in. It gets annotated with this player information.
That only those pixels that have a valid player in them you're interested in. So
those are the ones that actually -- this FIFO load logic checks and only loads
those into our sorting FIFO.
So the secret sauce in this is this FIFO sort, which keeps, as you're making a
binary decision left or right, it keeps the ordering of the nodes in this FIFO
completely sorted. What that allows you to do it allows you to basically stream
data. So every time you pop a pixel, it's always going to be in order in terms of
node. So when you fetch a node from the database, you fetch its evaluation
function threshold, et cetera. That's all sequentially accessed. You don't
randomly bounce around in memory.
Then kind of the last block is this compute, which is the evaluation function that it
had in the pseudo code, where you evaluate the function. You pop the pixel,
evaluate the function and determine whether or not you're going to go to the left
or right child.
So how does this thing work? This is the cool thing that Allesandro and Ken
came up with randomly and they said this is perfect. It's pretty cool. The idea is
basically to, whenever you go to a right child, it's a binary tree so the right
children always have a higher load index than the left children.
So as you're popping pixels, if a child is going to go to the right you push it at the
beginning of the FIFO. So you just push it back into the queue. If it's going to go
left, you need to actually write it somewhere in the middle and rewrite what was
there back to the beginning. The idea when you go left, it's going to be less than
had you gone right, so it needs to go somewhere in the middle. High level that's
how it works. I'll go through a specific example.
So imagine we have this very simple decision tree here. We have our FIFO. It's
actually a ring buffer. It wraps around. We have some pointer to head here. It's
going to first -- imagine it first pops the -- I guess it's yellowish, the yellow-brown,
yellow-red and black pixel and it pushes them to the right. So the tail was here.
It pushed two, two, two, because they're all in the second node now. And then
imagine now that the gray pixel, once it's popped by the head, it actually goes
left.
So what needs to happen if you want to stay sorted since we red all zeros the
head should pop a one next you don't want to pop a two because you're no
longer in sorted order.
So the idea here is that you pop the one or sorry you pop the zero, you compute
you evaluate that you need to go to one. You write that to where the hole was
and you shift the order.
By running this you're able to maintain a completely sorted FIFO. So as you see
when the head continues the axis here, it will always access the nodes in sorted
order.
So in doing so, you can take advantage of sequential memory axises. So it's
really -- with DDR, if you're bouncing around in memory, you're going to get killed
by I think it's roughly ten times overhead in your latency. But if you can stream it
you get really, really high bandwidth. So by keeping the FIFO sorted you're able
to access the nodes in your database in a stream-like fashion.
This actually allows for really effective pre-fetching which we haven't added.
These dotted lines as I mentioned are not billed so we don't have a cache or
pre-fetch logic but those are one of the options you could have. Since the FIFO
is sorted you can certainly look into the FIFO see what you're going to need you
can pre-fetch it and you know it's going to be used for a long amount of time
because it's sorted.
You're not going to evict something and have to put it back. It's going to be
there, when it's gone, it's gone. Because your FIFO is completely sorted.
So to give you guys an overview of what sort of area we're dealing with with this
implementation. We synthesize it to Xilinx Virtex 6, a larger FPGA. I think it's -it's a pretty beefy device. But if you look at me, the logic is not completely
overbearing. If you look at the actual registers used it's zero. And the actual
combination of logic, it's pretty small.
The biggest thing is the memory elements, which makes sense. We have a big
FIFO. We have these input/output buffers, and that tends to be the determining
factor. We actually cut this down by half yesterday, because of a last-minute
change. So that was cool. This actually used to be a lot larger.
>>: Would you happen to know how big those blocks are, how much RAM total?
>> Jason Oberg: What is the size of the block? It's eight, 18 K bits. 18 kilobits.
>>: That's a half. 36. If you take the whole thing.
>>: It's 172 times 18.
>> Jason Oberg: Kilobits.
>>: Most of these input/output buffer is the image 1912 times two. And the
output, which is not necessarily required, I mean it's by showing things is 250 K, I
think or 31 times 1912 times 31. We cut it down by ->> Jason Oberg: Right. The biggest thing is the input/output buffers, the input
image frame. I should mention this. We have a downsampled version in
software, we haven't done it in hardware. It would be cut down by a factor of
four. The input buffer.
>>: The input buffer, which is the skeleton which is 72. [inaudible].
>> Jason Oberg: Right. So if you were able to do the post processing on a
FPGA, you would send like nothing. I mean, just 72 coordinates.
>>: Alternatively if you're going try to push the skeletonization into the Kinect by
putting FPGA there and only -- you could stream the output. So you wouldn't
need the buffer.
>> Jason Oberg: Yeah.
>>: Because if you figure 172 minus the 64 from the input buffer, 77 for the
output buffer, the 172 is much, much more smaller.
>> Jason Oberg: So, I mean, even with this particular implementation, then it's a
V-6, but it fits on smaller devices, I think this will fit on a Virtex 5 as well. Which
is a step down from that.
>>: Here, the story of the decision tree where would that go.
>> Jason Oberg: It's in optional DDR. It's actually in here. So everything here -this is the results I'm displaying. So the database is 24 megabytes, roughly, at
least for this particular training.
So that's all stored off chip.
>>: That would require also the controller and the logic for that which is fairly
beefy, I think it's 10 percent it would be five -- and fooled around with that too.
>>: That's a good point.
>>: But that would be a big chunk of it.
>> Jason Oberg: You need that extra thing, and then that would definitely ->>: Bigger than anything else. As it stands.
>> Jason Oberg: I mean, we have plenty -- at least for in terms of logic you can
see there's -- we're not doing too much. It's mostly just these really small
computations and we're just moving a lot of data around.
>>: Do you know -- I'm not familiar with the [inaudible] FPGA. Do you know
[inaudible].
>> Jason Oberg: I think the rule of thumb for Olette [phonetic] is -- is it eight
gigs?
>>: Conversion from Lets to Gates [phonetic] is at best very gray.
>> Jason Oberg: Well --
>>: The answer here is two percent utilization of the LX 240-AD is nothing. The
number of gates is basically nothing.
>>: I guess my question is more geared to if you take that nothing and you put a
thousand of them in there, is it still nothing? Or ->>: Yes. You do want to replicate it.
>>: Is it 12 gates, is it 10 gates, is it [inaudible].
>>: If you replicate, you still pay a lot of rounds to replicate, this block RAM only
allows two [inaudible] axises, you wouldn't want a fragment there.
>>: You would start FIFO 2 which means you don't require any more, you just
need -- you don't need more copies you just need more ->>: For FIFO, yes. But for input -- you mentioned the output buffers those would
have to be ->>: It's more a case ->>: What I would be mapping it to is FPGA.
>> Jason Oberg: I don't think there's a hard ->>: Gates not much.
>>: It's very, very rough estimate. The whole device is probably somewhere
between 125,000 to a quarter billion gates equipment. So figure two percent of
that. But that is at best to be taken an estimate, with a degree of salt.
>> Jason Oberg: It varies by device and then the -- depends on what you're
putting in the lot. They can be used as memory elements and they can be used
as logic elements.
>>: These are protected ->> Jason Oberg: I should mention that, too, that's good this is with cert.
>>: The whole thing.
>> Jason Oberg: In the future, that's actually debugging. There's the ethernet
controller and buffer, you have to buffer the frame at some point but that's with all
the extra SIRC logic that wouldn't be there if you were to -I had it right here. So this is basically SIRC. This is the input buffer, so you can
see the logic for SIRC. So ->>: PM is actually --
>> Jason Oberg: This is the actual algorithm. This is the logic for that whole
algorithm. And then this input/output buffer, the image input buffer and output
buffer for buffering the output results.
Okay. So to kind of move on to our performance estimates. So we anticipate
pretty big speedups over the software version. If we look at this as -- we weren't
able to actually physically run it on the hardware because we don't have DDR
controller yet. And things like that. But our worst case estimate as you can
imagine we have the number of possible pixels which is 19 K. Times the height
of the tree so for our particular database is 20. So you're going to have to each
node is going to go down from root to leaf and that happens for all 19 K pixels
and you do that three times for each tree.
So we estimate roughly about a million cycles. And if you clock this at 100
megahertz you're looking at roughly 87 frames per second.
I should mention, too, that I've never seen the entire frame as an active pixel.
Because as I mentioned you only process the pixels that are part of the player.
So I think the most I've seen just is around 5,000.
So this essentially goes up by a factor of four. On the average case. This is
absolutely worst case. And then also this is likely to be clocked faster than 100
megahertz is kind of on the slow end for one of these devices. The DDR, once
it's added, will skew this estimate slightly because you could potentially have
stalls and things like that.
But keeping everything sorted, if we're able to stream quickly and effectively, it
should -- this shouldn't be too huge of a bottleneck.
And I should mention this is unparallelized hardware. So as I showed you just
have things streaming. There's no replication of resources. We've down
sampled the image in software, we just haven't built that in Verilog yet. You
could replicate the FIFOs, replicate the input buffers, and you could do a lot of
parallelism and speed this up substantially.
And so a little estimate. Software sequential. This is worst case that I ran on my
desk top, it's about ten frames per second with our software version and then the
estimate for the hardware is roughly ->>: I've got a question. For the software version, how optimized was it? Was it
the code you showed us before at the very beginning when you just described
how the algorithm appears to work?
>> Jason Oberg: It wasn't really optimized much at all. So it's not parallel. It's
sequential for one.
>>: It's not just ->> Jason Oberg: Sorry?
>>: [inaudible].
>> Jason Oberg: There is. That's the Toby Sharp.
>>: The sequential means not even the one that runs the actual, it's parallel.
>> Jason Oberg: Yes.
>>: I have that. But for that you don't get the worse case, the average is about
three milliseconds.
>> Jason Oberg: Absolute worst case. So...
>>: The machine that you're running your software, sequential on, getting it
[inaudible] per second, what sort of -- what sort of CPU ->> Jason Oberg: Xenon, quad core, but we were running ->>: Single threaded.
>> Jason Oberg: No, no, it's single threaded. I think it's three-gigahertz.
$2.8 million. Close.
>>: Even that one, 187 times per second, what that will [inaudible] on the
average case?
>> Jason Oberg: So like maybe I should.
>>: You can't really measure it worst case.
>> Jason Oberg: Because it's average. You're running it while you're looking at
it.
>>: We haven't test it with two players, rather before. That's what we use. You
can't use it -- worst case would be [inaudible].
>>: So the delta between your software sequential and your [inaudible] is that for
the [inaudible].
>>: Yes. The same image. In fact, the software is running ->>: Nine X speedup. FPGA.
>> Jason Oberg: Like I say for FPGA that's the absolute, for the workload
that's ->>: It's the best case scale, better or -- do you expect best case to be nine times
better?
>>: That's the formula. So the less pixel you have linearly, the better.
>>: Same here, this is if you had every pixel was [inaudible] into it you get the
whole energy. So it doesn't have ->> Jason Oberg: It doesn't actually recognize you. Won't even pick you up if
you're that close.
>>: 87 frames per second if you're processing, every pixel.
>> Jason Oberg: It's normally 4,000. I haven't seen it bigger than five. This is
roughly four times bigger.
>>: 4,000 and for software sequential, what is it normally, not worse case?
Definition.
>>: They're both speed up the same.
>>: That's part of the image.
>> Jason Oberg: Exactly.
>>: Just depends how much wood is going into the wood chipper?
>>: Yes.
>>: Software like CPU and GPU implementations we found it's like the majority of
the time was spent on the cache misses on the [inaudible] to say DDR enter this
and how much is that mitigated by this.
>> Jason Oberg: Well, I mean our expectation is because we're sorting it we're
not going to have any sort of -- so once you get towards the bottom, closer to the
leaves, things are more scattered. So you will have to do kind of more random
axises, but we expect that since everything's nice and sorted we should be able
to stream it pretty well.
>>: [inaudible].
>> Jason Oberg: And we'll have prefetching. Since it's sorted we can prefetch
and have things cached from DDR ahead of time. So it's hard ->>: It happens a little over 13 [phonetic] without fetching, you can cache
everything. Beyond that you want to look at it and see which block am I going to
hit. And then you imagine what we're going to get because DDR will get you two
nodes per clock. So how many nodes are you going to fetch. And how many
programs.
The good news is that each one of them can be very small. So once you keep
down the ramifications somewhere else, the big [inaudible] prefetching. You can
look ahead.
>>: Sorry. So is it basically like for each pixel go down to a certain node and
then if it is supposed to go down in new cache does it go for a different pixel?
>> Jason Oberg: Talking about FIFO or cache?
>>: Yes.
>> Jason Oberg: So, actually, the idea is that it keeps -- it keeps that whole
pointer always points to the front of the biggest, the current biggest element. So
if you go left, it's going to be smaller than that element. And so you put it at that
location and you rewrite that detail and then you increment kind of your middle
pointer, and you keep operating like that. And it maintains the sorting of it.
>>: Is that not clear?
>>: I'll go back.
>> Jason Oberg: It may or may not be obvious but it works.
>>: Think of a tree -- well known of trees. You have index. They have the
media. If we can sort the indexes that we're going to have access to, then we're
going to be sequential. Start memory also.
>>: How can you sort it when like every pixel going down.
>>: Because it's binary tree. It's not in general sorting. It's two levels. So what
he showed at the beginning that they are going to be using ->> Jason Oberg: So if you go ->>: Does that imply you have to get the whole frame in before you can stop?
Because you don't know whether you're going to be inserting something else that
you would have to consume?
>>: Yes and no. Meaning, you can't start processing until you have some data,
some pixel, high pixel data. You'll be scouring everything you have for pixels.
>>: But if you stop before you finished, then there's the danger that a new
incoming pixel will be one that if it had been sorted fully, it would have access
to ->>: We weren't talking about the database nowadays. Not very good. You're
talking about the images. The images is -- it depends on the database because
the database says once you process this, go look that farther away. So if you
know database is farther away, you're ever going to go, you can tell your team to
start processing once I have half a image.
So we look into that. Conceivably you could do something like that. But that's it
and this --
>>: Right now you pulled the whole frame in and you press ->>: And it's not a big deal.
>>: So because that's basically the trick that you come up with. You can be
sequential. And we'll have goals. Right? Because start having a million nodes.
Only access 5,000 of those.
>>: Per pixel the tree is being resorted?
>>: No we sort the pixel.
>> Jason Oberg: We sort what node the pixel is on based on that.
>>: The computation of the pixel, this decision is where we got. You can only go
left or right for every one of these pixels. So we can keep this, and when you put
it back into the FIFO, we do this sort there.
>> Jason Oberg: So you have compute -- that's left to right and you put it ->>: Can you learn how the process of pixel ->>: Yes, it gets sorted. It's insertion sort?
>> Jason Oberg: Yeah. Yeah, the order you're processing them in.
>>: If you look at it as that schedule so you have 1912, whatever many already
computed so you can't ->>: When you process you go back to your next ->>: Won't come back until 5,000 nodes.
>> Jason Oberg: So last couple of slides here. So as I mentioned kind of briefly,
some future things we can do with this is there's no caching, there's no
prefetching. If we add this, we should be able to kind of -- there's a concern
about DDR even though we have it sorting we should be able to stream from
DDR. As prefetching will hopefully hide any potential problems we could have
from the delay with any random accesses that happened once you get lower in
the tree.
Also a downed sample database. Now like I said we have this working in our
kind of software version, which is a model of what we've planned to build. And
so by down summing the database and down sampling the input image we can
have a smaller input buffer. We can essentially replicate because we'll have the
space and replicate the FIFOs because we can extremely enhance the
parallelism here and even more speedups than we already have or than we
already expect.
And so I'll kind of, to conclude, this started off with just USB was plug this thing in
and let's figure it out to see if we can get it to work. I was discouraged for about
five or six weeks. It turned out to work pretty well.
I learned a lot. There was a lot of hacking. A lot of things breaking and a lot of
waive forms. I want to thank Allesandro, Ken. I don't see Neil here. Neil and
Toby Sharp provided us with a lot of the software code that we were able to
adapt, which probably wouldn't have happened if we didn't have that. So I think I
can do a demo really quick.
>>: Hopefully it works.
>> Jason Oberg: Hopefully it will work.
>>: Not running on FPGA.
>> Jason Oberg: It's all Verilog. It's synthesized, the results are there. And it's
fully ->>: It is using SIRC so the software will send packets.
>> Jason Oberg: Right. The model that we have, that picture where it actually
sends to the FPGA, it's still happening here. We're simulating SIRC so we have
the driver grab the frame, it cents it over SIRC to our simulated hardware which
we'll process and send it back through SIRC and so it's all happening on the
machine. But all the components are there.
>>: Deep in the back of this. So we had to resend the packets.
>> Jason Oberg: Let me make sure I don't miss anything up here. So like I said,
this is essentially the host machine. We have the Kinect capture and the
software side SIRC running here.
You can run this. Get rid of this. Each of these -- it's kind of congested because
of the resolution. But see if I can get this thing to do something. So it got me
walking over here.
It's hard to -- let me see if I can move the -- it's hard to see all of it. I wish there
was another way.
>>: Want to try that?
>> Jason Oberg: Hmm? So that was me. That was my body. There's my head.
This is something else moving on the -- I don't know what it's picking up. It's
probably this thing waving back and forth. But I'll just bring this up so you can
see. So it goes through each tree. Loading FIFO. It seems like it's still picking
up that thing moving otherwise it would say no active pixels.
But it reads the depth image. Loads the active pixels in the FIFO, starts
computing on each tree, once it's done it sends back to SIRC.
>>: Nowadays nothing is ->> Jason Oberg: Yeah.
>>: [inaudible].
>> Jason Oberg: So once it picks you up, got him. Nice. It's taking a while.
>>: 25 seconds.
>> Jason Oberg: I think it got you when you were moving in. So I'll -- yeah.
Totally different body parts. And so like I said, this is all -- every component that I
had in that diagram except for that -- it's all simulated because we don't have a
physical DDR but all the logic, the FIFO, everything, SIRC all is written and
working.
And I think that's it.
>>: It's showing ->> Jason Oberg: I'm sorry. It's 31 body parts. It keeps picking up on something
moving over here. It's weird. But it basically says how likely -- it says it gives you
the distribution of where each pixel is likely to be. So this top one I think is your
head. And so you can see it all. It does like head, torso, arm, neck. And it will
kind of give you those distributions and show you where ->>: For each of the 31 parts, where is it more likely trying to go?
>> Jason Oberg: Exactly. So I think you can see. I don't know what this side
thing is. On the top one, the top one might be my head. I think it's actually
picking up on you.
>>: The bottom ->> Jason Oberg: Is it the screen itself moving.
>>: Like the bottom row there the second one from the left and the third one from
the left, it's clearly the left leg/right leg.
>>: And somebody ->>: Think about the size of those from that perspective, it's got to be the screen.
>> Jason Oberg: Because it does everything, hands, feet.
>>: I've never seen that transmit ->> Jason Oberg: So you can see my arm.
>>: Yes.
>> Jason Oberg: Up there and the head. It thinks my hands are my head. So
some probability. Arm, torso. I think that is actually you sitting there.
>>: It's reversed left to right?
>> Jason Oberg: Sorry?
>>: It's reversed left to right from the perspective of the camera?
>> Jason Oberg: I'm not sure.
>>: View left to right.
>> Jason Oberg: Right.
>>: It seems like you can process almost 90 frames per second and maybe
essentially four times that, 360 frames per second, something like that. So the
frame rate that, what do you need to get it from a temporal standpoint, 30 ->>: The [inaudible].
>>: About 30. So I guess if you can do around 60 seconds, my assumption is
there's sort of a way to, like -- you overkill, right? You do 30, what can you do to
sort of like make that larger or smaller and use ->>: The big deal is ->>: Yeah.
>> Jason Oberg: That and you can do other stuff. So you have more time to do
more things, fancier things.
>>: That would be on that you could reduce the [inaudible] considerably, which is
going to create power.
>>: Yeah. It's running 100 megahertz.
>> Jason Oberg: That's what we're estimating we're running at. We haven't ->>: Anything like 90 frames per second, since you can start each frame, previous
frame because people can't move that fast.
>> Jason Oberg: You could probably do that. I think that sounds -- that's a
little -- I mean I don't know enough about image processing and vision to do that.
But I think that would be definitely a possibility.
>>: [inaudible].
>> Jason Oberg: I know you can't move quick enough to get away from it. If you
could somehow reuse part of it.
>>: Definitely going higher frames per second if we can reduces the perceived
latency quite significantly. So that's definitely one that we can support. Get
pretty much enough latency from the camera and processing.
>>: How long is it?
>>: Pretty close to 100 milliseconds if you're doing like all the -- the tracking and
the application.
>>: Gives the UBS, though. That wouldn't change.
>>: Yeah.
>>: Do you also find people are filling in more trees?
>>: It's okay, we've got more capability than actually.
>> Jason Oberg: Yeah, you can -- if people ever find -- [laughter].
>>: This new 50 tree -- [laughter].
>>: That wouldn't work too well.
>>: So how many D RAMs do you think you require for this?
>> Jason Oberg: That was pretty much it. I mean, for the whole -- for at least
this particular one, because everything's there right now, the only thing we don't
have is the prefetching.
>>: DRAM specifically, what count of DRAMs do you think you need?
>> Jason Oberg: DRAMs?
>>: [inaudible].
>> Jason Oberg: I'm sorry 172.
>>: You don't see that shrinking.
>> Jason Oberg: No, if we downsample, that image will go down a lot.
>>: The reason I asked is if you switch spots you'll get the DDR essentially for
free, but you'll need a largish one part of the family to give you as many DRAMs
as you are able to.
>>: If you're able to stream the input, I imagine you're able to stream the output
rather than storing. Basically captured one entire frame and captured one entire
output, 50 percent. If you're able to stream that, then the core FIFO is only using.
>>: 30.
>> Jason Oberg: So we were buffering everything at the beginning and the end.
>>: 264 which gets divided by four and that's it. Because the output -- so the big
chunk would be the output.
>>: There's a bunch -- there's a tendency to go from vertex saying I don't care
how big and expensive this is because it's just -- but you actually pay in terms of
compile time and things like that.
>>: You're right. As much as for the co-algorithm you can use a very small -because you're certainly not logic bound. If you eliminated the fact that you have
to store the entirety of the input image and output and sort of harbor a goal of 0
[phonetic].
>>: Still need the DDR.
>>: Sounds like you had ->>: I understand but it's free.
>>: It's hard.
>>: Available logic left on the FPGA could have implemented on UBS. [laughter].
>> Jason Oberg: I mean, I was provided a chip with a data sheet and it was
still -- I mean, that sounds hard. I mean ->>: Probably dye it.
>> Jason Oberg: The problem -- it wasn't so much the controller, it was the
driver. I mean, we had to -- because the USB chip I mean was functional. It
just -- it was lacking in external memory and space, but it was just staring at all
the driver code and there's all this stuff going on, and that's where at least the
complexity with getting something to work at least a mouse and things like that.
>>: Keep in mind the Kinect is not the simple thing. It's a device we have in front
of your to audio and to media.
>> Jason Oberg: We didn't even really know what the -- how the different
devices were separated and it's kind of not ->>: So it wasn't so much the physical UBS connection?
>> Jason Oberg: No, that's what got us to stop, because it was like even if we
get it to, get the driver working, it's not going to fit. So it doesn't matter.
But, yes, on both.
>>: Five weeks is not enough.
>> Jason Oberg: Five weeks is not enough to write a UBS driver for the Kinect.
>>: Starting from 00 effective ->> Jason Oberg: I got a mouse to work, though. [laughter].
>>: One question I did have was how long did it take, into, five weeks.
>> Jason Oberg: I wrote all -- Allesandro helped me a lot. He took Toby Sharp's
C code and readapted how we should write the hardware and I wrote all the
Verilog in six weeks. I was writing stuff yesterday to get it done.
So we changed the output buffer to make it -- the block gram size used to be
twice as big until yesterday at 4:00 p.m..
>>: That was really the, during that time did it include coming up with the idea of
FIFO or was it something else?
>> Jason Oberg: No, not the six weeks, the whole 12 weeks. Allesandro was
working on the sorting FIFO and the C code while I was doing UBS stuff. And
then he was like, hey, I got this thing working now in C, and then he's like build it
in Verilog. And I was like, okay, and I took it and that's what we have here.
>>: So how long did it take you roughly from looking at the C code to the locality
of this data, stuff is nasty to, boom, this sort FIFO structure would work really
well?
>>: We went through many iterations, started off with obvious code it was fully
parallel. It took it and FIFO let it go [phonetic] but of course it wouldn't work on
the media. So that's where we started to think of how can we do that.
And then I don't remember -- can you remember?
>>: It was probably two weeks that it took us.
>> Jason Oberg: Didn't take us that long.
>>: Of actually coming up.
>> Jason Oberg: Once we got the code, when you gave it to me it wasn't that
long. Two weeks, three weeks at most. And then I took it and spent the last six
weeks writing the Verilog.
>>: When we were actually looking at the algorithm it was patently clear that
there were certain kind of bumps and stuff that we had to clear up.
>>: But the report, however, that was actually the thing that was most useful,
because people started talking about how exactly it's stored and so that's what
gave me intuition, I think. Instead of going level by level parallel, maybe that's
the way we need to deal with the DDR memory code.
Okay. No more questions? We'll close early. [applause]
Download