23203 >> Allesandro Forin: Good morning, everybody. Thank you... those watching here or later, thank you for your interest. ...

23203 >> Allesandro Forin: Good morning, everybody. Thank you for coming. And for those watching here or later, thank you for your interest. Today we are talking about Jason's intern project, which was a lot of fun, had some ups and downs, like he will tell you. And the general idea was to see what could we do when FPGA and Kinect are together. Jason. >> Jason Oberg: Thanks, Allesandro. Thanks for coming everybody. As the title suggests, we implemented the Kinect body recognition algorithm on an FPGA. My prior research, I've been doing for the last three years or so, is looking at security, looking at how we're designed for security and hardware information flow tracking. This summer is very, very different. I have done some past high performance competing with FPGAs, our group has done a face detection algorithm on an FPGA, as well as a GPU, and recently working with some graduate students on some high level synthesis. Now that you know that I'm an actual person and I have come from somewhere. But giving you a brief overview of my talk, I'm going to go through kind of the motivation of why we wanted to explore this. Go through the goals that we had or we have for my summer internship what I thought to get out of it and what actually came out of it. Go through some issues with Kinect and USB and FPGAs and the meat of the talk body recognition algorithm, I'll talk about the algorithm, the overview, go through the architecture that we designed to implement that and then go through our results, and then kind of at the end discuss some future work that this could have. So the motivation is we're going very, very low power. Things are getting more embedded. It's very high demand for low power hardware acceleration, especially we can figure out hardware helps because you can redesign, if your specifications change, you can kind of redesign this hardware-specific application to be accelerated low power. And another motivation is Kinect has been a huge success. Everybody -- it's blown up the SDKs, it's really gotten a lot of good feedback. Also the FPGA community is kind of lacking in example obligations. There's not -- they haven't really seen the push that we've seen from the software guys using the Kinect. We haven't really seen that in hardware. So we're kind of -- we want to kind of promote that and get that going. And so they're very much lacking interfacing solutions and also just using the Kinect in general. This in turn, if we get this hardware community excited this will promote the news of the Kinect in an embedded environment, opens the doors for a lot of different hardware accelerated applications. And so our end goal, which was rather an ambitious goal, was to build kind of the Kinect system in hardware. So this is kind of the system that you typically see at a high level, you have your Kinect connected to the XBox, which is connected over USB. The infrared sensor on the Kinect projects out a 3-D infrared image and receives it back and is able to construct depth image that says how far things are from others. So that gets passed to some sort of motion estimation on the XBox which will say this particular region is likely to belong to this player. So it associates player information with the depth image. That gets passed into this kind of body recognition algorithm, which operates on these annotated depth images, and returns a result of where different body parts are likely to be. So in that kind of example you can see the head is -- there's a lot of pixels that have a distribution around the head. This is very likely to be in this part of the body. That kind of does some post-processing which will lay out where your skeleton is likely to be and then that kind of gets sent to the display. So what we want to do is like I said build this entire thing, change what the XBox does and do it in hardware and see what happens. So first thing we wanted to do was interface the Kinect with the FPGAs via USB. And that was step one. We want to be able to capture data from the Kinect. The second thing that we wanted to do was once we have data to do this skeletonization or the body part recognition algorithm and possibly some other things. So once we get the data, see what we can do with it. And then lastly we wanted to do kind of any sort of post-processing we could do, either skeletonization or if we were going to do something else we could do gesture tracking or something like that, we wanted to see what we could do, kind of at the back end before displaying. So this is a slide I had kind of at the very beginning when I was like this is what we want to do. We wanted to, our primary goal was let's get this interfacing solution to work. Let's give the FPGA guys something to be able to -- get the Kinect running and stimulate some interest. Then we had this addition, if we have time we'll get these algorithms going. And so we kind of -- I had these three main stages where basically I had to write a USB driver for a USB micro controller that was on one of these FPGA development boards by Xilinx, and then I had to write Verilog to read from that micro controller to actually capture the data once it was buffered and we were going to write some image processing stuff. Run the algorithm. It turns out that this was kind of a complete disaster. I'll go through why. So our primary goal completely shifted. I mean, we set up like we're going to make this interface nice for everybody. But this all just turned horribly bad. And so everything kind of switched. I mean, the first six weeks spent on this everything just kind of changed and we went toward this additional goal which was to build these some sort of algorithm on the FPGA and turned out we got a hold of the body part recognition code and started working with that. So what happened with the USB? If you ever -- my advice to you if you ever have to do anything with USBs on one of these boards, just avoid it. So it's nasty. So we wanted to have this interface built, and our goal is to release it. Everybody would have been happy. There's no documentation. So we wanted to make a nice step-by-step procedure so everything could follow it. But it turns out that these, this chip provided on the boards, there's not enough space for anything. I mean, the code alone barely fits. You can't do any sort of buffering. Actually, I had to optimize the code and hand remove things because at first actually the code itself didn't fit. So you have to room for any sort of data. It's all just instructions. The chip doesn't even do UBS 2.0. It actually does 1.0 plus. It does 1.0 with some little extra benefits. But it doesn't -- it won't really work for anything that has high image demands in something like the Kinect. But I did get the mouse and keyboard to work which was awesome. So I have that documented. So hopefully we can send that to people and people will be happy and be like thanks for not making me spend hours and hours to try to get the Kinect to work because it's not really possible. But it doesn't solve my problems. So we want to know how to interface with the Kinect. We want to have some sort of way of doing this. And so we started thinking about other options. And this whole time I was trying to write an embedded driver for the Kinect to try to figure out how to do that. We thought why not use what the software guys did. They have a nice driver that they've released that works really well. So all right it was to reuse that and then on the host PC and send everything to the board over either ethernet or PCI express. So what we chose to use was this simple interface for reconfigureable computing, SIRC, which was a debugging tool developed by Ken and Allesandro and the rest of the embedded group. So actually this platform turned out to be really good to use, and it was very easy. So the high level overview of how we kind of use SIRC. So we have -- the setup now is we have the Kinect connected to the host PC which has a host side SIRC software running on it, which can buffer input data. So this transfer uses the driver from the SDK so we don't have to deal with that. Then data is sent over to the hardware over ethernet or PCI express. I use ethernet because that's what I use, but you can use either. So you can see that it captures the depth image. Can send it over to the ethernet to the FPGA. The FPGA can operate on the depth image. Generate some results. Send it back, do some lightweight post processing and then you can display it. So we have this whole interface now built. We don't have to worry about UBS in this SIRC and everything is really easy to use and it's already well documented. And so it's kind of alone a good interface for FPGAs with the Kinect. So before I get into the detail of that, the forest fire algorithm body recognition rectangle there. Probably would be good to give you guys an overview of how this algorithm operates. But beforehand, I'll just go through some example applications where this algorithm is used. We specifically were kind of using it for this human post-estimation body part recognition. But it's been used for key point recognition, for augmented reality. You can reference an object to a particular other object and say these are the points that these match at. So algorithms used for that. It also can be used for object segmentation. You can say this is water, this is a boat you can segment objects, the algorithm's been used for that and also for something like organ detection where you can say this is where, this is the lung and this is the heart things like that. This algorithm we used, we used it for body part recognition, but it has a wide array of things that you can do. So if you were to retrain for a different application, you could still run this thing on the FPGA. It would just be detecting your organs instead of your hand. So kind of the basics of how this algorithm works. This is a nice example that we obtained from Toby Sharp. You can imagine you have -- you're outside. You step outside and it's say the world state. It enters your decision tree and it says is it raining? If it's raining then it's very likely to be wet outside. If it's not raining, then you check -- if the sprinklers are on. If the sprinklers are on it's still likely to be wet outside. So this process is exactly how -- the algorithm just basically takes an object input, it makes decisions along this binary tree, and at the bottom the leaves represent the probability that that object belongs to some class. So it just basically classifies the object with some probability. And it's going to -just to go through pseudo code in general. You can imagine you have some V which is imagine it's some object. And then a particular node. If this node has an actual branch if it's not a leaf you want to evaluate that node's kind of decision on your input object. The last example it was, you know, is it raining outside, is the sprinkler on. You compare it against a threshold for that node. Then if it's greater you go to the right. If it's not, then you go to the left child. And once you hit the leaf nodes you return that probability. This is the way it works at a high level. So what we did is we're dealing with depth images. And with pixels. And so we used the same exact algorithm, but our object that we want to classify is a particular pixel. We want to see if it's part of a specific body part. So the pixel enters the decision tree at the root. And then at each node you evaluate the node's function on that pixel to determine whether to go left or right. The computation ends up being the functions basically a difference between neighboring pixels. It grabs another picture in the image, sees how it compares them and makes a decision based on that. Once you hit a leaf, it basically determines the likelihood that that pixel belongs to that particular body part. So it might be this pixel is in your hand with 90 percent probability. >>: Is it the return -- returned the probability that it's in all the body parts, right? So you could see like probability that it's -- it's the probability it's your shoulder, it's your elbow, so you get ->> Jason Oberg: That's why I have multiple things here. So, yeah, they're different weighted. >>: So at each you can't get a probability vector for all the body parts. >> Jason Oberg: Technically, yeah, but the actual implementation here it's segmented so this might be hand, head, feet, and this one might not be head in it. But in general I think ->>: But you get more than one probability. >> Jason Oberg: I think there's five. >>: I think there's another one where you have probability of wet. This doesn't give you a single like hand .5. >> Jason Oberg: No, it gives you a weight. It varies. Each -- at least for the database we're using, each leaf doesn't have all of them. But they're all contained at some leaf. So I think it's in chunks of five and there's 31 total. >>: Multiple trees. >> Jason Oberg: And then there's multiple trees. You run this -- our thing has three trees and you run it three times. So this is the whole process. So we wanted to basically -- we wanted to build that hardware. That's the whole algorithm. So how do we do that? We, first we need to have somewhere to store the binary tree. We store the binary tree in off chip DDR. Currently we modeled it because we didn't have time to physically deploy it on the physical chip. But that's where the database would be. I think it's roughly 24 megabytes. So it doesn't fit in the resources on an FPGA. So the main points we have in our architecture here is we're using the host machine with kind of the diagram I had before where you're using the driver to the Kinect UBS driver on the post side to fetch frames and then send them over ethernet to the hardware. And so we have a FIFO load kind of section which initializes this FIFO with the -- basically with all the active pixels. I mentioned before the depth pixel. The depth frame comes in. It gets annotated with this player information. That only those pixels that have a valid player in them you're interested in. So those are the ones that actually -- this FIFO load logic checks and only loads those into our sorting FIFO. So the secret sauce in this is this FIFO sort, which keeps, as you're making a binary decision left or right, it keeps the ordering of the nodes in this FIFO completely sorted. What that allows you to do it allows you to basically stream data. So every time you pop a pixel, it's always going to be in order in terms of node. So when you fetch a node from the database, you fetch its evaluation function threshold, et cetera. That's all sequentially accessed. You don't randomly bounce around in memory. Then kind of the last block is this compute, which is the evaluation function that it had in the pseudo code, where you evaluate the function. You pop the pixel, evaluate the function and determine whether or not you're going to go to the left or right child. So how does this thing work? This is the cool thing that Allesandro and Ken came up with randomly and they said this is perfect. It's pretty cool. The idea is basically to, whenever you go to a right child, it's a binary tree so the right children always have a higher load index than the left children. So as you're popping pixels, if a child is going to go to the right you push it at the beginning of the FIFO. So you just push it back into the queue. If it's going to go left, you need to actually write it somewhere in the middle and rewrite what was there back to the beginning. The idea when you go left, it's going to be less than had you gone right, so it needs to go somewhere in the middle. High level that's how it works. I'll go through a specific example. So imagine we have this very simple decision tree here. We have our FIFO. It's actually a ring buffer. It wraps around. We have some pointer to head here. It's going to first -- imagine it first pops the -- I guess it's yellowish, the yellow-brown, yellow-red and black pixel and it pushes them to the right. So the tail was here. It pushed two, two, two, because they're all in the second node now. And then imagine now that the gray pixel, once it's popped by the head, it actually goes left. So what needs to happen if you want to stay sorted since we red all zeros the head should pop a one next you don't want to pop a two because you're no longer in sorted order. So the idea here is that you pop the one or sorry you pop the zero, you compute you evaluate that you need to go to one. You write that to where the hole was and you shift the order. By running this you're able to maintain a completely sorted FIFO. So as you see when the head continues the axis here, it will always access the nodes in sorted order. So in doing so, you can take advantage of sequential memory axises. So it's really -- with DDR, if you're bouncing around in memory, you're going to get killed by I think it's roughly ten times overhead in your latency. But if you can stream it you get really, really high bandwidth. So by keeping the FIFO sorted you're able to access the nodes in your database in a stream-like fashion. This actually allows for really effective pre-fetching which we haven't added. These dotted lines as I mentioned are not billed so we don't have a cache or pre-fetch logic but those are one of the options you could have. Since the FIFO is sorted you can certainly look into the FIFO see what you're going to need you can pre-fetch it and you know it's going to be used for a long amount of time because it's sorted. You're not going to evict something and have to put it back. It's going to be there, when it's gone, it's gone. Because your FIFO is completely sorted. So to give you guys an overview of what sort of area we're dealing with with this implementation. We synthesize it to Xilinx Virtex 6, a larger FPGA. I think it's -it's a pretty beefy device. But if you look at me, the logic is not completely overbearing. If you look at the actual registers used it's zero. And the actual combination of logic, it's pretty small. The biggest thing is the memory elements, which makes sense. We have a big FIFO. We have these input/output buffers, and that tends to be the determining factor. We actually cut this down by half yesterday, because of a last-minute change. So that was cool. This actually used to be a lot larger. >>: Would you happen to know how big those blocks are, how much RAM total? >> Jason Oberg: What is the size of the block? It's eight, 18 K bits. 18 kilobits. >>: That's a half. 36. If you take the whole thing. >>: It's 172 times 18. >> Jason Oberg: Kilobits. >>: Most of these input/output buffer is the image 1912 times two. And the output, which is not necessarily required, I mean it's by showing things is 250 K, I think or 31 times 1912 times 31. We cut it down by ->> Jason Oberg: Right. The biggest thing is the input/output buffers, the input image frame. I should mention this. We have a downsampled version in software, we haven't done it in hardware. It would be cut down by a factor of four. The input buffer. >>: The input buffer, which is the skeleton which is 72. [inaudible]. >> Jason Oberg: Right. So if you were able to do the post processing on a FPGA, you would send like nothing. I mean, just 72 coordinates. >>: Alternatively if you're going try to push the skeletonization into the Kinect by putting FPGA there and only -- you could stream the output. So you wouldn't need the buffer. >> Jason Oberg: Yeah. >>: Because if you figure 172 minus the 64 from the input buffer, 77 for the output buffer, the 172 is much, much more smaller. >> Jason Oberg: So, I mean, even with this particular implementation, then it's a V-6, but it fits on smaller devices, I think this will fit on a Virtex 5 as well. Which is a step down from that. >>: Here, the story of the decision tree where would that go. >> Jason Oberg: It's in optional DDR. It's actually in here. So everything here -this is the results I'm displaying. So the database is 24 megabytes, roughly, at least for this particular training. So that's all stored off chip. >>: That would require also the controller and the logic for that which is fairly beefy, I think it's 10 percent it would be five -- and fooled around with that too. >>: That's a good point. >>: But that would be a big chunk of it. >> Jason Oberg: You need that extra thing, and then that would definitely ->>: Bigger than anything else. As it stands. >> Jason Oberg: I mean, we have plenty -- at least for in terms of logic you can see there's -- we're not doing too much. It's mostly just these really small computations and we're just moving a lot of data around. >>: Do you know -- I'm not familiar with the [inaudible] FPGA. Do you know [inaudible]. >> Jason Oberg: I think the rule of thumb for Olette [phonetic] is -- is it eight gigs? >>: Conversion from Lets to Gates [phonetic] is at best very gray. >> Jason Oberg: Well -- >>: The answer here is two percent utilization of the LX 240-AD is nothing. The number of gates is basically nothing. >>: I guess my question is more geared to if you take that nothing and you put a thousand of them in there, is it still nothing? Or ->>: Yes. You do want to replicate it. >>: Is it 12 gates, is it 10 gates, is it [inaudible]. >>: If you replicate, you still pay a lot of rounds to replicate, this block RAM only allows two [inaudible] axises, you wouldn't want a fragment there. >>: You would start FIFO 2 which means you don't require any more, you just need -- you don't need more copies you just need more ->>: For FIFO, yes. But for input -- you mentioned the output buffers those would have to be ->>: It's more a case ->>: What I would be mapping it to is FPGA. >> Jason Oberg: I don't think there's a hard ->>: Gates not much. >>: It's very, very rough estimate. The whole device is probably somewhere between 125,000 to a quarter billion gates equipment. So figure two percent of that. But that is at best to be taken an estimate, with a degree of salt. >> Jason Oberg: It varies by device and then the -- depends on what you're putting in the lot. They can be used as memory elements and they can be used as logic elements. >>: These are protected ->> Jason Oberg: I should mention that, too, that's good this is with cert. >>: The whole thing. >> Jason Oberg: In the future, that's actually debugging. There's the ethernet controller and buffer, you have to buffer the frame at some point but that's with all the extra SIRC logic that wouldn't be there if you were to -I had it right here. So this is basically SIRC. This is the input buffer, so you can see the logic for SIRC. So ->>: PM is actually -- >> Jason Oberg: This is the actual algorithm. This is the logic for that whole algorithm. And then this input/output buffer, the image input buffer and output buffer for buffering the output results. Okay. So to kind of move on to our performance estimates. So we anticipate pretty big speedups over the software version. If we look at this as -- we weren't able to actually physically run it on the hardware because we don't have DDR controller yet. And things like that. But our worst case estimate as you can imagine we have the number of possible pixels which is 19 K. Times the height of the tree so for our particular database is 20. So you're going to have to each node is going to go down from root to leaf and that happens for all 19 K pixels and you do that three times for each tree. So we estimate roughly about a million cycles. And if you clock this at 100 megahertz you're looking at roughly 87 frames per second. I should mention, too, that I've never seen the entire frame as an active pixel. Because as I mentioned you only process the pixels that are part of the player. So I think the most I've seen just is around 5,000. So this essentially goes up by a factor of four. On the average case. This is absolutely worst case. And then also this is likely to be clocked faster than 100 megahertz is kind of on the slow end for one of these devices. The DDR, once it's added, will skew this estimate slightly because you could potentially have stalls and things like that. But keeping everything sorted, if we're able to stream quickly and effectively, it should -- this shouldn't be too huge of a bottleneck. And I should mention this is unparallelized hardware. So as I showed you just have things streaming. There's no replication of resources. We've down sampled the image in software, we just haven't built that in Verilog yet. You could replicate the FIFOs, replicate the input buffers, and you could do a lot of parallelism and speed this up substantially. And so a little estimate. Software sequential. This is worst case that I ran on my desk top, it's about ten frames per second with our software version and then the estimate for the hardware is roughly ->>: I've got a question. For the software version, how optimized was it? Was it the code you showed us before at the very beginning when you just described how the algorithm appears to work? >> Jason Oberg: It wasn't really optimized much at all. So it's not parallel. It's sequential for one. >>: It's not just ->> Jason Oberg: Sorry? >>: [inaudible]. >> Jason Oberg: There is. That's the Toby Sharp. >>: The sequential means not even the one that runs the actual, it's parallel. >> Jason Oberg: Yes. >>: I have that. But for that you don't get the worse case, the average is about three milliseconds. >> Jason Oberg: Absolute worst case. So... >>: The machine that you're running your software, sequential on, getting it [inaudible] per second, what sort of -- what sort of CPU ->> Jason Oberg: Xenon, quad core, but we were running ->>: Single threaded. >> Jason Oberg: No, no, it's single threaded. I think it's three-gigahertz. $2.8 million. Close. >>: Even that one, 187 times per second, what that will [inaudible] on the average case? >> Jason Oberg: So like maybe I should. >>: You can't really measure it worst case. >> Jason Oberg: Because it's average. You're running it while you're looking at it. >>: We haven't test it with two players, rather before. That's what we use. You can't use it -- worst case would be [inaudible]. >>: So the delta between your software sequential and your [inaudible] is that for the [inaudible]. >>: Yes. The same image. In fact, the software is running ->>: Nine X speedup. FPGA. >> Jason Oberg: Like I say for FPGA that's the absolute, for the workload that's ->>: It's the best case scale, better or -- do you expect best case to be nine times better? >>: That's the formula. So the less pixel you have linearly, the better. >>: Same here, this is if you had every pixel was [inaudible] into it you get the whole energy. So it doesn't have ->> Jason Oberg: It doesn't actually recognize you. Won't even pick you up if you're that close. >>: 87 frames per second if you're processing, every pixel. >> Jason Oberg: It's normally 4,000. I haven't seen it bigger than five. This is roughly four times bigger. >>: 4,000 and for software sequential, what is it normally, not worse case? Definition. >>: They're both speed up the same. >>: That's part of the image. >> Jason Oberg: Exactly. >>: Just depends how much wood is going into the wood chipper? >>: Yes. >>: Software like CPU and GPU implementations we found it's like the majority of the time was spent on the cache misses on the [inaudible] to say DDR enter this and how much is that mitigated by this. >> Jason Oberg: Well, I mean our expectation is because we're sorting it we're not going to have any sort of -- so once you get towards the bottom, closer to the leaves, things are more scattered. So you will have to do kind of more random axises, but we expect that since everything's nice and sorted we should be able to stream it pretty well. >>: [inaudible]. >> Jason Oberg: And we'll have prefetching. Since it's sorted we can prefetch and have things cached from DDR ahead of time. So it's hard ->>: It happens a little over 13 [phonetic] without fetching, you can cache everything. Beyond that you want to look at it and see which block am I going to hit. And then you imagine what we're going to get because DDR will get you two nodes per clock. So how many nodes are you going to fetch. And how many programs. The good news is that each one of them can be very small. So once you keep down the ramifications somewhere else, the big [inaudible] prefetching. You can look ahead. >>: Sorry. So is it basically like for each pixel go down to a certain node and then if it is supposed to go down in new cache does it go for a different pixel? >> Jason Oberg: Talking about FIFO or cache? >>: Yes. >> Jason Oberg: So, actually, the idea is that it keeps -- it keeps that whole pointer always points to the front of the biggest, the current biggest element. So if you go left, it's going to be smaller than that element. And so you put it at that location and you rewrite that detail and then you increment kind of your middle pointer, and you keep operating like that. And it maintains the sorting of it. >>: Is that not clear? >>: I'll go back. >> Jason Oberg: It may or may not be obvious but it works. >>: Think of a tree -- well known of trees. You have index. They have the media. If we can sort the indexes that we're going to have access to, then we're going to be sequential. Start memory also. >>: How can you sort it when like every pixel going down. >>: Because it's binary tree. It's not in general sorting. It's two levels. So what he showed at the beginning that they are going to be using ->> Jason Oberg: So if you go ->>: Does that imply you have to get the whole frame in before you can stop? Because you don't know whether you're going to be inserting something else that you would have to consume? >>: Yes and no. Meaning, you can't start processing until you have some data, some pixel, high pixel data. You'll be scouring everything you have for pixels. >>: But if you stop before you finished, then there's the danger that a new incoming pixel will be one that if it had been sorted fully, it would have access to ->>: We weren't talking about the database nowadays. Not very good. You're talking about the images. The images is -- it depends on the database because the database says once you process this, go look that farther away. So if you know database is farther away, you're ever going to go, you can tell your team to start processing once I have half a image. So we look into that. Conceivably you could do something like that. But that's it and this -- >>: Right now you pulled the whole frame in and you press ->>: And it's not a big deal. >>: So because that's basically the trick that you come up with. You can be sequential. And we'll have goals. Right? Because start having a million nodes. Only access 5,000 of those. >>: Per pixel the tree is being resorted? >>: No we sort the pixel. >> Jason Oberg: We sort what node the pixel is on based on that. >>: The computation of the pixel, this decision is where we got. You can only go left or right for every one of these pixels. So we can keep this, and when you put it back into the FIFO, we do this sort there. >> Jason Oberg: So you have compute -- that's left to right and you put it ->>: Can you learn how the process of pixel ->>: Yes, it gets sorted. It's insertion sort? >> Jason Oberg: Yeah. Yeah, the order you're processing them in. >>: If you look at it as that schedule so you have 1912, whatever many already computed so you can't ->>: When you process you go back to your next ->>: Won't come back until 5,000 nodes. >> Jason Oberg: So last couple of slides here. So as I mentioned kind of briefly, some future things we can do with this is there's no caching, there's no prefetching. If we add this, we should be able to kind of -- there's a concern about DDR even though we have it sorting we should be able to stream from DDR. As prefetching will hopefully hide any potential problems we could have from the delay with any random accesses that happened once you get lower in the tree. Also a downed sample database. Now like I said we have this working in our kind of software version, which is a model of what we've planned to build. And so by down summing the database and down sampling the input image we can have a smaller input buffer. We can essentially replicate because we'll have the space and replicate the FIFOs because we can extremely enhance the parallelism here and even more speedups than we already have or than we already expect. And so I'll kind of, to conclude, this started off with just USB was plug this thing in and let's figure it out to see if we can get it to work. I was discouraged for about five or six weeks. It turned out to work pretty well. I learned a lot. There was a lot of hacking. A lot of things breaking and a lot of waive forms. I want to thank Allesandro, Ken. I don't see Neil here. Neil and Toby Sharp provided us with a lot of the software code that we were able to adapt, which probably wouldn't have happened if we didn't have that. So I think I can do a demo really quick. >>: Hopefully it works. >> Jason Oberg: Hopefully it will work. >>: Not running on FPGA. >> Jason Oberg: It's all Verilog. It's synthesized, the results are there. And it's fully ->>: It is using SIRC so the software will send packets. >> Jason Oberg: Right. The model that we have, that picture where it actually sends to the FPGA, it's still happening here. We're simulating SIRC so we have the driver grab the frame, it cents it over SIRC to our simulated hardware which we'll process and send it back through SIRC and so it's all happening on the machine. But all the components are there. >>: Deep in the back of this. So we had to resend the packets. >> Jason Oberg: Let me make sure I don't miss anything up here. So like I said, this is essentially the host machine. We have the Kinect capture and the software side SIRC running here. You can run this. Get rid of this. Each of these -- it's kind of congested because of the resolution. But see if I can get this thing to do something. So it got me walking over here. It's hard to -- let me see if I can move the -- it's hard to see all of it. I wish there was another way. >>: Want to try that? >> Jason Oberg: Hmm? So that was me. That was my body. There's my head. This is something else moving on the -- I don't know what it's picking up. It's probably this thing waving back and forth. But I'll just bring this up so you can see. So it goes through each tree. Loading FIFO. It seems like it's still picking up that thing moving otherwise it would say no active pixels. But it reads the depth image. Loads the active pixels in the FIFO, starts computing on each tree, once it's done it sends back to SIRC. >>: Nowadays nothing is ->> Jason Oberg: Yeah. >>: [inaudible]. >> Jason Oberg: So once it picks you up, got him. Nice. It's taking a while. >>: 25 seconds. >> Jason Oberg: I think it got you when you were moving in. So I'll -- yeah. Totally different body parts. And so like I said, this is all -- every component that I had in that diagram except for that -- it's all simulated because we don't have a physical DDR but all the logic, the FIFO, everything, SIRC all is written and working. And I think that's it. >>: It's showing ->> Jason Oberg: I'm sorry. It's 31 body parts. It keeps picking up on something moving over here. It's weird. But it basically says how likely -- it says it gives you the distribution of where each pixel is likely to be. So this top one I think is your head. And so you can see it all. It does like head, torso, arm, neck. And it will kind of give you those distributions and show you where ->>: For each of the 31 parts, where is it more likely trying to go? >> Jason Oberg: Exactly. So I think you can see. I don't know what this side thing is. On the top one, the top one might be my head. I think it's actually picking up on you. >>: The bottom ->> Jason Oberg: Is it the screen itself moving. >>: Like the bottom row there the second one from the left and the third one from the left, it's clearly the left leg/right leg. >>: And somebody ->>: Think about the size of those from that perspective, it's got to be the screen. >> Jason Oberg: Because it does everything, hands, feet. >>: I've never seen that transmit ->> Jason Oberg: So you can see my arm. >>: Yes. >> Jason Oberg: Up there and the head. It thinks my hands are my head. So some probability. Arm, torso. I think that is actually you sitting there. >>: It's reversed left to right? >> Jason Oberg: Sorry? >>: It's reversed left to right from the perspective of the camera? >> Jason Oberg: I'm not sure. >>: View left to right. >> Jason Oberg: Right. >>: It seems like you can process almost 90 frames per second and maybe essentially four times that, 360 frames per second, something like that. So the frame rate that, what do you need to get it from a temporal standpoint, 30 ->>: The [inaudible]. >>: About 30. So I guess if you can do around 60 seconds, my assumption is there's sort of a way to, like -- you overkill, right? You do 30, what can you do to sort of like make that larger or smaller and use ->>: The big deal is ->>: Yeah. >> Jason Oberg: That and you can do other stuff. So you have more time to do more things, fancier things. >>: That would be on that you could reduce the [inaudible] considerably, which is going to create power. >>: Yeah. It's running 100 megahertz. >> Jason Oberg: That's what we're estimating we're running at. We haven't ->>: Anything like 90 frames per second, since you can start each frame, previous frame because people can't move that fast. >> Jason Oberg: You could probably do that. I think that sounds -- that's a little -- I mean I don't know enough about image processing and vision to do that. But I think that would be definitely a possibility. >>: [inaudible]. >> Jason Oberg: I know you can't move quick enough to get away from it. If you could somehow reuse part of it. >>: Definitely going higher frames per second if we can reduces the perceived latency quite significantly. So that's definitely one that we can support. Get pretty much enough latency from the camera and processing. >>: How long is it? >>: Pretty close to 100 milliseconds if you're doing like all the -- the tracking and the application. >>: Gives the UBS, though. That wouldn't change. >>: Yeah. >>: Do you also find people are filling in more trees? >>: It's okay, we've got more capability than actually. >> Jason Oberg: Yeah, you can -- if people ever find -- [laughter]. >>: This new 50 tree -- [laughter]. >>: That wouldn't work too well. >>: So how many D RAMs do you think you require for this? >> Jason Oberg: That was pretty much it. I mean, for the whole -- for at least this particular one, because everything's there right now, the only thing we don't have is the prefetching. >>: DRAM specifically, what count of DRAMs do you think you need? >> Jason Oberg: DRAMs? >>: [inaudible]. >> Jason Oberg: I'm sorry 172. >>: You don't see that shrinking. >> Jason Oberg: No, if we downsample, that image will go down a lot. >>: The reason I asked is if you switch spots you'll get the DDR essentially for free, but you'll need a largish one part of the family to give you as many DRAMs as you are able to. >>: If you're able to stream the input, I imagine you're able to stream the output rather than storing. Basically captured one entire frame and captured one entire output, 50 percent. If you're able to stream that, then the core FIFO is only using. >>: 30. >> Jason Oberg: So we were buffering everything at the beginning and the end. >>: 264 which gets divided by four and that's it. Because the output -- so the big chunk would be the output. >>: There's a bunch -- there's a tendency to go from vertex saying I don't care how big and expensive this is because it's just -- but you actually pay in terms of compile time and things like that. >>: You're right. As much as for the co-algorithm you can use a very small -because you're certainly not logic bound. If you eliminated the fact that you have to store the entirety of the input image and output and sort of harbor a goal of 0 [phonetic]. >>: Still need the DDR. >>: Sounds like you had ->>: I understand but it's free. >>: It's hard. >>: Available logic left on the FPGA could have implemented on UBS. [laughter]. >> Jason Oberg: I mean, I was provided a chip with a data sheet and it was still -- I mean, that sounds hard. I mean ->>: Probably dye it. >> Jason Oberg: The problem -- it wasn't so much the controller, it was the driver. I mean, we had to -- because the USB chip I mean was functional. It just -- it was lacking in external memory and space, but it was just staring at all the driver code and there's all this stuff going on, and that's where at least the complexity with getting something to work at least a mouse and things like that. >>: Keep in mind the Kinect is not the simple thing. It's a device we have in front of your to audio and to media. >> Jason Oberg: We didn't even really know what the -- how the different devices were separated and it's kind of not ->>: So it wasn't so much the physical UBS connection? >> Jason Oberg: No, that's what got us to stop, because it was like even if we get it to, get the driver working, it's not going to fit. So it doesn't matter. But, yes, on both. >>: Five weeks is not enough. >> Jason Oberg: Five weeks is not enough to write a UBS driver for the Kinect. >>: Starting from 00 effective ->> Jason Oberg: I got a mouse to work, though. [laughter]. >>: One question I did have was how long did it take, into, five weeks. >> Jason Oberg: I wrote all -- Allesandro helped me a lot. He took Toby Sharp's C code and readapted how we should write the hardware and I wrote all the Verilog in six weeks. I was writing stuff yesterday to get it done. So we changed the output buffer to make it -- the block gram size used to be twice as big until yesterday at 4:00 p.m.. >>: That was really the, during that time did it include coming up with the idea of FIFO or was it something else? >> Jason Oberg: No, not the six weeks, the whole 12 weeks. Allesandro was working on the sorting FIFO and the C code while I was doing UBS stuff. And then he was like, hey, I got this thing working now in C, and then he's like build it in Verilog. And I was like, okay, and I took it and that's what we have here. >>: So how long did it take you roughly from looking at the C code to the locality of this data, stuff is nasty to, boom, this sort FIFO structure would work really well? >>: We went through many iterations, started off with obvious code it was fully parallel. It took it and FIFO let it go [phonetic] but of course it wouldn't work on the media. So that's where we started to think of how can we do that. And then I don't remember -- can you remember? >>: It was probably two weeks that it took us. >> Jason Oberg: Didn't take us that long. >>: Of actually coming up. >> Jason Oberg: Once we got the code, when you gave it to me it wasn't that long. Two weeks, three weeks at most. And then I took it and spent the last six weeks writing the Verilog. >>: When we were actually looking at the algorithm it was patently clear that there were certain kind of bumps and stuff that we had to clear up. >>: But the report, however, that was actually the thing that was most useful, because people started talking about how exactly it's stored and so that's what gave me intuition, I think. Instead of going level by level parallel, maybe that's the way we need to deal with the DDR memory code. Okay. No more questions? We'll close early. [applause]

23203 >> Allesandro Forin: Good morning, everybody. Thank you... those watching here or later, thank you for your interest. ...

Related documents

Products

Support

23203 &gt;&gt; Allesandro Forin: Good morning, everybody. Thank you... those watching here or later, thank you for your interest. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

23203 >> Allesandro Forin: Good morning, everybody. Thank you... those watching here or later, thank you for your interest. ...