Document 17864768

>> Hugues Hoppe: It's my pleasure to welcome and introduce Fredo Durand, who's a professor at MIT, and he almost needs no introduction. He is one of the superstars in graphics with an amazing number of papers at SIGGRAPH and SIGGRAPH Asia in recent years. He's one of the founders of the Computational Photography Symposium and Conference. He has really strong influence in those areas. Today he's going to give us an overview, I think, of several different areas of his work at MIT, so welcome, Fredo. >> Fredo Durand: Thank you. [applause] It's a pleasure to be back. Hugues is one of the few people who can pronounce my name correctly, so always good to be here. Yeah, so today I am going to show you a few overviews of recent work we've done, and then I'll spend more time on an area that we're pretty excited about, which is to use computation to reveal things that are hard to see with the naked eye. But before this, let me show you a couple of things we've done in good old image synthesis, in compiler for image processing, and something about online education and video lecture authoring. A few words about global illumination. It's still not a completely solved problem. Simulating the interplay of light within scenes is still computationally extremely challenging, and anything we can do to make it more efficient is really needed. With a number of coauthors from Finland, Jaakko Lehtinen being the lead one, we came up with this new technique that makes a better use of the samples that we use to sample, like transport, and we build on this technique called Metropolis Light Transport that seeks to place sample light rays proportional to the light contribution. It's an old idea by Eric Veach from a while back. It's pretty good at it. The only problem is that a lot of areas of light space contributes a lot to the image, but that kind of boring. There isn't much going on and so you shouldn't need that many samples to compute them efficiently. In particular in a scene like this, this whole area is very uniform, and you shouldn't need that many samples to compute it, so instead what Jaakko and the guys came up with is this idea of sampling according not to the image contribution, but according to the contribution to the gradient of the image. And you see here this is a sample density. We really focused in these areas where with things matter, so here I visualize it in the space of the image but under the hood everything happens in the space of light path. So you know, light path could be something that goes from the light, bounces off the screen, on the ground, and then on you. So it's in this complex abstract space that we have to sample according to the gradient of the image. Whereas and so this path space for a very simplified version could be shown as this, you know, maybe this is my receiving surface, this is where I want to compute light, and this is my light source. So for each point here I want to take into account the contribution from all the points here, so abstractly I can show it as a 2D space where this is my light, these are my pixels or my surface coordinates. In regular Metropolis, you know, because there's this occluder, this whole area of light space doesn't contribute to the image, and this one contributes a lot. So with regular Metropolis, your samples after your mark off chain process would be distributed roughly like this, and then you just count the number of samples for each color and you're done. You get your approximation of the image. So instead what we're doing is we're sampling this space according to the finite difference contribution to the gradient of the image. So instead of having one sample, we get like pairs of samples where we look at the difference. And we place this proportional to the gradient, so we get a lot of these in this area of the simplified light space and not so many of these in this area. The cool thing is that not only do we have a placement of the samples that's more according to where things actually happen, but in addition, in areas like these all these samples do tell us that there isn't anything happening, that the gradient should be zero in all these areas. So not only do you get a better representation of places where things change, but you also get more information that even though your sampling distribution is low, the information that you have is pretty strong and you know that you should reconstruct something pretty smooth. So the basic idea is pretty simple. It gets fairly messy as you do it in the general path space with paths of arbitrary lengths and et cetera, and if you don't pay attention to the math—this would be an example of some of the math, the main math you need to pay attention to, so instead of approximating [indiscernible] images like this, you're going to get a result like that because somewhere there's a Jacobian that creeps in, and you're not—if you don't pay attention, you won't be computing the integral you think you're computing, and I'll refer you to the paper if you like. Math, integrals, and changes of coordinate. If you do the right thing and take this term into account, you actually get the true approximation. So this will be presented at SIGGRAPH by Jaakko this summer. Now, no transition because the next topic is completely different. Something I've become very excited about is the potentials of online education and digital education in general. I'm very interested in this style of video lecture that you probably saw popularized by the Khan Academy where you see a virtual whiteboard lecture where you see the writing going on as the person is narrating what they're doing. A lot of people find these compelling. I won't get into the debate of whether that's the best format, but certainly there are a lot of people who wish they could generate content like this and who don't have the talent of Sal Khan to get it right in one go. Because the sad truth is that authoring software to do things like this, I mean it's not that they're bad; it's that they're nonexistent. I mean people just take some screen capture software and whatever drawing program they like. I know a lot of people like Microsoft Journal or Microsoft PowerPoint, but the bad news with these things is that you have no editing capability. If you get anything wrong you essentially have to restart your lecture from scratch, and so I claim that this is very similar to a typing text on a typewriter. So sure, I mean with a typewriter you can correct your mistakes. You can restart the page from scratch, or maybe you can scratch thing and write the correct word later, but it's really like the Stone Age of authoring capabilities, and we have all grown accustomed to being able to edit text in very nonsequential manner. I mean the order in which you're going to type the letters really doesn't have to be the final order of the text. You want to, you know, insert sentences, delete some, correct some words, maybe even reorganizing or put this paragraph in front of that other paragraph, and so all these modern tools really made the creative process, the authoring process, very nonsequential. And so because I was so appalled by the state of the art of tools out there, I decided that I could do better, and so I decided to implement my own software. And this is a short video that I'll show you what it can do. Playing video. Authoring handwritten lectures with current approaches is challenging because you must get everything right in one go. It is hard to correct mistakes; content cannot be added in the middle; you must carefully plan how much space you need and audio synchronization is hard because writing tends to be slower. We present Pentimento, which enables the nonsequential authoring of handwritten video lectures. Pentimento relies on a new sparse and structured representation that builds on space-time strokes and adds discrete temporal events for dynamic effects, such as color changes or basic animation. We also decouple audio and visual time and use a simple retiming structure based on discrete correspondences interpolated linearly. This makes it easy to maintain and edit synchronization. Let's look at an authoring session with Pentimento. We can record strokes over time with a standard pen interface. The lecture area is the white rectangle in the center. If we run out of space, we can keep writing in the red safety margin, stop recording, and edit the layout using a familiar vector graphics interface; however, our edits are retroactive and affect the stroke from its inception. The lecture's temporal flow is preserved, and the equation looks as if it was written with the correct size in the first place. We continue our derivation, but we decide that we went too fast and that an extra step might help students. We move this equation down to make room for the new step using another retroactive edit. We move the time slider back to where we want to insert the new content; we press record and add the new line. We perform more layout refinement and complete our demonstration. In this scenario, we have focused on visuals first, and we now move on to recording the audio. We first make the audio and visual timelines visible. We proceed piece by piece and select in the visuals the part that we want to narrate. We press the audio recording button and say our text. The audio gets automatically synchronized with the selected visuals. We proceed to other visuals and record the corresponding audio. We can also select in the audio timeline and record. Our approach relies on discrete synchronization constraints between the audio and visual time, which are visualized as red ellipses in the audio timeline. We can add constraints by selecting the visuals, moving the time slider to the audio time where the appropriate narration occurs, and using the timing menu or keyboard shortcut. Here we set the end of these visuals to occur when this audio is heard. We can also drag constraints to change the audio time or the visual time, and the visual timeline and the main view reflect the change. We can also delete silence and the visuals get sped up automatically. Once we have recorded the audio, we realize that the derivation could be clearer if we replace the mean mu by e of x. We first make space for the change. We select the strokes we want to replace and press the redraw button. We write e of x and the timing on the new content automatically conforms to the old one and preserves audio synchronization. We also realize that some of the narration doesn't have corresponding illustrations. We make space and use draw to add visuals without affecting the audio. We derive a fundamental identity for variance. Variance is usually written sigma square. It is defined as the expectation of the square difference between x and its expectation. We can distribute the square which gives us the expectation of x square minus two x. e of x square plus e of x square. We use linearity and take constants such as two and e of x outside of the expectation. We get e of x of x square minus two e of x plus e of x e of x square. We clean up this e of x and get e of x square minus two e of x square plus e of x square. We now cancel one of the two negative e of x square with a positive e of x square and we get the final equation. Variance is equal to e of x square minus x square . Ray tracing is a fundamental computer graphics algorithm. It allows us to go from a 3D scene to an image. The scene is represented digitally. For example, a sphere is encoded by the x,y,z coordinates of its center and its radius— >> Fredo Durand: This is the topic that got me started on this because I tried to make one with standard tools, and I started my drawing too big. I didn't have enough space. It was a disaster and I just stopped and didn't retry until I had my tool right. >>: -by a viewpoint, a viewing direction, and a field of view. Our goal is to compute the color of each pixel. The algorithm is as follows: For each pixel in the image we create a ray from the viewpoint to the pixel. Then for each geometric primitive in the scene, we compute the intersection between the ray and the primitive, and we only keep the intersection that is closest to the eye. Once we have found which primitive is visible at the pixel, we need to compute its color, which is called shading. We take into account a position of light sources and cast additional rays— >> Fredo Durand: In this case the audio was recorded first, and then I did the visual based on the audio. You can do it in whatever order you want. >>: -Pentimento is also used in video lectures on a variety of topics that include probabilities, bar codes, Magellan's voyage, defraction, computational geometry, and many others. The executable is included in the submission as well as a quick manual. Thank you. >> Fredo Durand: And so I'm hoping, well, I'm hoping to spend a lot of my summer debugging this thing and getting the UI usable and hopefully by early fall it'll be released free, open source, blah, blah, blah so. >>: Do you use it already? >> Fredo Durand: I use it for my class. I've been using it in lecture. There are a number of extra bits and pieces that I modified to make it usable in lecture. I've been enjoying it. I don't think the students have been enjoying it as much [laughter] because teaching new content with a new tool where you spend your brain power thinking oh, is it going to crash? Did I screw up this part of code, is maybe not the best idea, but it's been kind of fun to use a walking tablet in lecture rather than the blackboard or a small tablet PC, yeah. Another completely different topic. Maybe I'll try to go even faster on this one because I think that Jonathan is going to come to MSR to give a talk soon, and he understands all this a lot better, but just as some advertisement for his talk, so this is a compiler that recreates it to get really high performance image processing. The two people who really made it happen are Jonathan Ragan-Kelley, who is a grad student finishing with me, and Andrew Adams, who is a postdoc with me and is now at Google. The goal really is to get high performance in image processing, and we all know that these days you can't get good performance without parallelism and that parallelism is hard to achieve. And both the multicore and the SIMD aspect are really tough, but equally important is to achieve locality, meaning that you want your data to stay in the various cache levels as much as possible. This is equally if not more difficult to achieve, and the combination of these two makes writing high-performance code really hard. Usually you have to play with the tradeoff between various aspects. You know locality and parallelism are the two big goals that you want to achieve, and very often, the price you have to pay is that you're going to need to do redundant work. In image processing, that's typically that you organize your computation according to tiles rather than computing the whole image for each stage of your computation, you know. Stage one, whole image. Stage two, whole image. You're going to merge the stages and compute it tile by tile, and the price you have to pay, so this maximizes locality and parallelism, but you usually have to do redundant work at the boundary of the tiles. Usually we tend to think of performance coming from good interplay between powerful hardware and good algorithm and that these are the two knobs that we have to make our computation as fast as possible. For most of us, we're just software people, so all we can do a write a good piece of software, but we think that it's useful to split this notion of a program into two sub notions. One of them is the algorithm itself, and given an algorithm, given a set of arithmetic computations that you want to achieve essentially, there's a big question of the organization of this computation in space and time. The best choice will give you the best tradeoffs. So what do I mean by the separation of algorithm and organization of the computation? Well, we can start from something very simple, which the compiler people have known and exploiting for a long time. Just look at this very simple two-stage blur. So this is a 3 x 3 box filter. The first stage is a blur in x. Double loop on the pixels. Blur in x is just the average of three neighbors—actually it's the sum here, but it's the same, and then we do a blur in y. So you know, this is one piece of program that does this computation, but I can swap the order of those two loops, the for x, for y. It's still the same algorithm. The computation is just organized differently, but in this particular case, I think you get a 15x speed up. Oops, messing up my slide. So just because you get much better locality by doing the loops in the same order things are stored. So this is pretty well known, that the order of the loop can be changed, and most decent compilers will do it, but if you want to get high-performance image processing, we want to take this notion of separating the algorithm from the organization a little bit further. Because if you look at an actually high-performance version of this 3x3 blur, it might look like something like this. Believe it or not, the same algorithm that we had before is still hidden here. I don't expect you to understand what's going on here. You've got some SIMD stuff or pairs of loops have turned into four loops because things are tiled, again, to maximize this locality and visual locality. The main message is that A, this code is really ugly and hard to maintain and that the changes are pretty deep and global. It's not just that you're going to optimize your inner loop and write it in assembly and all that. It's also that you really reorganized your computation and that your whole code is affected, which in particular means that it's very difficult to have a library approach to this problem. Because the library can optimize every single stage of your pipeline and then you put them together, but for actually really fast code, you want optimization that goes across stages of your pipeline, and libraries don't naturally do this. And this code, by the way, gives you another order of magnitude speed up, so by swapping the order of loop, we got a factor of 15, and here we get another factor of 11. If you're MATLAB programmers who think that all you have to do to get fast image processing is to code it in C++, well, it depends. If you code this C++, yeah, you'll get really fast image processing, but it's two orders of magnitude faster than a naive—well, a very, very naive C++. The ordering of the loop, everybody should get this right, and so this whole reorganization of computation is actually hard, both because the low level mechanics of it are difficult. I mean again, this is pretty ugly code. You need to change things at many levels of your pipeline, but it's also hard just at the high level because you don't know what a good strategy might be. And if you're a company like Adobe, where they do really—and I'm sure Microsoft has people like this too. You know, you have people who will spend months optimizing one pipeline, trying to parallelize it this way, trying tiles here, maybe global computation there, but it takes them a month to come up with one, to implement one strategy, and so maybe you're going to try a different strategy if that one doesn't seem to be the best one. Maybe if you have a lot of time, you're going to implement a third strategy, but by then you just have to ship the product, and you're going to stop. So this is pretty tough to come up with the best, the best option. This. And so Halide's answer, our compiler, our language's answer is to separate the notion of algorithm, which in practice we encode in a simple functional formulation, so here you've got the blurx and blury. Just put it in terms of the output of this blurx function is this as a function of its input. And similarly, blury has this expression and uses blurx as input. It's very simple and this algorithm will not change as you try to optimize its organization. And for this we have a colanguage where you have simple instructions that allow you to specify things like tiling, parallelism, SIMD and things like this. What the schedule does is two things. For each pair of input and output functions, so for example, blurx and blury, it specifies the order in which you're going to traverse the pixels for this function. It also specifies when its input should be computed. So blury needs blurx to be computed, so when are we going to compute blurx? Are we going to compute it all at once for the whole image? Are we only going to compute it for a small subset of pixels around the pixel that we need? These are the two big, high-level decisions that you have to make for each input, output pair. And Jonathan will tell you a lot of more about this, but the cool thing is that it's not just a random set of small tricks to optimize your thing. You really get a nice parameterization of the space of tradeoffs and the space of schedules along various axes which you can specify as the granularity at which you compute things on the one hand— equal this—and the granularity on the which you store things, and we showed that all these points in the space correspond to different tradeoffs and might be valuable for various algorithms. And that's an animation. Just to, you know, give you a teaser. First of all, this is the C++ code I showed you before. This is the corresponding Halide program. These two pieces of code have exactly the same performance, and by the way, Halide is embedded in C++, so it's reasonably easy to incorporate it into your C++ program. We use embedded—it's an embedded language. To give you a sense of the kind of performance that we can get, Jonathan spent his summer at Adobe last year, and he took one of the stages in the camera row pipeline, the one that that does shadow highlight, clarity. It's a local Laplacian filter algorithm that was developed by Sylvain Paris, and the Adobe version was implemented by one of their really strong programmers. I mean to give you a sense, this guy is in a team that has only two people. It's him and Thomas Knoll, the inventor of Photoshop, so he's really good. He spent three months to optimize this code. His code was ten times faster than the research code that he started with, but that took him 1500 lines of code just for this stage of the pipeline. So Jonathan spent the summer there, reimplemented this algorithm, and optimized it with Halide, and within one day he had code that was 60 lines instead of 1500 lines and that run actually two times faster than the Adobe code that we started from. And the thing that's even cooler is that our language targets not only x86, but also GPUs and ARM cores, so in no time, maybe you want to change the schedule a little bit because the tradeoffs are not the same you can get the GPU version. Yes? >>: I was wondering are you constrained by the layout of input data, or is that a flexible thing that this gets to optimize? >> Fredo Durand: I experienced the layout of the input data is not that critical because in— >>: A 3x3 box code? >> Fredo Durand: Pardon? >>: For a 3x3 box code? >> Fredo Durand: Yeah, we've been disappointed. Yeah I had a master's student that I told, you know, go work on the layout, and let's also allow people to optimize their layout. So far he has come back and saying I get no performance gain. We can talk about this and you should talk about it with Jonathan. Kind of the intuition is that you have enough intermediate stages, and the prefetchers and the cache systems, especially on an x86 are so good, that as long as you do the granularity of the computation itself right, the layout in the cache is going to end up being right, and things are going to work out okay. It was kind of surprising, yeah. I was not—I had to find another master's subject for this guy, so it was a little bit of a surprise, yeah. So go see Jonathan's talk whenever he visits. The language is open source at Halide-lang.org or something like this. The documentation is still, you know, it's a research project. We're hoping to create a bunch of tutorials this summer so that's a little easier to pick up. Then there's a lot of enthusiasm about it at Adobe and especially at Google where, in particular Zalman Stern, who's the guy who created Light Room at Adobe and now moved to Google, is very excited about it and has been contributing a variety of things, including a java script backend, which I don't completely understand, but he wanted to have fun. He's also contributed a lot of exciting stuff, and we're still working on the compiler, and we're going to make it more and more useful hopefully. All right. Now I come to the actual chunk of my talk that's going to be reasonably coherent, and I want to tell you about a whole research area that my colleague Bill Freeman and I are very excited about, which is to use computation to reveal things that are hard to see with the naked eye. I think that, in general, this is a topic that has been excited for centuries in science and engineering, and scientists have developed lots of tools to go beyond the limits of human vision,, you know, starting with telescopes, microscopes, and things like this, x-ray. And if you're interested in the area, I have a keynote talk I gave last year where I put together a lot of these tools from outside computer science for most of them, and it was quite exciting to discover some of them that I didn't know about it. If you want to look at it, go see my slides. A lot of them are really fun. I especially like the stuff that takes phenomena that are not visual in nature and make them visible. But the particular subarea that I am going to talk about today is looking at videos where apparently nothing is happening, but in fact, you have a lot of changes and motion that are just below the threshold of human vision. So all these pictures are actually videos, and you can't see anything moving but that doesn't mean that there's no signal, there's nothing happening. Certainly this person is alive, so he must be breathing and his heart is beating, and you can't tell from this basic video, but we've developed techniques that you can use to amplify what's going on here and reveal things like these phenomena. So here, we reveal the reddening of the face as blood flows through it, so with each heartbeat, there's a little more blood in the face and you get a little bit redder. To give you a sense, if you have an 8-bit image, it's that's maybe half a value but we can extract it and amplify it and show you things like this. Even when your eyes are still, we have micro saccades and micro tremors, and we can amplify these. Structures that look still are actually swaying in the wind, et cetera, et cetera. So we started embarking on this journey a while back. In 2005, we published this work called Motion Magnification where we took videos as input with some motion really hard to see, like this beam structure here is bending a little bit when the person is playing with the swing. With our technique, we were able to take this very small motion and amplify it, and the way we did this is use standard computer vision techniques and image based rendering ideas. We took the video; we used motion analysis; we analyzed feature points and actually did the algorithm that Ce Liu, the main author of this work, developed to analyze motion is quite sophisticated and quite robust to things like occlusion. Given these trajectories, we did a little bit of clustering to extract different modes of motion, so in practice we want to amplify this red segment. And we do various things like advecting the motion vector further, doing a little bit of texture synthesis to fill the holes, and at the end we get these beautifully magnified videos. And we were quite excited about these results, but unfortunately at the time the work didn't have as much impact as we hoped for, partially because this technique was quite costly. We're talking about hours of computation to get these results and the algorithm was sophisticated enough that if you didn't have Ce Liu next to you to make it run, it was really hard to use to the point where we whenever we wanted the compare to this algorithm for our new work, we've been unable to rerun the old code, as sad as it might be. And so this is partially why we developed a new, much simpler technique which we call Eulerian video magnification and that was presented at SIGGRAPH last summer. So this is work with a number of people, the three main ones who made it happen are Hao-Yu Wu, who is a master's student with me at the time; Michael Rubenstein, who's a superstar grad student whom I believe Microsoft should hire, hopefully, if you guys are smart; and Eugene Shih, who is a former grad student working at Quanta Research. Then a number of faculty members who gave opinions. Actually, I should say that is project where I feel everyone on the list contributed at least one equation, actually did some work. In order to understand the difference between this new work and the old work we did on motion magnification, we need to borrow metaphors from the fluid dynamics community, where they make the very strong distinction between Lagrangian approaches and Eulerian approaches. What they mean by this is that in the Lagrangian perspective, you take a little piece of matter, a little atom of water, and you follow it over time as it travels through the medium. In contrast, Eulerian fans just take one block of space at the same location and look at what are coming in and out in this local position. So this one is a fixed frame; this one looks at moving frames. And of course, our previous work on motion magnification was essentially a Lagrangian approach where we would look at each of these pieces of the scene and see how they travel through the image, and then we just make them travel farther. And in contrast, the work I'm about to introduce just looks at more or less individual pixels, individual screen locations, look at the color changes at this location and amplifies them. And the basic idea is really, really simple. You just look at the color value at each pixel. You consider it as a time series. So this is my time axis, this is my intensity axis, or my red, green, or blue axis. I mean, you know, very standard time domain signal processing. I can do whatever temporal filter I want, so typically we extract a range of temporal frequencies, so if you're looking at the heartbeat, it's going to be around 1 hertz, give or take. You amplify these time frequencies for this pixel and for all the other pixels independently, and then you just put them back in your video when you're done. In practice, it's a little more sophisticated than this. Not by a lot actually. I mean we just add a spatial pyramid on top of this, and that's kind of about it. Really, the basic principle is just independent temporal processing on each pixel. A little bit of spatial pooling to reduce noise, a little bit of pyramid processing to be able to control which spatial frequencies are amplified, but that's about it. So I'll first show you how it can amplify color changes, which is unsurprising since that's kind of what we do. Maybe the more surprising aspect that we really did not expect is that it also amplifies spatial motion. Yes. >>: The process signature that you collect from the visual domain to get the information or the signal is not always decoupled from some noise and some artifacts that maybe due to some other processes. How would you go about filtering them and getting the real—the child can move, and you know, the pixel difference could be because of that rather than his heartbeat. >> Fredo Durand: Well, ask me again at the end of the presentation if I didn't answer, but the main thing we do about noise, at least in the first version of the technique, is just spatial averaging, spatial low pass. I'm not saying that you couldn't do something smarter, but that's all we're doing. So yeah, as I said, the basic color amplification technique is pretty simple. You take your time series. Typically, especially for the heartbeat, we do a pretty strong low pass on the image. I told you that the amplitude is less than the value, so you need to average the number of pixels before you get enough signal compared to your noise, but yeah, you just choose your frequency band in the time domain and just amplify. You get these cool results. Actually that's how the project started. We were working on a heart rate extraction, something similar to what the new Kinect has, what our colleagues at the media lab have developed, and we needed a debugging tool to understand what we were analyze, so we decided to just visualize these color variation changes. It's actually kind of cool. It works on regular video cameras. You don't need to do special acquisition set ups. We were even able to play with footage from the Batman and verify that this guy has a pulse, which he does. You can extract the heart rate, so again, we're not the first once to do this. We believe we do it better than other people, but we did some validation at a local hospital, and at least for sleeping babies, our technique works as well as a regular monitor. Which actually, I was surprised to discover that it's also the fact that regular monitors are not that good at extracting heart rate, which was surprising to me, but anyway. So as I said, we mostly developed this technique as a debugging tool for heart rate extraction, and when we looked at the first videos we processed, it was like, this is weird. This person is moving a lot compared to how much he was moving in the original video. What the hell is going on? And we really had no idea. It was really an accidental discovery, and we went to the blackboard and tried to figure out what happens, and that's what I'm going to explain in a minute. So you know, the fact that we can make things move in what seems to be kind of a Lagrangian aspect that, you know, these pixels are moving farther away. We need to study how local motion relates to intensity changes in order to understand why our simple technique actually does amplify spatial motion. So let's look at what happens at a pixel when we have a translating object. So don't be confused. My horizontal axis is now space, so this is the X coordinate in my image. And the vertical axis is still intensity, so here I have a very simple case where, you know, my intensity profile happens to be a sine wave, and it's moving to the right versus the next frame in blue. Right? I've got my velocity dx/dt here. So now, we're interested in how a single pixel changes under this translation. So this particular pixel happens to become brighter. This one is becoming darker, so obviously the intensity variation depends on the location, but we want to understand how it relates to spatial motion. And it's kind of obvious if you're looking at this diagram how this little horizontal edge relates to this vertical intensity difference. You've got a triangle here where the missing edge of the triangle is actually the slope of my intensity, so it's essentially the image gradient. So if I have an object translating with this horizontal velocity, the amount of vertical intensity change is going to be proportional to the slope of the intensity of the image gradient. So if you don't like diagrams and you're more of an algebraic person, one way of looking at it is we're interested in the temporal intensity derivative, di/dt, and you can argue that di/dt is di/dx times dx/dt, so that's the gradient. That's the velocity, and this is something that's really well known in the optical flow community. That's how your Lucas–Kanade, your Horn–Schunck algorithm, extracts velocity given intensity changes. So of course, in our case we don't know this. We could know this, but we don't care about it. All we do is we take this intensity change that's visualized vertically here and make it even bigger, so let's see what happens when we do this. We take this intensity change and we magnify it, and we do the same at each pixel, so in this case here, the intensity change is negative, and we make it even more negative. You see in this image that as we do this, it looks like we transported the sine wave further to the right. And again, if you're an algebraic person, we made di/dt bigger by a factor alpha, which kind of, if you have the same di/dx, the same image gradient, suggests that there was a dx/dt velocity that was bigger. And let me— >>: Magnify bounded by the resolution? >> Fredo Durand: Oh, yeah. Yeah, so in the paper we have derivations that relate your spatial frequencies, the velocity, the amplification factor, and we also look how this compares to the Lagrangian approach. It's kind of interesting because the sensitivity to noise is not the same, so in some regime one is better, and in some regime the other one is better. Let me show you a quick demo. So this is the same, well, this is a Gaussian bump in this case, and you know, it's moving horizontally, right, but now what we're studying is actually the vertical changes. So these are my intensity changes, and if I amplify them vertically, you see that it looks like my Gaussian moved farther to the right. And it's not a perfect approximation. I mean you see that we overshoot here. Not surprisingly, this is the area where the second derivative of my function is pretty high because fundamentally, this works only if my local derivative model, my di/dt, di/dx is valid, and so this is when your first-order Taylor expansion is a valid option. But since we're interested in very small motion that are impossible to see to the eye, this is precisely the regime in which this matters, and this kind of explains. I mean, a number of people had proposed to do temporal processing and pixel values, but nobody had applied it to very tiny motion, and this is where this amplification of spatial translation actually works. I should say this is a visualization that was done by an undergrad Lili Sun, and these kind of visualizations are great summer projects for beginning undergrads. I lied a little bit. Our processing is not purely pixel based. As I kind of said, we first do a spatial decomposition, and we do the processing independently on each scale of the pyramid. Some scales might not be amplified because we know that the approximation is not going to work. And so we can amplify motion like this baby breathing, which is Michael Rubenstein's baby, and you see that the spatial motion is really amplified. You also see a little bit of the overshooting where it gets too bright here, and this is the same thing as with the Gaussian bound. You might have seen this one. Maybe I'll skip it. I like this one because it shows that you can do different temporal processing to extract different phenomena. So this is our input video. It's a high-speed 600 frame-per-second guitar because we want to capture the audio time frequencies, and if you amplify frequencies between 72 and 92 hertz, you see the motion of the top string, and if you choose a different frequency band, 100 to 120, you see the second stream because this one is an A versus this one that was an E. So you have these degrees of freedom to choose which temporal component you're going to amplify. I think this video is broken, but I'll show it better later. As I said, we did a study that compares the Lagrangian verse the Eulerian approach and showed, at least with the simple model and for simple cases, that there's a regime where the Eulerian and better and one where the Lagrangian is better. Yeah. It's actually kind of cool how some components of noise get canceled with the Eulerian because in the Lagrangian version, you're computing velocity from pixel variation, and then you're creating pixel variation from this motion vectors and so you end up accumulating error along the way. And thanks to our colleagues at Quanta Research, we have a web version that people can use. I don't think in this room it's as critical because we also have MATLAB code, and this is essentially a one-hour project to reproduce, but it's been very useful for people who are not computer scientists, who have been able to try out our technique for their application. So a number of people have posted YouTube videos created with our code. Someone who has been using it for pregnant women, belly visualization, it gets a little freaky and very alien-like. Here's another one. You see we had to use some video stabilization because we amplify any motion we see, so you probably need to remove camera motion. Somebody else has used it to visualize the spatial flow of blood in the face by using color grading after our process. It's not real science if you don't have a Guinea pig, and it turns out some people did apply the method on a Guinea pig. I don't remember how they said it. This is the first Eulerian-magnified Guinea pig in the world. Actually, one of my colleagues who does biology in Stanford is interested if using pretty much something like this to look at the breathing of some of their lab mice to see what's going on with their cancer research and to be able to tell earlier whether something is in effect or not. We've been pretty excited about especially all the interest that we've gotten from people in a lot of different areas, but we're still a little frustrated with the amount of noise that we end up for some of these videos because of course, you amplify pixel variations, it's not just the signal that gets amplified. And so in order to reduce noise, we came up with a new technique that will be presented at SIGGRAPH this summer, and it was developed by Neal Wadhwa and Michael Rubenstein and still in collaboration with Bill Freeman. In order to understand the difference with the old as in one-year-old version, you have to remember what I said that essentially Eulerian perspective, it works when you have a first-order Taylor expansion that's valid, and essentially, it assumes that locally, the image has a linear intensity with respect to space. So if your image is a linear ramp, things work perfectly. Unfortunately A, noise gets amplified as well as whatever linear ramp you actually had; and B, especially because we need to use multi scale processing with the pyramid, in practice, the band of an image pyramid doesn't look like a bunch of linear ramps. It looks more like a bunch of local sine waves because that's what band pass it has. The kind of artifact that you see is where this first-order Taylor expansion breaks and things like here are where you get really overshoot or undershoot, same thing we saw with the Gaussian. And so we created a new technique that instead of assuming that images are locally linear ramps, they're locally sine waves, which is great if you want to do multi scale processing. And the good thing with sine waves is that we know how to shift them. We have the wonderful shift phase theorem that tells us that if your image undergoes translation, the only thing that happens to your Fourier representation, your sine wave representation, is a change in the phase of your Fourier coefficient. We know how to translate sine waves, so all we have to do is come up with local sine waves, so in practice we use steerable pyramids, which are essentially local wavelets that look a little bit like this, very similar to Gabor wavelet, but the isotope of the steerable pyramid that we use is actually complex-valued. So most people use image pyramids that are real-valued, but we can get complex-valued pyramids that have both an odd and even component to them, so just like your Fourier transform is not a bunch of sines, it's a bunch of complex exponentials, complex-valued steerable pyramids give you both a real and an imaginary part which allows you to get a local notion of phase which you can then use for processing and for amplifying local motion. There are beautiful Fourier domain constructions in all this. We give some of the details in the paper, and you can look at Simone and Shelley's webpage for more information. So whereas the previous approach used Laplacian pyramid, and just did linear amplification of the time variation, we use the steerable pyramid that has both a notion of scale and a notion of orientation that instead of being real-valued is complex-valued, and we turn these complex numbers into an amplitude and a phase, and the processing that we do is on the phase. So we take the local variations of phase and we amplify them, which really directly turns into increasing the local spatial translations, and it works a lot better than the old technique. So in red is the old technique. So let me play it once. We have a Gaussian bump that's moving to the right. Blue is the ground true. We are increasing the motion. Green is our new approximation, and red is the old one. So let me start it again. The beginning works fine. The old one breaks pretty quickly. You see this very strong overshoot. The new one, that's better longer until eventually things go a little crazy especially when you get phase wraparounds. But roughly speaking, the new method tends to work four times better, meaning you can use an amplification factor that's four times higher. The second big improvement is that it reacts much better to noise. So with the old technique, with noise, you just amplified the noise linearly, at least the temporal component that you selected, and so you get crazy noise like this. You saw effects like this in some of the videos I've shown. Even in the baby one, there was a lot more noise. With the new technique, we only modified the local phase, so we don't amplify noise amplitude. We just shift noise around, and so the local amplitude of noise stays the same. It just moves a lot of more, so the noise performance is way, way better, and you should hopefully see this in these results. This is the old version. This is the new version, and you see that noise performance is significantly improved. You don't get the overshoot around the area of motion. The kind of artifact that you might get is a little bit of ringing. These are other results. This is the old version. This is the new version Same thing here. You see that the noise performance is really dramatically improved. By the way, we tried to apply de-noising to the old technique. In some cases it helps a lot, but in some cases it actually hurts way more than it helps, so the new version is way more robust to noise. We can amplify changes that are as tiny as the small refraction changes when you have hot air around a candle. So in this case, the small changes in index of refraction due to air temperature cause small shifts in the background, and we can visualize this and amplify it and give you a sense of air flow. We're currently working on techniques that would be able to extract quantitative information from this and give you air velocity information. In the SIGGRAPH paper we've got a lot more information including how to play with the space over completeness tradeoff to get a bigger range of motion. We have some ground truth comparisons to physical phenomena that are ten times bigger, and we show that the technique gives you something reasonable. We encourage you to go see the talk at SIGGRAPH that Neal and Michael will give, and if you want to try it, the webpage I mentioned now has the new version. I'll show you the beginning of a video we created for the NSF that demonstrates this. Actually, I'll show you different pieces if I can get. Yes. Skip the intro. Mostly I'll show it because the explanation uses my video authoring system. >>: As blood pulsates through the subtle breathing motions of a baby. These changes are hidden in ordinary videos we capture with our cameras and smartphones. Given an input video, for each pixel we analyze the color variation over time and amplify this variation which gives us a magnified version of the video. >> Fredo Durand: So this is a case actually where because my handwriting is terrible, I first did a version of this mini lecture. Actually I used a lot of resizing and spatial layout, as you can imagine, and then I asked one of my students that has much better handwriting to just select the stuff and rewrite it, and because I have this redrawing tool, all the audio synchronization was preserved. Let me show one of the cool results. This one I really like. So this is a high-speed video of an eye that's static, but even when they're static, our eyes move a tiny little bit, and we're hoping that this might be useful to some doctors. >>: When a person fixates at a point, the eye may move from subtle head motions or from involuntary eye movements known as micro saccades. Such motions are very hard to notice, even in this close-up shot of the eye, but become apparent when amplified 150 times. >> Fredo Durand: One final very brief mention of something that'll be presented at CVPR this summer, we have a new technique with Guha Balakrishnan and John Guttag that analyzes beats from video, but instead of using color information we use motion information. I'll show you why that even works. >>: In this video, we demonstrate that it's possible to analyze cardiac pulse from regular videos by extracting the imperceptible motions of the head caused by blood flow. Recent work has enabled the extraction of pulse from videos based on color changes in the skin due to blood circulation. If you've seen someone blush, you know that pumping blood to the face can produce a color change. In contrast, our approach leverages a perhaps more surprise effect: The inflow of blood doesn't just change the skin's color, it also causes the head to move. This movement is too small to be visible with the naked eye but we can use video amplification to review it. Believe it or not, we all move like bobble heads at our heart rate, but at a much smaller amplitude than this. Now, you might wonder what causes the head to move like this. At each cardiac cycle, the heart's left ventricle contracts and ejects blood at a high speed to the aorta. During the cycle, roughly 12 grams of blood flows to the head from the aorta via the carotid arteries on either side of the neck. It is this influx of blood that generates a force on the head. Due to Newton's Third Law, the force of the blood acting on the head equals the force of the head acting on the blood, causing a reactionary cyclical head movement. To demonstrate this process, we created a toy model using a transparent mannequin head where rubber tubes stand for simplified arteries. Instead of pumping blood we will pump compressed air provided by this air tank, and I can release the air using this valve. Now, watch what happens as I open and close the valve once a second, similar to a normal heart rate. Ready? Here. This motion is fairly similar to the amplified motion of real heads that we've seen before. We exploit this effect to develop a technique that can analyze pulse in regular videos of a person's head. Our method takes on input video of the stationary— >> Fredo Durand: Most of the components are standard.. It's Lucas–Kanade tracking, little bit of PCA, little bit of extraction, and the cool thing is that at the end, we get not just the heart. We also get individual beat's locations, which gives us beat variations. This is a histogram of beat lengths according to the ECG or our motion technique. And I'm told that this has diagnosis application, but don't ask me too much. I'm not the right kind of doctor. And unlike the color version, you can get heart rates from the back of someone's head or in Halloween situations. So really, the thing I'd like to emphasize is I think this whole area of revealing invisible thing using computational tools is very rich, and I think that in vision and graphics, we really have the right intellectual tools to make a lot of things happen, and I encourage everyone to do research in this area. Thank you. [applause]. >> Hugues Hoppe: Since we've gone over time, you're welcome to leave, but we'll stick around for questions. Five minutes for questions. >>: Hi. I'm [indiscernible]. I work here in the [indiscernible] group. So my question for the pulse from head motion is, how much—robust are you, regular head motion or? >> Fredo Durand: Depends how regular regular is. >>: Random head motion. Let's say I'm working in front of my laptop. >> Fredo Durand: Yeah, so I mean this is the thing we're trying to test, you know, how far we can go in people's activity. So if you're like running on a treadmill, it's not going to work. Typing on the keyboard seems to be fine. Then we're trying to find where the exact limit is. Actually the biggest motion that we have to fight against is breathing for people being static. >>: The second part to my question is how much prior information do you need to get that beat signal out? Like do you actually specify a band like you did in the Eulerian? >> Fredo Durand: The band is pretty broad. I think these days we just specify if it's a newborn because their heart rate is so much higher. I don't remember what band he's using, but I think it's like .5 hertz to 2 or 3 hertz like this. It's reasonably broad. >>: So did you accelerate your video application algorithms using your language? >> Fredo Durand: Yeah, no. I've hired a student to do this this summer. We want a real-time version on a mobile device, and so yeah, we want them to do this. The phase based version is a little more tricky of this. Partially there are degrees of freedom in which steerable pyramids you use exactly, and there's probably—it's actually a more general issue for the compiler where so far we've assumed that the algorithm is fixed, but we all know that when you try to optimize your processing you might decide oh, you know, I'll use a cheaper version of blur in order the get the performance that I want. This is exactly the kind of stuff that's going to happen, I think, with the pyramids. Yeah. >>: For your latest head motion work, have you done quantitative comparisons against the previous work? In other words, is it better of comparable? >> Fredo Durand: Yes, so we've been comparing with color and sometimes it's better. Sometimes it's worse. And we're trying to come up with the best of both kind of thing, yeah. It looks like the motion is less sensitive to noise because as long as you have strong edges that move, you really need a lot of noise before you mess up the notion of an edge, but at reasonable noise levels it's less clear which one is better. >>: Up off that question, could you also, like, it's the case that if sometimes one is better, sometimes the other, you could possibly also combine them. >> Fredo Durand: Absolutely, yes. >>: You haven't done that? >> Fredo Durand: I have a student who's supposed to be working on it. I haven't seen any result yet. >>: Have you tried this on like a big motion [indiscernible?] >> Fredo Durand: Yeah, it does crazy stuff. I mean. >>: Only if you got a fixed or lack of motion, track motion [indiscernible? >> Fredo Durand: Yeah, so the larger the motion the less you can amplify the high spatial frequencies, so usually the way we get away with motion that's too big is we just give up on the high frequencies. But then at some point you get so low frequency that nothing much happens. The phase based one can go a little further, yeah. >>: Have you looked into retiming videos as well because a lot of retiming now is done with optical flow and pixel tracking. I'm interested in how the grid based approach works. >> Fredo Durand: I'm interested in this too, yeah. We want to try. Absolutely, yeah. The phase based method is very interesting because it's halfway between something Lagrangian and something Eulerian, and I think there are lots of things that you usually do with advections that might be interesting to do with this technique because it's more direct and so I think as a result, it will tend to be more robust. Yes. >>: It sounds like you've done some amount of experimentation with things that can't be seen at all and turning them into visual output essentially. Have you looked at all at taking something that like a laser bounced off of a faraway window and turn it back into an audio signal? Have you looked at doing that with just a visible light? I put my high-speed camera in your window, and I can peel the audio off. You're peeling the audio on both sides, but the laser offers the same [indiscernible] >> Fredo Durand: Yes, we're interested in this. We've made some early experiments, especially—I don't know if you saw there was a bluish membrane that we had that's a rubber membrane. We' started playing with loud speaker and trying to record the motion and getting some audio back, but it's very preliminary at this point. Yeah we're very interested in this. All right. >>: There is motion in the sun and the stars which can only be recorded by cameras. >> Fredo Durand: Pardon? >>: We detect that motion in sun and stars with cameras and big telescopes. Have you tried like that motion, and I guess [indiscernible] >> Fredo Durand: No, we haven't tried that. >>: Because it's really low frequency [indiscernible] >> Fredo Durand: Yeah, we should try. >> Hugues Hoppe: Thank you, Fredo. >> Fredo Durand: Thank you. [applause]

Document 17864768

Related documents

Products

Support

Document 17864768

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib