Document 17864768

advertisement
>> Hugues Hoppe: It's my pleasure to welcome and introduce Fredo Durand, who's a professor at MIT,
and he almost needs no introduction. He is one of the superstars in graphics with an amazing number of
papers at SIGGRAPH and SIGGRAPH Asia in recent years. He's one of the founders of the Computational
Photography Symposium and Conference. He has really strong influence in those areas. Today he's
going to give us an overview, I think, of several different areas of his work at MIT, so welcome, Fredo.
>> Fredo Durand: Thank you. [applause] It's a pleasure to be back. Hugues is one of the few people
who can pronounce my name correctly, so always good to be here. Yeah, so today I am going to show
you a few overviews of recent work we've done, and then I'll spend more time on an area that we're
pretty excited about, which is to use computation to reveal things that are hard to see with the naked
eye. But before this, let me show you a couple of things we've done in good old image synthesis, in
compiler for image processing, and something about online education and video lecture authoring. A
few words about global illumination. It's still not a completely solved problem. Simulating the interplay
of light within scenes is still computationally extremely challenging, and anything we can do to make it
more efficient is really needed. With a number of coauthors from Finland, Jaakko Lehtinen being the
lead one, we came up with this new technique that makes a better use of the samples that we use to
sample, like transport, and we build on this technique called Metropolis Light Transport that seeks to
place sample light rays proportional to the light contribution. It's an old idea by Eric Veach from a while
back. It's pretty good at it. The only problem is that a lot of areas of light space contributes a lot to the
image, but that kind of boring. There isn't much going on and so you shouldn't need that many samples
to compute them efficiently. In particular in a scene like this, this whole area is very uniform, and you
shouldn't need that many samples to compute it, so instead what Jaakko and the guys came up with is
this idea of sampling according not to the image contribution, but according to the contribution to the
gradient of the image. And you see here this is a sample density. We really focused in these areas
where with things matter, so here I visualize it in the space of the image but under the hood everything
happens in the space of light path. So you know, light path could be something that goes from the light,
bounces off the screen, on the ground, and then on you. So it's in this complex abstract space that we
have to sample according to the gradient of the image. Whereas and so this path space for a very
simplified version could be shown as this, you know, maybe this is my receiving surface, this is where I
want to compute light, and this is my light source. So for each point here I want to take into account the
contribution from all the points here, so abstractly I can show it as a 2D space where this is my light,
these are my pixels or my surface coordinates. In regular Metropolis, you know, because there's this
occluder, this whole area of light space doesn't contribute to the image, and this one contributes a lot.
So with regular Metropolis, your samples after your mark off chain process would be distributed roughly
like this, and then you just count the number of samples for each color and you're done. You get your
approximation of the image. So instead what we're doing is we're sampling this space according to the
finite difference contribution to the gradient of the image. So instead of having one sample, we get like
pairs of samples where we look at the difference. And we place this proportional to the gradient, so we
get a lot of these in this area of the simplified light space and not so many of these in this area. The cool
thing is that not only do we have a placement of the samples that's more according to where things
actually happen, but in addition, in areas like these all these samples do tell us that there isn't anything
happening, that the gradient should be zero in all these areas. So not only do you get a better
representation of places where things change, but you also get more information that even though your
sampling distribution is low, the information that you have is pretty strong and you know that you
should reconstruct something pretty smooth. So the basic idea is pretty simple. It gets fairly messy as
you do it in the general path space with paths of arbitrary lengths and et cetera, and if you don't pay
attention to the math—this would be an example of some of the math, the main math you need to pay
attention to, so instead of approximating [indiscernible] images like this, you're going to get a result like
that because somewhere there's a Jacobian that creeps in, and you're not—if you don't pay attention,
you won't be computing the integral you think you're computing, and I'll refer you to the paper if you
like. Math, integrals, and changes of coordinate. If you do the right thing and take this term into
account, you actually get the true approximation. So this will be presented at SIGGRAPH by Jaakko this
summer. Now, no transition because the next topic is completely different. Something I've become
very excited about is the potentials of online education and digital education in general. I'm very
interested in this style of video lecture that you probably saw popularized by the Khan Academy where
you see a virtual whiteboard lecture where you see the writing going on as the person is narrating what
they're doing. A lot of people find these compelling. I won't get into the debate of whether that's the
best format, but certainly there are a lot of people who wish they could generate content like this and
who don't have the talent of Sal Khan to get it right in one go. Because the sad truth is that authoring
software to do things like this, I mean it's not that they're bad; it's that they're nonexistent. I mean
people just take some screen capture software and whatever drawing program they like. I know a lot of
people like Microsoft Journal or Microsoft PowerPoint, but the bad news with these things is that you
have no editing capability. If you get anything wrong you essentially have to restart your lecture from
scratch, and so I claim that this is very similar to a typing text on a typewriter. So sure, I mean with a
typewriter you can correct your mistakes. You can restart the page from scratch, or maybe you can
scratch thing and write the correct word later, but it's really like the Stone Age of authoring capabilities,
and we have all grown accustomed to being able to edit text in very nonsequential manner. I mean the
order in which you're going to type the letters really doesn't have to be the final order of the text. You
want to, you know, insert sentences, delete some, correct some words, maybe even reorganizing or put
this paragraph in front of that other paragraph, and so all these modern tools really made the creative
process, the authoring process, very nonsequential. And so because I was so appalled by the state of
the art of tools out there, I decided that I could do better, and so I decided to implement my own
software. And this is a short video that I'll show you what it can do. Playing video. Authoring
handwritten lectures with current approaches is challenging because you must get everything right in
one go. It is hard to correct mistakes; content cannot be added in the middle; you must carefully plan
how much space you need and audio synchronization is hard because writing tends to be slower. We
present Pentimento, which enables the nonsequential authoring of handwritten video lectures.
Pentimento relies on a new sparse and structured representation that builds on space-time strokes and
adds discrete temporal events for dynamic effects, such as color changes or basic animation. We also
decouple audio and visual time and use a simple retiming structure based on discrete correspondences
interpolated linearly. This makes it easy to maintain and edit synchronization. Let's look at an authoring
session with Pentimento. We can record strokes over time with a standard pen interface. The lecture
area is the white rectangle in the center. If we run out of space, we can keep writing in the red safety
margin, stop recording, and edit the layout using a familiar vector graphics interface; however, our edits
are retroactive and affect the stroke from its inception. The lecture's temporal flow is preserved, and
the equation looks as if it was written with the correct size in the first place. We continue our
derivation, but we decide that we went too fast and that an extra step might help students. We move
this equation down to make room for the new step using another retroactive edit. We move the time
slider back to where we want to insert the new content; we press record and add the new line. We
perform more layout refinement and complete our demonstration. In this scenario, we have focused on
visuals first, and we now move on to recording the audio. We first make the audio and visual timelines
visible. We proceed piece by piece and select in the visuals the part that we want to narrate. We press
the audio recording button and say our text. The audio gets automatically synchronized with the
selected visuals. We proceed to other visuals and record the corresponding audio. We can also select in
the audio timeline and record. Our approach relies on discrete synchronization constraints between the
audio and visual time, which are visualized as red ellipses in the audio timeline. We can add constraints
by selecting the visuals, moving the time slider to the audio time where the appropriate narration
occurs, and using the timing menu or keyboard shortcut. Here we set the end of these visuals to occur
when this audio is heard. We can also drag constraints to change the audio time or the visual time, and
the visual timeline and the main view reflect the change. We can also delete silence and the visuals get
sped up automatically. Once we have recorded the audio, we realize that the derivation could be
clearer if we replace the mean mu by e of x. We first make space for the change. We select the strokes
we want to replace and press the redraw button. We write e of x and the timing on the new content
automatically conforms to the old one and preserves audio synchronization. We also realize that some
of the narration doesn't have corresponding illustrations. We make space and use draw to add visuals
without affecting the audio. We derive a fundamental identity for variance. Variance is usually written
sigma square. It is defined as the expectation of the square difference between x and its expectation.
We can distribute the square which gives us the expectation of x square minus two x. e of x square plus
e of x square. We use linearity and take constants such as two and e of x outside of the expectation. We
get e of x of x square minus two e of x plus e of x e of x square. We clean up this e of x and get e of x
square minus two e of x square plus e of x square. We now cancel one of the two negative e of x square
with a positive e of x square and we get the final equation. Variance is equal to e of x square minus x
square . Ray tracing is a fundamental computer graphics algorithm. It allows us to go from a 3D scene
to an image. The scene is represented digitally. For example, a sphere is encoded by the x,y,z
coordinates of its center and its radius—
>> Fredo Durand: This is the topic that got me started on this because I tried to make one with standard
tools, and I started my drawing too big. I didn't have enough space. It was a disaster and I just stopped
and didn't retry until I had my tool right.
>>: -by a viewpoint, a viewing direction, and a field of view. Our goal is to compute the color of each
pixel. The algorithm is as follows: For each pixel in the image we create a ray from the viewpoint to the
pixel. Then for each geometric primitive in the scene, we compute the intersection between the ray and
the primitive, and we only keep the intersection that is closest to the eye. Once we have found which
primitive is visible at the pixel, we need to compute its color, which is called shading. We take into
account a position of light sources and cast additional rays—
>> Fredo Durand: In this case the audio was recorded first, and then I did the visual based on the audio.
You can do it in whatever order you want.
>>: -Pentimento is also used in video lectures on a variety of topics that include probabilities, bar codes,
Magellan's voyage, defraction, computational geometry, and many others. The executable is included in
the submission as well as a quick manual. Thank you.
>> Fredo Durand: And so I'm hoping, well, I'm hoping to spend a lot of my summer debugging this thing
and getting the UI usable and hopefully by early fall it'll be released free, open source, blah, blah, blah
so.
>>: Do you use it already?
>> Fredo Durand: I use it for my class. I've been using it in lecture. There are a number of extra bits and
pieces that I modified to make it usable in lecture. I've been enjoying it. I don't think the students have
been enjoying it as much [laughter] because teaching new content with a new tool where you spend
your brain power thinking oh, is it going to crash? Did I screw up this part of code, is maybe not the best
idea, but it's been kind of fun to use a walking tablet in lecture rather than the blackboard or a small
tablet PC, yeah. Another completely different topic. Maybe I'll try to go even faster on this one because
I think that Jonathan is going to come to MSR to give a talk soon, and he understands all this a lot better,
but just as some advertisement for his talk, so this is a compiler that recreates it to get really high
performance image processing. The two people who really made it happen are Jonathan Ragan-Kelley,
who is a grad student finishing with me, and Andrew Adams, who is a postdoc with me and is now at
Google. The goal really is to get high performance in image processing, and we all know that these days
you can't get good performance without parallelism and that parallelism is hard to achieve. And both
the multicore and the SIMD aspect are really tough, but equally important is to achieve locality, meaning
that you want your data to stay in the various cache levels as much as possible. This is equally if not
more difficult to achieve, and the combination of these two makes writing high-performance code really
hard. Usually you have to play with the tradeoff between various aspects. You know locality and
parallelism are the two big goals that you want to achieve, and very often, the price you have to pay is
that you're going to need to do redundant work. In image processing, that's typically that you organize
your computation according to tiles rather than computing the whole image for each stage of your
computation, you know. Stage one, whole image. Stage two, whole image. You're going to merge the
stages and compute it tile by tile, and the price you have to pay, so this maximizes locality and
parallelism, but you usually have to do redundant work at the boundary of the tiles. Usually we tend to
think of performance coming from good interplay between powerful hardware and good algorithm and
that these are the two knobs that we have to make our computation as fast as possible. For most of us,
we're just software people, so all we can do a write a good piece of software, but we think that it's
useful to split this notion of a program into two sub notions. One of them is the algorithm itself, and
given an algorithm, given a set of arithmetic computations that you want to achieve essentially, there's a
big question of the organization of this computation in space and time. The best choice will give you the
best tradeoffs. So what do I mean by the separation of algorithm and organization of the computation?
Well, we can start from something very simple, which the compiler people have known and exploiting
for a long time. Just look at this very simple two-stage blur. So this is a 3 x 3 box filter. The first stage is
a blur in x. Double loop on the pixels. Blur in x is just the average of three neighbors—actually it's the
sum here, but it's the same, and then we do a blur in y. So you know, this is one piece of program that
does this computation, but I can swap the order of those two loops, the for x, for y. It's still the same
algorithm. The computation is just organized differently, but in this particular case, I think you get a 15x
speed up. Oops, messing up my slide. So just because you get much better locality by doing the loops in
the same order things are stored. So this is pretty well known, that the order of the loop can be
changed, and most decent compilers will do it, but if you want to get high-performance image
processing, we want to take this notion of separating the algorithm from the organization a little bit
further. Because if you look at an actually high-performance version of this 3x3 blur, it might look like
something like this. Believe it or not, the same algorithm that we had before is still hidden here. I don't
expect you to understand what's going on here. You've got some SIMD stuff or pairs of loops have
turned into four loops because things are tiled, again, to maximize this locality and visual locality. The
main message is that A, this code is really ugly and hard to maintain and that the changes are pretty
deep and global. It's not just that you're going to optimize your inner loop and write it in assembly and
all that. It's also that you really reorganized your computation and that your whole code is affected,
which in particular means that it's very difficult to have a library approach to this problem. Because the
library can optimize every single stage of your pipeline and then you put them together, but for actually
really fast code, you want optimization that goes across stages of your pipeline, and libraries don't
naturally do this. And this code, by the way, gives you another order of magnitude speed up, so by
swapping the order of loop, we got a factor of 15, and here we get another factor of 11. If you're
MATLAB programmers who think that all you have to do to get fast image processing is to code it in C++,
well, it depends. If you code this C++, yeah, you'll get really fast image processing, but it's two orders of
magnitude faster than a naive—well, a very, very naive C++. The ordering of the loop, everybody should
get this right, and so this whole reorganization of computation is actually hard, both because the low
level mechanics of it are difficult. I mean again, this is pretty ugly code. You need to change things at
many levels of your pipeline, but it's also hard just at the high level because you don't know what a good
strategy might be. And if you're a company like Adobe, where they do really—and I'm sure Microsoft
has people like this too. You know, you have people who will spend months optimizing one pipeline,
trying to parallelize it this way, trying tiles here, maybe global computation there, but it takes them a
month to come up with one, to implement one strategy, and so maybe you're going to try a different
strategy if that one doesn't seem to be the best one. Maybe if you have a lot of time, you're going to
implement a third strategy, but by then you just have to ship the product, and you're going to stop. So
this is pretty tough to come up with the best, the best option.
This. And so Halide's answer, our compiler, our language's answer is to separate the notion of
algorithm, which in practice we encode in a simple functional formulation, so here you've got the blurx
and blury. Just put it in terms of the output of this blurx function is this as a function of its input. And
similarly, blury has this expression and uses blurx as input. It's very simple and this algorithm will not
change as you try to optimize its organization. And for this we have a colanguage where you have
simple instructions that allow you to specify things like tiling, parallelism, SIMD and things like this.
What the schedule does is two things. For each pair of input and output functions, so for example, blurx
and blury, it specifies the order in which you're going to traverse the pixels for this function. It also
specifies when its input should be computed. So blury needs blurx to be computed, so when are we
going to compute blurx?
Are we going to compute it all at once for the whole image? Are we only going to compute it for a small
subset of pixels around the pixel that we need? These are the two big, high-level decisions that you
have to make for each input, output pair. And Jonathan will tell you a lot of more about this, but the
cool thing is that it's not just a random set of small tricks to optimize your thing. You really get a nice
parameterization of the space of tradeoffs and the space of schedules along various axes which you can
specify as the granularity at which you compute things on the one hand— equal this—and the
granularity on the which you store things, and we showed that all these points in the space correspond
to different tradeoffs and might be valuable for various algorithms. And that's an animation. Just to,
you know, give you a teaser. First of all, this is the C++ code I showed you before. This is the
corresponding Halide program. These two pieces of code have exactly the same performance, and by
the way, Halide is embedded in C++, so it's reasonably easy to incorporate it into your C++ program. We
use embedded—it's an embedded language. To give you a sense of the kind of performance that we
can get, Jonathan spent his summer at Adobe last year, and he took one of the stages in the camera row
pipeline, the one that that does shadow highlight, clarity. It's a local Laplacian filter algorithm that was
developed by Sylvain Paris, and the Adobe version was implemented by one of their really strong
programmers. I mean to give you a sense, this guy is in a team that has only two people. It's him and
Thomas Knoll, the inventor of Photoshop, so he's really good. He spent three months to optimize this
code. His code was ten times faster than the research code that he started with, but that took him 1500
lines of code just for this stage of the pipeline. So Jonathan spent the summer there, reimplemented
this algorithm, and optimized it with Halide, and within one day he had code that was 60 lines instead of
1500 lines and that run actually two times faster than the Adobe code that we started from. And the
thing that's even cooler is that our language targets not only x86, but also GPUs and ARM cores, so in no
time, maybe you want to change the schedule a little bit because the tradeoffs are not the same you can
get the GPU version. Yes?
>>: I was wondering are you constrained by the layout of input data, or is that a flexible thing that this
gets to optimize?
>> Fredo Durand: I experienced the layout of the input data is not that critical because in—
>>: A 3x3 box code?
>> Fredo Durand: Pardon?
>>: For a 3x3 box code?
>> Fredo Durand: Yeah, we've been disappointed. Yeah I had a master's student that I told, you know,
go work on the layout, and let's also allow people to optimize their layout. So far he has come back and
saying I get no performance gain. We can talk about this and you should talk about it with Jonathan.
Kind of the intuition is that you have enough intermediate stages, and the prefetchers and the cache
systems, especially on an x86 are so good, that as long as you do the granularity of the computation
itself right, the layout in the cache is going to end up being right, and things are going to work out okay.
It was kind of surprising, yeah. I was not—I had to find another master's subject for this guy, so it was a
little bit of a surprise, yeah. So go see Jonathan's talk whenever he visits. The language is open source
at Halide-lang.org or something like this. The documentation is still, you know, it's a research project.
We're hoping to create a bunch of tutorials this summer so that's a little easier to pick up. Then there's
a lot of enthusiasm about it at Adobe and especially at Google where, in particular Zalman Stern, who's
the guy who created Light Room at Adobe and now moved to Google, is very excited about it and has
been contributing a variety of things, including a java script backend, which I don't completely
understand, but he wanted to have fun. He's also contributed a lot of exciting stuff, and we're still
working on the compiler, and we're going to make it more and more useful hopefully. All right. Now I
come to the actual chunk of my talk that's going to be reasonably coherent, and I want to tell you about
a whole research area that my colleague Bill Freeman and I are very excited about, which is to use
computation to reveal things that are hard to see with the naked eye. I think that, in general, this is a
topic that has been excited for centuries in science and engineering, and scientists have developed lots
of tools to go beyond the limits of human vision,, you know, starting with telescopes, microscopes, and
things like this, x-ray. And if you're interested in the area, I have a keynote talk I gave last year where I
put together a lot of these tools from outside computer science for most of them, and it was quite
exciting to discover some of them that I didn't know about it. If you want to look at it, go see my slides.
A lot of them are really fun. I especially like the stuff that takes phenomena that are not visual in nature
and make them visible. But the particular subarea that I am going to talk about today is looking at
videos where apparently nothing is happening, but in fact, you have a lot of changes and motion that
are just below the threshold of human vision. So all these pictures are actually videos, and you can't see
anything moving but that doesn't mean that there's no signal, there's nothing happening. Certainly this
person is alive, so he must be breathing and his heart is beating, and you can't tell from this basic video,
but we've developed techniques that you can use to amplify what's going on here and reveal things like
these phenomena. So here, we reveal the reddening of the face as blood flows through it, so with each
heartbeat, there's a little more blood in the face and you get a little bit redder. To give you a sense, if
you have an 8-bit image, it's that's maybe half a value but we can extract it and amplify it and show you
things like this. Even when your eyes are still, we have micro saccades and micro tremors, and we can
amplify these. Structures that look still are actually swaying in the wind, et cetera, et cetera. So we
started embarking on this journey a while back. In 2005, we published this work called Motion
Magnification where we took videos as input with some motion really hard to see, like this beam
structure here is bending a little bit when the person is playing with the swing. With our technique, we
were able to take this very small motion and amplify it, and the way we did this is use standard
computer vision techniques and image based rendering ideas. We took the video; we used motion
analysis; we analyzed feature points and actually did the algorithm that Ce Liu, the main author of this
work, developed to analyze motion is quite sophisticated and quite robust to things like occlusion.
Given these trajectories, we did a little bit of clustering to extract different modes of motion, so in
practice we want to amplify this red segment. And we do various things like advecting the motion
vector further, doing a little bit of texture synthesis to fill the holes, and at the end we get these
beautifully magnified videos. And we were quite excited about these results, but unfortunately at the
time the work didn't have as much impact as we hoped for, partially because this technique was quite
costly. We're talking about hours of computation to get these results and the algorithm was
sophisticated enough that if you didn't have Ce Liu next to you to make it run, it was really hard to use
to the point where we whenever we wanted the compare to this algorithm for our new work, we've
been unable to rerun the old code, as sad as it might be. And so this is partially why we developed a
new, much simpler technique which we call Eulerian video magnification and that was presented at
SIGGRAPH last summer. So this is work with a number of people, the three main ones who made it
happen are Hao-Yu Wu, who is a master's student with me at the time; Michael Rubenstein, who's a
superstar grad student whom I believe Microsoft should hire, hopefully, if you guys are smart; and
Eugene Shih, who is a former grad student working at Quanta Research. Then a number of faculty
members who gave opinions. Actually, I should say that is project where I feel everyone on the list
contributed at least one equation, actually did some work. In order to understand the difference
between this new work and the old work we did on motion magnification, we need to borrow
metaphors from the fluid dynamics community, where they make the very strong distinction between
Lagrangian approaches and Eulerian approaches. What they mean by this is that in the Lagrangian
perspective, you take a little piece of matter, a little atom of water, and you follow it over time as it
travels through the medium. In contrast, Eulerian fans just take one block of space at the same location
and look at what are coming in and out in this local position. So this one is a fixed frame; this one looks
at moving frames. And of course, our previous work on motion magnification was essentially a
Lagrangian approach where we would look at each of these pieces of the scene and see how they travel
through the image, and then we just make them travel farther. And in contrast, the work I'm about to
introduce just looks at more or less individual pixels, individual screen locations, look at the color
changes at this location and amplifies them. And the basic idea is really, really simple. You just look at
the color value at each pixel. You consider it as a time series. So this is my time axis, this is my intensity
axis, or my red, green, or blue axis. I mean, you know, very standard time domain signal processing. I
can do whatever temporal filter I want, so typically we extract a range of temporal frequencies, so if
you're looking at the heartbeat, it's going to be around 1 hertz, give or take. You amplify these time
frequencies for this pixel and for all the other pixels independently, and then you just put them back in
your video when you're done. In practice, it's a little more sophisticated than this. Not by a lot actually.
I mean we just add a spatial pyramid on top of this, and that's kind of about it. Really, the basic principle
is just independent temporal processing on each pixel. A little bit of spatial pooling to reduce noise, a
little bit of pyramid processing to be able to control which spatial frequencies are amplified, but that's
about it. So I'll first show you how it can amplify color changes, which is unsurprising since that's kind of
what we do. Maybe the more surprising aspect that we really did not expect is that it also amplifies
spatial motion. Yes.
>>: The process signature that you collect from the visual domain to get the information or the signal is
not always decoupled from some noise and some artifacts that maybe due to some other processes.
How would you go about filtering them and getting the real—the child can move, and you know, the
pixel difference could be because of that rather than his heartbeat.
>> Fredo Durand: Well, ask me again at the end of the presentation if I didn't answer, but the main
thing we do about noise, at least in the first version of the technique, is just spatial averaging, spatial low
pass. I'm not saying that you couldn't do something smarter, but that's all we're doing. So yeah, as I
said, the basic color amplification technique is pretty simple. You take your time series. Typically,
especially for the heartbeat, we do a pretty strong low pass on the image. I told you that the amplitude
is less than the value, so you need to average the number of pixels before you get enough signal
compared to your noise, but yeah, you just choose your frequency band in the time domain and just
amplify. You get these cool results. Actually that's how the project started. We were working on a
heart rate extraction, something similar to what the new Kinect has, what our colleagues at the media
lab have developed, and we needed a debugging tool to understand what we were analyze, so we
decided to just visualize these color variation changes. It's actually kind of cool. It works on regular
video cameras. You don't need to do special acquisition set ups. We were even able to play with
footage from the Batman and verify that this guy has a pulse, which he does. You can extract the heart
rate, so again, we're not the first once to do this. We believe we do it better than other people, but we
did some validation at a local hospital, and at least for sleeping babies, our technique works as well as a
regular monitor. Which actually, I was surprised to discover that it's also the fact that regular monitors
are not that good at extracting heart rate, which was surprising to me, but anyway. So as I said, we
mostly developed this technique as a debugging tool for heart rate extraction, and when we looked at
the first videos we processed, it was like, this is weird. This person is moving a lot compared to how
much he was moving in the original video. What the hell is going on? And we really had no idea. It was
really an accidental discovery, and we went to the blackboard and tried to figure out what happens, and
that's what I'm going to explain in a minute. So you know, the fact that we can make things move in
what seems to be kind of a Lagrangian aspect that, you know, these pixels are moving farther away. We
need to study how local motion relates to intensity changes in order to understand why our simple
technique actually does amplify spatial motion. So let's look at what happens at a pixel when we have a
translating object. So don't be confused. My horizontal axis is now space, so this is the X coordinate in
my image. And the vertical axis is still intensity, so here I have a very simple case where, you know, my
intensity profile happens to be a sine wave, and it's moving to the right versus the next frame in blue.
Right? I've got my velocity dx/dt here. So now, we're interested in how a single pixel changes under this
translation. So this particular pixel happens to become brighter. This one is becoming darker, so
obviously the intensity variation depends on the location, but we want to understand how it relates to
spatial motion. And it's kind of obvious if you're looking at this diagram how this little horizontal edge
relates to this vertical intensity difference. You've got a triangle here where the missing edge of the
triangle is actually the slope of my intensity, so it's essentially the image gradient. So if I have an object
translating with this horizontal velocity, the amount of vertical intensity change is going to be
proportional to the slope of the intensity of the image gradient. So if you don't like diagrams and you're
more of an algebraic person, one way of looking at it is we're interested in the temporal intensity
derivative, di/dt, and you can argue that di/dt is di/dx times dx/dt, so that's the gradient. That's the
velocity, and this is something that's really well known in the optical flow community. That's how your
Lucas–Kanade, your Horn–Schunck algorithm, extracts velocity given intensity changes. So of course, in
our case we don't know this. We could know this, but we don't care about it. All we do is we take this
intensity change that's visualized vertically here and make it even bigger, so let's see what happens
when we do this. We take this intensity change and we magnify it, and we do the same at each pixel, so
in this case here, the intensity change is negative, and we make it even more negative. You see in this
image that as we do this, it looks like we transported the sine wave further to the right. And again, if
you're an algebraic person, we made di/dt bigger by a factor alpha, which kind of, if you have the same
di/dx, the same image gradient, suggests that there was a dx/dt velocity that was bigger. And let me—
>>: Magnify bounded by the resolution?
>> Fredo Durand: Oh, yeah. Yeah, so in the paper we have derivations that relate your spatial
frequencies, the velocity, the amplification factor, and we also look how this compares to the Lagrangian
approach. It's kind of interesting because the sensitivity to noise is not the same, so in some regime one
is better, and in some regime the other one is better. Let me show you a quick demo. So this is the
same, well, this is a Gaussian bump in this case, and you know, it's moving horizontally, right, but now
what we're studying is actually the vertical changes. So these are my intensity changes, and if I amplify
them vertically, you see that it looks like my Gaussian moved farther to the right. And it's not a perfect
approximation. I mean you see that we overshoot here. Not surprisingly, this is the area where the
second derivative of my function is pretty high because fundamentally, this works only if my local
derivative model, my di/dt, di/dx is valid, and so this is when your first-order Taylor expansion is a valid
option. But since we're interested in very small motion that are impossible to see to the eye, this is
precisely the regime in which this matters, and this kind of explains. I mean, a number of people had
proposed to do temporal processing and pixel values, but nobody had applied it to very tiny motion, and
this is where this amplification of spatial translation actually works. I should say this is a visualization
that was done by an undergrad Lili Sun, and these kind of visualizations are great summer projects for
beginning undergrads. I lied a little bit. Our processing is not purely pixel based. As I kind of said, we
first do a spatial decomposition, and we do the processing independently on each scale of the pyramid.
Some scales might not be amplified because we know that the approximation is not going to work. And
so we can amplify motion like this baby breathing, which is Michael Rubenstein's baby, and you see that
the spatial motion is really amplified. You also see a little bit of the overshooting where it gets too
bright here, and this is the same thing as with the Gaussian bound. You might have seen this one.
Maybe I'll skip it. I like this one because it shows that you can do different temporal processing to
extract different phenomena. So this is our input video. It's a high-speed 600 frame-per-second guitar
because we want to capture the audio time frequencies, and if you amplify frequencies between 72 and
92 hertz, you see the motion of the top string, and if you choose a different frequency band, 100 to 120,
you see the second stream because this one is an A versus this one that was an E. So you have these
degrees of freedom to choose which temporal component you're going to amplify. I think this video is
broken, but I'll show it better later. As I said, we did a study that compares the Lagrangian verse the
Eulerian approach and showed, at least with the simple model and for simple cases, that there's a
regime where the Eulerian and better and one where the Lagrangian is better. Yeah. It's actually kind of
cool how some components of noise get canceled with the Eulerian because in the Lagrangian version,
you're computing velocity from pixel variation, and then you're creating pixel variation from this motion
vectors and so you end up accumulating error along the way. And thanks to our colleagues at Quanta
Research, we have a web version that people can use. I don't think in this room it's as critical because
we also have MATLAB code, and this is essentially a one-hour project to reproduce, but it's been very
useful for people who are not computer scientists, who have been able to try out our technique for their
application. So a number of people have posted YouTube videos created with our code. Someone who
has been using it for pregnant women, belly visualization, it gets a little freaky and very alien-like.
Here's another one. You see we had to use some video stabilization because we amplify any motion we
see, so you probably need to remove camera motion. Somebody else has used it to visualize the spatial
flow of blood in the face by using color grading after our process. It's not real science if you don't have a
Guinea pig, and it turns out some people did apply the method on a Guinea pig. I don't remember how
they said it. This is the first Eulerian-magnified Guinea pig in the world. Actually, one of my colleagues
who does biology in Stanford is interested if using pretty much something like this to look at the
breathing of some of their lab mice to see what's going on with their cancer research and to be able to
tell earlier whether something is in effect or not. We've been pretty excited about especially all the
interest that we've gotten from people in a lot of different areas, but we're still a little frustrated with
the amount of noise that we end up for some of these videos because of course, you amplify pixel
variations, it's not just the signal that gets amplified. And so in order to reduce noise, we came up with
a new technique that will be presented at SIGGRAPH this summer, and it was developed by Neal
Wadhwa and Michael Rubenstein and still in collaboration with Bill Freeman. In order to understand the
difference with the old as in one-year-old version, you have to remember what I said that essentially
Eulerian perspective, it works when you have a first-order Taylor expansion that's valid, and essentially,
it assumes that locally, the image has a linear intensity with respect to space. So if your image is a linear
ramp, things work perfectly. Unfortunately A, noise gets amplified as well as whatever linear ramp you
actually had; and B, especially because we need to use multi scale processing with the pyramid, in
practice, the band of an image pyramid doesn't look like a bunch of linear ramps. It looks more like a
bunch of local sine waves because that's what band pass it has. The kind of artifact that you see is
where this first-order Taylor expansion breaks and things like here are where you get really overshoot or
undershoot, same thing we saw with the Gaussian. And so we created a new technique that instead of
assuming that images are locally linear ramps, they're locally sine waves, which is great if you want to do
multi scale processing. And the good thing with sine waves is that we know how to shift them. We have
the wonderful shift phase theorem that tells us that if your image undergoes translation, the only thing
that happens to your Fourier representation, your sine wave representation, is a change in the phase of
your Fourier coefficient. We know how to translate sine waves, so all we have to do is come up with
local sine waves, so in practice we use steerable pyramids, which are essentially local wavelets that look
a little bit like this, very similar to Gabor wavelet, but the isotope of the steerable pyramid that we use is
actually complex-valued. So most people use image pyramids that are real-valued, but we can get
complex-valued pyramids that have both an odd and even component to them, so just like your Fourier
transform is not a bunch of sines, it's a bunch of complex exponentials, complex-valued steerable
pyramids give you both a real and an imaginary part which allows you to get a local notion of phase
which you can then use for processing and for amplifying local motion. There are beautiful Fourier
domain constructions in all this. We give some of the details in the paper, and you can look at Simone
and Shelley's webpage for more information. So whereas the previous approach used Laplacian
pyramid, and just did linear amplification of the time variation, we use the steerable pyramid that has
both a notion of scale and a notion of orientation that instead of being real-valued is complex-valued,
and we turn these complex numbers into an amplitude and a phase, and the processing that we do is on
the phase. So we take the local variations of phase and we amplify them, which really directly turns into
increasing the local spatial translations, and it works a lot better than the old technique. So in red is the
old technique. So let me play it once. We have a Gaussian bump that's moving to the right. Blue is the
ground true. We are increasing the motion. Green is our new approximation, and red is the old one. So
let me start it again. The beginning works fine. The old one breaks pretty quickly. You see this very
strong overshoot. The new one, that's better longer until eventually things go a little crazy especially
when you get phase wraparounds. But roughly speaking, the new method tends to work four times
better, meaning you can use an amplification factor that's four times higher. The second big
improvement is that it reacts much better to noise. So with the old technique, with noise, you just
amplified the noise linearly, at least the temporal component that you selected, and so you get crazy
noise like this. You saw effects like this in some of the videos I've shown. Even in the baby one, there
was a lot more noise. With the new technique, we only modified the local phase, so we don't amplify
noise amplitude. We just shift noise around, and so the local amplitude of noise stays the same. It just
moves a lot of more, so the noise performance is way, way better, and you should hopefully see this in
these results. This is the old version. This is the new version, and you see that noise performance is
significantly improved. You don't get the overshoot around the area of motion. The kind of artifact that
you might get is a little bit of ringing. These are other results. This is the old version. This is the new
version Same thing here. You see that the noise performance is really dramatically improved. By the
way, we tried to apply de-noising to the old technique. In some cases it helps a lot, but in some cases it
actually hurts way more than it helps, so the new version is way more robust to noise. We can amplify
changes that are as tiny as the small refraction changes when you have hot air around a candle. So in
this case, the small changes in index of refraction due to air temperature cause small shifts in the
background, and we can visualize this and amplify it and give you a sense of air flow. We're currently
working on techniques that would be able to extract quantitative information from this and give you air
velocity information. In the SIGGRAPH paper we've got a lot more information including how to play
with the space over completeness tradeoff to get a bigger range of motion. We have some ground truth
comparisons to physical phenomena that are ten times bigger, and we show that the technique gives
you something reasonable. We encourage you to go see the talk at SIGGRAPH that Neal and Michael
will give, and if you want to try it, the webpage I mentioned now has the new version. I'll show you the
beginning of a video we created for the NSF that demonstrates this. Actually, I'll show you different
pieces if I can get. Yes. Skip the intro. Mostly I'll show it because the explanation uses my video
authoring system.
>>: As blood pulsates through the subtle breathing motions of a baby. These changes are hidden in
ordinary videos we capture with our cameras and smartphones. Given an input video, for each pixel we
analyze the color variation over time and amplify this variation which gives us a magnified version of the
video.
>> Fredo Durand: So this is a case actually where because my handwriting is terrible, I first did a version
of this mini lecture. Actually I used a lot of resizing and spatial layout, as you can imagine, and then I
asked one of my students that has much better handwriting to just select the stuff and rewrite it, and
because I have this redrawing tool, all the audio synchronization was preserved. Let me show one of the
cool results. This one I really like. So this is a high-speed video of an eye that's static, but even when
they're static, our eyes move a tiny little bit, and we're hoping that this might be useful to some doctors.
>>: When a person fixates at a point, the eye may move from subtle head motions or from involuntary
eye movements known as micro saccades. Such motions are very hard to notice, even in this close-up
shot of the eye, but become apparent when amplified 150 times.
>> Fredo Durand: One final very brief mention of something that'll be presented at CVPR this summer,
we have a new technique with Guha Balakrishnan and John Guttag that analyzes beats from video, but
instead of using color information we use motion information. I'll show you why that even works.
>>: In this video, we demonstrate that it's possible to analyze cardiac pulse from regular videos by
extracting the imperceptible motions of the head caused by blood flow. Recent work has enabled the
extraction of pulse from videos based on color changes in the skin due to blood circulation. If you've
seen someone blush, you know that pumping blood to the face can produce a color change. In contrast,
our approach leverages a perhaps more surprise effect: The inflow of blood doesn't just change the
skin's color, it also causes the head to move. This movement is too small to be visible with the naked
eye but we can use video amplification to review it. Believe it or not, we all move like bobble heads at
our heart rate, but at a much smaller amplitude than this. Now, you might wonder what causes the
head to move like this. At each cardiac cycle, the heart's left ventricle contracts and ejects blood at a
high speed to the aorta. During the cycle, roughly 12 grams of blood flows to the head from the aorta
via the carotid arteries on either side of the neck. It is this influx of blood that generates a force on the
head. Due to Newton's Third Law, the force of the blood acting on the head equals the force of the
head acting on the blood, causing a reactionary cyclical head movement. To demonstrate this process,
we created a toy model using a transparent mannequin head where rubber tubes stand for simplified
arteries. Instead of pumping blood we will pump compressed air provided by this air tank, and I can
release the air using this valve. Now, watch what happens as I open and close the valve once a second,
similar to a normal heart rate. Ready? Here. This motion is fairly similar to the amplified motion of real
heads that we've seen before. We exploit this effect to develop a technique that can analyze pulse in
regular videos of a person's head. Our method takes on input video of the stationary—
>> Fredo Durand: Most of the components are standard.. It's Lucas–Kanade tracking, little bit of PCA,
little bit of extraction, and the cool thing is that at the end, we get not just the heart. We also get
individual beat's locations, which gives us beat variations. This is a histogram of beat lengths according
to the ECG or our motion technique. And I'm told that this has diagnosis application, but don't ask me
too much. I'm not the right kind of doctor. And unlike the color version, you can get heart rates from
the back of someone's head or in Halloween situations. So really, the thing I'd like to emphasize is I
think this whole area of revealing invisible thing using computational tools is very rich, and I think that in
vision and graphics, we really have the right intellectual tools to make a lot of things happen, and I
encourage everyone to do research in this area. Thank you. [applause].
>> Hugues Hoppe: Since we've gone over time, you're welcome to leave, but we'll stick around for
questions. Five minutes for questions.
>>: Hi. I'm [indiscernible]. I work here in the [indiscernible] group. So my question for the pulse from
head motion is, how much—robust are you, regular head motion or?
>> Fredo Durand: Depends how regular regular is.
>>: Random head motion. Let's say I'm working in front of my laptop.
>> Fredo Durand: Yeah, so I mean this is the thing we're trying to test, you know, how far we can go in
people's activity. So if you're like running on a treadmill, it's not going to work. Typing on the keyboard
seems to be fine. Then we're trying to find where the exact limit is. Actually the biggest motion that we
have to fight against is breathing for people being static.
>>: The second part to my question is how much prior information do you need to get that beat signal
out? Like do you actually specify a band like you did in the Eulerian?
>> Fredo Durand: The band is pretty broad. I think these days we just specify if it's a newborn because
their heart rate is so much higher. I don't remember what band he's using, but I think it's like .5 hertz to
2 or 3 hertz like this. It's reasonably broad.
>>: So did you accelerate your video application algorithms using your language?
>> Fredo Durand: Yeah, no. I've hired a student to do this this summer. We want a real-time version on
a mobile device, and so yeah, we want them to do this. The phase based version is a little more tricky of
this. Partially there are degrees of freedom in which steerable pyramids you use exactly, and there's
probably—it's actually a more general issue for the compiler where so far we've assumed that the
algorithm is fixed, but we all know that when you try to optimize your processing you might decide oh,
you know, I'll use a cheaper version of blur in order the get the performance that I want. This is exactly
the kind of stuff that's going to happen, I think, with the pyramids. Yeah.
>>: For your latest head motion work, have you done quantitative comparisons against the previous
work? In other words, is it better of comparable?
>> Fredo Durand: Yes, so we've been comparing with color and sometimes it's better. Sometimes it's
worse. And we're trying to come up with the best of both kind of thing, yeah. It looks like the motion is
less sensitive to noise because as long as you have strong edges that move, you really need a lot of noise
before you mess up the notion of an edge, but at reasonable noise levels it's less clear which one is
better.
>>: Up off that question, could you also, like, it's the case that if sometimes one is better, sometimes
the other, you could possibly also combine them.
>> Fredo Durand: Absolutely, yes.
>>: You haven't done that?
>> Fredo Durand: I have a student who's supposed to be working on it. I haven't seen any result yet.
>>: Have you tried this on like a big motion [indiscernible?]
>> Fredo Durand: Yeah, it does crazy stuff. I mean.
>>: Only if you got a fixed or lack of motion, track motion [indiscernible?
>> Fredo Durand: Yeah, so the larger the motion the less you can amplify the high spatial frequencies,
so usually the way we get away with motion that's too big is we just give up on the high frequencies.
But then at some point you get so low frequency that nothing much happens. The phase based one can
go a little further, yeah.
>>: Have you looked into retiming videos as well because a lot of retiming now is done with optical flow
and pixel tracking. I'm interested in how the grid based approach works.
>> Fredo Durand: I'm interested in this too, yeah. We want to try. Absolutely, yeah. The phase based
method is very interesting because it's halfway between something Lagrangian and something Eulerian,
and I think there are lots of things that you usually do with advections that might be interesting to do
with this technique because it's more direct and so I think as a result, it will tend to be more robust. Yes.
>>: It sounds like you've done some amount of experimentation with things that can't be seen at all and
turning them into visual output essentially. Have you looked at all at taking something that like a laser
bounced off of a faraway window and turn it back into an audio signal? Have you looked at doing that
with just a visible light? I put my high-speed camera in your window, and I can peel the audio off.
You're peeling the audio on both sides, but the laser offers the same [indiscernible]
>> Fredo Durand: Yes, we're interested in this. We've made some early experiments, especially—I don't
know if you saw there was a bluish membrane that we had that's a rubber membrane. We' started
playing with loud speaker and trying to record the motion and getting some audio back, but it's very
preliminary at this point. Yeah we're very interested in this. All right.
>>: There is motion in the sun and the stars which can only be recorded by cameras.
>> Fredo Durand: Pardon?
>>: We detect that motion in sun and stars with cameras and big telescopes. Have you tried like that
motion, and I guess [indiscernible]
>> Fredo Durand: No, we haven't tried that.
>>: Because it's really low frequency [indiscernible]
>> Fredo Durand: Yeah, we should try.
>> Hugues Hoppe: Thank you, Fredo.
>> Fredo Durand: Thank you. [applause]
Download