>> Zicheng Liu: Hello. Hi. Welcome to... Sofien Bouaziz -- is that the way to say --

advertisement
>> Zicheng Liu: Hello. Hi. Welcome to this talk. So today I'm really pleased to introduce
Sofien Bouaziz -- is that the way to say ->> Sofien Bouaziz: Sofien Bouaziz.
>> Zicheng Liu: Bouaziz?
>> Sofien Bouaziz: Yeah.
>> Zicheng Liu: Yeah. So Sofien has been doing really cool work. He's been working on both
computer vision side and also computer graphics. He uses sort of machine learning [inaudible]
to do both. I have seen quite a few of his papers. I really like this work because that uses
Kinect. I mean, right now he's going to have a demo which is [inaudible] but is basically also
Kinect.
So of course facial animation has been really a favorite subject to myself, so I'm glad to see that
he's really moving this forward. He's really advancing the state of the art in this direction. So...
>> Sofien Bouaziz: Thank you. Thank you for the introduction. So I'm very glad to be here
today. So, yeah, I'm Sofien Bouaziz. I'm a Ph.D. student at EPFL and under the supervision of
Professor Mark Pauly.
And so one of the project I'm working on is this realtime facial animation project. So first I will
a bit introduce myself and explain how this project is born at EPFL, and then I will go through
the paper that we have at SIGGRAPH this year on how to do realtime facial animation with the
Kinect and without any training.
So I start on my Ph.D. in 2010 in Switzerland at EPFL. And so this is the EPFL campus. So I
think I'm very lucky because this is a very beautiful place. It's a bit like Seattle. And this is me
when I start on my Ph.D. and I had more hairs on my head. Yeah.
So my research is basically at the intersection of three different fields, which have computer
graphics, computer vision, and machine learning.
So I've been more recently interested in geometry processing and how to deform 3D meshes
under certain type of constraints. And this is very relevant for face tracking.
I've also done some work on sparsity and [inaudible] and how to apply that to registrations.
And finally I've done also a bit of work on machine learning, and I'm mostly interested in
semisupervised learning and unsupervised learning.
But the project that I am -- that is really at the heart of my Ph.D. is this facial animation project.
And what is interesting with this project is that it's really at the intersection of these three fields,
computer vision, computer graphics, and machine learning.
And I think [inaudible] when you want to have systems that deal with real-world data, you
somehow need this combination of things.
So what is facial animation? So there is mainly two method to create facial animations. The
first one is done -- the first one is to do it manually in a software like manual on the left side.
And this is quite time-consuming because an animator need to come and to place sliders one by
one and to do that for a lot of key frames. So here what you see is a caricature of my supervisor,
and it's me moving a slider opening his mouth.
And the second case is to use motion capture, which consists of capturing the expression of an
actor using some cameras and to retarget these expressions on to virtual character. And so what I
am -- I will talk about is mainly this in motion capture.
So I was really lucky because when I started my Ph.D. -- in fact, two months after I started on
my Ph.D., in November 2010, Microsoft released the Microsoft Kinect. And I remember we
bought the Kinect and we use these initial drivers to get some data out of it. And at that time I
was working on faces, so I wanted to know what a face looked like in the Kinect data.
And it is -- it looked like this. So because of the wide-angle lens, the face is only 160 pixel by
160 pixel, and this is when you're very close to the device because they also cut off. And you
can imagine that then an eye is like 15 pixel wide into this data.
So we were looking at that and we were a bit disappointed, especially because we had this 3D
scanner in our lab, and the quality is much better. You see a lot of details. And in the other side,
with the Kinect you don't really see much things.
So we search. It will be very challenging to do something with this data, but why not trying. So
a few months later we got something like this. And this is what we submitted at SIGGRAPH
2011.
So now two years have passed, because it was 2011, and what have we done after that. So in
fact with my friend and colleagues, supervisor Mark Pauly, myself, and Brian Umberg
[phonetic], we created a startup called Faceshift based on this technology.
And so we are quite happy now because some people use our system and they can do much
better things than what we could have done ourself, because they have like really nice 3D
models. And this is one of the fist videos that someone sent us. And this video is raw recording.
There are no cleaning going on. And this is the type of qualities that you can get in term of facial
tracking with our system.
So I will just do a quick demo of our system just very quickly to show you the quality.
So here is our software. And this is a -- software is a Kinect and [inaudible]. And this is the type
of quality we can get out of it. And then we can animate some [inaudible]. So just let me close
that and then like push through the talk.
So you can find the software online. There is a 3D trial version, so if you want to try it, feel free
to try.
And now we'll talk a bit more about what we have done for SIGGRAPH this year, which is
doing the same thing but without any training. Because, in fact, the software that I have just
showed need a training phase where I use only to scan his face to create a model of his face and
then you use that to track the person by registering the face modeled to the 3D and 2D data.
So our paper at SIGGRAPH this year is called Online Modeling for Realtime Facial Animation.
And this is one have been developed at EPFL with the collaboration of Yangang Wang from
Tsinghua University and Mark Pauly from EPFL.
So during the last three years approximately RGB-D devices have been drastically improved
going from expensive and [inaudible] setup like this scanner that we have in our lab to devices
that can be used with the video game console and more recently embedded into a laptop frame.
And both the size and the cost have drastically decreased to target the consumer market, but the
quality has also decreased.
So why is these devices are not ready to be mass produced? Very few algorithm are fully taking
advantageous of the capability and are now ready to be deployed at a large scale.
So in this paper what was the goal. The goal was to develop a face tracking algorithm that you
could directly deploy with these devices. So that means that you want something that is realtime
and training free, so someone can come, sit, and directly use the system and track his face, like in
this video.
[inaudible] it need to be accurate enough in order that the person impersonating the other have
the feeling that is -- that is really driving it. So you need to be responsive and be accurate.
And the last thing is like you need to be robust to lighting variation because a lot of people either
playing video game in the dark or the lighting in the room are not good. So what you want is
something that is robust to occlusion and by lighting condition.
So this was [inaudible] requirement that we wanted for system in order to be able to deploy it
with those kind of device.
And so what is nice is that as no more training is needed, so [inaudible] enable a new application
in virtual interaction. And in this example that we developed in collaboration with Head Geneve
and Faceshift, an observer can simply step in front of the picture and start animating the virtual
painting. And you see the Kinect is embedded into the frame.
And so you cannot only animate an avatar, but you can also use your face as a controller. So in
this painting the user can drive the weather and also with the rotation of his head drive the
camera. And you see, again, the Kinect is embedded into the laptop frame.
So then when he blows, wind come, and he uses really his face like a controller.
So let's put this work into context. So we will first compare our approach with existing realtime
system in term of quality and usability.
So in one side of the spectrum we have marker-based approaches. And this method [inaudible]
for realtime and accurate tracking of wherever the kind of required specialized [inaudible] set up
and need careful collaboration, which reduces the usability of the system for the consumer
market.
In the other side of the spectrum we have Web cam-based approaches, which only need to see
that it's a Web cam and usually no training, but unfortunately this approach lacks [inaudible] and
are kind of sensitive to bad lighting condition.
So we have other type of Web cam-based approaches where you need to do a training where you
create a [inaudible] model of the face, and this approach is a bit more robust. However, you
need a training face, so it's not really something that you can deploy for -- to the consumer
market directly.
So in SIGGRAPH 2011 we have presented this approach, how to do face tracking based on the
Kinect, but this paper also -- in this approach you also need to scan the face of the user first in an
offline process, and then you're able to track. And usually users don't want to do this process of
going and scanning and then tracking. They want to come in front and do directly in an
animated avatar.
So this year we present an approach where basically it's totally training free. And this puts our
approach somehow in the spot where the quality is pretty good and it's very usable because
people just come in front of the device and can use the system. The only difference with the
Web cam approach is like you need to buy a Kinect. And this is a bit more costly [inaudible].
You have a tradeoff between cost and quality.
So at SIGGRAPH this year, there's two interesting related work that also try to bring our quality
marker [inaudible] motion capture. So one from Cao, et al., from Microsoft Research Asia and
one from Li, et al., and both of them have very nice tracking quality. However, they also require
some training.
So now let's see how our system work. So as an input of our system, we get a depth map and an
image coming from the RGB-D device, like a Kinect. And we formulate the tracking problem as
a registration of a parametric face model to the input data. So this face is parameterized and you
need to find the parameter of this face in order to align them properly to the depth map in the
image.
And then when you have this parameter which is rotation translation and the internal parameter
of the face, you can try to retarget them to [inaudible].
So in our system we use as parametric face model a blendshape model. So what is a blendshape
model? So the blendshape model is defined we a neutral face, which is 3D mesh, and a set of
facial expression. So this is like mouth open, eye closed, smile. And you can express that as the
metrics where B0 as the first column is a neutral pose and B1 to BN are the different facial
expression. So you take your vertices, you vectorize them and stick that into the matrix.
So now to create a new expression, what you can do is you can combine the neutral pose with a
linear combination of the displacement from the neutral to the different expression expressed as
the [inaudible].
So, for example, if you want to create this face, you can take the neutral and then add a
percentage of mouth open, so the displacement from neutral to mouth open, and a bit of stretch
left and right. And then you would get this face. So it's a linear model.
>>: Is it the same as the [inaudible] model?
>>: So [inaudible] model is only for identity. So it's like a PCA model where you take a point in
space and you can create a neutral face. In this case it's like -- it's kind of similar, but you take a
point in space and you create an expression. So you take ->>: But the map is essentially ->> Sofien Bouaziz: The map is the same. It's a linear model. Yeah. Exactly. And we'll talk a
bit about [inaudible] a bit later.
So now the registration problem consist in finding a linear combination of blendshape basis so
that the face is aligned with the depth map in the image.
So now if we know the blendshape model of the user, so what is a neutral face of the user in this
expression, we can just try to optimize for the blendshape weight in order to fit best the model to
the input data.
And this is what we have done in SIGGRAPH 2011 and this is in the previous video that we
have seen. We optimize for this weight in realtime in order to align the parametric face model.
But then the issue is that like you need to know before and what is a blendshape model, and
usually this required a prior training because you don't know what the face of the user look like.
So instead what we want to do, in fact, is to optimize for everything. So we treat everything as
an unknown. It treats the blendshape model as unknown and the weight as an unknown because
you don't know what the face of the user is and you don't know what the expression that he's
doing are.
But as you see, everything is an unknown now, so it's kind of a challenging problem, how can we
deal with this. And in order to do that, we developed a algorithm that is [inaudible] optimization.
So it's working as follow. So first you receive the image in the depth map from the Kinect. And
then first -- and then you start tracking the facial expression by fixing the blendshape model. So
what you do is that you assume that your blendshape model is correct and you optimize for the
weight only. And then you get this parameter of rigid motion and blendshapes, and you can
retarget that.
And in the second step you assume that weights are correct and then you optimize for the
blendshape model. So alternatively you first fix blendshape optimized for weight and now fix
weight optimized for blendshapes.
So now if you initialize your system with a generic blendshape model that doesn't look like your
user but like a generic face and you do that alternatively to minimize between blendshapes and
weight, you can get something like this where the face get adapted over time in order to match
the user specific geometry. But you can also do tracking. So what we are doing in fact is
modeling while doing the tracking at the same time.
So let's first see how we do the tracking part. So to compute the tracking parameter, we assume
that we have a fixed blendshape model. So this type correspond to find the blendshape weight X
in order that the resulting pose match depth map in the image.
So to solve this, we formulate an optimization that combines three terms. So first term is your -the distance from the generated 3D model to the depth map in the image. So it's like a fitting
term. The second term is a smoothness term that tell you that temporarily you want your
animation to be smooth. So you don't want like you go from mouth open to smiling in one
frame. You want things to be smooth and behave nicely. And the last term is a sparsity term,
and we will see a bit later why we have this term.
So let's take a deeper look at this term. [inaudible] term is expressed as a quadratic energy where
the matrix A and the vector systemize the registration constraints. So [inaudible] registration
constraints are [inaudible] constraints, so you want consistency between color of a frame, and
point to plain constraint meaning that you want each vertex of the mesh to be close to geometry.
And I invite you to look at our first paper in SIGGRAPH 2011 to know a bit more about how
these constraints are formulated.
Then you want to apply smoothness which is just done for us by penalizing the signal difference
of the blendshape weights over time. So classical things.
And then we have the sparsity energy that is basically penalizing the L1 norm of the blendshapes
weight, and this is to get basically as few activation of the blendshape as possible. So notice that
now we have a convex problem so we can get optimized solution of this thing at each frame.
So there is some advantages to use this L1 norm instead of the L2 norm to regularize a
blendshape model. So first is regularization of [inaudible] blendshape composition artifact and
you try to activate a small set of blendshape at each time. And usually because the blendshape
model is meaningful, you don't want that all shapes are activated at the same time. You don't do
a smile plus a mouth open plus a lot of other thing. So you want to have a minimal description
of the shape that the user is doing.
And the second reason why we have this term is because when you will do the retargeting and
you will give those curve to an animator and we want to clean them, they don't want to have like
a lot of small activation because this is annoying for the animators to clean up. So that mainly
two reason for why we have this L1 term.
So what this was was a tracking part, and now what is really the focus of this paper is how do the
modeling. So in the previous section, we have seen how the blendshape weight can be computed
by using a given blendshape model. So now we will do the reverse. We will fix the blendshape
weight and optimize for the blendshape model. And so we developed an optimization that
contain two term.
First we have a fitting term. And this is exactly the same fitting term that we had for the
tracking, which means that you want your model to be close to the image and to the depth map.
And then you have a prior term, and this term basically regularize the deformation of the
blendshape because you want the blendshape to stay nice and to stay like a human face. You
don't want the blendshape to deform like a monkey or to something else. So this prior term is to
regularize the deformation.
So the blendshape model that we used in our system is composed of 34 poses. So we have the
neutral of all those poses. And each of those mesh contain around 7,500 vertices. So now one
way that you can deform the blendshape model is to take each vertex independently and to try to
move them in order that it betters the data. But the main issue is that is like you get like around
800,000 dimension if you do that. Because you have 35 meshes with around 8,000 vertices, and
each vertices is in three dimension, so if you try to take everything as unknown, you have a
massive problem.
So therefore if you want to directly optimize the vertex position, this is not really a good idea,
especially if you want do that in realtime. You have only 33 milliseconds to adapt the face per
frame. So it's not a lot of time.
So what we want is to do some kind of [inaudible] and to decrease the number of dimensions,
right? So how can we do that? So first we express the neutral pose of the user using an identity
PCA model like the [inaudible] model. And this model captures a variation of face geometry
across different user. So if the unknown is Y, then for any Y you will create a neutral face of a
person, say the PCA model of identity.
So what we have is also a generic model that I've -- you have seen in the previous slide. And
this model somehow define what type of shape you want to track in realtime. So you have this
knowledge that you want to track mouth open smile and I have a generic model.
So what do we do to create the expression of the user? We basically develop a deformation
transfer that will take the deformation of the template model and apply them onto the user
specific model. Meaning that in the template blendshape model you know how to go to neutral
to mouth open, and I want to take this deformation and to apply the same deformation to go from
the neutral of the user to the mouth open of the user.
And the interesting thing is like this [inaudible] can be made linear, so just automatic
multiplication. So then expression can be expressed as a linear transformation of the neutral
pose.
However, the PCA model present the large-scale viability of the face geometry in the neutral
expression, but it doesn't really capture very well user-specific detail like estimate curves of face.
So the PCA model is symmetric.
Similarly, the deformation transfer [inaudible] is a deformation just of the blendshape model to
your user-specific neutral pose. And this doesn't really account for the viability of expression of
the user. So what we want to do is to be able to capture better those detail. So in order to take
into account user-specific detail, we also added the smooth deformation field to the neutral and
to the expressions that is defined as a linear combination of the last eigenvector of the graph
Laplacian matrix of the 3D mesh of the neutral pose. And this I will invite you to take a look a
bit more into [inaudible] processing because it will take me a long time to explain how this is
working.
>>: How many of the columns [inaudible]?
>> Sofien Bouaziz: How many what?
>>: [inaudible].
>> Sofien Bouaziz: How many vectors? This is nice. Next slide.
>>: Oh, sorry.
>> Sofien Bouaziz: That's okay. So now we have a parameterization. And by increasing the
number of PCA and Laplacian bases, the modelling get better and better. So we found that if
you use like 50 PCA bases and 100 Laplacian bases, so for each shapes, we get a pretty good
reconstruction. And now we have only 5,000 dimension, approximately 5,000 dimension.
Which is zeroed at 7 percent of the total number of dimension that we had before, which is a
pretty decent decrease of number of dimension. It's still 5,000 dimension, but it's much better
than the 800,000 that you had before.
And this parameterization is also interesting because now we can express the blendshape as a
linear system with an unknown vector U that contains the parameter of the PCA model and the
Laplacian eigenbasis.
>>: So E is precomputed I suppose.
>> Sofien Bouaziz: E is precomputed.
>> [inaudible] P star is also precomputed.
>> Sofien Bouaziz: On the generic template model. Yes. So the only unknown is basically the
PCA weights and the weight that you applied to the eigenbasis of the Laplacian to get the smooth
deformation field.
>>: So we apply these precomputed P stars to a newly created match?
>> Sofien Bouaziz: Yes.
>>: [inaudible] some artifacts.
>> Sofien Bouaziz: You get pretty good thing, yeah. Because this deformation is done in the
way that you try to minimize some distortion. So in the paper we have in appendix explanation
of why this is working pretty well.
So now we can express minimization in term of the unknown vector U that contains a PCA
weight in the Laplacian equation. So similar to the tracking, we have the fitting term. And this
remain quadratic. However, [inaudible] if you give meaningful expression, so it is necessary to
regularize it with a prior term, and this prior term is a quadratic. It's a diagonal matrix. And I
invite you to look at the paper to see how it's built. But basically try to minimize the bend and
distortion and to stay in the good place of the PCA model.
>>: What is wrong with the first X, right next to the green check mark?
>> Sofien Bouaziz: Well, maybe you don't see it here, but like there's some distortion near the
lips. They start looking a bit like a monkey. So in fact it's very interesting because when you
start modifying face and you don't do things like very well regularize, you always get to other -those kind of monkey face. And this is exactly what we try to avoid because people see directly
that is not looking like a human.
>>: [inaudible] do you use [inaudible] images at all, common channel? I mean, [inaudible]
feature detection [inaudible].
>> Sofien Bouaziz: Yes. So in this matrix A have constraints and the vector C, which is also a
constraint, you can dump whatever constraint you want. So in this case you have [inaudible]
constraint and point-plane constraints. So those metrics are exactly the same constraints that you
are using for tracking. So you try to keep the same constraint in order to say, say, oh, yeah, this
weight was tracking now, what is the geometry that go with this weight in order to fit the data.
So if you want to put feature into this matrix, you can put features too.
So now this is really nice. We have this optimization. But the fitting term that is presented here
is only for one frame. So you have a constraint A for one frame that tell you what is [inaudible]
for this frame and a point-plane constraint for this frame.
So now what we would like to do is in fact have a cosign matrix A and vector C for each frame
and try to minimize this energy which is the fitting from the first frame to the last frame.
However, in order to optimize this energy, we do not want to keep all the matrices, so all the
constraint from the first frame, second frame, third frame. And because this is kind of
[inaudible] in term of memory and also because you cannot keep an infinite amount of
constraints.
So in order to do that, we developed an online algorithm that keep the memory constant and is
able to deal with all the previous constraints of all the frames that you have seen before.
So let's take a look at just at this minimization. So the first thing to notice is that minimizing this
quadratic energy correspond to solve the following linear system, also called normal equation.
So now we can close the left unsolved MT4 because it's a last frame is T and the right-hand side
YT because we add the frame T.
So now what we can see that we can do this process iteratively. So meaning that MT is in fact
equal to MT minus 1 plus A transpose A. So the multiplication of the transpose of the constraint
matrixes. And similarly for the right side.
So now we can design a nice algorithm in order to do this modeling while keeping all the
previous frame, which is if you have this optimization, first you compute M at time T which is
just a matrix, so you keep only matrix in memory. It's like a buffer. And you add the constraint
matrices at time T to this buffer. You can do the same for the right-hand side, and then you solve
[inaudible] linear system. And but this linear system is pretty big, but because you do it at each
frame, what we do is like we just do few step of [inaudible] for this linear system. So this is like
an online learning algorithm.
>>: [inaudible].
>> Sofien Bouaziz: So the pose have been computed in a tracking phase.
>>: Other, so this is what -- does not come in here.
>> Sofien Bouaziz: No. Because the face have been aligned and you have -- you assume that
the alignment and the expression are good. And then you deform the model to fit the data. And
[inaudible]. And you will see in the --
>>: [inaudible] so the second phase.
>> Sofien Bouaziz: Yes. So and you will see this here. So here we have initialize the system
with a wrong phase and the wrong rotation. So in the upper right corner, you -- yes, the upper
right corner. You are seeing the face. Without the rigid motion. So I took out the rigid motion
to let you see all the deformation, what the deformation is doing. And on the bottom right corner
you have the same face but with the rigid motion here.
So now the system of initialize with the wrong rotation, wrong translation, and the wrong face
model. So this is the wizard. At some point. So the famous -- the face start to deform and try to
realign to the face of the person in order to feed the data better. And you get the tracking at the
same time.
So it's exactly what we wanted. We wanted the model to adapt to the data and to be able to
track. And even with like pretty bad initialization, it can recover. And this is the first frame
where you have wrong initialization with rotation translation, and the last frame. So you get a
much better fit in term of the model.
So now this is the demo time of the system, and I hope everything would work out as usual.
>>: [inaudible] expressions, you will not be able to get all the bases [inaudible].
>> Sofien Bouaziz: Yeah. So if -- in the paper we have like a term that basically look at how
much time a face have been seen -- or an expression has been seen. And we accumulate -- we
look at how much time of an expression have been seen and then after a certain time we fix it.
So we have a [inaudible] into the algorithm. But I did not present it here.
So this is our system. So now I just came in front of the device and the face got adapted. So
now let me reset it and you will see the adaptation again. So it's really fast. So the thing is it's
very difficult to see the adaptation because we have 30 frame per second. It means that in one
second we did already like 30 times optimization here. So it goes pretty fast.
>>: Can you try it on someone else [inaudible]?
>> Sofien Bouaziz: Sure, we'll try.
>>: [inaudible] need it to be representative.
>> Sofien Bouaziz: Let me restart it. Hopefully. Okay. So the first thing is like -- wait. It's
tracking me now.
>>: [inaudible].
>> Sofien Bouaziz: Yeah. Let me reset that. Okay. Okay. So look straight inside first. You
can do facial expression and stuff.
>>: [inaudible].
>> Sofien Bouaziz: Yeah. So you'll know this but you will feed -- I mean, it's very difficult to
get the cheap parts because if you -- yeah, if you turn you will get it.
>>: So it's already being adapted.
>> Sofien Bouaziz: Yeah. So your nose is pretty well fit -- I mean, it's very difficult to get the
cheek parts because on the -- yeah, if you turn, you will get it.
>>: [inaudible].
>>: Yeah, it definitely looks different from your face.
>>: Yeah.
>> Sofien Bouaziz: Yeah.
>>: You can recognize me from the mesh?
>> Sofien Bouaziz: I don't think it's -- yeah, I don't think the goal is really to recognize to get the
good tracking.
>>: It certainly recognized that it's somebody Asian. I'm not sure what it is about the face, but it
definitely looks ->> Sofien Bouaziz: [inaudible].
>>: Yeah, because my nose bridge, nose bridge is lower.
>> Sofien Bouaziz: So now like -- now I can swap with you [inaudible].
>>: I mean, it's interesting how much of your facial identity is not in a shape but it's in things
like your hair and other things, your mustache and so on. [inaudible] would make a huge
difference ->>: In terms of recognition.
>>: Recognition, you know, I don't know how well recognition works on just pure face shape. It
makes you look very young.
>>: Thank you. I need that.
>>: Thank you.
>> Sofien Bouaziz: You're welcome. Yeah, it's true that you get a lot from other thing like hair
and texture. But this model is also pretty smooth because the data of the Kinect are not like very
high quality. So with better data and by increasing the number of bases, you can probably get
better reconstruction. So right now we [inaudible] if it is a model in order to not be [inaudible]
to noise basically.
>>: Do you have to track eye gaze?
>> Sofien Bouaziz: We track eye gaze, but this is not in the -- yeah, I did not talk about this.
>>: [inaudible].
>> Sofien Bouaziz: So eye gaze is done in another way. It's not done by registration because it's
very complicated. Even eyelids is not done by registration. So there's a lot of processing going
on then in the tracking in that.
>>: [inaudible].
>> Sofien Bouaziz: Yes. You have not changed the blinking.
>>: Yeah, I noticed that, yeah, the blinking.
>> Sofien Bouaziz: [inaudible] this is good. So what other future work? So first [inaudible] we
build automatically already a fully [inaudible] avatar, so without any expression and the identity.
So we would like to add texture and other facial feature like hair, and we would to also like
optimize for those thing, like imagine you can reduce [inaudible] to optimize which texture
match best the images at the same time. So this would be one of the other goals that we have in
mind.
So second is to improve the tracking because it's not perfect. So we like also to integrate speech
in order to [inaudible] because this is one of main challenge that we have right now, is to -- if
someone is speaking, to be able to see that it's really -- what he's saying is really -- how the lips
move. This is a really challenging thing.
And the thing that I was considering lately is to do the same procedure for active appearance
model and active shape model. So active appearance model and active shape model is a
[inaudible] model. And usually either you create a model that is user specific and then you get
good results for tracking, either you use a generic model that you train on different person, and
then usually is the performance are much worse.
So why not initializing the system with like a generic active appearance model and try to adapt it
to the user's specification. So this is like the extreme end thing that I'm considering too right
now.
>>: [inaudible] currently in your system, the demo system, I assume you do some kind of
feature tracking. Do you use ASM -- generic ASM model at all?
>> Sofien Bouaziz: No, not really. We use classifiers for the feature tracking.
>>: [inaudible] how many feature points do you ->> Sofien Bouaziz: So what you have seen here is the well known. But in our software -- so in
the software we have like lips feature point, I think we have like around 10, and then we have
like eyelids and stuff.
So to conclude, so we are really excited about the progress of RGB-D devices in term of quality,
cost, and size. And we believe that [inaudible] people will be able to really use facial motion
capture technology directly at home and that this would open like new type of application and
change the way people interact with computers.
So we are really also impatient to see what kind of application will be possible with this kind of
technology. So beyond application like in HCI gaming communication and security, we also
think that there would be many things and interesting application coming up.
So I just want to come back to this schematic that I was presenting at the beginning, and I was
saying this project is in like the intersection of computer vision, computer graphics, and machine
learning. And I think that this is a good description of this project because you have like a
registration part, which is totally computer vision. Then we have a 3D modeling part, which is
computer graphics side.
And something that I have not really said is like this type of algorithm is like very close to
dictionary learning algorithm done in machine learning, where you try to optimize for the
dictionary over a [inaudible] basis and this kind of thing. So there are a lot of relation between
those kind of optimization and machine learning that's -- or dictionary learning.
So finally I would like to thank a lot of people. Faceshift, Head Geneve, Tsinghua University,
the Swiss National fund that is funding me, and all the [inaudible] this paper.
And there's two more things that I would like to say. So first I'm giving a lecture, a course at
SIGGRAPH with Mark Pauly on 2D/3D registration where you will see how to construct
[inaudible] of constraint like [inaudible] flow, point-plane constraints and so on. So if you're
interested to know a bit more about the tracking side of this system, and you are at SIGGRAPH,
please come to this course.
And then also Faceshift has a booth at SIGGRAPH, 837, where you will be able to try it. So we
are currently reimplementing this SIGGRAPH paper in Faceshift. It's almost done. And
hopefully it will be better because we have feature trackers and stuff like this into Faceshift.
So thank you very much, and I am ready for your questions.
[applause]
>>: Have you tried getting a depth model from a time of light laser, and are you going to try
stuff with the new Kinect?
>> Sofien Bouaziz: So I would like to try stuff with the new Kinect, but I heard that you cannot
plug it to the computer. So I was a bit disappointed with that.
>>: Well, it soon will be, but not right now.
>> Sofien Bouaziz: Yeah. So when it will be available, sure, I will try definitely with the new
Kinect. Especially the texture is much better, it seems, so I'm sure we'll get much better tracking
across here and modeling. And for the time of light we did not really try it. But it's like
something that we should do probably.
>>: So in online tracking problems that people usually have the problem of the drifting, that is if
you update the model in the wrong way, in a slightly wrong way, then you will accumulate
errors. So I wonder how you address these kind of problems. Is that because that the
regularization terms you used or other tricks that you ->> Sofien Bouaziz: I will say yes and no. So one thing, yeah, obviously all this regularization
that we put in is a key of the quality of this tracking system. This is one thing. So for the
tracking side, it doesn't drift because we fit to the 3D. And the way we fit to the 3D log pretty
well the shape at each frame. And then [inaudible] is a self-constrain. So everything is a
self-constrain, so it's regularized pretty well and it kind of logs the data. So it doesn't really drift.
It will drift only if you have only temporal constraints, for example. But here you have also like
a static constraint at each frame, which is feeding to the depth map. Is it what you're asking?
>>: Yes.
>> Sofien Bouaziz: Yes.
>>: I wonder how you -- I mean, how you choose like the weights between the regularization
and -- because you regularize too much, then you don't look at ->> Sofien Bouaziz: Sure.
>>: Yes. And if you don't regularize, then you get -- you get drifting.
>> Sofien Bouaziz: Sure. So we say this is always a problem of any computer vision system.
We have weights to put in. So right now it's like done by experience. It's working well. We'll
say you can try to learn them by using some data. We did not try that for now because -- because
for now we have this and it's okay. But this is a future work probably that we should [inaudible]
how to learn this weight from data. Sure. This is interesting.
>>: [inaudible] models and then you have this [inaudible] generic based class [inaudible] the
expressions.
>> Sofien Bouaziz: Yes.
>>: So sometimes the PCA blendshapes and expression bases, they may -- they overlap.
They're not very separate. So you may confuse -- maybe there's some expression which is not
[inaudible] correctly but then you may confuse [inaudible]. You may add [inaudible].
>> Sofien Bouaziz: Sure. It's one of the reasons why we add this L1 term inside, is to activate
as few shapes as possible to get a minimal description of the signal. Because if you try with L2,
you may have a lot of signal going on, and then you will start mixing everything and then it will
be terrible. So this [inaudible] so one thing. I did not describe it very properly why it's working
with this one term, but is also one of the key why it's robust, is because ->>: That helps, but I still -- you will not [inaudible].
>> Sofien Bouaziz: Yes. But you are basically doing that over time. So whenever it starts
converging, then you reconverge to something. So if it's converged, it will converge pretty well.
Because after you -- you align the data more and more. But at time T at the beginning, yeah. So
if you -- if you start, you will get confused at the beginning. And because the model is wrong, it
will get wrong blendshape weights. And then because the blendshape will be wrong, you
somehow get wrong blendshapes. But then if you regularize properly your problem, as you have
seen, it can converge pretty nicely.
So everything is a mixture of like regularization in order to put the things together. But, yeah, I
agree with you that it is not always a trivial test to do those kind of systems, yeah.
>>: [inaudible] I saw in your demo previously that sometimes the eyes [inaudible] the tracking
is not very good because maybe the model has closed -- maybe closing their eyes and the person
is not ->> Sofien Bouaziz: Yeah. So ->>: [inaudible].
>> Sofien Bouaziz: Yeah. So right now in this system that I've shown, we don't do anything for
the eyes placement. So we don't try to place the eyes properly. And when we track the eyes, we
like to know where the eyes are located in the image. So if the eyes placement is a bit wrong,
you can get from tracking.
So in the reimplementation that we are doing in the software Faceshift, now we somehow do
some detection of eye position and add that into the constraint. So the eye tracking will be
better. Yeah. So the reason is because eyes placement is not very well done, and then uses eye
placement to track the eyes, to know where the eye in the image. This is the reason why.
>>: Is there any modeling of the tongue or teeth?
>> Sofien Bouaziz: Yeah, as I was telling [inaudible] the other funny thing is like every time
someone tries the system, after two minutes they do "ah." And they say where is my tongue.
And then, you know, we say for now we don't have tongue. But, yeah, we are talking about that.
It's funny like maybe to consider at some point to use the teeth [inaudible] opening of the teeth,
model that, detect if the tongue is outside. All those thing are a future work. But they are very
difficult work, because the tongue you -- you don't see the tongue much in the data most of the
time if you don't do "ah."
>>: Seems like we could see the teeth.
>> Sofien Bouaziz: Oh, so in the mesh that we have we have teeth and tongue, yes. And those
get adapted. But it's adapted in a heuristic way. So we basically feed them with a model to
adapt them. I can show you again. But it's not like actually fitted to the data. It's like just to
make things look better.
So a lot of thing in this tracking system is also to make things like good and plausible. And
adding tongue and teeth makes things much better. So I'll just come back to this thing to show
you the teeth and tongue.
>>: [inaudible].
>> Sofien Bouaziz: Yeah. But the tongue doesn't move.
>>: [inaudible] stick out your tongue. Basically there was -- you need -- the weakness is that
you really need to recognize that the person, the user is stick out of tongue, he doesn't -- the
system don't know how to [inaudible].
>> Sofien Bouaziz: Yes.
>>: [inaudible] tongue recognition.
>> Sofien Bouaziz: I think this could be an interesting paper, but quite odd -- quite odd paper.
>>: [inaudible].
>> Sofien Bouaziz: Yes. Let me put that.
>>: [inaudible] only on the CPU?
>> Sofien Bouaziz: Yes.
>>: Because you started the iteration on 30 frames, right?
>> Sofien Bouaziz: Yes. I mean -- or along -- all the time you track basically. So this is on the
CPU. So right now the Faceshift software that you have seen use only one core. So it's not
[inaudible]. And this is a reason -- the other reason why is because you want to do an SDK so
people will be using for online gaming. And then you don't want to use all the cores of computer
when people -- like if you open Skype and then your computer start burning, it's usually not
good.
So we are trying to keep it low. So this mean that we do a lot of optimization of code. But what
you have seen here is like basically mostly in one call and 30 frame per second. Which mean
that you have 33 millisecond per frame to solve this messy problem. So it's like you need to do
some -- there are some other smart thing going on in the software that I did not like explain here.
There are a bit more details in the SIGGRAPH paper what we did was this. Because the matrix
that we have here has a lot of symmetric parts that you can factorize out and this kind of thing.
So it's not [inaudible]. If you try to implement it like this straightaway, you probably take a bit
of time before getting the same performance, because we did try many things too. Even ICP
image in the alignment problem. Like [inaudible] for an ICP to make it realtime is not that
trivial thing. And so we -- this was also one of the main things that we did in all this software
and this work is how to make it fast. And this is a lot of experimentation I will say.
>>: So when we download the software ->> Sofien Bouaziz: What?
>>: When we download software from the -- can we download the software from Faceshift Web
site?
>> Sofien Bouaziz: Yes. So Faceshift that comes as a three-day trial version -- 30-day trial
version for free.
>>: Trial version?
>> Sofien Bouaziz: Yeah, you can use it. You can do a scan of your face that I have not seen. I
don't know if people are interested to see that, but you can scan your face.
>>: [inaudible] park in Kinect?
>> Sofien Bouaziz: Yes.
>>: Does it work?
>> Sofien Bouaziz: Yes. So any openNI-compliant RGB-D device you plug in, and then you do
that, what I did here. And so you can scan your face, get this 3D model of your face, and you
can track and record, but only for 30 days. And then at some point you maybe need to pay for it.
But later on.
>>: [inaudible] do you use any kind of rating on the blendshapes?
>> Sofien Bouaziz: Sorry?
>>: [inaudible] or, for example, when you have the other avatars, that they don't look like
human and you match them, so you probably use the human face and then rate to the other face
or ->> Sofien Bouaziz: So you're asking about the retargeting, can we transfer the expression of my
face to the avatar, right?
>>: [inaudible] or introducing the blendshapes?
>> Sofien Bouaziz: So we use a blendshape weight for the retargeting. So right now it's done in
a very simple way that you created your avatar with the same blendshapes that you are tracking
with, but we are working on those [inaudible] machine learning or to learn retargeting or to learn
the -- some mapping between the blendshape weights to avatar weights. Because this avatar can
have bones and different parameters that you don't have in -- that you don't want to have in your
tracking system.
>>: Because, I mean, we -- I usually work with bones and [inaudible] and that's one problem
[inaudible] when you want to go from blendshapes to [inaudible].
>> Sofien Bouaziz: So yeah. So we will submit soon a paper on that problem, in fact, which is
how to solve the retargeting when you have a tracking system and then the [inaudible] have
different type of parameters. But right now what you have seen, they just have the same
blendshapes. Yeah.
>>: [inaudible] seems to be -- to do more gestural -- like similar gesture motion or targeting
[inaudible] and model like a human face. So were you also thinking of playing around with that,
and I don't know, for example, if you do one expression then the model does an extreme version
of it?
>> Sofien Bouaziz: Yes. Yes. We are thinking a lot about that in fact. So it's what I call
style-based retargeting. It's like usually a human doesn't have the same dynamics as a monkey,
for example. I like the monkey example. And how can you retarget from your motion capture
system to another rig that have other -- another dynamic. Like a cartoon. A cartoon face will
usually have a much bouncy dynamic than yourself. And this is a [inaudible], but we are
thinking about it a lot. It's something that I'm working on, yeah, exactly this problem that you're
mentioning.
And this is one big problem with motion capture, in fact. So a lot of motion capture data and
[inaudible] directly from movies, for example, because the animator is animating for his
[inaudible] monkey. So, for example, in the movie King Kong, they were recording an actor and
then they were trying to retarget the expression of an actor to King Kong. But there are things
you need to clean up a lot because usually the dynamic is not good.
And so what they do is like the actor usually try to mimic monkey expression and do very
extreme poses. In motion capture you have actor especially for motion capture, that are trained
for motion capture, doing extreme poses. But this is one of the main drawback of motion capture
right now.
And if we can solve this problem, I believe motion capture will be used much more extensively
into movies and so on. But right now it's not at the state where in the movie you can just record
an actor and take the data and just project it into an avatar. It just doesn't work. There's like a lot
of cleaning, a lot of cleaning going on. Yeah.
>>: So I'm probably the odd person here because I don't do video stuff, but I do physical
animatronics, and we're doing a lot of work with kids, getting them to build animatronic figures
and program them up. And the process for creating a show for an animatronic figure is actually
very analogous that basically you have a very small number of degrees of freedom, and these are
programmed kind of typically one by one, you know, usually with a joystick in realtime and
people try and get in and edit it and fix it up.
Have you looked at -- because typically our characters -- because, you know, we want to stay
away from human character because they're very hard to do well, so we typically have animal
characters. And they have very limited motions that we need to retarget to in your speak. Have
you looked at this problem at all and how you might approach making a tool -- sort of we're
interested in trying to make it easier for kids who want to program these shows to put them
together.
And I see a lot of potential for kids kind of acting this out directly and having their figures do it,
but the figures are totally random. Kids create them. You don't have a good model of them.
You really need to capture the degrees of freedom of the figure first.
>> Sofien Bouaziz: So let me resume the program if I understand it well. So you want to design
a face and you want -- you need some degree of freedom to design the animation of this face,
right? And then you don't really know how to place this degree of freedom?
>>: Yeah.
>> Sofien Bouaziz: Yeah. So, you know, we have a really similar problem. It's like we use a
blendshape model, and this is what the final degree of freedom in our tracking system. And one
thing that we don't know is what is the best blendshape model that you want to do tracking, for
example, or even to do some animation.
And when you talk to animators for human faces, each of them have different parameters,
different way to move the -- different degree of freedom. And just amazing to see how much
things are different between animators.
And what you say is exactly this, so human face is very complicated to see without a good
degree of freedoms. And I believe that also for animals probably the same problem, you don't
know where and how to produce degree of freedom, right?
And so recently we were asking us how to build exactly what you said, how to build the model,
for example, for tracking and how to place those degree of freedoms such that the tracking will
be good. And this is a hard task.
And but if you want to talk about it more a bit later, we can talk about it more. Because I'm not
totally into animatronics, so I'm not sure I replied very well to your question. But if you explain
me a bit better after the talk, then we can discuss about it.
>> Zicheng Liu: Any other questions?
>>: So do you think -- would you mind if some people wanted to stay and try the demo?
>> Sofien Bouaziz: Sure.
>> Zicheng Liu: So if you guys want to stay here and try the demo, you're welcome [inaudible].
Okay [inaudible] thank the speaker again.
[applause]
Download