HeadPose_CVGIP2011_Integrate5.5

advertisement
REAL-TIME HEAD POSE ESTIMATION USING DEPTH MAP FOR
AVATAR CONTROL
Yu Tu(屠愚) , Chih-Lin Zeng(曾志霖), Che-Hua Yeh(葉哲華) , Ming Ouhyoung(歐陽明)
Dept. of Computer Science and Information Engineering,
National Taiwan University
{ tantofish , yesazcl, chyei}@cmlab.csie.ntu.edu.tw , ming@csie.ntu.edu.tw
ABSTRACT
In this paper, we propose a system to estimate head
poses using pure depth information in real-time. We
first track user’s nose, and sample an amount of 3D
points around the nose. Then we use the point cloud to
fit a plane by least square error method, the normal
vector of the plane yields yaw and pitch angle of user’s
head orientation. On the other hand, ellipse fitting using
head boundary can give roll angle. We use simple data
acquisition such as Microsoft Kinect Sensor in our
system. Simplicity and easy access make our system
easy to set up, while it also cost in high noise depth data.
Depth data only algorithm allows this system work in no
light environment. We demonstrate that 3D head pose
estimation can be achieved in real-time with noisy depth
data and without user calibration.
Keywords Head Pose Estimation; Depth Map; Kinect;
Least Square Error Plane; Real-Time Tracking; Nose
Tracking;Markerless performance capture;
1. INTRODUCTION
A successful interaction system should be robustness
and response to user in real-time. And it also can run
without error for a long time. Head pose information is a
very important cue to know user gaze orientation and it
can also control an avatar. We model the head pose in
the three dimensional space and the parameter is consist
of roll, pitch and yaw (Fig. 1).
Fig 1. Illustrate the tree degree of freedom of head pose.
Fig 2. The color image and corresponding depth map
captured by Kinect, both resolution are 640 x 480 at 30
fps.
The state of art methods of head pose estimation
can roughly be divided into several categories depends
on the kind of input data which needed (i.e., color image
or depth map). There are many works for head pose
estimation in color image [15]. For the color imagebased algorithms, it can roughly further divided into
feature-based [8, 1, 10, 11, 18] and appearance-based [7,
2, 4, 17, 5, 6, 18] methods. But those methods which
rely on color image are sensitive to illumination, so
weak/no light environments may result in the estimating
inaccuracy.
Reverse Rotation
Least Square Plane
Real-time
Acquisition
User Acting
Head Pose Estimation
Avatar Control
Depth/Color Data
Ellipse Fitting
Fig. 3 Visualized overview of the online processing pipeline.
Thanks to the fast depth map generating system
such as [12] , there are many works using depth data as
additional information in solving some of the limitations
besides color images [1, 14, 20]. But those methods
which the appearance cues still necessary. Therefore,
several recent works use depth data as their primary
information [3, 9, 40, 19] . Breitenstein et al. [3]
proposed system which can achieve large angle of
rotation in real time using GPU, but its computation
complexity is larger than we proposed approach
obviously. Being capable of large rotation angle, the
state of the art works [16, 19] need training data, while
ours doesn’t.
Recently, Mircosoft has released a device,
Mircosoft Xbox Kinect, which simultaneously captures
color image and depth map at 30 fps (Fig. 2). Kinect
uses infrared rays to acquire depth map, but there are
some limitations to use it. Frist, the depth map is noisy
and we have measured the average flicking rate for a
pixel of static object is about 3%. The maximum
flickering rate is larger than 30%, which appears at the
edge of object. As a result, we have to handle the noisy
data while preserve the accuracy of our algorithm,
otherwise the output parameters will be flickering all the
time, which will cause the estimated head pose
parameter would also flickering even the head of user is
static. Secondly, there are many holes within the depth
map having no depth information due to occlusion and
specular reflection which stop infrared rays from being
received by Kinect. Considering system accuracy, it is
better not to use those data points which contain no
depth information. Thirdly, Kinect sensor can’t acquire
depth data while object is too close to camera or too far
away from camera. Besides, even though Kinect sensor
acquired depth data with extreme distance (too close or
too far) from camera, those data are not precise and not
usable. As we measured, 1m~6.5m is an appropriate
range of distance for sufficiently precise depth
information. After we address these problems, our
system becomes more robustness and accuracy.
In this paper, we propose an approach to estimate
the head pose by finding the nose position in depth map
and sample some point cloud around the nose to
generate a least square error plane. Then the plane
normal represents face orientation. Especially we do not
need user to setup, so tracking nose is a difficult task
due to the depth is the only useful information we have.
We make an assumption that nose is the nearest point
when a user faces the depth camera. Therefore, we use
this assumption to track user’s nose.
The rest of this paper is organized as follows.
Overall system workflow is brief introduced in Sec 2,
which will explain each stage need to do. The
implementation which how the proposed algorithm can
locate nose position using depth data and how to
calculate the rotation angle is described in Sec 3. Our
experiment result is in Sec 4. Finally, we present
conclusion and feature work in Sec 5.
2. SYSTEM OVERVIEW
Our system overview is illustrated in fig. 4, and
each square represents a procedure. The estimating of
head pose process is divided into two parts, one for yaw
and pitch (left part) and the other for roll (right part).
For left part, Use the proposed algorithm in this paper to
locate nose’s position after captured a new depth image
from Kinect and then sample some points around nose’s
position. As a result, we can generate a least square
plane which can be fitted for those points and the
plane’s normal vector represents the face orientation.
For right part, we defined an appropriate depth threshold
to find out head boundary which can be fitted by an
ellipse. Both left part and right part will pass history
table which will smooth the estimated parameters in
order tackle the flickering problem. The final output
parameters which smoothed by history table can be used
to animate the virtual avatar in real-time.
Scene
Depth Image Acquisition
Focal plane
Camera(0,0,0)
Nose Detection
Head Boundary
Detection
Focal length f
Point Cloud
Sampling
3D Camera coordinate
p(x, y, z)
2D Image coordinate
P(X,Y)
Depth value z
Ellipse Fitting
Least Square Error
Plane Fitting
Fig.1: Perspective projection model which is used in this
paper for retrieve the 3D point cloud from Kinect.
Roll Generating
Yaw and Pitch
Generating
History Table
Virtual Avatar Animation
Fig.4: System flow chart. After data acquisition, our
system can be divided into two parts: Plane fitting for
yaw and pitch estimation; Ellipse fitting for roll
estimation. History table keeps track of the results in
order to filter out the outliers and smooth the output
parameters.
3. IMPLEMENTATION
In general, we assume that head pose consist of six
parameters. Three of them are transition parameters
respect to x-axis, y-axis and z-axis, rest of them are
rotation parameters - yaw, pitch and roll. Our goal is to
precisely estimate these six parameters in real-time
while preserve the temporal coherence.
3.1. Preprocessing
As mentioned in the previous chapters, the purposed
algorithm uses Microsoft Kinect to retrieve input data.
A depth map of VGA resolution is retrieved as a frame
with its pixel value ranges from 0 to 10000 millimeters.
Consider the depth value as the point’s z coordinate, the
rest x and y coordinate are still unknown. A simple
perspective projection model can handle this issue. Let
the camera be considered as the origin of the 3D world
coordinate system with its view direction considered as
the positive z-axis. The focal plane is located at a
distance f in front of the camera. A point p(x, y, z) on
the surface of an object in the 3D scene is projected to a
point P(X, Y) on the 2D focal plane, where
(a)
(b)
(c)
(d)
Fig.2: Detect the boundary of user’s head (a), and
smooth the boundary by averaging neighboring points
(b). Fit a ellipse that best matches the smoothed head
boundary(c), and apply the angle of the result ellipse to
the virtual avatar.
X  f
x
y
, Y  f
z
z
Eq.1
The problem in this case can be stated as follows.
The 2D coordinate of a data point P(X, Y) and its 3D
depth z remain unknown. Knowing that Kinect camera’s
focal length f is 575, it can be inferred as the following
equation:
p  ( x, y , z )  ( z
X Y
, z , z)
f
f
Eq.2
Thus the following steps of this paper work can
work on the 3D point cloud retrieved by this
preprocessing.
3.2. Transition estimation.
This part is easy to accomplish, and is not the main part
of our work. By calculating center of the point cloud of
user’s head as equation 3, we can easily estimate
transition parameters.
(t x , t y , t z

)
N
i 1
Eq.3
( xi , y i , z i )
N
(a)
(b)
(c)
Fig.3: Detect and track user’s nose(a) and sample
several points from the nose’s neighboring area(b) and
apply least square error algorithm to fit a plane to the
sample points(c).
3.3. Rotation estimation.
We propose a novel algorithm for rotation
estimation. The main work of this estimation can be
divided into two parts: Ellipse fitting for roll angle;
Least square error plane fitting for yaw and pitch angles.
3.3.1 Roll angle estimation
To estimate the ellipse which best matches user’s head,
we first define those pixels which belong to head
boundary. Dynamically set a depth threshold to crop out
the background pixels, and find the most left and the
most right pixels for each row, these pixels can
represent the mentioned head boundary, which is shown
in figure.2.a. However, there is a problem that we
cannot ignore. Kinect is a device which is so noisy that
the acquired data is always flickering. This results in a
temporal coherence problem that our virtual avatar will
be trembling all the time. This is not realistic and is not
what we want. Therefore, we add a smooth term to
handle this issue. This smooth process adjusts the
coordinate of each head boundary pixels by averaging
them from a defined length of successive pixels
(Figure.2.b.).

Smooth Term: ( x , y ) 
i
i
i 1
k  i 1
( xk , y k )
2l  1
Eq.4
After smoothing, a least square error ellipse is
fitted to these boundary points (Figure.1.c.) Having
obtained the best fitted ellipse, we take its rotation angle
as the estimated roll angle. And adapt this parameter to
a virtual avatar. Figure 1.d shows the result.
3.3.2 Pitch and yaw estimation
Except for roll angle, yaw angle and pitch angle must be
estimated as well. A sequence of instructions will be
introduced in the rest of this chapter. The main idea is
that human’s face can roughly be considered as a plane.
The normal vector of this plane can represent the
orientation of the actor’s face. Our goal is to reconstruct
this plane.
(a)
(b)
Fig.4: Nose point has the shallowest depth value among
the point cloud in small rotation range (a), while other
parts of the human head will take over the shallowest
position (b). Red circle indicates the detected shallowest
point.
To achieve this goal, least square error plane fitting is
applied to the 3D point cloud. However, Kinect doesn’t
tell which of the 3D points are belonged to the actor’s
face and which are not, it gives us the whole scene that
it captures instead. In order to find a fixed area of user’s
face when sampling in different frames and different
head poses, we focus on nose detection and nose
tracking. We can simply define the nose’s neighboring
area to be the face area. Figure 2 shows the pseudo
pipeline of this step.
It is observed from the experiment that most of the
time human’s nose is the nearest part to the camera,
except for the situation when the user turns his head
with sufficient large angle. This feature of human head
results in the initial guess that the nose point has the
shallowest depth value among the point cloud. This
initial guess remains robust in small rotation range
(Fig.4.a). However, other parts of the human head will
take over the position that has the shallowest depth
value. For example, glasses or cheek will be the
shallowest point when yaw angle surpasses about 20
degrees, chin and fringe will be the shallowest point
when pitch angle surpasses about 15 degrees (Fig.4.b).
This problem can be tackled by the following step.
It is saying that human’s head can only rotate a small
angle within a short moment such as one thirtieth of a
second, which is the length of time between two
consecutive frames. We take advantage of the temporal
information that was already calculated in the previous
iteration. A reverse rotation matrix is applied to the
whole point cloud to rotate the head back to the normal
pose. The rotation angles (yaw and pitch) which were
generated in the last iteration are used in the reverse
rotation matrix. After this transformation, the new point
cloud with new coordinate builds up a head that faces
straight forward to the camera.
(a)
error plane to these sample points. The normal vector of
the fitted plane is considered to be the face orientation
of the user. That is to say, the direction that the normal
vector points to is exactly the direction that the user
faces to.
An algebraic solution to the least square
approximation problem is being introduced. Let a
plane’s linear equation be Ax+By+C=z.
y
(b)
Fig.5: A reverse rotation transform is applied to the
whole point cloud (a) to rotate the head back to the
normal pose (b). Green circle indicates the shallowest
point while red circle indicates the new shallowest point
after reverse rotation.
α
β
x
z
⃑ and pose
Fig.7: The relation between normal vector 𝑁
parameter {𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ} . "α" denotes pitch while
"β"denotes yaw.
Fig.6: we sample 300 points from the defined face area,
and fit a least square error plane to these sample points.
If any sample points happen to be points that have no
depth information, ignore them.
Note that the camera stands for the origin of world
coordinate. Equation for the reverse rotation goes:
0
0  x
 x' cos( y ) 0  sin(  y ) 1


 y '   0
1
0
0
cos(

)
sin(
 p )   y 
p

  
 z '   sin(  y ) 0 cos( y )  0  sin(  p ) cos( p )  z 
Eq.5 : Rotation matrix for turning the point cloud of
user’s head back to normal pose.  y and  p represents
yaw and pitch angle estimated in the previous iteration.
Even though the adjusted point cloud looks like an
incomplete face in camera view, as long as the nose has
been captured in the origin depth map, we can
successfully track the nose just by finding the
shallowest point from the adjusted point cloud. Figure.5
illustrates the situation when user rotates so large angle
that other part of his head (green circles in Fig.5.a) takes
over the shallowest position. After reverse rotation by
the parameters of previous iteration (Fig.5.b), nose
keeps the shallowest depth value again (red circles in
Fig.5.a). Orange arrow shows the rotating direction.
As mentioned earlier in this chapter, human’s face
can roughly be considered as a plane. Thus the normal
vector of this plane can represent the orientation of the
actor’s face. Since the system has detected user’s nose,
we can simply consider the nose’s neighboring area to
be the user’s face. In this paper, we sample 300 points
from the defined face area (Fig.6), and fit a least square
Equation.7 indicates that every point sampled from
the point cloud is on the plane Ax+By+C=z.
 x1
x
 2


 xn
y1
y2

yn
1
 z1 
 A  

1    z 2 
B 
     

 C  
1    z n 
Eq.6
Where (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) is the 3D coordinate of a sample
point, and “n” denotes the number of sample points.
This is an over determine linear system and can be
solved by:
 x1
y
 1
 1
x2
...
y2
...
1
...
x
xn   1
x
yn   2

1  
 xn
y1 1
 A  x1
y2 1   
B  y1
    
 C   1
yn 1
x2
...
y2
...
1
...
z 
xn   1 
z
yn   2 

1   
 zn 
Eq.7
After simplification, least square plane coefficients
can be obtained by solving the following equation:
  in1 xi2
 n
 i 1 xi yi
  in1 xi

 in1 xi yi
 in1 yi2
 in1 yi
 in1 xi   A   in1 xi zi 



 in1 yi   B    in1 yi zi 
 in1 1  C    in1 zi 
Eq.8
Having the solution of Eq.8, we change the
⃑ (𝐴, 𝐵, −1) in
expression of the plane’s normal vector 𝑁
terms of yaw and pitch angle. Figure 6 illustrates the
⃑ (𝐴, 𝐵, −1) and pose
relation between normal vector 𝑁
⃑ 𝑦𝑧
parameter {𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ}. "α" is the angle between 𝑁
and negative z-axis and denotes pitch angle, where
⃑ 𝑦𝑧 (0, 𝐵, −1) is obtained by projecting 𝑁
⃑ (𝐴, 𝐵, −1)
𝑁
onto y-z plane. On the other hand, "β" is the angle
⃑ 𝑥𝑧 and negative z-axis and denotes yaw angle,
between 𝑁
⃑ 𝑥𝑧 (𝐴, 0, −1) is obtained by projecting
where 𝑁
⃑ (𝐴, 𝐵, −1) onto x-z plane.
𝑁
To put figure.6 into conclusion, we induce an
⃑ (𝐴, 𝐵, −1) to (α, β). As the
equation for transforming 𝑁
following:
  cos 1 (
1
A 1
2
2
)
  cos 1 (
1
B  12
2
)
Eq.9
3.4 History table
As mentioned earlier in this paper, although this system
works in natural environment using non-intrusive,
commercially available 3D sensor as Microsoft Kinect.
The convenience and simplicity of setup comes at the
cost of high noise in the acquired data. Our system
should be robust when the depth map sequence is
flickering or when data missing so large area that the
algorithm can’t work. Except for the preliminary
smoothing process in the estimating stage of the
implementation, a history table is maintained to keep
trail of the estimated result in every frame. First step we
filtered out those results with impossible angles, for
instance, a 60 degrees of yaw angle comes right after a 5
degrees of yaw angle. On the other hand, this table
automatically averages the latest n results in the purpose
of smoothing the estimation. We use this result as the
final result of our system to animate the virtual avatar.
4. RESULTS
We present results of our real-time performance capture.
The output of our system is continuous stream of head
pose parameters. Figure.8 demonstrates three people
each makes ten arbitrary pose. The first and fourth
columns show what pose the user makes. Note that the
system doesn’t use any information of this color image
for reason that color image may become unusable when
the light is off. The second and fifth columns are the
depth map captured by the 3D sensor. The third and
sixth columns visualize the output of our system by
using the output parameters to control a virtual avatar.
The capability of our system mainly relies on nose
tracking. Speak of yaw angle and pitch angle, as long as
our system successfully detected the nose, it can
generate acceptable corresponding yaw and pitch angles.
On the contrary, once the system failed in nose tracking,
it becomes high possibility that the estimation will fail.
On the other hand, roll angle estimation is really robust
in the proposed system. As long as user can make the
roll pose,
Fig.8
5. CONCLUSION AND FUTURE WORK
We have presented a system to estimate head poses
using pure depth information in real-time and a novel
method to track user’s nose within depth map. As a
result, sample some points around nose’s position and
fit a least square plane that approximates user’s face
plane to those points. Then the parameter of yaw and
pitch can be generated according to normal vector of
plane and the parameter of roll is just to fit an ellipse to
head boundary. It is intuitional and easy to know for our
parameter generating method. Our system compare with
others method which also using pure depth information
as their primary cue, our system doesn’t needs users to
set up when system starts and works without any
training data.
Our system has some limitations. First, if user’s
fringe is more closed to depth camera, our system may
locate the user’s fringe instead of nose. The reason is
our system locates the nose’s position by reversing
rotate the input depth data of head according to the
previous estimated head pose parameters and then find
the most closed point to the depth camera. Therefore, it
is recommended the most salient point of face should be
nose when using our system. Second, the frequency FPS
of Kinet 原本取得深度圖為 30FPS,但加上我們的演
算法之後,FPS 降為 21 左右,CPU 為 INTEL Q8800
2.8 GHz. 在這個 fps 之下,假如使用者作一些較快速
頭部轉動的話,我們的系統無法很 robust 去追蹤鼻
子 的 位 置 。 Third, 頭 髮 的 長 度 也 會 影 響 到
Estimating accuracy,比方說髮長及肩的人就可能會
影響到 Roll 的精準度。
As mention before, the maximum angle of head
rotation is limited by whether the nose information is
still usable or not. Therefore, the future work of our
system is to increase the maximum angle by using other
information of face and make the system more
robustness.
REFERENCES
[1] R. Yang and Z. Zhang. Model-based head pose tracking
with stereovision. Aut. Face and Gestures Rec., 2002.
[2] L.-P. Morency, P. Sundberg, and T. Darrell. Pose
estimation using 3d view-based eigenspaces. In Aut. Face
and Gestures Rec., 2003.
[3] M. D. Breitenstein, D. Kuettel, T. Weise, L. Van Gool, and
H. Pfister. Real-time face pose estimation from single
range images. In CVPR, 2008.
[4] V. N. Balasubramanian, J. Ye, and S. Panchanathan.
Biased manifold embedding: A framework for personindependent head pose estimation. In CVPR, 2007.
[5] M. Osadchy, M. L. Miller, and Y. LeCun. Synergistic face
detection and pose estimation with energy-based models.
In NIPS, 2005.
[6] M. Storer, M. Urschler, and H. Bischof. 3d-mam: 3d
morphable appearance model for efficient fine head pose
estimation from still images. In Workshop on Subspace
Methods, 2009.
[7] M. Jones and P. Viola. Fast multi-view face detection.
Technical Report TR2003-096, Mitsubishi Electric
Research Laboratories,2003.
[8] T. Vatahska, M. Bennewitz, and S. Behnke. Feature-based
head pose estimation from images. In Humanoids, 2007.
[9] S. Malassiotis and M. G. Strintzis. Robust real-time 3d
head pose estimation from range data. Pattern Recognition,
38:1153 – 1165, 2005.
[10] Y. Matsumoto and A. Zelinsky. An algorithm for realtime stereo vision implementation of head pose and gaze
direction measurement. In Aut. Face and Gestures Rec.,
2000.
[11] J. Yao and W. K. Cham. Efficient model-based linear
head motion recovery from movies. In CVPR, 2004.
[12] T. Weise, B. Leibe, and L. Van Gool. Fast 3d scanning
with automatic motion compensation. In CVPR, 2007.
[13] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang. 3d
deformable face tracking with a commodity depth camera.
In ECCV, 2010.
[14] L.-P. Morency, P. Sundberg, and T. Darrell. Pose
estimation using 3d view-based eigenspaces. In Aut. Face
and Gestures Rec., 2003.
[15] E. Murphy-Chutorian and M. Trivedi. Head pose
estimation in computer vision: A survey. TPAMI,
31(4):607–626, 2009.
[16] G. Fanelli, J. Gall ,L. V. Gool. Real Time Head Pose
Estimation with Random Regression Forests. In CVPR,
2011.
[17] L. Chen, L. Zhang, Y. Hu, M. Li, and H. Zhang. Head
pose estimation using fisher manifold learning. In
Workshop on Analysis and Modeling of Faces and
Gestures, 2003.
[18] J. Whitehill and J. R. Movellan. A discriminative
approach to frame-by-frame head pose tracking. In Aut.
Face and Gestures Rec., 2008.
[19] T. Weise, S. Bouaziz, H. Li, M. Pauly. Real time
Performance-Based Facial Animation. In SIGGRAPH,
2011.
[20] E. Seemann, K. Nickel, and R. Stiefelhagen. Head pose
estimation using stereo vision for human-robot interaction.
In Aut. Face and Gestures Rec., 2004.
Download