REAL-TIME HEAD POSE ESTIMATION USING DEPTH MAP FOR AVATAR CONTROL Yu Tu(屠愚) , Chih-Lin Zeng(曾志霖), Che-Hua Yeh(葉哲華) , Ming Ouhyoung(歐陽明) Dept. of Computer Science and Information Engineering, National Taiwan University { tantofish , yesazcl, chyei}@cmlab.csie.ntu.edu.tw , ming@csie.ntu.edu.tw ABSTRACT In this paper, we propose a system to estimate head poses using pure depth information in real-time. We first track user’s nose, and sample an amount of 3D points around the nose. Then we use the point cloud to fit a plane by least square error method, the normal vector of the plane yields yaw and pitch angle of user’s head orientation. On the other hand, ellipse fitting using head boundary can give roll angle. We use simple data acquisition such as Microsoft Kinect Sensor in our system. Simplicity and easy access make our system easy to set up, while it also cost in high noise depth data. Depth data only algorithm allows this system work in no light environment. We demonstrate that 3D head pose estimation can be achieved in real-time with noisy depth data and without user calibration. Keywords Head Pose Estimation; Depth Map; Kinect; Least Square Error Plane; Real-Time Tracking; Nose Tracking;Markerless performance capture; 1. INTRODUCTION A successful interaction system should be robustness and response to user in real-time. And it also can run without error for a long time. Head pose information is a very important cue to know user gaze orientation and it can also control an avatar. We model the head pose in the three dimensional space and the parameter is consist of roll, pitch and yaw (Fig. 1). Fig 1. Illustrate the tree degree of freedom of head pose. Fig 2. The color image and corresponding depth map captured by Kinect, both resolution are 640 x 480 at 30 fps. The state of art methods of head pose estimation can roughly be divided into several categories depends on the kind of input data which needed (i.e., color image or depth map). There are many works for head pose estimation in color image [15]. For the color imagebased algorithms, it can roughly further divided into feature-based [8, 1, 10, 11, 18] and appearance-based [7, 2, 4, 17, 5, 6, 18] methods. But those methods which rely on color image are sensitive to illumination, so weak/no light environments may result in the estimating inaccuracy. Reverse Rotation Least Square Plane Real-time Acquisition User Acting Head Pose Estimation Avatar Control Depth/Color Data Ellipse Fitting Fig. 3 Visualized overview of the online processing pipeline. Thanks to the fast depth map generating system such as [12] , there are many works using depth data as additional information in solving some of the limitations besides color images [1, 14, 20]. But those methods which the appearance cues still necessary. Therefore, several recent works use depth data as their primary information [3, 9, 40, 19] . Breitenstein et al. [3] proposed system which can achieve large angle of rotation in real time using GPU, but its computation complexity is larger than we proposed approach obviously. Being capable of large rotation angle, the state of the art works [16, 19] need training data, while ours doesn’t. Recently, Mircosoft has released a device, Mircosoft Xbox Kinect, which simultaneously captures color image and depth map at 30 fps (Fig. 2). Kinect uses infrared rays to acquire depth map, but there are some limitations to use it. Frist, the depth map is noisy and we have measured the average flicking rate for a pixel of static object is about 3%. The maximum flickering rate is larger than 30%, which appears at the edge of object. As a result, we have to handle the noisy data while preserve the accuracy of our algorithm, otherwise the output parameters will be flickering all the time, which will cause the estimated head pose parameter would also flickering even the head of user is static. Secondly, there are many holes within the depth map having no depth information due to occlusion and specular reflection which stop infrared rays from being received by Kinect. Considering system accuracy, it is better not to use those data points which contain no depth information. Thirdly, Kinect sensor can’t acquire depth data while object is too close to camera or too far away from camera. Besides, even though Kinect sensor acquired depth data with extreme distance (too close or too far) from camera, those data are not precise and not usable. As we measured, 1m~6.5m is an appropriate range of distance for sufficiently precise depth information. After we address these problems, our system becomes more robustness and accuracy. In this paper, we propose an approach to estimate the head pose by finding the nose position in depth map and sample some point cloud around the nose to generate a least square error plane. Then the plane normal represents face orientation. Especially we do not need user to setup, so tracking nose is a difficult task due to the depth is the only useful information we have. We make an assumption that nose is the nearest point when a user faces the depth camera. Therefore, we use this assumption to track user’s nose. The rest of this paper is organized as follows. Overall system workflow is brief introduced in Sec 2, which will explain each stage need to do. The implementation which how the proposed algorithm can locate nose position using depth data and how to calculate the rotation angle is described in Sec 3. Our experiment result is in Sec 4. Finally, we present conclusion and feature work in Sec 5. 2. SYSTEM OVERVIEW Our system overview is illustrated in fig. 4, and each square represents a procedure. The estimating of head pose process is divided into two parts, one for yaw and pitch (left part) and the other for roll (right part). For left part, Use the proposed algorithm in this paper to locate nose’s position after captured a new depth image from Kinect and then sample some points around nose’s position. As a result, we can generate a least square plane which can be fitted for those points and the plane’s normal vector represents the face orientation. For right part, we defined an appropriate depth threshold to find out head boundary which can be fitted by an ellipse. Both left part and right part will pass history table which will smooth the estimated parameters in order tackle the flickering problem. The final output parameters which smoothed by history table can be used to animate the virtual avatar in real-time. Scene Depth Image Acquisition Focal plane Camera(0,0,0) Nose Detection Head Boundary Detection Focal length f Point Cloud Sampling 3D Camera coordinate p(x, y, z) 2D Image coordinate P(X,Y) Depth value z Ellipse Fitting Least Square Error Plane Fitting Fig.1: Perspective projection model which is used in this paper for retrieve the 3D point cloud from Kinect. Roll Generating Yaw and Pitch Generating History Table Virtual Avatar Animation Fig.4: System flow chart. After data acquisition, our system can be divided into two parts: Plane fitting for yaw and pitch estimation; Ellipse fitting for roll estimation. History table keeps track of the results in order to filter out the outliers and smooth the output parameters. 3. IMPLEMENTATION In general, we assume that head pose consist of six parameters. Three of them are transition parameters respect to x-axis, y-axis and z-axis, rest of them are rotation parameters - yaw, pitch and roll. Our goal is to precisely estimate these six parameters in real-time while preserve the temporal coherence. 3.1. Preprocessing As mentioned in the previous chapters, the purposed algorithm uses Microsoft Kinect to retrieve input data. A depth map of VGA resolution is retrieved as a frame with its pixel value ranges from 0 to 10000 millimeters. Consider the depth value as the point’s z coordinate, the rest x and y coordinate are still unknown. A simple perspective projection model can handle this issue. Let the camera be considered as the origin of the 3D world coordinate system with its view direction considered as the positive z-axis. The focal plane is located at a distance f in front of the camera. A point p(x, y, z) on the surface of an object in the 3D scene is projected to a point P(X, Y) on the 2D focal plane, where (a) (b) (c) (d) Fig.2: Detect the boundary of user’s head (a), and smooth the boundary by averaging neighboring points (b). Fit a ellipse that best matches the smoothed head boundary(c), and apply the angle of the result ellipse to the virtual avatar. X f x y , Y f z z Eq.1 The problem in this case can be stated as follows. The 2D coordinate of a data point P(X, Y) and its 3D depth z remain unknown. Knowing that Kinect camera’s focal length f is 575, it can be inferred as the following equation: p ( x, y , z ) ( z X Y , z , z) f f Eq.2 Thus the following steps of this paper work can work on the 3D point cloud retrieved by this preprocessing. 3.2. Transition estimation. This part is easy to accomplish, and is not the main part of our work. By calculating center of the point cloud of user’s head as equation 3, we can easily estimate transition parameters. (t x , t y , t z ) N i 1 Eq.3 ( xi , y i , z i ) N (a) (b) (c) Fig.3: Detect and track user’s nose(a) and sample several points from the nose’s neighboring area(b) and apply least square error algorithm to fit a plane to the sample points(c). 3.3. Rotation estimation. We propose a novel algorithm for rotation estimation. The main work of this estimation can be divided into two parts: Ellipse fitting for roll angle; Least square error plane fitting for yaw and pitch angles. 3.3.1 Roll angle estimation To estimate the ellipse which best matches user’s head, we first define those pixels which belong to head boundary. Dynamically set a depth threshold to crop out the background pixels, and find the most left and the most right pixels for each row, these pixels can represent the mentioned head boundary, which is shown in figure.2.a. However, there is a problem that we cannot ignore. Kinect is a device which is so noisy that the acquired data is always flickering. This results in a temporal coherence problem that our virtual avatar will be trembling all the time. This is not realistic and is not what we want. Therefore, we add a smooth term to handle this issue. This smooth process adjusts the coordinate of each head boundary pixels by averaging them from a defined length of successive pixels (Figure.2.b.). Smooth Term: ( x , y ) i i i 1 k i 1 ( xk , y k ) 2l 1 Eq.4 After smoothing, a least square error ellipse is fitted to these boundary points (Figure.1.c.) Having obtained the best fitted ellipse, we take its rotation angle as the estimated roll angle. And adapt this parameter to a virtual avatar. Figure 1.d shows the result. 3.3.2 Pitch and yaw estimation Except for roll angle, yaw angle and pitch angle must be estimated as well. A sequence of instructions will be introduced in the rest of this chapter. The main idea is that human’s face can roughly be considered as a plane. The normal vector of this plane can represent the orientation of the actor’s face. Our goal is to reconstruct this plane. (a) (b) Fig.4: Nose point has the shallowest depth value among the point cloud in small rotation range (a), while other parts of the human head will take over the shallowest position (b). Red circle indicates the detected shallowest point. To achieve this goal, least square error plane fitting is applied to the 3D point cloud. However, Kinect doesn’t tell which of the 3D points are belonged to the actor’s face and which are not, it gives us the whole scene that it captures instead. In order to find a fixed area of user’s face when sampling in different frames and different head poses, we focus on nose detection and nose tracking. We can simply define the nose’s neighboring area to be the face area. Figure 2 shows the pseudo pipeline of this step. It is observed from the experiment that most of the time human’s nose is the nearest part to the camera, except for the situation when the user turns his head with sufficient large angle. This feature of human head results in the initial guess that the nose point has the shallowest depth value among the point cloud. This initial guess remains robust in small rotation range (Fig.4.a). However, other parts of the human head will take over the position that has the shallowest depth value. For example, glasses or cheek will be the shallowest point when yaw angle surpasses about 20 degrees, chin and fringe will be the shallowest point when pitch angle surpasses about 15 degrees (Fig.4.b). This problem can be tackled by the following step. It is saying that human’s head can only rotate a small angle within a short moment such as one thirtieth of a second, which is the length of time between two consecutive frames. We take advantage of the temporal information that was already calculated in the previous iteration. A reverse rotation matrix is applied to the whole point cloud to rotate the head back to the normal pose. The rotation angles (yaw and pitch) which were generated in the last iteration are used in the reverse rotation matrix. After this transformation, the new point cloud with new coordinate builds up a head that faces straight forward to the camera. (a) error plane to these sample points. The normal vector of the fitted plane is considered to be the face orientation of the user. That is to say, the direction that the normal vector points to is exactly the direction that the user faces to. An algebraic solution to the least square approximation problem is being introduced. Let a plane’s linear equation be Ax+By+C=z. y (b) Fig.5: A reverse rotation transform is applied to the whole point cloud (a) to rotate the head back to the normal pose (b). Green circle indicates the shallowest point while red circle indicates the new shallowest point after reverse rotation. α β x z ⃑ and pose Fig.7: The relation between normal vector 𝑁 parameter {𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ} . "α" denotes pitch while "β"denotes yaw. Fig.6: we sample 300 points from the defined face area, and fit a least square error plane to these sample points. If any sample points happen to be points that have no depth information, ignore them. Note that the camera stands for the origin of world coordinate. Equation for the reverse rotation goes: 0 0 x x' cos( y ) 0 sin( y ) 1 y ' 0 1 0 0 cos( ) sin( p ) y p z ' sin( y ) 0 cos( y ) 0 sin( p ) cos( p ) z Eq.5 : Rotation matrix for turning the point cloud of user’s head back to normal pose. y and p represents yaw and pitch angle estimated in the previous iteration. Even though the adjusted point cloud looks like an incomplete face in camera view, as long as the nose has been captured in the origin depth map, we can successfully track the nose just by finding the shallowest point from the adjusted point cloud. Figure.5 illustrates the situation when user rotates so large angle that other part of his head (green circles in Fig.5.a) takes over the shallowest position. After reverse rotation by the parameters of previous iteration (Fig.5.b), nose keeps the shallowest depth value again (red circles in Fig.5.a). Orange arrow shows the rotating direction. As mentioned earlier in this chapter, human’s face can roughly be considered as a plane. Thus the normal vector of this plane can represent the orientation of the actor’s face. Since the system has detected user’s nose, we can simply consider the nose’s neighboring area to be the user’s face. In this paper, we sample 300 points from the defined face area (Fig.6), and fit a least square Equation.7 indicates that every point sampled from the point cloud is on the plane Ax+By+C=z. x1 x 2 xn y1 y2 yn 1 z1 A 1 z 2 B C 1 z n Eq.6 Where (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) is the 3D coordinate of a sample point, and “n” denotes the number of sample points. This is an over determine linear system and can be solved by: x1 y 1 1 x2 ... y2 ... 1 ... x xn 1 x yn 2 1 xn y1 1 A x1 y2 1 B y1 C 1 yn 1 x2 ... y2 ... 1 ... z xn 1 z yn 2 1 zn Eq.7 After simplification, least square plane coefficients can be obtained by solving the following equation: in1 xi2 n i 1 xi yi in1 xi in1 xi yi in1 yi2 in1 yi in1 xi A in1 xi zi in1 yi B in1 yi zi in1 1 C in1 zi Eq.8 Having the solution of Eq.8, we change the ⃑ (𝐴, 𝐵, −1) in expression of the plane’s normal vector 𝑁 terms of yaw and pitch angle. Figure 6 illustrates the ⃑ (𝐴, 𝐵, −1) and pose relation between normal vector 𝑁 ⃑ 𝑦𝑧 parameter {𝑦𝑎𝑤, 𝑝𝑖𝑡𝑐ℎ}. "α" is the angle between 𝑁 and negative z-axis and denotes pitch angle, where ⃑ 𝑦𝑧 (0, 𝐵, −1) is obtained by projecting 𝑁 ⃑ (𝐴, 𝐵, −1) 𝑁 onto y-z plane. On the other hand, "β" is the angle ⃑ 𝑥𝑧 and negative z-axis and denotes yaw angle, between 𝑁 ⃑ 𝑥𝑧 (𝐴, 0, −1) is obtained by projecting where 𝑁 ⃑ (𝐴, 𝐵, −1) onto x-z plane. 𝑁 To put figure.6 into conclusion, we induce an ⃑ (𝐴, 𝐵, −1) to (α, β). As the equation for transforming 𝑁 following: cos 1 ( 1 A 1 2 2 ) cos 1 ( 1 B 12 2 ) Eq.9 3.4 History table As mentioned earlier in this paper, although this system works in natural environment using non-intrusive, commercially available 3D sensor as Microsoft Kinect. The convenience and simplicity of setup comes at the cost of high noise in the acquired data. Our system should be robust when the depth map sequence is flickering or when data missing so large area that the algorithm can’t work. Except for the preliminary smoothing process in the estimating stage of the implementation, a history table is maintained to keep trail of the estimated result in every frame. First step we filtered out those results with impossible angles, for instance, a 60 degrees of yaw angle comes right after a 5 degrees of yaw angle. On the other hand, this table automatically averages the latest n results in the purpose of smoothing the estimation. We use this result as the final result of our system to animate the virtual avatar. 4. RESULTS We present results of our real-time performance capture. The output of our system is continuous stream of head pose parameters. Figure.8 demonstrates three people each makes ten arbitrary pose. The first and fourth columns show what pose the user makes. Note that the system doesn’t use any information of this color image for reason that color image may become unusable when the light is off. The second and fifth columns are the depth map captured by the 3D sensor. The third and sixth columns visualize the output of our system by using the output parameters to control a virtual avatar. The capability of our system mainly relies on nose tracking. Speak of yaw angle and pitch angle, as long as our system successfully detected the nose, it can generate acceptable corresponding yaw and pitch angles. On the contrary, once the system failed in nose tracking, it becomes high possibility that the estimation will fail. On the other hand, roll angle estimation is really robust in the proposed system. As long as user can make the roll pose, Fig.8 5. CONCLUSION AND FUTURE WORK We have presented a system to estimate head poses using pure depth information in real-time and a novel method to track user’s nose within depth map. As a result, sample some points around nose’s position and fit a least square plane that approximates user’s face plane to those points. Then the parameter of yaw and pitch can be generated according to normal vector of plane and the parameter of roll is just to fit an ellipse to head boundary. It is intuitional and easy to know for our parameter generating method. Our system compare with others method which also using pure depth information as their primary cue, our system doesn’t needs users to set up when system starts and works without any training data. Our system has some limitations. First, if user’s fringe is more closed to depth camera, our system may locate the user’s fringe instead of nose. The reason is our system locates the nose’s position by reversing rotate the input depth data of head according to the previous estimated head pose parameters and then find the most closed point to the depth camera. Therefore, it is recommended the most salient point of face should be nose when using our system. Second, the frequency FPS of Kinet 原本取得深度圖為 30FPS,但加上我們的演 算法之後,FPS 降為 21 左右,CPU 為 INTEL Q8800 2.8 GHz. 在這個 fps 之下,假如使用者作一些較快速 頭部轉動的話,我們的系統無法很 robust 去追蹤鼻 子 的 位 置 。 Third, 頭 髮 的 長 度 也 會 影 響 到 Estimating accuracy,比方說髮長及肩的人就可能會 影響到 Roll 的精準度。 As mention before, the maximum angle of head rotation is limited by whether the nose information is still usable or not. Therefore, the future work of our system is to increase the maximum angle by using other information of face and make the system more robustness. REFERENCES [1] R. Yang and Z. Zhang. Model-based head pose tracking with stereovision. Aut. Face and Gestures Rec., 2002. [2] L.-P. Morency, P. Sundberg, and T. Darrell. Pose estimation using 3d view-based eigenspaces. In Aut. Face and Gestures Rec., 2003. [3] M. D. Breitenstein, D. Kuettel, T. Weise, L. Van Gool, and H. Pfister. Real-time face pose estimation from single range images. In CVPR, 2008. [4] V. N. Balasubramanian, J. Ye, and S. Panchanathan. Biased manifold embedding: A framework for personindependent head pose estimation. In CVPR, 2007. [5] M. Osadchy, M. L. Miller, and Y. LeCun. Synergistic face detection and pose estimation with energy-based models. In NIPS, 2005. [6] M. Storer, M. Urschler, and H. Bischof. 3d-mam: 3d morphable appearance model for efficient fine head pose estimation from still images. In Workshop on Subspace Methods, 2009. [7] M. Jones and P. Viola. Fast multi-view face detection. Technical Report TR2003-096, Mitsubishi Electric Research Laboratories,2003. [8] T. Vatahska, M. Bennewitz, and S. Behnke. Feature-based head pose estimation from images. In Humanoids, 2007. [9] S. Malassiotis and M. G. Strintzis. Robust real-time 3d head pose estimation from range data. Pattern Recognition, 38:1153 – 1165, 2005. [10] Y. Matsumoto and A. Zelinsky. An algorithm for realtime stereo vision implementation of head pose and gaze direction measurement. In Aut. Face and Gestures Rec., 2000. [11] J. Yao and W. K. Cham. Efficient model-based linear head motion recovery from movies. In CVPR, 2004. [12] T. Weise, B. Leibe, and L. Van Gool. Fast 3d scanning with automatic motion compensation. In CVPR, 2007. [13] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang. 3d deformable face tracking with a commodity depth camera. In ECCV, 2010. [14] L.-P. Morency, P. Sundberg, and T. Darrell. Pose estimation using 3d view-based eigenspaces. In Aut. Face and Gestures Rec., 2003. [15] E. Murphy-Chutorian and M. Trivedi. Head pose estimation in computer vision: A survey. TPAMI, 31(4):607–626, 2009. [16] G. Fanelli, J. Gall ,L. V. Gool. Real Time Head Pose Estimation with Random Regression Forests. In CVPR, 2011. [17] L. Chen, L. Zhang, Y. Hu, M. Li, and H. Zhang. Head pose estimation using fisher manifold learning. In Workshop on Analysis and Modeling of Faces and Gestures, 2003. [18] J. Whitehill and J. R. Movellan. A discriminative approach to frame-by-frame head pose tracking. In Aut. Face and Gestures Rec., 2008. [19] T. Weise, S. Bouaziz, H. Li, M. Pauly. Real time Performance-Based Facial Animation. In SIGGRAPH, 2011. [20] E. Seemann, K. Nickel, and R. Stiefelhagen. Head pose estimation using stereo vision for human-robot interaction. In Aut. Face and Gestures Rec., 2004.