Automatic Creation of A Talking Head From A Video Sequence Author : KyoungHo Choi , Jenq-Neng Hwang Presented by: Rubie F. Viñas, 方如玉 Adviser: Dr. Shih-Chung Chen, 陳世中 Date: December 31, 2008 Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Outline Introduction Facial Feature Extraction Face Shape Extractor Probability Networks A Least-Square Approach to Adopt a 3D Face Model Implementation and Experimental Results Conclusion and Future Work References Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Abstract • Paper – Real-time System Extract 2D – Talking Head Video Sequence • Video Sequence Facial Features • Automatic – No user intervention Probabilistic Network 3D Face Model Audio to Visual Conversion Technique Talking Head Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO I. Introduction Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Introduction • Advances in computing power and numerical algorithms in graphics and image-processing. – Build a realistic 3D face • From – Video Sequence » Using a regular PC camera. • Needs user intervention – Initialization stage • Provide feature points to generate a photo realistic 3D face model. – Two orthogonal frames – Multiple frames • Techniques – Build high quality face models • Computationally Expensive • Time Consuming Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Introduction (cont.) • Integrating Talking Heads – Highly required to enrich their human-computer interface. – Various Multimedia Applications • Video Conferencing • E-commerce • Virtual Anchors • Recent Researches – Talking head that do not require high quality animation. – Fast and easy way to build a 3D face • Generate many different face models in a short period of time. – User intervention required • Provide several corresponding points – Two frames from a video sequence Presented By of Electrical Engineering – Feature points in a single frontal image STU Institute Harry Mills / PRESENTATIONPRO Approaches for Creating a 3D Face Model 1. Use a generic 3D Face Model – – 3D Scanner Deform the 3D Model by calculating coordinates of all vertices in the 3D Model. 1. Lee – – Deformation of vertices in a 3D model as an interpolation of the displacements of the given control points. Dirichlet Free-Form Deformation Technique » Calculate new 3D coordinates of a deformed 3D model. 2. Pighin – Model deformation as an interpolation problem and used radial basis functions to find new 3D coordinates for vertices in a generic 3D model. 2. Use multiple 3D Face Models – – Find 3D coordinates of all vertices in a new 3D Model based on given feature points. Combine multiple 3D Models to generate a new 3D Model by calculating parameters to combine them. 1. Blanz – Laser Scanner (CyberwareTM) » Generate a 3D Model database. » Considered a new face model as a linear combination of the shapes of 3D faces in the database. – Simplified the idea of linear combination of 3D models » Designing key 3D faces that can be used to build a new 3D model by combining the key 3D faces linearly, eliminating the need for a large 3D face database. 2. Liu – Merit • Linearly created face objects can eliminate a wrong face that is not natural. – Very important aspect to create a 3D face model without user intervention. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Approaches for Single Image Used • 3D Face Model – – Computationally cheap and fast Suitable to generate multiple face models for a short time 1. Valle • Manually Extracted • Interpolation technique – – 2. Anthropometric and apriori information – Estimate the depth of a 3D face model. Lin • 2D Mesh Model – – • Radial basis function » Obtain coordinates of polygon mesh of a 3D model. Kuo • 3. 18 feature points Animate a talking head » Mesh Warping Manually adjust control points of a mesh to fit eyes, nose and mouth into an input image. Problem: – Depth Information of the created 3D Model • Not as accurate as other labor intensive approaches. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Paper • Real-time System from a Video Sequence – Automatically builds a 3D face model • No user input needed • Objectives – To present a real-time system that extracts facial features automatically and builds a 3D face model without user intervention. • • • • Face Shape Extractor Probabilistic Network A Least-Square Approach Talking-Head System Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO II. Facial Feature Extraction Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Finding Rough Location • – – • Find rough location of target feature points. High energy around the facial features Facial Feature – Deep Valley for their luminance distribution. • • • • Test Images (Front View) 2D Valley Detection Filter Eyebrows Eyes Mouth Valley Energy Equation: Ev ( x0 ) max {dR( x0 , x) dL( x0 , x) abs(dR( x0 , x) dL( x0 , x))}, Results of a Valley Detection Filter x l to k • Where: dR( x0 , x) f ( x0 x) f ( x0 ) dL( x0 , x) f ( x0 x) f ( x0 ) d dR x0 x x Valley Detection Filter Horizontal Direction Vertical Direction Histogram Distribution Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Produces high intensity band for an object boundary and changes the region around the object into a low intensity region Finding Exact Location Tff thresholding Performed to make a two-tone image Moving Object • Pseudo Moving Difference – Contour extraction method • Motion Estimation • Moving Object Segmentation Intensified Band at the boundary Homogeneous – Moving Difference Image • Two images – Image 1 time T1 – Image 2 time T2 Synthetic Object • Equation – No need to create a new image to take a difference from it. – Where: » mx, my movement in the x and y direction. – Object not actually moving. Difference Image Simple Contour Extraction d mx,my x, y f x, y f x mx , y my Strong Intensity band around the boundary Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Mouth Region 0 pixel moved 1 pixel moved down 4 pixels moved down Finding Exact Location (cont.) • Mouth and Eye model to extract outside contours Applied Contour Model – Parametric Curve • Mouth • Eyes – Search p1, p2 and h1, h2 • Maximize the energy of intensity of pixels on the curve Example Image of thresholding after the pseudo moving difference Binarization (Tff=80) result of a moving difference Extracted Mouth Contour Use: Parametric Curve Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO III. Face Shape Extractor Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Face Shape • • One of the most important features in creating a 3D model. Classification of Face Shapes: Square Triangular Trapezoid Long Narrow Move Up • Same Size Width and Height Move Down Use Ellipse – Controlled by three anchor points: P1,P2,P3 Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Three Anchor Points Detection Tfs 0.5* Intensity of the inside of a face Step 1: Step 2: Step 3: Edge Image Find the three Anchor Points. Draw an Ellipse P1 180° (move up/down) P2 270° (fixed position) x2 y2 1 a2 b2 P3 0° (move up/down) Where: Face Shape (Symmetric) Add the intensity of pixels that are lower than the left and right anchor points on the ellipse and record the sum. Move the left and right anchor points up and down to find the parameters of the ellipse that produces maximum boundary energy for the face shape from the edge of the image. a distance b/w P1 and P2 (x) b distance b/w P1 and P2 (y) Optimal Face Shape: Where: E(x,y) Intensity of an Edge Image el Subset of pixels on an Ellipse Pixels located lower than the left and right anchor points (aˆ, bˆ) arg max a ,b E (a 1 ( y / b) 2 , b 1 ( x / a) 2 ) ( x , y )el such that P1 , P2 , P3 el , Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO IV. Probability Networks Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Literature Review • Uses – Locate human faces from a scene – Track deformations of local features • Approach 1. Cipolla • Probabilistic framework to combine different facial features and face groups. – Achieving a high confidence rate for face detection from a complicated scene. 2. Huang • Probabilistic Network for local feature tracking. – Modeling locations and velocities of selected feature points. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Probabilistic Framework • Maximally use facial feature evidence – Deciding correctness of extracted facial features • Before a 3D face model is built. 3.14 3.13 3.5 3.8 3.113. 3.7 3.123.10 63 . 10.8 10.7 9.15 8.4 2.2 8.1 9 8.3 2.3 8.2 2.1 Probabilistic Framework FDP’s Each Node Random Variable Each Row Conditional dependency between two nodes Network Hierarchy Face Shape Net Mouth Net Eye Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Net Topology Net V. A Least-Square Approach to Adopt A 3D Face Model Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Approach • • Algorithm that is robust and stable that is used to build a photo-realistic and natural 3D face model. Literature Review – Liu • • Combination of multiple 3D models linearly is a promising way to generate a photo-realistic 3D model Paper Approach – – New face model is described as a linear combination of key 3D face models. Strong Point • Multiple face models constrain the shape of a new 3D face – – Similarity with Liu • – Preventing algorithms from producing an unrealistic 3D face model. 3D model Linear combination of a neutral face and some deformation vectors. Difference with Liu • Use a Least-Square approach to find the coefficient vector for creating anew 3D face model – • Rather than using an iterative approach Build a 3D Face model from a video sequence with no user input. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO 3D Model • Modified version of the 3D Face Model – Developer • Parke and Waters • 3D Model Editor – Build a complete head and shoulder model • Including – Ears – Teeth m F F0 ci D i i 1 Where: F Face Geometry F0 Neutral Face Vector c Coefficient Vector c=(c1,c2,…,cm) *Decides the amount of variation needed to be applied to vertices on the neutral face model. • • 16 Face Models T Face Geometry F (v1 ,..., vn ) – Where: vi ( xi , yi , zi )T • T Deformation Vector D ( v1 ,..., vn ) – Amount of variation for size and location of vertices on a 3D model Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO 3D Model Adaptation • All coefficients are decided at once To solve the Least-square Problem • Use the SVD – Least-square • Singular Value Decomposition • Where cˆ arg min c n [(V j 1 j Deformation Vectors, D m F0 j ) ci Dij ]2 , i 1 – n Denotes the number of extracted features – m Number of deformation vector D – Vj Extracted Feature » Contains x and y location – F0j, Dij Vertex on a neutral 3D model » Projected onto 2D using current camera parameters Corresponding vertex on a deformation vector Di - = 8 Shape Vectors: 1. Wide Face 2. Thin Face 3. Big Mouth 4. Small Mouth 5. Big Nose 6. Small Nose 7. Big Eyes 8. Small Eyes 8 Position Vectors: 1. Eyes Minimum Horizontal 2. Eyes Maximum Horizontal 3. Eyes Minimum Vertical 4. Eyes Maximum Vertical 5. Mouth Minimum Vertical 6. Mouth Maximum Vertical 7. Nose Minimum Vertical 8. Nose Maximum Vertical Deformation Vector for a Wide Face Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO 3D Face Models to Find Deformation Vectors Neutral 3D Face Model 3D Face Models used to calculate deformation vectors 1294 Polygons Good enough for realistic facial animation Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Procedure to Adapt a Generic Model from a Video Sequence Video Sequence m 3D models … … Feature extraction is OKed by probability networks c1 c2 … cm Facial Features Least-Square Approach to obtain a new 3D model Orthogonal Projection to get texture Texture and its coordinates Linearly combined *To generate a new 3D model according to a Coefficient Vector calculated by a Least-Square Approach based on extracted facial features Vertices on the new 3D model are projected into the input face to get face texture and texture coordinates Rendered image from the adapted 3D and texture map Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO VI. Implementation and Experimental Results Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Automatic Creation of a 3D Face Model • Characterization of face movement in the input video sequences. y Algorithm *Catches the best facial orientation *Extracting and Verifying facial feautures x Input Face Translated and Rotated α° x-axis β° y-axis • Samples of Input Video Sequences •Neutral Face •Looking at the camera •Rotating in the x and y axis Requirements for the Real-Time System (Analyzing Video Sequence) 1. Locating face should not be called every frame. 2. Facial features in previous frames should be exploited. – Provide a better result in current frame. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Method: Normalized RG Color Space *Not called every frame *Best Facial Orientation INPUT Block Diagram of the Real Time System Valley Detection Filter *Histogram Distribution *Rough positions of facial components Increase Threshold Nose *Recursive Thresholding Nose Holes lowest intensity Around the nose Mouth and Eyes *Pseudo Moving Difference Ellipse Search Area *Face Shape Extractor Extracted Features Correctness and Suitability *Quality Control Agent Build 3D Model Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Speech-Driven Talking Head System • Block Diagram of the Encoder Constrained Optimization *Robust in Noise Environment Decoder Via Internet bit CameraPC Obtained automatically from a video sequence FDP Facial Definition Parameter FAP Facial Animation Parameter Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Transmitting Speech Via Internet • G.723.1 – – Dual rate speech coder for multimedia communications. Most widely used standard codec for Internet telephony • Capability of low bit rate coding – – • 5.3kbps 6.3kbps Implementation 1. 3D coordinates and texture information for an adapted 3D model • Sent to the decoder via TCP protocol 2. Coded speech and animation parameters • • Sent to the decoder via UDP protocol Selected Bit Rate for Audio and Visual Parameter Transmission – – 6.3kbps 1.0kbps Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Encoder and Decoder Implementation Encoder Input Video Decoder Virtual Face Talking Head Window •Decoded Animation Parameter •Speech Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Experimental Results • Requirements – Face Shape • • • • 9 fps Pentium III 600 MHz PC n=20 Feature Points m=16 Deformation Vectors – User • • Face Shape Extractor – Test • Different – Face Shape – Orientation – Detection Rate • 64% – 1180 selected Frames » Testing Video Sequence Frontal View – Rotation Angle < 5° – 20 Video Sequence Recorded • 2000 Frames in total Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Experimental Results • Errors Rejected Facial Features Complex Background Failure of Eye Detection Eye Closing • Probabilistic Network Increased Extraction Accuracy – 50 Frontal Face Images • PICS Database – University of Stirling » http:// pics.psych.stir.ac.uk/ – Quality Control Agent • 2D Gaussian Distributions Adjust Threshold: Tff: 80 60 Tfs: 0.5*Ave I 1*Ave I – Expectation Maximation Algorithm (EM) Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Created 3D models • Talking Head System – Performance Evaluated Successfully – 20 people participated – 5-Point Scale Animating Talking Head Assessment 3D models wearing cloth1 Score Bad=1, Excellent=5 1 2 3 4 5 Photo-realistic 0 0 0 4 8 Natural 0 0 5 7 0 Audio Quality 0 0 0 12 0 Synchronization 0 0 2 10 0 Overall Performance 0 0 1 11 0 AUTOMATIC SCHEME Produces a 3D Model that is quit realistic and good enough for various internet applications. 3D models wearing cloth2 Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO VII. Conclusions and Future Works Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Conclusions • System Extract Face Shape • Ellipse model controlled by three anchor points. • Accurate and computationally cheap method. Probabilistic Network • Verify if the extracted features are good enough to build a 3D face model. Least-Square Approach • Adopt a generic 3D model into extracted features from input video. • Calculate a required coefficient vector to adapt a generic model to fit an input face. Talking-Head System • Generates FAP’s and FDP’s for MPEG-4 facial animation system. • No user intervention • Internet Application – Virtual Conference – Virtual Story Teller • Do not require much head movement • Do not require high quality facial animation Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Future Works 1. More accurate mouth and eye extraction scheme. – Improve quality of the created 3D model. 2. Handle input faces that are not just neutral. 3. Remove Limitation – Shape of the mouth and eye. 4. Add modeling of hair. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO VIII. References Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO References A 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Won-Sook Lee, Marc Escher, Gael Sannier, Nadia Magnenat-Thalmann, "MPEG-4 Compatible Faces from Orthogonal Photos," International Conference on Computer Animation, 1999, pp.186-194. P. Fua and C. Miccio, “Animated Heads from Ordinary Images: A Least-Squares Approach,” Computer Vision and Image Understanding, Vol. 75, No. 3, 1999, pp. 247-259. Frederic Pighin, Richard Szeliski, David H. Salesin, “Resynthesizing Facial Animation through 3D Model-Based Tracking,” The Proceedings of the Seventh IEEE International Conference on Computer Vision, Volume 1, 1999, pp. 143 –150. Zicheng Liu, Zhengyou Zhang, Chuck Jacobs, Michael Cohen, “Rapid Modeling of Animated Faces From Video,” Technical Report MSR-TR-2000-11. Ana C. Andres del Valle and Jorn Ostermann, “3D Talking Head Customization By Adapting a Generic Model to One Uncalibrated Picture,” IEEE International Symposium on Circuits and Systems, 2001, pp. 325 –328. C.J.Kuo, R.-S. Huang, and T.-G. Lin, “3-D Facial Model Estimation From Single Front-View Facial Image,” IEEE Transactions on CSVT, vol. 12, no. 3, 2002, pp. 183-192. Moccozet L., Magnenat Thalmann N., “Dirichlet Free-Form Deformations and their Application to Hand Simulation,” Proc. Of Computer Animation 97, 1997, pp. 93-102. V. Blanz and T. Vetter, “A Morphable Model for the Synthesis of 3d Faces,” In Computer Graphics, Annual Conference Series, SIGGRAPH 1999, pp. 187-194. Eric Cosatto and Hans Peter Graf, “Photo-Realistic Talking-Heads from Image Samples,” IEEE Transactions on Multimedia, vol. 2, no. 3, 2000, pp. 152-163. I-Chen Lin, Cheng-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, “A Speech Driven Talking Head System Based on a Single Face Image,” Seventh Pacific Conference on Computer Graphics and Applications, 1999, pp. 43 –49. http://www.ananova.com/ Ru-Shang Wang, Yao Wang, “Facial feature extraction and tracking in video sequences,” IEEE International Workshop on Multimedia Signal Processing, 1997, pp. 233 –238. D. Reisfeld, Y. Yeshurun, “Robust detection of facial features by generalized symmetry,” 11th IAPR International Conference on Pattern Recognition, 1992, pp. 117 –120. M. Zobel, A. Gebhard, D. Paulus, J. Denzler, H. Niemann, “Robust facial feature localization by coupled features,” Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000, pp. 2 –7. Y. Tian, T. Kanade, and J. Cohn, “Robust Lip Tracking by Combining Shape, Color and Motion,” 4th Asian Conference on Computer Vision, 2000. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO References B 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. J. Luettin, N.A. Tracker, and S. W. Beet, “Active Shape Models for Visual Speech Feature Extraction,” Electronic Systems Group Report No. 95/44, University of Sheffield, UK, 1995. Changick Kim and Jenq-Neng Hwang, "An Integrated Scheme for Object-Based Video Abstraction," ACM International Multimedia Conference, 2000. Lesilie G. Farkas. Anthropometry of the Head and Face. Raven Press, 1994. K. C. Yow and R. Cipolla, “A Probabilistic Framework for Perceptual Grouping of Features for Human Face Detection,” The Proceedings of IEEE Int. Conf. On Auto. Face and Gesture Recognition ’96, pp. 16-21, 1996. Hai Tao, Lopez R., Thomas Huang, “Tracking Facial Features Using Probabilistic Network,” Automatic Face and Gesture Recognition, pp. 166-170, 1998. ISO/IEC FDIS 14496-1 Systems, ISO/IEC JTC1/SC29/WG11 N2501, November, 1998. ISO/IEC FDIS 14496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502, November, 1998. Psychological Image Collection at Stirling (PICS). Available at http://pics.psych.stir.ac.uk/. J. Luettin, N.A. Tracker, and S. W. Beet, “Active Shape Models for Visual Speech Feature Extraction,” Electronic Systems Group Report No. 95/44, University of Sheffield, UK, 1995. Y. Tian, T. Kanade, and J. Cohn, “Robust Lip Tracking by Combining Shape, Color and Motion,” 4th Asian Conference on Computer Vision, 2000. K.H. Choi and Jenq-Neng Hwang, "Creating 3D Speech-Driven Talking Heads: A Probabilistic Approach," to appear in IEEE International Conference on Image Processing, 2002. Fabio Lavagetto, “Converting Speech into Lip Movement : A Multimedia Telephone for Hard of Hearing People,” IEEE Transaction on Rehabilitation Engineering, vol. 3, no. 1, 1995, pp. 90-102. Ram R. Rao, Tsuhan Chen, Russell M. Mersereau, "Audio-to-Visual Conversion for Multimedia Communication," IEEE Transactions on Industrial Electronics, vol. 45, No. 1, 1998, pp. 15-22. Frederic I Parke, Keith Waters, Computer Facial Animation, A.K. Peters, 1996. Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbits/s,” ITU-T Recommendation G.723.1, March 1996. Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO Presented By STU Institute of Electrical Engineering Harry Mills / PRESENTATIONPRO