ENTERFACE’10 VISION BASED HAND PUPPET Final Presentation PROJECT OBJECTIVE To develop a multimodal interface to manipulate the low- and high level aspects of 3D hierarchical digital models The hands and the face of possibly separate performers will be tracked. Their gestures and facial expressions will be recognized. Estimated features will be mapped to digital puppets in real-time. PROJECT OBJECTIVE The project involves Tracking of both hands Background segmentation Skin color filtering Particle filtering Active appearance models Filtering Hand pose estimation Skeleton model Inverse kinematics Physics Dimensionality reduction Hidden semi-Markov models Keyframe classification Visualization Gesture and expression Facial parameter tracking Networking XML WORK PACKAGES WP1: WP2: Hand posture modeling and training WP3: Data collection and ground-truth creation for pose estimation module Stereovision based hand tracking WP4: Vision based facial expression tracking WORK PACKAGES WP5: WP6: Skeleton and 3D model generation WP7: Gesture and expression spotting and recognition Development of the graphics engine with skeletal animation support WP8: Network Protocol design and module development FLOWCHART OF THE SYSTEM WP1: DATA COLLECTION AND GROUNDTRUTH CREATION FOR POSE ESTIMATION PROBLEM: Hand pose estimation annotated images for training requires Each hand pose must be exactly known, which is not possible without special devices such as data gloves. As this process requires a lot of work, we synthetically create the training images. Poser: A software that can manipulate models with skeletons and render photorealistic images with python scripts. WP1: DATA COLLECTION AND GROUNDTRUTH CREATION FOR POSE ESTIMATION Poser Imitates a stereo camera setup and produces photorealistic renders Automatically generates the silhouettes from the rendered images Allows python scripts to manipulate any parameter of the scene A single script can generate an entire multicamera dataset WP1: DATA COLLECTION - METHODOLOGY We iteratively increase the complexity of the visualized hand, starting with an open hand Start with 3 degrees of freedom: Side to side (waving) Bend (up down) Twist We created 2x1000 images for training Created for a certain stereo camera setup Manipulated each d.o.f. in sequence Extracted the silhouettes Saved along with generating parameters WP1: DATA COLLECTION - CONCLUSION Using Poser to generate training images is a very efficient method Can potentially create a very large database in a few hours It’s very simple to create any multicamera setup Each extrinsic and intrinsic camera parameter can be set via python script Automatically extracts the silhouettes Provides high level manipulation parameters for the body parts e.g. grasping and spreading for the hand WP2: HAND POSE ESTIMATION AIM: To estimate hand skeleton parameters using hand silhouettes from multiple cameras IDEA: Use dimensionality reduction to map silhouettes to a space with much lower dimensionality When an unknown silhouette arrives Search for the closest known point in the reduced space WP2: MANIFOLD LEARNING WITH SPARSE-GPLVM Poser’s Hand Model is used for rendering. Hand Silhouette Images (80x80) are rendered. 80x80 = 6400 dimensional silhouette vector. 1000 training samples per camera has been captured by iteration over x, y and z for below values using a Python script: x = [0⁰, 90⁰], y = [-90⁰, +90⁰], z = [-60⁰, 60⁰] 2 cameras are simulated orthogonal to each other: that 2000 training samples are collected are placed WP2: PREPROCESSING GPLVM AND LEARNING FORWARD MAPPING WITH NNS GPLVM is a non-linear probabilistic PCA. For additional speed gains a conventional PCA has been applied as a preprocessing step, capturing the 99% of the total variance. GPLVM is applied afterwards. This made the optimization process ~4 times faster. GPLVM finds a backward mapping from latent space (2D) to PCA feature space (~250D for 99% variance). For fast generation of initial search points a forward mapping from feature space to latent space is trained using a NN with 15 hidden neurons. WP2: HAND POSE ESTIMATION FLOWCHART Nearest Neighbor Classifier Capture frame Find foreground Filter skin colors Extract silhouette Use GPLVM from 2D per Camera Use PCA X Use NN from PCA space to Latent space 1 0 -1 -1 0 1 WP2: CLASSIFICATION 2 dimensional latent space has been found in a smooth fashion by GPLVM optimization. Therefore as a classifier a nearest neighbor matcher has been used in the latent space. Ground truth angles of the hand poses are known. An exact pose match is looked for. Any divergence from the exact angles is considered as a classification error. For the synthetic environment prepared by poser a classification performance of 94% has been reached in 2D latent space. WP3: STEREOVISION BASED HAND TRACKING Objective Obtain the 3D position of the hands Enable real-time low noise tracking of robust features for gesture recognition tasks Track some intuitive spatial parameters to map them directly to the puppet Approach skin color as main cue for hand location stereo camera to obtain 3D information Particle-filtering for robust tracking WP3: STEREOVISION BASED HAND TRACKING WP3: STEREOVISION BASED HAND TRACKING Skin-color segmentation Bayesian color model Chromatic color space (HS) Train color model on image regions obtained from a face tracker WP3: STEREOVISION BASED HAND TRACKING Particle filtering (CONDENSATION) Initialization (Midterm result) Biggest skin colored blob is assumed to be the hand Stereo-matching to obtain 3D hand location Tracking (new) Color cue: accumulated skin color probability weighted by percentage of skin colored pixels in particle ROI Depth cue: deviation of ROI disparity from disparity value implied by particle location WP3: STEREOVISION BASED HAND TRACKING WP4: VISION-BASED EMOTION RECOGNITION AIM: To enable the digital puppet to imitate facial emotions. To change digital puppet’s states using facial expressions. METHODOLOGY: Active shape model based facial landmark tracker Track all shape parameters – set of points Extract useful features manually Eyebrow leverage, lip width etc. Classify features Using HSMM Using nearest neighbor classifier WP4: VISION-BASED EMOTION RECOGNITION The ASM is trained using the annotated set of face images. It starts the search from the mean shape aligned to the face located by a global face detector The following are repeated until convergence Adjust the locations of shape points by template matching of the image texture around each landmark and propose a new shape Conform this new shape to a global shape model (based on PCA). REAL-TIME VISION-BASED EMOTION RECOGNITION Video frame captured from an ordinary webcam Feature Extraction based on Emotion Recognition intensity changes in Six universal gestures shown specific face regions happiness and sadness distances of specific landmarks surprise Facial Landmark Tracking based on Active Shape Models both generic and person-specific model anger fear disgust WP4: VISION-BASED EMOTION RECOGNITION WP4: VISION-BASED EMOTION RECOGNITION – NEAREST NEIGHBOR WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION Has to “spot” gestures and expressions in continuous streams No start-end signals Should recognize only when a command is over Should run in real time We use hidden semi-Markov models Inhomogeneous explicit duration models WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION HMM HSMM WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION Why HSMMs? HMMs model duration lengths only implicitly Using self transition probabilities Imposes geometric distribution on each duration Variance-mean correlated High mean – high variance Geometric distribution and/or high variance do not conform to every application Speech, hand gestures, expressions, ... HSMMs explicitly model durations HMMs are a special case of HSMMs WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION Training module Developed in MATLAB – no real time requirement Yet very fast, does not require too many samples Previously experimented with 25 hand gestures and continuous streams Achieved 99.7% recognition rate For this project, also experimented with facial expressions Six expressions Long continuous streams for training – not annotated Results look good, no numerical results due to lack of ground truth (future work) WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION Recognition module Converted to an on-line algorithm Uses the recent history to determine current state using Viterbi on a large HSMM As expressions are independent, this does not introduce much error (about %1.5 in number of misclassified frames) Runs in real time in MATLAB (not ported to C++ yet) Performance analysis Most of the error is attributable to Noise Global head motion Rather weak vector quantization method WP5: GESTURE AND EXPRESSION SPOTTING AND RECOGNITION Preliminary results WP6: SKELETON AND 3D MODEL GENERATION We have utilized a skeletal animation technique. The skeleton is predetermined and consists of 16 joints and accompanying bones. WP7: DEVELOPMENT OF THE GRAPHICS ENGINE Supports skeletal animation for the predetermined skeleton Reads skeleton parameters at each frame from incoming command files Applies the parameters to the model in real-time Allows different models to be bound to the skeleton Supports inverse kinematics Same skeleton can be bound to different 3D models Allows absolute coordinates as commands Supports basic physics (gravity) Allows forward kinematics via forces WP7: DEVELOPMENT OF THE GRAPHICS ENGINE Forward Kinematics: "Given the angles at all of the robot's joints, what is the position of the hand?“ Inverse Kinematics: "Given the desired position of the robot's hand, what must be the angles at all of the robot's joints?“ Cyclic-Coordinate Descent Algorithm for IK Traverse linkage from distal joint inwards Optimally set one joint at a time Update end effector with each joint change At each joint, minimize difference between end the effector and the goal WP7: DEVELOPMENT OF THE GRAPHICS ENGINE WP7: DEVELOPMENT OF THE GRAPHICS ENGINE Future Work Implement and Optimize CCD (%90 complete) Load geometry data from Autodesk FBX file Advanced Shading for puppets, i.e. fur Rig to multiple models Choose and implement a convenient method for visualizing face parameters and expressions WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT “Visualization computer” acts as a server and listens to the other computers accepts binary xml files Works over TCP/IP XML is parsed and parameters are extracted. Each packet may contain several parameters and commands Either low level joint angles as a set Or a high level command, such as new gesture or expression WP8: NETWORK PROTOCOL DESIGN AND MODULE DEVELOPMENT Threaded TCP/IP Server Binary XML <?xml version="1.0" encoding="UTF-8" ?> <handPuppet timeStamp=”str” source=”str”> <paramset> <H rx=”f” ry=”f” rz=”f” /> <ER ry=”f” rz=”f” /> <global tx=”f” ty=”f” tz=”f” rx=”f” ry=”f” rz=”f” /> </paramset> <anim id=”str” /> <emo id=”str” /> </handPuppet> CONCLUSION Individual modules Most of the modules nearly complete Final application Tracked features not bound to skeleton parameters Model skin and animations missing New ideas that emerged during the workshop Estimate forward mapping for GPLVMs using NNs Use HSMMs for facial expressions Fit 3D ellipse to the 3D point cloud of the hand Extract manual features such as edge activity on the forehead FUTURE WORK Once hand tracking is complete, gestures will be trained using HSMMs All MATLAB code will be ported to C++ (mostly OpenCV) Hand pose complexity will be gradually increased, until not further possible in real time Inverse Kinematics will be fully implemented A face model that is capable of showing emotions will be incorporated to the 3D model for easy visualization