Final Presentation

advertisement
ENTERFACE’10
VISION BASED HAND
PUPPET
Final Presentation
PROJECT OBJECTIVE

To develop a multimodal interface to manipulate the
low- and high level aspects of 3D hierarchical digital
models
The hands and the face of possibly separate performers will
be tracked.
 Their gestures and facial expressions will be recognized.
 Estimated features will be mapped to digital puppets in
real-time.

PROJECT OBJECTIVE

The project involves

Tracking of both hands
Background segmentation
 Skin color filtering
 Particle filtering

Active appearance models
 Filtering





Hand pose estimation

Skeleton model
 Inverse kinematics
 Physics
Dimensionality reduction
Hidden semi-Markov models
 Keyframe classification
Visualization

Gesture and expression

Facial parameter tracking

Networking

XML
WORK PACKAGES

WP1:


WP2:


Hand posture modeling and training
WP3:


Data collection and ground-truth creation for pose
estimation module
Stereovision based hand tracking
WP4:

Vision based facial expression tracking
WORK PACKAGES

WP5:


WP6:


Skeleton and 3D model generation
WP7:


Gesture and expression spotting and recognition
Development of the graphics engine with skeletal
animation support
WP8:

Network Protocol design and module development
FLOWCHART OF THE SYSTEM
WP1: DATA COLLECTION AND GROUNDTRUTH CREATION FOR POSE ESTIMATION

PROBLEM: Hand pose estimation
annotated images for training


requires
Each hand pose must be exactly known, which is not
possible without special devices such as data gloves.
As this process requires a lot of work, we
synthetically create the training images.

Poser: A software that can manipulate models with
skeletons and render photorealistic images with
python scripts.
WP1: DATA COLLECTION AND GROUNDTRUTH CREATION FOR POSE ESTIMATION

Poser
Imitates a stereo camera setup and
produces photorealistic renders
 Automatically
generates
the
silhouettes from the rendered
images
 Allows python scripts to manipulate
any parameter of the scene
 A single script can generate an
entire multicamera dataset

WP1: DATA COLLECTION - METHODOLOGY
We iteratively increase the complexity of the
visualized hand, starting with an open hand
 Start with 3 degrees of freedom:

Side to side (waving)
 Bend (up down)
 Twist


We created 2x1000 images for training
Created for a certain stereo camera setup
 Manipulated each d.o.f. in sequence
 Extracted the silhouettes
 Saved along with generating parameters

WP1: DATA COLLECTION - CONCLUSION

Using Poser to generate training images is a very
efficient method


Can potentially create a very large database in a few
hours
It’s very simple to create any multicamera setup

Each extrinsic and intrinsic camera parameter can be
set via python script
Automatically extracts the silhouettes
 Provides high level manipulation parameters for
the body parts


e.g. grasping and spreading for the hand
WP2: HAND POSE ESTIMATION

AIM:


To estimate hand skeleton parameters using hand
silhouettes from multiple cameras
IDEA:
Use dimensionality reduction to map silhouettes to a
space with much lower dimensionality
 When an unknown silhouette arrives


Search for the closest known point in the reduced space
WP2: MANIFOLD LEARNING WITH
SPARSE-GPLVM
Poser’s Hand Model is used for rendering.
 Hand Silhouette Images (80x80) are rendered.
 80x80 = 6400 dimensional silhouette vector.
 1000 training samples per camera has been
captured by iteration over x, y and z for below
values using a Python script:

x = [0⁰, 90⁰], y = [-90⁰, +90⁰], z = [-60⁰, 60⁰]

2 cameras are simulated
orthogonal to each other:

that
2000 training samples are collected
are
placed
WP2: PREPROCESSING GPLVM AND LEARNING
FORWARD MAPPING WITH NNS

GPLVM is a non-linear probabilistic PCA.


For additional speed gains a conventional PCA has
been applied as a preprocessing step, capturing the
99% of the total variance.
GPLVM is applied afterwards. This made the
optimization process ~4 times faster.
GPLVM finds a backward mapping from latent
space (2D) to PCA feature space (~250D for 99%
variance).
 For fast generation of initial search points a
forward mapping from feature space to latent
space is trained using a NN with 15 hidden
neurons.

WP2: HAND POSE ESTIMATION FLOWCHART
Nearest
Neighbor
Classifier
Capture
frame
Find
foreground
Filter skin
colors
Extract
silhouette
Use GPLVM
from 2D per
Camera
Use PCA
X
Use NN from
PCA space to
Latent space
1
0
-1
-1
0
1
WP2: CLASSIFICATION
2 dimensional latent space has been found in a
smooth fashion by GPLVM optimization.
 Therefore as a classifier a nearest neighbor
matcher has been used in the latent space.
 Ground truth angles of the hand poses are
known. An exact pose match is looked for. Any
divergence from the exact angles is considered as
a classification error.
 For the synthetic environment prepared by poser
a classification performance of 94% has been
reached in 2D latent space.

WP3: STEREOVISION BASED HAND
TRACKING

Objective
Obtain the 3D position of the hands
 Enable real-time low noise tracking of robust
features for gesture recognition tasks
 Track some intuitive spatial parameters to map them
directly to the puppet


Approach
skin color as main cue for hand location
 stereo camera to obtain 3D information
 Particle-filtering for robust tracking

WP3: STEREOVISION BASED HAND
TRACKING
WP3: STEREOVISION BASED HAND
TRACKING

Skin-color segmentation
Bayesian color model
 Chromatic color space (HS)
 Train color model on image regions obtained from a
face tracker

WP3: STEREOVISION BASED HAND
TRACKING

Particle filtering (CONDENSATION)

Initialization (Midterm result)
Biggest skin colored blob is assumed to be the hand
 Stereo-matching to obtain 3D hand location


Tracking (new)
Color cue: accumulated skin color
probability weighted by
percentage of skin colored pixels in
particle ROI
 Depth cue: deviation of ROI
disparity from disparity value
implied by particle location

WP3: STEREOVISION BASED HAND
TRACKING
WP4: VISION-BASED EMOTION
RECOGNITION

AIM:
To enable the digital puppet to imitate facial
emotions.
 To change digital puppet’s states using facial
expressions.


METHODOLOGY:
Active shape model based facial landmark tracker
 Track all shape parameters – set of points
 Extract useful features manually



Eyebrow leverage, lip width etc.
Classify features
Using HSMM
 Using nearest neighbor classifier

WP4: VISION-BASED EMOTION
RECOGNITION
The ASM is trained using the annotated set of
face images.
 It starts the search from the mean shape aligned
to the face located by a global face detector
 The following are repeated until convergence

Adjust the locations of shape points by template
matching of the image texture around each landmark
and propose a new shape
 Conform this new shape to a global shape model
(based on PCA).

REAL-TIME VISION-BASED EMOTION
RECOGNITION
Video frame
captured from
an ordinary webcam
Feature Extraction
based on
Emotion Recognition
intensity changes in
Six universal gestures
shown specific face regions
happiness
and
sadness
distances of specific landmarks
surprise
Facial Landmark Tracking
based on
Active Shape Models
both generic and
person-specific model
anger
fear
disgust
WP4: VISION-BASED EMOTION
RECOGNITION
WP4: VISION-BASED EMOTION
RECOGNITION – NEAREST NEIGHBOR
WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

Has to “spot” gestures and expressions in
continuous streams

No start-end signals
Should recognize only when a command is over
 Should run in real time
 We use hidden semi-Markov models


Inhomogeneous explicit duration models
WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

HMM

HSMM
WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

Why HSMMs?

HMMs model duration lengths only implicitly
Using self transition probabilities
 Imposes geometric distribution on each duration
 Variance-mean correlated
 High mean – high variance


Geometric distribution and/or high variance do not
conform to every application


Speech, hand gestures, expressions, ...
HSMMs explicitly model durations

HMMs are a special case of HSMMs
WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

Training module
Developed in MATLAB – no real time requirement
 Yet very fast, does not require too many samples
 Previously experimented with 25 hand gestures and
continuous streams



Achieved 99.7% recognition rate
For this project, also experimented with facial
expressions
Six expressions
 Long continuous streams for training – not annotated
 Results look good, no numerical results due to lack of
ground truth (future work)

WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

Recognition module
Converted to an on-line algorithm
 Uses the recent history to determine current state
using Viterbi on a large HSMM




As expressions are independent, this does not introduce
much error (about %1.5 in number of misclassified frames)
Runs in real time in MATLAB (not ported to C++ yet)
Performance analysis

Most of the error is attributable to
Noise
 Global head motion
 Rather weak vector quantization method

WP5: GESTURE AND EXPRESSION
SPOTTING AND RECOGNITION

Preliminary results
WP6: SKELETON AND 3D MODEL
GENERATION
We have utilized a skeletal animation technique.
 The skeleton is predetermined and consists of 16 joints
and accompanying bones.

WP7: DEVELOPMENT OF THE GRAPHICS
ENGINE

Supports skeletal animation for the
predetermined skeleton
Reads skeleton parameters at each frame from
incoming command files
 Applies the parameters to the model in real-time


Allows different models to be bound to the
skeleton


Supports inverse kinematics


Same skeleton can be bound to different 3D models
Allows absolute coordinates as commands
Supports basic physics (gravity)

Allows forward kinematics via forces
WP7: DEVELOPMENT OF THE GRAPHICS
ENGINE

Forward Kinematics:


"Given the angles at all of the robot's joints, what is
the position of the hand?“
Inverse Kinematics:
"Given the desired position of the robot's hand, what
must be the angles at all of the robot's joints?“
 Cyclic-Coordinate Descent Algorithm for IK

Traverse linkage from distal joint inwards
 Optimally set one joint at a time
 Update end effector with each joint change
 At each joint, minimize difference between end the effector
and the goal

WP7: DEVELOPMENT OF THE GRAPHICS
ENGINE
WP7: DEVELOPMENT OF THE GRAPHICS
ENGINE

Future Work





Implement and Optimize CCD (%90 complete)
Load geometry data from Autodesk FBX file
Advanced Shading for puppets, i.e. fur
Rig to multiple models
Choose and implement a convenient method for
visualizing face parameters and expressions
WP8: NETWORK PROTOCOL DESIGN AND
MODULE DEVELOPMENT

“Visualization computer” acts as a server and
listens to the other computers
accepts binary xml files
 Works over TCP/IP

XML is parsed and parameters are extracted.
 Each packet may contain several parameters and
commands

Either low level joint angles as a set
 Or a high level command, such as new gesture or
expression

WP8: NETWORK PROTOCOL DESIGN AND
MODULE DEVELOPMENT
Threaded TCP/IP Server
 Binary XML
 <?xml version="1.0" encoding="UTF-8" ?>


<handPuppet timeStamp=”str” source=”str”>
<paramset>
 <H rx=”f” ry=”f” rz=”f” />
 <ER ry=”f” rz=”f” />
 <global tx=”f” ty=”f” tz=”f” rx=”f” ry=”f” rz=”f” />
 </paramset>
 <anim id=”str” />
 <emo id=”str” />


</handPuppet>
CONCLUSION

Individual modules


Most of the modules nearly complete
Final application
Tracked features not bound to skeleton parameters
 Model skin and animations missing


New ideas that emerged during the workshop
Estimate forward mapping for GPLVMs using NNs
 Use HSMMs for facial expressions
 Fit 3D ellipse to the 3D point cloud of the hand
 Extract manual features such as edge activity on the
forehead

FUTURE WORK
Once hand tracking is complete, gestures will be
trained using HSMMs
 All MATLAB code will be ported to C++ (mostly
OpenCV)
 Hand pose complexity will be gradually
increased, until not further possible in real time
 Inverse Kinematics will be fully implemented
 A face model that is capable of showing emotions
will be incorporated to the 3D model for easy
visualization

Download