A Random Forest Approach to Segmenting and Classifying Gestures Ajjen Joshi

advertisement
A Random Forest Approach to Segmenting and Classifying Gestures
Ajjen Joshi1 , Camille Monnier2 , Margrit Betke1 and Stan Sclaroff1
1 Department
of Computer Science, Boston Univeristy, Boston, MA 02215 USA
River Analytics, Cambridge, MA 02138 USA
2 Charles
Abstract— This work investigates a gesture segmentation and
recognition scheme that employs a random forest classification
model. Our method trains a random forest model to recognize
gestures from a given vocabulary, as presented in a training
dataset of video plus 3D body joint locations, as well as outof-vocabulary (non-gesture) instances. Given an input video
stream, our trained model is applied to candidate gestures
using sliding windows at multiple temporal scales. The class
label with the highest classifier confidence is selected, and
its corresponding scale is used to determine the segmentation
boundaries in time. We evaluated our formulation in segmenting and recognizing gestures from two different benchmark
datasets: the NATOPS dataset of 9,600 gesture instances from
a vocabulary of 24 aircraft handling signals, and the ChaLearn
dataset of 7,754 gesture instances from a vocabulary of 20 Italian communication gestures. The performance of our method
compares favorably with state-of-the-art methods that employ
Hidden Markov Models or Hidden Conditional Random Fields
on the NATOPS dataset.
I. INTRODUCTION
The problem of spotting and recognizing meaningful gestures has been an important research endeavor in the fields
of computer vision and pattern recognition. Research in this
domain has a broad scope of applications such as recognizing
sign-language symbols, enabling video surveillance, and developing new modes of human-computer interaction, among
others.
A common approach in solving the gesture segmentation
and classification problem involves separating them into twosubproblems where the task of segmentation precedes the
task of recognition. In this method [1], [2], [3], the focus
is on first finding the gesture segmentation boundaries in
time. The candidate gestures produced by the segmentation
algorithm is then classified. One of the limitations of this
approach is the dependence of classification on segmentation:
a good gesture classification algorithm will fail to yield
desirable results if the segmentation algorithm is inaccurate.
Another disadvantage of this method is the difficulty in
distinguishing contiguously occurring gestures.
Ours is a unified approach that simultaneosly performs
the tasks of segmentation and classification. In methods such
as ours [4], [5], gesture intervals for which above-threshold
scores are given by the classifier are deemed to be the
labeled and segmented gesture. Thus, we attempt to design a
framework capable of automatically and accurately spotting
and classifying gestures present in a set of test videos, given
a training set of RGBD videos and 3D joint locations with
multiple examples of all gestures in a gesture vocabulary.
We take a random forest approach to creating a fast and
accurate classifier. Random forests are an example of an
ensemble method, where multiple classifiers engage in a
voting strategy to provide the final prediction.They have been
applied to good effect in real-time human pose recognition
[6], object segmentation [7], and image classification [8]
among others. Many works dealing with spatiotemporal
signals, such as gestures, employ graphical models such as
Conditional Random Fields (CRFs) [9], [10], and Hidden
Markov Models (HMMs) [5], [11] in order to model relationships and variations in both the temporal and spatial
domains. We show that taking a random forest approach in
gesture classification tasks can be beneficial because they are
often simpler to implement and easier to train than graphical
models, while providing a comparable (and in some cases,
better) accuracy in recognition.
The key contributions of this work are: (1) the design
of a simple framework that employs a single multi-class
random forest classification model to distinguish gestures
from a given vocabulary in a continuous video stream, (2) the
fusion of 3D joint-based features with color and appearancebased features to create an accurate feature representation of
gestures that is robust to variations in user height, distance
of user to sensor and speed of execution of gesture, and (3)
the creation of a uniform feature descriptor for gestures to
account for the variability in their length by dividing gestures
into a fixed number of temporal segments followed by the
concatenation of the representative feature vectors of each
temporal segment.
II. R ELATED W ORK
Here, we list and briefly explain some of the important
methods that have been used in gesture recognition and
are relevant to our work. A more comprehensive survey of
gesture recognition techniques can be found elsewhere [12].
Nearest neighbor models are often used in gesture classification problems. Malassiotis et al. [13] used a k-NN classifier
to classify static sign language hand gestures. A normalized
cross-correlation measure was used to compare the feature
vector of an input image with those in the k-NN model.
Dynamic Time Warping (DTW) can be used to compute a
matching score between two temporal sequences, a variant
of which was used by Alon et al. [4]. A drawback of k-NN
models is the difficulty in defining distance measures that
clearly demarcate different classes of time series observations.
A Hidden Markov Model (HMM) is another widely used
tool in temporal pattern recognition, having been implemented in applications of speech recognition, handwriting
recognition, as well as gesture recognition. Starner et al. [11]
employed an HMM-based system to recognize American
Sign Language symbols. One difficulty while implementing
HMMs is to determine an appropriate number of hidden
states, which can be domain-dependent.
The Conditional Random Field (CRF), introduced by Lafferty et al. [14] is a discriminative graphical model with an
advantage over generative models, such as HMMs: the CRF
does not assume that observations are independent given the
values of the hidden variables. Hidden Conditional Random
Fields (HCRF) use hidden variables to model the latent
structure of the input signals by defining a joint distribution
over the class label and hidden state labels conditioned on
the observations [15]. HCRFs can model the dependence
between each state and the entire observation sequence,
unlike HMMs, which only capture the dependencies between
each state and its corresponding observation. Song et al.
used a Gaussian temporal-smoothing HCRF [9] to classify
gestures that combine both body and hand signals. They also
presented continuous Latent Dynamic CRFs [10] to classify
unsegmented gestures from a continuous input stream of
gestures.
Randomized decision forests have been applied in a variety of ways in problems related with classifying gestures.
Miranda et al. [16] used a gesture recognition scheme based
on decision forests, where each node in a tree in the forest
represented a keypose, and the leaves of the trees represented
gestures corresponding to the sequence of keyposes that
constitute the gesture as one traverses down a tree from
root to leaf. Gall et al. [17] used Hough forests to perform
action recognition. In Hough forests, a set of randomized
trees is trained to perform a mapping from a densely-sampled
d-dimensional feature space into corresponding votes in
Hough space. Demirdjian et al. [18] proposed the use of
temporal random forests in order to recognize temporal
events. Randomized decision forests have been shown to be
robust to the effects of noise and outliers. Moreover, they
generalize well to variations in data [19]. Thus, random
forests are suitable for classification tasks involving data
such as gestures because data collected by image and depth
sensors can be sensitive to noise and their execution can
exhibit a high level of variance.
III. S YSTEM OVERVIEW
A. Training
An overview of how our gesture recognition framework is
trained is shown in Figure 1. Here, we explain in detail the
elements of our framework:
1) Input: The input to our framework consists of RGB
images, depth maps and 3D skeletal joint data for every
frame of the videos in the datasets. Each input video contains
several in-vocabulary gestures and is labeled with ground
truth temporal segmentation as well as class labels. Let
c be the number of different gestures that are present in
the gesture vocabulary. We used all instances of each of
the c different gestures and created additional examples to
represent a non-gesture class by randomly selecting intervals
Fig. 1.
Pipeline view of training our gesture recognition framework
between two gestures of varying length. This set of examples
was used to train a c + 1-class random forest classifier.
2) Feature Extraction: Each training example consists
of a varying number of frames, each of which is described
by a feature descriptor. Our system computes normalized
positional and velocity features for nine different skeletal
body joints (left and right shoulders, elbows, wrists and
hands, as well as the head joint). Since gestures are
performed by subjects with different heights, at different
distances from the camera sensor, we first normalized the
positional coordinates of the users’ joints using the length
of the user’s torso as a reference. The normalized position
vector for joint j at time t is:
W j (t) =
Wrj (t) − Wrhip (t)
,
(1)
l
where Wrj (t) is the raw position vector for joint j at time t,
Wrhip (t) is the raw position vector for the hip joint at time
t, and l is the length of the torso defined as:
l = k(Whead − Whip )k.
(2)
Our system uses the normalized positional coordinates
(Wx ,Wy ,Wz ) of these nine joints along with their rotational values (Rx , Ry , Rz , Rw ), which are provided with
the dataset, and computes values for their velocities
(Wx0 ,Wy0 ,Wz0 , R0x , R0y , R0z , R0w ).
Thus, there are 126 feature descriptors extracted from 3D
skeletal data for every frame. In addition, we found that
augmenting our skeletal feature vector with Histogram of
Oriented Gradients (HOG) features [20] on 32x32 pixel
squares centered on the left and right hands help improve
classification accuracy. The HOG features for each of the
two squares can be represented as a 324-dimensional vector. We obtain a dimensionality-reduced representation by
performing Principal Component Analysis (PCA) and using
the first 20 principal components for each hand. Thus, every
frame of every instance in our training set is represented by
a 166 dimensional feature descriptor.
3) Gesture Representation: In order to remove the effects
of noisy measurments, we first smoothed all features using a
moving average filter spanning 5 frames. Smoothing features
slightly improved classification accuracy. Because instances
of gestures and non-gestures in our training set are temporal
sequences of varying length, there arises the need to represent
every gesture with a feature vector of the same length. We
achieved this by dividing the gesture into 10 equal-length
temporal segments, and representing each temporal segment
with a vector of the median elements of all features. Using 10
temporal segments provided a balance between keeping the
feature representation concise, while encapsulating enough
temporal information useful in discerning the gesture classes.
Using the median elements of all features provided better
performance than using the mean feature value, or the feature
value correspondng to the median frame of the temporal segment. The representative vectors of each temporal segment
were then concatenated into a single feature vector.
4) Random Forest Training: We defined the training set as
D = {(X1 ,Y1 ), ..., (Xn ,Yn )}. Here, (X1 , ..., Xn ) corresponds
to the uniform-length feature vector representing each gesture or non-gesture, and (Y1 , ...,Yn ) represents their corresponding class labels.
A random forest classification model consists of several
decision tree classifiers {t(x, φk ), k = 1, . . . } [19]. Each decision tree t(x, φk ) in the forest is constructed until they are
fully grown. Here x is an input vector and φk is a random
vector used to generate a bootstrap sample of objects from
the training set D. The ideal number of trees in our random
forest model was determined to be 500 by studying the Outof-Bag (OOB) error rate in the training data.
Let d be the dimensionality of the feature vector of the
inputs. At each internal node of the tree, m features are
selected
randomly from the available d, such that m < d. m
√
among other common
= d provided the highest
√ accuracy
√
choices for m (1, 0.5 d, 2 d, d). From the m chosen
features, the feature that provides the most information gain
is selected to split the node. Information gain (I) can be
defined as:
I j = H(S j ) −
∑
kε(L,R)
|Skj |
|S|
H(Skj ),
(3)
where S j is the set of training points at node j, H(S j ) is the
Shannon entropy at node j before the split, and SLj and SRj are
the sets of points at the right child and left child respectively
of the parent node j after the split.
The Shannon entropy can be defined as:
H(S) = − ∑ pc log(pc ),
(4)
cεC
where S is the set of training points and pc is the probability
of a sample being class c.
We trained and saved a random forest classification model
based on the features that we extracted. There is a need to
strengthen the classifier’s ability to accurately detect intervals
of non-gestures because the randomly chosen intervals of
non-gestural examples fail to fully model the class of nongestures. In order to achieve this, we applied the random
forest model on continuous input of the training set and
collected false positives and false negatives, which are examples of intervals from the training set that the classifier
fails to classify correctly. The set of false positives and false
negatives instances is then added to the original training set,
and the random forest is re-trained using the new extended
set of training examples. This process of bootstrapping, as
performed by Marin et al. [21], is performed iteratively until
the number of false positives gets reduced below a threshold,
which was empirically determined to be one false positive
per training sample.
B. Testing
The task during testing is to use our trained random forest
model to determine the temporal segmentation as well as
class labels of gestures in a continuous video. We performed
multi-scale sliding window classification to predict the class
labels of the gestures, as well as their start and end points.
For each input video, gesture candidates were constructed
at different temporal scales. Let fs be the number of frames in
the shortest gesture in the training set and fl be the number
of frames in the longest gesture in the training set. Then,
the temporal scales ranged from length fs to length fl , in
increments of 5 frames. Let, G = {g1 , ...gn } be the set of
gesture candidates at different temporal scales. At each scale,
a candidate gesture gi was constructed by concatenating the
feature vectors at an interval specified by the temporal scale,
so that the dimensions of the feature vector matched those
of the gestures used to train the classification model.
Within a buffer of length larger than the longest temporal scale, a sliding window was used to construct gesture
candidates at each temporal scale. For a buffer of size b, the
number of gesture candidates at scale si is equal to b − si + 1.
We chose b to be 100 frames, which is marginally greater
than the maximum length of a gesture in the training set.
Gesture candidates generated by the sliding window within
the temporal neighborhood defined by the buffer at each
scale were classified by our trained random forest model and
competed to generate a likely gesture candidate Gsi at that
scale. Since gesture candidates at the neighborhood of where
the gesture is truly temporally located tend to be classified as
the same gesture, we performed Non-Maxima Suppression
to select the most likely gesture candidate.That is, for each
scale si , b − si + 1 gesture candidates were generated and
the one classified with the highest confidence (Gsi ) within a
temporal neighborhood was selected. The confidence score
is the percentage of decision trees that vote for the predicted
class. Finally, the likely gesture candidates at the various
scales competed to generate the final predicted gesture within
the buffer.
Therefore, within the buffer, the scale of the final predicted
gesture helps determine the segmentation boundaries of the
gesture, whereas its class label is that which is predicted by
the random forest classifier. The end point of the predicted
gesture was chosen to be the start point of the new buffer.
This process was then repeated until the end of the test video
was reached.
Fig. 2. RGB, Depth, and User-Mask Segmentation of a subject performing
gesture 1 ’I Have’ in the NATOPS dataset
IV. DATASETS
Here, we describe the nature of the datasets we have used
to test our gesture recognition system.
A. NATOPS
The Naval Air Training and Operating Procedures Standardization (NATOPS) gesture vocabulary comprises a set
of gestures used to communicate commands to naval aircraft
pilots by officers on an aircraft carrier deck. The NATOPS
dataset [22] consists of 24 unique aircraft handling signals,
which is a subset of the set of gestures in the NATOPS
vocabulary, performed by 20 different subjects, where each
gesture has been performed 20 times by all subjects. An
example gesture is illustrated in Figure 2. The samples were
recorded at 20 FPS using a stereo camera at a resolution of
320 x 240 pixels. The dataset includes RGB color images,
depth maps, and mask images for each frame of all videos.
A 12 dimensional vector of body features (angular joint
velocities for the right and left elbows and wrists), as well as
an 8 dimensional vector of hand features (probability values
for hand shapes for the left and right hands) collected by
Song et al. [22] was also provided for all frames of all
videos of the dataset.
B. ChaLearn
The ChaLearn dataset was provided as part of the 2014
Looking at People Gesture Recognition Challenge [23]. The
focus of the gesture recognition challenge was to create
a gesture recognition system trained on several examples
of each gesture category performed by various users. The
gesture vocabulary contains 20 unique Italian cultural and
anthropological signs. Figure 3 shows an example gesture
being perfomed.
The data used to train the recognition system contains
a total of 7,754 manually labeled gestures. Additionally, a
Fig. 3. RGB, Depth, and User-Mask Segmentation of a subject performing
gesture 1 ’sonostufo’ in the ChaLearn dataset
validation set with 3,363 labelled gestures was provided to
test the performance of the trained classifier. During the final
evaluation phase, another 2,742 gestures were provided. The
dataset includes RGB and depth video along with 3D skeletal
joint positions for each frame.
V. E XPERIMENTS
Here we describe the experiments performed to evaluate
our gesture recognition system on the two datasets. We used
the NATOPS dataset to evaluate our gesture classification
system in a non-continuous setting. We used a set of gesture
samples to train our gesture classifier, and tested its performance on a test-set of segmented gestures. The ChaLearn
dataset consists of training and test videos where the user
performs both in-vocabulary and out-of-vocabulary gestures,
with intervals of gestural silence or transitions. Thus, we used
the ChaLearn dataset to test the performance of our system
on continuous input.
From the NATOPS dataset, we trained our gesture recognition model with the following features sets in order to
formulate a good feature representation:
(a) 3D skeletal joints and hand-shape based feature set
(SK+HS): This feature set [9] consists of 20 unique
features for each timeframe for every gesture. The
extracted features are angular joint velocities for the
right and left elbows and wrists, as well as probability
values of hand shapes for the left and right hands. Since
each gesture instance is described by a single feature
TABLE I
AVERAGE C LASSIFICATION ACCURACY ON ALL 24 GESTURES OF THE
NATOPS DATASET
Feature set
Feature
Feature
Feature
Feature
set
set
set
set
a (SK+HS)
b (EOD)
c (EODPCA)
d (SK+HS+EODPCA)
Average Classification Accuracy
84.77%
76.63%
67.74%
87.35%
TABLE II
P ERFORMANCE COMPARISON ON PAIRS OF SIMILAR GESTURES WITH
OTHER APPROACHES (T HE HMM, HCRF, AND L INKED HCRF)
PRESENTED BY S ONG ET AL . [24].
Classifier
HMM
HCRF
Linked HCRF
Random Forest (our)
Average Classification Accuracy
77.67%
78.0%
87.0%
88.1%
Fig. 4.
descriptor obtained by concatenating 10 representative
feature vectors, the feature vector representing a gesture
instance is of length 200.
(b) Appearance-based feature set (EOD): Each frame of the
gesture instances is represented by a 400 dimensional
feature vector, which was calculated using randomly
pooled edge-orientation and edge-density features. Each
gesture example is represented by a single-dimension
feature vector of length 4000.
(c) EODPCA: In this feature representation, we reduced the
above 4,000-d feature space into a 200-d feature space
via Principal Component Analyis (PCA).
(d) SK+HS+EODPCA: This feature set was obtained by
concatenating the 200-d 3D skeletal joints and handshape based (SK+HS) feature descriptor of a gesture with
the corresponding dimensionality-reduced edge orientation and density (EOD-PCA) feature descriptor to form
a 400-d feature vector for every gesture.
For each feature set described above, we trained random
forests with 500 trees on 19 subjects and tested on the remaining subject in a leave-one-out cross-validation approach.
We computed the average recognition accuracy (averaged
across all subjects and all gestures) of the random forest
classifier on the four different feature sets (a) - (d) of the
NATOPS dataset for all 20 test subjects each performing the
24 gestures in the vocabulary (Table I). The feature set containing 3D skeletal joints and hand-shape features (SK+HS)
is correctly classified 84.77% of the time, whereas the feature
set containing features based on edge density and orientation
is correctly classified 76.63% of the time. This suggests, in
our case, that 3D joint-based based features encode more
class-discerning information than features based on edge
density and orientation. However, the highest classification
accuracy of 87.35% is achieved on the feature set that combines joint-based features with appearance-based features,
Fig. 5.
dataset
Some pairs of similar gestures in the NATOPS dataset
Confusion Matrix for pairs of similar gestures in the NATOPS
suggesting the benefit of combining the two approaches of
collecting features.
Gesture pairs (2,3), (10, 11) and (20, 21) were confused,
often getting misclassified as the other (Figure 4). Figure 5
uses a confusion matrix to illustrate the misclassifications
between these pairs of similar gestures.
We compared the classification performance of our random forest classifier with the performance of other classifiers
that have been used on this dataset (Table II). Our random
forest approach on the challenging subset of similar gestures,
tested on samples from 5 subjects as specified by Song et al.
[24], yields results that exceeds those produced by the stateof-the-art (Linked HCRF) (Table II). The graphical models
presented by Song et al. [24] were trained using feature set a
(SK+HS), whereas we use feature set d (SK+HS+EODPCA)
to train our gesture recognition model.
From the ChaLearn dataset, we trained our gesture recognition model with the following feature sets:
a Raw 3D skeletal joint data (RAW): Features contain
unedited raw skeleton data, that is, each frame consists
of 9 values for all 20 joints. The feature vector per frame
has 180 dimensions, and per gesture has 1800 dimensions.
b Normalized skeletal joint positions and velocities (SKPV):
This feature set contains normalized positional and velocity data for 9 joints. The feature vector per frame has 126
dimensions, and per gesture has 1260 dimensions.
c Normalized skeletal joint positions, velocities and accelerations (SKPVA): This feature set contains positional,
velocity, and acceleration data for 9 joints. The feature
vector per frame has 189 dimensions, and per gesture has
1890 dimensions.
d Appearance-based features (HOG): This feature set contains Histogram of Oriented Gradients (HOG) data for
32x32 pixel boxes around 9 joints (head, left shoulder,
left elbow, left wrist, left hand, right shoulder, right elbow,
right wrist, right hand). The feature vector per frame has
2916 dimensions, and per gesture has 29,160 dimensions.
e SK+HOGPCA: This feature set was obtained by concatenating the 1260-d feature vector of normalized skeletal
joint positions and velocities (SK) with the 400-d feature
vector of HOG data for 32x32 pixel squares around the left
and right hands whose dimensionality has been reduced by
PCA. The resultant feature vector per gesture example is
1660-d.
For each feature set described above, we trained random
forests with 500 trees on gesture instances from the training
and validation sets, and tested the performance of our classifier on the test dataset. The division of the data into training,
validation and test sets has been described earlier [23].
The feature set that combines the normalized positional
and velocity information (SKPV), with HOG features of the
hands (HOGPCA), is correctly classified correctly 88.91%
of the time (Table III), which is the highest average classification accuracy of all feature sets.
The iterative procedure of training a random forest improves its capacity to correctly classify and segment gestures.
This is evident in the increase in average classification
accuracy in the test set (Figure 6).
Table IV shows the Jaccard score of our method compared
TABLE III
AVERAGE CLASSIFICATION ACCURACY ON ALL 20 GESTURES OF THE
C HA L EARN DATASET
Feature set
Feature
Feature
Feature
Feature
Feature
set
set
set
set
set
a (RAW)
b (SKPV)
c (SKPVA)
d (HOG)
e (SK+HOGPCA)
Average Classification Accuracy
81.45%
88.12%
83.50%
54.65%
88.91%
Fig. 6. Plot of number of misclassifications and Jaccard index score with
number of iterations of training classifier
TABLE IV
JACCARD I NDEX SCORES ON C HA L EARN G ESTURE R ECOGNITION
C HALLENGE 2014 [23]
Method
Deep Neural Network [25]
Our Score
Competition Baseline [23]
Jaccard Index Segmentation
and Classification Score
0.84
0.68
0.37
with the baseline and winning scores of the ChaLearn
gesture recognition challenge. The competition winner, a
team from Laboratoire d’InfoRmatique en Image et Systèmes
d’information (LIRIS), used features extracted from skeleton
joints and a deep neural network classifier to achieve a
Jaccard score of 0.84 [25]. Our classifier achieves a good
recognition accuracy of 88.91% on pre-segmented gestures.
Our Jaccard score of 0.68 underlines the difficulty of achieving optimal results in classification tasks where temporal
segmentation is not provided.
VI. C ONCLUSION
We have presented a random forest framework for a multigesture classification problem. The method consists of first
creating a uniform fixed-dimensional feature representation
of all gesture samples, and then using all training samples
to train a random forest. On a challenging subset of the
NATOPS dataset, our approach yields results comparable
to those produced by graphical models such as HCRFs.
Although a random forest classifier does not explicitly model
the inherent temporal nature of gestural data as done by
graphical models, its performance in accuracy on this particular dataset exceeds that achieved by graphical models such as
HMMs, and different variants of HCRFs, which are presented
by Song et al. [24]. Additionally our experiments also show
that classification accuracy was improved by combining 3D
skeletal joint-based features with appearance-based features,
thus underlying the importance of a well-chosen feature set
for a classification task.
On the ChaLearn dataset, our classifier yields an average
accuracy of 88.91% when tested on a set of segmented
gestures. However, the task of simultaneously detecting
and classifying gestures is a more difficult challenge than
classifying accurately segmented gestures.
The strengths of our framework lie in its simplicity,
speed, its capacity to generalize well to variations in user
size, distance to the sensor, speeds at which gestures are
performed, as well as its robustness to the effects of sensor
noise. One area of the framework that can be improved is the
process of selecting and creating better feature sets. Many
additional features, such as joint-pair distances used by Yao
et al. [26], can be experimented with in order to improve the
accuracy of our framework. Additionally, selecting a small
group of features over an interval of frames to split a node
in a decision tree, instead of selecting a single feature at a
single frame, might be better suited to the purpose of learning
complex spatio-temporal objects such as gestures.
ACKNOWLEDGMENTS
The authors would like to thank the reviewers for their
constructive feedback. This work was supported in part
by National Science Foundation grants IIS-0910908, MRI1337866 and US Navy contract N00014-13-P-1152.
R EFERENCES
[1] J. Alon, V. Athitsos, and S. Sclaroff, “Accurate and efficient gesture
spotting via pruning and subgesture reasoning,” in Proceedings of
the 2005 IEEE International Conference on Computer Vision (ICCV
2005), 2005, pp. 189–198.
[2] H. Junker, O. Amft, P. Lukowicz, and G. Tröster, “Gesture spotting
with body-worn inertial sensors to detect user activities,” Pattern
Recognition, vol. 41, no. 6, pp. 2010–2024, 2008.
[3] M. Betke, O. Gusyatin, and M. Urinson, “Symbol design: A usercentered method to design pen-based interfaces and extend the functionality of pointer input devices,” Universal Access in the Information
Society, vol. 4, no. 3, pp. 223–236, 2006.
[4] J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff, “Simultaneous localization and recognition of dynamic hand gestures,” in Proceedings of
the 2005 IEEE Motion Workshop on Application of Computer Vision
(WACV/MOTIONS 2005), vol. 2. IEEE, 2005, pp. 254–260.
[5] H.-K. Lee and J.-H. Kim, “An HMM-based threshold model approach
for gesture recognition,” IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI 1999), vol. 21, no. 10, pp. 961–973, 1999.
[6] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio,
A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition
in parts from single depth images,” Communications of the ACM,
vol. 56, no. 1, pp. 116–124, 2013.
[7] F. Schroff, A. Criminisi, and A. Zisserman, “Object class segmentation
using random forests,” in Proceedings of the 2008 British Machine
Vision Conference (BMVC 2008), 2008, p. 10pp.
[8] A. Bosch, A. Zisserman, and X. Muoz, “Image classification using
random forests and ferns,” in Proceedings of the 2007 IEEE International Conference on Computer Vision (ICCV 2007). IEEE, 2007,
pp. 1–8.
[9] Y. Song, D. Demirdjian, and R. Davis, “Multi-signal gesture recognition using temporal smoothing Hidden Conditional Random Fields,” in
Proceedings of the 2011 IEEE International Conference on Automatic
Face & Gesture Recognition and Workshops (FG 2011). IEEE, 2011,
pp. 388–393.
[10] Y. Song, D. Demirdjian, and R. Davis, “Continuous body and hand
gesture recognition for natural human-computer interaction,” ACM
Transactions on Interactive Intelligent Systems (TIIS 2012), vol. 2,
no. 1, p. 5, 2012.
[11] T. Starner and A. Pentland, “Real-time American Sign Language
recognition from video using Hidden Markov Models,” in MotionBased Recognition. Springer, 1997, pp. 227–243.
[12] S. Mitra and T. Acharya, “Gesture recognition: A survey,” Proceedings
of the 2007 IEEE Transactions on Systems, Man, and Cybernetics, Part
C: Applications and Reviews, vol. 37, no. 3, pp. 311–324, 2007.
[13] S. Malassiotis, N. Aifanti, and M. G. Strintzis, “A gesture recognition
system using 3D data,” in Proceedings of the 2002 IEEE International
Symposium on 3D Data Processing, Visualization and Transmission.
IEEE, 2002, pp. 190–193.
[14] J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random
Fields: Probabilistic models for segmenting and labeling sequence
data,” Proceedings of the 2001 International Conference on Machine
Learning (ICML 2001), pp. 282–289, 2001.
[15] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell,
“Hidden Conditional Random Fields,” IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI 2007), vol. 29, no. 10, pp.
1848–1852, 2007.
[16] L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. W. Vieira, and M. F.
Campos, “Real-time gesture recognition from depth data through key
poses learning and decision forests,” in Proceedings of the 2012 IEEE
SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI
2012). IEEE, 2012, pp. 268–275.
[17] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky, “Hough
forests for object detection, tracking, and action recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI
2011), vol. 33, no. 11, pp. 2188–2202, 2011.
[18] D. Demirdjian and C. Varri, “Recognizing events with temporal
random forests,” in Proceedings of the 2009 ACM International
Conference on Multimodal Interfaces (ICML 2009). ACM, 2009,
pp. 293–296.
[19] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Proceedings of the 2005 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2005), vol. 1. IEEE, 2005,
pp. 886–893.
[21] J. Marin, D. Vázquez, A. M. López, J. Amores, and B. Leibe, “Random forests of local experts for pedestrian detection,” in Proceedings
of the 2013 IEEE International Conference on Computer Vision (ICCV
2013). IEEE, 2013, pp. 2592–2599.
[22] Y. Song, D. Demirdjian, and R. Davis, “Tracking body and hands for
gesture recognition: NATOPS aircraft handling signals database,” in
Proceedings of the 2011 IEEE International Conference on Automatic
Face & Gesture Recognition and Workshops (FG 2011). IEEE, 2011,
pp. 500–506.
[23] S. Escalera, X. Baró, J. Gonzalez, M. A. Bautista, M. Madadi,
M. Reyes, V. Ponce, H. J. Escalante, J. Shotton, and I. Guyon,
“ChaLearn looking at people challenge 2014: Dataset and results,”
in Proceedings of the 2014 IEEE European Conference on Computer
Vision (ECCV 2014) ChaLearn Workshop on Looking at People.
IEEE, 2014.
[24] Y. Song, L. Morency, and R. Davis, “Multi-view latent variable
discriminative models for action recognition,” in Proceedings of 2012
IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2012). IEEE, 2012, pp. 2120–2127.
[25] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout, “Multi-scale deep
learning for gesture detection and localization,” in Proceedings of the
2014 IEEE European Conference on Computer Vision (ECCV 2014)
ChaLearn Workshop on Looking at People, 2014.
[26] A. Yao, J. Gall, G. Fanelli, and L. J. Van Gool, “Does human action
recognition benefit from pose estimation?.” in Proceedings of the 2011
British Machine Vision Conference (BMVC 2011), vol. 3, 2011, p. 6.
Download