Jeff K. T. Tang and Howard Leung
Department of Computer Science, City University of Hong Kong, 83 Tat Chee
Avenue, Kowloon, Hong Kong
City University of Hong Kong kttang@cs.cityu.edu.hk, howard@cityu.edu.hk
A system that can retrieve logically relevant 3D captured motions is useful in game and animation production. We presented a robust logical relevance metric based on the relative distances among the joints. Existing methods select a universal subset of features for all kinds of queries which may not well characterize the variations in different queries. To break through this limitation we proposed an Adaptive Feature Selection (AFS) method that abstracts the characteristics of the query by a Linear Regression Model, and different feature subsets can be selected according to the properties of the specific query. With a Graded Relevance Feedback (GRF) algorithm, we refined the feature subset that enhances the retrieval performance according to the graded relevance of the feedback samples. With an Ontology that predefines the logical relevance between motion classes in terms of graded relevance, the performance of the proposed AFS-GRF algorithm is evaluated and shown to outperform other class-specific feature selection and motion retrieval methods.
Keywords: Adaptive Feature Selection, Logical Similarity, Graded Relevance,
Relevance Feedback, Motion retrieval, 3D Human Motion Capture
An increasing number of 3D motion capture databases such as (CMU, 2003) from Carnegie Mellon University and (HDM, 2005) from Hochschule de Medien are available for public usage. However, motion capture is time consuming and labor intensive. There is a rising need to reuse the existing captured data, which makes animation and game production more efficient. In real applications, captured motions (even the same type) have large spatial and temporal variations.
This is due to the style difference among persons. Moreover, even the same person cannot perform a movement identically each time. Hence, a reliable system to retrieve logically relevant motions is desired. However, it is challenging
1
to represent the logical meaning of a motion and this topic is quite new in motion analysis research.
In human motion analysis research, it is proven that the numerical similarity metrics (e.g. joint position difference, joint angle difference) are insufficient to identify similar motions with style difference (Müller and Röder, 2005; Tang et al., 2008). Recent works showed that it is possible to abstract the logical meaning of motions with relational measures. For example, Müller and Röder measure whether a joint is in place or not with a Boolean value, while other researchers
(Tang et al., 2008) consider the relative distances between joints. The relative motion between joints appears to be a robust measure with the context of logical relevance.
For high dimensional data (e.g. motion data), we have to select a subset of features that work best with the similarity evaluation. In motion retrieval systems, the queries are most likely different each time in terms of characteristics. Classical feature selection methods that select a universal feature set for all queries cannot achieve the best retrieval result for individual cases. It is challenging to design a retrieval method that can automatically determine a set of tailor-made features from the query itself, without the prior knowledge of its motion class.
Relevance Feedback can be used to refine the feature selection. It can be achieved by allowing the user to give some positive and negative samples to the system. In most systems, the query is refined to be closer to the samples in feature vector space. Some researchers consider updating the weights of features such that important features will be boosted up. The weighting is updated according to the feedback samples and hopefully the result would be more conformed to the user’s desire next time. If the binary relevance becomes graded (i.e. “highly relevant”,
“somewhat relevant”, “not so relevant”, etc.), more considerations are required in designing the updating method for the relevance feedback.
There are three major contributions we have made in this article: First, we proposed to use the variance of Joint Relative Distance (VJRD) to characterize human motions with logical relevance. Second, we proposed an Adaptive Feature
Selection (AFS) method that is able to abstract the characteristics of the query and determine a feature subset specific to that query. Third, we proposed a Graded
Relevance Feedback (GRF) method that makes use of the feedback samples with
2
the known graded relevance to refine the feature subset and enhance the retrieval performance.
The article is organized as follows. Section 2 describes the related work. We describe the overview of our proposed method in Section 3, and the details of our proposed Adaptive Feature Selection with Graded Relevance Feedback (AFS-
GRF) method in Section 4. We evaluate the performance of the proposed methods with existing methods in Section 5. Finally we conclude with a discussion of the future directions in Section 6.
In general, information retrieval (IR) can be content-based or semantic-based.
In content-based IR, the features do not contain semantic meaning. Semanticbased IR on the other hand has been widely adopted in semantic web searching.
In semantic-based IR, there are basically two types: keyword-based and ontology-based. For keyword-based semantic IR, the user should define keywords that are used to describe each data entry. A vector space model is used to describe the features with weights. If a keyword is more discriminative, its weight is higher. In Ontology-based IR systems, relationships among different “concepts” can be defined by a high level ontology (Hotho et al., 2001). In web/document retrieval, the semantic ontology of the keywords is predefined such that the similarity between the query and data is given by the weighted sum of the semantics (Grubber, 1993; Guarino, 1995; Breitman et al., 2006). This approach can also be used to organize 3D captured motions (Chung et al., 2005), and possibly assign semantic meanings to the content in retrieval systems (Zhang,
2008). It is possible to apply the Ontology approach to organize motion capture data according to their logical relevance for retrieval system.
In content-based motion retrieval system, a logical similarity metric is useful to retrieve motions of similar semantic meaning but different in style, timing etc.
There are just a few related works because this field is still new. The Geometric
(or Boolean) feature proposed by (Müller and Röder, 2005) apparently is the first robust logical relevance metric. It considers the point-to-plane relationships such that a binary value is given whether the condition is true or false. However, it produces a large dimension of features and we have to select a subset of good
3
features at different times. The Joint Relative Distance (JRD) proposed by (Tang et al., 2008) is another new logical relevant metric that considers the relation between joints only, with a much smaller set of features.
Traditionally, the performance of a retrieval system is evaluated by binary relevance (either relevant or irrelevant). In recent years, the graded relevance measure has been a popular choice in evaluating retrieval performance. The graded relevance is represented by different levels such as “highly relevant”,
“somewhat relevant”, “irrelevant”, etc. instead of just a Boolean value (relevant/ irrelevant). Graded Relevance is mainly derived from the Average Precision on
Cumulative Gain ( CG ). The term gain is a score that awards a particular rank for the retrieval of a relevant item. For a graded relevance measure, different gain values could be assigned for different levels of relevance. A larger gain is assigned for a more relevant retrieved item. (Sakai, 2003) proposed Average Gain
Ratio ( AGR ) that considered the ratio of the CG s of the retrieved ranking to the ideal ranking. (Sakai, 2007) have surveyed and compared various graded relevance measures including Cumulative Gain ( CG ), normalized Cumulative
Gain ( nCG ), Discounted Cumulative Gain ( DCG ), normalized Discounted
Cumulative Gain ( nDCG ), Average normalized Discounted Cumulative Gain
( AnDCG ), and Q-measure (Sakai, 2004). AnDCG and Q-Measure are the best choices because they are discounting the gains for the late arriving relevant documents. In our experiment, the graded relevance measure is given by the
Average Normalized Discounted Cumulative Gain ( AnDCG ) (Sakai, 2007) among motion classes. The performance of our proposed motion retrieval system using
AFS-GRF is evaluated through the AnDCG measure.
Feature selection is used to select important (or filter out non-useful) features in order to get a better performance. Sequential Forward/Backward Selections
(SBS/SFS) have been commonly used (Guyon and Elisseeff, 2003). However, these are brute force methods that maximize the IR performance by selecting/eliminating features to form new subsets. Principal Component Analysis
(PCA) and Linear Discriminant Analysis (LDA) can reduce the feature dimension by choosing the feature subset with highest projection energies, i.e. the most discriminative ones. Both PCA and LDA can be applied to model characteristics specific to each class in pattern classification (Sharma et al., 2006; Liu et al.,
2003) and face recognition (Kim et al., 2003) respectively.
4
More specific to human motion retrieval, a Boolean Motion Template that averaged the characteristics of all samples in the sample logical class is built
(Müller et al., 2006; Baak et al., 2008). Furthermore, some researchers divided a human body into parts and built separate indices for these body parts (Deng et al.,
2009). These methods will be compared with our proposed method.
Researchers applied relevance feedback to enhance the performance in IR systems. A classical way is to update the current query in the feature vector space in order to get closer to the positive (relevant) samples and further away from the negative (irrelevant) samples (Rocchio, 1971). A blinded/pseudo relevance feedback can automatically expand the query by assuming the topN ranks are relevant (Buckley et al., 1995). On the other hand, some people suggested to update the weights of features. This method maximizes the variations of features in the database over those in the feedback samples, and the features can be reweighted accordingly (Aksoy et al., 2000). Similarly, a subset of most significant features can be selected through relevance feedback, which is useful for IR systems that deal with high dimensional data such as image (Tusk et al.,
2003). In existing relevance feedback methods, most people considered the binary relevance (i.e. positive/negative feedbacks), but only a few researchers have considered the graded relevance in retrieval systems (Sakai, 2007; Keskustalo,
2008; Järvelin, 2009). They considered a blinded relevance feedback method such that the query is expanded through the relative average term frequency (RATF) that depends on the graded ranks.
3.1. Overview
Our proposed method consists of three parts, and the flow is shown in
Figure 1. Firstly we preprocessed the data by extracting useful features and organized them according to the logical relevance. Next, we trained a Linear
Regression Model between the feature vectors of all samples and their optimal feature subset. In the motion retrieval phase, the Adaptive Feature Selection
(AFS) is able to select an initial subset of features based on the input query automatically. Through the Graded Relevance Feedback (GRF), a number of
5
retrieved samples with known relevance are fed back to the system to refine the initial feature set and hence a better retrieval performance can be achieved. In this section, we will explain the preprocessing step and provide a highlight of motion retrieval and relevance feedback. The details of the proposed AFS-GRF method will be presented in next section.
Fig.1.
The flow of the proposed Adaptive Feature Selection and Graded Relevance Feedback algorithm.
3.2. Data Collection and Representation
Our experiment data contained 1099 single person movements. 90% of these motions are captured on our own (CityU, 2011). The remaining 10% are adopted from a well-known public motion capture database provided by Carnegie
Mellon University (CMU, 2003). We converted our captured motion clips into a standard Biovision Hierarchy (BVH) format, in which the body hierarchy is welldefined. The BVH version of CMU motion data is available in (CGSPEED,
2008). We selected the “walking”, “running” and “jumping” motions from the public database because their sample sizes are relatively large. These three classes are common to the dataset used in existing work by (Deng et al., 2009) thus making our performance evaluation more objective.
Since the CMU and CityU dataset are captured with different settings, their numbers of joints are different. The hierarchy in a CMU motion contains 31 joints whereas the hierarchy in a CityU motion contains 21 joints. Hence, we
6
generalized the human representation as shown in Figure 2. We considered the 15 joints common to both sets, i.e. (1) Head, (2) Neck, (3) Root, (4) Right Shoulder,
(5) Right Elbow, (6) Right Hand, (7) Left Shoulder, (8) Left Elbow, (9) Left
Hand, (10), Right Hip, (11) Right Knee, (12) Right Foot, (13) Left Hip, (14) Left
Knee, and (15) Left Foot. Our method can easily adopt motion data captured by other laboratories under different conditions.
Fig.2.
The representation of the human body in motion capture data.
In our motion retrieval system, we assumed that each sample is a primitive move, i.e. a self-complete movement. Hence, we first segmented the motion clips into primitive moves automatically simply at the postures where the joint accelerations are not significant (Shum et al., 2007). It is simple but proven efficient in existing work.
3.3. VJRD Feature Extraction
In each segmented primitive move, we extracted the Joint Relative Distance
(JRD), which is calculated by pair-wise Euclidean distances (i.e. L2 -norm) between any two joints (Tang et al., 2008). The variance of each JRD over the duration of the movement is considered to characterize the motion, and we named it as VJRD. Physically the variance measures the extent of each JRDs changed from the mean. Intuitively the motions of similar logical meaning (even with different styles) would have some joints moving in similar way, which is reflected in the value of VJRD. A zero VJRD represents a pair of joints that are relatively static to each other, while a non-zero VJRD represents a pair of joints that have sound relative movement.
7
There are 15 joints in our human model, so we have 15×14/2 = 105 joint pairs. Some rigid pairs (i.e. on the same bone or on the torso) are filtered out because their JRDs are almost unchanged all the time. Hence the total number of joint pairs we used becomes 105 pairs – 14 (bone pairs) – 15 (torso pairs) = 76.
The formulation of VJRD is shown below:
Let M
A
= { P
A1
, P
A2
, … , P
AT
} be a primitive move A of T frames (postures), and each posture contains N p
= 76 joint pairs. The JRD of the p -th (1
p
N p
) joint pair at the t -th (1
t
T ) posture is calculated by the L2 -norm between two joints J i and J j at frame t :
JRD ( t , p ) = d
L2
( J i
( t ), J j
( t )) (1)
A normalization is applied on each JRD ( t , p ) to scale them within a common range [0, 1] so that it becomes robust to different body sizes. We denote the normalized JRD
A
( t , p ) by nJRD
A
( t , p ), and its mean over M frames be nJRD
A
( p ) . The VJRD of the p -th joint pair is formulated as:
VJRD
A
( p )
1
T t
T
1
( nJRD
A
( t , p )
nJRD
A
( p ) )
2 (2)
3.4. Data Organization
According to the logical relevance, we organized the motion classes in an
Ontology hierarchy as shown in Figure 3. It is a tangled tree of motion classes linked with an “include” (
) relationship, i.e. a lower level motion class inherits the characteristics of higher level classes. For example, a “Left Punch” is not only a “Punch” but also a “Left hand movement”. There are 36 leaf-node classes, which contain the finest details and are regarded as distinct classes. Table 1 shows the types and number of motions in our data. We used the Ontology to determine the graded relevance among motion classes.
8
Fig.3.
The Ontological hierarchy of our motion dataset.
Table 1. The types/classes of motions and their number of samples
Motion
Type
Motion Classes
Fighting Left Straight Punch, Right Straight Punch, Left Hock Punch,
Right Hock Punch, Left Upper Kick, Right Upper Kick, Left
Lower Kick, Right Lower Kick
Dancing 20 Kinds of A-go-go dance moves
Sport Basketball motions (Shooting, Dribbling and Defending)
Locomotion Walking, Running and Jumping
Others Random movements, Reacting moves and Head moves
Total
Number of samples
318
300
189
110
182
1099
9
3.5. Graded Relevance among Motion Classes
We determined the relevance between any two leaf-node classes by a graded relevance scheme, which considers the degree of relevance between two motions rather than a hard decision of relevant/irrelevant item. From the Ontology graph shown in Figure 3, we noticed that if the distance between two classes is closer then their relevance is higher. Hence, we formed a matrix of graded relevance among these 36 classes. Figure 4 shows a simplified version
(considering only the 8 classes of fighting motions at Level 0). Each entry represents the grade g ( C i
, C j
) (also known as “Gain”) of the two classes C i
, and C j
, which is determined by g ( C i
, C j
) = d node_max
- d node
( C i
, C j
) + 1, where d node
( C i
, C j
) is the distance between two classes, and d node_max
is the maximum distance among all classes. A range of values (from 1 to 7) are used to represent the relevance in ascending order, e.g. 1 for the “irrelevant” classes, 5 for “somewhat relevant” classes, 7 for the most relevant classes.
4
5
6
7
8
2
3
Class 1 2 3 4 5 6 7 8
1 7 5 3 3 1 1 1 1
5 7 3 3 1 1 1 1
3 3 7 5 1 1 1 1
3
1
1
1
1
3
1
1
1
1
5
1
1
1
1
7
1
1
1
1
1
7
5
3
3
1
5
7
3
3
1
3
3
7
5
1
3
3
5
7
Fig.4.
The Graded Relevance Matrix of 8 fighting motion classes.
3.6. Motion Retrieval Method
We presented how to compute the dissimilarity between two primitive moves M
A
and M
B
, which is given in Equation (3). We resampled them into the same number of frames, and calculated the Euclidean distance of each selected dimension p , where N p
is the size of selected features. Here, w p
is the weight for that feature p . Initially it is set to be 1/ N p
but is updated during the relevance feedback described in Section 4.2. If the moves M
A
and M
B
are similar, the value d ( M
A
, M
B
) will be close to 1; otherwise the value will be much larger.
10
d ( M
A
, M
B
)
f
LR
N p p
1 w p
T
t
1 d
L 2
( JRD
A
( t , p ), JRD
B
( t , p ))
(3)
Sometimes, the number of frames of M
A
and M
B
would be very different and biased the dissimilarity measure. To overcome this deficiency, a factor about the length ratio f
LR
= max_length ( M
A
, M
B
)/ min_length ( M
A
, M
B
) is multiplied to the dissimilarity score. In our data, the length ratio between motion samples in each class and the class mean is about 1.67. Therefore, this factor is able to penalize the score when the length of a sample is very different from the query, without outweighing the original measurement.
In this work, we considered an example-based motion retrieval method.
Given a primitive move as query Q , we search the moves { D i
| i
I } from the database and rank them from descending relevance where I is the size of the database, i.e. ascending dissimilarity d ( Q , D i
) values given by equation (3). With the initial set of features selected with AFS, it is assumed that these selected features are equally important thus the same weight w p
is assigned to each feature.
When the Relevance Feedback is applied, the importance of the features in the subset will be examined to refine the weights.
3.7. Performance Evaluation with Graded Relevance
We evaluated the performance of retrieval systems with the graded relevance scheme by AnDCG , which is one of the common measure considered by researchers working on information retrieval with graded relevance.
The Gain g ( i ) is associated as the graded relevance value we have determined in Section 3.5, where i is the rank of a retrieved motion. The Discounted
Cumulative Gain ( DCG ) is determined by summing the Discounted Gains dg ( i ) which is the gain g ( i ) normalized by the term log a i , where a is the minimum rank that applies discounting Let r be the number of retrieved items. The equation of
DCG is given by Equation (4): dcg ( r )
r
1 dg ( i )
g ( 1 )
r
2 g ( i ) log a i
(4)
11
The Average normalized Discounted Cumulative Gain ( AnDCG ) is the average of the ratio of dcg ( l ) and dcg
I
( l ), where l is the size of ranked output, dcg ( l ) is the DCG of the retrieved ranking, and dcg
I
( l ) is the DCG of the ideal ranking. The ideal ranking is the Gain values sorted in descending order. The equation of AnDCG is given by Equation (5). The range of AnDCG is from 0 to 1 and is a kind of average precision measure. The higher the AnDCG value, the better the retrieval performance is.
AnDCG l
l
1 l
1 dcg ( r ) dcg
I
( r )
(5)
The motion retrieval result can be enhanced by selecting a subset of features that work best to rank the samples in the database in descending logical relevance.
Our proposed Adaptive Feature Selection (AFS) is able to estimate an initial subset of features that can well characterize the query motion. The initial feature subset could be refined by a Graded Relevance Feedback (GRF). The detail of each part of the algorithm will be discussed in the following subsections.
4.1. Adaptive Feature Selection (AFS)
In our hypothesis, there exist a subset of feature that gives the optimal retrieval performance and this subset may vary among different queries, in which we are able to abstract them from a query without knowing its motion class. A
Regression Model is proposed here to explore the relations among the motion features to decide whether a feature should be selected or not. The AFS consists of two steps: training a Regression Model, and applying the Regression Model for motion retrieval. Before the training step, we first obtain the ground-truth feature subset per class using class-specific sequential backward selection.
12
4.1.1. Ground-truth Class-specific Feature Subset Estimation
It is tedious to select the optimal feature subset per motion class manually.
Hence, we apply a classic sequential feature selection method. As suggested by
(Guyon and Elisseeff, 2003), a Sequential Backward feature Selection (SBS) can select a combination of features that gives a good retrieval performance. In order to select the ground-truth feature subset for each class, we apply SBS for each class separately. Therefore, a single SBS will be repeated 36 times and this scheme is called class-specific SBS (C-SBS).
Figure 5 illustrates how SBS works on each class of queries. Suppose there are k features in the data. There are k -1 possible feature subsets that could be formed when one feature is taken away. The performance of each subset is computed and the subset with the best performance is chosen. The excluded feature m is added to the list. This process is repeated on the current subset until only one feature is left. The SBS is repeated for each of the 36 classes. Although the performance can be very good, it is a brute force approach and computationally expensive. This is called only once to determine the ground-truth feature subset for each class. During retrieval, we applied our proposed Adaptive
Feature Selection (AFS) method to determine the feature subset from each input query automatically.
Fig.5.
The Sequential Backward Selection (SBS).
13
A flag vector z can be used to denote which features have been selected.
Consider the scenario if we have the whole set of four features F = { f
1
, f
2
, f
3
, f
4
} and the optimal feature subset is F
O
= { f
3
, f
4
}, then the vector z would become [-1
-1 +1 +1], where the i -th element is labeled by +1 (or -1) if the i -th feature is present (or absent) in the optimal subset respectively. In our experiment, the dimension of the flag vector z of each data is 1
76. If there are N tr
training samples, the flag vectors z will be aggregated to form a N tr
76 flag matrix Y , which will be used to train the Regression Model.
4.1.2. Training a Regression Model
A Regression Model ( B) is trained by relating the VJRD features and the flag vectors resulted from C-SBS. Here we use a simplified example to illustrate the idea: Assume that there are four features in total and three training samples in the dataset. The feature vectors will be aggregated to form a 3
5 feature matrix
A =
a a
11
21 a
31 a a a
12
22
32 a
13 a
23 a
33 a
14 a
24 a
34 a a
15
25 a
35
.
The term a ij
represents feature j of training sample i with 1
i
3 and 1
j
4 and these features are considered as dependent variables in the Regression Model. On the other hand, the term a i5
is set to be constant 1 with 1
i
3 and it is the intercept in the Regression Model.
Furthermore, we have a 3
4 flag matrix Y =
1
1
1
1
1
1
1
1
1
1
1
1
denoting which features should be selected for each training sample (i.e., features 3, 4 for training sample 1; features 1, 3 for training sample 2 and feature 1 for training sample 3). With a Regression Model, the flag vector z
1
= [ Y
11
Y
12
Y
13
Y
14
] for the training sample 1 is represented as a set of linear equations given by (6), where
β ij are some unknown coefficients to be solved.
Y
11
= a
11
β
11
+ a
12
β
21
+ a
13
β
31
+ a
14
β
41
+ β
51
= -1
Y
12
= a
11
β
12
+ a
12
β
22
+ a
13
β
32
+ a
14
β
42
+ β
52
= -1
Y
13
= a
11
β
13
+ a
12
β
23
+ a
13
β
33
+ a
14
β
43
+ β
53
= +1
Y
14
= a
11
β
14
+ a
12
β
24
+ a
13
β
34
+ a
14
β
44
+ β
54
= +1 (6)
14
If we also consider z
2 and z
3
in a similar manner for training samples 2 and 3, we have a total of 12 equations and 20 model coefficients
β ij.
A Model Matrix B =
11
21
31
41
51
12
22
32
42
52
13
23
33
43
53
14
24
34
44
54
is hence formed and these equations can be represented in matrix form by equation (7),
Y = A B (7)
In our experiment, each training set has about 500 samples while each sample has 76 VJRD features. Since the number of data N tr
is always higher than the number of features N
F
, we solve the system of linear equations by minimizing the Euclidean Norm between AB and Y using a pseudo-inverse method (Roger,
1956). The pseudo-inverse of A is denoted by A
+
= ( A
T
A )
-1
A
T
. Hence, the matrix of coefficients B for this Regression Model can be obtained by Equation (8):
B = A
+
Y (8)
4.1.3. Applying Regression Model for Motion Retrieval
An example-based retrieval system is presented, such that a motion sample has to be input as query in order to retrieve relevant motions. With the trained
Regression Model B , the significant features of the query can be determined. The following example illustrates how to select the feature subset from the query.
Let Q to be the vector that contains the features of the input motion and a constant intercept term of 1. Let z
Q
be the flag vector of the feature set. By inputting Q as A and the trained Regression Model B in equation (7), z
Q
can be obtained by Equation (9).
15
z
Q
= Q B =
q
1 q
2 q
3 q
4
1
11
21
31
41
51
12
22
32
42
52
13
23
33
43
53
14
24
34
44
54
(9) z
Q
is a vector of real numbers that can help us decide whether a feature is selected. The decision threshold is set to be 0 meaning that the feature selection depends on the sign of z
Q
. For example, if z
Q
is equal to [-1.02, +2.68, +0.1, -
4.22], since the 2 nd
and 3 rd
entries of z
Q are positive thus features 2 and 3 are selected for similarity comparison in retrieving motions.
4.2. Refining Retrieval with Graded Relevance Feedback (GRF)
In our proposed relevance feedback method, the topN ranked samples in the current retrieval result are fed back into the system. In the real scenario with an actual user, the graded relevance between the query and each feedback sample can be tagged by the user via the retrieval system. In this experiment, the graded relevance of the feedback samples is assumed to be known which allows us to evaluate the performance of relevance feedback. The feature subset is refined iteratively and the retrieval result is getting closer to the user’s desire.
In the relevance feedback research, some researchers update the features such that the query will get closer to the feedback samples in the feature space.
Others update the weights of features such that more significant features will be set with higher weights. In our approach, we do not update the query but introduce an offset vector Δ z to update the flag vector z
Q
. The offset can be determined by the graded relevance and dissimilarities between the feedback samples and the query. To determine the offset vector Δ z , we consider the performance of each feature contribution. The gains (i.e. the graded relevance) of the TopN ranked samples show how good each feature ranks the feedback samples. The feature with a high accumulated gain has a better performance thus the corresponding entry in Δ z would be increased and vice versa. In particular, the total feature set is divided into two parts: the selected subset (i.e. the current selected features) and the unselected subset (the remaining features). In the new iteration, we suppress the effectiveness of the worst performing feature i in the selected set, while boost
16
up the best performing feature j in the unselected set. It is done by assigning an offset value
(0.3 in our experiment). Hence, the i -th element in the offset vector
Δ z ( i ) is set to be -
while the j -th element Δ z ( j ) is set to +
.
However, if we only update Δ z based on the gain of the feedback samples, the algorithm will no longer work in marginal cases with identical gains i.e. the gains of the feedback samples are all equal (say 3). This is more likely to happen when a smaller number of feedback samples are provided. To avoid this problem, we also consider the Mean Accumulated Dissimilarity (MAD) of each feature, which is the average of dissimilarity values between the query and each feedback sample of each gain g ( i ). Let S i
( p , j ) be the p -th feature vector of the j -th feedback sample with gain i , the MAD is given by Equation (10):
MAD i
( p )
1
| S i
|
| S | j i d ( Q ( p ), S i
( p , j )) (10)
The gain g ( i ) is now weighted by the corresponding MAD i
such that g’ ( i ) = g ( i ) MAD i
, which is used to calculated AnDCG , i.e. the performance of each feature. The weight of each feature ( w
P
) is given by normalizing the MAD value.
After the above process, an offset vector Δ z and the weights w
P
of each feature is determined. A new flag vector z’
Q
is computed such that z’
Q
= z
Q
+ Δ z .
In calculating the dissimilarity measure between the query and each sample, the selected features are further weighted by w
P
. In the marginal case where all feedback samples have identical gains, the system will consider the performance of each feature by using MAD only.
The experimental settings are provided in the first part of this section. Then our proposed approach is evaluated in several aspects. Our proposed VJRD features are evaluated and compared with other similarity metrics. Our proposed
Adaptive Feature Selection (AFS) method is then evaluated and compared with other feature selection methods. The performance of our proposed motion
17
retrieval method is compared with other existing retrieval methods. Finally, the performance of our proposed Graded Relevance Feedback (GRF) is shown.
5.1. Experimental Settings
In our proposed method, the performance of different motion retrieval approaches will be evaluated by a Graded relevance scheme. To verify the robustness of the systems, a cross-fold validation is used. The data has been divided into four equal parts. In each part, the data samples are selected randomly.
In each trial, 2 out of the 4 parts, i.e., 50% of the data, are used for training, and the rest 50% is used for testing. Hence, there are
4
2
= 6 trials.
5.2. Evaluation of the Variance of Joint Relative Distance (VJRD)
The Variance of Joint Relative Distance (VJRD) is compared with other motion similarity measures: the joint angle difference and the Boolean features.
Joint angle difference is a classical measure that has long been adopted in many early works for evaluating the motion similarity (Wang et al., 2008; Tam et al.,
2007). Boolean features (Müller and Röder, 2005) are also known as Geometric features which describe the relations between body parts (a joint or a plane of joints). The value of Boolean feature is either in place (the feature value is 1) or not in place (the feature value is 0). The 39 Boolean features suggested by (Müller and Röder, 2006) are used in the experiment. The properties of the joint angle difference and Boolean features are shown in Table 2. In the table,
i
A
and
i
B are the Euler angles of motions A and B of a particular joint axis respectively, and
i
A
and
i
B
represent the i -th Boolean features of motions A and B respectively.
The frame correspondence between the two motions is required for using the joint angle difference and Boolean features when the two motions have different durations. This can be obtained by uniform scaling or dynamic time warping. In contrast, our proposed VJRD are computed over the frames such that uniform scaling or time warping is not required hence it is more efficient.
18
Table 2. Properties of other distance measures.
Joint angle difference Boolean feature
Range of feature values
Feature dimension
Normalized Euler angles
-180˚ < θ < 180˚
15 joints × 3 dimensions = 45
1 = in place
0 = not in place
39
Distance measure per frame
Absolute distance d i
i
A i
B
Manhattan distance d i
1
i
A i
B
The performances in terms of AnDCG under C-SBS feature selection among the aforesaid features are plotted against the dimension of the reduced feature subset in Figure 6 (To focus on our argument, the x -axis is clipped to show only up to 35 selected features).
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 5
Variance of JRD
Boolean Feature
Joint angle difference
10 15 20 25
Dimension of reduced feature subset
30 35
Fig.6.
Performance comparison among various features.
The performance of retrieval using joint angle difference is the worst among all as it only accounts for the numerical difference. On the other hand, both
Boolean features and VJRD can achieve better performances (max. 0.8502 and
0.8909). Boolean features perform better than joint angle difference because it
19
considers Geometric (semantic) meaning of the movement rather than the exact numerical similarity values. However, a binary semantic can only show two states of a relation, i.e. either yes (the joints are in place) or no (they are not in place).
Our proposed VJRD features actually consider the statistical variations of the joint relations which are more informative. This explains why VJRD outperforms the
Boolean features in this experiment.
5.3. Evaluation of Adaptive Feature Selection (AFS)
The effectiveness of the feature selection definitely affects the retrieval performance. Dimensional reduction methods such as PCA (or LDA) are classic methods to select a feature subset such that the similarity (or distinctness) can be enhanced. In this experiment, traditional dimension reduction methods
(PCA/LDA), the brute force method (C-SBS), and our proposed adaptive feature selection (AFS) method are compared and the result is shown in Figure 7. In
Figure 7, the performances in terms of AnDCG are plot against the dimension of the reduced feature set. The mean performance of our proposed AFS method over all classes is the second highest among all tested methods. It is the closest to the
C-SBS which is regarded as the ground-truth result.
0.9
0.85
0.8
0.75
0.7
0 10
C-SBS
Proposed AFS
C-LDA
PCA
C-PCA
20 30 40 50
Dimension of feature subset
60 70 80
20
Fig.7.
Performance comparison with various feature selection methods.
Among the dimension reduction methods, both PCA and Class-specific
PCA are shown to be not suitable for feature selection since PCA is trying to project similar data together and hence it has weak discriminative power. On the other hand, Class-specific LDA works better than PCA methods. When a new class of data is present, the system has to be retrained each time and it is not scalable to a large dataset. The performance of our proposed method is very close to C-SBS. However, our method is far more efficient than the brute force C-SBS since it is much faster in getting the feature selection labels. More specifically, the time for the brute force method is 1.22 minute per query but our proposed method just spent 1.09 second per query. Moreover, there is no need to train the system separately for each class.
5.4. Evaluation of Motion Retrieval scheme
The performances among other retrieval methods, i.e. the Boolean Motion
Template Method (Müller et al., 2006; Baak et al., 2008) and the Hierarchical
Method (Deng et al., 2009) are compared. Instead of Precision-Recall curves, their performances are compared in terms of graded relevance measure AnDCG .
A higher value represents a better performance in the overall ranking. The retrieval result is shown in Table 3.
The performance of Hierarchical Indexing method is better than the
Boolean Motion Template method. The performance of Boolean Motion template relies on the matched template. Mismatches are expected if the intra-class variation among samples is large. In our experiment data, it is the case because the motions are classified according to their logical meaning. Although Geometric feature can also define logical meaning well, an extra sophisticated feature selection is needed to enhance the performance. As proposed by the authors in
(Müller and Röder, 2005), manual effort in additional with a fuzzy selection is needed. On the other hand, the Hierarchical indexing method contains more detail and is more scalable hence the retrieval could be better in high-recall region. In addition, it segments the motion into a sequence of sub-moves such that the temporal detail is preserved.
21
Our method outperforms the existing methods because the selected features from the Regression Model can effectively abstract the logical meaning of each input motion. Even when a new motion is added to the database, our system can still suggest an initial set of good features. Our proposed method is useful to retrieve different kinds of query motions. Figure 8 illustrates an example retrieval result using our proposed method.
Methods
Table 3. Performance comparison among several retrieval methods.
Trial 1 Trial 2
Performance in Graded Relevance
Trial 3 Trial 4 Trial 5 Trial 6 Average
Boolean
Motion
Template
Motion hierarchy indexing
Proposed method
0.8358
0.8745
0.9137
0.8264
0.8768
0.9133
0.8203
0.8757
0.9155
0.8341
0.8742
0.9136
0.8237
0.8755
0.9143
0.8311
0.8739
0.9143
0.8286
0.8751
0.9141
22
Query Motion:
Retrieved results:
Rank 1
LSP
Score
1.2021
Rank 2
LSP
1.2198
Rank 3
LSP
1.2256
…
Rank 15
LHP
1.2510
Rank 16
RSP
1.2518
…
Rank 68
RHP
1.3031
Rank 69
RHP
1.3056
Rank 78
RUK
1.3109
…
Rank 98
Agogo15
1.3241
Fig.8.
Example retrieval result of our proposed method (Query = Left Straight Punch).
23
5.5. Evaluation of Graded Relevance Feedback (GRF)
We compared the performance of GRF among different number of iterations and feedback samples in terms of AnDCG . In general, the performance with GRF is better than the baseline performance (without GRF at iteration 0). We denote
GRF@ N to be the case when N top ranked samples of known relevance are used to feedback into the system. Figure 9 shows the result of applying Graded
Relevance Feedback (GRF) to refine the initial subset selected by our proposed
AFS method. For each value of N , the performance increases and converges at about 5 iterations, while the rise is the sharpest at the first 2 iterations. The result by GRF@20 achieved the best performance.
Figure 10 shows the change of Precision-Recall of different number of GRF iterations at GRF@20, where we considered the binary relevance of retrieving the highest relevant class. The retrieval performed better with increasing number of iterations and converged at about 5 iterations. It is consistent to the result evaluated in graded relevance measure. It showed that our proposed GRF method is robust to assist the Adaptive Feature Selection to attain better performance.
0.92
0.919
0.918
0.917
0.916
0.915
0.914
0 1 2 3
Number of GRF iterations
4
Fig.9.
The performance of Graded Relevance Feedback
GRF@25
GRF@20
GRF@15
GRF@10
GRF@5
5
24
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
5 iterations
4 iterations
3 iterations
2 iterations
1 iteration
No iteration
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
1
Fig.10.
The Precision-Recall curves of different number of iterations at GRF@20.
5.6. Prototype of Our Retrieval System
We have implemented a prototype of the proposed motion retrieval system and the interface is shown in Figure 11. It allows the user to open a motion file as a query example. The query motion is rendered on the left hand side (in red color) while the retrieved motions are rendered on the right hand side (in blue color).
When the user clicks the “Relevance Feedback” button, the user will be asked to click on a retrieved sample and enter its relevance with a grade value. The samples are then fed into the system that triggers the graded relevance feedback.
The newly retrieved samples will be rendered.
25
Fig.11.
The interface of our proposed motion retrieval system.
In this article, we presented an Adaptive Feature Selection and Graded
Relevance Feedback (AFS-GRF) that enhanced the performance in an examplebased motion retrieval system. In our experiment data, we organized a total of 36 classes of different movements with an Ontology hierarchy and determined their logical relevance. We modeled the logically relevance of the 3D motion capture data based on the variances of the relative distances between joints (VJRD). With the graded relevance measure, we evaluated the retrieved result by the proposed method in logical relevance.
In addition, we proposed an Adaptive Feature Selection (AFS) method that is able to abstract the characteristics of the query by finding a Linear Regression
Model that identified the relationship between the ground-truth feature subset and the training data. Our method is adaptive to different kinds of queries and is superior to existing methods that just select features based on the global statistics of feature distribution. The proposed AFS-GRF algorithm outperforms other class-specific feature selection methods and motion retrieval methods.
Furthermore, we refined the feature selection by Graded Relevance Feedback
(GRF). The system updated the retrieval result by the accumulated distances between the query and the feedback samples of different graded relevance values.
26
Through a small number of iterations and feedback samples, our method showed a good retrieval performance.
Our method can be applied to retrieve 3D motion capture data that can be reused for producing new animations of games. We have built a prototype motion retrieval system with our proposed AFS-GRF method to illustrate this concept. As future work, the retrieved motions can be selected and stored temporarily for synthesizing new animations by techniques such as motion graph (Kovar et al.,
2002). Moreover, it is not convenient to tell the user to open a motion file to retrieve similar motions. A possible direction is to explore more sophisticated ways to input the query, such as input some abstract movements with a cheaper motion sensing device. Not the least, our joint relative measurements can be extended to represent the interactions of multiple bodies.
The work described in this paper was substantially supported by a grant from the
Research Grant Council of the Hong Kong Special Administrative Region, China
[project No. CityU 1165/09E]. The authors would like to thank the anonymous reviewers for their helpful comments to improve the paper.
Aksoy, S., and Haralick, R.M., 2000. A Weighted Distance Approach to Relevance Feedback. In:
Proc. the International Conference on Pattern Recognition (ICPR00). Barcelona , Spain, vol. 4, pp.812 – 815.
Baak, A., Müller, M., Seidel, H.P., 2008. An efficient algorithm for keyframe-based motion retrieval in the presence of temporal deformations. In: Proc. ACM International
Conference on Multimedia Information Retrieval (MIR), Vancouver, British Columbia,
Canada, pp. 451–458.
Breitman, K.., Casanova, M.A., Truszkowski, W., 2006. Semantic Web: Concepts, Technologies and Applications (NASA Monographs in Systems and Software Engineering). Springer-
Verlag New York, Inc.
Buckley, C., Allan, J., Salton, G., 1995. Automatic Routing and Retrieval Using Smart: TREC-2.
Inf. Process. Manage. 31(3): 315-326 (1995)
27
CityU Motion Capture Lab, City University of Hong Kong, 2010. <http://mocap.cs.cityu.edu.hk/>.
Accessed: May 29, 2011.
CMU Graphics Lab Motion Capture Database, Carnegie Mellon University, 2003.
<http://mocap.cs.cmu.edu/>. Accessed: May 29, 2011.
CGSPEED, 2008, <http://www.cgspeed.com>. Accessed: March 13, 2011.
Chung, H.S., Kim, J.M., Byun, Y.C., Byun, S.Y., 2005. Retrieving and Exploring Ontology-Based
Human Motion Sequences. In: Proc: the ICCSA (3), pp. 788–797.
Deng, Z., Gu, Q., Li, Q., 2009. Perceptually consistent example-based human motion retrieval. In:
Proc. 2009 Symposium on interactive 3D Graphics and Games (I3D’09), Boston,
Massachusetts, February 27 - March 01, 2009. ACM, New York, NY, pp. 191–198.
Grubber, T.R., 1993. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):
199–220.
Guarino, N., 1995. Formal ontology, conceptual analysis and knowledge representation.
International. Journal of Hum.-Comput. Stud. 43(5–6): 625-640 (December).
Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of
Machine Learning Research, 3: 1157–1182 (March)
HDM Motion Capture Database (HDM05), Hochschule der Medien, 2005. <http://www.mpiinf.mpg.de/resources/HDM05/>. Accessed: May 29, 2011
Hotho, A., Maedche, A., Staab, S., 2001. Ontology-based text clustering. In: Proc. IJCAI-2001
Workshop “Text Learning: Beyond Supervision”, August, Seattle, USA.
Järvelin, K.., 2009. Interactive Relevance Feedback with Graded Relevance and Sentence
Extraction: Simulated User Experiments. In: Proc. 18th ACM Conference on Information and knowledge management, Hong Kong, China, 26 November 2009, pp. 2053-2056.
Keogh, E.J., Palpanas, T., Zordan, V.B., Gunopulos, D., Cardle, M., 2004. Indexing large humanmotion databases. In: Proc. 30th VLDB Conf., Toronto, pp. 780–791.
Keskustalo, H., Järvelin, K., Pirkola, A., 2008. Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value.
Information Retrieval, 11(3): 209-228, June 2008.
Kim, T.K., Kim, H., Hwang, W., Kee, S.C., Kittler, J., 2003. Face description based on decomposition and combining of a facial space with LDA. In: Proc. IEEE International
Conference on Image Processing, pp. 877–880, Spain.
Kovar, L. Gleicher, M., Pighin, F., 2002. Motion Graphs. ACM Transactions on Graphics, 21(3), pp.473–482.
Liu, X., Chen, T., Thornton, S.M., 2003. Eigenspace updating for non-stationary process and its application to face recognition. In Pattern Recognition, 36: 1945–1959.
Müller, M., Röder, T., 2006. Motion templates for automatic classification and retrieval of motion capture data. In: Proc: 2006 ACM SIGGRAPH/Eurographics Symposium on Computer
Animation (SCA), Vienna, Austria, pp. 137–146.
Müller, M., Röder, T., Clausen, M., 2005. Efficient content-based retrieval of motion capture data.
ACM Transactions on Graphics (TOG), 24(3): 677–685.
28
Rocchio, J.J., 1971. Relevance Feedback in Information Retrieval. In The SMART Retrieval
System, Experiments in Automatic Document Processing, pages 313–323. Prentice Hall,
Englewood Cliffs, New Jersey, USA.
Roger, P., 1956. On best approximate solution of linear matrix equations. In: Proc. Cambridge
Philosophical Society 52: 17–19.
Sakai, T., 2003. Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels. In: Proc. 26th Annual international ACM SIGIR Conference on
Research and Development in information Retrieval (SIGIR’03), Toronto, Canada, July
28 – August 01, 2003, pp. 417–418.
Sakai, T., 2004. Ranking the NTCIR systems based on multigrade relevance. In: Proc. of Asia information retrieval symposium 2004, pp. 170–177.
Sakai, T., 2007. On the reliability of information retrieval metrics based on graded relevance.
Information Processing and Management, 43(2), 531–548.
Sharma, A., Paliwal, K.K., Onwubolu, G.C., 2006. Class-dependent PCA, MDC and LDA: A combined classifier for pattern classification. In Pattern Recognition, 39(7): 1215–1229.
Shum. H.P., Komura, T., Yamazaki, S., 2007. Simulating competitive interactions using singly captured motions. In: Proc. 2007 ACM symposium on Virtual reality software and technology (VRST '07), CA, U.S.A, pp. 65-72.
Tam, G.K., Zheng, Q., Corbyn, M., Lau, R.W., 2007. Motion Retrieval Based on Energy
Morphing. In: Proc. Ninth IEEE International Symposium on Multimedia, pp. 210–220.
Tang, J.K., Leung, H., Komura, T., Shum, H.P., 2008. Emulating human perception of motion similarity. Comput. Animat. Virtual Worlds, 19(3–4): 211–221.
Tusk C., Koperski K., Aksoy S., and Marchisio G. 2003. Automated feature selection through relevance feedback. In: Proc. IEEE International Geoscience and Remote Sensing
Symposium, 2003. IGARSS ’03. Vol. 6, pp. 3691–3693, July 2003.
Wang, X., Yu, Z., Wong, H.S., 2008. Searching of Motion Database Based on Hierarchical SOM.
In: Proc. IEEE International Conference on Multimedia and Expo 2008, pp.1223–1236.
Zhang, J., 2008. A Novel Video Searching Model Based on Ontology Inference and Multimodal
Information Fusion. In: Proc. 2008 international Symposium on Computer Science and
Computational Technology – Vol. 2 (December 20 - 22, 2008). ISCSCT. IEEE Computer
Society, Washington, DC, pp. 489–492.
29