Preprint

advertisement

Retrieval of Logically Relevant 3D Human

Motions by Adaptive Feature Selection with

Graded Relevance Feedback

Jeff K. T. Tang and Howard Leung

Department of Computer Science, City University of Hong Kong, 83 Tat Chee

Avenue, Kowloon, Hong Kong

City University of Hong Kong kttang@cs.cityu.edu.hk, howard@cityu.edu.hk

Abstract

A system that can retrieve logically relevant 3D captured motions is useful in game and animation production. We presented a robust logical relevance metric based on the relative distances among the joints. Existing methods select a universal subset of features for all kinds of queries which may not well characterize the variations in different queries. To break through this limitation we proposed an Adaptive Feature Selection (AFS) method that abstracts the characteristics of the query by a Linear Regression Model, and different feature subsets can be selected according to the properties of the specific query. With a Graded Relevance Feedback (GRF) algorithm, we refined the feature subset that enhances the retrieval performance according to the graded relevance of the feedback samples. With an Ontology that predefines the logical relevance between motion classes in terms of graded relevance, the performance of the proposed AFS-GRF algorithm is evaluated and shown to outperform other class-specific feature selection and motion retrieval methods.

Keywords: Adaptive Feature Selection, Logical Similarity, Graded Relevance,

Relevance Feedback, Motion retrieval, 3D Human Motion Capture

1. Introduction

An increasing number of 3D motion capture databases such as (CMU, 2003) from Carnegie Mellon University and (HDM, 2005) from Hochschule de Medien are available for public usage. However, motion capture is time consuming and labor intensive. There is a rising need to reuse the existing captured data, which makes animation and game production more efficient. In real applications, captured motions (even the same type) have large spatial and temporal variations.

This is due to the style difference among persons. Moreover, even the same person cannot perform a movement identically each time. Hence, a reliable system to retrieve logically relevant motions is desired. However, it is challenging

1

to represent the logical meaning of a motion and this topic is quite new in motion analysis research.

In human motion analysis research, it is proven that the numerical similarity metrics (e.g. joint position difference, joint angle difference) are insufficient to identify similar motions with style difference (Müller and Röder, 2005; Tang et al., 2008). Recent works showed that it is possible to abstract the logical meaning of motions with relational measures. For example, Müller and Röder measure whether a joint is in place or not with a Boolean value, while other researchers

(Tang et al., 2008) consider the relative distances between joints. The relative motion between joints appears to be a robust measure with the context of logical relevance.

For high dimensional data (e.g. motion data), we have to select a subset of features that work best with the similarity evaluation. In motion retrieval systems, the queries are most likely different each time in terms of characteristics. Classical feature selection methods that select a universal feature set for all queries cannot achieve the best retrieval result for individual cases. It is challenging to design a retrieval method that can automatically determine a set of tailor-made features from the query itself, without the prior knowledge of its motion class.

Relevance Feedback can be used to refine the feature selection. It can be achieved by allowing the user to give some positive and negative samples to the system. In most systems, the query is refined to be closer to the samples in feature vector space. Some researchers consider updating the weights of features such that important features will be boosted up. The weighting is updated according to the feedback samples and hopefully the result would be more conformed to the user’s desire next time. If the binary relevance becomes graded (i.e. “highly relevant”,

“somewhat relevant”, “not so relevant”, etc.), more considerations are required in designing the updating method for the relevance feedback.

There are three major contributions we have made in this article: First, we proposed to use the variance of Joint Relative Distance (VJRD) to characterize human motions with logical relevance. Second, we proposed an Adaptive Feature

Selection (AFS) method that is able to abstract the characteristics of the query and determine a feature subset specific to that query. Third, we proposed a Graded

Relevance Feedback (GRF) method that makes use of the feedback samples with

2

the known graded relevance to refine the feature subset and enhance the retrieval performance.

The article is organized as follows. Section 2 describes the related work. We describe the overview of our proposed method in Section 3, and the details of our proposed Adaptive Feature Selection with Graded Relevance Feedback (AFS-

GRF) method in Section 4. We evaluate the performance of the proposed methods with existing methods in Section 5. Finally we conclude with a discussion of the future directions in Section 6.

2. Related work

In general, information retrieval (IR) can be content-based or semantic-based.

In content-based IR, the features do not contain semantic meaning. Semanticbased IR on the other hand has been widely adopted in semantic web searching.

In semantic-based IR, there are basically two types: keyword-based and ontology-based. For keyword-based semantic IR, the user should define keywords that are used to describe each data entry. A vector space model is used to describe the features with weights. If a keyword is more discriminative, its weight is higher. In Ontology-based IR systems, relationships among different “concepts” can be defined by a high level ontology (Hotho et al., 2001). In web/document retrieval, the semantic ontology of the keywords is predefined such that the similarity between the query and data is given by the weighted sum of the semantics (Grubber, 1993; Guarino, 1995; Breitman et al., 2006). This approach can also be used to organize 3D captured motions (Chung et al., 2005), and possibly assign semantic meanings to the content in retrieval systems (Zhang,

2008). It is possible to apply the Ontology approach to organize motion capture data according to their logical relevance for retrieval system.

In content-based motion retrieval system, a logical similarity metric is useful to retrieve motions of similar semantic meaning but different in style, timing etc.

There are just a few related works because this field is still new. The Geometric

(or Boolean) feature proposed by (Müller and Röder, 2005) apparently is the first robust logical relevance metric. It considers the point-to-plane relationships such that a binary value is given whether the condition is true or false. However, it produces a large dimension of features and we have to select a subset of good

3

features at different times. The Joint Relative Distance (JRD) proposed by (Tang et al., 2008) is another new logical relevant metric that considers the relation between joints only, with a much smaller set of features.

Traditionally, the performance of a retrieval system is evaluated by binary relevance (either relevant or irrelevant). In recent years, the graded relevance measure has been a popular choice in evaluating retrieval performance. The graded relevance is represented by different levels such as “highly relevant”,

“somewhat relevant”, “irrelevant”, etc. instead of just a Boolean value (relevant/ irrelevant). Graded Relevance is mainly derived from the Average Precision on

Cumulative Gain ( CG ). The term gain is a score that awards a particular rank for the retrieval of a relevant item. For a graded relevance measure, different gain values could be assigned for different levels of relevance. A larger gain is assigned for a more relevant retrieved item. (Sakai, 2003) proposed Average Gain

Ratio ( AGR ) that considered the ratio of the CG s of the retrieved ranking to the ideal ranking. (Sakai, 2007) have surveyed and compared various graded relevance measures including Cumulative Gain ( CG ), normalized Cumulative

Gain ( nCG ), Discounted Cumulative Gain ( DCG ), normalized Discounted

Cumulative Gain ( nDCG ), Average normalized Discounted Cumulative Gain

( AnDCG ), and Q-measure (Sakai, 2004). AnDCG and Q-Measure are the best choices because they are discounting the gains for the late arriving relevant documents. In our experiment, the graded relevance measure is given by the

Average Normalized Discounted Cumulative Gain ( AnDCG ) (Sakai, 2007) among motion classes. The performance of our proposed motion retrieval system using

AFS-GRF is evaluated through the AnDCG measure.

Feature selection is used to select important (or filter out non-useful) features in order to get a better performance. Sequential Forward/Backward Selections

(SBS/SFS) have been commonly used (Guyon and Elisseeff, 2003). However, these are brute force methods that maximize the IR performance by selecting/eliminating features to form new subsets. Principal Component Analysis

(PCA) and Linear Discriminant Analysis (LDA) can reduce the feature dimension by choosing the feature subset with highest projection energies, i.e. the most discriminative ones. Both PCA and LDA can be applied to model characteristics specific to each class in pattern classification (Sharma et al., 2006; Liu et al.,

2003) and face recognition (Kim et al., 2003) respectively.

4

More specific to human motion retrieval, a Boolean Motion Template that averaged the characteristics of all samples in the sample logical class is built

(Müller et al., 2006; Baak et al., 2008). Furthermore, some researchers divided a human body into parts and built separate indices for these body parts (Deng et al.,

2009). These methods will be compared with our proposed method.

Researchers applied relevance feedback to enhance the performance in IR systems. A classical way is to update the current query in the feature vector space in order to get closer to the positive (relevant) samples and further away from the negative (irrelevant) samples (Rocchio, 1971). A blinded/pseudo relevance feedback can automatically expand the query by assuming the topN ranks are relevant (Buckley et al., 1995). On the other hand, some people suggested to update the weights of features. This method maximizes the variations of features in the database over those in the feedback samples, and the features can be reweighted accordingly (Aksoy et al., 2000). Similarly, a subset of most significant features can be selected through relevance feedback, which is useful for IR systems that deal with high dimensional data such as image (Tusk et al.,

2003). In existing relevance feedback methods, most people considered the binary relevance (i.e. positive/negative feedbacks), but only a few researchers have considered the graded relevance in retrieval systems (Sakai, 2007; Keskustalo,

2008; Järvelin, 2009). They considered a blinded relevance feedback method such that the query is expanded through the relative average term frequency (RATF) that depends on the graded ranks.

3. Proposed Method

3.1. Overview

Our proposed method consists of three parts, and the flow is shown in

Figure 1. Firstly we preprocessed the data by extracting useful features and organized them according to the logical relevance. Next, we trained a Linear

Regression Model between the feature vectors of all samples and their optimal feature subset. In the motion retrieval phase, the Adaptive Feature Selection

(AFS) is able to select an initial subset of features based on the input query automatically. Through the Graded Relevance Feedback (GRF), a number of

5

retrieved samples with known relevance are fed back to the system to refine the initial feature set and hence a better retrieval performance can be achieved. In this section, we will explain the preprocessing step and provide a highlight of motion retrieval and relevance feedback. The details of the proposed AFS-GRF method will be presented in next section.

Fig.1.

The flow of the proposed Adaptive Feature Selection and Graded Relevance Feedback algorithm.

3.2. Data Collection and Representation

Our experiment data contained 1099 single person movements. 90% of these motions are captured on our own (CityU, 2011). The remaining 10% are adopted from a well-known public motion capture database provided by Carnegie

Mellon University (CMU, 2003). We converted our captured motion clips into a standard Biovision Hierarchy (BVH) format, in which the body hierarchy is welldefined. The BVH version of CMU motion data is available in (CGSPEED,

2008). We selected the “walking”, “running” and “jumping” motions from the public database because their sample sizes are relatively large. These three classes are common to the dataset used in existing work by (Deng et al., 2009) thus making our performance evaluation more objective.

Since the CMU and CityU dataset are captured with different settings, their numbers of joints are different. The hierarchy in a CMU motion contains 31 joints whereas the hierarchy in a CityU motion contains 21 joints. Hence, we

6

generalized the human representation as shown in Figure 2. We considered the 15 joints common to both sets, i.e. (1) Head, (2) Neck, (3) Root, (4) Right Shoulder,

(5) Right Elbow, (6) Right Hand, (7) Left Shoulder, (8) Left Elbow, (9) Left

Hand, (10), Right Hip, (11) Right Knee, (12) Right Foot, (13) Left Hip, (14) Left

Knee, and (15) Left Foot. Our method can easily adopt motion data captured by other laboratories under different conditions.

Fig.2.

The representation of the human body in motion capture data.

In our motion retrieval system, we assumed that each sample is a primitive move, i.e. a self-complete movement. Hence, we first segmented the motion clips into primitive moves automatically simply at the postures where the joint accelerations are not significant (Shum et al., 2007). It is simple but proven efficient in existing work.

3.3. VJRD Feature Extraction

In each segmented primitive move, we extracted the Joint Relative Distance

(JRD), which is calculated by pair-wise Euclidean distances (i.e. L2 -norm) between any two joints (Tang et al., 2008). The variance of each JRD over the duration of the movement is considered to characterize the motion, and we named it as VJRD. Physically the variance measures the extent of each JRDs changed from the mean. Intuitively the motions of similar logical meaning (even with different styles) would have some joints moving in similar way, which is reflected in the value of VJRD. A zero VJRD represents a pair of joints that are relatively static to each other, while a non-zero VJRD represents a pair of joints that have sound relative movement.

7

There are 15 joints in our human model, so we have 15×14/2 = 105 joint pairs. Some rigid pairs (i.e. on the same bone or on the torso) are filtered out because their JRDs are almost unchanged all the time. Hence the total number of joint pairs we used becomes 105 pairs – 14 (bone pairs) – 15 (torso pairs) = 76.

The formulation of VJRD is shown below:

Let M

A

= { P

A1

, P

A2

, … , P

AT

} be a primitive move A of T frames (postures), and each posture contains N p

= 76 joint pairs. The JRD of the p -th (1

 p

N p

) joint pair at the t -th (1

 t

T ) posture is calculated by the L2 -norm between two joints J i and J j at frame t :

JRD ( t , p ) = d

L2

( J i

( t ), J j

( t )) (1)

A normalization is applied on each JRD ( t , p ) to scale them within a common range [0, 1] so that it becomes robust to different body sizes. We denote the normalized JRD

A

( t , p ) by nJRD

A

( t , p ), and its mean over M frames be nJRD

A

( p ) . The VJRD of the p -th joint pair is formulated as:

VJRD

A

( p )

1

T t

T 

1

( nJRD

A

( t , p )

 nJRD

A

( p ) )

2 (2)

3.4. Data Organization

According to the logical relevance, we organized the motion classes in an

Ontology hierarchy as shown in Figure 3. It is a tangled tree of motion classes linked with an “include” (

) relationship, i.e. a lower level motion class inherits the characteristics of higher level classes. For example, a “Left Punch” is not only a “Punch” but also a “Left hand movement”. There are 36 leaf-node classes, which contain the finest details and are regarded as distinct classes. Table 1 shows the types and number of motions in our data. We used the Ontology to determine the graded relevance among motion classes.

8

Fig.3.

The Ontological hierarchy of our motion dataset.

Table 1. The types/classes of motions and their number of samples

Motion

Type

Motion Classes

Fighting Left Straight Punch, Right Straight Punch, Left Hock Punch,

Right Hock Punch, Left Upper Kick, Right Upper Kick, Left

Lower Kick, Right Lower Kick

Dancing 20 Kinds of A-go-go dance moves

Sport Basketball motions (Shooting, Dribbling and Defending)

Locomotion Walking, Running and Jumping

Others Random movements, Reacting moves and Head moves

Total

Number of samples

318

300

189

110

182

1099

9

3.5. Graded Relevance among Motion Classes

We determined the relevance between any two leaf-node classes by a graded relevance scheme, which considers the degree of relevance between two motions rather than a hard decision of relevant/irrelevant item. From the Ontology graph shown in Figure 3, we noticed that if the distance between two classes is closer then their relevance is higher. Hence, we formed a matrix of graded relevance among these 36 classes. Figure 4 shows a simplified version

(considering only the 8 classes of fighting motions at Level 0). Each entry represents the grade g ( C i

, C j

) (also known as “Gain”) of the two classes C i

, and C j

, which is determined by g ( C i

, C j

) = d node_max

- d node

( C i

, C j

) + 1, where d node

( C i

, C j

) is the distance between two classes, and d node_max

is the maximum distance among all classes. A range of values (from 1 to 7) are used to represent the relevance in ascending order, e.g. 1 for the “irrelevant” classes, 5 for “somewhat relevant” classes, 7 for the most relevant classes.

4

5

6

7

8

2

3

Class 1 2 3 4 5 6 7 8

1 7 5 3 3 1 1 1 1

5 7 3 3 1 1 1 1

3 3 7 5 1 1 1 1

3

1

1

1

1

3

1

1

1

1

5

1

1

1

1

7

1

1

1

1

1

7

5

3

3

1

5

7

3

3

1

3

3

7

5

1

3

3

5

7

Fig.4.

The Graded Relevance Matrix of 8 fighting motion classes.

3.6. Motion Retrieval Method

We presented how to compute the dissimilarity between two primitive moves M

A

and M

B

, which is given in Equation (3). We resampled them into the same number of frames, and calculated the Euclidean distance of each selected dimension p , where N p

is the size of selected features. Here, w p

is the weight for that feature p . Initially it is set to be 1/ N p

but is updated during the relevance feedback described in Section 4.2. If the moves M

A

and M

B

are similar, the value d ( M

A

, M

B

) will be close to 1; otherwise the value will be much larger.

10

d ( M

A

, M

B

)

 f

LR

N p p

1 w p

T 

 t

1 d

L 2

( JRD

A

( t , p ), JRD

B

( t , p ))

(3)

Sometimes, the number of frames of M

A

and M

B

would be very different and biased the dissimilarity measure. To overcome this deficiency, a factor about the length ratio f

LR

= max_length ( M

A

, M

B

)/ min_length ( M

A

, M

B

) is multiplied to the dissimilarity score. In our data, the length ratio between motion samples in each class and the class mean is about 1.67. Therefore, this factor is able to penalize the score when the length of a sample is very different from the query, without outweighing the original measurement.

In this work, we considered an example-based motion retrieval method.

Given a primitive move as query Q , we search the moves { D i

| i

I } from the database and rank them from descending relevance where I is the size of the database, i.e. ascending dissimilarity d ( Q , D i

) values given by equation (3). With the initial set of features selected with AFS, it is assumed that these selected features are equally important thus the same weight w p

is assigned to each feature.

When the Relevance Feedback is applied, the importance of the features in the subset will be examined to refine the weights.

3.7. Performance Evaluation with Graded Relevance

We evaluated the performance of retrieval systems with the graded relevance scheme by AnDCG , which is one of the common measure considered by researchers working on information retrieval with graded relevance.

The Gain g ( i ) is associated as the graded relevance value we have determined in Section 3.5, where i is the rank of a retrieved motion. The Discounted

Cumulative Gain ( DCG ) is determined by summing the Discounted Gains dg ( i ) which is the gain g ( i ) normalized by the term log a i , where a is the minimum rank that applies discounting Let r be the number of retrieved items. The equation of

DCG is given by Equation (4): dcg ( r )

 r 

1 dg ( i )

 g ( 1 )

 r 

2 g ( i ) log a i

(4)

11

The Average normalized Discounted Cumulative Gain ( AnDCG ) is the average of the ratio of dcg ( l ) and dcg

I

( l ), where l is the size of ranked output, dcg ( l ) is the DCG of the retrieved ranking, and dcg

I

( l ) is the DCG of the ideal ranking. The ideal ranking is the Gain values sorted in descending order. The equation of AnDCG is given by Equation (5). The range of AnDCG is from 0 to 1 and is a kind of average precision measure. The higher the AnDCG value, the better the retrieval performance is.

AnDCG l

 l

1 l 

1 dcg ( r ) dcg

I

( r )

(5)

4. Adaptive Feature Selection with Graded Relevance

Feedback (AFS-GRF)

The motion retrieval result can be enhanced by selecting a subset of features that work best to rank the samples in the database in descending logical relevance.

Our proposed Adaptive Feature Selection (AFS) is able to estimate an initial subset of features that can well characterize the query motion. The initial feature subset could be refined by a Graded Relevance Feedback (GRF). The detail of each part of the algorithm will be discussed in the following subsections.

4.1. Adaptive Feature Selection (AFS)

In our hypothesis, there exist a subset of feature that gives the optimal retrieval performance and this subset may vary among different queries, in which we are able to abstract them from a query without knowing its motion class. A

Regression Model is proposed here to explore the relations among the motion features to decide whether a feature should be selected or not. The AFS consists of two steps: training a Regression Model, and applying the Regression Model for motion retrieval. Before the training step, we first obtain the ground-truth feature subset per class using class-specific sequential backward selection.

12

4.1.1. Ground-truth Class-specific Feature Subset Estimation

It is tedious to select the optimal feature subset per motion class manually.

Hence, we apply a classic sequential feature selection method. As suggested by

(Guyon and Elisseeff, 2003), a Sequential Backward feature Selection (SBS) can select a combination of features that gives a good retrieval performance. In order to select the ground-truth feature subset for each class, we apply SBS for each class separately. Therefore, a single SBS will be repeated 36 times and this scheme is called class-specific SBS (C-SBS).

Figure 5 illustrates how SBS works on each class of queries. Suppose there are k features in the data. There are k -1 possible feature subsets that could be formed when one feature is taken away. The performance of each subset is computed and the subset with the best performance is chosen. The excluded feature m is added to the list. This process is repeated on the current subset until only one feature is left. The SBS is repeated for each of the 36 classes. Although the performance can be very good, it is a brute force approach and computationally expensive. This is called only once to determine the ground-truth feature subset for each class. During retrieval, we applied our proposed Adaptive

Feature Selection (AFS) method to determine the feature subset from each input query automatically.

Fig.5.

The Sequential Backward Selection (SBS).

13

A flag vector z can be used to denote which features have been selected.

Consider the scenario if we have the whole set of four features F = { f

1

, f

2

, f

3

, f

4

} and the optimal feature subset is F

O

= { f

3

, f

4

}, then the vector z would become [-1

-1 +1 +1], where the i -th element is labeled by +1 (or -1) if the i -th feature is present (or absent) in the optimal subset respectively. In our experiment, the dimension of the flag vector z of each data is 1

76. If there are N tr

training samples, the flag vectors z will be aggregated to form a N tr

76 flag matrix Y , which will be used to train the Regression Model.

4.1.2. Training a Regression Model

A Regression Model ( B) is trained by relating the VJRD features and the flag vectors resulted from C-SBS. Here we use a simplified example to illustrate the idea: Assume that there are four features in total and three training samples in the dataset. The feature vectors will be aggregated to form a 3

5 feature matrix

A =

 a a

11

21 a

31 a a a

12

22

32 a

13 a

23 a

33 a

14 a

24 a

34 a a

15

25 a

35

.

The term a ij

represents feature j of training sample i with 1

 i

3 and 1

 j

4 and these features are considered as dependent variables in the Regression Model. On the other hand, the term a i5

is set to be constant 1 with 1

 i

3 and it is the intercept in the Regression Model.

Furthermore, we have a 3

4 flag matrix Y =

1

1

1

1

1

1

1

1

1

1

1

1

denoting which features should be selected for each training sample (i.e., features 3, 4 for training sample 1; features 1, 3 for training sample 2 and feature 1 for training sample 3). With a Regression Model, the flag vector z

1

= [ Y

11

Y

12

Y

13

Y

14

] for the training sample 1 is represented as a set of linear equations given by (6), where

β ij are some unknown coefficients to be solved.

Y

11

= a

11

β

11

+ a

12

β

21

+ a

13

β

31

+ a

14

β

41

+ β

51

= -1

Y

12

= a

11

β

12

+ a

12

β

22

+ a

13

β

32

+ a

14

β

42

+ β

52

= -1

Y

13

= a

11

β

13

+ a

12

β

23

+ a

13

β

33

+ a

14

β

43

+ β

53

= +1

Y

14

= a

11

β

14

+ a

12

β

24

+ a

13

β

34

+ a

14

β

44

+ β

54

= +1 (6)

14

If we also consider z

2 and z

3

in a similar manner for training samples 2 and 3, we have a total of 12 equations and 20 model coefficients

β ij.

A Model Matrix B =

11

21

31

41

51

12

22

32

42

52

13

23

33

43

53

14

24

34

44

54

is hence formed and these equations can be represented in matrix form by equation (7),

Y = A B (7)

In our experiment, each training set has about 500 samples while each sample has 76 VJRD features. Since the number of data N tr

is always higher than the number of features N

F

, we solve the system of linear equations by minimizing the Euclidean Norm between AB and Y using a pseudo-inverse method (Roger,

1956). The pseudo-inverse of A is denoted by A

+

= ( A

T

A )

-1

A

T

. Hence, the matrix of coefficients B for this Regression Model can be obtained by Equation (8):

B = A

+

Y (8)

4.1.3. Applying Regression Model for Motion Retrieval

An example-based retrieval system is presented, such that a motion sample has to be input as query in order to retrieve relevant motions. With the trained

Regression Model B , the significant features of the query can be determined. The following example illustrates how to select the feature subset from the query.

Let Q to be the vector that contains the features of the input motion and a constant intercept term of 1. Let z

Q

be the flag vector of the feature set. By inputting Q as A and the trained Regression Model B in equation (7), z

Q

can be obtained by Equation (9).

15

z

Q

= Q B =

 q

1 q

2 q

3 q

4

1

11

21

31

41

51

12

22

32

42

52

13

23

33

43

53

14

24

34

44

54

(9) z

Q

is a vector of real numbers that can help us decide whether a feature is selected. The decision threshold is set to be 0 meaning that the feature selection depends on the sign of z

Q

. For example, if z

Q

is equal to [-1.02, +2.68, +0.1, -

4.22], since the 2 nd

and 3 rd

entries of z

Q are positive thus features 2 and 3 are selected for similarity comparison in retrieving motions.

4.2. Refining Retrieval with Graded Relevance Feedback (GRF)

In our proposed relevance feedback method, the topN ranked samples in the current retrieval result are fed back into the system. In the real scenario with an actual user, the graded relevance between the query and each feedback sample can be tagged by the user via the retrieval system. In this experiment, the graded relevance of the feedback samples is assumed to be known which allows us to evaluate the performance of relevance feedback. The feature subset is refined iteratively and the retrieval result is getting closer to the user’s desire.

In the relevance feedback research, some researchers update the features such that the query will get closer to the feedback samples in the feature space.

Others update the weights of features such that more significant features will be set with higher weights. In our approach, we do not update the query but introduce an offset vector Δ z to update the flag vector z

Q

. The offset can be determined by the graded relevance and dissimilarities between the feedback samples and the query. To determine the offset vector Δ z , we consider the performance of each feature contribution. The gains (i.e. the graded relevance) of the TopN ranked samples show how good each feature ranks the feedback samples. The feature with a high accumulated gain has a better performance thus the corresponding entry in Δ z would be increased and vice versa. In particular, the total feature set is divided into two parts: the selected subset (i.e. the current selected features) and the unselected subset (the remaining features). In the new iteration, we suppress the effectiveness of the worst performing feature i in the selected set, while boost

16

up the best performing feature j in the unselected set. It is done by assigning an offset value

(0.3 in our experiment). Hence, the i -th element in the offset vector

Δ z ( i ) is set to be -

while the j -th element Δ z ( j ) is set to +

.

However, if we only update Δ z based on the gain of the feedback samples, the algorithm will no longer work in marginal cases with identical gains i.e. the gains of the feedback samples are all equal (say 3). This is more likely to happen when a smaller number of feedback samples are provided. To avoid this problem, we also consider the Mean Accumulated Dissimilarity (MAD) of each feature, which is the average of dissimilarity values between the query and each feedback sample of each gain g ( i ). Let S i

( p , j ) be the p -th feature vector of the j -th feedback sample with gain i , the MAD is given by Equation (10):

MAD i

( p )

1

| S i

|

| S  | j i d ( Q ( p ), S i

( p , j )) (10)

The gain g ( i ) is now weighted by the corresponding MAD i

such that g’ ( i ) = g ( i )  MAD i

, which is used to calculated AnDCG , i.e. the performance of each feature. The weight of each feature ( w

P

) is given by normalizing the MAD value.

After the above process, an offset vector Δ z and the weights w

P

of each feature is determined. A new flag vector z’

Q

is computed such that z’

Q

= z

Q

+ Δ z .

In calculating the dissimilarity measure between the query and each sample, the selected features are further weighted by w

P

. In the marginal case where all feedback samples have identical gains, the system will consider the performance of each feature by using MAD only.

5. Experiments and Results

The experimental settings are provided in the first part of this section. Then our proposed approach is evaluated in several aspects. Our proposed VJRD features are evaluated and compared with other similarity metrics. Our proposed

Adaptive Feature Selection (AFS) method is then evaluated and compared with other feature selection methods. The performance of our proposed motion

17

retrieval method is compared with other existing retrieval methods. Finally, the performance of our proposed Graded Relevance Feedback (GRF) is shown.

5.1. Experimental Settings

In our proposed method, the performance of different motion retrieval approaches will be evaluated by a Graded relevance scheme. To verify the robustness of the systems, a cross-fold validation is used. The data has been divided into four equal parts. In each part, the data samples are selected randomly.

In each trial, 2 out of the 4 parts, i.e., 50% of the data, are used for training, and the rest 50% is used for testing. Hence, there are



4

2

 = 6 trials.

5.2. Evaluation of the Variance of Joint Relative Distance (VJRD)

The Variance of Joint Relative Distance (VJRD) is compared with other motion similarity measures: the joint angle difference and the Boolean features.

Joint angle difference is a classical measure that has long been adopted in many early works for evaluating the motion similarity (Wang et al., 2008; Tam et al.,

2007). Boolean features (Müller and Röder, 2005) are also known as Geometric features which describe the relations between body parts (a joint or a plane of joints). The value of Boolean feature is either in place (the feature value is 1) or not in place (the feature value is 0). The 39 Boolean features suggested by (Müller and Röder, 2006) are used in the experiment. The properties of the joint angle difference and Boolean features are shown in Table 2. In the table,

 i

A

and

 i

B are the Euler angles of motions A and B of a particular joint axis respectively, and

 i

A

and

 i

B

represent the i -th Boolean features of motions A and B respectively.

The frame correspondence between the two motions is required for using the joint angle difference and Boolean features when the two motions have different durations. This can be obtained by uniform scaling or dynamic time warping. In contrast, our proposed VJRD are computed over the frames such that uniform scaling or time warping is not required hence it is more efficient.

18

Table 2. Properties of other distance measures.

Joint angle difference Boolean feature

Range of feature values

Feature dimension

Normalized Euler angles

-180˚ < θ < 180˚

15 joints × 3 dimensions = 45

1 = in place

0 = not in place

39

Distance measure per frame

Absolute distance d i

  i

A   i

B

Manhattan distance d i

1

  i

A   i

B

The performances in terms of AnDCG under C-SBS feature selection among the aforesaid features are plotted against the dimension of the reduced feature subset in Figure 6 (To focus on our argument, the x -axis is clipped to show only up to 35 selected features).

1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

0 5

Variance of JRD

Boolean Feature

Joint angle difference

10 15 20 25

Dimension of reduced feature subset

30 35

Fig.6.

Performance comparison among various features.

The performance of retrieval using joint angle difference is the worst among all as it only accounts for the numerical difference. On the other hand, both

Boolean features and VJRD can achieve better performances (max. 0.8502 and

0.8909). Boolean features perform better than joint angle difference because it

19

considers Geometric (semantic) meaning of the movement rather than the exact numerical similarity values. However, a binary semantic can only show two states of a relation, i.e. either yes (the joints are in place) or no (they are not in place).

Our proposed VJRD features actually consider the statistical variations of the joint relations which are more informative. This explains why VJRD outperforms the

Boolean features in this experiment.

5.3. Evaluation of Adaptive Feature Selection (AFS)

The effectiveness of the feature selection definitely affects the retrieval performance. Dimensional reduction methods such as PCA (or LDA) are classic methods to select a feature subset such that the similarity (or distinctness) can be enhanced. In this experiment, traditional dimension reduction methods

(PCA/LDA), the brute force method (C-SBS), and our proposed adaptive feature selection (AFS) method are compared and the result is shown in Figure 7. In

Figure 7, the performances in terms of AnDCG are plot against the dimension of the reduced feature set. The mean performance of our proposed AFS method over all classes is the second highest among all tested methods. It is the closest to the

C-SBS which is regarded as the ground-truth result.

0.9

0.85

0.8

0.75

0.7

0 10

C-SBS

Proposed AFS

C-LDA

PCA

C-PCA

20 30 40 50

Dimension of feature subset

60 70 80

20

Fig.7.

Performance comparison with various feature selection methods.

Among the dimension reduction methods, both PCA and Class-specific

PCA are shown to be not suitable for feature selection since PCA is trying to project similar data together and hence it has weak discriminative power. On the other hand, Class-specific LDA works better than PCA methods. When a new class of data is present, the system has to be retrained each time and it is not scalable to a large dataset. The performance of our proposed method is very close to C-SBS. However, our method is far more efficient than the brute force C-SBS since it is much faster in getting the feature selection labels. More specifically, the time for the brute force method is 1.22 minute per query but our proposed method just spent 1.09 second per query. Moreover, there is no need to train the system separately for each class.

5.4. Evaluation of Motion Retrieval scheme

The performances among other retrieval methods, i.e. the Boolean Motion

Template Method (Müller et al., 2006; Baak et al., 2008) and the Hierarchical

Method (Deng et al., 2009) are compared. Instead of Precision-Recall curves, their performances are compared in terms of graded relevance measure AnDCG .

A higher value represents a better performance in the overall ranking. The retrieval result is shown in Table 3.

The performance of Hierarchical Indexing method is better than the

Boolean Motion Template method. The performance of Boolean Motion template relies on the matched template. Mismatches are expected if the intra-class variation among samples is large. In our experiment data, it is the case because the motions are classified according to their logical meaning. Although Geometric feature can also define logical meaning well, an extra sophisticated feature selection is needed to enhance the performance. As proposed by the authors in

(Müller and Röder, 2005), manual effort in additional with a fuzzy selection is needed. On the other hand, the Hierarchical indexing method contains more detail and is more scalable hence the retrieval could be better in high-recall region. In addition, it segments the motion into a sequence of sub-moves such that the temporal detail is preserved.

21

Our method outperforms the existing methods because the selected features from the Regression Model can effectively abstract the logical meaning of each input motion. Even when a new motion is added to the database, our system can still suggest an initial set of good features. Our proposed method is useful to retrieve different kinds of query motions. Figure 8 illustrates an example retrieval result using our proposed method.

Methods

Table 3. Performance comparison among several retrieval methods.

Trial 1 Trial 2

Performance in Graded Relevance

Trial 3 Trial 4 Trial 5 Trial 6 Average

Boolean

Motion

Template

Motion hierarchy indexing

Proposed method

0.8358

0.8745

0.9137

0.8264

0.8768

0.9133

0.8203

0.8757

0.9155

0.8341

0.8742

0.9136

0.8237

0.8755

0.9143

0.8311

0.8739

0.9143

0.8286

0.8751

0.9141

22

Query Motion:

Retrieved results:

Rank 1

LSP

Score

1.2021

Rank 2

LSP

1.2198

Rank 3

LSP

1.2256

Rank 15

LHP

1.2510

Rank 16

RSP

1.2518

Rank 68

RHP

1.3031

Rank 69

RHP

1.3056

Rank 78

RUK

1.3109

Rank 98

Agogo15

1.3241

Fig.8.

Example retrieval result of our proposed method (Query = Left Straight Punch).

23

5.5. Evaluation of Graded Relevance Feedback (GRF)

We compared the performance of GRF among different number of iterations and feedback samples in terms of AnDCG . In general, the performance with GRF is better than the baseline performance (without GRF at iteration 0). We denote

GRF@ N to be the case when N top ranked samples of known relevance are used to feedback into the system. Figure 9 shows the result of applying Graded

Relevance Feedback (GRF) to refine the initial subset selected by our proposed

AFS method. For each value of N , the performance increases and converges at about 5 iterations, while the rise is the sharpest at the first 2 iterations. The result by GRF@20 achieved the best performance.

Figure 10 shows the change of Precision-Recall of different number of GRF iterations at GRF@20, where we considered the binary relevance of retrieving the highest relevant class. The retrieval performed better with increasing number of iterations and converged at about 5 iterations. It is consistent to the result evaluated in graded relevance measure. It showed that our proposed GRF method is robust to assist the Adaptive Feature Selection to attain better performance.

0.92

0.919

0.918

0.917

0.916

0.915

0.914

0 1 2 3

Number of GRF iterations

4

Fig.9.

The performance of Graded Relevance Feedback

GRF@25

GRF@20

GRF@15

GRF@10

GRF@5

5

24

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

5 iterations

4 iterations

3 iterations

2 iterations

1 iteration

No iteration

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

1

Fig.10.

The Precision-Recall curves of different number of iterations at GRF@20.

5.6. Prototype of Our Retrieval System

We have implemented a prototype of the proposed motion retrieval system and the interface is shown in Figure 11. It allows the user to open a motion file as a query example. The query motion is rendered on the left hand side (in red color) while the retrieved motions are rendered on the right hand side (in blue color).

When the user clicks the “Relevance Feedback” button, the user will be asked to click on a retrieved sample and enter its relevance with a grade value. The samples are then fed into the system that triggers the graded relevance feedback.

The newly retrieved samples will be rendered.

25

Fig.11.

The interface of our proposed motion retrieval system.

6. Conclusion and Future work

In this article, we presented an Adaptive Feature Selection and Graded

Relevance Feedback (AFS-GRF) that enhanced the performance in an examplebased motion retrieval system. In our experiment data, we organized a total of 36 classes of different movements with an Ontology hierarchy and determined their logical relevance. We modeled the logically relevance of the 3D motion capture data based on the variances of the relative distances between joints (VJRD). With the graded relevance measure, we evaluated the retrieved result by the proposed method in logical relevance.

In addition, we proposed an Adaptive Feature Selection (AFS) method that is able to abstract the characteristics of the query by finding a Linear Regression

Model that identified the relationship between the ground-truth feature subset and the training data. Our method is adaptive to different kinds of queries and is superior to existing methods that just select features based on the global statistics of feature distribution. The proposed AFS-GRF algorithm outperforms other class-specific feature selection methods and motion retrieval methods.

Furthermore, we refined the feature selection by Graded Relevance Feedback

(GRF). The system updated the retrieval result by the accumulated distances between the query and the feedback samples of different graded relevance values.

26

Through a small number of iterations and feedback samples, our method showed a good retrieval performance.

Our method can be applied to retrieve 3D motion capture data that can be reused for producing new animations of games. We have built a prototype motion retrieval system with our proposed AFS-GRF method to illustrate this concept. As future work, the retrieved motions can be selected and stored temporarily for synthesizing new animations by techniques such as motion graph (Kovar et al.,

2002). Moreover, it is not convenient to tell the user to open a motion file to retrieve similar motions. A possible direction is to explore more sophisticated ways to input the query, such as input some abstract movements with a cheaper motion sensing device. Not the least, our joint relative measurements can be extended to represent the interactions of multiple bodies.

Acknowledgement

The work described in this paper was substantially supported by a grant from the

Research Grant Council of the Hong Kong Special Administrative Region, China

[project No. CityU 1165/09E]. The authors would like to thank the anonymous reviewers for their helpful comments to improve the paper.

Reference

Aksoy, S., and Haralick, R.M., 2000. A Weighted Distance Approach to Relevance Feedback. In:

Proc. the International Conference on Pattern Recognition (ICPR00). Barcelona , Spain, vol. 4, pp.812 – 815.

Baak, A., Müller, M., Seidel, H.P., 2008. An efficient algorithm for keyframe-based motion retrieval in the presence of temporal deformations. In: Proc. ACM International

Conference on Multimedia Information Retrieval (MIR), Vancouver, British Columbia,

Canada, pp. 451–458.

Breitman, K.., Casanova, M.A., Truszkowski, W., 2006. Semantic Web: Concepts, Technologies and Applications (NASA Monographs in Systems and Software Engineering). Springer-

Verlag New York, Inc.

Buckley, C., Allan, J., Salton, G., 1995. Automatic Routing and Retrieval Using Smart: TREC-2.

Inf. Process. Manage. 31(3): 315-326 (1995)

27

CityU Motion Capture Lab, City University of Hong Kong, 2010. <http://mocap.cs.cityu.edu.hk/>.

Accessed: May 29, 2011.

CMU Graphics Lab Motion Capture Database, Carnegie Mellon University, 2003.

<http://mocap.cs.cmu.edu/>. Accessed: May 29, 2011.

CGSPEED, 2008, <http://www.cgspeed.com>. Accessed: March 13, 2011.

Chung, H.S., Kim, J.M., Byun, Y.C., Byun, S.Y., 2005. Retrieving and Exploring Ontology-Based

Human Motion Sequences. In: Proc: the ICCSA (3), pp. 788–797.

Deng, Z., Gu, Q., Li, Q., 2009. Perceptually consistent example-based human motion retrieval. In:

Proc. 2009 Symposium on interactive 3D Graphics and Games (I3D’09), Boston,

Massachusetts, February 27 - March 01, 2009. ACM, New York, NY, pp. 191–198.

Grubber, T.R., 1993. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):

199–220.

Guarino, N., 1995. Formal ontology, conceptual analysis and knowledge representation.

International. Journal of Hum.-Comput. Stud. 43(5–6): 625-640 (December).

Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of

Machine Learning Research, 3: 1157–1182 (March)

HDM Motion Capture Database (HDM05), Hochschule der Medien, 2005. <http://www.mpiinf.mpg.de/resources/HDM05/>. Accessed: May 29, 2011

Hotho, A., Maedche, A., Staab, S., 2001. Ontology-based text clustering. In: Proc. IJCAI-2001

Workshop “Text Learning: Beyond Supervision”, August, Seattle, USA.

Järvelin, K.., 2009. Interactive Relevance Feedback with Graded Relevance and Sentence

Extraction: Simulated User Experiments. In: Proc. 18th ACM Conference on Information and knowledge management, Hong Kong, China, 26 November 2009, pp. 2053-2056.

Keogh, E.J., Palpanas, T., Zordan, V.B., Gunopulos, D., Cardle, M., 2004. Indexing large humanmotion databases. In: Proc. 30th VLDB Conf., Toronto, pp. 780–791.

Keskustalo, H., Järvelin, K., Pirkola, A., 2008. Evaluating the effectiveness of relevance feedback based on a user simulation model: effects of a user scenario on cumulated gain value.

Information Retrieval, 11(3): 209-228, June 2008.

Kim, T.K., Kim, H., Hwang, W., Kee, S.C., Kittler, J., 2003. Face description based on decomposition and combining of a facial space with LDA. In: Proc. IEEE International

Conference on Image Processing, pp. 877–880, Spain.

Kovar, L. Gleicher, M., Pighin, F., 2002. Motion Graphs. ACM Transactions on Graphics, 21(3), pp.473–482.

Liu, X., Chen, T., Thornton, S.M., 2003. Eigenspace updating for non-stationary process and its application to face recognition. In Pattern Recognition, 36: 1945–1959.

Müller, M., Röder, T., 2006. Motion templates for automatic classification and retrieval of motion capture data. In: Proc: 2006 ACM SIGGRAPH/Eurographics Symposium on Computer

Animation (SCA), Vienna, Austria, pp. 137–146.

Müller, M., Röder, T., Clausen, M., 2005. Efficient content-based retrieval of motion capture data.

ACM Transactions on Graphics (TOG), 24(3): 677–685.

28

Rocchio, J.J., 1971. Relevance Feedback in Information Retrieval. In The SMART Retrieval

System, Experiments in Automatic Document Processing, pages 313–323. Prentice Hall,

Englewood Cliffs, New Jersey, USA.

Roger, P., 1956. On best approximate solution of linear matrix equations. In: Proc. Cambridge

Philosophical Society 52: 17–19.

Sakai, T., 2003. Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels. In: Proc. 26th Annual international ACM SIGIR Conference on

Research and Development in information Retrieval (SIGIR’03), Toronto, Canada, July

28 – August 01, 2003, pp. 417–418.

Sakai, T., 2004. Ranking the NTCIR systems based on multigrade relevance. In: Proc. of Asia information retrieval symposium 2004, pp. 170–177.

Sakai, T., 2007. On the reliability of information retrieval metrics based on graded relevance.

Information Processing and Management, 43(2), 531–548.

Sharma, A., Paliwal, K.K., Onwubolu, G.C., 2006. Class-dependent PCA, MDC and LDA: A combined classifier for pattern classification. In Pattern Recognition, 39(7): 1215–1229.

Shum. H.P., Komura, T., Yamazaki, S., 2007. Simulating competitive interactions using singly captured motions. In: Proc. 2007 ACM symposium on Virtual reality software and technology (VRST '07), CA, U.S.A, pp. 65-72.

Tam, G.K., Zheng, Q., Corbyn, M., Lau, R.W., 2007. Motion Retrieval Based on Energy

Morphing. In: Proc. Ninth IEEE International Symposium on Multimedia, pp. 210–220.

Tang, J.K., Leung, H., Komura, T., Shum, H.P., 2008. Emulating human perception of motion similarity. Comput. Animat. Virtual Worlds, 19(3–4): 211–221.

Tusk C., Koperski K., Aksoy S., and Marchisio G. 2003. Automated feature selection through relevance feedback. In: Proc. IEEE International Geoscience and Remote Sensing

Symposium, 2003. IGARSS ’03. Vol. 6, pp. 3691–3693, July 2003.

Wang, X., Yu, Z., Wong, H.S., 2008. Searching of Motion Database Based on Hierarchical SOM.

In: Proc. IEEE International Conference on Multimedia and Expo 2008, pp.1223–1236.

Zhang, J., 2008. A Novel Video Searching Model Based on Ontology Inference and Multimodal

Information Fusion. In: Proc. 2008 international Symposium on Computer Science and

Computational Technology – Vol. 2 (December 20 - 22, 2008). ISCSCT. IEEE Computer

Society, Washington, DC, pp. 489–492.

29

Download