Mining spatiotemporal video patterns towards robust action retrieval Liujuan Cao ,

advertisement
Neurocomputing 105 (2013) 61–69
Contents lists available at SciVerse ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Mining spatiotemporal video patterns towards robust action retrieval
Liujuan Cao a, Rongrong Ji b,n, Yue Gao c, Wei Liu b, Qi Tian d
a
Harbin Engineering University, Harbin 150001, China
Columbia University, New York City 10027, United States
Department of Automation, Tsinghua University, 100086, China
d
University of Texas at San Antonio, San Antonio 78249-1644, United States
b
c
a r t i c l e i n f o
abstract
Available online 17 October 2012
In this paper, we present a spatiotemporal co-location video pattern mining approach with application
to robust action retrieval in YouTube videos. First, we introduce an attention shift scheme to detect and
partition the focused human actions from YouTube videos, which is based upon the visual saliency [13]
modeling together with both the face [35] and body [32] detectors. From the segmented spatiotemporal
human action regions, we extract 3D-SIFT [17] detector. Then, we quantize all detected interest points
from the reference YouTube videos into a vocabulary, based on which assign each individual interest
point with a word identity. An APrior based frequent itemset mining scheme is then deployed over the
spatiotemporal co-located words to discover co-location video patterns. Finally, we fuse both visual
words and patterns and leverage a boosting based feature selection to output the final action
descriptors, which incorporates the ranking distortion of the conjunctive queries into the boosting
objective. We carried out quantitative evaluations over both KTH human motion benchmark [26], as
well as over 60-hour YouTube videos, with comparisons to the state-of-the-arts.
Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved.
Keywords:
Video search
Spatiotemporal descriptor
Visual vocabulary
Visual pattern mining
Social media
Scalable multimedia representation
1. Introduction
Coming with the video sharing websites like YouTube,1 MySpace,2
and Yahoo Video,3 nowadays there is an increasing amount of usercontributed videos on the Web. To manage the ever growing videos,
content-based accessing, browsing, search and analysis techniques
emerge. In this paper, we investigate the task of human action
retrieval from the user contributed videos on the Web, which has
emerging potentials in many related applications such as video
surveillance, abnormal action detection and human behavior analysis.
More specifically, we focus on the retrieval of actor-independent
actions, i.e. we only care about the motion patterns rather than the
visual appearances of the actor, rather than near-duplicate action
matching. Yet, it also differs from the traditional action recognition
scenario, such that there is not predefined action category to ensure
our scalability.
Searching actions from the user contributed videos is not
trivial. Besides the difficulties of foreground motion segmentation
and representation, challenges also come from the uncontrolled
quality of these user contributed videos. For instance, such videos
are typically of low resolution, with unstable global (camera)
motion, and in a very short duration etc. These challenges largely
degenerate the robustness of the state-of-the-art video search
techniques [33] deployed on the popular bag-of-words representations. Meanwhile, without stable or regular camera motions in
video surveillance, detecting and tracking human motion from
these user contributed videos are much more difficult.
However, there is still good news from the user tagging as well
as their video capturing manners: First, there is a large amount of
cheaply available tags, which provides a weak supervision (due to
their noise) to guide our design of a robust and discriminative
action representation, referred to video patterns in this work
comparing to the bag-of-words. Second, most consumer videos
are with good focus on the person of interest comparing to the
other foreground and background motions, which inspires us to
incorporate the visual attention modeling to robustly detect and
track the human actions.
A typical paradigm of content based action search involves
three components, i.e. action segmentation, spatiotemporal description, feature indexing and similarity scoring. We brief as
follows:
Action segmentation refers to segmenting the human action
n
Corresponding author. Tel.: þ1 917 655 3190.
E-mail address: jirongrong@gmail.com (R. Ji).
1
www.youtube.com
2
www.myspace.com
3
www.video,yahoo.com
regions from the static or moving backgrounds and other
foreground motions. Different from the visual tracking or
background substraction, the segmentation target is the
human body not else. Different from the pedestrian detection,
the uncontrolled quality of the user contributed videos,
0925-2312/$ - see front matter Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.06.044
62
L. Cao et al. / Neurocomputing 105 (2013) 61–69
e.g. heavy sharking and blurring, raises the challenge of
detecting human actions in the wild.
Spatiotemporal description aims to robustly and discriminatively describe the segmented human actions. Ideally, this
description should be at the body part level, i.e. the recent
work in body part inference from the RGB-Depth Kinect data
[34]. Again, this is not realistic for user contributed videos
such as YouTube, where neither depth nor stabile videos is
guaranteed. Subsequently, to improve the popular bag-offeatures action representation, high-level representations are
demanded to capture the eigen statistics of the human actions.
Feature indexing and similarity scoring: The final step is to build
an effective feature representation by fusing both the bag-offeatures descriptors with their higher-level abstraction. Its
ultimate goal is to improve the ranking performance of the
possible query actions. To this end, we will further show a way
to incorporate the ranking loss of possible queries into the
feature representation and indexing in a principle way.
In this paper, we tackle the problem of action retrieval from user
sharing videos with three main contributions:
First, we introduce an attention shift scheme for fast and
robust human motion detection and segmentation, which is
deployed over the visual saliency model [13] with the integration
of human face [35] and body [32] detectors to produce human
specific response. Our attention shift scheme well exploits the
recording preferences of the user contributed videos, in our
quantitative evaluations outperforms the classic mean shift [36]
or particle filtering based trackers.
Second, we introduce a spatiotemporal co-location video
pattern mining scheme, which is deployed over the 3D-SIFT [17]
based interest points. For each human motion segment, we model
the spatiotemporal co-location statistics of interest point neighborhoods as the transactions. We pool all transactions from videos
of the same tag and run an APrior based frequent itemset mining
[37] to discover the co-location video patterns, which is then
treated as a high-level abstraction of the bag-of-features action
descriptors.
Finally, we propose to boost an optimal human action representation over both the bag-of-features and the mined patterns,
with the ranking loss of sampled conjunctive queries into the
boosting objective. It enables us to discover the most discriminative
feature representation, which best fits the potential action search
by boosting based feature selection. Fig. 1 shows our proposed colocation video pattern mining based action search framework.
We have quantized the performance of our proposed scheme
in two evaluations: First, we compare our scheme to 1. HoG þ
Bag-of-Features and 2. 3D-SIFT þ Bag-of-Features based feature
representations on the KTH human action benchmark [26].
Second, we have crawled over 60-hour videos from YouTube,
based on which we evaluate our action search scheme with
comparisons to the state-of-the-art alternatives.
The rest of this paper is organized as follows: Section 2 reviews
our related work. Section 3 introduces our attention shift based
human action segmentation and partition. Section 4 presents our
spatiotemporal co-location video pattern mining. Section 5 presents our boosting based discriminative feature representation
and indexing. We give quantitative evaluations in Section 6.
Finally, we conclude in Section 7 and discuss our future work.
2. Related work
Action search: While lots of previous works have focused on
visual search and recognition of still images [40–47], there is
limited work directly handled human action retrieval. Shih et al. [5]
proposed to search single foreground action from the static background. And Ferrari et al. [6] searched human poses with the
combination of human body parts with motion trajectory similarity.
However, the human body estimation is not precise enough
especially for videos containing serious viewpoint changes, as in
our scenario.
Our work follows an unsupervised setting, which differs from
the action recognition [7–11] that classified human actions into
several predefined categories so that cannot be well scaled up.
Visual saliency: The visual saliency analysis in videos aims to
capture the specific foreground motion that is the focus of video
viewers. As shown in [12], such a salient motion is transferable
from one actor to anther. Taking advantage of this essence, recent
works in [14–16] further deployed the image saliency analysis
from Itti et al. [13] into the video domain. For instance, Li et al.
[16] presented a dynamic attention model to combine the spatial
and temporal saliency to detect the focused region switching.
Video pattern mining: There are several previous works in video
pattern mining [28–31]. For instance, Sivic et al. adopted the
spatial configuration of viewpoint invariant features from principal objects or scenes to mine the most frequent feature combinations. Quack et al. [29] proposed to mine frequently occurring
objects and scenes, based on finding recurring spatial arrangements of affine covariant regions. Yuan et al. proposed to mine a
visual phrase lexicon, each of which is a meaningful spatially cooccurrent pattern of visual words. Quack et al. [31] further
proposed to mine spatial configurations of local features occurring frequently in a given object. Compared to our method, all the
above works output the mined video patterns, without regard to
Fig. 1. The proposed co-location video pattern mining based action search framework.
L. Cao et al. / Neurocomputing 105 (2013) 61–69
an optimal combinations to improve the subsequent video (action)
search performance.
Spatiotemporal interest points: The spatiotemporal interest
points model actions as a bag-of-words [26,32] representation,
which in general contains three phases: (1) extracting spatiotemporal interest points from action videos, (2) clustering the interest
points based on their interest points (e.g. histogram of oriented
gradient [32], SIFT [1]), and (3) classifying/recognizing actions
using the occurrence frequencies of the clustered points.
Researches have focused on each step to improve the final
action recognition performance. For instance, to extract robust
and informative interest points, various criteria are proposed, e.g.
cornerness [21]. Liu et al. [10] further clustered interest points by
exploring the correlation between points and actions categories.
3. Robust human action extraction
We define an action as a consecutive spatiotemporal region set,
which contains the appearance and movement of identical people
in this video shot. Given another video shot as the query input,
the action retrieval is to retrieve shots in the video dataset that
contain identical or near-identical actor movements to the query.
To clarify, we focus on ‘‘actor-independent’’ action retrieval,
which considers actions of similar motion trajectories as similar,
without regard to their appearance differences.
As discussed in Section 1, our first step is to distill actions from
videos, which subdivides each shot into several human-focused
actions and eliminates the interferences from (multiple) concurrent actions and global motions (caused by camera motions). We
propose an attention shift model to achieve this goal as brief in
Fig. 2, which combines a Focus of Attention detection together
with a switching mechanism, so as to partition spatiotemporally
salient actions in each shot. Especially, we also incorporate the
face detector [35] and body part detector [32] to post-filter the
partitioned motion regions, such as motions without human (e.g.
vehicles or animals) are filtered out.
3.1. Extracting focused actions
In preliminary processing, we first carry out the shot boundary
detection based on graph partition, which has been demonstrated
to be effective in TRECVID 2005 SBD competitions.
To distinguish foreground actions from backgrounds, a straightforward strategy is the background modeling such as Gaussian
mixture model. However, it cannot tackle the scenario of moving
cameras. Moving object tracking is another feasible solution to locate
the actions. However, how to detect the focused action changes is
left unexploited. We address this issue by a spatiotemporal saliency
63
detection, which is well investigated in image domain [13] that
adopted saliency map to detect human-focused regions. In this
paper, we present a spatiotemporal saliency analysis strategy to
locate the human-focused foreground actions, to eliminate the
interferences from concurrent actions and camera motions.
Spatial saliency: We detect the spatial saliency by using the
saliency map in [13]. We compute 42 feature maps, 6 for
intensity, 12 for color, and 24 for orientation. The spatial saliency
rSpatial
of each pixel p (location (x,y)) is measured by the weighted
p
summation of all conspicuity maps from different channels:
X
r Spatial
¼
g n Mn ðpÞ, n A fI,C,Og,
ð1Þ
p
n
where I,C,O represent the three channels (intensity, color, orientation), gn is the weighting parameter. The conspicuity map M is
calculated by summing up all feature maps belonging to the
corresponding feature channel.
Motion saliency: For the successive frames in this shot, a
motion saliency map is built using the dynamic attention model
[16], which detects the human-focused motions that both constant inside the object and salient to the background.
First, we adopt the optical flow [18] feature to determine the
number of concurrent motions within each shot. For each point p,
we calculate a weighted structural tensor of M ðu,v,wÞ ¼ O31 .
Once the optical flows ðu,v,wÞ are constant within neighborhood
N, M should be rank deficient with rankðMÞ r 2. We then select
the first three eigen values l1 , l2 , l3 to define the continuous rankdeficient measurement dM:
8
0
trðMÞ o g:
>
>
<
l23
ð2Þ
dM ðpÞ ¼
otherwise:
>
>
: 0:5 l2 þ0:5 l2 þ e
1
2
g is adopted to handle the case rankðMÞ ¼ 0. Then, the motion
saliency for each p is detected as (e is a constant tuning
parameter)
r Temporal ðx,yÞ ¼
dM ðx,yÞ
:
dM ðx,yÞ þ e
ð3Þ
Spatiotemporal FOA: We then multiply both spatial and temporal saliency scores of each candidate grid to obtain a spatiotemporal saliency map, in which the saliency of each pixel p is
determined by
X
X
¼
r Spatial
r Temporal
:
ð4Þ
r Spatiotemporal
p
i
j
i A NSpatial
j A NTemporal
p
p
We first determine the FOA center by finding the maximum of
P
Spatiotemporal
GðpÞ, in which G(p) is a normalized Gaussian
q A Np r p
function centered at p, with identical amount of points to N Spatial
.
p
Fig. 2. The proposed attention shift model for focused action segmentation and partition.
64
L. Cao et al. / Neurocomputing 105 (2013) 61–69
Second, we define a locating rectangle in each frame to distinguish human-focused foreground region from backgrounds
and concurrent motions as RecX,Y ¼ f8xi A X,8yi A Y9xmin r xi r
xmax ,ymin ryi r ymax g. We expand the rectangle box from the FOA
center, until the box enclosed all the pixels with saliency strength
higher than a pre-defined threshold TSaliency. For pixels within this
box, once the optical flow similarity between this pixel and the
background is lower than a pre-defined threshold TOF, we remove
these pixels from this FOA region.
For all the detected action regions, we further use the cascade
face detector [35] together with the HOG (Histograms of Oriented
Gradients) [32] human detector to detect whether a moving
foreground is a human. As a result, spatiotemporal regions without human are filtered out from our subsequent processing.
3.2. Detecting attention shift
For each detected focused action sequence, the ‘‘Shift’’ from
one people to another is identified to partition the action regions
that are from different people. To measure whether this action is
conducted by an identical people, we define a shifting rate AShift in
Eq. (5), which evaluates the spatial consistency of concerned
action region frames for action partition:
(
1 CenDisði,jÞ 4 T C , & DiaVar oT D ,
AShift ¼
ð5Þ
0 CenDisði,jÞ o T C , & DiaVar 4 T D ,
where CenDisðÞ is the distance between geometrical centers of two
given regions i and j; DiaVarðÞ is the variance of their diameters.
Once AShift between the target action regions (i and j) of successive
frames satisfies Eq. (5), an ‘‘Attention Shift’’ is detected. From each
detected ‘‘Attention Shift’’ position, we segment the action region
sequence into two different actions.
4.1. Spatiotemporal descriptor
We adopt 3D-SIFT [17] feature to characterize a given action
segment based on its spatiotemporal interest points. Without lost
of generality, other spatiotemporal features [19–24] could be also
adopted in this phase, such as the Hessian-STIP [19] and 3DGabor [20]. Following the principle of [17], for each pixel in the
action region, we encode the gradient magnitudes and orientations within its spatiotemporal 3D space, in which spatial and
temporal angels (y and f) are calculated as
0
ð6Þ
1
L
C
t
fðx,y,tÞ ¼ arctanB
@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA:
L2x þ L2y
4.2. Spatiotemporal vocabulary
Based on the 3D-SIFT features extracted from all 3D action
segments, we build a visual vocabulary using hierarchical K-Means
[2–4]. Each quantized word contains descriptors that are closer to
the feature center of this word than the others in the 3D-SIFT
feature space. As a result, each action segment is represented as a
bag-of-features (BoF).
For a focused action, the BoF sequence encodes spatial, motion,
and appearance information all together in action representation.
To ensure scalability, we adopt our indexing model to conduct
an Approximate Nearest-Neighbor search: In retrieval, for each
3D-SIFT extracted from the query action, the (inverted) indexed
action clips falling within the identical visual word are picked out
as ranking candidates.
Given a visual vocabulary, a query action q (or an action clip d)
can be represented as an N dimensional vector of words. The
relevance between q and d can be calculated as the Cosine
distance between their BoF vectors. That is
PN
wdi wqi
rðd,qÞ ¼ i ¼ 1
,
ð8Þ
9d99q9
where wdi is the weight of the ith word in action clip d, wqi is the
weight for the ith word in the query action clip q.
Similar to document retrieval, Term Frequency (TF) and Invert
Document Frequency (IDF) [25] can be also incorporated.
Indexing: We further inverted index each action segment into
its corresponding non-zero words. As a result, in online search,
the candidate action selection is very fast due to our hierarchical
model structure: For each search, the time complexity is OðlogðNÞÞ
in a vocabulary with N words.
4.3. Mining co-location action patterns
4. Mining co-location action patterns
yðx,y,tÞ ¼ arctanðLy \Lx Þ,
among all 3D regions of this interest point to form the final 3D-SIFT
descriptor.
ð7Þ
For pixel (x,y), Lx ¼ Lðx þ1,y,tÞLðx1,y,tÞ, Ly ¼ Lðx,yþ 1,tÞLðx,y
1,tÞ represent the gradients at x and y directions, respectively, and
Lt ¼ Lðx,y,t þ1ÞLðx,y,t1Þ represents the gradient at t direction.
When the pixels are local maximum or minimum around its
spatiotemporal neighbors, they are retained as interest point candidates. Then, rotation invariance is achieved by placing the key point
into its dominant 3D directions (rotate to y ¼ f ¼ 0). Subsequently,
we sample the spatiotemporal neighborhood regions (3 3 in spatial,
and 8 in temporal, each contains 4 4 4 spatiotemporal pixels)
within the target interest point. For each 3D sub-region, similar to
SIFT [1], we divide y and f into equal bins to quantize y and f
orientations into a histogram. Finally, we ensemble such quantization
Concurrent word coding: Suppose we have in total M tags
crawled from the YouTube videos. For each tag, we have a set of
human action segments whose original video was labeled with
this tag. We perform the co-location video pattern mining offline
over all tags and their reference action segments, which outputs a
set of video patterns as a pattern pool for the subsequent boosting
in Section 5.
For each reference action segment Ii ði A ½1,nÞ of a given tag,
suppose there are J3D-SIFT descriptors LðiÞ ¼ ½L1 ðiÞ, . . . ,LJ ðiÞ
extracted from Ii with their spatial positions SðiÞ ¼ ½S1 ðiÞ, . . . ,SJ ðiÞ
(each Sj (j A ½1,J) of the jth local descriptor). We quantize LðiÞ
into an m-dimensional Bag-of-Features histogram VðiÞ ¼ ½V 1 ðiÞ,
. . . ,V m ðiÞ. For each Sj(i), we scan its K-nearest neighborhood to
identify all concurrent words:
Tj ðiÞ ¼ fLj0 9Lj0 ðiÞ A LðiÞ, Sj0 ðiÞ oNeighbor K ðSj ðiÞÞg,
ð9Þ
where Tj ðiÞ is the item (if any) built for Sj(i) in Ii. The ensemble of
all possible items from Ii is
TðiÞ ¼ fTj ðiÞgj A ½1,J :
ð10Þ
Again, the ensemble of all possible TðiÞ for iA ½1,n forms an
itemset Dk ¼ fTð1Þ, . . . ,TðnÞg for ½I1 , . . . ,In at the pattern order k.
From Dk , we are going to mine np patterns named Pk ¼
fP 1 , . . . ,Pnp g.
Distance based pattern mining: Previous work in visual pattern
mining mainly resolved to Transaction based Co-location pattern
Mining (TCM): For instance, the works in [28–30] built transaction features by coding the k nearest 2D spatially nearby words.
This spatial configuration can be further refined by the scales of
L. Cao et al. / Neurocomputing 105 (2013) 61–69
interest points to impose scale invariance into transactions [31].
A transaction in TCM can be defined by coding the spatial layouts
of neighborhood words. Then, frequent itemset mining schemes
like APriori [37] can be further deployed to discover word
combinations (maximal order k) as patterns.
TCM can be formulated as follows: Let V ¼ fV 1 ,V 2 , . . . ,V m g be
the set of all potential items, let D ¼ fT1 ,T2 , . . . ,Tn g be the set of
transactions. Each Ti is a possible set of items A D, e.g. combinations of partial items within V, named itemset. Any two Ti and Tj
are induplicated.
We then define the Support of an itemset A (AD D) as follows:
9fT A D9A DTg9
supportðAÞ ¼
:
9D9
If and only if supportðAÞ Z s, the itemset A is defined as a frequent
itemset of D, where s is the threshold to restrict the minimal
support rate. We then define the confidence of each frequent
itemset as follows:
condifentðA-BÞ ¼
9fT A D9ðA [ BÞ DTg9
supportðA [ BÞ
¼
,
supportðAÞ
9fT A D9AD Tg9
ð12Þ
where A and B are two subsets. The confidence in Eq. (12) is
defined as the maximal correctness likelihood of B in the case that
A is also correct. The idea of confidence based restriction can
guarantee the co-location visual patterns to discover the minimal
item subsets that are most helpful in representing the visual
features at a given order K.4
We further define an Association Hyperedge of subset A ¼ fV 1 ,
V 2 , . . . ,V l g as
AHðAÞ ¼
1
confidenceðAfV i g-V i Þ,
N
ð13Þ
which gives a minimal association hyperplane to bound A. By
checking all possible itemset combinations in D from order 2 to K,
the itemsets with supportðÞZ s and AH Z g are defined as frequent
patterns.
However, TCM procedure would generate repeated patterns in
texture regions that contain dense words, which degenerates its
discriminability. Distance-based Co-location Pattern Mining
(DCM) refines TCM by introducing two new measurements
named participation index (pi) and participation ratio (pr), which
overcomes TCM’s over sensitive in texture regions.
Formally speaking, let C ¼ fV 1 ,V 2 , . . . ,V m g be the set of in total
m words, DCM first introduces the R-reachable measurement as
the basis of both pi and pr: Two words Vi and Vj are R-reachable if
and only if disðV i ,V j Þ odthres , where disðÞ is the distance metric
such as Euclidean distance and dthres is the distance threshold. For
a given word Vi, we define the partition rate prðC,V i Þ as the
percentage of subsets CfV i g in C that are R-reachable:
prðC,V i Þ ¼ pV i
ð9instanceðCÞ9Þ
,
9instanceðV i Þ9
ð14Þ
where p is the relational projection operation with deduplication. The pi is defined as follows:
k
piðCÞ ¼ minfprðC,V i Þg,
i¼1
identical to the APriori [37] process. Algorithm 1 outlines the
work flow of our proposed scheme.
Algorithm 1. Distance based spatiotemporal co-location video
pattern mining.
1
ð15Þ
where pi describes the frequency of subset C in the neighborhood.
Note that only item subsets with pi larger than a give threshold
can be defined as patterns in DCM. The mining procedure is then
Input: visual vocabulary V, reference action segments
fIi gni¼ 1 , reference actions with respect to the tag
Objectj nreference nobject
gi ¼ 1 gj ¼ 1
ffIi
3
4
.
Output: CBoP patter set fQ gti selected
¼1
Spatiotemporal coding: for ToIj ðj A ½1,JÞ do
9 Building itemset fDgObjectj candidates by 3D sphere
5
6
coding in the 3D points using Eq. (9);
end
Ensemble fDgObjectj for all tags ðj A ½1,nobject Þ as D ¼ fDgti ¼ 1 ;
7
8
Distance based Pattern Mining:
Calculating supports and confidences all patterns within P
using Eqs. (11) and (12);
9 Filtering out unreliable patterns with thresholds s and g;
10 Generate the mined pattern collection P ¼ fP i gti ¼ 1 Eqs. (15)
and (14).
5. Boosting discriminative features
Finally, we present a boosting based discriminative feature
selection from both the visual words and the mined visual
patterns. We sub-sample a set of conjunctive query from the
reference action segments, based on which we define our boosting objective as to minimize the ranking distortions of the ground
truth labels. We describe our boosting procedure as follows:
Conjunctive queries: We sample a set of conjunctive query
similar to [38] as ½I01 , . . . ,I0nsample . Then, each action is used to search
similar actions by using the original bag-of-features of V, which
results in the following training set:
QueryðI01 Þ ¼ ½A11 ,A12 , . . . ,A1R ¼ n
k is initialized from 2 to K in our implementation.
n
n
QueryðI0nsample Þ ¼ ½A1sample ,A2sample , . . . ,ARsample ,
ð16Þ
where Aji is the ith returning of the jth query. We expect the
boosted words/patterns from P [ V to retain ½Aj1 ,Aj2 , . . . ,AjR for
each jth action query as much as possible.
To this end, we define ½w1 , . . . ,wnsample as an error weighting
vector to the nsample in the user query log, which measures the
ranking consistency loss in the word/pattern selection. We then
define the selected word/pattern subset as C. At the tth boosting
iteration, we got the current ðt1Þ words or patterns as Ct1 . To
select the next tth discriminative word or pattern from P [ V, we
estimate the ranking preservation of the current selection Ct1 as
R
X
LossðI0i Þ ¼ wt1
i
RankðA1r ÞWA1 Jf ðCIt1
Þ,VAi J2 ,
0
r
r¼1
i
ð17Þ
r
where i A ½1,nsample ; RankðAir Þ is the current position of the
originally ith returning of the action query I0i ; f ðÞ is the bag-oft1
feature recovery function. ½wt1
1 , . . . ,wnsample is the (t 1)th error
weighting, which measures the ranking loss of the jth query
ðj A ½1,nsample Þ. Then, the overall ranking loss is
nsample
LossRank ¼
4
bag-of-words histograms
fVð1Þ, . . . VðnÞg, support threshold s, confidence threshold g,
maximal pattern order K, sparse factor a.
2
ð11Þ
65
X
i¼1
wt1
i
R
X
r¼1
RankðAir ÞWAi Jf ðCI0i Þ,VAi J2 :
r
r
ð18Þ
66
L. Cao et al. / Neurocomputing 105 (2013) 61–69
And the next best word or pattern Ct is selected from P [ V by
minimizing
nsample
C t ¼ arg min
j
X
i¼1
wt1
i
R
X
r¼1
RankðAir ÞWAi Jf ðfC þ C j gI0 Þ,VAi J2 :
r
i
r
ð19Þ
as the
Subsequently, we update the error weighting of each wt1
i
corresponding loss of each ith query in Eq. (19).
6. Experimental results
6.1. Action search in 60-hour YouTube videos
Database: In our experiments, we have crawled over 60-hour
YouTube videos from the ‘‘most view’’ from September to October, 2010. We partition these videos into shots and distill focused
actions from each shot. Then, we filter out actions with durations
less than 2 s. After this filtering, we generate over 6000 actions
with sufficient durations and focuses. Among them, 200 actions
are selected to build our query set. We construct this database by
the rest 5800 actions and offline extract 3D-SIFT descriptors from
each of them.
Search model: We build a 4-branch, 24-level vocabulary using
hierarchical K-Means clustering [2] to generate BoF vector for
each action segment. In tree construction, if a node has less than
2000 features, we stop its K-Means division, not matter whether
it achieves the deepest level or not. On a computer with Intel
Pentium IV 3.00 GHz CPU and 1.0 G RAM, the typical time and
memory costs are: tree constructing time 2 h, retrieval time 4 s
per action (including feature extraction), and memory cost 200M.
Evaluation protocol: We use precision@N to evaluate the search
performance, which is widely used in image/document retrieval
systems.
For each given method, after automatically ranking the top 10
similar actions in our database, we ask a group of subjective users
to label its similar action: That is, for each query action, users are
asked to give a binary judgment about whether a current action is
a similar action, which therefore enables us to measure the
precision@N. (Note that we cannot go through the entire database, so we did not judge the Recall in all of our experiments).
With and without attention shift: First, we compare the reasonability of our action segmentation based on our Attention Shift
Model. With identical implementation in the rest steps, we
compare our Attention Shift model with the method that adopts
both global motion compensation and includes all moving foregrounds into action representation. This alternative approach is
denoted as ‘‘Global Compensation’’ in our discussions.
Fig. 3. The quantitative results of attention shift based action segmentation with
comparisons to the ‘‘Global Compensation’’ based scheme.
As presented in Fig. 3, using Attention Shift model for
action selection, the precision@N are better than the Global
Compensation approach. It derives from the fact that the
salient action representation can automatically and effectively
capture the human-concerned action regions, without lost of
precision.
Action extraction and recognition are still an open problem. To
design an appropriate action extraction algorithm, one key consideration is the constitution of actions and videos in its corresponding database. We have found that this is the case for
YouTube videos such that the video capturer would typically
focus on the foreground human action (if any), rather than some
other post-edited videos.
In addition, in most cases these action regions will occupy a
large portion of the entire shot, which are mostly close-up lens,
rather than telephoto lens. In some cases that contains multiple
acting people (concurrent actions), most of them would be have
only one salient action in each time stamp. Based on above two
reasons, we believe our saliency-based method is very suitable for
actor action extraction from sitcoms or movies. Nevertheless, for
wide baseline or telephoto lens videos, our method is not a best
choice, especially when there are multiple moving objects with
little occupying windows.
However, we also include face detection and tracking to
enhance our saliency map based foreground action detection.
Furthermore, we also constrain the scales of the extracted foreground, which further eliminates unstable extractions from small
action regions.
Quantitative comparisons: Third, we present a group of comparisons for our video pattern based action search to: (1) action
search based on partial 3D-SIFT description; (2) action search
based on optical flow. The first baseline calculates action similarity based on the identical procedure based upon simply bag-offeatures, and retains only the temporal dimensions in the 3D-SIFT
features. It could be regarded as another case of actorindependent search since only the motion information is kept in
action representation. The later baseline directly calculates the
optical flow features [18] to replace the 3D SIFT features, for
which the rest steps are identical to our framework.
As presented in Fig. 4, our strategy reports the best performance among all three approaches. We analysis this figure from
two following aspects:
For consumer videos, due to their low quality, bag-of-features
approaches might not be an optimal choice. To this end,
the capability to capture eigen representations (or latent
feature spaces) is a fundamental step towards a good search
Fig. 4. The quantitative results of our co-location video pattern mining for action
search.
L. Cao et al. / Neurocomputing 105 (2013) 61–69
67
Fig. 5. Action search examples of our proposed boosting based action representation.
performance. We achieve this based on visual pattern mining
and ranking loss integrated boosting.
Adopting solely optical flow features cannot achieve satisfactory results comparing with 3D-SIFT features. Rather than
directly usage, the main advantages of optical flow lie in its
capability to build the correspondence frames.
Fig. 5 further shows four case studies for action retrieval
results in four representative queries. Each line is a query
operation. The left frame shows the query shot with its detected
action. The rest frames show the top four returning results,
respectively.
Efficiency analysis: We further give the time cost of our scheme
for online ranking, which contains two following steps:
1. The Inverted Indexing Ranking, in which we search the
boosted word þ patterns with inverted action indexing, for
in total mword words and mpattern patterns, the time complexity
of linear scanning is Oðmword þ mpattern Þ þ Oðk logðkÞÞ, for in
total k actions picked from the index files (from k we rank the
top n actions as our initial ranking results with cost
Oðk logðkÞÞÞ.
2. The Spatiotemporal Re-ranking, in which case we use dynamic
time warping distance [39] to re-rank the top n returning
action segments’ with linear programming, the time complexity is Oðt 2 Þ for an action with t frames. And the overall cost is
Oðn t 2 Þ for the n actions from Initial Ranking.
Table 1 further shows the comparisons of the computation
complexity (measured by seconds) of the building time and the
different parts in our online search pipeline. It is obvious that the
initial ranking largely improves the search efficiency to enable our
online application. Subsequently, the DTW Matching does not
largely degenerate to our online search efficiency.
Table 1
Computational complexity cost evaluation of offline building, as well as different
parts in online search.
Time cost comparisons
Steps \ cost
Time cost (s)
Offline building
Online search without initial ranking
Solely inverted indexing based search
Inverted indexing with DTW matching
26,385
1397
1.826
8.763
to select one sequence as the query clip and search all other
sequences in the database (leave-one-sequence-out, LOSO in [27]).
The second one is to select one person as a query clip and search all
other persons in the database (leave-one-person-out, LOPO). Note
that to offer identical performance to Fig. 8 in [27], we also use the
entire sequence as query.
Comparing to [27], Fig. 6 shows better performance by using
our visual patterns with boosting based selection in Fig. 6. It is
mainly because of our scheme, as explained above, is robust and
discriminative enough towards capturing the eigen representation
of each individual actions. In addition, since we use hierarchical
spatiotemporal vocabulary as the coarse search phase, our search
efficiency can be also guaranteed.
Our 60-hour YouTube video dataset is also very challenging.
First, comparing to the coherent backgrounds in KTH database
[26], our 60-hour YouTube videos contain large amount of background clutter, as well as foreground occlusions. Second, in many
scenarios of our 60-hour YouTube videos, there are dynamic and
moving backgrounds, which is a very challenging scenario
for precise action extraction. Third, in many cases, the viewpoints and camera zoom in/out could largely affect the size of
extracted 3D action regions, which is more complex than KTH
database [26].
6.2. Comparisons on KTH database
To further validate our performance in standard evaluation
benchmarks, we carry out a group of experiments in the KTH
human motion database [26], which is one of the largest available
video sequence benchmarks for human action recognition, which
contains six types of human actions (walking, jogging, running,
boxing, hand waving, and hand clapping). Each sequence is performed several times by 25 subjects in four different scenarios:
outdoors s1, outdoors with scale variation s2, outdoors with
different clothes s3, and indoors s4.
Similar to [27], we search the query clip against the database,
and return the candidate sequences with top similarities. Identical to [27], we conduct two kinds of Leave-one-out cross
validation to compare our performance to [27]. The first one is
7. Conclusion
In this paper, we propose a robust and discriminative action
search paradigm, specialized for searching in the user contributed
YouTube videos that typically with uncontrolled qualities. Our
contributions are threefold: First, we propose an attention shift
model for saliency-driven human action segmentation and partition. Our second contribution is a spatiotemporal co-location
video pattern mining paradigm, aiming for discovering more
eigen word combinations to capture the motion patterns based
on a Distance based Co-location pattern Mining. Finally, we
propose a novel boosting based discriminative feature resembling
scheme, which incorporates the ranking distortions into the
68
L. Cao et al. / Neurocomputing 105 (2013) 61–69
Fig. 6. Quantitative performance on the KTH human motion benchmark [26] with comparisons to the work in [26,27].
boosting objective to optimize the feature descriptor towards an
optimal action retrieval. We have conducted extensive evaluations on a 60-hour YouTube video dataset as well as the KTH
human motion benchmark [26], with comparisons to the state-ofthe-arts.
References
[1] D.G. Lowe, Distinctive image features form scale-invariant keypoints, Int. J.
Comput. Vision. 60 (2) (2004) 91–110.
[2] D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: CVPR,
2006.
[3] G. Schindler, M. Brown, R. Szeliski, City-scale location recognition, in: CVPR,
2007.
[4] R. Ji, X. Xie, H. Yao, W.-Y. Ma, Hierarchy vocabulary optimization for effective
and transferable retrieval, in: CVPR, 2009.
[5] T.K. Shih, C.-S. Wang, Y.-K. Chiu, Y.-T. Hsin, C.-H. Huang, On automatic action
retrieval of martial arts, in: ICME, 2004.
[6] V. Ferrari, M. Marin-Jimenez, A. Zisserman, Pose search: retrieving people
using their pose, in: CVPR, 2009.
[7] O. Masoud, N. Papanikolopoulos, A method for human action recognition,
Image Vis. Comput. 21 (2003) 729–743.
[8] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human
actions from movies, in: CVPR, 2008.
[9] A. Yilmaz, M. Shah, Recognizing human actions in videos acquired by
uncalibrated moving cameras, in: ICCV, 2005.
[10] J. Liu, S. Ali, M. Shah, Recognizing human actions using multiple features, in:
CVPR, 2008.
[11] J. Yuan, Z. Liu, Y. Wu, Discriminative subvolume search for efficient action
detection, in: CVPR, 2009.
[12] H.E. Pashler, The Phychology of Attention, MIT Press, 1999.
[13] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid
scene analysis, Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259.
[14] W.-H. Cheng, W.-T. Chu, J.-L. Wu, A visual attention based region-of-interest
detection, IEICE Trans. Inf. Syst. E88-D (7) (2005) 1578–1586.
[15] Y. Zhai, M. Shah, Visual attention detection in video sequences using
spatiotemporal cues, ACM Multimedia (2006) 815–824.
[16] S. Li, M.-C. Lee, Efficient spatiotemporal-attention-driven shot matching,
ACM Multimedia (2007) 178–187.
[17] P. Scovanner, S. Ali, M. Shah, A 3-dimensional SIFT descriptor and its
application to action recognition, ACM Multimedia (2007) 357–360.
[18] D.J. Fleet, Y. Wiess, Optical Flow Estimation. Handbook of Mathematical
Models in Computer Vision, Springer, 2006.
[19] G. Willems, T. Tuytelaars, L.V. Gool, An efficient dense and scale-invariant
spatio-temporal interest point detector, in: ECCV, 2008.
[20] Y. Wang, C.-S. Chua, Face recognition from 2D and 3D images using 3D Gabor
filters, Image Vis. Comput. 23 (11) (2005) 1018–1028.
[21] I. Laptev, T. Lindeberg, Space–time interest points, in: ICCV, 2003.
[22] Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using
volumetric features, in: ICCV, 2005.
[23] A. Oikonomopoulos, I. Patras, M. Pantic, Spatiotemporal salient points for
visual recognition of human actions, Trans. Syst. Man Cybern. B (2006).
[24] S.-F. Wong, R. Cipolla, Extracting spatiotemporal interest points using global
information, in: ICCV, 2007.
[25] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval,
Inf. Process. Manag. (1998), pp. 513–523.
[26] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM
approach, in: ICPR, 2004.
[27] H. Ning, T. Han, D. Walther, M. Liu, T. Huang, Hierarchical space–time model
enabling efficient search for human actions, Circuits Syst. Video. Technol. 19
(6) (2008) 808–820.
[28] J. Sivic, A. Zisserman, Video data mining using configurations of viewpoint
invariant regions, in: CVPR, 2004.
[29] T. Quack, V. Ferrari, L.V. Gool, Video mining with frequent itemset configurations, in: CIVR, 2006.
[30] J. Yuan, Y. Wu, M. Yang, Discovery of collocation patterns: from visual words
to visual phrases, in: CVPR, 2007.
[31] T. Quack, V. Ferrari, B. Leibe, L.V. Gool, Efficient mining of frequent and
distinctive feature configurations, in: ICCV, 2007.
[32] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:
CVPR, 2005.
[33] /http://trecvid.nist.gov/S.
[34] J. Shotton, A. Fitzgibbon, M. Cook, A. Blake, Real-time human pose recognition
in parts from single depth images, in: CVPR, 2011.
[35] /http://opencv.willowgarage.com/wiki/FaceDetectionS.
[36] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space
analysis, Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619.
[37] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of
items in large database, in: SIGMOD, 1993.
[38] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, W. Gao, Location discriminative
vocabulary coding for mobile landmark search, Int. J. Comput. Vis. (2011).
[39] D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series,
in: Advances in Knowledge Discovery in Databases, AAAI Workshop, 1994.
[40] R. Ji, H. Yao, X. Sun, B. Zhong, W. Gao, Towards semantic embedding in visual
vocabulary, in: IEEE International Conference on Computer Vision and
Pattern Recognition, 2010.
[41] R. Ji, X. Xie, H. Yao, W.-Y. Ma, Mining city landmarks from blogs by graph
modeling, ACM Multimedia (2009) 105-114.
[42] R. Ji, X. Lang, H. Yao, Z. Zhang, Semantic supervised region retrieval using
keyword integrated Bayesian reasoning, Int. J. Innovative Comput. Inf.
Control 3 (6) (2008) 1645–1656.
[43] X. Liu, R. Ji, H. Yao, P. Xu, X. Sun, T. Liu, Cross-media manifold learning for
image retrieval and annotation, Multimedia Inf. Retr. (2008). 141–148.
[44] R. Ji, P. Xu, H. Yao, Z. Zhang, X. Sun, T. Liu, Directional correlation analysis of
local Haar binary pattern for text detection, in: IEEE International Conference
on Multimedia and Expo, 2008.
[45] R. Ji, H. Yao, W. Liu, X. Sun, Q. Tian, Task dependent visual codebook
compression, IEEE Trans. Image Process. 21 (4) 2282–2293.
[46] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval
framework based on semi-supervised ranking and relevance feedback, IEEE
Trans. Pattern Anal. Mach. Intell. 34 (5) (2012) 723–742.
[47] Y. Yang, Y. Zhuang, F. Wu, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE
Trans. Multimedia 10 (3) (2008) 43–446.
Liujuan Cao is currently a Ph.D. candidate in Department of Computer Science and Technology at Harbin
Engineering University, China. she received her B.S.
degree and M.E. degree in Department of Computer
Science and Technology at Harbin Engineering University, in 2005 and 2008, respectively. Her research
interests include multimedia information retrieval,
machine learning, pattern recognition, and vector
map watermarking.
L. Cao et al. / Neurocomputing 105 (2013) 61–69
Rongrong Ji is currently a PostDoc Researcher at
Electronic Engineering Department, Columbia University. He obtained his Ph.D. from Computer Science
Department, Harbin Institute of Technology. His
Research interests include: image retrieval and annotation, video retrieval and understanding. During
2007–2008, he is a research intern at Web Search
and Mining Group, Microsoft Research Asia, mentored
by Xing Xie, where he received Microsoft Fellowship
2007. During 2010.5–2010.6, he is a visiting student at
University of Texas at San Antonio, cooperated with
Professor Qi Tian. During 2010.7–2010.11, he is also a
visiting student at Institute of Digital Media, Peking
University, under the supervision of Professor Wen Gao. He has published over 40
referred journals and conferences, including IJCV, TIP, CVPR, ACM MM, IJCAI, AAAI,
TOMCCAP, IEEE Multimedia, etc. He serves as a reviewer for IEEE Signal Processing
Magazine, IEEE Transactions on Multimedia, SMC, TKDE, and ACM Multimedia
conference et al., an associated editor at International Journal of Computer
Applications, as well as a session chair at ICME 2008. He is a member of the IEEE
and ACM.
Yue Gao is currently a Ph.D. candidate in Department
of Automation at Tsinghua University, China. He
received his B.S. degree in Electronic Engineering from
the Harbin Institute of Technology, China, in 2005, and
his M.E. degree in School of Software from Tsinghua
University, China, in 2008, respectively. His research
interests include multimedia information retrieval,
machine learning, and pattern recognition.
69
Qi Tian is currently an Associate Professor in the
Department of Computer Science, the University of
Texas at San Antonio (UTSA). During 2008–2009, he
took one-year Faculty Leave at Microsoft Research Asia
(MSRA) in the Media Computing Group (former Internet Media Group). He received his Ph.D. in 2002 from
UIUC. Dr. Tian’s research interests include multimedia
information retrieval and computer vision. He has
published about 110 refereed journal and conference
papers in these fields. His research projects were
funded by NSF, ARO, DHS, HP Lab, SALSI, CIAS, CAS
and Akiira Media System, Inc. He was the coauthor of a
Best Student Paper in ICASSP 2006, and co-author of a
Best Paper Candidate in PCM 2007. He was a nominee for 2008 UTSA President
Distinguished Research Award. He received 2010 ACM Service Award for ACM
Multimedia 2009. He is the Guest Editors of IEEE Transactions on Multimedia,
ACM Transactions on Intelligent Systems and Technology, Journal of Computer
Vision and Image Understanding, Pattern Recognition Letter, and EURASIP Journal
on Advances in Signal Processing and is the Associate Editor of IEEE Transaction on
Circuits and Systems for Video Technology and in the Editorial Board of Journal of
Multimedia. He is a Senior Member of IEEE (2003), and a Member of ACM (2004).
Wei Liu received his M.Phil. and Ph.D. degrees in
Electrical Engineering from Columbia University, New
York, NY, USA in 2012. Currently, he is the Josef Raviv
Memorial Postdoctoral Fellow at IBM Thomas J. Watson
Research Center, Yorktown Heights, NY, USA. He worked
as an intern at Kodak Research Laboratories and IBM
Thomas J. Watson Research Center in 2010 and 2011,
respectively. His research interests include machine
learning, computer vision, pattern recognition, and
information retrieval. Dr. Liu won the 2011–2012 Facebook Fellowship.
Download