Neurocomputing 105 (2013) 61–69 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Mining spatiotemporal video patterns towards robust action retrieval Liujuan Cao a, Rongrong Ji b,n, Yue Gao c, Wei Liu b, Qi Tian d a Harbin Engineering University, Harbin 150001, China Columbia University, New York City 10027, United States Department of Automation, Tsinghua University, 100086, China d University of Texas at San Antonio, San Antonio 78249-1644, United States b c a r t i c l e i n f o abstract Available online 17 October 2012 In this paper, we present a spatiotemporal co-location video pattern mining approach with application to robust action retrieval in YouTube videos. First, we introduce an attention shift scheme to detect and partition the focused human actions from YouTube videos, which is based upon the visual saliency [13] modeling together with both the face [35] and body [32] detectors. From the segmented spatiotemporal human action regions, we extract 3D-SIFT [17] detector. Then, we quantize all detected interest points from the reference YouTube videos into a vocabulary, based on which assign each individual interest point with a word identity. An APrior based frequent itemset mining scheme is then deployed over the spatiotemporal co-located words to discover co-location video patterns. Finally, we fuse both visual words and patterns and leverage a boosting based feature selection to output the final action descriptors, which incorporates the ranking distortion of the conjunctive queries into the boosting objective. We carried out quantitative evaluations over both KTH human motion benchmark [26], as well as over 60-hour YouTube videos, with comparisons to the state-of-the-arts. Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved. Keywords: Video search Spatiotemporal descriptor Visual vocabulary Visual pattern mining Social media Scalable multimedia representation 1. Introduction Coming with the video sharing websites like YouTube,1 MySpace,2 and Yahoo Video,3 nowadays there is an increasing amount of usercontributed videos on the Web. To manage the ever growing videos, content-based accessing, browsing, search and analysis techniques emerge. In this paper, we investigate the task of human action retrieval from the user contributed videos on the Web, which has emerging potentials in many related applications such as video surveillance, abnormal action detection and human behavior analysis. More specifically, we focus on the retrieval of actor-independent actions, i.e. we only care about the motion patterns rather than the visual appearances of the actor, rather than near-duplicate action matching. Yet, it also differs from the traditional action recognition scenario, such that there is not predefined action category to ensure our scalability. Searching actions from the user contributed videos is not trivial. Besides the difficulties of foreground motion segmentation and representation, challenges also come from the uncontrolled quality of these user contributed videos. For instance, such videos are typically of low resolution, with unstable global (camera) motion, and in a very short duration etc. These challenges largely degenerate the robustness of the state-of-the-art video search techniques [33] deployed on the popular bag-of-words representations. Meanwhile, without stable or regular camera motions in video surveillance, detecting and tracking human motion from these user contributed videos are much more difficult. However, there is still good news from the user tagging as well as their video capturing manners: First, there is a large amount of cheaply available tags, which provides a weak supervision (due to their noise) to guide our design of a robust and discriminative action representation, referred to video patterns in this work comparing to the bag-of-words. Second, most consumer videos are with good focus on the person of interest comparing to the other foreground and background motions, which inspires us to incorporate the visual attention modeling to robustly detect and track the human actions. A typical paradigm of content based action search involves three components, i.e. action segmentation, spatiotemporal description, feature indexing and similarity scoring. We brief as follows: Action segmentation refers to segmenting the human action n Corresponding author. Tel.: þ1 917 655 3190. E-mail address: jirongrong@gmail.com (R. Ji). 1 www.youtube.com 2 www.myspace.com 3 www.video,yahoo.com regions from the static or moving backgrounds and other foreground motions. Different from the visual tracking or background substraction, the segmentation target is the human body not else. Different from the pedestrian detection, the uncontrolled quality of the user contributed videos, 0925-2312/$ - see front matter Crown Copyright & 2012 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.06.044 62 L. Cao et al. / Neurocomputing 105 (2013) 61–69 e.g. heavy sharking and blurring, raises the challenge of detecting human actions in the wild. Spatiotemporal description aims to robustly and discriminatively describe the segmented human actions. Ideally, this description should be at the body part level, i.e. the recent work in body part inference from the RGB-Depth Kinect data [34]. Again, this is not realistic for user contributed videos such as YouTube, where neither depth nor stabile videos is guaranteed. Subsequently, to improve the popular bag-offeatures action representation, high-level representations are demanded to capture the eigen statistics of the human actions. Feature indexing and similarity scoring: The final step is to build an effective feature representation by fusing both the bag-offeatures descriptors with their higher-level abstraction. Its ultimate goal is to improve the ranking performance of the possible query actions. To this end, we will further show a way to incorporate the ranking loss of possible queries into the feature representation and indexing in a principle way. In this paper, we tackle the problem of action retrieval from user sharing videos with three main contributions: First, we introduce an attention shift scheme for fast and robust human motion detection and segmentation, which is deployed over the visual saliency model [13] with the integration of human face [35] and body [32] detectors to produce human specific response. Our attention shift scheme well exploits the recording preferences of the user contributed videos, in our quantitative evaluations outperforms the classic mean shift [36] or particle filtering based trackers. Second, we introduce a spatiotemporal co-location video pattern mining scheme, which is deployed over the 3D-SIFT [17] based interest points. For each human motion segment, we model the spatiotemporal co-location statistics of interest point neighborhoods as the transactions. We pool all transactions from videos of the same tag and run an APrior based frequent itemset mining [37] to discover the co-location video patterns, which is then treated as a high-level abstraction of the bag-of-features action descriptors. Finally, we propose to boost an optimal human action representation over both the bag-of-features and the mined patterns, with the ranking loss of sampled conjunctive queries into the boosting objective. It enables us to discover the most discriminative feature representation, which best fits the potential action search by boosting based feature selection. Fig. 1 shows our proposed colocation video pattern mining based action search framework. We have quantized the performance of our proposed scheme in two evaluations: First, we compare our scheme to 1. HoG þ Bag-of-Features and 2. 3D-SIFT þ Bag-of-Features based feature representations on the KTH human action benchmark [26]. Second, we have crawled over 60-hour videos from YouTube, based on which we evaluate our action search scheme with comparisons to the state-of-the-art alternatives. The rest of this paper is organized as follows: Section 2 reviews our related work. Section 3 introduces our attention shift based human action segmentation and partition. Section 4 presents our spatiotemporal co-location video pattern mining. Section 5 presents our boosting based discriminative feature representation and indexing. We give quantitative evaluations in Section 6. Finally, we conclude in Section 7 and discuss our future work. 2. Related work Action search: While lots of previous works have focused on visual search and recognition of still images [40–47], there is limited work directly handled human action retrieval. Shih et al. [5] proposed to search single foreground action from the static background. And Ferrari et al. [6] searched human poses with the combination of human body parts with motion trajectory similarity. However, the human body estimation is not precise enough especially for videos containing serious viewpoint changes, as in our scenario. Our work follows an unsupervised setting, which differs from the action recognition [7–11] that classified human actions into several predefined categories so that cannot be well scaled up. Visual saliency: The visual saliency analysis in videos aims to capture the specific foreground motion that is the focus of video viewers. As shown in [12], such a salient motion is transferable from one actor to anther. Taking advantage of this essence, recent works in [14–16] further deployed the image saliency analysis from Itti et al. [13] into the video domain. For instance, Li et al. [16] presented a dynamic attention model to combine the spatial and temporal saliency to detect the focused region switching. Video pattern mining: There are several previous works in video pattern mining [28–31]. For instance, Sivic et al. adopted the spatial configuration of viewpoint invariant features from principal objects or scenes to mine the most frequent feature combinations. Quack et al. [29] proposed to mine frequently occurring objects and scenes, based on finding recurring spatial arrangements of affine covariant regions. Yuan et al. proposed to mine a visual phrase lexicon, each of which is a meaningful spatially cooccurrent pattern of visual words. Quack et al. [31] further proposed to mine spatial configurations of local features occurring frequently in a given object. Compared to our method, all the above works output the mined video patterns, without regard to Fig. 1. The proposed co-location video pattern mining based action search framework. L. Cao et al. / Neurocomputing 105 (2013) 61–69 an optimal combinations to improve the subsequent video (action) search performance. Spatiotemporal interest points: The spatiotemporal interest points model actions as a bag-of-words [26,32] representation, which in general contains three phases: (1) extracting spatiotemporal interest points from action videos, (2) clustering the interest points based on their interest points (e.g. histogram of oriented gradient [32], SIFT [1]), and (3) classifying/recognizing actions using the occurrence frequencies of the clustered points. Researches have focused on each step to improve the final action recognition performance. For instance, to extract robust and informative interest points, various criteria are proposed, e.g. cornerness [21]. Liu et al. [10] further clustered interest points by exploring the correlation between points and actions categories. 3. Robust human action extraction We define an action as a consecutive spatiotemporal region set, which contains the appearance and movement of identical people in this video shot. Given another video shot as the query input, the action retrieval is to retrieve shots in the video dataset that contain identical or near-identical actor movements to the query. To clarify, we focus on ‘‘actor-independent’’ action retrieval, which considers actions of similar motion trajectories as similar, without regard to their appearance differences. As discussed in Section 1, our first step is to distill actions from videos, which subdivides each shot into several human-focused actions and eliminates the interferences from (multiple) concurrent actions and global motions (caused by camera motions). We propose an attention shift model to achieve this goal as brief in Fig. 2, which combines a Focus of Attention detection together with a switching mechanism, so as to partition spatiotemporally salient actions in each shot. Especially, we also incorporate the face detector [35] and body part detector [32] to post-filter the partitioned motion regions, such as motions without human (e.g. vehicles or animals) are filtered out. 3.1. Extracting focused actions In preliminary processing, we first carry out the shot boundary detection based on graph partition, which has been demonstrated to be effective in TRECVID 2005 SBD competitions. To distinguish foreground actions from backgrounds, a straightforward strategy is the background modeling such as Gaussian mixture model. However, it cannot tackle the scenario of moving cameras. Moving object tracking is another feasible solution to locate the actions. However, how to detect the focused action changes is left unexploited. We address this issue by a spatiotemporal saliency 63 detection, which is well investigated in image domain [13] that adopted saliency map to detect human-focused regions. In this paper, we present a spatiotemporal saliency analysis strategy to locate the human-focused foreground actions, to eliminate the interferences from concurrent actions and camera motions. Spatial saliency: We detect the spatial saliency by using the saliency map in [13]. We compute 42 feature maps, 6 for intensity, 12 for color, and 24 for orientation. The spatial saliency rSpatial of each pixel p (location (x,y)) is measured by the weighted p summation of all conspicuity maps from different channels: X r Spatial ¼ g n Mn ðpÞ, n A fI,C,Og, ð1Þ p n where I,C,O represent the three channels (intensity, color, orientation), gn is the weighting parameter. The conspicuity map M is calculated by summing up all feature maps belonging to the corresponding feature channel. Motion saliency: For the successive frames in this shot, a motion saliency map is built using the dynamic attention model [16], which detects the human-focused motions that both constant inside the object and salient to the background. First, we adopt the optical flow [18] feature to determine the number of concurrent motions within each shot. For each point p, we calculate a weighted structural tensor of M ðu,v,wÞ ¼ O31 . Once the optical flows ðu,v,wÞ are constant within neighborhood N, M should be rank deficient with rankðMÞ r 2. We then select the first three eigen values l1 , l2 , l3 to define the continuous rankdeficient measurement dM: 8 0 trðMÞ o g: > > < l23 ð2Þ dM ðpÞ ¼ otherwise: > > : 0:5 l2 þ0:5 l2 þ e 1 2 g is adopted to handle the case rankðMÞ ¼ 0. Then, the motion saliency for each p is detected as (e is a constant tuning parameter) r Temporal ðx,yÞ ¼ dM ðx,yÞ : dM ðx,yÞ þ e ð3Þ Spatiotemporal FOA: We then multiply both spatial and temporal saliency scores of each candidate grid to obtain a spatiotemporal saliency map, in which the saliency of each pixel p is determined by X X ¼ r Spatial r Temporal : ð4Þ r Spatiotemporal p i j i A NSpatial j A NTemporal p p We first determine the FOA center by finding the maximum of P Spatiotemporal GðpÞ, in which G(p) is a normalized Gaussian q A Np r p function centered at p, with identical amount of points to N Spatial . p Fig. 2. The proposed attention shift model for focused action segmentation and partition. 64 L. Cao et al. / Neurocomputing 105 (2013) 61–69 Second, we define a locating rectangle in each frame to distinguish human-focused foreground region from backgrounds and concurrent motions as RecX,Y ¼ f8xi A X,8yi A Y9xmin r xi r xmax ,ymin ryi r ymax g. We expand the rectangle box from the FOA center, until the box enclosed all the pixels with saliency strength higher than a pre-defined threshold TSaliency. For pixels within this box, once the optical flow similarity between this pixel and the background is lower than a pre-defined threshold TOF, we remove these pixels from this FOA region. For all the detected action regions, we further use the cascade face detector [35] together with the HOG (Histograms of Oriented Gradients) [32] human detector to detect whether a moving foreground is a human. As a result, spatiotemporal regions without human are filtered out from our subsequent processing. 3.2. Detecting attention shift For each detected focused action sequence, the ‘‘Shift’’ from one people to another is identified to partition the action regions that are from different people. To measure whether this action is conducted by an identical people, we define a shifting rate AShift in Eq. (5), which evaluates the spatial consistency of concerned action region frames for action partition: ( 1 CenDisði,jÞ 4 T C , & DiaVar oT D , AShift ¼ ð5Þ 0 CenDisði,jÞ o T C , & DiaVar 4 T D , where CenDisðÞ is the distance between geometrical centers of two given regions i and j; DiaVarðÞ is the variance of their diameters. Once AShift between the target action regions (i and j) of successive frames satisfies Eq. (5), an ‘‘Attention Shift’’ is detected. From each detected ‘‘Attention Shift’’ position, we segment the action region sequence into two different actions. 4.1. Spatiotemporal descriptor We adopt 3D-SIFT [17] feature to characterize a given action segment based on its spatiotemporal interest points. Without lost of generality, other spatiotemporal features [19–24] could be also adopted in this phase, such as the Hessian-STIP [19] and 3DGabor [20]. Following the principle of [17], for each pixel in the action region, we encode the gradient magnitudes and orientations within its spatiotemporal 3D space, in which spatial and temporal angels (y and f) are calculated as 0 ð6Þ 1 L C t fðx,y,tÞ ¼ arctanB @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA: L2x þ L2y 4.2. Spatiotemporal vocabulary Based on the 3D-SIFT features extracted from all 3D action segments, we build a visual vocabulary using hierarchical K-Means [2–4]. Each quantized word contains descriptors that are closer to the feature center of this word than the others in the 3D-SIFT feature space. As a result, each action segment is represented as a bag-of-features (BoF). For a focused action, the BoF sequence encodes spatial, motion, and appearance information all together in action representation. To ensure scalability, we adopt our indexing model to conduct an Approximate Nearest-Neighbor search: In retrieval, for each 3D-SIFT extracted from the query action, the (inverted) indexed action clips falling within the identical visual word are picked out as ranking candidates. Given a visual vocabulary, a query action q (or an action clip d) can be represented as an N dimensional vector of words. The relevance between q and d can be calculated as the Cosine distance between their BoF vectors. That is PN wdi wqi rðd,qÞ ¼ i ¼ 1 , ð8Þ 9d99q9 where wdi is the weight of the ith word in action clip d, wqi is the weight for the ith word in the query action clip q. Similar to document retrieval, Term Frequency (TF) and Invert Document Frequency (IDF) [25] can be also incorporated. Indexing: We further inverted index each action segment into its corresponding non-zero words. As a result, in online search, the candidate action selection is very fast due to our hierarchical model structure: For each search, the time complexity is OðlogðNÞÞ in a vocabulary with N words. 4.3. Mining co-location action patterns 4. Mining co-location action patterns yðx,y,tÞ ¼ arctanðLy \Lx Þ, among all 3D regions of this interest point to form the final 3D-SIFT descriptor. ð7Þ For pixel (x,y), Lx ¼ Lðx þ1,y,tÞLðx1,y,tÞ, Ly ¼ Lðx,yþ 1,tÞLðx,y 1,tÞ represent the gradients at x and y directions, respectively, and Lt ¼ Lðx,y,t þ1ÞLðx,y,t1Þ represents the gradient at t direction. When the pixels are local maximum or minimum around its spatiotemporal neighbors, they are retained as interest point candidates. Then, rotation invariance is achieved by placing the key point into its dominant 3D directions (rotate to y ¼ f ¼ 0). Subsequently, we sample the spatiotemporal neighborhood regions (3 3 in spatial, and 8 in temporal, each contains 4 4 4 spatiotemporal pixels) within the target interest point. For each 3D sub-region, similar to SIFT [1], we divide y and f into equal bins to quantize y and f orientations into a histogram. Finally, we ensemble such quantization Concurrent word coding: Suppose we have in total M tags crawled from the YouTube videos. For each tag, we have a set of human action segments whose original video was labeled with this tag. We perform the co-location video pattern mining offline over all tags and their reference action segments, which outputs a set of video patterns as a pattern pool for the subsequent boosting in Section 5. For each reference action segment Ii ði A ½1,nÞ of a given tag, suppose there are J3D-SIFT descriptors LðiÞ ¼ ½L1 ðiÞ, . . . ,LJ ðiÞ extracted from Ii with their spatial positions SðiÞ ¼ ½S1 ðiÞ, . . . ,SJ ðiÞ (each Sj (j A ½1,J) of the jth local descriptor). We quantize LðiÞ into an m-dimensional Bag-of-Features histogram VðiÞ ¼ ½V 1 ðiÞ, . . . ,V m ðiÞ. For each Sj(i), we scan its K-nearest neighborhood to identify all concurrent words: Tj ðiÞ ¼ fLj0 9Lj0 ðiÞ A LðiÞ, Sj0 ðiÞ oNeighbor K ðSj ðiÞÞg, ð9Þ where Tj ðiÞ is the item (if any) built for Sj(i) in Ii. The ensemble of all possible items from Ii is TðiÞ ¼ fTj ðiÞgj A ½1,J : ð10Þ Again, the ensemble of all possible TðiÞ for iA ½1,n forms an itemset Dk ¼ fTð1Þ, . . . ,TðnÞg for ½I1 , . . . ,In at the pattern order k. From Dk , we are going to mine np patterns named Pk ¼ fP 1 , . . . ,Pnp g. Distance based pattern mining: Previous work in visual pattern mining mainly resolved to Transaction based Co-location pattern Mining (TCM): For instance, the works in [28–30] built transaction features by coding the k nearest 2D spatially nearby words. This spatial configuration can be further refined by the scales of L. Cao et al. / Neurocomputing 105 (2013) 61–69 interest points to impose scale invariance into transactions [31]. A transaction in TCM can be defined by coding the spatial layouts of neighborhood words. Then, frequent itemset mining schemes like APriori [37] can be further deployed to discover word combinations (maximal order k) as patterns. TCM can be formulated as follows: Let V ¼ fV 1 ,V 2 , . . . ,V m g be the set of all potential items, let D ¼ fT1 ,T2 , . . . ,Tn g be the set of transactions. Each Ti is a possible set of items A D, e.g. combinations of partial items within V, named itemset. Any two Ti and Tj are induplicated. We then define the Support of an itemset A (AD D) as follows: 9fT A D9A DTg9 supportðAÞ ¼ : 9D9 If and only if supportðAÞ Z s, the itemset A is defined as a frequent itemset of D, where s is the threshold to restrict the minimal support rate. We then define the confidence of each frequent itemset as follows: condifentðA-BÞ ¼ 9fT A D9ðA [ BÞ DTg9 supportðA [ BÞ ¼ , supportðAÞ 9fT A D9AD Tg9 ð12Þ where A and B are two subsets. The confidence in Eq. (12) is defined as the maximal correctness likelihood of B in the case that A is also correct. The idea of confidence based restriction can guarantee the co-location visual patterns to discover the minimal item subsets that are most helpful in representing the visual features at a given order K.4 We further define an Association Hyperedge of subset A ¼ fV 1 , V 2 , . . . ,V l g as AHðAÞ ¼ 1 confidenceðAfV i g-V i Þ, N ð13Þ which gives a minimal association hyperplane to bound A. By checking all possible itemset combinations in D from order 2 to K, the itemsets with supportðÞZ s and AH Z g are defined as frequent patterns. However, TCM procedure would generate repeated patterns in texture regions that contain dense words, which degenerates its discriminability. Distance-based Co-location Pattern Mining (DCM) refines TCM by introducing two new measurements named participation index (pi) and participation ratio (pr), which overcomes TCM’s over sensitive in texture regions. Formally speaking, let C ¼ fV 1 ,V 2 , . . . ,V m g be the set of in total m words, DCM first introduces the R-reachable measurement as the basis of both pi and pr: Two words Vi and Vj are R-reachable if and only if disðV i ,V j Þ odthres , where disðÞ is the distance metric such as Euclidean distance and dthres is the distance threshold. For a given word Vi, we define the partition rate prðC,V i Þ as the percentage of subsets CfV i g in C that are R-reachable: prðC,V i Þ ¼ pV i ð9instanceðCÞ9Þ , 9instanceðV i Þ9 ð14Þ where p is the relational projection operation with deduplication. The pi is defined as follows: k piðCÞ ¼ minfprðC,V i Þg, i¼1 identical to the APriori [37] process. Algorithm 1 outlines the work flow of our proposed scheme. Algorithm 1. Distance based spatiotemporal co-location video pattern mining. 1 ð15Þ where pi describes the frequency of subset C in the neighborhood. Note that only item subsets with pi larger than a give threshold can be defined as patterns in DCM. The mining procedure is then Input: visual vocabulary V, reference action segments fIi gni¼ 1 , reference actions with respect to the tag Objectj nreference nobject gi ¼ 1 gj ¼ 1 ffIi 3 4 . Output: CBoP patter set fQ gti selected ¼1 Spatiotemporal coding: for ToIj ðj A ½1,JÞ do 9 Building itemset fDgObjectj candidates by 3D sphere 5 6 coding in the 3D points using Eq. (9); end Ensemble fDgObjectj for all tags ðj A ½1,nobject Þ as D ¼ fDgti ¼ 1 ; 7 8 Distance based Pattern Mining: Calculating supports and confidences all patterns within P using Eqs. (11) and (12); 9 Filtering out unreliable patterns with thresholds s and g; 10 Generate the mined pattern collection P ¼ fP i gti ¼ 1 Eqs. (15) and (14). 5. Boosting discriminative features Finally, we present a boosting based discriminative feature selection from both the visual words and the mined visual patterns. We sub-sample a set of conjunctive query from the reference action segments, based on which we define our boosting objective as to minimize the ranking distortions of the ground truth labels. We describe our boosting procedure as follows: Conjunctive queries: We sample a set of conjunctive query similar to [38] as ½I01 , . . . ,I0nsample . Then, each action is used to search similar actions by using the original bag-of-features of V, which results in the following training set: QueryðI01 Þ ¼ ½A11 ,A12 , . . . ,A1R ¼ n k is initialized from 2 to K in our implementation. n n QueryðI0nsample Þ ¼ ½A1sample ,A2sample , . . . ,ARsample , ð16Þ where Aji is the ith returning of the jth query. We expect the boosted words/patterns from P [ V to retain ½Aj1 ,Aj2 , . . . ,AjR for each jth action query as much as possible. To this end, we define ½w1 , . . . ,wnsample as an error weighting vector to the nsample in the user query log, which measures the ranking consistency loss in the word/pattern selection. We then define the selected word/pattern subset as C. At the tth boosting iteration, we got the current ðt1Þ words or patterns as Ct1 . To select the next tth discriminative word or pattern from P [ V, we estimate the ranking preservation of the current selection Ct1 as R X LossðI0i Þ ¼ wt1 i RankðA1r ÞWA1 Jf ðCIt1 Þ,VAi J2 , 0 r r¼1 i ð17Þ r where i A ½1,nsample ; RankðAir Þ is the current position of the originally ith returning of the action query I0i ; f ðÞ is the bag-oft1 feature recovery function. ½wt1 1 , . . . ,wnsample is the (t 1)th error weighting, which measures the ranking loss of the jth query ðj A ½1,nsample Þ. Then, the overall ranking loss is nsample LossRank ¼ 4 bag-of-words histograms fVð1Þ, . . . VðnÞg, support threshold s, confidence threshold g, maximal pattern order K, sparse factor a. 2 ð11Þ 65 X i¼1 wt1 i R X r¼1 RankðAir ÞWAi Jf ðCI0i Þ,VAi J2 : r r ð18Þ 66 L. Cao et al. / Neurocomputing 105 (2013) 61–69 And the next best word or pattern Ct is selected from P [ V by minimizing nsample C t ¼ arg min j X i¼1 wt1 i R X r¼1 RankðAir ÞWAi Jf ðfC þ C j gI0 Þ,VAi J2 : r i r ð19Þ as the Subsequently, we update the error weighting of each wt1 i corresponding loss of each ith query in Eq. (19). 6. Experimental results 6.1. Action search in 60-hour YouTube videos Database: In our experiments, we have crawled over 60-hour YouTube videos from the ‘‘most view’’ from September to October, 2010. We partition these videos into shots and distill focused actions from each shot. Then, we filter out actions with durations less than 2 s. After this filtering, we generate over 6000 actions with sufficient durations and focuses. Among them, 200 actions are selected to build our query set. We construct this database by the rest 5800 actions and offline extract 3D-SIFT descriptors from each of them. Search model: We build a 4-branch, 24-level vocabulary using hierarchical K-Means clustering [2] to generate BoF vector for each action segment. In tree construction, if a node has less than 2000 features, we stop its K-Means division, not matter whether it achieves the deepest level or not. On a computer with Intel Pentium IV 3.00 GHz CPU and 1.0 G RAM, the typical time and memory costs are: tree constructing time 2 h, retrieval time 4 s per action (including feature extraction), and memory cost 200M. Evaluation protocol: We use precision@N to evaluate the search performance, which is widely used in image/document retrieval systems. For each given method, after automatically ranking the top 10 similar actions in our database, we ask a group of subjective users to label its similar action: That is, for each query action, users are asked to give a binary judgment about whether a current action is a similar action, which therefore enables us to measure the precision@N. (Note that we cannot go through the entire database, so we did not judge the Recall in all of our experiments). With and without attention shift: First, we compare the reasonability of our action segmentation based on our Attention Shift Model. With identical implementation in the rest steps, we compare our Attention Shift model with the method that adopts both global motion compensation and includes all moving foregrounds into action representation. This alternative approach is denoted as ‘‘Global Compensation’’ in our discussions. Fig. 3. The quantitative results of attention shift based action segmentation with comparisons to the ‘‘Global Compensation’’ based scheme. As presented in Fig. 3, using Attention Shift model for action selection, the precision@N are better than the Global Compensation approach. It derives from the fact that the salient action representation can automatically and effectively capture the human-concerned action regions, without lost of precision. Action extraction and recognition are still an open problem. To design an appropriate action extraction algorithm, one key consideration is the constitution of actions and videos in its corresponding database. We have found that this is the case for YouTube videos such that the video capturer would typically focus on the foreground human action (if any), rather than some other post-edited videos. In addition, in most cases these action regions will occupy a large portion of the entire shot, which are mostly close-up lens, rather than telephoto lens. In some cases that contains multiple acting people (concurrent actions), most of them would be have only one salient action in each time stamp. Based on above two reasons, we believe our saliency-based method is very suitable for actor action extraction from sitcoms or movies. Nevertheless, for wide baseline or telephoto lens videos, our method is not a best choice, especially when there are multiple moving objects with little occupying windows. However, we also include face detection and tracking to enhance our saliency map based foreground action detection. Furthermore, we also constrain the scales of the extracted foreground, which further eliminates unstable extractions from small action regions. Quantitative comparisons: Third, we present a group of comparisons for our video pattern based action search to: (1) action search based on partial 3D-SIFT description; (2) action search based on optical flow. The first baseline calculates action similarity based on the identical procedure based upon simply bag-offeatures, and retains only the temporal dimensions in the 3D-SIFT features. It could be regarded as another case of actorindependent search since only the motion information is kept in action representation. The later baseline directly calculates the optical flow features [18] to replace the 3D SIFT features, for which the rest steps are identical to our framework. As presented in Fig. 4, our strategy reports the best performance among all three approaches. We analysis this figure from two following aspects: For consumer videos, due to their low quality, bag-of-features approaches might not be an optimal choice. To this end, the capability to capture eigen representations (or latent feature spaces) is a fundamental step towards a good search Fig. 4. The quantitative results of our co-location video pattern mining for action search. L. Cao et al. / Neurocomputing 105 (2013) 61–69 67 Fig. 5. Action search examples of our proposed boosting based action representation. performance. We achieve this based on visual pattern mining and ranking loss integrated boosting. Adopting solely optical flow features cannot achieve satisfactory results comparing with 3D-SIFT features. Rather than directly usage, the main advantages of optical flow lie in its capability to build the correspondence frames. Fig. 5 further shows four case studies for action retrieval results in four representative queries. Each line is a query operation. The left frame shows the query shot with its detected action. The rest frames show the top four returning results, respectively. Efficiency analysis: We further give the time cost of our scheme for online ranking, which contains two following steps: 1. The Inverted Indexing Ranking, in which we search the boosted word þ patterns with inverted action indexing, for in total mword words and mpattern patterns, the time complexity of linear scanning is Oðmword þ mpattern Þ þ Oðk logðkÞÞ, for in total k actions picked from the index files (from k we rank the top n actions as our initial ranking results with cost Oðk logðkÞÞÞ. 2. The Spatiotemporal Re-ranking, in which case we use dynamic time warping distance [39] to re-rank the top n returning action segments’ with linear programming, the time complexity is Oðt 2 Þ for an action with t frames. And the overall cost is Oðn t 2 Þ for the n actions from Initial Ranking. Table 1 further shows the comparisons of the computation complexity (measured by seconds) of the building time and the different parts in our online search pipeline. It is obvious that the initial ranking largely improves the search efficiency to enable our online application. Subsequently, the DTW Matching does not largely degenerate to our online search efficiency. Table 1 Computational complexity cost evaluation of offline building, as well as different parts in online search. Time cost comparisons Steps \ cost Time cost (s) Offline building Online search without initial ranking Solely inverted indexing based search Inverted indexing with DTW matching 26,385 1397 1.826 8.763 to select one sequence as the query clip and search all other sequences in the database (leave-one-sequence-out, LOSO in [27]). The second one is to select one person as a query clip and search all other persons in the database (leave-one-person-out, LOPO). Note that to offer identical performance to Fig. 8 in [27], we also use the entire sequence as query. Comparing to [27], Fig. 6 shows better performance by using our visual patterns with boosting based selection in Fig. 6. It is mainly because of our scheme, as explained above, is robust and discriminative enough towards capturing the eigen representation of each individual actions. In addition, since we use hierarchical spatiotemporal vocabulary as the coarse search phase, our search efficiency can be also guaranteed. Our 60-hour YouTube video dataset is also very challenging. First, comparing to the coherent backgrounds in KTH database [26], our 60-hour YouTube videos contain large amount of background clutter, as well as foreground occlusions. Second, in many scenarios of our 60-hour YouTube videos, there are dynamic and moving backgrounds, which is a very challenging scenario for precise action extraction. Third, in many cases, the viewpoints and camera zoom in/out could largely affect the size of extracted 3D action regions, which is more complex than KTH database [26]. 6.2. Comparisons on KTH database To further validate our performance in standard evaluation benchmarks, we carry out a group of experiments in the KTH human motion database [26], which is one of the largest available video sequence benchmarks for human action recognition, which contains six types of human actions (walking, jogging, running, boxing, hand waving, and hand clapping). Each sequence is performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3, and indoors s4. Similar to [27], we search the query clip against the database, and return the candidate sequences with top similarities. Identical to [27], we conduct two kinds of Leave-one-out cross validation to compare our performance to [27]. The first one is 7. Conclusion In this paper, we propose a robust and discriminative action search paradigm, specialized for searching in the user contributed YouTube videos that typically with uncontrolled qualities. Our contributions are threefold: First, we propose an attention shift model for saliency-driven human action segmentation and partition. Our second contribution is a spatiotemporal co-location video pattern mining paradigm, aiming for discovering more eigen word combinations to capture the motion patterns based on a Distance based Co-location pattern Mining. Finally, we propose a novel boosting based discriminative feature resembling scheme, which incorporates the ranking distortions into the 68 L. Cao et al. / Neurocomputing 105 (2013) 61–69 Fig. 6. Quantitative performance on the KTH human motion benchmark [26] with comparisons to the work in [26,27]. boosting objective to optimize the feature descriptor towards an optimal action retrieval. We have conducted extensive evaluations on a 60-hour YouTube video dataset as well as the KTH human motion benchmark [26], with comparisons to the state-ofthe-arts. References [1] D.G. Lowe, Distinctive image features form scale-invariant keypoints, Int. J. Comput. Vision. 60 (2) (2004) 91–110. [2] D. Nister, H. Stewenius, Scalable recognition with a vocabulary tree, in: CVPR, 2006. [3] G. Schindler, M. Brown, R. Szeliski, City-scale location recognition, in: CVPR, 2007. [4] R. Ji, X. Xie, H. Yao, W.-Y. Ma, Hierarchy vocabulary optimization for effective and transferable retrieval, in: CVPR, 2009. [5] T.K. Shih, C.-S. Wang, Y.-K. Chiu, Y.-T. Hsin, C.-H. Huang, On automatic action retrieval of martial arts, in: ICME, 2004. [6] V. Ferrari, M. Marin-Jimenez, A. Zisserman, Pose search: retrieving people using their pose, in: CVPR, 2009. [7] O. Masoud, N. Papanikolopoulos, A method for human action recognition, Image Vis. Comput. 21 (2003) 729–743. [8] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: CVPR, 2008. [9] A. Yilmaz, M. Shah, Recognizing human actions in videos acquired by uncalibrated moving cameras, in: ICCV, 2005. [10] J. Liu, S. Ali, M. Shah, Recognizing human actions using multiple features, in: CVPR, 2008. [11] J. Yuan, Z. Liu, Y. Wu, Discriminative subvolume search for efficient action detection, in: CVPR, 2009. [12] H.E. Pashler, The Phychology of Attention, MIT Press, 1999. [13] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259. [14] W.-H. Cheng, W.-T. Chu, J.-L. Wu, A visual attention based region-of-interest detection, IEICE Trans. Inf. Syst. E88-D (7) (2005) 1578–1586. [15] Y. Zhai, M. Shah, Visual attention detection in video sequences using spatiotemporal cues, ACM Multimedia (2006) 815–824. [16] S. Li, M.-C. Lee, Efficient spatiotemporal-attention-driven shot matching, ACM Multimedia (2007) 178–187. [17] P. Scovanner, S. Ali, M. Shah, A 3-dimensional SIFT descriptor and its application to action recognition, ACM Multimedia (2007) 357–360. [18] D.J. Fleet, Y. Wiess, Optical Flow Estimation. Handbook of Mathematical Models in Computer Vision, Springer, 2006. [19] G. Willems, T. Tuytelaars, L.V. Gool, An efficient dense and scale-invariant spatio-temporal interest point detector, in: ECCV, 2008. [20] Y. Wang, C.-S. Chua, Face recognition from 2D and 3D images using 3D Gabor filters, Image Vis. Comput. 23 (11) (2005) 1018–1028. [21] I. Laptev, T. Lindeberg, Space–time interest points, in: ICCV, 2003. [22] Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using volumetric features, in: ICCV, 2005. [23] A. Oikonomopoulos, I. Patras, M. Pantic, Spatiotemporal salient points for visual recognition of human actions, Trans. Syst. Man Cybern. B (2006). [24] S.-F. Wong, R. Cipolla, Extracting spatiotemporal interest points using global information, in: ICCV, 2007. [25] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manag. (1998), pp. 513–523. [26] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM approach, in: ICPR, 2004. [27] H. Ning, T. Han, D. Walther, M. Liu, T. Huang, Hierarchical space–time model enabling efficient search for human actions, Circuits Syst. Video. Technol. 19 (6) (2008) 808–820. [28] J. Sivic, A. Zisserman, Video data mining using configurations of viewpoint invariant regions, in: CVPR, 2004. [29] T. Quack, V. Ferrari, L.V. Gool, Video mining with frequent itemset configurations, in: CIVR, 2006. [30] J. Yuan, Y. Wu, M. Yang, Discovery of collocation patterns: from visual words to visual phrases, in: CVPR, 2007. [31] T. Quack, V. Ferrari, B. Leibe, L.V. Gool, Efficient mining of frequent and distinctive feature configurations, in: ICCV, 2007. [32] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005. [33] /http://trecvid.nist.gov/S. [34] J. Shotton, A. Fitzgibbon, M. Cook, A. Blake, Real-time human pose recognition in parts from single depth images, in: CVPR, 2011. [35] /http://opencv.willowgarage.com/wiki/FaceDetectionS. [36] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619. [37] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large database, in: SIGMOD, 1993. [38] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, W. Gao, Location discriminative vocabulary coding for mobile landmark search, Int. J. Comput. Vis. (2011). [39] D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in: Advances in Knowledge Discovery in Databases, AAAI Workshop, 1994. [40] R. Ji, H. Yao, X. Sun, B. Zhong, W. Gao, Towards semantic embedding in visual vocabulary, in: IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [41] R. Ji, X. Xie, H. Yao, W.-Y. Ma, Mining city landmarks from blogs by graph modeling, ACM Multimedia (2009) 105-114. [42] R. Ji, X. Lang, H. Yao, Z. Zhang, Semantic supervised region retrieval using keyword integrated Bayesian reasoning, Int. J. Innovative Comput. Inf. Control 3 (6) (2008) 1645–1656. [43] X. Liu, R. Ji, H. Yao, P. Xu, X. Sun, T. Liu, Cross-media manifold learning for image retrieval and annotation, Multimedia Inf. Retr. (2008). 141–148. [44] R. Ji, P. Xu, H. Yao, Z. Zhang, X. Sun, T. Liu, Directional correlation analysis of local Haar binary pattern for text detection, in: IEEE International Conference on Multimedia and Expo, 2008. [45] R. Ji, H. Yao, W. Liu, X. Sun, Q. Tian, Task dependent visual codebook compression, IEEE Trans. Image Process. 21 (4) 2282–2293. [46] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell. 34 (5) (2012) 723–742. [47] Y. Yang, Y. Zhuang, F. Wu, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE Trans. Multimedia 10 (3) (2008) 43–446. Liujuan Cao is currently a Ph.D. candidate in Department of Computer Science and Technology at Harbin Engineering University, China. she received her B.S. degree and M.E. degree in Department of Computer Science and Technology at Harbin Engineering University, in 2005 and 2008, respectively. Her research interests include multimedia information retrieval, machine learning, pattern recognition, and vector map watermarking. L. Cao et al. / Neurocomputing 105 (2013) 61–69 Rongrong Ji is currently a PostDoc Researcher at Electronic Engineering Department, Columbia University. He obtained his Ph.D. from Computer Science Department, Harbin Institute of Technology. His Research interests include: image retrieval and annotation, video retrieval and understanding. During 2007–2008, he is a research intern at Web Search and Mining Group, Microsoft Research Asia, mentored by Xing Xie, where he received Microsoft Fellowship 2007. During 2010.5–2010.6, he is a visiting student at University of Texas at San Antonio, cooperated with Professor Qi Tian. During 2010.7–2010.11, he is also a visiting student at Institute of Digital Media, Peking University, under the supervision of Professor Wen Gao. He has published over 40 referred journals and conferences, including IJCV, TIP, CVPR, ACM MM, IJCAI, AAAI, TOMCCAP, IEEE Multimedia, etc. He serves as a reviewer for IEEE Signal Processing Magazine, IEEE Transactions on Multimedia, SMC, TKDE, and ACM Multimedia conference et al., an associated editor at International Journal of Computer Applications, as well as a session chair at ICME 2008. He is a member of the IEEE and ACM. Yue Gao is currently a Ph.D. candidate in Department of Automation at Tsinghua University, China. He received his B.S. degree in Electronic Engineering from the Harbin Institute of Technology, China, in 2005, and his M.E. degree in School of Software from Tsinghua University, China, in 2008, respectively. His research interests include multimedia information retrieval, machine learning, and pattern recognition. 69 Qi Tian is currently an Associate Professor in the Department of Computer Science, the University of Texas at San Antonio (UTSA). During 2008–2009, he took one-year Faculty Leave at Microsoft Research Asia (MSRA) in the Media Computing Group (former Internet Media Group). He received his Ph.D. in 2002 from UIUC. Dr. Tian’s research interests include multimedia information retrieval and computer vision. He has published about 110 refereed journal and conference papers in these fields. His research projects were funded by NSF, ARO, DHS, HP Lab, SALSI, CIAS, CAS and Akiira Media System, Inc. He was the coauthor of a Best Student Paper in ICASSP 2006, and co-author of a Best Paper Candidate in PCM 2007. He was a nominee for 2008 UTSA President Distinguished Research Award. He received 2010 ACM Service Award for ACM Multimedia 2009. He is the Guest Editors of IEEE Transactions on Multimedia, ACM Transactions on Intelligent Systems and Technology, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, and EURASIP Journal on Advances in Signal Processing and is the Associate Editor of IEEE Transaction on Circuits and Systems for Video Technology and in the Editorial Board of Journal of Multimedia. He is a Senior Member of IEEE (2003), and a Member of ACM (2004). Wei Liu received his M.Phil. and Ph.D. degrees in Electrical Engineering from Columbia University, New York, NY, USA in 2012. Currently, he is the Josef Raviv Memorial Postdoctoral Fellow at IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He worked as an intern at Kodak Research Laboratories and IBM Thomas J. Watson Research Center in 2010 and 2011, respectively. His research interests include machine learning, computer vision, pattern recognition, and information retrieval. Dr. Liu won the 2011–2012 Facebook Fellowship.