WEAKLY SUPERVISED TOPIC GROUPING OF YOUTUBE SEARCH RESULTS Liujuan Cao , Rongrong Ji† , Wei Liu† , Yue Gao , Ling-Yu Duan , and Chaoguang Men † School of Computer Science, Harbin Engineering University, Harbin, 150001, China Department of Electrical Engineering, Columbia University, New York City, 10027, United States Department of Automation, Tsinghua University, Beijing, 100871, China Institute of Digital Media, Peking University, Beijing, 100871, China ABSTRACT Recent years have witnessed an explosive growth of user contributed videos on websites like YouTube and Metacafe, which usually provide a query-by-keyword functionality to facilitate the user browsing. For a given query, the returned videos typically contain multiple topics that are mixed up to duplicate the user browsing. Therefore, their diversification and grouping are highly demanded to improve the user experiences. However, the tagging and content qualities of user contributed videos are uncontrolled against their precise grouping. In this paper, we present a weakly supervised topic grouping paradigm to diversify the returned videos of a given keyword query. Our grouping is based on the bagof-words visual signature quantized over the spatiotemporal STIP descriptor [1] extracted from each returned video. First, we adopt a min-Hashing based visual similarity in combination of the tagging similarity to group the returned videos. Based on the initial grouping configurations, we mine the co-occurred discriminative sub-signatures, based on which we iteratively refine the first step. Such iteration well handles the noise in visual content and tagging, since neither of which is fully trusted during the grouping. We validate our schemes on over 2,000 video clips crawled from a set of YouTube keyword query results. Comparing to alternative approaches, our scheme has shown superior robustness and precision. Index Terms— Social Media, Video Retrieval, Search Result Diversification, Hashing, Pattern Mining 1. INTRODUCTION In recent years, there is an ever growing amount of user contributed videos on the online sharing websites such as YouTube, Facebook and MySpace. Such social repositories allow users to upload, tag, comment and share videos with almost uncontrolled freedom. Most existing photo sharing websites above provide the query-by-keyword functionality. In this scenario, the returned videos are typically mixed up from multiple topics, each of which contains near-duplicate or semantically-similar returned results. Therefore, deduplicating and diversifying the returned videos can further 978-1-4673-2533-2/12/$26.00 ©2012 IEEE 2885 Fig. 1. The work flow of our proposed weakly supervised topic grouping paradigm for YouTube search results. reduce the user browsing burdens. This is of great importance to the state-of-the-art video sharing websites such as YouTube and Metacafe to improve their user experiences. Towards diversifying the query-by-keyword returned videos, a straightforward solution is to online group related videos by “topics”1 by using their tag and/or visual content signatures. However, facing the user contributed videos, both their tags and video contents are likely noisy against a precise grouping. For instance, the tags could be incorrect or missing, and the videos could have low resolutions or are re-edited etc. As a result, neither tags nor visual contents for user contributed videos are fully trustable to achieve the optimal grouping. In this paper, we look at the issue of grouping the returned videos into topics with respect to the noisy tags and visual contents, which is challenging yet unexploited in the literature. We tackle this “topic grouping” from a weakly supervised clustering perspective. Our method is deployed based on the bag-of-visual-words visual signature quantized from the STIP descriptors [1] extracted in each reference video. First, we adopt a min-hashing similarity over the bag-of-visual-words signature to carry out the initial topic grouping, which is refined based upon their tagging similarity. Second, we mine a discriminative and co-occurred sub-signature from the initial grouping configurations, which is feedback to adaptively iterate the first grouping procedure. Such iteration sharps the unreliable signature, acting as an 1 In this paper, we define “topic” as the grouping results of returned videos, rather than the definition of traditional topic models. ICIP 2012 optimal subspace representation. It is worthwhile to mention that, we carefully control the pattern orders in sub-signature mining to ensure the online efficiency. As a result, the returned videos can be robustly and efficiently organized into multiple “topics” to facilitate the user browsing. Figure 1 outlines the work flow of our proposed paradigm deployed over the YouTube search results in a query-by-keyword scenario, which is yet general enough and could be further extended to refine the search of other user contributed video repositories. 2. RELATED WORK Social Video Mining and Recognition: There is an ever increasing research focus on the user sharing videos from websites like YouTube and Metacafe. For instance, TRECVID has included the task of consumer video search and recognition since 2010. Donmez et al. [2] and Whitehill et al. [3] leveraged the YouTube metadata to assist the video annotation tasks. Duan et al. [4] adopted transfer learning and pyramid matching to classify consumer videos by using classifiers learnt from YouTube videos. Visual Search Diversification: Diversifying the image search results is recently investigated in [5][6][7][8][11][13]. For instance, Weinberger et al. diversified the returned image list based on the textual feature ambiguity and topic diversity. Zhang et al. [5] proposed an Affinity Ranking to group returning images by optimizing the diversity and information richness. Song et al. [6] proposed a search result re-ranking by topic richness analysis. Leuken et al. [8] adopted lightweight clustering and dynamic feature weighting to diversify image search results with representative image selection. Weakly Supervised Learning: Our work is inspired from a brand new machine learning area named “weak supervised learning” [14][15], whose task is to jointly learn from multiple labeling sources without fully trusting the initial labels or cues. In practice, a variety of real-world problems can be formalized as such a multi-labelers problem, where the performance of different annotators can vary widely. Without the ground truth, how to learn classifiers and simultaneously evaluate the trustiness of individual features are recently investigated in [14][15], which in general model this problem as a robust learning from multiple unreliable feature sources. 3. WEAKLY SUPERVISED GROUPING Extracting Bag-of-Visual-Words Signature: To represent the contents of individual returned videos, we adopt a bag-ofvisual-words signature deployed over the visual vocabulary representation. To build this vocabulary, we have collect a set of 1,000 videos from the “Most View” YouTube videos from August to September 2011. We extract STIP descriptors [1] from each video. Then, we adopt k-means clustering to cluster all STIP descriptors from these videos into a visual vocabulary V with totally K words. Nevertheless, other wellused features or vocabularies [9][10][12] can be also adopted 2886 without loss of generality. In online search, for a given query keyword, suppose N we have N returned videos {Videoi }i=1 crawled from the YouTube API. For each Videoi , we also extract a set of STIP descriptors as STIPi = [STIPi1 , ..., STIPij , ..., STIPiJ ], which are quantized using V into a bag-of-visual words signature Vi = [V1i , ..., VKq ]. Min-Hashing based Online Grouping: We then convert Vi into a min-Hash based representation [16] to efficiently group the returned videos online. First, to unify the weighting histogram of Vi for minHashing, we duplicate each bin if its score is larger than 1 (e.g. duplicate three time the jth bin Vij if Vij = 3). It results in a set-based representation of visual words, denoted as S where each element has an integer value 0 or 1. Then, the distance between two input signatures S1 and S2 is the ratio of elements in their intersection, over the union of S1 and S2 as follows: sim(S1 , S2 ) = S1 ∩ S2 S1 ∪ S2 (1) We adopt min-Hashing to estimate sim(S1 , S2 ) without linear scanning all signatures in the N returned videos. To this end, we carry out R random permutations π in vocabulary V to define the min-Hashing operation as min π(Si ), based on which the similarity score between S1 and S2 are defined as: (min π(S1 ) = min π(S2 )) (2) mHash(S1 , S2 ) = N Suppose that we have in total R hashing bins Si for the ith signature as {min π1 (S1i ), ..., min πR (SRi )}. Based on the above pairwise distance mHash(S1 , S2 ), we group the N returned videos into M clusters by k-means clustering. To reduce the false positive rate, we only maintain the score of Equation 2 once it is larger than τ, otherwise set this score as 0. We further incorporate the tag similarity of each returned video to refine the above min-Hashing similarity: SimVideo1 ,Video2 = mHash(S1 , S2 )exp(−||T1 − T2 ||2 ) (3) where Ti is the bag-of-textual-words extracted from the tags and comments of the ith returned video. Mining Co-occurred Discriminative Sub-Signature: Based on the initial M group partition of the returned videos, we select N pos signatures that are true positive for each group. We then mine discriminative sub-signatures based on these in total N pos × M signatures. We adopt APriori [17] to mine this sub-signature, which is very efficient when we control the pattern orders <= 2 together with an empirical setting of N < 10 and M < 5. In general, APriori identifies the set of elements in S that more frequently co-occur in N pos positive sets with respect to Nneg negative sets2 . Formally speaking, 2 The negative sets are randomly sampled from image pairs of different groups in M. Algorithm 1: Work Flow of Iterative Topic Grouping 1 2 3 4 5 Fig. 2. Exemplar grouping results by querying YouTube videos using keyword “Columbia”. 6 for a hashing bin combination A ∈ S (e.g. a possible hashing bit combination in S), we want to measure it’s discriminability to generate its consequent parts (e.g. by adding additional bins into A as B), called an association rule A ⇒ B. This is achieved by defining the support and confidence of A ⇒ B. The support is calculated as: 7 |{T|T ∈ D, (A ∪ B) ⊆ T}| sup(A ⇒ B) = |D| (4) which measures the statistical significance of this rule. D is the set of all possible transactions built from combinations of all non-zero bins in S. T is a hashing bin set of D that contain A. The confidence of a rule is: conf(A ⇒ B) = sup(A ∪ B) |{T|T ∈ D.(A ∪ B) ⊆ T}| = sup(A) |{T|T ∈ D, A ⊆ T}| (5) Intuitively, the support for this rule is the probability of the joint occurrence of A and B, i.e. P(A, B) while confidence is the conditional probability P(B|A). Based on the initial grouping, we can define videos in an identical group to form the “positive” set α+ while videos from different groups to form the “negative” set α− . Then, we run Algorithm 1 on transactions D from both the positive and negative sets. The mining results include rules of the form (A, B) ⇒ (α+ , α− )) where P((α+ , α− )|A, B) is given by the confidence of the rule. We expect that P(α+ |A, B) P(α− |A, B). If so this transaction would be included into the co-occurred sub-signature, denoted as C. The mined sub-signature is then converted into a new vector and appended to the original signature V, as to pulling the signatures from positive examples closer together. Iterative Weakly Supervised Grouping: The above two processes are iterated to output the final grouping of returned videos as in Algorithm 1, based on which we enable users to view the returning results organized in groups as in Figure 2. 4. EXPERIMENTS Data Collection and Labeling: We collect over 2,000 videos by querying YouTube with a set of 20 keywords3 using the 3 Keywords: adam lambert, bad romance, christian bale, columbia, chrismas, inauguration, kanye west, michael jackson, boat, obama in- 2887 8 9 10 11 12 13 Input: Iteration t = 0, maximal iteration T , query keyword Q, bag-of-words signatures of returned videos N N , tag descriptors {Ti }i=1 , group number M. {Vi }i=1 M Output: Partition {Vi }i=1 into M groups. while t < T do N Run k-means by min-Hashing similarity of {Si }i=1 N (converted from {Vi }i=1 ) into M groups; Sample (α+ , α− ) from these M groups; Mining discriminative co-occurred sub-signatures C from (α+ , α− ) and {Si } by APriori; Adding sub-signature C into V. t++; end for each group in M groups do Find S j with nearest min-Hashing distance to other videos in this group; Select S j as the representative video; end Table 1. Time cost of online result grouping. Compression methods K-means with BoW Our method (order = 2) Our method (order = 3) Our method (order = 4) Iterations 10 5 10 15 Mining 0s 3.78s 11.76s 25.36s Grouping 2.35s 0.57s 0.62s 0.91s YouTube API. For the returned videos of each specific query, we manually label videos into multiple groups as our grouping ground truth. From all videos, we extract STIP descriptor [1] and adopt hierarchical k-means to quantize all descriptors into an initial vocabulary. Table 1 gives the averaged grouping time comparison in a regular PC with Pentium Dual-Core 2.54GHZ and 2GB Memory. Baselines and Evaluation Criteria: (1). K-Means: Group the bag-of-visual-words signatures by k-means clustering, without iteratively mining the discriminative subsignature. (2). Without Mining Co-occurred Patterns: Only conduct the visual signature grouping based on the minHashing similarity, without iteratively mining the discriminative sub-signature. (3). Without Tag Refinement: Run the entire procedure as in Section 3 without integrating tag similarity into Equation 3. (4). Without Multi-iterations: Iterate one-round of both topic grouping and sub-signature mining. We evaluate our grouping effectiveness using the Average AP among all groups. Average AP = ni=1 APi , where APi is the Average Precision of the ith group. Quantitative Results: Figure 4 shows the Average AP of online grouping by tuning the number of groups in the kmeans clustering. In general, the best performance comes auguration, hatton, pacquiao, paranormal activity, san boyle, climb, tiger woods, usain bolt, watchmen, wedding. su- Fig. 3. Case Study of Average APs of the online grouping results with respect to different keywords. sub-signature mining to learn an optimal visual representation, which is iteratively feedback to refine grouping. Our scheme has outperformed a group of alternative schemes by quantitatively evaluating on over 200 YouTube video queries. Finally, how to select representative videos for each group would be investigated in our future work. 6. REFERENCES Fig. 4. Average AP comparison of online grouping with respect to different numbers of groups. from iteratively mining the discriminative sub-signatures to refine the clustering results in a weakly supervised manner. Meanwhile, with a moderate group number, the corresponding precision is more or less insensitive for our final scheme, while very sensitive for baselines of K-Means or baselines of Without Discriminative/Iterative Sub-signature Mining. Table 1 shows the runtime of performing different grouping schemes. If we carefully control the order of mined patterns and the number of returned videos, online grouping is moderately efficient4 . It’s partially due to the min-Hashing avoids linear scanning during similarity measurement to accelerate the iteratively online grouping. Figure 3 further shows the individual average APs with respect to different keywords in our query set (group number is 5). While different keywords vary quite a lot in grouping precision, we have found that our proposed grouping is consistently superior to all alternatives. In addition, iteratively refining signatures would largely boost the grouping precision, by comparing baselines (3)(4) to (1)(2). 5. CONCLUSION AND FUTURE WORK In this paper, we investigate the issue of online grouping the returned videos into “topics” to facilitate the efficient browsing of YouTube videos in a query-by-keyword scenario, where neither video contents nor their tags are fully trustable for direct usage. To this end, we propose a weakly supervised topic grouping paradigm with two iterative phases: (1) a min-Hashing based fast online grouping with tag similarity refinement and (2) an APriori based discriminative 4 Nevertheless, running our algorithm in (parallel) server(s) instead would further reduce the time cost by scales, or equivalently enable us to handle more returned videos for each query or conduct a higher-order mining. 2888 [1] I. Laptev and T. Lindeberg. Space-time interest points. ICCV, 2003. 1, 2, 3 [2] P. Donmez and J. Carbonell. Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. CIKM, 2008. 2 [3] J. Whitehill, P. Ruvolo, J. Bergsma, T. Wu, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. NIPS, 2009. 2 [4] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. CVPR, 2010. 2 [5] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.Y. Ma Improving web search results using affinity graph. SIGIR, 2005. 2 [6] K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image retrieval results. ACM Multimedia, 2006. 2 [7] K. Weinberger, M. Slaney, and R. V. Zwol. Resolving tag ambiguity. ACM Multimedia, 2008. 2 [8] R. Leuken, L. Garcia, and X. Olivares. Visual diversification of image search results. WWW, 2009. 2 [9] R. Ji and H. Yao. Visual & textual fusion for region retrieval: from both fuzzy matching and bayesian reasoning aspects. Multimedia Information Retrieval, 2007 2 [10] R. Ji, P. Xu, and H. Yao. Directional correlation analysis of local Haar binary pattern for text detection. ICME, 2008. 2 [11] X Liu, R Ji, H Yao, and P Xu. Cross-media manifold learning for image retrieval & annotation ICMR, 2010. 2 [12] R Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao. Location discriminative vocabulary coding for mobile landmark search International Journal of Computer Vision, 2012. 2 [13] R. Ji, Y. Gao, B. Zhong, H. Yao, and Q. Tian. Mining flickr landmarks by modeling reconstruction sparsity. TOMCCAP, 2011. 2 [14] V. Raykar, S. Yu, L. Zhao, A. Jerebko, C. Florin, G. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. ICML, 2009. 2 [15] O. Dekel and O. Shamir. Good learners for evil teachers. ICML, 2009. 2 [16] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-Hash and TF-IDF weighting. BMVC, 2008. 2 [17] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. VLDB, 1994. 2