WEAKLY SUPERVISED TOPIC GROUPING OF YOUTUBE SEARCH RESULTS Liujuan Cao

advertisement
WEAKLY SUPERVISED TOPIC GROUPING OF YOUTUBE SEARCH RESULTS
Liujuan Cao , Rongrong Ji† , Wei Liu† , Yue Gao , Ling-Yu Duan , and Chaoguang Men
†
School of Computer Science, Harbin Engineering University, Harbin, 150001, China
Department of Electrical Engineering, Columbia University, New York City, 10027, United States
Department of Automation, Tsinghua University, Beijing, 100871, China
Institute of Digital Media, Peking University, Beijing, 100871, China
ABSTRACT
Recent years have witnessed an explosive growth of user
contributed videos on websites like YouTube and Metacafe,
which usually provide a query-by-keyword functionality to
facilitate the user browsing. For a given query, the returned
videos typically contain multiple topics that are mixed up
to duplicate the user browsing. Therefore, their diversification and grouping are highly demanded to improve the user
experiences. However, the tagging and content qualities of
user contributed videos are uncontrolled against their precise grouping. In this paper, we present a weakly supervised
topic grouping paradigm to diversify the returned videos of
a given keyword query. Our grouping is based on the bagof-words visual signature quantized over the spatiotemporal
STIP descriptor [1] extracted from each returned video. First,
we adopt a min-Hashing based visual similarity in combination of the tagging similarity to group the returned videos.
Based on the initial grouping configurations, we mine the
co-occurred discriminative sub-signatures, based on which
we iteratively refine the first step. Such iteration well handles
the noise in visual content and tagging, since neither of which
is fully trusted during the grouping. We validate our schemes
on over 2,000 video clips crawled from a set of YouTube
keyword query results. Comparing to alternative approaches,
our scheme has shown superior robustness and precision.
Index Terms— Social Media, Video Retrieval, Search
Result Diversification, Hashing, Pattern Mining
1. INTRODUCTION
In recent years, there is an ever growing amount of user
contributed videos on the online sharing websites such as
YouTube, Facebook and MySpace. Such social repositories
allow users to upload, tag, comment and share videos with
almost uncontrolled freedom. Most existing photo sharing
websites above provide the query-by-keyword functionality.
In this scenario, the returned videos are typically mixed up
from multiple topics, each of which contains near-duplicate
or semantically-similar returned results. Therefore, deduplicating and diversifying the returned videos can further
978-1-4673-2533-2/12/$26.00 ©2012 IEEE
2885
Fig. 1. The work flow of our proposed weakly supervised
topic grouping paradigm for YouTube search results.
reduce the user browsing burdens. This is of great importance
to the state-of-the-art video sharing websites such as YouTube
and Metacafe to improve their user experiences.
Towards diversifying the query-by-keyword returned
videos, a straightforward solution is to online group related
videos by “topics”1 by using their tag and/or visual content
signatures. However, facing the user contributed videos, both
their tags and video contents are likely noisy against a precise grouping. For instance, the tags could be incorrect or
missing, and the videos could have low resolutions or are
re-edited etc. As a result, neither tags nor visual contents
for user contributed videos are fully trustable to achieve the
optimal grouping.
In this paper, we look at the issue of grouping the returned videos into topics with respect to the noisy tags and
visual contents, which is challenging yet unexploited in the
literature. We tackle this “topic grouping” from a weakly
supervised clustering perspective. Our method is deployed
based on the bag-of-visual-words visual signature quantized
from the STIP descriptors [1] extracted in each reference
video. First, we adopt a min-hashing similarity over the
bag-of-visual-words signature to carry out the initial topic
grouping, which is refined based upon their tagging similarity. Second, we mine a discriminative and co-occurred
sub-signature from the initial grouping configurations, which
is feedback to adaptively iterate the first grouping procedure.
Such iteration sharps the unreliable signature, acting as an
1 In
this paper, we define “topic” as the grouping results of returned
videos, rather than the definition of traditional topic models.
ICIP 2012
optimal subspace representation. It is worthwhile to mention
that, we carefully control the pattern orders in sub-signature
mining to ensure the online efficiency. As a result, the returned videos can be robustly and efficiently organized into
multiple “topics” to facilitate the user browsing. Figure 1 outlines the work flow of our proposed paradigm deployed over
the YouTube search results in a query-by-keyword scenario,
which is yet general enough and could be further extended to
refine the search of other user contributed video repositories.
2. RELATED WORK
Social Video Mining and Recognition: There is an ever increasing research focus on the user sharing videos from websites like YouTube and Metacafe. For instance, TRECVID
has included the task of consumer video search and recognition since 2010. Donmez et al. [2] and Whitehill et al. [3]
leveraged the YouTube metadata to assist the video annotation tasks. Duan et al. [4] adopted transfer learning and pyramid matching to classify consumer videos by using classifiers
learnt from YouTube videos.
Visual Search Diversification: Diversifying the image
search results is recently investigated in [5][6][7][8][11][13].
For instance, Weinberger et al. diversified the returned image
list based on the textual feature ambiguity and topic diversity.
Zhang et al. [5] proposed an Affinity Ranking to group returning images by optimizing the diversity and information richness. Song et al. [6] proposed a search result re-ranking by
topic richness analysis. Leuken et al. [8] adopted lightweight
clustering and dynamic feature weighting to diversify image
search results with representative image selection.
Weakly Supervised Learning: Our work is inspired
from a brand new machine learning area named “weak supervised learning” [14][15], whose task is to jointly learn from
multiple labeling sources without fully trusting the initial labels or cues. In practice, a variety of real-world problems can
be formalized as such a multi-labelers problem, where the
performance of different annotators can vary widely. Without
the ground truth, how to learn classifiers and simultaneously
evaluate the trustiness of individual features are recently investigated in [14][15], which in general model this problem
as a robust learning from multiple unreliable feature sources.
3. WEAKLY SUPERVISED GROUPING
Extracting Bag-of-Visual-Words Signature: To represent
the contents of individual returned videos, we adopt a bag-ofvisual-words signature deployed over the visual vocabulary
representation. To build this vocabulary, we have collect a
set of 1,000 videos from the “Most View” YouTube videos
from August to September 2011. We extract STIP descriptors
[1] from each video. Then, we adopt k-means clustering to
cluster all STIP descriptors from these videos into a visual
vocabulary V with totally K words. Nevertheless, other wellused features or vocabularies [9][10][12] can be also adopted
2886
without loss of generality.
In online search, for a given query keyword, suppose
N
we have N returned videos {Videoi }i=1
crawled from the
YouTube API. For each Videoi , we also extract a set of STIP
descriptors as STIPi = [STIPi1 , ..., STIPij , ..., STIPiJ ], which
are quantized using V into a bag-of-visual words signature
Vi = [V1i , ..., VKq ].
Min-Hashing based Online Grouping: We then convert
Vi into a min-Hash based representation [16] to efficiently
group the returned videos online.
First, to unify the weighting histogram of Vi for minHashing, we duplicate each bin if its score is larger than 1
(e.g. duplicate three time the jth bin Vij if Vij = 3). It results
in a set-based representation of visual words, denoted as S
where each element has an integer value 0 or 1.
Then, the distance between two input signatures S1 and
S2 is the ratio of elements in their intersection, over the union
of S1 and S2 as follows:
sim(S1 , S2 ) =
S1 ∩ S2
S1 ∪ S2
(1)
We adopt min-Hashing to estimate sim(S1 , S2 ) without linear
scanning all signatures in the N returned videos. To this end,
we carry out R random permutations π in vocabulary V to define the min-Hashing operation as min π(Si ), based on which
the similarity score between S1 and S2 are defined as:
(min π(S1 ) = min π(S2 ))
(2)
mHash(S1 , S2 ) =
N
Suppose that we have in total R hashing bins Si for the ith
signature as {min π1 (S1i ), ..., min πR (SRi )}. Based on the above
pairwise distance mHash(S1 , S2 ), we group the N returned
videos into M clusters by k-means clustering. To reduce the
false positive rate, we only maintain the score of Equation 2
once it is larger than τ, otherwise set this score as 0.
We further incorporate the tag similarity of each returned
video to refine the above min-Hashing similarity:
SimVideo1 ,Video2 = mHash(S1 , S2 )exp(−||T1 − T2 ||2 )
(3)
where Ti is the bag-of-textual-words extracted from the tags
and comments of the ith returned video.
Mining Co-occurred Discriminative Sub-Signature:
Based on the initial M group partition of the returned videos,
we select N pos signatures that are true positive for each group.
We then mine discriminative sub-signatures based on these
in total N pos × M signatures. We adopt APriori [17] to mine
this sub-signature, which is very efficient when we control
the pattern orders <= 2 together with an empirical setting of
N < 10 and M < 5. In general, APriori identifies the set of
elements in S that more frequently co-occur in N pos positive
sets with respect to Nneg negative sets2 . Formally speaking,
2 The negative sets are randomly sampled from image pairs of different
groups in M.
Algorithm 1: Work Flow of Iterative Topic Grouping
1
2
3
4
5
Fig. 2. Exemplar grouping results by querying YouTube
videos using keyword “Columbia”.
6
for a hashing bin combination A ∈ S (e.g. a possible hashing
bit combination in S), we want to measure it’s discriminability to generate its consequent parts (e.g. by adding additional
bins into A as B), called an association rule A ⇒ B. This is
achieved by defining the support and confidence of A ⇒ B.
The support is calculated as:
7
|{T|T ∈ D, (A ∪ B) ⊆ T}|
sup(A ⇒ B) =
|D|
(4)
which measures the statistical significance of this rule. D is
the set of all possible transactions built from combinations of
all non-zero bins in S. T is a hashing bin set of D that contain
A. The confidence of a rule is:
conf(A ⇒ B) =
sup(A ∪ B) |{T|T ∈ D.(A ∪ B) ⊆ T}|
=
sup(A)
|{T|T ∈ D, A ⊆ T}|
(5)
Intuitively, the support for this rule is the probability of the
joint occurrence of A and B, i.e. P(A, B) while confidence is
the conditional probability P(B|A).
Based on the initial grouping, we can define videos in an
identical group to form the “positive” set α+ while videos
from different groups to form the “negative” set α− . Then,
we run Algorithm 1 on transactions D from both the positive and negative sets. The mining results include rules of
the form (A, B) ⇒ (α+ , α− )) where P((α+ , α− )|A, B) is given
by the confidence of the rule. We expect that P(α+ |A, B) P(α− |A, B). If so this transaction would be included into the
co-occurred sub-signature, denoted as C.
The mined sub-signature is then converted into a new vector and appended to the original signature V, as to pulling the
signatures from positive examples closer together.
Iterative Weakly Supervised Grouping: The above two
processes are iterated to output the final grouping of returned
videos as in Algorithm 1, based on which we enable users to
view the returning results organized in groups as in Figure 2.
4. EXPERIMENTS
Data Collection and Labeling: We collect over 2,000 videos
by querying YouTube with a set of 20 keywords3 using the
3 Keywords: adam lambert, bad romance, christian bale, columbia, chrismas, inauguration, kanye west, michael jackson, boat, obama in-
2887
8
9
10
11
12
13
Input: Iteration t = 0, maximal iteration T , query
keyword Q, bag-of-words signatures of returned videos
N
N
, tag descriptors {Ti }i=1
, group number M.
{Vi }i=1
M
Output: Partition {Vi }i=1 into M groups.
while t < T do
N
Run k-means by min-Hashing similarity of {Si }i=1
N
(converted from {Vi }i=1 ) into M groups;
Sample (α+ , α− ) from these M groups;
Mining discriminative co-occurred sub-signatures
C from (α+ , α− ) and {Si } by APriori;
Adding sub-signature C into V.
t++;
end
for each group in M groups do
Find S j with nearest min-Hashing distance to other
videos in this group;
Select S j as the representative video;
end
Table 1. Time cost of online result grouping.
Compression methods
K-means with BoW
Our method (order = 2)
Our method (order = 3)
Our method (order = 4)
Iterations
10
5
10
15
Mining
0s
3.78s
11.76s
25.36s
Grouping
2.35s
0.57s
0.62s
0.91s
YouTube API. For the returned videos of each specific query,
we manually label videos into multiple groups as our grouping ground truth. From all videos, we extract STIP descriptor
[1] and adopt hierarchical k-means to quantize all descriptors
into an initial vocabulary. Table 1 gives the averaged grouping time comparison in a regular PC with Pentium Dual-Core
2.54GHZ and 2GB Memory.
Baselines and Evaluation Criteria: (1). K-Means:
Group the bag-of-visual-words signatures by k-means clustering, without iteratively mining the discriminative subsignature. (2). Without Mining Co-occurred Patterns: Only
conduct the visual signature grouping based on the minHashing similarity, without iteratively mining the discriminative sub-signature. (3). Without Tag Refinement: Run the
entire procedure as in Section 3 without integrating tag similarity into Equation 3. (4). Without Multi-iterations: Iterate
one-round of both topic grouping and sub-signature mining.
We evaluate our grouping effectiveness using the Average
AP among all groups. Average AP = ni=1 APi , where APi is
the Average Precision of the ith group.
Quantitative Results: Figure 4 shows the Average AP
of online grouping by tuning the number of groups in the kmeans clustering. In general, the best performance comes
auguration,
hatton,
pacquiao,
paranormal
activity,
san boyle, climb, tiger woods, usain bolt, watchmen, wedding.
su-
Fig. 3. Case Study of Average APs of the online grouping results with respect to different keywords.
sub-signature mining to learn an optimal visual representation, which is iteratively feedback to refine grouping. Our
scheme has outperformed a group of alternative schemes by
quantitatively evaluating on over 200 YouTube video queries.
Finally, how to select representative videos for each group
would be investigated in our future work.
6. REFERENCES
Fig. 4. Average AP comparison of online grouping with respect to different numbers of groups.
from iteratively mining the discriminative sub-signatures to
refine the clustering results in a weakly supervised manner.
Meanwhile, with a moderate group number, the corresponding precision is more or less insensitive for our final scheme,
while very sensitive for baselines of K-Means or baselines of
Without Discriminative/Iterative Sub-signature Mining.
Table 1 shows the runtime of performing different grouping schemes. If we carefully control the order of mined patterns and the number of returned videos, online grouping is
moderately efficient4 . It’s partially due to the min-Hashing
avoids linear scanning during similarity measurement to accelerate the iteratively online grouping.
Figure 3 further shows the individual average APs with
respect to different keywords in our query set (group number
is 5). While different keywords vary quite a lot in grouping
precision, we have found that our proposed grouping is consistently superior to all alternatives. In addition, iteratively refining signatures would largely boost the grouping precision,
by comparing baselines (3)(4) to (1)(2).
5. CONCLUSION AND FUTURE WORK
In this paper, we investigate the issue of online grouping
the returned videos into “topics” to facilitate the efficient
browsing of YouTube videos in a query-by-keyword scenario, where neither video contents nor their tags are fully
trustable for direct usage. To this end, we propose a weakly
supervised topic grouping paradigm with two iterative phases:
(1) a min-Hashing based fast online grouping with tag similarity refinement and (2) an APriori based discriminative
4 Nevertheless, running our algorithm in (parallel) server(s) instead would
further reduce the time cost by scales, or equivalently enable us to handle
more returned videos for each query or conduct a higher-order mining.
2888
[1] I. Laptev and T. Lindeberg. Space-time interest points. ICCV,
2003. 1, 2, 3
[2] P. Donmez and J. Carbonell. Proactive learning: Cost-sensitive
active learning with multiple imperfect oracles. CIKM, 2008. 2
[3] J. Whitehill, P. Ruvolo, J. Bergsma, T. Wu, and J. Movellan.
Whose vote should count more: Optimal integration of labels
from labelers of unknown expertise. NIPS, 2009. 2
[4] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. CVPR, 2010. 2
[5] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.Y. Ma Improving web search results using affinity graph. SIGIR,
2005. 2
[6] K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image
retrieval results. ACM Multimedia, 2006. 2
[7] K. Weinberger, M. Slaney, and R. V. Zwol. Resolving tag ambiguity. ACM Multimedia, 2008. 2
[8] R. Leuken, L. Garcia, and X. Olivares. Visual diversification of
image search results. WWW, 2009. 2
[9] R. Ji and H. Yao. Visual & textual fusion for region retrieval:
from both fuzzy matching and bayesian reasoning aspects. Multimedia Information Retrieval, 2007 2
[10] R. Ji, P. Xu, and H. Yao. Directional correlation analysis of
local Haar binary pattern for text detection. ICME, 2008. 2
[11] X Liu, R Ji, H Yao, and P Xu. Cross-media manifold learning
for image retrieval & annotation ICMR, 2010. 2
[12] R Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao.
Location discriminative vocabulary coding for mobile landmark
search International Journal of Computer Vision, 2012. 2
[13] R. Ji, Y. Gao, B. Zhong, H. Yao, and Q. Tian. Mining flickr
landmarks by modeling reconstruction sparsity. TOMCCAP,
2011. 2
[14] V. Raykar, S. Yu, L. Zhao, A. Jerebko, C. Florin, G. Valadez, L.
Bogoni, and L. Moy. Supervised learning from multiple experts:
Whom to trust when everyone lies a bit. ICML, 2009. 2
[15] O. Dekel and O. Shamir. Good learners for evil teachers.
ICML, 2009. 2
[16] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image
detection: min-Hash and TF-IDF weighting. BMVC, 2008. 2
[17] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. VLDB, 1994. 2
Download