Understanding and Predicting Interestingness of Videos …

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
Understanding and Predicting Interestingness of Videos
Yu-Gang Jiang, Yanran Wang, Rui Feng,
Xiangyang Xue, Yingbin Zheng, Hanfang Yang
School of Computer Science, Fudan University, Shanghai, China
Abstract
Flickr Dataset
The amount of videos available on the Web is growing explosively. While some videos are very interesting
and receive high rating from viewers, many of them are
less interesting or even boring. This paper conducts a pilot study on the understanding of human perception of
video interestingness, and demonstrates a simple computational method to identify more interesting videos.
To this end we first construct two datasets of Flickr
and YouTube videos respectively. Human judgements
of interestingness are collected and used as the groundtruth for training computational models. We evaluate
several off-the-shelf visual and audio features that are
potentially useful for predicting interestingness on both
datasets. Results indicate that audio and visual features are equally important and the combination of both
modalities shows very promising results.
…
…
…
…
…
…
…
…
…
…
…
…
YouTube Dataset
Time
Introduction
The measure of interestingness of videos can be used to improve user satisfaction in many applications. For example, in
Web video search, for the videos with similar relevancy to a
query, it would be good to rank the more interesting ones
higher. Similarly this measure is also useful in video recommendation, where users will certainly be more satisfied and
as a result the stickiness of a video-sharing website will be
largely improved if the recommended videos are interesting
and attractive.
While great progress has been made to recognize semantic categories (e.g., scenes, objects and events) in videos
(Xiao et al. 2010; Laptev et al. 2008), a model for automatically predicting the interestingness of videos is elusive
due to the subjectiveness of this criterion. In this paper we
define interestingness as a measure that is collectively reflected based on the judgements of a large number of viewers. We found that—while a particular viewer may have a
very different judgement—generally most viewers can reach
a clear agreement on whether a video is more interesting
than another. The problem we pose here is: given two videos,
can a computational model automatically analyze their contents and predict which one is more interesting? To answer
this challenging question, this paper conducts an in-depth
Figure 1: Example frames of videos from the Flickr and
YouTube datasets we collected. For each dataset, the video
on top is considered more interesting than the other one according to human judgements.
study using a large set of off-the-shelf video features and
a state-of-the-art prediction model. Most selected features
have been shown effective in recognizing video semantics,
and a few are chosen particularly for this task based on our
intuitions. We investigate the feasibility of predicting interestingness using two datasets collected from Flickr and
YouTube respectively, both with ground-truth labels generated based on human judgements (see example frames in
Figure 1).
This paper makes two important contributions:
• We construct two benchmark datasets with ground-truth
labels to support the study of video interestingness1 . We
analyze the factors that may reflect interestingness on
both datasets, which are important for the development
of a good computational model.
c 2013, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
1
1113
Available at www.yugangjiang.info/research/interestingness.
• We evaluate a large number of features (visual, audio,
and high-level attributes) through building models for predicting interestingness. This provides critical insights in
understanding the feasibility of solving this challenging
problem using current techniques.
sults. While the exact way of computing this is unknown,
it is clear that various subjective measures of human interactions are considered, such as the number of views, tags
and comments, popularity of video owners, etc. We used
15 keywords to form 15 Flickr searches and downloaded
top-400 videos from each search, using the interestingnessbased ranking. The keywords include “basketball”, “beach”,
“bird”, “birthday”, “cat”, “dancing”, “dog”, “flower”, “graduation”, “mountain”, “music performance”, “ocean”, “parade”, “sunset” and “wedding”. We found that the interestingness ranking of videos is not locally stable, i.e., the orders of nearby videos may change from time to time. This
may be due to the fact that the criterion is frequently updated
based on most recent user interactions. We therefore follow a
previous work on image interestingness analysis (Dhar, Ordonez, and Berg 2011) to use the top 10% of the videos from
each search as “interesting” samples, and the bottom 10% as
“uninteresting” samples. Therefore the final dataset contains
1,200 videos (15 × 400×(10%+10%)), with an average duration of around 1 minute (20 hours in total).
We underline that although the Flickr search engine already has the function of estimating video interestingness,
it relies on human interactions/inputs which become available only after a certain time period. A computational model
based on purely content analysis is desired as it can predict interestingness immediately. Besides, content-based interestingness estimation can be used in other domains where
human interactions are not available.
The second dataset consists of advertisement videos
downloaded from YouTube2 . Practically it may be useful
if a system could automatically predict which ad is more
interesting. We collected ad videos in 14 categories: “accessories”, “clothing&shoes”, “computer&website”, “digital products”, “drink”, “food”, “house application”, “houseware&furniture”, “hygienic products”, “insurance&bank”,
“medicine”, “personal care”, “phone” and “transportation”.
30 videos were downloaded for each category, covering ads
of a large number of products and services. Average video
duration of this dataset is 36 seconds.
After collecting the YouTube data, we assigned an “interestingness” score to each video, defined by the average ranking provided by human assessors after viewing all videos in
the corresponding category. Ten assessors (five males and
five females) were involved in this annotation process. Each
time an assessor was shown a pair of videos and asked to
tell which one is more interesting (or both are equally interesting). Labels from the 10 assessors were consolidated to
form fine-grained interestingness rankings of all the videos
from each category.
We analyzed the videos and labels in order to understand if there are clear clues that are highly correlated with
interestingness, which would be helpful for the design of
the computational models. While the problem was found
to be very complex and challenging as expected, one general observation is that videos with humorous stories, attrac-
Notice that our aim is to have a general interestingness prediction that most people would agree with, not specifically
tailored for a particular person. While the latter might also
be predictable with a sufficient amount of training data and
supporting information like browsing history, it is out of the
scope of this work.
Related Work
There have been a number of works focusing on predicting
aesthetics of images (Datta et al. 2006; Ke, Tang, and Jing
2006; Murray, Marchesotti, and Perronnin 2012), which is
a related yet different goal. Several papers directly investigated the problem of image interestingness. In (Katti et al.
2008), the authors used human experiments to verify that
people can discriminate the degree of image interestingness
in very short time spans, and discussed the potentials of using this criterion in various applications like image search.
The use of interestingness was also explored in (Lerman,
Plangprasopchok, and Wong 2007) to achieve improved and
personalized image search results. A more recent work in
(Dhar, Ordonez, and Berg 2011) presented a computational
approach for the estimation of image interestingness using
three kinds of attributes: compositional layouts, content semantics, and sky-illumination types (e.g., cloudy and sunset). The authors found that these high-level attributes are
more effective than the standard features (e.g., color and
edge distributions) used in (Ke, Tang, and Jing 2006), and
further combining both of them can achieve significant improvements.
Another work that is perhaps the most related to ours is
(Liu, Niu, and Gleicher 2009), where the authors used Flickr
images to measure the interestingness of video frames.
Flickr images were assumed to be mostly interesting as
compared to many video frames since the former is generally well-taken and selected. A video frame is considered
interesting if it matches (using image local features) to a
large number of images. Promising results were observed
on travel videos of several famous landmarks. Our work is
fundamentally different since predicting the interestingness
of a video frame is essentially the same as that for images.
Entire video-level prediction in this paper demands serious
consideration of several other clues such as audio. To the
best of our knowledge, there is no existing works studying
computational approaches for the modeling of video-level
interestingness.
Datasets
To facilitate the study we need benchmark datasets with
ground-truth interestingness labels. Since there is no such
kind of dataset publicly available, we collected two new
datasets.
The first dataset was collected from Flickr, which has
a criterion called “interestingness” to rank its search re-
2
We chose YouTube because Flickr has much fewer ad videos,
whose interestingness rankings are also unreliable due to the limited number of views received by those videos.
1114
tive background music, or better professional editing tend
to be more interesting. Figure 1 shows a few example video
frames from both datasets.
vs.
Predicting Interestingness
With the datasets we are now able to develop and evaluate
computational models. A central issue in building a successful model is the selection or design of video features,
which are fed in machine learning algorithms for prediction. A large number of features are studied, ranging from
visual, audio, to high-level semantic ones (a.k.a. attributes).
These features are selected based on past experiences on
video analysis (e.g., object/scene recognition) and intuitions
on what may be useful for predicting interestingness. We
briefly describe each of them in the following subsection.
Multi-modal feature extraction
Visual
features
Audio
features
High-level
attribute features
Multi-modal fusion
Ranking SVM
Prediction results
Features
Color Histogram in HSV space is the first feature to be
evaluated, as color is an important clue in visual perception.
SIFT (Scale Invariant Feature Transform) feature has
been popular for years, producing outstanding performance
in many applications. We adopt the standard pipeline as described in (Lowe 2004). Since SIFT computation on every
frame is computationally expensive and nearby frames are
visually very similar, we sample a frame per second. The
same set of sampled frames are used in computing all the
other features except the audio ones. The SIFT descriptors
of a video are quantized into a bag-of-words representation
(Sivic and Zisserman 2003) using a spatial pyramid with a
codebook of 500 codewords. This representation has been
widely used in the field of image/video content recognition.
HOG (Histogram of Gradients) features (Dalal and
Triggs 2005) are effective in several computer vision tasks
such as human detection. Different from SIFT which is computed on sparsely detected patches, we compute HOG descriptors over densely sampled patches (on the same set of
frames). Following (Xiao et al. 2010), HOG descriptors in a
2×2 neighborhood are concatenated to form a descriptor as
higher dimensional feature is normally more discriminative.
The descriptors are converted to a bag-of-words representation for each video, in the same way as we quantize the SIFT
features.
SSIM (Self-Similarities) is another type of local descriptor, which is computed by quantizing a correlation map of
a densely sampled frame patch within a larger circular window (Shechtman and Irani 2007). We set the patch size and
the window radius as 5 × 5 and 40 pixels respectively. Similarly, SSIM descriptors are also quantized to a bag-of-words
representation.
GIST is a global visual feature, capturing mainly the
global texture distribution in a video frame. Specifically,
it is computed based on the output energy of Gabor-like
filters (8 orientations, 4 scales) over a 4 × 4 image grids
(Oliva and Torralba 2001). The final descriptor is 512-d
(8 × 4 × 4 × 4). We use the averaged GIST descriptor over
the sampled frames to represent a video.
MFCC: Research in neuroscience has found that multiple
senses work together to enhance human perception (Stein
Figure 2: Overview of our method for predicting video interestingness.
and Stanford 2008). Audio information is intuitively a very
important clue for our work. The first audio feature we consider is the well-known mel-frequency cepstral coefficients
(MFCC), which are computed for every 32ms time-window
(an audio frame) with 50% overlap. Multiple MFCC descriptors of a video are mapped to a bag-of-words representation, which is the same as we quantize the visual features.
Spectrogram SIFT is an audio feature motivated by recent computer vision techniques. First, an auditory image is
generated based on the constant-Q spectrogram of a video’s
sound track, which visualizes the distribution of energy in
both time and frequency. SIFT descriptors are then computed on the image and quantized into a bag-of-words representation. Similar ideas of using vision techniques for audio
representation were explored in (Ke, Hoiem, and Sukthankar
2005).
Audio-Six: We also include another compact audio feature consisting of six basic audio descriptors that have been
frequently adopted in audio and music classification, including Energy Entropy, Signal Energy, Zero Crossing Rate,
Spectral Rolloff, Spectral Centroid, and Spectral Flux. This
six-dimensional feature is expected to be useful as we observed that music seems to have a certain degree of positive
correlation with video interestingness.
Classemes (Torresani, Szummer, and Fitzgibbon 2010)
is a high-level descriptor consisting of prediction scores of
2,659 semantic concepts (objects, scenes, etc.). Models of
the concepts were trained using external data crawled from
the Web. Different from the visual and audio features listed
above, each dimension of this representation has a clear semantic interpretation (often referred to as an attribute in recent literatures). The authors of (Dhar, Ordonez, and Berg
2011) reported that high-level attributes (e.g., scene types
and animals) are effective for predicting image interestingness. Like the GIST descriptor, an averaged vector of the
Classemes descriptors of all the sampled frames is used to
represent a video.
1115
Results and Discussions
ObjectBank (Li et al. 2010) is another high-level attribute
descriptor. Different from Classemes which uses frame-level
concept classification scores, ObjectBank is built on the local response scores of object detections. Since an object may
appear in very different sizes, detection is performed in multiple image (frame) scales. 177 object categories are adopted
in ObjectBank.
Style Attributes: We also consider the photographic style
attributes defined in (Murray, Marchesotti, and Perronnin
2012) as another high-level descriptor, which is formed
by concatenating the classification outputs of 14 photographic styles (e.g., Complementary Colors, Duotones, Rule
of Thirds, Vanishing Point, etc.). These style attributes were
found useful for evaluating the aesthetics and interestingness
of images, and it is therefore interesting to know whether
similar observations could extend to videos.
Among these representations, the first five are low-level
visual features, the next three are audio features, and the last
three are high-level attribute descriptors. Due to space limitation, we cannot elaborate the generation process of each
of them. Interested readers are referred to the corresponding
references for more details.
In this section, we train models using the multi-modal features and measure their accuracies on both datasets. We first
report results of the visual, audio, and attribute features separately, and then evaluate the fusion of all the features.
Visual feature prediction: We first examine the effectiveness of the visual features. Results are summarized in Figure 3 (a). Overall the visual features achieve very impressive
performance on both datasets. For example, the SIFT feature
produces an accuracy of 74.5% on the Flickr dataset, and
67.0% on the YouTube dataset. Among the five evaluated visual features, Color Histogram is the worst (68.1% on Flickr
and 58.0% on YouTube), which indicates that—while color
is important in many human perception tasks—it is probably not an important clue in evaluating interestingness. Intuitively this is reasonable since it is mainly the semantic-level
story that really attracts people. Although other features like
SIFT and HOG do not have direct semantic interpretations,
they are the cutting-edge features for recognizing video semantics.
We also evaluate the combination of multiple visual features. Since there are too many feature combinations to be
evaluated, we start from using only the best single feature
and then incrementally add in new features. A feature is
dropped if it does not contribute in a fusion experiment. As
shown in Figure 3 (a), fusing visual features does not always improve the results. In addition to Color Histogram,
GIST also degrades the performance, probably because it is
a global descriptor and is not robust to content variations like
object scale changes.
Audio feature prediction: Next, we experiment with the
three audio-based features. Figure 3 (b) gives the results.
We see that all the audio features–including the very compact Audio-Six descriptor–are discriminative for this task.
This verifies the fact that the audio channel conveys very
useful information for human perception of interestingness.
Spectrogram SIFT produces the best performance on Flickr
(mean accuracy 74.7%) and MFCC has the highest accuracy on YouTube (64.8%). These individual audio features
already show better performance than some of the evaluated
visual features.
The three audio features are also very complementary,
which is not surprising since they capture very different
information from video soundtracks. Through combining
them, we get an accuracy of 76.4% on Flickr and 65.7% on
YouTube.
Attribute feature prediction: The last set of features to
be evaluated are attribute-based. As shown in Figure 3 (c),
while the attribute features are all effective, they do not work
as well as we expected. Compared with the audio and visual
features, the results are similar or slightly worse. This may
be because the models for detecting these attributes were
trained using external data (mostly Web images), which are
visually very different from the video data used in our work.
The data domain difference largely limits the performance
of this set of features. Therefore we conjecture that significant performance gains may be attained with new carefully
designed attribute models trained using video data.
Comparing the three attribute features, Classemes and
Classifier and Evaluation
As it is difficult to quantify the degree of interestingness of
a video, our aim in this work is to train a model to compare the interestingness of two videos, i.e., to tell which one
is more interesting. Given a set of videos, the model can
be used to generate a ranking, which is often sufficient in
practical applications. We therefore adopt Joachims’ Ranking SVM (Joachims 2003) to train prediction models. SVM
is used here because it is the most successful algorithm in
many video analysis applications, and the Ranking SVM algorithm designed for ranking-based problems has been popular in information retrieval. For the histogram-like features
such as the color histogram and the bag-of-words representations, we adopt the χ2 RBF kernel, which has been observed to be a good choice. For the remaining features (GIST
and the attribute descriptors), Gaussian RBF kernel is used.
Training data are organized in the form of video pairs,
with ground-truth labels indicating which one is more interesting for each pair. Since the Flickr dataset does not have
stable fine-grained ranking, similar to (Dhar, Ordonez, and
Berg 2011), we generate training and test pairs between the
top 10% and bottom 10% of videos (the total number of pairs
is 600 × 600/2). For both datasets, we use 2/3 of the videos
for training and the rest for testing. The datasets are randomly partitioned ten times and a model is trained for each
train-test split. We report both mean and standard deviation
of the prediction accuracies, which are computed as the percentage of the test video pairs that are correctly ranked.
Fusing multiple features is important since many features
are complementary. We use kernel-level fusion which linearly combines kernels computed from each single feature
to form a new kernel. Equal weights are adopted for simplicity, while it is worth noting that dynamic weights learned
from cross-validation or multiple kernel learning might further improve performance.
1116
2
3
4
5
12 123 124 125
YouTube
1
2
3
4
5
12 123 124 125
100
90
80
70
60
50
40
30
20
10
0
Flickr
Accuracy
Accuracy
1
100
90
80
70
60
50
40
30
20
10
0
1
2
3
12
123
YouTube
Accuracy
100
90
80
70
60
50
40
30
20
10
0
Flickr
Accuracy
Accuracy
Accuracy
100
90
80
70
60
50
40
30
20
10
0
1
2
(a)
3
12
(b)
123
100
90
80
70
60
50
40
30
20
10
0
100
90
80
70
60
50
40
30
20
10
0
Flickr
1
2
3
12
23
12
23
YouTube
1
2
3
(c)
Figure 3: Accuracies (%) of interestingness prediction on both Flickr and YouTube datasets, using models trained with individual features and their fusion. (a) Visual features (1. SIFT; 2. HOG; 3. SSIM; 4. GIST; 5. Color Histogram). (b) Audio features
(1. MFCC; 2. Spectrogram SIFT; 3. Audio-Six). (c) Attribute features (1. Style Attributes; 2. Classemes; 3. ObjectBank). Notice
that a feature is dropped immediately if it does not contribute to a fusion experiment, and therefore not all the feature combinations are experimented. The best feature combinations are “123”, “123” and “23” in the visual, audio and attribute feature sets
respectively.
ObjectBank are significantly better than the Style Attributes,
which is an interesting observation since the latter was
claimed effective in predicting image aesthetics and interestingness (Dhar, Ordonez, and Berg 2011; Murray, Marchesotti, and Perronnin 2012). This verifies the fact that video
interestingness is mostly determined by complex high-level
semantics, and has little correlation with the spatial and
color compositions like the Rule-of-Thirds.
Fusing visual, audio and attribute features: Finally, we
train models by fusing features from all the three sets. In
each set, we choose the subset of features that show the
best performance in the intra-set fusion experiments (e.g.,
Classemes and ObjectBank in the attribute set, as indicated
in Figure 3). Table 1 (the bottom “Overall” row) gives the
accuracies of each feature set and their fusion. Fusing visual and audio features always leads to substantial performance gains (76.6%→78.6% on Flickr and 68.0%→71.7%
on YouTube), which clearly show the value of jointly modeling audio and visual features for predicting video interestingness.
Attribute feature further boosts the performance from
78.6% to 79.7% on Flickr but does not improve the results on
YouTube. This is perhaps because most Flickr videos were
captured in less constrained environments by amateur consumers. As a result, contents of the Flickr videos are more
diverse, for which the large set of semantic attributes may be
more discriminative.
Generally the results on the Flickr dataset are better than
that on the YouTube dataset, because the training and test
video pairs in Flickr are generated only between top and bottom ranked videos (see discussions in Section “Datasets”),
which are apparently easier to predict.
Per-category performance and analysis: We also analyze the prediction accuracy of each category in both
datasets, to better understand the strength and limitation of
our current method. Results are listed in Table 1. We see
that the performance is very good for some of the categories
like “mountain” in Flickr and “medicine” in YouTube. Many
highly rated “mountain” and “sunset” videos contain background music, which is the main reason that audio features
alone already produce very impressive results. In contrast,
videos from categories like “graduation” are very diverse
and do not have a very clear audio clue. Overall we observe similar performance from the audio and visual features
on both datasets. While the fusion of audio-visual features
shows clear increases in performance of most categories, we
emphasize that the current fusion method, i.e., average kernel fusion, is obviously not an optimal choice. Deeper investigations on joint audio-visual feature design might lead to a
big performance leap.
Figure 4 shows some challenging examples. Our method
has difficulties in handling the cases where the videos have
similar visual and audio clues. For example, both videos in
the “food” category contain people sitting around a dining
table, while talking about something related to the products.
The main difference between these videos is the sense of
1117
Category
Visual
Audio
basketball
beach
bird
birthday
cat
dancing
dog
flower
graduation
mountain
music perf.
ocean
parade
sunset
wedding
73.9±11.6
87.3±5.1
74.3±6.9
86.9±4.5
68.7±8.1
62.6±11.1
71.9±10.3
82.6±6.7
76.4±11.9
83.4±5.4
81.0±5.0
68.4±7.1
77.8±8.1
71.6±9.6
81.5±8.1
70.5±7.5
77.0±4.1
78.6±8.4
83.8±6.6
63.4±10.2
73.2±9.9
72.2±6.0
82.2±5.2
69.7±6.1
88.1±3.0
77.4±6.5
76.3±7.4
72.1±5.9
82.6±6.8
79.2±8.9
Overall
76.6±2.4
76.4±2.2
Flickr
Attribute
YouTube
Audio
Attribute
Vis.+Aud.
Vis.+Aud.+Att.
Category
Visual
Vis.+Aud.
Vis.+Aud.+Att.
64.8±8.3
81.0±7.5
75.3±6.5
80.8±4.5
60.3±7.3
74.0±8.1
68.5±6.9
81.7±5.5
71.9±11.7
81.0±6.6
80.7±8.5
74.0±9.5
76.6±6.5
69.0±7.2
82.1±7.1
73.5±10.2
82.6±4.8
80.9±7.9
87.1±6.4
64.9±7.6
65.6±13.5
74.0±7.8
85.5±3.4
78.9±7.0
89.4±3.0
83.3±6.4
76.2±6.6
71.3±8.9
82.3±7.5
83.5±7.0
74.2±9.6
83.3±5.3
82.4±8.3
88.0±5.7
66.9±7.0
70.2±12.6
72.9±6.3
86.2±2.3
74.7±8.8
90.4±2.4
86.0±4.7
78.0±6.5
76.0±9.1
81.3±7.8
84.9±6.6
accessories
clothing&shoes
computer&website
digital products
drink
food
house application
houseware&funiture
hygienic products
insurance&bank
medicine
personal care
phone
transportation
77.8±8.6
67.3±9.8
71.8±4.5
66.2±9.5
70.4±13.0
53.6±12.5
74.2±7.6
60.4±7.6
63.3±10.6
70.4±8.4
72.4±7.9
62.0±11.4
65.3±8.8
76.4±7.3
63.6±8.8
73.3±9.0
70.7±8.8
61.6±15.5
73.3±13.0
57.8±8.5
68.7±9.0
64.2±8.3
72.9±4.7
56.2±11.2
71.8±7.6
67.8±9.4
70.7±8.6
53.8±11.4
73.8±7.4
53.3±5.9
70.9±9.5
71.8±11.9
60.4±12.6
55.1±8.9
68.2±9.1
59.8±7.6
55.6±10.8
64.9±13.3
67.1±9.4
59.6±12.4
67.1±10.0
72.7±7.0
74.0±9.1
76.0±7.3
72.2±8.7
74.4±7.5
75.1±10.3
57.8±8.0
74.0±8.8
68.9±7.6
73.8±8.3
66.0±12.5
74.0±9.3
72.0±7.6
75.1±5.8
70.2±9.4
74.9±7.5
73.1±7.9
74.9±7.0
75.6±8.7
74.2±10.9
59.8±8.8
72.2±7.9
66.7±6.6
72.7±6.8
62.9±13.9
76.0±8.1
71.6±7.6
73.3±6.2
71.1±8.0
74.8±1.8
78.6±2.5
79.7±2.7
Overall
68.0±2.2
65.7±2.7
64.3±2.0
71.7±3.5
71.4±2.8
Table 1: Per-category prediction accuracies (%), using visual, audio, attribute features, and their fusion. The best result of each
category is shown in bold.
Cat (Flickr)
Food (YouTube)
…
…
…
…
…
…
…
…
…
…
…
…
Figure 4: Challenging examples from two categories that have low prediction accuracies using our method. The upper video
is more interesting than the lower one according to human judgements, but both of them contain very similar visual and audio
features.
(e.g., the style attributes are not helpful for this task).
humor, or attractiveness of the story, which cannot be captured by our current features. One promising direction to
tackle this challenge is to design a better set of high-level
visual/audio attributes for interestingness prediction. In addition, speech recognition and natural language processing
techniques may be adopted to help understand the conversations, which will be very useful.
While our initial results are very encouraging, the topic
deserves future investigations as there is still space for performance improvement. Promising directions include the
joint modeling of audio-visual features, the design of more
suitable high-level attribute features, and speech recognition
and analysis with natural language understanding. It would
also be interesting to deploy and rigorously evaluate the interestingness prediction models in real-world applications
like video search and recommendation.
Conclusions
We have conducted a pilot study on video interestingness,
a measure that is useful in many applications such as Web
video search and recommendation. Two datasets were collected to support this work, with ground-truth labels generated based on human judgements. These are valuable resources and will be helpful for stimulating future research
on this topic. We evaluated a large number of features on the
two datasets, which led to several interesting observations.
Specifically, we found that current audio and visual features
are effective in predicting video interestingness, and fusing
them significantly improves the performance. We also observed that some findings from previous works on image interestingness estimation do not extend to the video domain
Acknowledgments
This work was supported in part by two grants from
the National Natural Science Foundation of China
(#61201387 and #61228205), a National 973 Program
(#2010CB327900), two grants from the Science and
Technology Commission of Shanghai Municipality
(#12XD1400900 and #12511501602), and a National 863
Program (#2011AA010604), China.
1118
References
Stein, B. E., and Stanford, T. R. 2008. Multisensory integration: current issues from the perspective of the single
neuron. Nature Reviews Neuroscience 9:255–266.
Torresani, L.; Szummer, M.; and Fitzgibbon, A. 2010. Efficient object category recognition using classemes. In Proc.
of European Conference on Computer Vision.
Xiao, J.; Hays, J.; Ehinger, K.; Oliva, A.; and Torralba, A.
2010. SUN database: Large-scale scene recognition from
abbey to zoo. In Proc. of IEEE Conference on Computer
Vision and Pattern Recognition.
Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In Proc. of IEEE Conference on
Computer Vision and Pattern Recognition.
Datta, R.; Joshi, D.; Li, J.; and Wang, J. Z. 2006. Studying aesthetics in photographic images using a computational
approach. In Proc. of European Conference on Computer
Vision.
Dhar, S.; Ordonez, V.; and Berg, T. L. 2011. High level
describable attributes for predicting aesthetics and interestingness. In Proc. of IEEE Conference on Computer Vision
and Pattern Recognition.
Joachims, T. 2003. Optimizing search engines using clickthrough data. In Proc. of the ACM Conference on Knowledge
Discovery and Data Mining.
Katti, H.; Bin, K. Y.; Chua, T. S.; and Kankanhalli, M. 2008.
Pre-attentive discrimination of interestingness in images. In
Proc. of IEEE International Conference on Multimedia and
Expo.
Ke, Y.; Hoiem, D.; and Sukthankar, R. 2005. Computer
vision for music identification. In Proc. of IEEE Conference
on Computer Vision and Pattern Recognition.
Ke, Y.; Tang, X.; and Jing, F. 2006. The design of highlevel features for photo quality assessment. In Proc. of IEEE
Conference on Computer Vision and Pattern Recognition.
Laptev, I.; Marszalek, M.; Schmid, C.; and Rozenfeld, B.
2008. Learning realistic human actions from movies. In
Proc. of IEEE Conference on Computer Vision and Pattern
Recognition.
Lerman, K.; Plangprasopchok, A.; and Wong, C. 2007. Peronalizing image search results on flickr. In Proc. of AAAI
Workshop on Intelligent Techniques for Web Personlization.
Li, L.; Su, H.; Xing, E.; and Fei-Fei, L. 2010. Object bank:
A high-level image representation for scene classification
semantic feature sparsification. In Advances in Neural Information Processing Systems.
Liu, F.; Niu, Y.; and Gleicher, M. 2009. Using web photos for measuring video frame interestingness. In Proc. of
International Joint Conference on Artificial Intelligence.
Lowe, D. 2004. Distinctive image features from scaleinvariant keypoints. International Journal on Computer Vision 60:91–110.
Murray, N.; Marchesotti, L.; and Perronnin, F. 2012. AVA:
A large-scale database for aesthetic visual analysis. In Proc.
of IEEE Conference on Computer Vision and Pattern Recognition.
Oliva, A., and Torralba, A. 2001. Modeling the shape of
the scene: a holistic representation of the spatial envelope.
International Journal of Computer Vision 42:145–175.
Shechtman, E., and Irani, M. 2007. Matching local selfsimilarities across images and videos. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition.
Sivic, J., and Zisserman, A. 2003. Video Google: A text
retrieval approach to object matching in videos. In Proc. of
IEEE International Conference on Computer Vision.
1119