Event-Centric Twitter Photo Summarization Chung-Lin Wen

advertisement
Event-Centric Twitter Photo Summarization
by
MASSACHUSETTIS 17E
OF TECHNOLOGY
Chung-Lin Wen
M.S., National Taiwan University (2009)
B.B.A., National Taiwan University (2006)
JUL 14 2014
LIBRARIES
Submitted to the the Program in Media Arts and Sciences,
School of Architecture and Planning
in partial fulfillment of the requirements for the degree of
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
A uthor ................
...............
Program injMedia Arts and Sciences
May 9, 2014
Signature redacted
C ertified by .........................
......
>kamesh Raskar
Associate Professor
Th s-Siervisor
Accepted by..........................Signature
redacted
Maes
Associate Academic Head,
Program in Media Arts and Sciences
2
Event-Centric Twitter Photo Summarization
by
Chung-Lin Wen
Submitted to the the Program in Media Arts and Sciences,
School of Architecture and Planning
on May 9, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science
Abstract
We develop a novel algorithm based on spectral geometry that summarize a photo collection into a small subset that represents the collection well. While the definition for a good
summarization might not be unique, we focus on two metrics in this thesis: representativeness and diversity. By representativeness we mean that the sampled photo should be
similar to other photos in the data set. The intuition behind this is that by regarding each
photo as a "vote" towards the scene it depicts, we want to include the photos that have
high "votes". Diversity is also desirable because repeating the same information is an inefficient use of the few spaces we have for summarization. We achieve these seemingly
contradictory properties by applying diversified sampling on the denser part of the feature
space.
The proposed method uses diffusion distance to measure the distance between any given
pair in the dataset. By emphasizing the connectivity of the local neighborhood, we achieve
better accuracy compared to previous methods that used the global distance. Heat Kernel Signature (HKS) is then used to separate the denser part and the sparser part of the
data. By intersecting the denser part generated by different features, we are able to remove
most of the outliers, i.e., photos that have few similar photos in the dataset. Farthest Point
Sampling (FPS) is then applied to give a diversified sampling, which produces our final
summarization.
The method can be applied to any image collection that has a specific topic but also a
fair proportion of outliers. One scenario especially motivating us to develop this technique
is the Twitter photos of a specific event. Microblogging services have became a major way
that people share new information. However, the huge amount of data, the lack of structure,
and the highly noisy nature prevent users from effectively mining useful information from
it. There are textual data based methods but the absence of visual information makes them
less valuable. To the best of our knowledge, this study is the first to address visual data in
Twitter event summarization. Our method's output can produce a kind of "crowd-sourced
news", useful for journalists as well as the general public.
We illustrate our results by summarizing recent Twitter events and comparing them
with those generated by metadata such as retweet numbers. Our results are of at least the
3
same quality although produced by a fully automatic mechanism. In some cases, because
metadata can be biased by factors such as the number of followers, our results are even
better in comparison. We also note that by our initial pilot study, the photos we found with
high-quality have little overlap with highly-tweeted photos. That suggests the signal we
found is orthogonal to the retweet signal and the two signals can be potentially combined
to achieve even better results.
Thesis Supervisor: Ramesh Raskar
Title: Associate Professor
4
Event-Centric Twitter Photo Summarization
by
Chung-Lin Wen
Signature redacted
Thesis Advisor.............................
Ramesh Raskar
Associate Professor
Program in Media Arts and Sciences
Signature redacted
Thesis Reader..................................
Deb Roy
Associate Professor
Program in Media Arts and Sciences
Si gnature redacted
Thesis Reader......................
Cesar Hidalgo
Assistant Professor
Program in Media Arts and Sciences
5
6
Acknowledgments
I would like to express my appreciation to my advisor, Prof. Ramesh Raskar, for his support
during my time at MIT. The time spent in the Camera Culture group is the most unique and
exciting experience in my life.
I would also like to thank my readers, Prof. Deb Roy and Prof. Cesar Hidalgo, for
providing me constructive criques during the proposal and the completion of this thesis.
In addition, Dr. Dan Raviv, the post doc I work closely with on this project, also offers
very helpful adivice. Without his support, this project could not have been accomplished.
I also enjoy my discussion with office mate Nikhil Naik very much. It is very valuable
to have people working on similar direction in a very diversed lab.
Finally, I would like to thank my family, Yuki and my parents. Yuki, my dearest wife,
always support me with endless love and care. My parents always offer me their encouragement and support. I feel blessed to have you as my family and I would like to take this
opportunity to express my deepest appreciation.
7
8
Contents
1
Introduction
15
1.1
Brief review of image collection summarization techniques . . . . . . . .
19
1.2
Brief review of Twitter event summarization techniques . . . . . . . . . .
21
2
Problem statement
23
3
Spectral Geometry
25
3.1
Diffusion maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1.1
Similarity graph
. . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.1.2
Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1.3
Sparsified weight matrix . . . . . . . . . . . . . . . . . . . . . .
29
3.1.4
Diffusion maps and diffusion distance . . . . . . . . . . . . . . .
29
3.2
Heat Kernel Signature and Auto Diffusion Function . . . . . . . . . . . .
31
3.3
Farthest Point Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4
Image Collection Summarization Algorithm
35
4.1
Data cleaning ...............
. . . . . . . . . . . . . . 35
4.2
Feature generation and affinity matrix
. . . . . . . . . . . . . . 37
4.3
Diffusion maps and diffusion distance
. . . . . . . . . . . . . . 38
4.4
Outliers removal
. . . . . . . . . . . . .
. . . . . . . . . . . . . . 39
4.5
Farthest point sampling . . . . . . . . . .
. . . . . . . . . . . . . . 40
4.6
Feature selection and parameter tuning . .
. . . . . . . . . . . . . . 40
4.7
Implementation details . . . . . . . . . .
. . . . . . . . . . . . . . 41
9
5
6
Results and Evaluation
43
5.1
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1
A crowd-sourcing approach
5.2.2
Textual data and metadata integration . . . . . . . . . . . . . . . . 65
5.2.3
Near-duplicate photos and its influence
Conclusion
. . . . . . . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . 66
69
10
List of Figures
1-1
The image search result on Twitter via the query term "#Oscars2014" as of
April, 2014. Only a few of them are directly related to the event. . . . . . . 17
3-1
The difference between linear and non-linear clustering. Demonstrated by
a swiss roll dataset. [Photo credit: scikits-learn.org] . . . . . . . . . . . . . 26
3-2
TED image scatter plot according to the first two dimension in Y . . . . . . 30
3-3
An example of HKS value. Blue colors indicate lower values while red
colors indicate higher values. . . . . . . . . . . . . . . . . . . . . . . . . . 32
4-1
System overview
4-2
Duplicated Images in the input photo collection collected from Twitter API
4-3
Example of outliers and non-outliers . . . . . . . . . . . . . . . . . . . . . 39
5-1
Part of the photos from the Media Lab City Car dataset . . . . . . . . . . . 44
5-2
The diffusion distance to the first data point in the Media Lab City Car dataset 45
5-3
Data points in the Media Lab City Car dataset in diffusion maps embedding
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
36
projected to 3D. The color indicates the frame numbers. . . . . . . . . . . . 45
5-4
Media Lab City Car data set summarized in ten images . . . . . . . . . . . 46
5-5
Part of the photos from TED event dataset . . . . . . . . . . . . . . . . . . 47
5-6
Images from TED dataset with lowest and highest HKS value . . . . . . . . 48
5-7
Summarizing TED 2014 in five photos . . . . . . . . . . . . . . . . . . . . 51
5-8
Part of the photos from the Oscar data set . . . . . . . . . . . . . . . . . . 51
5-9
Intersection of lowest 15% HKS value applied on both normalized color
and histogram and GIST . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11
5-10 Summarizing Oscar 2014 in five photos . . . . . . . . . . . . . . . . . . . 55
5-11 Photos most retweeted with hashtag "#TED2014" . . . . . . . . . . . . . .
56
5-12 Photos most retweeted with hashtag "#Oscars2Ol4" . . . . . . . . . . . . .
57
5-13 The result of our method and baseline methods on Oscar data set #2
. . . . 60
5-14 A prorotype for Rep or Not. A crowd-sourcing approach for image collection summarization problem
. . . . . . . . . . . . . . . . . . . . . . . . .
5-15 Almost identical images in the Oscar data set
12
. . . . . . . . . . . . . . . .
61
66
List of Tables
5.1
The result of comparing proposed method with baseline on two events using Amazon Mechanical Turk
5.2
. . . . . . . . . . . . . . . . . . . . . . . . 59
The result of comparing proposed method with baselines on a larger data
set of Oscar photos using Amazon Mechanical Turk . . . . . . . . . . . . . 59
13
14
Chapter 1
Introduction
With the advance of mobile devices and social networks, the way people exchange information has fundamentally changed. As for news, a few big news media companies are no
longer the only sources of information. For instance, people nowadays go to all kinds of social media, micro-blogging sites or blogs to look for the information they are interested in.
The ability to broadcast interesting information has been effectively distributed to ordinary
people outside the news companies.
Twitter is one of the most popular micro-blogging platforms that enormous number
of users have adopted to share information for its realtime-ness compared to other media.
The unbiased view of Twitter also made it very ideal for social movement. For instance,
the Arab Spring. While potentially containing useful information, the micro-blogging platforms raise several challenges. First, potentially useful information is buried in a significant
amount of tweets that are irrelevant, such as mundane daily life sharing or broadcast private discussions. Second, for a detected event, the amount of information might still well
exceed the human perceptual capacity to process it in a reasonably short time. Hence, how
to effectively detect and summarize events is one of the critical steps toward effective data
mining on micro-blogging platforms.
Already much effort has been put into the event detection or summarization fields in
research communities [2,3,23,24,26,28,37]. These studies used textual data or metadata,
such as the content of the tweets or the number of tweets, as the key features. While these
studies aim at detecting and describing events that are going on, an essential part of the
15
event information is still missing, namely, the visual information. Humans are visual by
nature. This is also the reason why newspapers and magazines spend much effort and
budget on photographs or graphics.
Visual information is not only essential for presenting events to users, but also valuable
for event summarization. First, it is not uncommon for people to use different phrases to
describe the same idea. On the other hand, the same set of phrases can be used to describe
differing ideas. Second, in a highly casual platform like Twitter, people tend to tweet about
a topic with a photo but without using more specific vocabulary. It is not uncommon to see
a tweet with photos attached while the caption is simply "OMG", for instance. In contrast,
photos taken from the same event usually share common visual patterns. We believe that
by including visual information, a more useful event summarization can be achieved.
As mobile devices become more popular and the cameras on them get more powerful,
people are taking far more pictures to share their personal experience or to share new information. However, how to utilize these huge amounts of visual data remains an ongoing
research topic in the data mining and computer vision communities. Without a sound summarizing technique, it is difficult for users to find the information that is relevant to them.
For instance, if one wants to see what are some interesting photos about Oscar 2014, he or
she can search on twitter.com via the hashtag "#Oscars2014". However, what the user gets
will be a list of photos ordered without a clear cue about relevance to the user, as shown in
Figure 1-1.
Therefore, it is essential to develop effective ways to summarize a photo collection.
The task is fundamentally challenging, however, because visual data is by nature harder
to summarize than textual data. Hence, in the computer vision community, many previous
works on how to summarize a collection of photos [9,30] have been proposed. However,
most of them are using photographers' community site such as Flickr. In comparison,
Twitter photo collection is far more challenging and less explored.
Given the extremely casual nature of Twitter, users can share a wide variety of pictures.
Some of them are mundane daily life sharing (e.g., selfies), which usually are less valuable
unless the reader is socially close to the writer. Some of them are of better value in terms
of news, such as political demonstrations or festivals or conferences, which might be inter16
Results for #Oscars2014
Save
Gd/List
My Best Celebrity
Selfie Ever
Figure 1-1: The image search result on Twitter via the query term "#Oscars2O14" as of
April, 2014. Only a few of them are directly related to the event.
17
esting to a wider range of audience. Given this highly noisy nature, the standard clustering
techniques on visual features are unlikely to perform well.
To address this issue, we developed a novel algorithm that can deal with noisy datasets
such as Twitter photos, based on spectral geometry. Specifically, in contrast to most of
the previous works (e.g., [30]) that used only the global distance as the similarity metrics
between images, we use diffusion distance [8] instead, which puts emphasis on the local
connectivity. The method increases the accuracy of similarity between images, as we will
detail in a later section.
In addition, we made a key observation that users tweet about a photo because they think
the photo is worth sharing in some way. Consequently, a tweeted photo can be regarded as
a "vote" toward an event. For instance, people might want to tweet photos about their selfie
or lunch, but presumably within the collection we will not find too many ones similar to that
particular photo. In contrast, for an interesting parade or conference, there might be a fair
number of similar photos. Hence, we can use this property for picking representative photos
and removing the outliers. Heat Kernel Signature (HKS) can be used to quantitatively
measure the density of a point and its local neighborhood. Finally, we applied Farthest
Point Sampling (FPS), which maximizes the distance between sampled points.
To the best of our knowledge, this work is so far the first to incorporate spectral geometry techniques into the context of image collection summarization. This is also the first
work towards an effective visual information summarization on a popular micro-blogging
platform. With this novel framework, we can effectively summarize a potentially large
photo collection, even for a noisy data set such as Twitter photos. This technique can be
generalized to other micro-blogging platforms or photo-sharing websites. We argue that
this work provides a key feature that is missing in the current summarization techniques
and thus makes one critical step toward a crowd-sourced news platform.
The paper is organized as follows. In the rest of this section, we will briefly review the
related works in the literature. In Section 2, we will formalize our problem and explain
our metrics for a good summarization. In Section 3, we will review the mathematical
background for spectral geometry, as our work is built on top of these concepts. Section 4
introduces the details of the image collection summarization algorithm. Section 5 presents
18
results of our algorithm and its evaluation. Finally, we conclude the paper with discussion
and future directions.
Our work overlaps with two areas. In the one hand, the web mining community dealing
with Twitter event detection tries to extract useful information from a huge number of
tweets. In the other hand, the computer vision community dealing with image collection
summarization aims at effectively convey the essense of a scene. In the rest of this section,
we discuss these areas respectively and their relation to our work.
1.1
Brief review of image collection summarization techniques
As the applications are compelling, there are quite a few previous works in the field of
image collection summarization. Simon et al. [30] pioneered the topic of computationally
summarizing a large online photo collection. Their approach is to build a term-document
matrix in a visual bag-of-words framework via SIFT descriptor. Then, a quality function
is defined to maximize the similarity between each photo and its closest "canonical view"
as Simon et al. termed it. There are two penalty terms to discourage the system for (a)
including too many canonical views and (b) including canonical views that are too similar
to each other. The quality function is then optimized by using a greedy algorithm. In each
iteration, the algorithm includes the one that produces the maximum marginal increase.
There are several differences between our method and this method. First, by considering the global distance, the quality function is maximizing something irrelevant while
points are too far away. While this might be acceptable when the photo collection has
intrinsic cluster, we argue that our method is suitable for more general cases. Second, by
picking the canonical views that maximize the similarity between it self and the represented
individual view, it is possible to be biased by outliers. In contrast, our method considers
the density of data points, thus is more robust to noise and outliers.
Kennedy and Naaman et al. [22] extends the work of Simon et al. with a similar goal,
to pick representative photos for a single location. The work claims its novelty by (1) the
19
ability to discover potential tags and locations for landmarks and (2) a novel clustering-
based algorithm to pick representative photos given the discovered tags and locations. For
the first part, TF-IDF was used to discover location-dependent but time-invariant tags. For
the second part, clusters was first formed according to visual features applied with K-Means
algorithm. Then, a set of heuristics was used to rank the the clusters and images. Finally,
top images within top clusters will be included as representative images.
Crandall et al. [9] propose a system that can do the following task on a large online
photo collection. First, the geographical location of photos was used to identify "interesting" area that people are more likely to take photo with. This is done by applying mean
shift algorithm on the latitude and longitute of photos. Second, for highly-photographed,
the system is able to pick representative photos by first doing spectral clustering and then
use the photos with highest similarity as representative photo, where similarity is deinfed
by number of SIFT descriptor matched. The second task overlap with our topic to some
extent but there are also fundamental differences. The paper used spectral clustering technique but did not take full advantage of the property. For instance, the dense part of the
feature space is defined in a straight forward way which can be biased by highly similar
outliers.
There are also several works aim at providing a computational framework for predicting
aesthetics or "interestingness" in a photo, which can in turn be used in ranking photo and
summarize a collection. Datta et al. [11] use a machine learning approach to predict the
aesthetics of photographs. Features such as colorfulness, low DoF indicator was extracted.
Dhar et al. [12] built a model to predict interestingness. The model used high-level
descriable features, such as presence of salient object, rule of thrids, low depth of field,
in combination with low-level features. The authors argues that these high-level features
add useful information to the prediction of aesthetics and interestingness. The high-level
features are trained by manually labelled training set and then use relevent visual cues to
train the classifier.
Chen and Abihishek [5] also detect event on visual data, namely Flickr. However, they
used only metadata such as tags, because of the difficulty applying data analysis techniques
on visual data.
20
Several works tackle the problem of how to organize photos with geographical information. Jaffe et al. [20] propose a system for generating summaries and visualization for
a large collection of geo-referenced photographs. The system conduct a modified version
Hungarian clustering on locations to produce a hierarchy of clusters. Then clusters are
ranked according to factors such as time, location, quality and relevance. However, the
essential factor, relevance is externally given and no visual similarity measures are used.
Epshtein et al. [14] built a system for hierarchical photo organization by reconstructing
viewing frustum and choose the representative photo as the one that maximize viewed
relevance.
While these works present convincing results, none of them deal with Twitter photo
collection, which we believe a special mechanism is needed for its noisy and casual nature.
1.2
Brief review of Twitter event summarization techniques
Sakaki et al. [28] built an emergency event detection system by regarding individual Twitter
user as "social censors". A given event topic is used to train a classfier so that each tweet
will in turn provide a binary output. Then, techniques such as Kalman filtering or particle
filtering can be applied to the output to extract the location and time of the event.
Ritter et al. [26] built TWICAL - a system that perform automatic event extraction and
categorization. To determine whether an event is significant or not, TWICAL looks at the
co-occurence of tweets for a specific phrase. To categorize the event type, the system adopt
an unsupervised approach based on latent variable models.
Mathioudakis and Koudas [24] demonstrate TwitterMonitor, which use bursty keywords to detect the trend on Twitter. Where the bursty keyword is defined by unnormal
high rate of occurence of a specific keyword.
Becker et al. [2] use incremental clustering technique to do online real event detection
on Twitter. The features they used mainly consists of four categories: temporal features,
social features, topical features and Twitter-centric features. Becker et al. has also built
another system [3] that selects representative tweets for events.
Lee and Sumiya [23] develop a system that will automatically detect abnormal pattern,
21
namely events in their definition, mainly via number of tweets and number of crowds in
a given region. The regions of interest (RoIs) are automatically constructed via clustering
techniques on latitude and longitude.
Fruin et al. [16] also tried to visualize news photos. However, instead of getting the
source in a crowd-sourced way, they scrape the photos from manually-picked news media
sources. While the system is useful, it is suspectable that whether the system scale well
since it is difficult for one to define a list of all representative news media.
While aforementioned works succeed at detecting events to a certain level using textual
data or metadata, to the best of our knowledge, our work is the first one to incorporate
visual information in the framework.
22
Chapter 2
Problem statement
The problem we are trying to solve can be formalized in the following way. Given a potentially large photo collection X, we would like to find a small subset X, c X such that X,
summarizes X well. While the definition of a good summarization might not be unique, in
this project we are trying to find summarizations that are both representative and diversed.
By representativeness we mean that it should be possible to find many photos that are
similar to X. The intuition behind this principle is that each photo can be regarded as a
vote on a scene. People take a photo and share it on the social media because they think
something has the value to be shared. Hence, while forming summarization, it is desirable
to have photos that have the highest number of "votes".
By diversity we mean that the photos in X should be dissimilar from each other so that
we can minimize the redundancy. The major motivation for doing summarization is to use
few entries to represent the whole set. Repeating the same information is thus undesirable.
At first look, these two criteria might sounds somewhat contradictory to each other.
However, we frame the problem so that they are in fact complementary to each other. We
achieve representativeness by using only the denser part in the feature space, and then
within the denser part, we maximize the diversity of the points used to do the sampling.
In this thesis, we will mainly focus on the photo summarization problem although we
also note that problems such as layout or the integration of the text part also have practical
value. We will briefly discuss such issues in Section 5.
23
24
Chapter 3
Spectral Geometry
As the paper is closely related to the concept of spectral geometry, we will briefly guide the
reader through the methametical background in the context of the problem we are trying to
solve in this section. Specifically, we will discuss Diffusion Maps/Distance, Heat Kernel
Signature (HKS) and Farthest Point Sampling (FPS).
3.1
Diffusion maps
Given a dataset X with n data points, the structure can be revealed by the similarities of
the data point to each other within the set. For instance, K-Means is a clustering method
that iteratively partitioning the data points into subsets, according to the distances to the
centroids. However, methods like this does not work well for more complex dataset. As
an example, for dataset shown in the Figure 3-la, clusters are assigned according to global
distance, as opposed to the distance along the manifold.
To address more complex dataset, manifold learning methods are proposed. Diffusion
Maps is one of the technique that model the data as a Markov process and the similarity
between two points can be defined as probability that one point travel to the other in a
given time frame. The main advantage of doing so is to focus on the local connectivity
instead of the global distance. The intuition is that, distances are usually only meaningful
within a local neighborhood. For instance, two images that are very similar might be photos
of the same object taken from a slightly different setting. In contrast, while comparing a
25
15
5
10
5-5
-10
-100
-5
0
10
2-10
-5
(a) Clustered by K-Means clustering
0
5
10
15
(b) Clustered by spectral clustering
Figure 3-1: The difference between linear and non-linear clustering. Demonstrated by a
swiss roll dataset. [Photo credit: scikits-learn.org]
image with two very different images, there is no general rule for determining which one
is "closer". Indeed, by focusing at connectivity instead of the global distance, we are able
to achieve better clustering result such as Figure 3-1b.
Diffusion Maps transfer the original data points into an embedding where the Euclidean
distance equal the diffusion distance, a measure that how well the points are connected to
each other, as elaborate later. It mainly compose of the following steps and we will discuss
them in order in the following sections.
1) Construct the affinity matrix
2) Model the Markov process
3) Non-linearly maps the data to the new embedding
3.1.1
Similarity graph
Given dataset X with n points, a standard way to model the data in the spectral geometry
literature is the similarity graph G = (V, E). In the graph, V is the set of vertices, where
each vertex represents a data point and the edges E represent how well the vertices are
connected to each other. In order to measure the similarity between any two given pair of
pionts, a similarity measure kernel k(xi, xj) should be defined. A kernel k(xi, xj) should
meet the following properties:
26
1) Symmetry: k(xi, xj) = k(xj, xi)
2) Non-negativity: k(xi, xj) > 0
3) Locality: k(xi, xj) ~ 0 for IIxi - x|
> e, where c is a given distance threshold.
Notice that the locality constraint implies that any pair of points whose distance to each
other that is more than 6 will be considered negligibly weakly connected, as k(xi, xj) goes
to zero quickly outside the c neighborhood. This relates to the notion of local connectivity: only distances within a certain neighborhood are considered meaningful, while the
distances outside the neighborhood give us almost no information. This property also differentiate Diffusion Maps and other related methods to methods that consider the global
distance, such as Principal Component Analysis (PCA).
The choice of kernel function is application-dependent, as the similarity should be measured in the context of specific applications. One common choice, and also what we used
in this paper is the Gaussian kernel:
- -j
k(xi, xj) = exp(-( ||xi_-2
11|2
))
The choice of scale parameter o- is not trivial and will influence the overall connectivity of
the graph and thus the final result. For a smaller -,the locality constraint is intensified and
can result in poorly-connected graph. In the other hand, a greater o tends to make the graph
better connected. However, noise ourside the local neighborhood can get in and influence
the final result. Selecting an appropriate o can be an analytical process, such as Hein and
Audibert [18], but to determine in an emperical way is also quite common.
3.1.2
Markov process
Then the weight matrix W can be constructed as the adjacency matrix form of the similarity
graph:
(3.1)
Wij = k (xi, xj )
Each entry Wj represents how well a point xi is connect to another point xj.
If we normalize the k(xi, xj) by the total outgoing degree d(xi) =
27
Z
Wij, then
p(xi, xj) = k(xi, xj)/d(xi) can be seen as the transition probability in a Markov process
from node xi to node xj. A degree matrix D can be formed with diagonal entries as d(xi):
d(xi)
D =
d(x 2 )
d(Xn)
Then we can form the transition probability matrix by normalizing W with D:
1 W
P = D-
(3.2)
With P, advancing the Markov process in t steps corresponds to rasing the power of P to
Pt. We denote each entry in Pt to be pt(Xi, x)
As we are using density as an important criteria of representative images or outliers, one
issue with equation 3.2 is that it does not take the sampling density into account. It is well
studied in [8] that a family of anisotropic diffusion process can be obtained by specifying
how much the influence of the density is desirable. The family of the diffusion process can
be parameterized by a single variable a. Specifically, we change laplacian calculation to:
L = D-"WD-'
(3.3)
While a = 1, the diffusion process reduce to an approximation of the Laplace-Beltrami
operator. In this case, the distribution of the points is not considered. While a = 1/2, the
Markov chain is an approximation of the diffusion of a Fokker-Planck equation, and the
influence of the density is considered. The latter case would be ideal for our usage, result
in:
L = D-1/2WD-1/2
For more detailed explantion, we refer the readers to Coifman and Lafon [8].
28
(3.4)
3.1.3
Sparsified weight matrix
By constructing W according to equation 3.1, a fully-connected matrix will be generated.
However, as most of the points are out of the 6-neighborhood, the matrix will contain
mainly values nearly zero. In terms of the similarity graph, this means most of the edges
are very weakly connected. Hence, we can sparsify the matrix by either:
1) Set a threshold and set everything under that to be zero
2) Keep only the k nearest neighbor for each point
The sparsification serves for two purposes. First, it can speed up the computation.
Second, the points that are negligibly weakly connected should have no influence. By
ruling out these edges, it further increase the accuracy of the algorithm.
However, the sparsification result in an asymmetric W so that equation 3.2 generates
an asymmetric L as well. Fortunately, by using equation 3.3, L is symmetrized.
3.1.4
Diffusion maps and diffusion distance
Applying Singular Value Decomposition (SVD) to P will generate the singular values Aj,
left and right singular vectors 0ij, 7i. Then, singular values and right singular vectors can
be used to construct a mapping termed diffusion maps [8]:
The new embedding Y has a desirable property that the Euclidean distances between points
are the diffusion distance, while all eigenvectors are used:
Dt(xi, xj)
=
I't(xi) - Tt (xj)
Diffusion Maps is a very useful technique in doing dimensionality reduction. If we
visualize the image in 2D using the first two coordinates from the new embedding, we
can actually see that similar images are close to each other while outliers tend to be in a
sparse part of the graph, as shown in Figure 3-2. This property is useful while removing
the outliers, as detailed in the later section.
29
0.04
F
1
U
0.02
Ii
-0.04
U
-0.06
-0.08
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
0.1
Figure 3-2: TED image scatter plot according to the first two dimension in Y
30
0.2
3.2
Heat Kernel Signature and Auto Diffusion Function
Heat Kernel Signature (HKS) [32] is a way to compactly encode the geometric information
of a shape. Gebal et al. [17] is another closely related technique that shares similar properties, referred to as Auto Diffusion Function (ADF). Both of the works are time-dependent
as it encode the geometric information in a multi-scale way. In this paper, we follow their
work but consider one specific time which is empirically chosen. In the following of this
section, we first introduce the definition of HKS and ADF first, and then illustrate the application of these techniques on our study.
HKS can be represented by the following equation:
00
HKS(x, t) = E e-A\ti(x) 2
(3.5)
i=O
Where Ai and Oi are the ith eigenvalue and eigenfunction of the Laplace-Beltrami operator.
Similarly, ADF can be expressed by the following equation:
ADF(x) =
e-ik / Aj
(X)2
(3.6)
i=zO
As we can see from the definitions, the two methods are very similar. In the following
of the paper, we will focus the discussion on HKS while the readers should note that the
same discussion also applies to ADF.
The property we are seeking from HKS is that, for a fixed t, HKS(x, t) captures the
density around the local neighborhood of point x. This property is highly applicable for
filtering out outliers and to focus on the popular views only, as detailed in a later section.
With a slightly abuse of notation, we will denote HKS(x, t) with fixed t as HKS value in
the following of the paper.
An example of HKS value is shown in Figure 3-3. As we can see from the figure, denser
part has lower value while sparse part has higher values.
31
0.4
0.3
0.2
0
0.1
-0.0
*.
-1.0
:-0.1
-0.
0
0.2
.-
4-0.3
0
'D
0.
-04
-00
-0.5
1.5
1.0
0.5
0.0
0.2
0.4
-1.0
Figure 3-3: An example of HKS value. Blue colors indicate lower values while red colors
indicate higher values.
32
3.3
Farthest Point Sampling
Farthest Point Sampling (FPS), originally introduced by Eldar et al. [13], is a technique to
sample a set of data points X so that the sampled points are dissimilar to each other in terms
of a given distance metric d. In this way, given any point xi in the data set, the distance
between the point and the closest sampled point x,, is minimized. Thus, the distortion of
not presenting in the sampled set can be minimized.
Put it more formally, assuming that pairwise distance matrix D is given, where each
entry Dij represents the distance between point xi and point xj. i.e., Dij = d(xi, x). The
FPS is a progressive process and the first point sampled can either be given by user or
randomly generated. For each succeeding step, the point sampled will be determined by
the distance of currently sampled points. The next sampled point x, will be the the point
that maximize the distance with the aggregated currently sampled points S:
Xn = arg max dt(j)
Where dt is the minimum distance we have seen so far:
dt(j) = min(Dj ), i E S
Then we add Xz into S and continue the next iteration until the number of point sampled
reached the desired number.
33
34
Chapter 4
Image Collection Summarization
Algorithm
In this section, we will go through our algorithm for summarizing photo collection in detail.
In a nutshell, the key steps are: generating affinity matrix according to selected feature(s),
computing diffusion distance, remove outliers according to Heat Kernel Signature and then
apply Farthest Point sampling. The overall workflow can be summarized as Figure 4-1.
Please note that two features was used in the figure, but it can be extend to any number of
n features where n is a small positive integers. We will discuss each component in detail
in the following sections.
4.1
Data cleaning
As mentioned earlier, Twitter data by its nature is very noisy. One of the problem is that
it contains a fair number of duplication. This can be a problem as we are looking at how
many similar images in the dataset to determine the "voting" for a specific scene. For
example, Figure 4-2 shows an example of duplicate images finally got selected as part of
the summarization.
To eliminate the duplicated image is conceptually straight-forward. However, we need
to consider the computational complexity. A brute-force solution is to compare every image
to all the other images in the dataset. This result in a 0(n2 x w x h) complexity algorithm
35
Summariz
ation
Input
Photo
Collection
A
Diffuslon
E22>
Distance
Datrix A
Outlk"r
IRemoval
by HKS
Data
Cleaning
11*
Farthest
Point
Sampling
Diffusion
Disace
Feature
*
Outliers
Removal
by HKS
4A
Multiple features can be used
Figure 4-1: System overview
OOZBJ2pg
00029jpg
000304.pg
OW31pg
00033jpg
00034Jpg
OO35jp
ootv Tool,
0003Zjpg
Figure 4-2: Duplicated Images in the input photo collection collected from Twitter API
36
where n, w, h is the number of data points, width and height of the image, respectively.
This can be impractical while n is non-trivial.
We used hash mechanism to eliminate this problem. For each image, we used MD5
algorithm to hash it into a table. While traversing the data set, we checked the hashed key
and see whether it is already in the table or not. If so, duplication is found and we remove
the image from the set. This result in an algorithm of complexity 0(n).
4.2
Feature generation and affinity matrix
While comparing the similarity of a given pair of images, apparently pixel-by-pixel comparison is too naive to use, it is too sensitive to the setting and alignment of the images.
The right features to describe an image has long been an active direction in the computer
vision research. In this project, we referenced the SUN Database by Xiao et al. [38] for a
list of state-of-the-art visual features. The list includes densely sampled SIFT, sparse SIFT
(bag-of-words approach [31]), GIST [35], HOG [10], Tiny Images [34] and so on.
In addition to that, we have also implemented Gaussian blurred vectorized image and
color histogram by our own. For the blurred vectorized image, we subsample the image to
a fixed dimension and apply Gaussian blur to each of them. Then we vectorize the image
as the feature vector.
Given a dataset X with n data points, we first generate a feature matrix F where each
row represents a data point encoded by the chosen feature. We normalize each column into
a unit vector so that each dimension has equal contribusion to the distance.
For the color histogram, each of the R, G, B channel, a histogram with 32 bins are
constructed. Each histogram is then normalized by the number of pixels. Finally, the three
histogram is concatenated to form the feature vector for a data point.
We then generate the distance matrix R using the pairwise distance of F. Where Ri,
shows the affinity of two points xi and xj and is defined by the distance of row i and row
j
in F. We use L2 distance to calculate the distance except the case of histogram. In the
case of histogram, Earth Mover's Distance [27] was used to calculate the distance.
As mentioned in the previous section, to construct a weight matrix that only local dis37
tance matters, Gaussian kernel is used. The scale parameter -is determined by taking the
median of all the closest neighbor in R, and scaled by a hand-picked parameter c. This
variable is usually around 10 in our system.
Then, the weighting matrix is constructed by applying Gaussian function to Rij:
R ?.
Wij = exp(-(
4.3
) )
Diffusion maps and diffusion distance
With the weighting matrix, we then calculate the diffusion distance matrix D, mostly follow
the steps mentioned in the previous section. First of all, we construct the diagonal matrix
where the diagonal entries are as the following:
Di= (Wi
Then the normalized graph Laplacians is constructed. As mentioned in the previous
section, the matrix is normalized in a way we mentioned in section 3.1.2:
L = D-1/2WD-1/2
Singular Value Decomposition (SVD) is then applied to the graph Laplacians L to generate eigenvalues Ai and left and right eigenvectors 0j, 0j.
With eigenvalues and eigenvectors of L, we then are able to construct the diffusion
maps:
It (xi)
=
[A'?L(i),
... ,
Ajb(i)]T
By definition, the diffusion distance can then be calculated as the Euclidean distance
between data points.
Dt (xi, xj)
=
|t (xi) 38
pt(xj) 1 2
200212-433773jPg
200220_j46221jpg
200227 329355Jp
(a) Example of non-outliers in Oscar dataset
182456_018730jpg
182S31j9339894pe
182539.429326Jpg
(b) Example of outliers in Oscar dataset
Figure 4-3: Example of outliers and non-outliers
4.4
Outliers removal
As mentioned in the previous section, given a set of data points, Farthest Point Sampling
(FPS) will try to maximize the diversity by maximizing the distance between sampled
points. This works well if the dataset does not contain outliers. However, with outliers,
the process will tend to select the outliers since it typically has larger distance to other
non-outliers. In real world data, outliers are very common. For instance, Figure 4-3 shows
some example of outliers in our data. Hence, a method for removing outliers before FPS
applied is necessary.
As discussed in the previous section, Heat Kernel Signature [32] has a nice property
that the value will be lower in the denser part and vice versa. We calculated HKS value for
each of the data point according to equation 3.5. The data points are then sorted according
to the value. FPS then to be applied on the data points that have lowest HKS value. In
practice, for highly noisy datasets such as Twitter images, we apply FPS to only the lowest
20% data points.
39
To further remove the outliers, one can repeat the process for different features and
apply FPS to the intersection of densest part of each feature. The principle is similar to
RANSAC [15] in the sense that by taking union of the outliers, we increase the robustness
of the process.
4.5
Farthest point sampling
After the outliers removal process as described in section 4.4, the subset of the dataset X,
will have none or only a few outliers. On that FPS can be applied. Specifically, from the
new embedding Y generated by Diffusion Maps, we keep only a subset of rows Y, where
the points are in X. From there we calculated Diffusnion Distance D, between each point
in Y. Then FPS can be applied on D, as described in 3.3.
4.6
Feature selection and parameter tuning
Which features to use depends on the nature of the dataset. For a clean and continuous
dataset, bluured vectorized image performs the best. In this case, for each image, the
difference between itself and its local neighbors is only a small offset. The blurred image
can thus connect the images by averaging the neighboring pixels.
In the other hand, for noisy dataset we have found that features considering the overall
statistics, such as normalized color histogram, sparse SIFT and GIST, perform better.
There are a few parameters in our systems. The important ones include the scale parameter - of Gaussian kernel, the time t for Diffusion Maps and the filtering rate k for
HKS.
The choice of -will affect the locality of the graph. For choosing smaller o, the system
becomes stricter about whether two images are connected or not. However, this might
result in a poorly connected graph. We choose the parameter emperically around 10 times
of the median of closest neighbors. We increase the -as the dataset gets noisier.
From our experience, t = 1 usually give us the best result. We have also tried commute
time distance [39] but it did not improve the result.
40
The choice of k also depends on the noisiness of the data. We can filter out nothing if
we are certain about the dataset contains no outliers. For a noisier dataset such as Twitter
images, we applied FPS on the densest part (around 20%) only.
4.7
Implementation details
We implemented most of the system, including diffusion maps, Heat Kernel Signature and
Farthest Point Sampling, in Python with the usage of libraries such as NumPy, SciPy and
Scikit-Learn. For part of the feature generation, the Matlab code from the SUN Database
[38] was used. OpenCV was used for histogram calculation and Earth Mover's Distance.
We run the code on a late 2012 Macbook Pro with 2.9 GHz Intel Core i7 CPU and 8 GB
1600 MHz DDR3 RAM. The calculation time is dominated by pairsise distance calculation.
For around 1K data points and reasonable dimensionality, the elapsed time ranges from a
few second to tens of seconds. For instance, the Oscar dataset #2 with 2250 images using
color histogram (96 columns) as features took 23 seconds to compute the diffusion distance
and HKS.
41
42
Chapter 5
Results and Evaluation
In this section, we will present our results from applying the proposed image collection
summarization technique.
We first demonstrate the effectiveness of diffusion distance and farthest point sampling
by applying the technique to photos taken from many points surrounding an object in a full
circle. Part of the photos are shown in Figure 5-1. Since we circle the object, the first few
frames have a similar view to the last few frames. The fact can be captured by the diffusion
distance. If we plot the diffusion distance to the first data point, as shown in Figure 5-2,
the distances are in a reversed U shape. As expected, the first few frames are similar to the
first frame. In addition to that, the last few frames can also be "connected" through the last
frame.
One desirable property of diffusion maps is that the technique can recover the intrinsic
low dimensional variables. Vectorized blurred images as feature vectors were used in this
example, hence the observed dimension is w x h where w and h are the width and height
of the blurred image, respectively. However, the intrisic variable is simply the degree of the
angle from which the camera surrounds the car. Hence, the use of diffusion maps is able to
recover the intrinsic parameters, as shown in Figure 5-3, where we reduce the dimensions
to only 3. We noted that this example is more non-trivial than the toy exmaple shown in
Talmon et al. [33] since our bare feet recording produced some shakiness.
After diffusion distance matrix is constructed, FPS was applied with m = 10 and the
first image as the seed. The result is shown in Figure 5-4. As we can see from the figure,
43
00
q
j2PQ
00003jpq
00004JPO
0000Sipg
00006Jpg
00007JPg
00008Jp0
0I
000109pg
O009
00011 pg
00012Pg
00013Jp
00014Jpg
0001.Jpg
0001609
000179
0001849
00019J09
00020Jpg
00021Jpg
00022Jpg
00023Jpg
00024JPg
00025JP
00026jp
027jp
MUJpg
00029jpg
00030jpg
00031.jpg
000334g
00034Jpg
00035Jpg
00036Jpg
00037.jpg
00038Jpg
00039jpg
Figure 5-1: Part of the photos from the Media Lab City Car dataset
44
OOW4-Jpg
0.8
0.70.60.50.40.30.20.1
0.00
100
200
300
400
500
600
700
800
Figure 5-2: The diffusion distance to the first data point in the Media Lab City Car dataset
-0.2
.
-
-0.1
0..0
-0
.4
-0 .3
--
- 0 .2
-0 0. 004
00
-. 1
0
0.0
.1
0.2
0.2
01
Figure 5-3: Data points in the Media Lab City Car dataset in diffusion maps embedding
projected to 3D. The color indicates the frame numbers.
45
000*1Jpg
001034pg
001944po
002S89J
00302 4ft
00331jpg
003664pg
00441Jpg
00495jpg
007084pg
Figure 5-4: Media Lab City Car data set summarized in ten images
the angle of taking the picture is roughly evenly sampled. By our problem statement,
this completes a summarization with diversity because the redundancy is minimized while
presenting an event/object.
We also collected Twitter images and verified our technique on real Twitter events.
Twitter Streaming API was used for collecting the data. Tweets from the United States
that have Geo information and photos attached were collected. As the time of writing, five
months of the tweets and images were collected, result in a dataset around two Terabyte
with millions of images. The collected tweets was stored in a MongoDB-powered NoSQL
database. The potential candidate events are systematically investigated by sorting the
hashtags by its frequency of occurence. Then, events are discovered by high-frequency
hashtags that are time-dependent. Here, we present two events: TED 2014 and Oscars
2014.
We got the TED dataset by querying the database with #TED2014 hashtags and got 331
images. Part of the images are shown in Figure 5-5. As we can see from the figure, many
images actually have little to do with the event. If we simply apply the techniques used in
the Media Lab City Car dataset, the result will be unsatisfiable because of the outliers.
Hence, an outlier removal mechanism needs to be developed in order to have a summrization that is more representative. To remove the outliers, we apply the heuristics discussed in the introduction section, that the non-outliers should appear in the denser part of
46
1SS430j976?8jpg 155303767Jp155439_4ZO925jog
15439202SJ6
1554..29227SjPg
155600_015624.jpg
lSS947_619347jpg
160634-.9937S20pg
160913 067573jpg
TM
161347.038485jpq
16 909 656955 Pg
163904..239042jpg
164017..19747&jpg
170331..693071jpg
170734..570214jpg
1717S8_901938jpq
172O _ 0
172646346945Jpg
172920j05266JPQ 173253...65545,jpg
23 9
54 3
165815_485856jpq
165816...401004jpg
O.jpg
1722053760935jpg
172604..94S964jpg
1736S$_A54267jpg
173946_65133 ljpg
174001_2O962jpg
Figure 5-5: Part of the photos from TED event dataset
47
170307_27SS44jpg
172720..016070jpg
175636_WI787jpg
00O0JPg
00001.Jpg
00002Jpg
00003Jpg
00004,jpg
00004,g
00006jpg
00007.jpg
00008jpo
00009
00010jpg
00011jpg
pg
(a) Images from TED dataset that have the lowest HKS value
N
003204jpg
00321ipg
00322.jpg
00323Jpg
AS~
A
00324 jpg
00325jpg
00326Jpg
00327.JP9
S-ft. T-na
00328. pg
00329 jpg
00330.jpg
(b) Images from TED dataset that have the highest HKS value
Figure 5-6: Images from TED dataset with lowest and highest HKS value
48
the feature space because there are more similar data points nearby. In contrast, outliers
usually appear in the sparse part of the feature space since it distinguish itself from other
points and thus there are little data point nearby. As described in the previous section, HKS
is an highly applicable technique for the task. We applied HKS with a fixed small t, and
then sort the images according to the value. In our setting, photos with lowest values corresponds to the data points in the densest part of the feature space, and vice versa. While
applied the technique on feature space generated by the normalized color histogram feature, the result is shown in Figure 5-6, where images with lowest and highest HKS value are
listed. As we can see from the figure, images with lowest HKS value are indeed correlates
better with the TED event, while images with highest HKS value tend to be outliers.
To further eliminate the outliers, the process can be repeated for multiple features. Figure 5-7a shows the images with lowest 5% to 10% HKS value applies to affinity matrix
generated by normalized color histogram. As we can see from the figure, there are still a
few outliers. However, if we intersect the lowest 20% of HKS value applied on both normalized color histogram and sparse SIFT feature, the result is shown in Figure 5-7b. While
there are still one or two outliers, if we compare this to the version that filtered only by a
single feature, it is much more applicable to apply FPS on.
After filtering out the outliers, we apply FPS to diversely sample from the filtered subset, and the result is shown in Figure 5-7. Our technique focus on the most common scene
(e.g., speech by Edward Snowden) but also has different settings from the same scene to
attain the diversity.
Another dataset we experimented with is the Oscar dataset. By querying the database
with the hashtag "#Oscars2014", and filtered out the duplicated images we got 1165 images. Part of the images are shown in Figure 5-8. As we can see from the figure, the dataset
is even noisier than the TED dataset. HKS can again be used to seperate the denser part of
the data points from the sparse part.
Normalized color histogram and GIST descriptor [35] was used to generate the affinity
matrix in this dataset. Figure 5-9a and 5-9c show the photos with lowest HKS value using
normalized color histogram and GIST as the feature, respectively. The intersection of the
top 15% of the two sets is shown in Figure 5-9. For comparison, photos with highest HKS
49
00028jpg
OOOZgjPg
00030jpq
O3Op
00031.jpg
00032Jpg
00033J9
000341p0
000354P
00036J9
00037?jP
00036Jpg
00039.jpg
(a) Images from TED dataset with lowest 5% to 10% HKS value, affinity matrix
generated by normalized color histogram
00001Jpg
00005.jpg
000190pg
00023jpg
00035.jpg
00036jpg
00047jpg
00048jpg
00049jPq
00053JPG
00065.Jpg
0006.Jpg
00079JP9
00086jpg
00091Jpg
00094.jpg
0009S.jpg
00096Jpg
00098.jpg
50
(b) Intersection of the lowest 20% HKS value applied on both normalized color and histogram
and sparse SIFT feature
00001jpg
00048jpg
00091.JPg
OO095Jpg
00086.jpg
Figure 5-7: Summarizing TED 2014 in five photos
180310.995205Jg
180353.3391og
180512.813400pg
4
180605.136 53j"g
180905.1780059
180932.078733"
180945371377mg
1A0950.886121,g
18121
4
4
301 03pg
18127.92Z833pg
181403249431M
181405.453657J"9
rw
3239168pg
182002.551811g
152051.2283082g
182330382173JM
182340.M 974O9
1823S367l177.g
182455.35864gg
1845601730,g
18253L83399g
182539-429328J8
182606.411411J09
18Z807.9Z460g
182B18.967956jp9
182933j508SSg
18294L4118024in
183045376428og
18312L099934g
183142.276S28og
183143?038&g
183250.A55S15#g
183405.84789M
1S343799234J00
193448-
163$07414179M
18362L114461pg
183710.476440.og
183748.311601ipg
184035.098940Mpg
184056.93274200
184210398164g
184330
9750JPg
184342.166069jpg
184437.071S21,0g
18443810767SJog
184453.64438jog
184518192641pg
184552.72287j09
184633516078g
184710484966,mg
184818240118JOG
184939780459g
18493.319012jg
18499.096134jog
185034l1786j g
185IO8.2847M5Jg
185lS6.934559g
18S333.677408Pg
18533895549jog
185427
185606.070235OOg
185714.017SO7jpg
181807 481345kg
185837AI79134,g
185844-626306jg
185924.561680.og
181935_6S9937 g
1900l0.84755 Jpg
190033.838685JDO
190039.15292jPg
181856
4
4
185 03_9421 2s98
.88473jg
Figure 5-8: Part of the photos from the Oscar data set
51
09
3
DD03009g
00031jpg
00032jpg
000 ,M
0004jf
00035409
00036jog
000374p
OO384pgq
00039
00040jpg
00041jpg
000424g
00043jpg
0004410p
(a) Photos from Oscar dataset with low HKS value using normalized color histogram
as feature
-~
~--
I -1
ppo~
-~
--
01IS04pg
0115[J0g9
01152jog
01153-pi
011544Jg
01157.Jpg
Oll99jog
01150.jog
01162jpq
01143.jpg
TBxcyIs
H(*ywL-id
T
Gays
Tonlorow Gayo, HSW
EgVW Mie M
Arnwca' l
'C
Blacks. Hwuewcs
S.
aas
a
Lani
Black
01155.jpg
OSl56jpg
011604pg
011614pog
01164JP
(b) Photos from Oscar dataset with high HKS value using normalized color histogram
as feature
52
00000Jpo
OOO149L
00002jp
0000340
00005 JV2
00000Jp0
0000?j0
ooMo8Q
00010jpq
00011Jp4
000110a0
0001 jog
0004jo9
00009400
3
00014jog
(c) Photos from Oscar dataset with low HKS value using normalized color GIST as
feature
01110,jpq
011S1409
01USSjpg
011524j0
01113jpg
01107Jp
0115040
0110440
0I1631 N
01160jg
011614g
01162j0 q
011644p0
(d) Photos from Oscar dataset with high HKS value using normalized color GIST as
feature
53
.........
.- .......
.....
....
00038Jpg
00060jpg
00064jpg
00066jpg
00074jpq
00075jpg
00078.jpg
00079jpg
00080.jpg
00085.Jpg
00096jpg
00097jpg
00100jpg
00107jpg
001 10jPg
Figure 5-9: Intersection of lowest 15% HKS value applied on both normalized color and
histogram and GIST
54
00038jpg
00075Jpg
00163jpg
00168jpg
00085.Jpg
Figure 5-10: Summarizing Oscar 2014 in five photos
value generated by normalized color histogram and GIST are shwon in Figure 5-9b and
5-9d, respectively.
By using normalized color histogram, we have better result for filtering out the outliers.
However, result from GIST is also useful in the sense that a photo has to be "non-outliers"
for both of the feature to be included in the filtered subset. Although there are still a few
outliers in the intersection, it can be addressed by further tuning the parameter or features
and it was not included in the final summarization. We also noted that we omitted a few
images in Figure 5-9a for reducing the redundancy. There are several images similar in
the form to the famous selfie by Ellen DeGeneres with some trivial modification or image
compressing offset.
Finally, we applied FPS on the intersection and the resulting summarization is shown
in Figure 5-10. As we can see from the figure, all images are directly relevant to the event.
Our algorihm did not use clusters implicitly, however, we can see the summarization is
55
014923.161757.jpg
13ZS24_765231jpg
133258_107454.jpg
224624_18196jpg
133131_159954.jpg
Figure 5-11: Photos most retweeted with hashtag "#TED2014"
formed by different part of the feature space. In particular, two images coming from the
famous selfie, two images coming from the red carpet photos, and one image coming from
red carpet photo in the television (which is visually different from the original red carpet
photos). Our system is able to capture that many photos are in fact a modification of the
famous selfie and include it in the summarization. In this particular example, we can see
our system achieves both representativeness and also diversity.
5.1
Evaluation
In this section, we present the evaluation by comparing our result to ones that generated by
metadata. When measuring the popularity of a tweet, people usually look at the number of
retweet. Hence, one way to summarize a Twitter image collections can be looking at the
tweets that are retweeted the most. To some extent, this method overlap with our method
in the sense that using popularity as "voting" toward the "interestingness" of a scene.
Figure 5-11 shows the top photos that are retweeted the most with hashtag "#TED2014".
56
4
111813 9263 8jpg
125358_639593.jpg
1 7 3 2 3 9 0 09 1 0 8
..
jpg
Ann Conar
Today's Hollywood PC order:
Gays, Blacks, Hispanics.
Tomorrow: Gays, Hispanics,
Blacks. Enjoy it while you can,
Black America! !Oscars
230836_368920jpg
232249.698556Jpg
Figure 5-12: Photos most retweeted with hashtag "#Oscars2014"
As we can see from the figure, both our result and the most retweeted photos share the
scene that Mr. Edward Snowden is speaking. This validates our result by proving that our
method is able to capture the important moment. In addition, our result demonstrates a
better diversity by having different speakers.
Figure 5-12 shows the top photos that are retweeted the most with hashtag "#Oscars2014".
Compare to our result, the representativeness of the photos that are most
retweeted seems to be low because almost all of them are not actually related to the event.
This also validate one of our hypothesis: the photo most retweeted might be biased by
other factors. For instance, an user with more followers tends to get more retweets (e.g.,
Figure 5-13a). Or tweets about a controversial topics (e.g. Figure 5-13b), ongoing social
movement (e.g., Figure 5-13c) might get more retweets because of the supporters. Hence,
the summarization by picking photos most retweeted can be biased by one or more of the
above factors.
One limitation of this comparison is that not all the photos related to the event will be
included but only the photos that has a specific tag we used. In this particular example,
57
Col. Morris Davis
, Foow
To say that Ann Coulter is a horse's ass is an
insult to the asses of horses everywhere.
*Frank
1
DeCaro
Fa4ow
RT this selfie and let's break Twitter again!
@JACKIEBEAT @DoriaBiddle @ADuralde
#Oscars2014 pic.twitter.com/6d6BYXFTOp
o%Apy a
RW*qe
80
268
* Fwvrtie
PookW
a
-.
Ann Coulter
MOmu
+
Today's Hollywood PC order:
Gays, Blacks, Hispanics.
Tomorrow: Gays, Hispanics,
Blacks. Enjoy it while you can,
Black America! Oscars
WFI
-a
(a) Number of retweets biased by number of (b) Number of retweets biased by a contro-
followers
versial topic
0
Javier El-Hage
%L Fonow
"U@sabiastuque_: Your speech can save lives
in Venezuela. Speak Upl
#OscarsForVenezuela #Oscars2014
#SOSVenezuela pic.twitter.com/yU5Z8RwxE"
*-
APly
0
Am,~
*
Fawortte
IPaCk
Ellen DeGeneres
0
-. PoRow
If only Bradley's arm was longer. Best photo
ever. oscars pic.twitter.com/C9U5NOtGap
.. M&Ye
25 10
,o
3,429,193 2020.589*
C~
(c) Number of retweets biased by an ongoing
social movement
(d) The famous selfie by Ellen DeGeneres
58
Event/Method
Event 1: TED 2014
Event 2: Oscar 2014
Proposed method
60%
95%
Most retweeted
40%
5%
Table 5.1: The result of comparing proposed method with baseline on two events using
Amazon Mechanical Turk
Event/Method
Oscar 2014 (#Oscars & #Oscars2014)
Proposed method
52%
Most retweeted
48%
Table 5.2: The result of comparing proposed method with baselines on a larger data set of
Oscar photos using Amazon Mechanical Turk
the famous Oscar selfie taken by Ellen DeGeneres has the hashtag "#Oscars" instead of
"#Oscars2014" (Figure 5-13d) so it was not included in the summarization by counting the
retweet numbers. This result also demonstrates the advantage of our approach since the
users do not always know which hashtags will be the most appropriate one. However, via
looking at visual features with meaningful amount of data, the pattern can be identified and
thus reasonable summarization can be generated.
We have also verified our result by Amazon Mechanical Turk (MTurk) by compariing
our result with most retweeted tweets for the above mentioned two events. For each event,
we upload our input photo collections and two summarizations on the MTurk. We asked the
users to choose which summarization they think is better without knowing which result was
generated by which method. We provided clear criteria about what is our definition of good
summarization, namely representativeness and diversity. We noted that representativeness
is generally weight higher than diversity because a diverse set of outliers provide very little
information.
We conducted the MTurk experiment and have collected 20 responces. The result is
shown in the Table 5.1. The value shows the percentage of users that prefer a specific
method. In the evaluation of the result of TED data set, we notice that our reuslt is slightly
better than that of most retweeted, or at least of similar quality. Consider that our method
is purely automatic, we think this illustrates the usefulness of our method. As for the result
of Oscar data set, it might not be a surprise that most of the users agree on the fact that our
method produces better result.
59
(a) Result of Oscar data set #2 by our method
(b) Photos of Oscar data set #2 that are most retweeted
Figure 5-13: The result of our method and baseline methods on Oscar data set #2
Surely, one can argue that it is more appropriate to use hashtags is more favorable for
the most retweeted method. Although we see this as the advantage of our method, in the
sense that users do not need to have the knowledge of the most appropriate hashtags a
priori. However, we are still interested in the result of comparing our method with baseline
methods in the data set that generated by the most appropriate hashtag. We generate another
data set in which photos has either "#Oscars" or "#Oscars2014" as hashtag. In addition to
comparing to most retweeted photos, we would also like to compare with most favorited
photos. Again, we upload the input photo collection and the results of our result (as in
Figure 5-13a), and most retweeted photos (Figure 5-13b) to Amazon Mechanical Turk. We
ask turkers which summarization they prefer the best, similar to the above experiment, and
the reulst is in Table 5.2 based on 50 responses. As we can see from the result, users still
prefer our method the most in this setting. We think the reason might be the fact that there
are still a few outliers in the baseline methods that are biased by the factors we mentioned
before.
60
Instructions: There are two Images associate with a spedfle event In the followIng. Pleanse c 1
s one that better represeat the event.
For a good summarization, we mean 1) represetatvenew there are many shnliar photos in the original cellection 2) diveralty: the chosen photo doesn't repeat
themselves. Ideally, we want both properties In the semmarwation, but generally speaking representativeness is much more important than diversity.
The steps of this HIT will be:
1. From two images, choose the one that best represents the original collection better via radio buttons
2. Submit
3. Repeat to play with more HITs if you wish
For any feedback or suggestion, welcome to send it to clwen@mit.edu
Estimated completion time is 10 seconds each HIT (including the time for loading the images). Thanks for your participation.
Event (Oscar 2014)
Please check the photo collection for Oscar 2014: hUMpr~pianwh.goo&.com/1 17239588905151234783/0scurI402
There are around 2249 images in the dataset. Please make sure you browse through the entim photo collection before proceeding to the
summarization.
Image #1
Image #2
From the above two images, which one do you think is more representative?
Image #1
Image #2
Submit
Figure 5-14: A prorotype for Rep or Not. A crowd-sourcing approach for image collection
summarization problem
5.2
Discussion
From the results above, we found several interesting topics to discuss. Firstly, we would
like to discuss how our work can be potentially extend to a crowd-sourcing approach. Secondly, we explain how to integrate the textual part of the event summarization. Finally, we
discuss the insight of how similar photos are influencing the performance of the system.
61
5.2.1
A crowd-sourcing approach
Recently, crowd-sourcing and human-in-the-loop computation hasve been extensively studied. For example, VisionBlocks [4] [6] provides users with a user interface for visual programing, CrowdCam [1] help users navigates crowd images. For a survey paper on this
topic, we refer the readers to [36]. Among them, "hot-or-not" style crowd-sourcing produced some interesting results such as Place Pulse [29] or StreetScore [25], where users are
asked which one in a set of two images looks safer or welthier, etc. Then the pattern can be
analyzed to derive the possible visual features that make a place look "safer".
It is temping to speculate whether the same approach is applicable to image collection
summarization or not. We have built a Human Intelligence Task (HIT) task on Amazon
Mechanical Turk. Figure 5-14 shows the interface of it. We name the task "Rep or Not"
as it collects the ground truth about whether users think a specific image is representative
or not. Compared to projects such as Place Pulse, the problem we are solving is more
challenging in the sense that whether an image is representative or not depends on not only
the image perse, but the relation of the image and the whole collection. Hence, although
it is an interesting topic, how the ground truth provided by users can be incoporated is
not stright-forward. However, we do note that the ground truth combined with ranking
algorithms can be used as another form of evaluation.
Specifically, let us assume the data was collected from the pairwise comparison for
the whole image collection. We can rank the representativeness of each image by the
TrueSkill algorithm [19] or methods that are similar. In the case of TrueSkill algorithm, the
representativeness of each image can be modeled by a normal distribution N(It, a2 ). All the
representativeness are initialized as the same normal distribution with chosen parameters A
and -. After that, we update distribution with each round of "match". Let us assume two
images xi and xj had a match, in which xi was chosen as more representative than x3 . The
update will be as follows:
62
Pi +
Pi
I -t
[t
pg +-o-?+-o
2
-
- 0-j2
Ui
1-
(Pi -
- f
( ( Pi -
p1_)
.I
P ) 1e
g(fj)-
2
-
c=
j
202
+ U +
(5.1)
U2
f (0) = A(0)/1<(O)
g(O) = f(0) - (f(0)+0)
Where E is an emperically chosen parameter, which represents the probability that two
players will tie and 3 indicates a per-game variance. M(9), <D(O) represent Normal probability density function and Normal cumulative density function, respectively.
Assuming enough comparisons are conducted, the TrueSkill algorithm will converged
to a stable score, by which we can assign represetativeness score ri to images. For instance,
one simple choice is to assign ri linearly according to the rank:
Ti = n - rank(i)
Where n is the number of images.
With this, two tasks can be done. First, we can use this for evaluating whether our
system proposed in Section 4. Second, we can train a regression system that predicts the
representativeness of images in the input collection, which can be another approach for
solving the summarization problem.
For evaluation, we can evaluate a summarization, generated by our proposed method or
not, by calculating the following quality function, which measures both representativeness
and diversty:
63
n
Q(S) =
ri -W
*Sim(S)
i=O
(5.2)
S(i, j), Vi E S A i
Sim(S) =
i~j
The basic idea here is to maximize the representativeness but to minimize the similarity.
W is a parameter to control the weighting between maximizing representativeness and
minimizing similarity. S(i, j) is a measurement of similarity between point xi and x3 and
is thus application-dependent.
Following this framework, we can also derive another automatic approach that solve the
image collection summarization problem. We can extract features as described in Section
4.2 and the crowd-sourced ranking data becomes the target variable that we can train a
regression system to predict. A common choice for a regression system will be Support
Vector Regression, which finds a regression function f(x) that approximates y:
f (x) = (w -x)b,
w, x E
N
(5.3)
bE R
The ideal regression function f (x) will minimize the error between predicted value and
the target value and also consider the complexity by minimizing the following function:
1
12
C
K54
211W2 + K
lyi
f (xi)l
54
Once we predicted the representativeness using the regression system, we can again
find summarization by maximizing the quality equation 5.2.
One issue with the automatic system using crowd-sourcing data is the scalibility. The
number of comparisons grows at least linearly with the number of images in the collection.
If we consider the noise produced by Turkers, the complexity can be even steeper. However,
we do see an oppurtunity here to integrate our system with the crowd-sourcing approach.
As we have noted in section 4.4, our system is good at automatically removing most of the
64
outliers. However, in practice we still see a few outliers in the denser part of the feature
space. These outliers are probably visually similar to non-outliers so that it is hard to
seprate them in a fully-automatic approach. Hence, we can derive the following algorithm.
First, in the outliers removal process, we focus on the top k% images with lowest HKS
value. Second, we apply the crowd-sourced comparison described in this subsection and
rank them accordingly. Finally, regression model can be used or FPS can be applied to the
top images. In other words, crowd-sourcing approach can be a potentially useful extension
to our proposed method.
5.2.2
Textual data and metadata integration
As we mentioned in the Section 2, our goal in this study is to summarize Twitter event from
a visual perspective. Hence, the textual summarization part, although pratically valuable,
is beyond the scope of this project. However, we do note that our work can be a valuable
addition to the textual summary techniques. One straight-forward way to combine the
visual part and the textual part is to compute the summary seperately and integrate it in
a well-designed layout. Another way is to select tweets that have photo selected by our
algorithm as the representative tweets. In that case, both photos and texts will be included
in the summary. Moreover, our algorithm can be used in image search. Given a keyword,
images can be sorted in the order of HKS value as it indicates how many similar photos are
in the data set and how many people are paying attention to the scene.
One interesting future application is to summarize photos across the globe using the
location data. We can summarize what people are paying attention in different cities within
different time, which can be served as an crowd-sourced local news. This can be done by
dividing the world into several metropolitan area. For each of them, we apply the summarization framework we described in the paper. Finally, the highlighted photo can be
visualized at a world scale map. This is one of our original motivation for attacking the
image collection summarization problem, and we think this can be an interesting future
direction of this project.
65
OOOOjPg
oooo1j~,OO2j
00D02JPq
OM304pe
003p
00004jpg
00.JgOOSp
00005jPg
000OJpg
00007jpg
00008Opg
00W0jpg
00010jo
0OljPqg
00012jpq
00013j0g
00014jPq
OOO1SJPD
00016jpg
00017j90
0oI9Jwg
000194g
00020,g
OQ214g
00022JO
0002309g
g
Figure 5-15: Almost identical images in the Oscar data set
5.2.3
Near-duplicate photos and its influence
As we mentioned in Section 4.1, we removed the duplicated data so that the system will not
be biased by it. However, we do note that there are almost identical photos with slight variation (due to cropping, different hue, image compression offset, etc.) that are not filtered out
by our system. As the photos are almost identical, it can be arguably regarded as another
form of retweet. We referred to this kind of photos as "visual retweets", as it is different
from the general form of retweets, which based on a combination of factors such as text.
For example, Figure 5-15 shows the photos from Oscar data set with lowest HKS value
applied on color histogram feature matrix. As we can see from the figure, many of them
are the same image with slightly different variation. Hence, to a certain degree, one can
argue that our system use the retweet signal. It is also because of this signal, we are able to
include the famous Ellen's selfie in a group of highly similar photos in our summarization
(Figure 5-10).
66
It naturally leads us to several questions. First, does the fact that retweet signal was
used undermines the value of our system? Second, to what degree do we have the signal
in our system? Third, can we remove the signal? Our answers to these questions are as
follows.
We argue that the value of our system is not undermined by the fact that retweet signal,
to some degree, is in our system. Apparently, our system is not purely operating on only
the retweet signal. If that is the case, we will see our result identical or very similar to the
most retweeted photos. However, our system generates very different result from the most
retweeted photos and proven to be preferred by users. For instance, Figure 5-13 shows
fairly different result from our method and the most retweeted photos. Moreover, our
system is able to include Ellen's selfie even if different hashtag was used (#Oscars2014).
The basic assumption of our algorithm is that by looking at what photos people are taking,
we can find representative photos within the similar photos. Whether the users physically
took the photo by themselves should not affect the fact that photos are similar and worth
sharing. Therefore, the contribution of our system lies in the ability to extract pattern from
similar photos, no matter they are visual retweeted or not.
We further examined how significant is the phenomenon of visual retweet exists in
our data set. In the Oscar data set, there are about 20 to 30 images that are Ellen's selfie
with slight variation within 1165 images. In the TED data set, there is little or no such
phenomenon exist, as we manually examined the data set. In addition, the summarization
photos we found generally have low retweet numbers. For the TED data set summarzization
generated by our method, the retweet counts from left to right in Figure 5-7 are 18, 6, 3,
4, 1 respectively. For the Oscar data set summarzization generated by our method, the
retweet counts from 2nd left to rightmost in Figure 5-10 are 0, 15, 0, 0 respectively. The
first one in Figure 5-10 appears to be deleted so we can not get the retweet count but the
first three visual retweets in the dataset returned 0, 4, 0 as retweet counts. Hence, within
the two summarization photos we generated so far, we speculate that only Ellen's selfie is
influenced by the visual retweet signal, while others are generated by the regular similar
photos. We think Ellen's selfie is special in the sense that it is a) highly worth sharing and
b) only very few people can physically take the picture from that time and space. Hence,
67
it helps to explain why visual retweet works and is needed in this case. We also want to
note that the retweet numbers suggest that the photos we found have only little overlap with
the photos that are highly retweeted. This implies the signal we found is orthogonal to the
general retweet signal and thus the combination of two signals can be further investigated.
Finally, one might want to ask whether it is viable to filter out the visual retweet or not.
In the current system, we filter out the images that are exactly the same. Following the
same logic, we can introduce a threshold E so that any pair of images that has a distance
smaller than c will be filtered out. The tricky part is the value of 6 might not be easy to tune.
For an c that is too small, the system will fail to filter out the almost identical photos. For
an c that is too large, it will filter out the regular similar photos and the signal our system
relies on will be removed. In addition, feature selection also plays a factor and that nearduplicate image detection is not trivial. Several papers already published in the field, for
instacne, [7], [21] and [40].
In summary, we do not considered the value of our system is undermined because
of the fact that our system partly using the visual retweet signal. Also, although further
examination might be needed, we speculate the influence of the visual retweet is limited.
Finally, it is possible to partly remove the visual retweet by setting a carefully chosen
threshold as the minimum distance within the image collection.
68
Chapter 6
Conclusion
In this thesis, we presented a novel method to summarize a potentially large image collection with the motivating scenario of summarizing Twitter photos over events. The method
mainly composed of three parts. First, diffusion distance is used to model the connectivity
between data points. Second, outliers are filtered out according to the Heat Kernel Signature value. Third, Farthest Point Sampling is applied to sample the denser part in the
feature space. By using this method, we achieved both representativeness and also diversity. Representativeness is achieved by focusing on the denser part in the feature space,
where the data points have more similar photos in the dataset. Intuitively, it can be think of
focusing on scenes that have more "votings" compare to others. In the other hand, diversity
is achieved by applying FPS, which maximize the distance between sampled points.
We presented results of our method. In a clean and continuous dataset generated by a
video surrounding an object, the system is able to sample evenly from the dataset, providing
a roughly evenly spaced view fomr an object. We also presented datasets from real-world
Twitter data. We presented both the summarization of TED and Oscar in the year of 2014.
Both summarization include images that are directly relevant to the event despites of the
high ratio of the outliers. Different part of the dataset are also sampled. We compared the
result with the most retweeted photo and claim that our result is at least have similar quality
depite of being an automatic method. In some case, our result achieves even better result
because the retweet number can be biased by a number of factors.
There are several limitations of our method. First, the selection of feature and the tuning
69
of parameters is not trivial. Different dataset usually requires different features. Diffusion
Maps is relatively sensitive to parameters. For instance, the o- used in the Gaussian Kernel.
Second, the system is ignorant of high level semantic knowledge. For instance, red carpet
photos can be detected, however, the system is indifferent to who is actually on the red
carpet, as long as they are visually similar. However, this is a fundamental question in
computer vision which probably beyond the scope of this paper.
We foresee several future directions. First, we want to explore the combination of
visual and textual data as features. The textual data can be noisy and casual in social media
environment and the metadata data can be biased by a number of factors. However, they
can be potentially useful if we can find a clever way to combine both textual and visual
data. Second, we want to extend the event summarization techniques to event discovery
method. Finally, it would be interesting to extend the method to online clustering, so that
the summarization can happen in the realtime and update itself as the new information
arrives.
70
Bibliography
[1] Aydin Arpa, Luca Ballan, Rahul Sukthankar, Gabriel Taubin, Marc Pollefeys, and
Ramesh Raskar. Crowdcam: Instantaneous navigation of crowd images using angled graph. In 3DTV-Conference, 2013 InternationalConference on, pages 422-429.
IEEE, 2013.
[2] Hila Becker, Mor Naaman, and Luis Gravano. Beyond trending topics: Real-world
event identification on twitter. In ICWSM, 2011.
[3] Hila Becker, Mor Naaman, and Luis Gravano. Selecting quality twitter content for
events. ICWSM, 11, 2011.
[4] Abhijit Bendale, Kevin Chiu, Kshitij Marwah, and Ramesh Raskar. Visionblocks: A
social computer vision framework. In Privacy,security, risk and trust (passat),2011
ieee third internationalconference on and 2011 ieee third internationalconference
on social computing (socialcom), pages 521-526. IEEE, 2011.
[5] Ling Chen and Abhishek Roy. Event detection from flickr data through waveletbased spatial analysis. In Proceedingsof the 18th ACM conference on Information
and knowledge management, pages 523-532. ACM, 2009.
[6] Kevin Chiu and Ramesh Raskar. Computer vision on tap. In Computer Vision and
PatternRecognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, pages 31-38. IEEE, 2009.
[7] Ondrej Chum, James Philbin, and Andrew Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In BMVC, volume 810, pages 812-815, 2008.
[8] Ronald R Coifman and Stephane Lafon. Diffusion maps. Applied and computational
harmonic analysis, 21(1):5-30, 2006.
[9] David J Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. Mapping
the world's photos. In Proceedings of the 18th internationalconference on World
'wide web, pages 761-770. ACM, 2009.
[10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, volume 1, pages 886-893. IEEE, 2005.
71
[11] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Studying aesthetics in photographic images using a computational approach. In Computer Vision-ECCV 2006,
pages 288-301. Springer, 2006.
[12] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. High level describable attributes
for predicting aesthetics and interestingness. In Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on, pages 1657-1664. IEEE, 2011.
[13] Yuval Eldar, Michael Lindenbaum, Moshe Porat, and Yehoshua Y Zeevi. The farthest
point strategy for progressive image sampling. Image Processing,IEEE Transactions
on, 6(9):1305-1315, 1997.
[14] Boris Epshtein, Eyal Ofek, Yonatan Wexler, and Pusheng Zhang. Hierarchical photo
organization using geo-relevance. In Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems, page 18. ACM,
2007.
[15] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for
model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381-395, 1981.
[16] Brendan C Fruin, Hanan Samet, and Jagan Sankaranarayanan. Tweetphoto: photos
from news tweets. In Proceedings of the 20th InternationalConference on Advances
in GeographicInformation Systems, pages 582-585. ACM, 2012.
[17] K Gbal, Jakob Andreas Bxrentzen, Henrik Aanws, and Rasmus Larsen. Shape analysis using the auto diffusion function. In Computer GraphicsForum, volume 28, pages
1405-1413. Wiley Online Library, 2009.
[18] Matthias Hein and Jean-Yves Audibert. Intrinsic dimensionality estimation of submanifolds in r d. In Proceedings of the 22nd internationalconference on Machine
learning,pages 289-296. ACM, 2005.
[19] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill: A bayesian skill rating system. In Advances in Neural Information ProcessingSystems, pages 569-576, 2006.
[20] Alexandar Jaffe, Mor Naaman, Tamir Tassa, and Marc Davis. Generating summaries
and visualization for large collections of geo-referenced photographs. In Proceedings
of the 8th ACM internationalworkshop on Multimedia information retrieval, pages
89-98. ACM, 2006.
[21] Yan Ke, Rahul Sukthankar, and Larry Huston. Efficient near-duplicate detection and
sub-image retrieval. In ACM Multimedia, volume 4, page 5, 2004.
[22] Lyndon S Kennedy and Mor Naaman. Generating diverse and representative image
search results for landmarks. In Proceedingsof the 17th internationalconference on
World Wide Web, pages 297-306. ACM, 2008.
72
[23] Ryong Lee and Kazutoshi Sumiya. Measuring geographical regularities of crowd
behaviors for twitter-based geo-social event detection. In Proceedingsof the 2nd ACM
SIGSPATIAL InternationalWorkshop on Location Based Social Networks, pages 110. ACM, 2010.
[24] Michael Mathioudakis and Nick Koudas. Twittermonitor: trend detection over the
twitter stream. In Proceedingsof the 2010 ACM SIGMOD InternationalConference
on Management of data, pages 1155-1158. ACM, 2010.
[25] Nikhil Naik, Jade Philipoom, Ramesh Raskar, and Cesar Hidalgo. Streetscore - predicting the perceived safety of one million streetscapes. In IEEE Computer Vision
and PatternRecognition Workshops, 2014.
[26] Alan Ritter, Oren Etzioni, Sam Clark, et al. Open domain event extraction from twitter. In Proceedingsof the 18th ACM SIGKDD internationalconference on Knowledge
discovery and data mining, pages 1104-1112. ACM, 2012.
[27] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover's distance as a
metric for image retrieval. InternationalJournal of Computer Vision, 40(2):99-121,
2000.
[28] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter
users: real-time event detection by social sensors. In Proceedingsof the 19th international conference on World wide web, pages 851-860. ACM, 2010.
[29] Philip Salesses, Katja Schechtner, and Cesar A Hidalgo. The collaborative image of
the city: mapping the inequality of urban perception. PloS one, 8(7):e68400, 2013.
[30] Ian Simon, Noah Snavely, and Steven M Seitz. Scene summarization for online image
collections. In Computer Vision, 2007. ICCV 2007. IEEE 11th InternationalConference on, pages 1-8. IEEE, 2007.
[31] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object
matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1470-1477. IEEE, 2003.
[32] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature based on heat diffusion. In Computer Graphics Forum,
volume 28, pages 1383-1392. Wiley Online Library, 2009.
[33] Ronen Talmon, Israel Cohen, Sharon Gannot, and R Coifman. Diffusion maps for
signal processing: A deeper look at manifold-learning techniques based on kernels
and graphs. Signal ProcessingMagazine, IEEE, 30(4):75-86, 2013.
[34] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A
large data set for nonparametric object and scene recognition. PatternAnalysis and
Machine Intelligence, IEEE Transactionson, 30(11):1958-1970, 2008.
73
[35] Antonio Torralba, Kevin P Murphy, William T Freeman, and Mark A Rubin. Contextbased vision system for place and object recognition. In Computer Vision, 2003.
Proceedings.Ninth IEEE InternationalConference on, pages 273-280. IEEE, 2003.
[36] Catherine Wah. Crowdsourcing and its applications in computer vision. University of
California,San Diego, 2006.
[37] Jianshu Weng and Bu-Sung Lee. Event detection in twitter. In ICWSM, 2011.
[38] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.
Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision
andpattern recognition (CVPR), 2010 IEEE conference on, pages 3485-3492. IEEE,
2010.
[39] Luh Yen, Marco Saerens, Amin Mantrach, and Masashi Shimbo. A family of dissimilarity measures between nodes generalizing both the shortest-path and the commutetime distances. In Proceedingsof the 14th ACM SIGKDD internationalconference
on Knowledge discovery and data mining, pages 785-793. ACM, 2008.
[40] Dong-Qing Zhang and Shih-Fu Chang. Detecting image near-duplicate by stochastic
attributed relational graph matching with learning. In Proceedingsof the 12th annual
ACM internationalconference on Multimedia, pages 877-884. ACM, 2004.
74
Download