DOC format [Microsoft Word, Mac and PC]

advertisement
Adaptive Linking between Text and Photos
Using Common Sense Reasoning
Henry Lieberman and Hugo Liu
MIT Media Laboratory, Software Agents Group
20 Ames St., E15-320G
Cambridge, MA 02139, USA
lieber@media.mit.edu, hugoliu@mit.edu,
Abstract
In a hypermedia authoring task, an author often wants to set
up meaningful connections between different media, such as
text and photographs. To facilitate this task, it is helpful to
have a software agent dynamically adapt the presentation of
a media database to the user's authoring activities, and look
for opportunities for annotation and retrieval. However,
potential connections are often missed because of
differences in vocabulary or semantic connections that are
"obvious" to people but that might not be explicit.
ARIA (Annotation and Retrieval Integration Agent) is a
software agent that acts an assistant to a user writing e-mail
or Web pages. As the user types a story, it does continuous
retrieval and ranking on a photo database. It can use
descriptions in the story to semi-automatically annotate
pictures. To improve the associations beyond simple
keyword matching, we use natural language parsing
techniques to extract important roles played by text, such as
"who, what, where, when". Since many of the photos depict
common everyday situations such as weddings or recitals,
we use a common sense knowledge base, Open Mind, to fill
in semantic gaps that might otherwise prevent successful
associations.
Introduction1
The task described in this paper is the robust retrieval of
annotated photos by a keyword query. By “annotated
photos”, we mean a photo accompanied by some metadata
about the photo, such as keywords and phrases describing
people, things, places, and activities depicted in the photo.
By “robust retrieval”, we mean that photos should be
retrievable not just by the explicit keywords in the
annotation, but also by other related keywords conceptually
related to the event in the photo.
In the retrieval sense, annotated photos behave similarly
to documents because both contain text, which can be
exploited by conventional IR techniques. In fact, the
common query enrichment techniques such as thesaurus-
based keyword expansion developed for document retrieval
may very well work for the photo retrieval domain.
However, keyword expansion using thesauri is limited in
its usefulness because keywords expanded by their
synonyms can still only retrieve documents directly related
to the original keyword. Furthermore, naïve synonym
expansion may actually contribute more noise to the query
and negate what little benefit keyword expansion may add
to the query, namely, if keywords cannot have their word
sense disambiguated, then synonyms for all the word senses
of a particular word may be used in the expansion, and this
has the potential to retrieve many irrelevant documents.
Relevant Work
Attempting to overcome the limited usefulness of
keyword expansion by synonyms, various researchers have
tried to use slightly more sophisticated resources for query
expansion. These include dictionary-like resources such as
lexical semantic relations (Voorhees, 1994), and keyword
co-occurrence statistics (Peat, 1991), as well as more
resources generated dynamically through relevance
feedback, like global document analysis (Xu, 1996), and
collaborative concept-based expansion (Klink, 2001)
Although some of these approaches are promising, they
share some of the same problems as naïve synonym
expansion. Dictionary-like resources such as WordNet
(Fellbaum, 1998) and co-occurrence frequencies, although
more sophisticated that just synonyms, still operate mostly
on the word-level and suggest expansions that are lexically
motivated rather than conceptually motivated. Relevance
feedback, though somewhat more successful than
dictionary approaches, requires additional user action and
we cannot consider it fully automated retrieval, which
makes it inappropriate for our task.
Photos vs. Documents
With regard to our domain of photo retrieval, we make a
key observation about the difference between photos and
documents. We argue that a photo has more structure and
is more predictable than a document, even though that
structure may not be immediately evident. The contents of
a typical document such as a web page are hard to predict,
because there are too many types and genres of web pages
and the content doesn’t usually follow a stereotyped
structure. However, with typical photos, such as one found
in your photo album, there is more predictable structure.
Photos capture people and things in certain places,
participating in certain events and activities. Many of
these situations depicted, such as a wedding, a hike, or a
recital are common to human experience, and therefore
have a high level of predictability.
Take for example, a picture annotated with the keyword
“bride”. Even without looking at the photo, a person may
be able to successfully guess who else is in the photo, and
what situation is being depicted. Common sense says that
brides are usually found at weddings, that people found
around her are usually the groom, the father of the bride,
bridesmaids, that weddings may take place in a chapel or
church, that there may be a wedding cake, walking down
the aisle, and a wedding reception.
World Semantics
Knowledge about the spatial, temporal, and social relations
of the everyday world, as exemplified above, is part of
commonsense knowledge.
We also call this world
semantics, referring to the meaning of concepts and how
concepts relate to each other in the world.
The mechanism we propose for robust photo retrieval
uses a world semantic resource in order to expand concepts
in existing photo annotations with concepts that are, inter
alia, spatially, temporally, and socially related. More
specifically, we automatically constructed our resource
from a corpus of English sentences about commonsense by
first extracting predicate argument structures, and then
compiling those structures into a Concept Node Graph,
where the nodes are commonsense concepts, and the
weighted edges represent commonsense relations. The
graph is structured like MindNet (Dolan, 1998).
Performing concept expansion using the graph is modeled
as spreading activation. Relevance of a concept is
measured as world semantic proximity between nodes on
the graph.
This paper is structured as follows: First, we discuss the
source and nature of the corpus of commonsense
knowledge used by our mechanism. Second, a discussion
follows regarding how our world semantic resource was
automatically constructed from the corpus. Third, we show
the spreading activation strategy for robust photo retrieval.
The paper concludes with a discussion of the larger system
to which this mechanism belongs, potential application of
this type of resource in other domains, and plans for future
work.
Open Mind: A Corpus of
Commonsense Knowledge
The source of the world semantic knowledge used by our
mechanism is the Open Mind Commonsense Knowledge
Base (OMCS) (Singh, 2002) - an endeavor at the MIT
Media Laboratory that aims to allow a web-community of
teachers to collaboratively build a database of “common
sense” knowledge. OMCS contains over 400,000 semistructured English sentences about commonsense,
organized into an ontology of commonsense relations such
as the following:
 A is a B
 You are likely to find A in/at B
 A is used for B
By semi-structured English, we mean that many of the
sentences loosely follow one of 20 or so sentence patterns
in the ontology.
However, the words and phrases
represented by A and B (see above) are not restricted.
Some examples of sentences in the knowledge base are:
 Something you find in (a restaurant) is (a waiter)
 The last thing you do when (getting ready for bed)
is (turning off the lights)
 While (acting in a play) you might (forget your
lines)
The parentheses above denote the part of the sentence
pattern that is unrestricted. The major limitations of
OMCS are two-fold. First, there is much ambiguity, such
as the lack of disambiguated word senses, and the lack of
ability to fully parse the more complex sentences. Second,
some of the knowledge is noisy in that many sentences
don’t truly reflect common sense, and there are no claims
made about the coverage of the commonsense in the
knowledge base.
The Open Mind Commonsense Knowledge Base is often
compared to its more famous counterpart, the CYC
Knowledge Base (Lenat, 1998). CYC contains over
1,000,000 hand-entered rules that constitute “common
sense”. Unlike OMCS, CYC represents knowledge using
formal logic, and ambiguity is minimized. Unfortunately,
the CYC corpus is not publicly available at this time.
Even though OMCS is a slightly noisy and ambiguous
corpus, it is still suitable to our task because we only need
to mine conceptual relations out of it using shallow
techniques.
Constructing a World Semantic Resource
In this section, we describe how a subset of the knowledge
in OMCS is extracted and structured to be useful to the
photo retrieval task. First, we apply sentence pattern rules
to the raw OMCS corpus and extract crude predicate
argument structures, where predicates represent
commonsense relations and arguments represent
commonsense concepts. Second, concepts are normalized
using natural language techniques. Third, the predicate
argument structures are read into a Concept Node Graph,
where nodes represent concepts, and edges represent
predicate relationships. Edges are weighted to indicate the
strength of the relationship.
a wedding. For our problem domain, we will generally
penalize the direction in a relation that returns hyponymic
concepts as opposed to hypernymic ones.
Approximately 20 mapping rules are applied to all the
sentences (400,000+) in the OMCS corpus. From this, a
crude set of predicate argument relations are extracted. At
this time, the text blob bound to each of the arguments
needs to be normalized into concepts.
Extracting Predicate Argument Structures
The first step of extracting predicate argument structures is
to apply a fixed number of mapping rules to the sentences
in OMCS. Each mapping rule captures a different
commonsense relation. Commonsense relations, insofar as
what interests us for constructing our world semantic
resource for photos, fall under the following general
categories of knowledge:
1. Classification: A dog is a pet
2. Spatial: San Francisco is part of California
3. Scene: Things often found together are: restaurant,
food, waiters, tables, seats
4. Purpose: A vacation is for relaxation; Pets are for
companionship
5. Causality: After the wedding ceremony comes the
wedding reception.
6. Emotion: A pet makes you feel happy;
Rollercoasters make you feel excited and scared.
In our extraction system, mapping rules can be found
under all of these categories. To explain mapping rules, we
give an example of knowledge from the aforementioned
Scene category:
somewhere THING1 can be is PLACE1
somewherecanbe
THING1, PLACE1
0.5, 0.1
The first line in a mapping rule is a sentence pattern.
THING1 and PLACE1 approximately bind to a word or
phrase, which is later mapped to a set of canonical
commonsense concepts. Line 2 specifies the name of this
predicate relation. Line 3 specifies the arguments to the
predicate, and corresponds to the variable names in line 1.
The pair of numbers on the last line represents the
confidence weights given to forward relation (left to right),
and backward relation (right to left), respectively, for this
predicate relation. This also corresponds to the weights
associated with the directed edges between the nodes,
THING1 and PLACE1 in the graph representation.
It is important to distinguish the value of the forward
relation on a particular rule, as compared to a backward
relation. For example, let us consider the commonsense
fact, “somewhere a bride can be is at a wedding.” Given
the annotation “bride,” it may be very useful to return
“wedding.” However, given the annotation “wedding,” it
seems to be less useful to return “bride,” “groom,”
“wedding cake,” “priest,” and all the other things found in
Normalizing Concepts
Because any arbitrary text blob can bind to a variable in a
mapping rule, these blobs need to be normalized into
concepts before they can be useful. There are two
categories of concepts that can accommodate the majority
of the commonsense knowledge in OMCS: Noun Phrases
(things, places, people), and Activity Phrases (e.g.: “walk
the dog,” “buy groceries.”).
To normalize a text blob into a Noun Phrase or Activity
Phrase, we part-of-speech tag the blob, and use these tags
filter the blob through a miniature noun phrase and activity
phrase grammar. If the blob does not fit the grammar, it is
massaged until it does or it is rejected altogether. The final
step involves normalizing the verb tenses and the number
of the nouns. Only after this is done can our predicate
argument structure be added to our repository. A brief
description of each of the aforementioned steps is given
below.
The aforementioned noun phrase and activity phrase
grammar is as follows:
NOUN PHRASE:
(PREP) (DET|POSS-PRON) NOUN
(PREP) (DET|POSS-PRON) NOUN NOUN
(PREP) NOUN POSS-MARKER (ADJ) NOUN
(PREP) (DET|POSS-PRON) NOUN NOUN NOUN
(PREP) (DET|POSS-PRON) (ADJ) NOUN PREP NOUN
ACTIVITY PHRASE:
(PREP) (ADV) VERB (ADV)
(PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN
(PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN NOUN
The application of the grammar behaves like a filter. If
the input to a grammar rule matches any optional tokens,
which are in parentheses, then this is still considered a
match, but the output will filter out any optional fields. For
example, the phrase, “in your playground” will match the
first rule, but after the grammar is applied, the phrase will
become just “playground”.
Concept Node Graph
To model concept expansion as a spreading activation task,
we convert the predicate argument structures gathered
previously into a Concept Node Graph shown in Figure 1.
bride
0.1
FoundTogether
0.2
0.5
0.2
PartOf
hop away risks exposure to noise. Although associating
weights with the edges provides some measure of
relevance, these weights form a homogenous class for all
edges of a common predicate (recall that the weights came
from mapping rules).
We identify two opportunities to re-weight the graph to
improve relevance: reinforcement and popularity.
Reinforcement
Q
0.1
groom
wedding
0.5
A
PartOf
Figure 1. A portion of the Concept Node Graph. Nodes are
concepts, and edges correspond to predicate relations.
The following statistics were compiled on the automatically
constructed resource:
 400,000+ sentences in OMCS corpus
 50,000 predicate argument structures extracted
 20 predicates in mapping rules
 30,000 concept nodes
 160,000 edges
 average branching factor of 5
Concept Expansion By Spreading Activation
In this section, we explain how concept expansion is
modeled as spreading activation.
We propose two
heuristics for re-weighting the graph to improve relevance.
Examples of the spreading activation are then given.
In spreading activation, the origin node is the concept we
wish to expand and it is the first node to be activated.
Next, all the nodes one hop away from the origin node are
activated, then two levels away, and so on. Nodes will
continue to be activated so long as their activation score
meets the activation threshold, which is a number between
0 and 1.0. Given nodes A and B, where A has 1 edge
pointing to B, the activation score (AS) of B is:
AS ( B)  AS ( A) * weight (edge( A, B))
When no more nodes are activated, we have found all
the concepts that expand the input concept.
P
B
P
C
Figure 2. An example of reinforcement
As illustrated in figure 2, we make the observation that if
node C is two hops away from the origin node A through
path P, and also one hop away from node A through path
Q, then node C is more likely to be relevant than if path Q
had not existed. This is the idea of reinforcement. We
define reinforcement as two or more corroborating pieces
of evidence, represented by paths, that two nodes are
related. The stronger the reinforcement, the higher the
potential relevance. Looking at this another way, if three
or more nodes are mutually connected, they form a cluster,
and any two nodes in the cluster have enhanced relevance
because the other nodes provide additional paths for
reinforcement. Applying this, we re-weight the graph by
detecting clusters and increasing the weight on edges
within the cluster.
Popularity
to wedding
concepts
bride
groom
bridesmaid
to wedding
concepts
(bf = 129)
to beauty
concepts
(bf = 2)
Figure 2. Illustrating the negative effects of popularity
Heuristics to Improve Relevance
One problem that can arise with spreading activation is that
nodes that are activated two or more hops away from the
origin node may quickly lose relevance, causing the search
to lose focus. One reason for this is noise. Because
concept nodes do not make distinctions between different
word senses, it is possible that a node represents many
different word senses. Therefore, activating more than one
The second observation we make is that if an origin node A
has a path through node B, and node B has 100 children,
then each of node B's children are less likely to be relevant
to node A than if node B had had 10 children. This is
based on empirical evidence with commonsense
knowledge, though it has also a common technique used in
spreading activation (Salton, 1988).
We refer to nodes with a large branching factor as being
popular. It so happens that popular nodes in our graph tend
to be very common concepts in commonsense, or tend to
have many different word senses, or word contexts. This
causes its children to be in general, less relevant.
As illustrated in Figure 2, the concept bride may lead to
bridesmaid and groom. Whereas bridesmaid is a more
specific concept, not appearing in many contexts, groom is
a less specific concept. In fact, different senses and
contexts of the word can mean “the groom at a wedding”,
or “grooming a horse” or “he is well-groomed”. This
causes groom to have a much larger branching factor.
It seems that even though our knowledge is
commonsense, there is more value associated with more
specific concepts than general ones. To apply this
principle, we visit each node and discount the weights on
each of its edges based on the following rule (α and β are
constants):
newWeight  oldWeight * discount
1
discount 
log(  * branchingF actor   )
Examples
Below are actual runs of the concept expansion program
using an activation threshold of 0.1.
>>> expand(“bride”)
('wedding', '0.3662') ('woman', '0.2023')
('ball', '0.1517') ('tree', '0.1517')
('snow covered mountain', '0.1517')
('flower', '0.1517') ('lake', '0.1517')
('cake decoration', '0.1517') ('grass', '0.1517')
('groom', '0.1517') ('tender moment', '0.1517')
('veil', '0.1517') ('tuxedo', '0.1517')
('wedding dress', '0.1517') ('sky', '0.1517')
('hair', '0.1517') ('wedding boquet', '0.1517')
>>> expand('london')
('england', '0.9618') ('ontario', '0.6108')
('europe', '0.4799') ('california', '0.3622')
('united kingdom', '0.2644') ('forest', '0.2644')
('earth', '0.1244')
>>> expand(“symphony”)
('concert', '0.5') ('piece music', '0.4')
('kind concert', '0.4') ('theatre', '0.2469')
('screw', '0.2244') ('concert hall', '0.2244')
('xylophone', '0.1') ('harp', '0.1')
('viola', '0.1') ('cello', '0.1')
('wind instrument', '0.1') ('bassoon', '0.1')
('bass fiddle', '0.1')
>>> expand(“listen to music”)
('relax', '0.4816') ('be entertained', '0.4816')
('have fun', '0.4')
("understand musician's feelings", '0.4')
('feel emotionally moved', '0.4')
('hear music', '0.4') ('end silence', '0.4')
('understand', '0.4') ('mother', '0.2')
('feel emotion brings', '0.136')
('get away', '0.136') ('listen', '0.136')
('change psyche', '0.136') ('show', '0.1354')
('dance club', '0.1295') ('frisbee', '0.1295')
('scenery', '0.124') ('garden', '0.124')
('spa', '0.124') ('bean bag chair', '0.124')
Conclusion
In this paper, we presented a mechanism for robust photo
retrieval through performing concept expansion on the
photo’s annotations using a world semantic resource. The
resource was automatically constructed from the publicly
available Open Mind Commonsense corpus. Sentence
patterns were applied to the corpus, and simple predicate
argument structures were extracted. After normalizing
arguments into syntactically neat concepts, a weighted
concept node graph was constructed. Concept expansion is
modeled as spreading activation over the graph. To
improve relevance in spreading activation, the graph was
re-weighted according to arguments for reinforcement and
against popularity.
This work has not yet been formally evaluated. Any
evaluation will likely take place in the context of the larger
system that this mechanism is used in, called (A)nnotation
and (R)etrieval (I)ntegration (A)gent (Lieberman, 2001)
ARIA is an assistive software agent which automatically
learns annotations for photos by observing how users place
photos in emails and web pages. It also monitors the user
as she types an email and finds opportunities to suggest
relevant photos. The idea of using world semantics to
make the retrieval robust comes from recognizing that
given one annotation about the situation depicted in a
photo, it becomes possible to predict other annotations
given some knowledge about spatial, temporal, and social
relations about the everyday world. While the knowledge
extracted from OMCS does not give very complete
coverage of many different concepts, we believe that what
concept expansions are done have added to the value of the
retrieval process. Sometimes the concept expansions are
irrelevant, but because ARIA engages in opportunistic
retrieval, the user does not suffer as a result. We
sometimes refer to ARIA as being “fail-soft” because the
user does not rely on the software to suggest photos. But
when the software does suggest a relevant photo, the user
feels that ARIA has been helpful.
Robust photo retrieval is not the only IR task in which
semantic resources extracted from OMCS have been
successfully applied. (Liu, 2001a) used OMCS to generate
effective search queries by analyzing the user’s search
goals. (Liu, 2001b) describes how story scripts can be
generated given a seed story sentence.
In general, OMCS can benefit any program that deals
with concepts of the everyday world, or assists the user in
performing an everyday task. However, because of such
limitations as noise and ambiguity associated with this
corpus, it is likely to be only useful at a very shallow level,
such as in various IR tasks.
Future work is planned to improve the performance of
the mechanism presented in this paper. One major
limitation that we have encountered is noise, stemming
from ambiguous word senses and contexts. To overcome
this, we hope to apply word sense disambiguation
techniques to the concepts and the query, using word sense
co-occurrence statistics, WordNet, or LDOCE. A similar
approach could be taken to disambiguate meaning contexts,
but it is less clear how to proceed. We also hope to migrate
from sentence patterns to a broad coverage parser so that
we can extract more kinds of commonsense relations from
the corpus that cannot be ordinarily parsed by a sentence
pattern.
Acknowledgements
We thank our colleagues Push Singh, and Kim Waters at
the MIT Media Lab, Tim Chklovski at the MIT AI Lab,
and Erik Mueller at IBM who are also working on the
problem of Commonsense, for their contributions to our
collective understanding of the issues.
We would
especially like to thank Push for leading the Open Mind
effort and for his advocacy of commonsense reasoning.
References
Borchardt, G. C. (1992). Understanding Causal
Descriptions of Physical Systems. Proc. AAAI Tenth
National Conference on Artificial Intelligence, San Jose,
CA, 2-8.
Davis, E. (1990) Representations of commonsense
knowledge. San Mateo, Calif.: Morgan Kaufmann.
Fellbaum, C. (Ed.). (1998). WordNet: An Electronic
Lexical Database. MIT Press, Cambridge, MA.
Ide, N. and Véronis, J. (Eds.) 1998. Special Issue on Word
Sense Disambiguation. Computational Linguistics, 24(1).
Jurafsky, D. and Martin, J. (2000). Speech and Language
Processing: An Introduction to Natural Language
Processing, Computational Linguistics and Speech
Recognition, pp. 501-661. Prentice Hall.
Klink, S. (2001) Query reformulation with collaborative
concept-based expansion. Proceedings of the First
International Workshop on Web Document Analysis,
Seattle, WA.
Lenat, D. (1998) The dimensions of context-space, Cycorp
technical report, www.cyc.com.
Lieberman, H., & Selker, T. (2000). Out of context:
Computer systems that adapt to, and learn from, context.
IBM Systems Journal, 39(3,4):617-632.
Lin, D. (1998). Using collocation statistics in information
extraction. In Proc. of the Seventh Message Understanding
Conference (MUC-7).
Liu, H., Lieberman, H., Selker, T. (2002a). GOOSE: A
Goal-Oriented Search Engine With Commonsense. In
Proceedings of the 2002 International Conference on
Adaptive Hypermedia and Adaptive Web Based Systems,
Malaga, Spain
Liu, H., Singh, P. (2002b). MAKEBELIEVE: Using
Commonsense to Generate Stories. Proceedings of the 20th
National Conference on Artificial Intelligence (AAAI-02) -Student Abstract. Seattle, WA.
Minsky,
M.
Commonsense-Based
Interfaces.
Communications of the ACM. Vol. 43, No. 8 (August
2000), Pages 66-73
Mueller, E. T. (2000). A calendar with common sense. In
Proceedings of the 2000 International Conference on
Intelligent User Interfaces, 198-201. New York:
Association for Computing Machinery.
Peat, H. J. and Willett, P. (1991). The limitations of term
co-occurrence data for query expansion in document
retrieval systems. Journal of the ASIS, 42(5), 378--383.
Salton G. and Buckley C. (1988). On the Use of Spreading
Activation Methods in Automatic Information Retrieval, In
Proc. 11th Ann. Int. ACM SIGIR Conf. on R&D in
Information Retrieval (ACM), 147-160.
Shneiderman, B., Byrd, D., and Croft, B. (1998) Sorting
out searching: A user-interface framework for text
searches, Communications of the ACM 41, 4 (April 1998),
95-98.
Singh, P. (2002). The public acquisition of commonsense
knowledge. In Proceedings of AAAI Spring Symposium:
Acquiring (and Using) Linguistic (and World) Knowledge
for Information Access. Palo Alto, CA, AAAI.
Voorhees, E. (1994). Query expansion using lexicalsemantic relations. In Proc. of ACM SIGIR Intl. Conf. on
Research and Development in Information Retrieval, pages
61—69.
Xu, J. and Croft, W.B. (1996). Query Expansion Using
Local and Global Document Analysis. In Proceedings of
the Nineteenth Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, pp. 4—11.
Download