User-centered reduction of the semantic gap in content

advertisement
User-centered reduction of the semantic gap in
content-based image retrieval
Mark Pardijs
Universiteit Twente
Master Human Media Interaction
Capita Selecta
m.r.pardijs@student.utwente.nl
ABSTRACT
Content-Based Image Retrieval (CBIR) systems provide a
way to search for images which are not annotated with text,
by using image features such as color, shape and texture.
The problem with this approach is the semantic gap. User
semantics, their abstract idea of what they expect from
search results, are hard to relate to the concrete image features. This article gives several pointers to reduce this semantic gap. Reducing the semantic gap can be done by using
algorithms which incorporate user semantics, and give users
more influence on the search process. Promising developments are automated annotation, Latent Semantic Indexing
(LSI) and user influence on the execution of algorithms: assigning weights to features and giving relevance feedback.
1.
INTRODUCTION
The enormous increase of digital information available locally or on the Internet makes it almost impossible to annotate the digital objects manually. Therefore, automatic
index and retrieval systems are developed. The research
area Information Retrieval is mainly focused on the disclosure of textual documents. The last few years, also multimedia objects are taken into account, including images.
Content-Based Image Retrieval (CBIR) systems aim to recognize and retrieve information based on content of images
instead of looking at metadata provided with the images.
The major content in images used consists of color, shape
and text features. The problem with searching in these features in images is the semantic gap. The semantic gap is the
difference between the high-level user semantics and the lowlevel image features which need to be connected (E. van den
Broek, 2006; Datta, Li, & Wang, 2005). In other words, it is
hard to translate the user need for a specific image in a comprehensible manner to a CBIR system. Although there are
classic methods of performance measurement in IR - recall
and precision - they indicate how well the system performs
on the given query, not how useful the results actually are
to the user (McDonald, Lai, & Tait, 2001).
There are two perspectives to this problem.
• User-specific How to translate the user need into a
clear query to the system which gives meaningful results. The focus lies on the interaction with the user.
• System-specific How to process the user query effective and efficient to get satisfying results.
The research proposed below focuses on the user-side of reducing the semantic gap. This is an interesting field since
there is a need for an approach to IR which looks at user
intentions. Traditionally the focus lies on laboratory experiments without taking in account the user satisfaction. Recent initiatives, for example the book ’The Turn’ (Ingwersen
& Jarvelin, 2005) address this problem and point out that
a more cognitive view is needed. By cognitive view, they
mean that a system should have categories and concepts in
mind which reflect the users’ world, incorporate the users
way of information processing.
In a survey of over 100 CBIR systems, held in the year 2000,
the main conclusion stated: (Veltkamp & Tanase, 2002, p.
57) “It is widely recognized that most current content-based
image retrieval systems work with low level features (color,
texture, shape), and that next generation systems should
operate at a higher semantic level.” The key to build more
effective CBIR systems is not improving the user side or
system side, but to move the system side towards the user
side.
2.
RESEARCH GOAL
The goal of the research is to find directions for reducing the
semantic gap, focusing on the movement towards the userside. First the origin and backgrounds of the semantic gap
are investigated; why is there a semantic gap, which human
perceptions are involved in it, and which system components
are at stake. Secondly, new developments in CBIR research
are discussed
2.1
Research questions
The research subjects presented above lead to the following
research topics and questions:
• Backgrounds of the semantic gap
– What are the user perception aspects?
Figure 1: Overview of CBIR
– What are the system perception aspects?
• How can the semantic gap be reduced?
3.
METHOD
Information retrieval is a young research area with ongoing
developments and improvements. The research questions
will be answered by doing a literature survey to current
CBIR developments with respect to the proposed subject.
Interesting models which have been put into practice will be
checked on their results. The rest of this paper is organized
as follows. First the backgrounds of the semantic gap are
discussed, including a description of commonly used CBIR
algorithms. Following, valuable solutions from new research
initiatives for reducing the semantic gap are selected, concluding with a future prosepcts overview and discussion.
4.
BACKGROUND: THE SEMANTIC GAP
In text retrieval, a text query is directly matched to the
text in the document. Obviously, this is not the case in
images. Searching images with textual queries is however
widely implemented, with as most used example Google image search (Google, 2007). Although this is the most used
form of searching images, it actually does not look at images themselves, but at their surrounding text or manually
added annotation. Since annotating images is tedious, and
surrounding text does not have to be related to the images, more advanced search methods have to be designed,
to search for visual features which match features in the
database to search. The main problem which causes the
semantic gap is that users do not assess images on these
visual features but on semantics, i.e. the meaning of an image. Since an image can have far more interpretations than
text, visual similarity can be totally different from semantic
similarity. Examples of high-level semantics which are hard
to express in pictures are activities taking place in the image or the emotions the picture evokes. Sometimes, users
even have no idea what they are looking for, they just want
something which is ’the same idea’ as their previously found
picture, or want some picture which is coherent to the idea
they have in mind. Of course, judging this coherency is only
possible untill the search results are returned. Next to that,
’the same’ can be very subjective. In literature, the semantic gap is widely discussed. Below we will first describe the
user perception aspects of the semantic gap, followed by the
system aspects.
picture about a football game or a picture with a car in it.
The user has no picture in mind, but an abstract, high-level
idea. How to find pictures which relate to this idea is exactly
the bridge in reducing the semantic gap. Searching is very
subjective. Different persons search in a different way. This
subjectivity is also influenced by language difficulties. Due
to synonymy and polysemy, ambiguity arises. Synonymy is
the existence of several words for the same thing, polysemy
means that the same word can be used to indicate different things. For example, the word Cherokee can be used
in the context of cars (Jeep Cherokee) but can also be an
Indian person. The judgement of results is also subjective.
Users address the relevance of an image in different ways.
This happens not only among users but even with the same
user at different stages in the search process. Choi and Rasmussen point out several factors which influence the judgement of pictures. These factors tend to change during the
search process, and are dependent of the users task. Some of
the factors are topicality, accuracy, novelty and accessibility.
Especially topicality (is the picture found ’on the topic’ of
the search query?) changes along the search process.
All these points have their origin in the friction between
high-level semantics and low-level features. These two are
discussed below.
4.1.1
High level semantics
There is a many-to-many relationship between user ideas
and images (Zhao & Grosky, 2002), illustrated by Figure 2.
1) Users can have the same idea in mind, but find different
pictures to fulfill this need. At the other hand, 2) different
ideas can lead to the same picture as satisfying result. The
implication for 1) is that a query should return multiple
results where the user can decide on, not the system. 2)
yields a multiple and adjustable input approach to express
different ideas. The user should be able to influence the
search process in order to find, to his preference, satisfying
results.
Figure 2: User idea vs. image
4.1
User perception aspects
From the user point of view there are several aspects which
contribute to the semantic gap. Users have an abstract notion about what will satisfy them. (Zhao & Grosky, 2002)
Often they have no exact image in mind what they would
like to see, but are looking for some concept or theme like a
Human people are able to see things in an image that are
not really visible in the picture itself. We can see things
that are absent, like when we see a portion of a race car, we
know it is a race car, we know the picture means to show
a race car, we know the picture intends to show a race car,
because it is familiair to us.
There are three components in the user-based definition
of relevance: cognitive, situational and dynamic aspects.
These aspects are closely related to the topicality problem
adressed before.
Figure 3: Matching colors
1. Cognitive: given document changes mental state influencing judgment, and presentation influences judgement, visual presentation is processed faster.
2. Situational: depends on type of information seeking
and situation.
3. Dynamic: decisions evolve during evaluation process.
For example, a user has in mind a beach with a lot of sand
and a blue, calm sea. To keep it simple, he just enters the
search term beach but this yields also results with rocket
beaches and stormy seas. He then realizes his search criteria
were too wide, and specifies his search query.
4.2
System perception aspects
The system perception aspects consist of matching algorithms which can decide whether a picture is relevant to
a query or not. The main problem with the system perception and its algorithms is the difference between human
vision and computer vision (E. L. v. d. Broek, 2005) For a
user, an image can yield a certain emotion which influences
the judgment of the image. Users can recognize different
shapes very accurately, where computer systems are able
to classify shapes, but the accuracy is far behind that of
humans. Another example is that humans can perfectly recognize occluded objects, where a computer can not. To sum
up, a content-based retrieval system purely judges the image on its low-level features, while the user searches with a
semantic mindset.
The system side depicted in figure 1 will now be described,
to get an idea what is possible with CBIR technologies nowadays. The most prevalent techniques are color, shape and
texture extraction (Smeulders, Worring, Santini, Gupta, &
Jain, 2000).
4.2.1
Color
By looking at every color in an image, a color histogram
is created. The histogram tells how much of every color is
present in the image. For example, the left image in Figure 3
will yield a color histogram of approximately 30% yellow and
70% blue. When a user has in mind a beach with a blue sea
like the left image in Figure 3 and searches in a color-based
search tool for it (for example, using a color picker or an
example image) the system will look for images containing
yellow and blue colors. The problem here is that images
which do not contain a beach or sea but have the same color
histogram will also show up in the results. For example, the
right image depicts a desert with blue sky, which looks like
the image above but surely is not what the user wants. The
low-level features of the pictures are the same (same color
histogram) but the semantics (beach or desert) are not. The
user recognizes this, and is unsatisfied, but the system is not
Figure 4: Matching shapes
aware of its malfunctioning and thinks its precision (number
of relevant documents retrieved) is very accurate.
4.2.2
Shape
There are some algorithms which are able to distinguish
shapes in pictures. With the shape technique, the same
problem arises as with color, objects can have the same
shape, but be totally different. For example, a palm tree can
roughly have the same shape as a cheerleader, illustrated by
Figure 4.
4.2.3
Texture
Texture analysis tries to reveal the difference in textures.
Obviously, searching on texture bases can also yield unrelevant results. For example, take a basketball and a orange. In
color and shape, they are almost the same, but the texture
is different, showed by Figure 5.
5.
REDUCING THE SEMANTIC GAP
Based on the user and system perception differences, it seems
impossible to carry out satisfying content-based image retrieval. At the other hand, as stated before, text-based image retrieval has also several drawbacks. Many CBIR initiatives focus on improving the system side, i.e. the isolated
matching algorithms like color or shape, without incorporating user semantics. Hence, they don’t actually bridge
the step towards the user. In this section, some promising
models and directions are given to reduce the semantic gap,
ordered by the problems defined in paragraph 4.
5.1
Problem: Matching algorithms
Figure 5: Matching textures
Following from the previously described image features on
which retrieval can be done, it should be obvious that using
one of the features color, shape or texture can lead to undesired results. Several scenarios can occur when searching
for an image. The image can be annoted with text or not,
and the features can be very distinctive or not. For instance,
images with distinctive colours but vague shapes should obviously be judged on their color histogram. Also, the query
can be very variable. One can search for a distinctive shape,
so the shape feature should be exaggerated, or color can be
very important in the query. Therefore, a combination algorithm is proposed which can be easily modified by the user.
The gap is simply eliminated by letting the user directly
modify the algorithm at stake. As Belkhatir, Mulhem, and
Chiaramella put it:
“The growing need for intelligentstems, i.e. being capable of bridging this semantic gap, leads
to new architectures combining multiple characterizations of the image content.”
(Belkhatir et al., 2004) A system which already implements
this approach is CIRES (Iqbal & Aggarwal, 2002). With
CIRES, it is possible to define weights to image feature. For
example, it is possible to give much attention to the color in
an image, but discard the texture differences.
5.1.1
CIRES
CIRES tackles two of the problems: high level semantics
and the isolated use of feature extraction algorithms. They
identify the problem of matching algorithms especially exist in pictures with manmade objects such as buildings and
bridges. Although there are shape extraction algorithms
which focus on distinct segments in a picture, they mostly do
not recognize the fact that a manmade object is build from
several segments, with different shapes and texture. Perceptual grouping does reveal this structure, common manmade
shapes in pictures can be identified. It incorporates higher
level semantics since it is able to recognize objects in pictures. Especially for manmade and landscape pictures, the
combination of structure, color and texture has a better performance than just color and texture. On manmade objects,
this means an increase of performance accuracy of 17%, on
landscapes 8%. The differences in the categories qbirds,
bugs, manmals and flowers are minimal. Since their new
use of structure does not always give better results, weights
can be given to the use of color, texture, and structure extraction, thereby supporting a broader range of queries, see
Figure 6, coming from their demo (Iqbal & Aggarwal, 2007)
5.2
Problem: High level semantics
The main problem here is that the user sees things in images which can not be translated to physical features, such
as activities taking place or emotions which are evoked by a
picture.Li and Wang propose automated annotation to resolve this problem. Other algorithms also use this approach,
but ALIPR is the only one which works real-time.
5.2.1
ALIPR
Figure 6: CIRES demo screenshot
ALIPR (Automated Linguistic Indexing of Pictures - Real
Time) annotates images by looking at a large database of images which are already annotated and categorized in groups.
ALIPR group images in categories, also called semantic concepts. They use the already annotated Corel image database,
which provides 599 concepts. Concepts are identified by
tags, such as ’sail, boat, ocean’ or ’landscape, mountain,
grass’. ALIPR defines models of image features for each
concept by making a signature of each image in a concept,
and build a generative model from it. A signature consists
of color and texture features. ALIPR can compare any image to the category models in their training database, then
examines the correlation with the categories, and annotates
the image similar to the most matching category. Hence it is
able to annotate the query image with the annotation words
from the database. This results in a more semantic way of
searching, since users can just query for concepts they have
in mind. For example, in their online demonstration system
(Li & Wang, 2007), the query ’car outside’ yields Figures
like 7 but the query ’car indoor’ yields all forms of pictures
like Figure 8. Even the subtle distinction ’car inside’, with
pictures taken from within a car is possible, see Figure 9.
About their results, Li and Wang, p. 10 state: “When the
top 15 words are used to describe each image, above 98%
procent are correctly annotated by some words.” These results are obtained by looking at 5400 real world pictures,
and examining the annotation manually. In combination
with their computational efficiency which makes real-time
annotation possible, ALIPR is a valuable contribution to
the CBIR field. Their real-time annotation is in the first
place possible because their algorithm characterizes image
features in a statistical distribution, without looking at each
individual object within images. Secondly, they have a cumulative approach, only for images of new concepts, the
annotation algorithm has to be trained, previous concepts
are stored in profiling models.
5.3
Problem: Linguistic problems
The limitation of synonymy and polysemy can be reduced by
applying memory learning (Han, Ngan, Li, & Zhang, 2005)
which uses the history usage to identify the relationship between low-level features and semantic meaning of images.
Figure 9: Query ’car inside’
Figure 7: Query ’car outside’
learning is used for effective retrieval for a specific user at a
specific time, long-term learning improves the overall search
algorithm execution.
6.
CONCLUSION
Overall, a step forward to reduce the semantic gap consists
of:
1. Using, in addition to known algorithms which extract
color, shape, texture or other features, new algorithms
which better incorporate high level semantics, as CIRES,
ALIPR and LSI tend to do
Figure 8: Query ’car indoor’
Another technique is Latent Semantic Indexing (LSI) (Zhao
& Grosky, 2002) LSI tries to reveal underlying semantic nature of image contents, thus find correlation between visual
features and semantics of visual documents or objects.
5.3.1
3. Letting user input decide how algorithms are used, by
providing a choice of features to include in the search
process, and adjusting weights of these features.
LSI
Zhao and Grosky use LSI to reduce the problem of synonymy, by clustering images which at low-level features look
different, but do have similarities. LSI is a complex process,
in short it comes down to revealing this similarity by looking
at features which co-occur in two different images, these images are then beschouwd as similar. For example: an image
(image A) which is related to the concept ’sea + beach’ my
not contain typical sea colors, but contain features which occur in many ’sea + beach’ pictures, then LSI can recognize
that image A is also about the concept ’sea + beach’. Hence,
searching for a feature which does not directly occur in image A still returns image A since LSI found that it is in many
ways similar to the concept which is searched for. Results of
experiments show that merging LSI into ’traditional’ color
histogram based retrieval improves results (unfortunately,
they do not give exact figures).
5.4
2. On the fly influence on the search proces. This can be
done by persistent relevance feedback, and:
Figure 10 illustrates this. At the presentation layer of the
system, the user side, the user can input adjustments which
should be made to the algorithm. The user side translates
this to the system side, depicted by the dashed line. Another
new input is the input of user semantics in the system side.
This is for example the ALIPR approach, the system side is
modelled to reflect the user semantic behaviour.
Using relevance feedback
Since image retrieval seems to be very user-dependent, relevance feedback (RF) mechanisms are more important than
in textual retrieval. The human perception of an image
is subjective and depends on viewing an image. Thus, a
user should determine relevance criteria before the search
as well as after the search, as some relevance criteria can
only emerge when a user views images. To acknowledge
the subjectiveness of the search process, the RF should not
be applied system-wide, but per user (for example, only in
one search process). RF is mostly used to adjust the search
query, in stead of having influence on the search algorithm.
However, a search database which incorporates previously
given RF can have more semantic information in it. Xiaofei
He describes a learning algorithm which takes RF as input to
re-rank images (He, 2004). In fact, both short-term learning and long-term learning should be applied. Short-term
Figure 10: New overview
This conclusion is elaborated upon in the next paragraph.
7.
DISCUSSION
The effort put in system-methods is often not as effective as
the promising laboratory research results tend to suggest.
Many researches put their focus on complex algorithms and
increasing their recall and precision while the user-oriented
semantic perspective does not connect to this. It seems more
important to improve the user side of CBIR than finding
smarter, more detailed feature extraction algorithms. Several points at the user side can be improved, accompanied
by the reduction methods described in paragraph 5.
Apart from the algorithm(s) used, a well desgined user interface is needed so the user can quickly navigate to example
images to make his vague idea more concrete. (The definition of a ’well desgined user interface’ is beyond the scope
of this article). One way of an user-friendly interface is the
following: the user enters a simple text query which expresses his first idea. In the returned results, the user can
look for images which are relevant. When an image suits
well, the system can search for more similar images. Hereby,
semantic-based feature algorithms and RF are connected.
It is important to adapt the algorithm used to the goal of
the user, otherwise the algorithm can be very fancy, but
when the user is not able to reach its goal, it is useless. For
pictures representing vague ideas, ALIPR’s automatic annotation can be used to group images which have apparently
the same vague idea. For images with distinctive features,
LSI can be used to group images which at first sight look
different, but when clustered, represent the same concept,
therefore providing the user with better search results.
Include subjective user characteristics in search is important
to subject the subjectiveness of the search process.
This can be done by making combination algorithms possible. The keypoint is that it is not a manner of staticly
adjusting the algorithms, but the connection between the
algorithm and the user.
Relevance feedback can also support this subjectiveness. With
RF, the LSI algorithm can be trained to cluster more images, providing more relevant results.
Drawback of the methods which include user feedback in
forms of RF or adjusting weights is that a lot of feedback is
expected from the user, while at the same time, the search
process has to be user friendly. At the other hand, highdemanding users are probably willing to put some effort in
their search process, as long as they get satisfying results.
Zhao and Grosky acknowledge the limitations of current
CBIR techniques. Neither textual annotation, nor visual
features can capture all of the image contents and semantics, but their combined vector space model is designed to
integrate visual features and textual annotation, so further
improvement is possible.
Another way of improving the engagement of users is give
more explanation about the search process. When the users
comprehend why the system makes choices, the user can adjust his search, or give relevance feedback which can change
the system behavior.
After all, incorporating the user semantics in CBIR seems
an ongoing and never ending process. This area of research
is constantly renewing itself due to the application of new
algorithms and the improving technology. For example, digital camera’s are more and more capable of saving relevant
context information. Modern camera’s can save an image
with accompanied settings like brightness or diafragma used,
which tell something about the context of the picture taken.
This may provide new ways of searching in digital images.
References
Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). The
outline of an ’intelligent’ image retrieval engine. In
Icwi (p. 1228-1251).
Broek, E. van den. (2006). Tekst- en beeldanalyse in een
zoekmachine met menselijke trekjes. Informatie Professional, 8(3).
Broek, E. L. v. d. (2005). Human-centered content-based
image retrieval.
Broek, E. v. d., Rikxoort, E. van,
Schouten, T.
(2005). Human-centered object-based image retrieval.
Human-centered object-based image retrieval. Lecture
Notes in Computer Science (Advances in Pattern
Recognition).
Carson, C., Thomas, M., Belongie, S., Hellerstein, J. M.,
Malik, J. (1999). Blobworld: A system for regionbased image indexing and retrieval. In Visual ’99: Proceedings of the third international conference on visual
information and information systems (pp. 509–516).
London, UK: Springer-Verlag.
Choi, Y., Rasmussen, E. M. (2002). User’s relevance criteria
in image retrieval in american history. Inf. Process.
Manage., 38 (5), 695–726.
Datta, R., Li, J., Wang, J. Z. (2005). Content-based image
retrieval: approaches and trends of the new age. In
Mir ’05: Proceedings of the 7th acm sigmm international workshop on multimedia information retrieval
(pp. 253–262). New York, NY, USA: ACM Press.
Google. (2007). Google image search. http://www.google.
com/images.
Han, J., Ngan, K., Li, M., Zhang, H. (2005, April). A memory learning framework for effective image retrieval.
IP, 14 (4), 511-524.
He, X. (2004). Incremental semi-supervised subspace learning for image retrieval. In Multimedia ’04: Proceedings of the 12th annual acm international conference
on multimedia (pp. 2–8). New York, NY, USA: ACM
Press.
Ingwersen, P., Jarvelin, K. (2005). The turn: Integration
of information seeking and retrieval in context. Dordrecht, The Netherlands: Springer.
Iqbal, Q., Aggarwal, J. (2002). Cires: A system for contentbased retrieval in digital image libraries.
Iqbal, Q., Aggarwal, J. (2007). Cires: Content based
image retrieval system. http://amazon.ece.utexas.
edu/~qasim/cires.htm.
Li, J., Wang, J. Z. (2006). Real-time computerized annotation of pictures. In Multimedia ’06: Proceedings of
the 14th annual acm international conference on multimedia (pp. 911–920). New York, NY, USA: ACM
Press.
Li, J., Wang, J. Z. (2007). Alipr - automatic image tagging
and visual image search. http://alipr.com.
Ma, W. Y., Manjunath, B. S. (1997). Netra: a toolbox for
navigating large image databases. In Icip ’97: Proceedings of the 1997 international conference on image
processing (icip ’97) 3-volume set-volume 1 (p. 568).
Washington, DC, USA: IEEE Computer Society.
McDonald, S., Lai, T.-S., Tait, J. (2001). Evaluating a
content based image retrieval system. In Sigir ’01:
Proceedings of the 24th annual international acm sigir
conference on research and development in information retrieval (pp. 232–240). New York, NY, USA:
ACM Press.
Muller, H. (2002). User interaction and evaluation in
content-based visual information retrieval.
Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A.,
Jain, R. (2000). Content-based image retrieval at the
end of the early years. IEEE Trans. Pattern Anal.
Mach. Intell., 22 (12), 1349–1380.
Veltkamp, R., Tanase, M. (2002). Content-based image
retrieval systems: A survey.
Vries, A. de, Kazai, G., Lalmas, M. (2004). Tolerance
to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit.
Wang, J. Z., Li, J., Wiederhold, G. (2001). Simplicity:
Semantics-sensitive integrated matching for picture libraries. IEEE Trans. Pattern Anal. Mach. Intell.,
23 (9), 947–963.
Zhao, R., Grosky, W. I. (2002). Bridging the semanitic gap
in image retrieval. Hershey, PA, USA: IGI Publishing.
Download