User-centered reduction of the semantic gap in content-based image retrieval Mark Pardijs Universiteit Twente Master Human Media Interaction Capita Selecta m.r.pardijs@student.utwente.nl ABSTRACT Content-Based Image Retrieval (CBIR) systems provide a way to search for images which are not annotated with text, by using image features such as color, shape and texture. The problem with this approach is the semantic gap. User semantics, their abstract idea of what they expect from search results, are hard to relate to the concrete image features. This article gives several pointers to reduce this semantic gap. Reducing the semantic gap can be done by using algorithms which incorporate user semantics, and give users more influence on the search process. Promising developments are automated annotation, Latent Semantic Indexing (LSI) and user influence on the execution of algorithms: assigning weights to features and giving relevance feedback. 1. INTRODUCTION The enormous increase of digital information available locally or on the Internet makes it almost impossible to annotate the digital objects manually. Therefore, automatic index and retrieval systems are developed. The research area Information Retrieval is mainly focused on the disclosure of textual documents. The last few years, also multimedia objects are taken into account, including images. Content-Based Image Retrieval (CBIR) systems aim to recognize and retrieve information based on content of images instead of looking at metadata provided with the images. The major content in images used consists of color, shape and text features. The problem with searching in these features in images is the semantic gap. The semantic gap is the difference between the high-level user semantics and the lowlevel image features which need to be connected (E. van den Broek, 2006; Datta, Li, & Wang, 2005). In other words, it is hard to translate the user need for a specific image in a comprehensible manner to a CBIR system. Although there are classic methods of performance measurement in IR - recall and precision - they indicate how well the system performs on the given query, not how useful the results actually are to the user (McDonald, Lai, & Tait, 2001). There are two perspectives to this problem. • User-specific How to translate the user need into a clear query to the system which gives meaningful results. The focus lies on the interaction with the user. • System-specific How to process the user query effective and efficient to get satisfying results. The research proposed below focuses on the user-side of reducing the semantic gap. This is an interesting field since there is a need for an approach to IR which looks at user intentions. Traditionally the focus lies on laboratory experiments without taking in account the user satisfaction. Recent initiatives, for example the book ’The Turn’ (Ingwersen & Jarvelin, 2005) address this problem and point out that a more cognitive view is needed. By cognitive view, they mean that a system should have categories and concepts in mind which reflect the users’ world, incorporate the users way of information processing. In a survey of over 100 CBIR systems, held in the year 2000, the main conclusion stated: (Veltkamp & Tanase, 2002, p. 57) “It is widely recognized that most current content-based image retrieval systems work with low level features (color, texture, shape), and that next generation systems should operate at a higher semantic level.” The key to build more effective CBIR systems is not improving the user side or system side, but to move the system side towards the user side. 2. RESEARCH GOAL The goal of the research is to find directions for reducing the semantic gap, focusing on the movement towards the userside. First the origin and backgrounds of the semantic gap are investigated; why is there a semantic gap, which human perceptions are involved in it, and which system components are at stake. Secondly, new developments in CBIR research are discussed 2.1 Research questions The research subjects presented above lead to the following research topics and questions: • Backgrounds of the semantic gap – What are the user perception aspects? Figure 1: Overview of CBIR – What are the system perception aspects? • How can the semantic gap be reduced? 3. METHOD Information retrieval is a young research area with ongoing developments and improvements. The research questions will be answered by doing a literature survey to current CBIR developments with respect to the proposed subject. Interesting models which have been put into practice will be checked on their results. The rest of this paper is organized as follows. First the backgrounds of the semantic gap are discussed, including a description of commonly used CBIR algorithms. Following, valuable solutions from new research initiatives for reducing the semantic gap are selected, concluding with a future prosepcts overview and discussion. 4. BACKGROUND: THE SEMANTIC GAP In text retrieval, a text query is directly matched to the text in the document. Obviously, this is not the case in images. Searching images with textual queries is however widely implemented, with as most used example Google image search (Google, 2007). Although this is the most used form of searching images, it actually does not look at images themselves, but at their surrounding text or manually added annotation. Since annotating images is tedious, and surrounding text does not have to be related to the images, more advanced search methods have to be designed, to search for visual features which match features in the database to search. The main problem which causes the semantic gap is that users do not assess images on these visual features but on semantics, i.e. the meaning of an image. Since an image can have far more interpretations than text, visual similarity can be totally different from semantic similarity. Examples of high-level semantics which are hard to express in pictures are activities taking place in the image or the emotions the picture evokes. Sometimes, users even have no idea what they are looking for, they just want something which is ’the same idea’ as their previously found picture, or want some picture which is coherent to the idea they have in mind. Of course, judging this coherency is only possible untill the search results are returned. Next to that, ’the same’ can be very subjective. In literature, the semantic gap is widely discussed. Below we will first describe the user perception aspects of the semantic gap, followed by the system aspects. picture about a football game or a picture with a car in it. The user has no picture in mind, but an abstract, high-level idea. How to find pictures which relate to this idea is exactly the bridge in reducing the semantic gap. Searching is very subjective. Different persons search in a different way. This subjectivity is also influenced by language difficulties. Due to synonymy and polysemy, ambiguity arises. Synonymy is the existence of several words for the same thing, polysemy means that the same word can be used to indicate different things. For example, the word Cherokee can be used in the context of cars (Jeep Cherokee) but can also be an Indian person. The judgement of results is also subjective. Users address the relevance of an image in different ways. This happens not only among users but even with the same user at different stages in the search process. Choi and Rasmussen point out several factors which influence the judgement of pictures. These factors tend to change during the search process, and are dependent of the users task. Some of the factors are topicality, accuracy, novelty and accessibility. Especially topicality (is the picture found ’on the topic’ of the search query?) changes along the search process. All these points have their origin in the friction between high-level semantics and low-level features. These two are discussed below. 4.1.1 High level semantics There is a many-to-many relationship between user ideas and images (Zhao & Grosky, 2002), illustrated by Figure 2. 1) Users can have the same idea in mind, but find different pictures to fulfill this need. At the other hand, 2) different ideas can lead to the same picture as satisfying result. The implication for 1) is that a query should return multiple results where the user can decide on, not the system. 2) yields a multiple and adjustable input approach to express different ideas. The user should be able to influence the search process in order to find, to his preference, satisfying results. Figure 2: User idea vs. image 4.1 User perception aspects From the user point of view there are several aspects which contribute to the semantic gap. Users have an abstract notion about what will satisfy them. (Zhao & Grosky, 2002) Often they have no exact image in mind what they would like to see, but are looking for some concept or theme like a Human people are able to see things in an image that are not really visible in the picture itself. We can see things that are absent, like when we see a portion of a race car, we know it is a race car, we know the picture means to show a race car, we know the picture intends to show a race car, because it is familiair to us. There are three components in the user-based definition of relevance: cognitive, situational and dynamic aspects. These aspects are closely related to the topicality problem adressed before. Figure 3: Matching colors 1. Cognitive: given document changes mental state influencing judgment, and presentation influences judgement, visual presentation is processed faster. 2. Situational: depends on type of information seeking and situation. 3. Dynamic: decisions evolve during evaluation process. For example, a user has in mind a beach with a lot of sand and a blue, calm sea. To keep it simple, he just enters the search term beach but this yields also results with rocket beaches and stormy seas. He then realizes his search criteria were too wide, and specifies his search query. 4.2 System perception aspects The system perception aspects consist of matching algorithms which can decide whether a picture is relevant to a query or not. The main problem with the system perception and its algorithms is the difference between human vision and computer vision (E. L. v. d. Broek, 2005) For a user, an image can yield a certain emotion which influences the judgment of the image. Users can recognize different shapes very accurately, where computer systems are able to classify shapes, but the accuracy is far behind that of humans. Another example is that humans can perfectly recognize occluded objects, where a computer can not. To sum up, a content-based retrieval system purely judges the image on its low-level features, while the user searches with a semantic mindset. The system side depicted in figure 1 will now be described, to get an idea what is possible with CBIR technologies nowadays. The most prevalent techniques are color, shape and texture extraction (Smeulders, Worring, Santini, Gupta, & Jain, 2000). 4.2.1 Color By looking at every color in an image, a color histogram is created. The histogram tells how much of every color is present in the image. For example, the left image in Figure 3 will yield a color histogram of approximately 30% yellow and 70% blue. When a user has in mind a beach with a blue sea like the left image in Figure 3 and searches in a color-based search tool for it (for example, using a color picker or an example image) the system will look for images containing yellow and blue colors. The problem here is that images which do not contain a beach or sea but have the same color histogram will also show up in the results. For example, the right image depicts a desert with blue sky, which looks like the image above but surely is not what the user wants. The low-level features of the pictures are the same (same color histogram) but the semantics (beach or desert) are not. The user recognizes this, and is unsatisfied, but the system is not Figure 4: Matching shapes aware of its malfunctioning and thinks its precision (number of relevant documents retrieved) is very accurate. 4.2.2 Shape There are some algorithms which are able to distinguish shapes in pictures. With the shape technique, the same problem arises as with color, objects can have the same shape, but be totally different. For example, a palm tree can roughly have the same shape as a cheerleader, illustrated by Figure 4. 4.2.3 Texture Texture analysis tries to reveal the difference in textures. Obviously, searching on texture bases can also yield unrelevant results. For example, take a basketball and a orange. In color and shape, they are almost the same, but the texture is different, showed by Figure 5. 5. REDUCING THE SEMANTIC GAP Based on the user and system perception differences, it seems impossible to carry out satisfying content-based image retrieval. At the other hand, as stated before, text-based image retrieval has also several drawbacks. Many CBIR initiatives focus on improving the system side, i.e. the isolated matching algorithms like color or shape, without incorporating user semantics. Hence, they don’t actually bridge the step towards the user. In this section, some promising models and directions are given to reduce the semantic gap, ordered by the problems defined in paragraph 4. 5.1 Problem: Matching algorithms Figure 5: Matching textures Following from the previously described image features on which retrieval can be done, it should be obvious that using one of the features color, shape or texture can lead to undesired results. Several scenarios can occur when searching for an image. The image can be annoted with text or not, and the features can be very distinctive or not. For instance, images with distinctive colours but vague shapes should obviously be judged on their color histogram. Also, the query can be very variable. One can search for a distinctive shape, so the shape feature should be exaggerated, or color can be very important in the query. Therefore, a combination algorithm is proposed which can be easily modified by the user. The gap is simply eliminated by letting the user directly modify the algorithm at stake. As Belkhatir, Mulhem, and Chiaramella put it: “The growing need for intelligentstems, i.e. being capable of bridging this semantic gap, leads to new architectures combining multiple characterizations of the image content.” (Belkhatir et al., 2004) A system which already implements this approach is CIRES (Iqbal & Aggarwal, 2002). With CIRES, it is possible to define weights to image feature. For example, it is possible to give much attention to the color in an image, but discard the texture differences. 5.1.1 CIRES CIRES tackles two of the problems: high level semantics and the isolated use of feature extraction algorithms. They identify the problem of matching algorithms especially exist in pictures with manmade objects such as buildings and bridges. Although there are shape extraction algorithms which focus on distinct segments in a picture, they mostly do not recognize the fact that a manmade object is build from several segments, with different shapes and texture. Perceptual grouping does reveal this structure, common manmade shapes in pictures can be identified. It incorporates higher level semantics since it is able to recognize objects in pictures. Especially for manmade and landscape pictures, the combination of structure, color and texture has a better performance than just color and texture. On manmade objects, this means an increase of performance accuracy of 17%, on landscapes 8%. The differences in the categories qbirds, bugs, manmals and flowers are minimal. Since their new use of structure does not always give better results, weights can be given to the use of color, texture, and structure extraction, thereby supporting a broader range of queries, see Figure 6, coming from their demo (Iqbal & Aggarwal, 2007) 5.2 Problem: High level semantics The main problem here is that the user sees things in images which can not be translated to physical features, such as activities taking place or emotions which are evoked by a picture.Li and Wang propose automated annotation to resolve this problem. Other algorithms also use this approach, but ALIPR is the only one which works real-time. 5.2.1 ALIPR Figure 6: CIRES demo screenshot ALIPR (Automated Linguistic Indexing of Pictures - Real Time) annotates images by looking at a large database of images which are already annotated and categorized in groups. ALIPR group images in categories, also called semantic concepts. They use the already annotated Corel image database, which provides 599 concepts. Concepts are identified by tags, such as ’sail, boat, ocean’ or ’landscape, mountain, grass’. ALIPR defines models of image features for each concept by making a signature of each image in a concept, and build a generative model from it. A signature consists of color and texture features. ALIPR can compare any image to the category models in their training database, then examines the correlation with the categories, and annotates the image similar to the most matching category. Hence it is able to annotate the query image with the annotation words from the database. This results in a more semantic way of searching, since users can just query for concepts they have in mind. For example, in their online demonstration system (Li & Wang, 2007), the query ’car outside’ yields Figures like 7 but the query ’car indoor’ yields all forms of pictures like Figure 8. Even the subtle distinction ’car inside’, with pictures taken from within a car is possible, see Figure 9. About their results, Li and Wang, p. 10 state: “When the top 15 words are used to describe each image, above 98% procent are correctly annotated by some words.” These results are obtained by looking at 5400 real world pictures, and examining the annotation manually. In combination with their computational efficiency which makes real-time annotation possible, ALIPR is a valuable contribution to the CBIR field. Their real-time annotation is in the first place possible because their algorithm characterizes image features in a statistical distribution, without looking at each individual object within images. Secondly, they have a cumulative approach, only for images of new concepts, the annotation algorithm has to be trained, previous concepts are stored in profiling models. 5.3 Problem: Linguistic problems The limitation of synonymy and polysemy can be reduced by applying memory learning (Han, Ngan, Li, & Zhang, 2005) which uses the history usage to identify the relationship between low-level features and semantic meaning of images. Figure 9: Query ’car inside’ Figure 7: Query ’car outside’ learning is used for effective retrieval for a specific user at a specific time, long-term learning improves the overall search algorithm execution. 6. CONCLUSION Overall, a step forward to reduce the semantic gap consists of: 1. Using, in addition to known algorithms which extract color, shape, texture or other features, new algorithms which better incorporate high level semantics, as CIRES, ALIPR and LSI tend to do Figure 8: Query ’car indoor’ Another technique is Latent Semantic Indexing (LSI) (Zhao & Grosky, 2002) LSI tries to reveal underlying semantic nature of image contents, thus find correlation between visual features and semantics of visual documents or objects. 5.3.1 3. Letting user input decide how algorithms are used, by providing a choice of features to include in the search process, and adjusting weights of these features. LSI Zhao and Grosky use LSI to reduce the problem of synonymy, by clustering images which at low-level features look different, but do have similarities. LSI is a complex process, in short it comes down to revealing this similarity by looking at features which co-occur in two different images, these images are then beschouwd as similar. For example: an image (image A) which is related to the concept ’sea + beach’ my not contain typical sea colors, but contain features which occur in many ’sea + beach’ pictures, then LSI can recognize that image A is also about the concept ’sea + beach’. Hence, searching for a feature which does not directly occur in image A still returns image A since LSI found that it is in many ways similar to the concept which is searched for. Results of experiments show that merging LSI into ’traditional’ color histogram based retrieval improves results (unfortunately, they do not give exact figures). 5.4 2. On the fly influence on the search proces. This can be done by persistent relevance feedback, and: Figure 10 illustrates this. At the presentation layer of the system, the user side, the user can input adjustments which should be made to the algorithm. The user side translates this to the system side, depicted by the dashed line. Another new input is the input of user semantics in the system side. This is for example the ALIPR approach, the system side is modelled to reflect the user semantic behaviour. Using relevance feedback Since image retrieval seems to be very user-dependent, relevance feedback (RF) mechanisms are more important than in textual retrieval. The human perception of an image is subjective and depends on viewing an image. Thus, a user should determine relevance criteria before the search as well as after the search, as some relevance criteria can only emerge when a user views images. To acknowledge the subjectiveness of the search process, the RF should not be applied system-wide, but per user (for example, only in one search process). RF is mostly used to adjust the search query, in stead of having influence on the search algorithm. However, a search database which incorporates previously given RF can have more semantic information in it. Xiaofei He describes a learning algorithm which takes RF as input to re-rank images (He, 2004). In fact, both short-term learning and long-term learning should be applied. Short-term Figure 10: New overview This conclusion is elaborated upon in the next paragraph. 7. DISCUSSION The effort put in system-methods is often not as effective as the promising laboratory research results tend to suggest. Many researches put their focus on complex algorithms and increasing their recall and precision while the user-oriented semantic perspective does not connect to this. It seems more important to improve the user side of CBIR than finding smarter, more detailed feature extraction algorithms. Several points at the user side can be improved, accompanied by the reduction methods described in paragraph 5. Apart from the algorithm(s) used, a well desgined user interface is needed so the user can quickly navigate to example images to make his vague idea more concrete. (The definition of a ’well desgined user interface’ is beyond the scope of this article). One way of an user-friendly interface is the following: the user enters a simple text query which expresses his first idea. In the returned results, the user can look for images which are relevant. When an image suits well, the system can search for more similar images. Hereby, semantic-based feature algorithms and RF are connected. It is important to adapt the algorithm used to the goal of the user, otherwise the algorithm can be very fancy, but when the user is not able to reach its goal, it is useless. For pictures representing vague ideas, ALIPR’s automatic annotation can be used to group images which have apparently the same vague idea. For images with distinctive features, LSI can be used to group images which at first sight look different, but when clustered, represent the same concept, therefore providing the user with better search results. Include subjective user characteristics in search is important to subject the subjectiveness of the search process. This can be done by making combination algorithms possible. The keypoint is that it is not a manner of staticly adjusting the algorithms, but the connection between the algorithm and the user. Relevance feedback can also support this subjectiveness. With RF, the LSI algorithm can be trained to cluster more images, providing more relevant results. Drawback of the methods which include user feedback in forms of RF or adjusting weights is that a lot of feedback is expected from the user, while at the same time, the search process has to be user friendly. At the other hand, highdemanding users are probably willing to put some effort in their search process, as long as they get satisfying results. Zhao and Grosky acknowledge the limitations of current CBIR techniques. Neither textual annotation, nor visual features can capture all of the image contents and semantics, but their combined vector space model is designed to integrate visual features and textual annotation, so further improvement is possible. Another way of improving the engagement of users is give more explanation about the search process. When the users comprehend why the system makes choices, the user can adjust his search, or give relevance feedback which can change the system behavior. After all, incorporating the user semantics in CBIR seems an ongoing and never ending process. This area of research is constantly renewing itself due to the application of new algorithms and the improving technology. For example, digital camera’s are more and more capable of saving relevant context information. Modern camera’s can save an image with accompanied settings like brightness or diafragma used, which tell something about the context of the picture taken. This may provide new ways of searching in digital images. References Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). The outline of an ’intelligent’ image retrieval engine. In Icwi (p. 1228-1251). Broek, E. van den. (2006). Tekst- en beeldanalyse in een zoekmachine met menselijke trekjes. Informatie Professional, 8(3). Broek, E. L. v. d. (2005). Human-centered content-based image retrieval. Broek, E. v. d., Rikxoort, E. van, Schouten, T. (2005). Human-centered object-based image retrieval. Human-centered object-based image retrieval. Lecture Notes in Computer Science (Advances in Pattern Recognition). Carson, C., Thomas, M., Belongie, S., Hellerstein, J. M., Malik, J. (1999). Blobworld: A system for regionbased image indexing and retrieval. In Visual ’99: Proceedings of the third international conference on visual information and information systems (pp. 509–516). London, UK: Springer-Verlag. Choi, Y., Rasmussen, E. M. (2002). User’s relevance criteria in image retrieval in american history. Inf. Process. Manage., 38 (5), 695–726. Datta, R., Li, J., Wang, J. Z. (2005). Content-based image retrieval: approaches and trends of the new age. In Mir ’05: Proceedings of the 7th acm sigmm international workshop on multimedia information retrieval (pp. 253–262). New York, NY, USA: ACM Press. Google. (2007). Google image search. http://www.google. com/images. Han, J., Ngan, K., Li, M., Zhang, H. (2005, April). A memory learning framework for effective image retrieval. IP, 14 (4), 511-524. He, X. (2004). Incremental semi-supervised subspace learning for image retrieval. In Multimedia ’04: Proceedings of the 12th annual acm international conference on multimedia (pp. 2–8). New York, NY, USA: ACM Press. Ingwersen, P., Jarvelin, K. (2005). The turn: Integration of information seeking and retrieval in context. Dordrecht, The Netherlands: Springer. Iqbal, Q., Aggarwal, J. (2002). Cires: A system for contentbased retrieval in digital image libraries. Iqbal, Q., Aggarwal, J. (2007). Cires: Content based image retrieval system. http://amazon.ece.utexas. edu/~qasim/cires.htm. Li, J., Wang, J. Z. (2006). Real-time computerized annotation of pictures. In Multimedia ’06: Proceedings of the 14th annual acm international conference on multimedia (pp. 911–920). New York, NY, USA: ACM Press. Li, J., Wang, J. Z. (2007). Alipr - automatic image tagging and visual image search. http://alipr.com. Ma, W. Y., Manjunath, B. S. (1997). Netra: a toolbox for navigating large image databases. In Icip ’97: Proceedings of the 1997 international conference on image processing (icip ’97) 3-volume set-volume 1 (p. 568). Washington, DC, USA: IEEE Computer Society. McDonald, S., Lai, T.-S., Tait, J. (2001). Evaluating a content based image retrieval system. In Sigir ’01: Proceedings of the 24th annual international acm sigir conference on research and development in information retrieval (pp. 232–240). New York, NY, USA: ACM Press. Muller, H. (2002). User interaction and evaluation in content-based visual information retrieval. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22 (12), 1349–1380. Veltkamp, R., Tanase, M. (2002). Content-based image retrieval systems: A survey. Vries, A. de, Kazai, G., Lalmas, M. (2004). Tolerance to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit. Wang, J. Z., Li, J., Wiederhold, G. (2001). Simplicity: Semantics-sensitive integrated matching for picture libraries. IEEE Trans. Pattern Anal. Mach. Intell., 23 (9), 947–963. Zhao, R., Grosky, W. I. (2002). Bridging the semanitic gap in image retrieval. Hershey, PA, USA: IGI Publishing.