Content-Based Retrieval Using Invariant Features, SelfOrganizing Maps, Concepts and Fuzzy Interval Numbers. C. Skourlas N. Vassilas Department of Informatics, Technological Educational Institute of Athens {cskourlas, nvas }@teiath.gr Abstract In the complex Cross Language Document Retrieval applications dealt in this paper, the goal is to assist the document matching stage when the documents contain images. Hence, we calculate the similarity between a submitted bilingual query and each document in the collection or retrieve images that are similar to a query image, based on features extracted from images. Fuzzy Interval Numbers (FINs) have been employed in various real-world application including numeric and non-numeric data. In this paper, the use of FINs classifier is proposed to handle problems of Cross Language Information Retrieval and documents’ classification. The FIN representation of documents is based on the use of the collection term frequency as the term identifier. Such a representation of documents seems to be suitable for Cross Language Information Retrieval without dictionary. In the recent years, Content-Based Image Retrieval (CBIR) evolved to an important research domain within the context of multimodal information retrieval. To assure improved image retrieval performance, even when the images are at different scales and orientations or corrupted by noise, we propose a set of global and local invariant features. Color quantization of the images using selforganizing maps are also used to lead to memory savings and improve, in some cases, image retrieval accuracy. Keywords: Content Based Image Retrieval, Concept Based Retrieval, Cross Language Information Retrieval, Multimodal Information Retrieval, Self-Organizing Maps, Fuzzy Interval Numbers 1. Introduction In the complex Information Retrieval applications, such as cross language document retrieval, the documents are typically represented using the vector space model. A set of N keywords (index terms) is used to represent each document as an N-dimensional vector with each element representing either the appearance of the corresponding keyword in the document or its relative frequency of occurrence. To correctly translate a keyword from one language to another, besides the use of a dictionary and a thesaurus (for the synonyms), word sense disambiguation techniques that assess the appropriate contextual meaning of the word have to be employed [Alevisos et al., 2007]. The similarity between a submitted query and each document in the collection is usually an inner product based vector matching operation. However, due to a poor selection of keywords, keyword sense ambiguities, inaccurate lexicons or incomplete thesauri, the retrieval accuracy is often rather low. In the image databases, the difficulty to annotate images so that to allow later retrieval that is acceptable by the subjective perception of the various future users makes text-based image retrieval an inappropriate approach [Rui et al., 1999]. It is, therefore, necessary to develop tools for retrieving information based on image content and also improve the cross language text retrieval without dictionaries and thesauri. In the recent years, Content-Based Image Retrieval (CBIR) evolved to an important research domain within the context of multimodal information retrieval [Faloutsos, 1995] and a number of CBIR systems and tools have already been developed [Rui], [Veltkamp, 2002], [Eakins, 2000]. CBIR could then be used to assist the document matching stage in complex multimodal information retrieval applications when the documents contain images. In the case that the documents also contain images, redesigning the similarity measures to include the contribution from CBIR systems could lead to significant improvements in document retrieval accuracy. Fuzzy Interval Numbers (FINs) and lattice algorithms have been employed in various realworld application including numeric and non-numeric data [Kaburlasos, 2006]. A FIN may be interpreted as a conventional Fuzzy Set; additional interpretations for a FIN are possible including a statistical interpretation. Use of FINs classifiers is proposed [Alevisos, 2007] to handle problems of Cross Language Information Retrieval. The FIN representation of documents is based on the use of the collection term frequency as the term identifier and seems to be suitable for Cross Language Information Retrieval without dictionary. Section 2 of this paper illustrates the proposed experimental methodology using a Euro coin database. The overall structure of the coin retrieval system is presented for the original coin collection as well as for the coin collection after vector quantization with Kohonen’s selforganizing maps. In section 3 the features extracted for content-based coin indexing are presented and the experimental results for three test datasets are discussed. In Section 4 a brief introduction to definition and construction of FINs is given. The FINs similarity is defined and applied to the similarity of queries and documents. The evaluation of FIN – techniques is given in section 5. Finally, conclusions and future work are discussed in the last section. 2. The Content Based Image Retrieval The experimental methodology consists of two distinct phases: a) the design phase, whereby we create the database and decide upon its indexing scheme, and b) the retrieval phase, in which, following the presentation of query-images, similar images are retrieved from the database. 2.1 Database Design In the first phase, a coin database is created using the following stages: Specification of a digital coin image collection Preprocessing of each image to detect the position of the coin Feature extraction from coin pixels Feature-based indexing of the database At the first stage, we downloaded coin images from the Internet and compiled a collection of 115 Euro coin faces [Vassilas, 2006]. Among these images, several coin designs, not yet in circulation, were included. All images were in the RGB color space and had a size of approximately 240x240 pixels. Next, each image is preprocessed in order to detect the position of the coin in the image and isolate it from the background. To this end, each image is first converted to gray-scale, then the edges are found using the Sobel operators and finally the Hough transform [Ballard, 1982] is applied on the black-and-white image of edges in order to detect the outer circle of the coin. The circular disk mask, thus created, is then used to isolate the coin from the background (see Fig. 1). a b c d Fig. 1. Preprocessing steps: a) original color image, b) Sobel edge detection on gray-scale image, c) mask from Hough transform, and d) isolated coin. Once the coin images are free from background interference, features are extracted from those pixels that belong exclusively to the coin area. The extracted feature vector from each image is then stored in order to form the index to the database. 2.2 Database Retrieval Retrieval from the coin database is performed by: Query-image presentation to the database system Preprocessing to detect the position of the coin in the query-image Feature extraction and query encoding with its feature vector Specification of similarity measure Computation of query features similarity with those of the database index Retrieval of the most similar database coins The query coin image presented to the system undergoes the same preprocessing, as that for the database coins, to isolate the coin from the background and to extract its corresponding feature vector. The similarity measures employed in this work between two feature vectors are the L1 and L2 distances defined by p 1/p Lp = ( Error!| fi( I ) – fi( J ) | ) for p = 1, 2 where f() represents the feature vector of an image and I, J are the database and query images respectively. The database coins are ranked according to their similarities to the query image and the N most similar coins are retrieved from the database and shown to the user. 2.3 Quantization of Coin Collection Kohonen’s self-organizing maps was the vector quantization method used in this work. The coin collection was quantized in two ways. First, Kohonen’s algorithm was applied to each coin image separately resulting to indexed images with distinct colormaps. Second, the same algorithm was applied to the whole coin collection simulataneously, resulting to indexed images with a common colormap. Fig. 2 shows some original coins (upper row) along with their indexed representations (lower row) for the second quantization case. Also shown to the right of this figure is the coins’ common colormap. Figure 2. Original (upper row) and quantized (bottom row) coins using a common colormap (right). The feature extraction and indexing phases for the database coins are the same as for the previously described methodology. During retrieval, the queries are also quantized with Kohonen’s algorithm resulting in an indexed representation with their own associated colormap. 3. Feature Extraction and experimental results Three different kinds of features were used, namely, color, shape and wavelet features. Color is known to be invariant to scale and orientation but is sensitive to illumination changes. Four moments (mean, standard deviation, skewness and kyrtosis) from each of the hue, saturation and value components of the HSV color space were extracted, for a total of twelve color features. The shape features were extracted from the gradient image which was obtained by first transforming the color image to grayscale and then using the Sobel operators. To achieve some invariance to illumination conditions, the [min, max] range of the gradient image was normalized to [0, 1]. Assuming that scale and orientation changes preserve most of the edges, the normalized polar historgram, i.e. the probability distribution of edge orientations (at 90 o with respect to the edge gradients), not only should it be, adequately, scale and rotation invariant but also can give an estimate of the rotation angle. The latter results from the normalized polar histogram circular correlation between the query image and each one of the database coins. Fig. 3 shows the normalized polar histograms of a database image and a query that is of a different size and orientation with respect to the original image. Figure 3. A database (left) and a query (right) coin along with their polar histograms. Actually, the procedure followed uses the L1 or L2 distance measures instead of the typical inner product correlation in order to determine the angle of rotation and the corresponding polar similarity score. In addition, in order to increase the robustness of the system, we extracted both, edge magnitudes and orientations from five equal-area concentric circular rings. The shape features extracted were: a) the mean, standard deviation, skewness and kyrtosis of the ring edge magnitudes, for a total of 5x4 = 20 features, and b) the normalized polar histograms of each ring for 5 degrees angular bins (36 features/ring), for a total of 180 polar features. Finally, for comparison purposes, we also extracted features from a three level wavelet analysis with the ‘db3’ mother wavelet of the Daubechies family. In particular, we extracted the mean and standard deviation of the wavelet coefficients for each detail image of the wavelet decomposition as well as for the level-3 approximation image, for a total of 10x2 = 20 wavelet features. 3.1 Discussion of the experiments The experiments on coin retrieval have been performed using three sets of queries. The first data set (DS1) contained queries at a lower resolution (approximately of 130x130 pixels) in order to test the effect of scale on the retrieval accuracy. The queries of the second data set (DS2) were at a different scale (70% the size of the original images), had a random rotation in the [-180o, 180o] range and were corrupted with additive zero mean gaussian noise with 0.1 standard deviation. Finally, the queries of the third data set (DS3) were at a random scale (between 50% and 100% of the original scale), at random orientations (in the [-180o, 180o] range) and with a random change of color in order to test scale, rotation and illumination invariances. Tables 1 and 2 show the retrieval accuracy based on color and edge (ring) magnitude features respectively, (for the means, standard deviations or their combination) for the three data sets while Tables 3 and 4 show the retrieval accuracy for the edge (ring) orientations and wavelet features respectively. Since the L1 distance measure gave slightly better results than L2 in most of the experiments, all results shown in the tables assume the L1 measure. Table 1. Retrieval based on color. Feature vector composed of: μ σ μ+σ 4.6 (7.0) 5.7 (5.1) 3.3 (4.9) DS1 6.5 (6.6) 25.6 (29.7) 7.2 (7.8) DS2 DS3 19.1 (19.4) 30.7 (27.3) 18.3 (22.2) Table 2. Retrieval using edge strengths. Feature vector composed of: μ σ μ+σ 10.1 (14.6) 23.9 (28.9) 13.8 (20.7) DS1 DS2 17.6 (19.6) 35.2 (30.6) 22.4 (23.0) DS3 32.3 (28.8) 35.6 (30.8) 30.9 (29.4) Table 3. Retrieval using edge angles. Feature vector: Normalized Polar Histogram 12.1 (17.1) DS1 22.7 (27.6) DS2 29.4 (31.9) DS3 Table 4. Retrieval using wavelet features. Feature vector composed of: μ σ μ+σ 36.3 (28.2) 32.7 (26.7) 36.7 (28.9) DS1 DS2 34.1 (27.9) 38.8 (30.3) 32.8 (28.2) DS3 36.3 (29.3) 31.7 (27.0) 31.3 (27.9) 4. Definition and construction of FINs FIN could be regarded as an abstract “mathematical object” and can have various interpretations and uses. Given a vector (a “population”) x = [x1,x2,,xN] of term frequencies (“measurements”) that are real numbers sorted in ascending order. The dimension of a vector x is denoted by dim(x) e.g. dim([2,-1]) = 2, dim([-3,4,0,-1,7]) = 5. The median(x) of the vector x = [x1,x2,,xN] is defined to be a number such that half of the N numbers x1,x2,,xN are smaller than median(x) and the other half are larger than median(x); for instance, median([x1,x2,x3]) = x2, with x1 < x2 < x3, whereas median([x1,x2,x3,x4]) = (x2 + x3)/2, with x1 < x2 < x3 < x4. A FIN can be computed for the vector x (population) by applying the following CALFIN algorithm (see Kaburlasos, 2006]. Algorithm CALFIN // Calculate FIN Let x be a vector of term frequencies (real numbers). Sort vector x incrementally. Initially vector pts is empty. function calfin(x) { while (dim(x) 1) medi:= median(x) insert medi in vector pts x_left:= elements in vector x less-than number median(x) x_right:= elements in vector x larger-than number median(x) calfin(x_left) calfin(x_right) endwhile } //function calfin(x) Sort vector pts incrementally. Store in vector val, dim(pts)/2 numbers from 0 up to 1 in steps of 2/dim(pts) followed by another dim(pts)/2 numbers from 1 down to 0 in steps of 2/dim(pts). Eventually Algorithm CALFIN computes two vectors, i.e. pts and val, where vector val includes the degrees of (fuzzy) membership of the corresponding real numbers in vector pts. In the figure 4 two queries or documents (vectors) are illustrated (plotted) as two Fuzzy Interval Numbers. Each value on the term axis represents a term (stem). Any “cut” at a given height h(0,1] defines two generalized intervals, denoted by [a’, c’]h, [b’, d’] h. In our case the generalized intervals are positive and intersecting. As we can see below, a bell-shaped mass function is used for the calculations of the distance between two FINs (or the similarity between two queries or documents which are represented by their FINs). Figure 4. Two queries / documents (vectors) illustrated as two Fuzzy Interval Numbers. Each value on the term axis represents a term (stem). A bell-shaped mass function is used for the calculation of the distance (similarity) between the documents. 4.1 Fuzzy Interval Numbers and FINs similarity The concept of the Generalized Intervals was used for introducing a metric into the lattice of the Fuzzy Interval Numbers (FINs) [Kaburlasos 2006]. The interpretation of a generalized interval depends on an application; for instance if a feature (a term) is present in a document it could be indicated by a positive generalized interval [Skourlas 2007]. The area “under” a generalized interval is a real number which could be calculated. A metric distance and an inclusion measure function in the set (lattice) of the generalized intervals Mh can be defined [Kaburlasos, 2006]. 4.2 Calculation of the distance between documents The FIN distance is used instead of the similarity measure between documents: the smaller the distance the more similar the documents. In the case of documents and for the distance calculations a bell-shaped mass function could be selected: (1 )h mh(t)= [ (1 )t ] 2 1 z t 1 1 z maxctf / Figure 4 illustrates how we calculate the distance of two FINs F1 and F2 (representing two documents) using the mass function. The points a, b, c, d are used to define the distance, at height h, dh(F1(h),F2(h)) = dh([a,b]h,[c,d]h); dh(F1(h),F2(h)) equals the sum of areas of the shaded regions. The distance of the two FINs at the height h is given by the sum of areas of the shaded regions and the distance between the two FINs is calculated using the definite integral of the distance at height h from h=0 to 1: 1 b' d' 0 a' c' Distance = (| f (t )dt | | f (t )dt |)dh 5 Examples of successful document retrieval and classification From our previous experimentation [Skourlas, 2007] we knew that if the length of the queries and the documents is greater then the results of the method are better. Sample 1: Coin Images’ annotations Unfortunately, the available annotations for the Euro coin images are rather short e.g. “2-cent denomination: Depicting a Corvette, i.e. a type of ship used during the Greek War of Independence (1821)”. «Kέρμα των 2 λεπτών: Κορβέτα, τύπος πλοίου που χρησιμοποιήθηκε στον Εθνικό Απελευθερωτικό Αγώνα (1821-1827)». Therefore, it was necessary to expand such annotations using relevant texts in order to have successful experimentation. For example, within the framework of the Olympic Games Coin Program (Athens 2004), the Greek Mint was striking series of commemorative coins. In the case of representation of Goddess Athena on such coins we added texts as the following: “Goddess Athena Athena was Zeus' favourite daughter. She was beautiful and brave, yet very wise as well. And it was absolutely natural that she was highly esteemed by the Greeks for it was in her figure that the most important ideals of the ancient Greek spirit, prudence and bravery, were personified. She was the goddess of wisdom, protector of cities who ensured the peaceful life of citizens. At the same time she was the goddess of war and power, the one who would greatly contribute to the final victory in the battlefield. It is no surprise therefore that she is often represented triumphant (Nikephoros), with a small figure of Nike (Victory) in hand. Athena is always depicted in full armament - helmet, spear and shield. Her symbols are the owl and the olive. The myth of Athena's birth is quite impressive …..”. Sample 2: The MED collection. The MED test set is composed of a collection of 1033 documents; a set of 30 queries, and a list of all the collection documents (“qrels”) that are relevant to each of the queries e.g.“qrels14”={23,24,25,26,28,29,454,455,456,457,459,461,463,466,467,468} We selected all the long queries (e.g. 14, 17, 20, 25, …). For example, query 14 is described by the following terms: “renal amyloidosis as a complication of tuberculosis and the effects of steroids on this condition. only the terms kidney diseases and nephrotic syndrome were selected by the requester. prednisone and prednisolone are the only steroids of interest”. The terms of the queries, and the frequencies of the terms (in the MED collection, in the “qrels”) were used in order to extract representative features (terms) for the “qrels” classes (see some of the features in the Table 5). Query Representative features (terms) “qrel14” Amyloidosis, Tuberculosis, Nephrotic, Prdnisone, Prednisolone “qrel25” Diabetes, Insipidus, Chlorothiazide, Sodium Table 5. Representative features for some classes (“qrels”) of the MED collection After that we used successfully various techniques based (or not) on the above statistics (terms frequencies). Table 6 summarizes some interesting experimental results. In this case we used the representative features (terms) to construct one centroid for each class. Then, we calculated the distance of every query from these centroids (see table 6). If the FIN-distance of a query from the centroid of a class is minimal then the query is classified to this class. For the calculations we also used a stop-words file (see [IDOM-server]), and the following parameters for the mass-function: ρ = 0, σ = 1, ν = 2, z = 0.05. It also seems that for the MED collection it is better to use TfIdf instead of simple tf (term frequency). Qrels Q14 Q17 Q20 Q25 Q27 Q29 14 12.7087 24.8059 28.4343 9.13131 71.354 25 17.3849 24.7493 27.8314 6.67153 71.2515 35.5123 35.1403 Table 6 Examples of successful classification Sample 3: Bilingual medical bibliographic records We also focus on various samples using bibliographic descriptions extracted from the bilingual bibliographic databases of the Greek National Documentation Centre (NDC). The rationale for selecting such bibliographic records is the following: Each bibliographic record (text, document) is publicly available in various formats (e.g. text, MARC XML). NDC is the host of various bibliographic (or not) databases that cover different research topics (e.g. medical bibliography, dissertations, financial reports and bibliography). For example, the records of the NDC were searched using the search terms “pediatrics” and “mastectomy”. All the records without abstract were deleted and then all the documents that have a search term in their keywords formed two classes: Pediatrics-class, Mastectomy-class. Then, we supposed that the rest of the documents were our queries. All these queries (documents) were successfully classified using the FIN-based similarity. 6. Conclusions and future work In this work, we developed a content-based coin retrieval system using color, shape and wavelet features with consideration to translation, scale and rotation invariance. A coin collection was used during the design phase to create the index to the database. In the retrieval phase, a weighted similarity measure was used to match the query’s feature vector to those of the index and retrieve the most similar database coins. Several experiments have been performed showing improved retrieval accuracy when using feature combinations. Moreover, contrary to intuition, vector quantization of the coin database using self-organizing maps, showed a significant improvement of system’s performance when the scaled-down and rotated queries were corrupted by additive noise or had color alterations, especially when a common colormap was used. We have also concluded that the FIN-based calculation of the similarity between queries and documents (vectors) is a novel method for solving various problems. We verified in our experiments that if we focus on the documents that are related to different search terms then we can easily apply FIN-based techniques and calculate correctly the distance (similarity) between them. We are currently considering on expanding the present research in respect to distributed processing architectures [13] and in applying the aforementioned techniques in specific use case scenarios such as in medical environments [14]. Acknowledgments This Project is co-funded by the European Social Fund and National Resources – (EPEAEKII)-ARXIMHDHS. References 1. Rui, Y., T.S. Huang and S.-F. Chang, “Image Retrieval: Current Techniques, Promising Directions and Open Issues”, J. of Visual Communication and Image Representation, Vol. 10, pp. 1-23, 1999. 2. Faloutsos, C. and D. Oard, “A Survey of Information Retrieval and Filtering Methods”, Technical Report, Dept. of Computer Science, Univ. of Maryland, College Park, No. CS-TR3514, Aug.1995. 3. Veltkamp, R.C., “Content-Based Image Retrieval Systems: A Survey”, Revision of Tech. Rep. UU-CS-2000-34, Dept. of Computer Science, Utrecht University, 2002. 4. Eakins, J.P. and M.E. Graham, “Content-Based Image Retrieval”, Tech. Rep. JTAP-039, JISC Technology Application Program, Newcastle upon Tyne, 2000. 5. Kaburlasos V. G., Towards a Unified Modeling and Knowledge –Representation based on Lattice Theory - Computational Intelligence and Soft Computing Applications, SpringerVerlag, Summer 2006 6. T. Alevisos, V.G. Kaburlasos, S. Papadakis, C. Skourlas, P. Belsis, Fuzzy Interval Number Techniques for Multilingual and cross Language Information Retrieval. International Conference on Enterprise Information Systems ICEIS 2007, Madeira Portugal 7. Ballard, D.H. and C.M. Brown, Computer Vision, Prentice Hall, Englewood Cliffs, N.J., 1982. 8. Kohonen, T., Self-Organizing Maps, Springer, Heidelberg, 1995. 9. Vassilas N., Skourlas C., Content-based coin retrieval using invariant features and self organizing maps, ICANN06, Athens, 2006 10. Skourlas C, Alevizos T, Belsis P, Fragos K., V. G Kaburlasos, Papadakis S. Fuzzy Interval Numbers (FINs) techniques and its applications in Natural Language Queries Processing and documents classification, BCI 2007, accepted 11. Glasgow IDOM server–IRCollections, http://ir.dcs.gla.ac.uk/resources/test_collections// 12. Greek National Documentation Centre: www.ekt.gr 13. P. Belsis, Challenges and Potential Solutions for Secure and Efficient Knowledge Leveraging in Coalitions, eJETA: The electronic Journal for E-Commerce Tools & Applications , Vo. 2, No. 1, 2006 14. Gritzalis S., Belsis P., Katsikas S. “Interconnecting autonomous medical domains: a security perspective” IEEE Engineering in Medicine and Biology Magazine (IEEE EMBS) Theme Issue on: "Image, Signal and Distributed Data Processing for Networked eHealth Applications" (accepted)