Combining Fast Search and Learning ... Search Hooman Vassef

Combining Fast Search and Learning for Scalable Similarity Search by Hooman Vassef Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Bachelors of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2000 @ Hooman Vasse, MM. All rijhts reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. A u th or ....................................... Department of Electrical Engineering and Computer Science May 22, 2000 C ertified by .............................. .. ............ Tommi Jaakkola Assistant Professor Thesis Supervisor Accepted by .................... Artu .. Arthur C. Smith Chairman, Department Committee on Graduate Students MASSACHUSETTS INSTITUTE OF TECHNOLOGY ENG JUL 2 7 2000 LIBRARIES Combining Fast Search and Learning for Scalable Similarity Search by Hooman Vassef Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2000, in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science Abstract Current techniques for Feature-Based Image Retrieval do not make provisions for efficient indexing or fast learning steps, and are thus not scalable. I propose a new scalable simultaneous learning and indexing technique for efficient content-based retrieval of images that can be described by high-dimensional feature vectors. This scheme combines the elements of an efficient nearest neighbor search algorithm, and a relevance feedback learning algorithm which refines the raw feature space to the specific subjective needs of each new application, around a commonly shared compact indexing structure based on recursive clustering. After a detailed analysis of the problem and a review the current literature, the design rationale is given, followed by the detailed design. Results show that this scheme is capable of improving the search exactness while being scalable. Finally, future work directions are given towards implementing a fully operational database engine. Thesis Supervisor: Tommi Jaakkola Title: Assistant Professor 2 Acknowledgments This work was conducted at the I.B.M. T. J. Watson Research Center, and was funded in part by NASA/CAN contract no. NCC5-305. 3 Contents 1 2 3 Introduction: Towards Scalable and Reliable Similarity Search ..................................... 1.1 Objectives ........ 1.2 Similarity Search: The Basic Approach . . . . . . . . . . . . . . . . 1.3 Towards Reliable and Scalable Similarity Search . . . . . . . . . . Improving Search Time: The Need for Efficient Indexing . . 1.3.1 1.3.2 Improving Accuracy: The Need for a Learning Scheme . . . 1.3.3 Combining efficient Indexing, Search and Learning . . . . . 1.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . Design Rationale: Combining Indexing and Learning 2.1 Search Space and Metric . . . . . . . . . . . . . . . . . . . . 2.1.1 Objective and Subjective Information Spaces . . . . 2.1.2 Objective and Subjective Information Discriminated Objective and Subjective Information Mixed . . . . 2.1.3 2.2 Designing a Dynamic Search Index . . . . . . . . . . . . . . 2.2.1 The Need for Clustering . . . . . . . . . . . . . . . . 2.2.2 The Need for Hierarchy . . . . . . . . . . . . . . . . 2.2.3 The Need for Flexibility . . . . . . . . . . . . . . . . 2.3 Designing a Scalable Learning Algorithm . . . . . . . . . . 2.3.1 Insights from the S-STIR Algorithm . . . . . . . . . Reshaping The Result Set with Attraction/Repulsion 2.3.2 2.3.3 Generalization Steps . . . . . . . . . . . . . . . . . . 2.4 Fast Nearest Neighbors Search . . . . . . . . . . . . . . . . 2.4.1 Insights from The RCSVD Algorithm . . . . . . . . 2.4.2 Adapting the Search Method to a Dynamic Index . Methods 3.1 The Database Index . . . . . . . . . 3.1.1 Structure of the Index . . . . 3.1.2 Building the index . . . . . . 3.1.3 Dynamics of the index . . . . 3.1.4 Auxiliary Exact Search Index 3.2 The Leash Learning Algorithm . . . 3.2.1 Adapting the Result Set . . . 3.2.2 Generalization . . . . . . . . 3.3 Searching for the Nearest Neighbors 3.3.1 Exact Search for the Target . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 8 8 8 10 10 . . . . . . . . . . . . . . . 12 12 12 12 13 13 13 13 14 14 14 14 15 16 16 16 . . . . . . . . . . 17 17 17 19 19 21 21 21 23 26 26 3.3.2 4 . . .. .. .. . .. Results 4.1 Class Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Test Results. . . . Structure Preservation . . 4.2.1 Experimental Setup 4.2.2 Motivation . . . . 4.2.3 Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Objective Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 4.3.2 4.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 33 Discussion 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Improving Learning Precision . . . . . . . . . . . . . . . . . . . . . . 36 36 36 36 5 . . . . . 29 29 29 30 30 30 30 32 4.3 . . . . . 28 . . . . . 4.2 5 Collecting the Nearest Neighbors List of Figures 2-1 Learning Illustration: Reshaping the Result Set . . . . . . . . . . . . . . . . 15 3-1 3-2 3-3 fa and fb, transition functions (here a, b = 2) . . . . . . . . . . . . . . . . . Learning Illustration: Generalization . . . . . . . . . . . . . . . . . . . . . . Learning Illustration: Generalization Variant . . . . . . . . . . . . . . . . . 23 25 27 4-1 4-2 4-3 4-4 Class Precision of the Leash Algorithm . . . . . . . . . Structure Preservation Results for the Leash algorithm Class Precision with Objective Feedback . . . . . . . . Structure Preservation with Objective Feedback . . . . 31 32 34 35 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction: Towards Scalable and Reliable Similarity Search 1.1 Objectives The continuous increase in size, availability and use of multimedia databases has brought a demand for methods to query elements by their multimedia content. Examples of such databases include: a digital image archive scanned from geological core samples obtained during an oil/gas exploration process, where search for a particular rock type would help identify its strata; an archive of satellite images, where search by content would be used to identify regions of the same land cover (a particular type of terrain) or land use (a populated area, a particular kind of crop, etc.). One approach would be to explicitly label archived elements and query them by providing a description, but the more practical and popular approach is to query by example. In other words, content-based retrieval can be done by performing a search in the database for the elements which are the most similar to a query sample. While numerous such search methods have been proposed in the recent literature, the specific purpose of this research was to design a similarity search engine which is reliable, yet scalable to arbitrarily large multimedia databases. 1.2 Similarity Search: The Basic Approach Most of the recent methods for retrieving images and videos by their content from large archives use feature descriptors to index the visual information. That is, digital processing methods are used to compute a fixed set of features from each archived element (image, video, or other), and the element is then represented by a high-dimensional vector combining these features in the search index. Search by example then consists of a similarity search on the vector space, with as a target the feature vector corresponding to the query sample provided by the user. In the simplest approach, this is done by performing a nearest neighbors search on the space of the feature vectors used to represent the database elements in the index. Example of such systems include the IBM Query by Image Content (QBIC) system [8], the Virage visual information retrieval system [1], the MIT Photobook [14], the Alexandria project at UCSB [12, 2], and the IBM/NASA Satellite Image Retrieval System [16]. In these systems, the comparison between the query and target feature vectors stored in the database is 7 typically based upon a simple fixed and objective metric, such as the Euclidean distance or the quadratic distance [16], in order to reduce computational costs. 1.3 Towards Reliable and Scalable Similarity Search The basic approach can be optimized towards two logical goals: making it scalable, and making it more reliable. Scalability is attained by limiting the individual query processing time and the size of the database index. Reliability is attained by improving the search accuracy. These two research directions have thus far been explored mostly in an orthogonal fashion. 1.3.1 Improving Search Time: The Need for Efficient Indexing Naive Nearest Neighbors Search Improving the query processing time inevitably requires a specialized database index, even for a simple nearest neighbors search. Indeed, the naive approach, which consists of going through all the elements to collect the k nearest samples to the target sample, where k is the search number, requires touching every sample in the database, which for very large databases, is unthinkable. The desired query processing time, roughly equivalent in complexity to the number of touched samples, is sub-linear. Common High-Dimensional Indexing Methods are Inadequate Indexing methods such as B-Trees, R-Trees or X-Trees [3] have been proposed to optimize lookup time for one particular sample vector. However, their performance for nearest neighbors search, while satisfactory in low-dimensional spaces, is poor in high-dimensional vector spaces, due to the "curse of dimensionality". Indeed, in a high-dimensional space, individual samples are all in relative proximity of each other, and a search of those indexes ends up therefore touching most of the samples in the database by exploring across all dimensions. Fast Nearest Neighbors Search with RCSVD The RCSVD index, proposed by Thomasian et al. [17], is designed specifically as a compact index meant to improve nearest neighbors search to a sub-linear time complexity. RCSVD builds an index by recursively performing a combination of clustering to reduce the search range, and uses singular value decomposition to reduce the dimensionality of the clusters, thus facilitating the nearest neighbors search within a given cluster. 1.3.2 Improving Accuracy: The Need for a Learning Scheme Class Precision The tasks performed on those database examples given earlier (Section 1.1) require class precision property: in any similarity search, the result set samples should all belong to the same class (e.g. rock type, or terrain) as the query sample. This issue is trivial if the feature vector space corresponding to the database discriminates classes naturally by its topography, i.e. is such that any similarity search query will return only samples of the same class (provided the query number is less than or equal to the number of samples of that class). 8 However, a straight nearest neighbors search based on Euclidean distance, which is the query method of choice for its computational simplicity, is unfortunately very unlikely to yield class precision for several reasons. First, no feature extractor is perfect, as in essence they all discard a significant amount of information. In particular, if the database images are complex such that they require a cognitive process and a certain knowledge base to interpret, it is not likely that any feature extractor will discriminate samples with a good enough approximation of human perceptual distance. Second, as mentioned in Section 1.3.1, in a high-dimensional space all samples are in relative proximity to each other, i.e. distances are not very discriminative. Thus, enhancing the feature extractors themselves to capture generally more relevant information may provide only little improvement in search accuracy, if it implies adding noisy feature dimensions to the vectors. Empirical results show indeed that a nearest neighbors search does no discriminate classes well on most high-dimensional feature vector spaces. Many recent approaches focus instead on developing some learning algorithm which uses subjective information provided by the user, to refine some aspect of the search, either the query itself, or for example the vector comparison metric. This refinement is meant to overcome the disparity between human perceptual distance and the objective metric used to compare feature vectors, and to induce class precision. Learning as Implicit Classification The intended applications would then greatly benefit from a form of classification of the database samples, which would allow the similarity search to be restricted to one particular class. For example, the user may have some explicit high-level knowledge of the features used to index the archive, or similarly s/he might provide explicit subjective labels of which the database might take advantage for classification. However, as indicated earlier in Section 1.1, in this work I consider only search by example, specifically in instances where no explicit cognitive knowledge is shared by the database and the user. This constraint is often desirable for several reasons, notably the difficulty for automatic generalization by the computer, and the impracticality of entering such labels individually by the user. Furthermore, explicit classification would be done by manually labeling all the samples, which is unrealistic in terms of manpower for very large databases. The need for a learning scheme is therefore apparent as the remaining alternative. Using information provided by the user, a learning scheme can either modify the search method, or create a new search space reflecting both that new information and the original information of the vector space. Once again, because a nearest neighbors search is the desirable search method for optimization, I will choose the latter scheme. Learning from Relevance Feedback Although the user and the database share no explicit knowledge base, implicit subjective information can be provided by the user in the form of relevance feedback. This consists usually of an iterative refinement process in which the user indicates the relevance or irrelevance of retrieved items in each individual query, where relevance is equivalent to class precision. Previously studied extensively in textual retrieval, an approach towards applying this technique to image database has been proposed recently by Minka and Picard [13]. In 9 this approach, the system forms disjunctions between initial image feature grouping (both intra- and inter-image features) according to both positive and negative feedback given by the users. A consistent grouping is found when the features are located within the positive examples. A different approach is proposed by one of the authors later and becomes PicHunter [6, 51. In PicHunter, the history of user selection is used to construct the system's estimate of the user's goal image. A Bayesian learning method, based on a probabilistic model of the user behavior, is combined with user selection to estimate the probability of each image in the database. Instead of revising the queries, PicHunter tries to refine the answers in reply to user feedback. Alternatively, an approach has also been proposed to learn both feature relevance and similarity measure simultaneously [4]. In this approach, the local feature relevance is computed from a least-square estimate, and a connectionist reinforcement learning approach has been adopted to iteratively refine the weights. A heuristic approach for the interactive revision of the weights for the components of the feature vector has been proposed by Rui et al. [15]. An algorithm, S-STIR, based on nonlinear multidimensional scaling of the feature space has been proposed by Li et al. [10]. 1.3.3 Combining efficient Indexing, Search and Learning While all of these methods definitely provide some improvement in the accuracy of the search, none of them take into account the need for efficient updates to an index designed for fast search. Nor do they optimize the per-query learning time, which counts as part of the query processing time, because the user may provide feedback after each query. On the other hand, the RCSVD index relies on information which is valid only in a static search space, thus inhibiting all incremental refinement of the index by relevance feedback. Consequently, RCSVD is restricted to an objective nearest neighbors search and performs with poor subjective accuracy in applications where class precision is important. In summary, none of the previous methods are directly suitable to building a system which is both scalable and reliable. The challenge is to build an index designed to both allow for fast search, and be flexible to modifications by a relevance feedback learning algorithm. Furthermore, a small (sub-linear) time complexity for each query is equally important for the learning algorithm as it is for the search algorithm. 1.4 Proposed Approach In this paper, I propose a new algorithm that combines a learning process similar to the S-STIR method in that it refines the feature space based on user feedback, and an efficient high-dimensional indexing technique inspired by RCSVD, to address the issue of simultaneously efficient and reliable search and learning. Specifically, the proposed method approximates the idea of S-STIR's non-linear transformation of the whole feature space, by a series of linear transformations which affect a significant fraction of the database, yet have a sub-linear time complexity in terms of the database size, and can thus affordably be applied after each query and make the system scalable. This learning algorithm directly affects the index structure, which is created by recursive clustering, and directly shared with the search algorithm. Cumulated learning steps will make the vector space converge to a configuration where the search space for each query can be substantially pruned. Chapter 2 is a preliminary statement of the goals and previous work leading directly to our proposed solution. which itself is sketched in Chapter 3. The results in Chapter 4 show 10 the current level of reliability obtained with the system, and Chapter 5 concludes with a discussion of the current results and of future research directions. 11 Chapter 2 Design Rationale: Combining Indexing and Learning The purpose of this chapter is to walk through and justify each step of the design rationale for the proposed approach. Once again, the primary objective of this research work is to design an image search engine capable of learning from user interactions, and scalable to large databases. 2.1 Search Space and Metric The first step in designing a search engine is to decide on the search space and metric. 2.1.1 Objective and Subjective Information Spaces The common approach of many learning algorithms is to maintain a distinction between the objective information, namely the original space of feature vectors, and subjective information provided by the user. The subjective information space could possibly consist of an unbounded record of user feedback, but that poses problems in terms of index size growth. It will preferably remain within a fixed-size representation, where the past information is weighted down to allow the accumulation of new feedback. While this effectively discards some of the resolution from previous feedback information, it is usually designed such that the overall cumulated subjective information will ideally converge towards a stable and accurate representation of the user(s)'s knowledge. 2.1.2 Objective and Subjective Information Discriminated Given the two information spaces, one option is to design the search metric to reflect information from both spaces at the same time. However, this makes the search algorithm harder to implement and scale. The other option is to map the two spaces onto a new vector space, on which a straight nearest neighbors search can be performed. This is notably the case for the S-STIR algorithm (Similarity Search Through Iterative Refinement) [10]. Specifically, S-STIR proposes a learning method where the user information is combined with the original vector space using non-linear multidimensional scaling. This is done by first clustering the original feature space, then mapping it to a new high-dimensional space where each dimension represents a sample's proximity to the centroid of one given cluster. This space is then scaled by a matrix 12 W which is effectively the subjective information space, because the "iterative refinement" process consists of adjusting the coefficients of W by relevance feedback. Specifically, W is chosen such that in the new vector spaces, the structure of each class is preserved, while the distances between elements within that class are minimized. This method ideally converges until the clusters reflect the subjective classes, in a space where they can be easily discriminated. Unfortunately, while a simple nearest neighbors search can be used to perform queries, the difficulty resides now in designing, for that new vector space, a versatile search index which would easily adapt to changes in W. 2.1.3 Objective and Subjective Information Mixed Alternatively, one could accumulate the subjective information within the objective space, modifying the original vectors directly to reflect the user's feedback. This has the drawback that it destroys the original feature description of each sample, and results in a partial and irreversible loss of the objective information. However, if this method can still be made to converge to a stable and accurate representation of the user's knowledge (assuming the user's feedback maintains a certain coherence), the information loss is inconsequential. I chose this approach, for it has the advantage of simplifying the search and indexing problem, because the search space is then the same as the original space, and the metric can be as simple as Euclidean distance or quadratic distance. 2.2 Designing a Dynamic Search Index The database index in this scalable system needs to fulfill two roles: it must provide the necessary structure and landmarks for a fast nearest neighbors search algorithm, and all the while allow a dynamic and quick modification to the whole indexed vector space to reflect the information provided by the user feedback. This section steps through the logic requirements of such an index. 2.2.1 The Need for Clustering An index for similarity search must keep similar elements closely connected. The estimate of similarity between two elements, in any given state of knowledge of the database, is reflected in the correlations between their feature vectors. Separating elements in clusters of correlated features, as done in RCSVD, reflects the physical layout of the samples in the vector space and consequently allows to greatly prune the search space. Furthermore, regardless of the specifics of the learning algorithm, it should intuitively generalize in some way the feedback information pertinent to one sample onto other samples in the same cluster, as they are correlated. 2.2.2 The Need for Hierarchy An unbounded and unorganized set of clusters does not facilitate search or learning. In a tree-like hierarchy of clusters, delegation is possible, a prerequisite for sub-linear search and learning time complexity. Indeed, in a hierarchy, each learning step, being always focused around one target sample, can have an impact progressing from the local environment to the global environment, following the hierarchy up along one branch. Similarly, a search 13 starting from the bottom of the hierarchy will remain along one branch, if the search space is adequately pruned. 2.2.3 The Need for Flexibility While a search index is usually meant to be built with only a one-time cost and make every subsequent query efficient, this index needs to be dynamically modified to reflect the changing state of subjective knowledge of the database. These modifications, while possibly impacting the whole database, need to be achieved with little cost for each query, in order to keep the efficiency. At the top of the hierarchy, a modification should have a global effect, while at the bottom it should only have a local effect - yet both need to be achieved in constant time to achieve the logarithmic search and learning time we seek. This requires each node of the index to have some sort of control handle over all its subnodes, such that it can cause modifications to all of them in a single operation. Furthermore, the initial distribution of samples in the clusters might be found an inaccurate subdivision of the space, as this space gets modified by the learning process. Thus, this might require a dynamic transfer of samples and tree nodes across the index, possibly at each query. This process of restructuring the index must therefore also be achieved quickly at each query. 2.3 Designing a Scalable Learning Algorithm This Section provides a high-level overview of how the learning process reshapes the vector space to reflect the user's feedback, by applying criteria learned from the S-STIR algorithm [10]. 2.3.1 Insights from the S-STIR Algorithm The S-STIR algorithm, described in Section 2.1.2, is not suitable as a scalable learning scheme, principally because its per query learning step is super-linear in terms of the database size, and furthermore it provides no support for an efficient index. However, it uses two key concepts in the design of a learning algorithm which modifies the search space: Intra-Class Distance Reduction Type inclusion can be achieved if the distances between elements of the same class are reduced, making them all relatively closer to each other and farther from other classes. Intra-Class Structure Preservation Within one given class, samples should maintain a topography similar to that of the original vector space. Otherwise, the original objective information is completely lost and the relative position and similarity measure of samples in the vector space may become nonsensical. 2.3.2 Reshaping The Result Set with Attraction/Repulsion The scheme proposed to reshape the feature space consists of intuitive concepts: samples of the same type attract each other, samples of different types repulse each other. Intuitively again, similar samples which are already quite close together need not attract each other 14 Legend T: target sample R: relevant sample R I: irrelevant sample T Figure 2-1: Learning Illustration: Reshaping the Result Set any closer. Similarly, different samples which are already quite far apart need not repulse each other any further. This process is illustrated in Figure 2-1. The learning algorithm must thus take the form of geometrical transformations which reflect these intuitive behaviors. These transformations would modify the samples of a query result set with respect to the target sample of the query, according to the feedback given by the user. This process should reduce intra-class distance, and relatively increases the inter-class gaps. Only the samples in the result set would be touched, generally an insignificant number with respect of the database size. 2.3.3 Generalization Steps The process just described affects only the samples in the result set. By itself, it is useless, affecting only a very small portion of a large database, and even potentially detrimental, breaking the local topography and thus disrespecting the structure preservation criterion. Ideally, the local space of samples around the query result set should smoothly follow the movements of the samples in that set, a transformation which could be achieved nicely using for example topographical maps. However, topographical maps are computationally expensive and are not easy to index. Instead, I propose a scheme using the hierarchy of clusters in the index to approximate such a transformation. The high-level idea (illustrated in Chapter 3, Figure 3-2), is the following: if s is a moving sample, then any cluster along the hierarchy to which s belongs should move in the same direction as s, by a fraction of the displacement of s. This fraction is inversely proportional to the cluster's weight. This scheme is meant to implement structure preservation, and generalize the impact of one query to the rest of the database, by following a simple directive: if the sample s is moving, then the samples nearby should follow it a bit to preserve the smoothness of the space, and the samples nearby are likely to be those in the same cluster as s. In the index, the information is relative across the hierarchy (see Section 2.2.3): specifically, this would imply that vector coordinates are expressed in with respect of the coordinates of the cluster immediately above. Thus, translating all the samples in a cluster can be simply done by changing the cluster coordinates. Consequently, this generalization step can be achieved by only performing one action at each level of the hierarchy when moving up an index branch, making the process sub-linear as desired. 15 2.4 Fast Nearest Neighbors Search The proposed index replicates many features of the RCSVD index which make the RCSVD search method scalable, and which we will presently describe. However, due to the dynamic nature of the new index, some features had to be discarded, and others re-adjusted accordingly, the details of which are described in the second part of this section. 2.4.1 Insights from The RCSVD Algorithm Briefly, the RCSVD indexing method consists of creating a recursive clustering hierarchy, which is emulated in the present design. Additionally, RCSVD uses Singular Value Decomposition to find the directions of least variance for the samples in a particular cluster, and discards them while preserving a reasonable amount of the total variance. Therefore, when those features are discarded, the new "flattened" cluster is still a good approximation of the original. When the dimensions are reduced to a sufficiently low number, the implementation of a compact index using the Kim-Park method [9] is possible. This within-cluster index allows a very fast nearest neighbors search method. 2.4.2 Adapting the Search Method to a Dynamic Index The Kim-Park method relies on the static nature of sample vectors, and can therefore not be used in the new index. Furthermore, dimensionality reduction is also undesirable in a space where samples change constantly. These losses are compensated in scalability by pushing the hierarchy further down to include less samples in the leaf clusters of the tree. However, singular value decomposition will still be applied to keep tight hyper-rectangular boundaries for the clusters, in order to prune the search space. 16 Chapter 3 Methods This chapter presents the details of the proposed approach for a scalable and reliable content-based retrieval engine. It is organized in three main sections, each describing one of the novel designs combined by this approach: a compact indexing structure, made to reflect the topographical organization of the vector space, and made flexible to facilitate changes to said vector space; a learning algorithm made to accurately translate the user's feedback on a query into changes in the vector space, and to perform those changes efficiently on the index, i.e. in sublinear time; and finally, a nearest neighbors search algorithm which makes proper use of the index structure so as to perform a query also in sublinear time. 3.1 The Database Index The database index is a versatile structure: its purpose is to accommodate both nearest neighbors queries and changes by the learning algorithm in sublinear time. First, the structure of the index must be designed to prune the search space as much as possible for nearest neighbors queries. Second, a process has to be defined by which such an index is constructed on a database of raw feature vectors. Last, the dynamics of the index must be established to provide the learning algorithm with methods for adapting, in sublinear time, the vector space to the user's feedback, so as to improve the search accuracy without however decreasing search speed. 3.1.1 Structure of the Index A Tree of Clusters In order to provide a quick access to all the database samples, the they are indexed by a reasonably balanced tree hierarchy, as stated previously in Section 2.2.2. At each tree node, the subdivision of the database sample in the index must reflect their spatial distribution. Indeed, spatial locality is the basis of similarity search, therefore the search space can only be pruned if said spatial locality corresponds to locality in the index. Hence, the structure of the index is a hierarchy of clusters, much like the RCSVD index [17]. Structure of a Cluster Each node of the index tree is a cluster. The leaf nodes of the tree (leaf clusters) directly index a set of samples, while the internal node clusters (superclusters) index a set of 17 subclusters. The versatility of the index originates in the versatility of its cluster nodes. Aside from the standard functions of a tree node, which are to hold index pointers to all its subnodes and a back-pointer to its parent, each cluster serves two main purposes. On one hand, as a delegate and container for all its subnodes, a cluster node captures certain useful collective information about all its subnodes, which can be checked first, and potentially avoid having to check more detailed information in its whole subtree, if the latter is found unnecessary. On the other hand, as the root of a whole subtree, a cluster node exercises certain control over its contents, which potentially avoids having to make individual modifications in each element of the subtree, if modifications to this nodes are enough. Standard Tree Node Contents As a standard tree node, the cluster contains a pointer to its parent as well as pointers to its children. For a leaf cluster, children consist of a set of samples. For a supercluster, children are a set of subclusters. In both cases, there exists a motivation to create a within-cluster index structure to allow efficient access and sorting of the set of children. An example is the Kim-Park index used inside RCSVD clusters [9, 17]. However, such an index is difficult to maintain here, because the information on which it would rely is subject to changes by the learning algorithm (in the case of the Kim-Park method, the relative location of the children). Instead, the set of children is simply indexed as a linear array, for which the access and sorting costs can remain low so long as the branching factor and the maximum leaf size are kept small enough. Collective Information on the Cluster Contents Besides the evident centroid vector and cluster weight (total number of samples in this cluster), a cluster keeps track of information pertaining to its boundaries, and the variance of its elements. Boundaries allow the search algorithm to check whether the clusters as a whole is beyond the search radius, before trying to search its contents. Thus, these boundaries should be as tight as possible, so as to effectively prune the search space. For example, a spherical bound is inadequate given the amount of overlap it would cause with neighboring clusters, especially in a high-dimensional space. Instead, I use hyper-rectangular boundaries. While ellipsoid boundaries would be generally tighter, hyper-rectangular boundaries offer a good tradeoff between tightness and computational simplicity for a bounds check. They are recorded in the cluster by an orthonormal basis for the rectangle's directions, and a pair of boundary distance values from the cluster's centroid, in each direction of the basis. The variance of elements in a cluster offers a representation of the cluster's context. This allows to consider a cluster element relatively to its context, and proves useful in the learning algorithm, as shown later (Section 3.2.1), as well as in index maintenance checks, to see if a particular sample or subcluster fits better in the context of another cluster. The information recorded is the variance in each direction of the cluster's orthonormal basis (from which the variance in any direction can be easily calculated). Collective Control on the Cluster Contents As later discussed in Section 3.2.2, one factor which allows the learning algorithm to run in sublinear time on this index is the ability to translate all the elements contained in a cluster in one single operation, at any level of the hierarchy. This element of control is given to the cluster over its subtree by the cluster's reference point. All vectors in a cluster and its children use this relative reference 18 point as their origin. Thus, their absolute position can be translated, simply by translating the cluster's reference point with respect to the cluster's parent. 3.1.2 Building the index Clustering A hierarchy of clusters is obtained by recursively clustering the vector space. A number of clustering algorithms have been proposed in the scientific literature, with varying time complexities and degrees of optimality in the subdivision of space. Here, K-means, LBG [11], TSVQ, and SOM [7] were considered. Currently the system uses either K-means or TSVQ, for their time efficiency. The number of clusters at each level is the tree's branching factor b. The clustering algorithm is applied recursively until a leaf node reaches a size less or equal to a maximum leaf size 1. Both b and 1 should again remain relatively small to keep within-cluster indexing simple. Using K-means, b can be specified to remain within two values bmin and bmax. The algorithm is run with an initial b = bma,, and run again if it converges with no more than bmax - bmin empty clusters. Also, I should be chosen accordingly to give K-means a good probability of convergence. SVD Boundaries While it is a difficult operation to find the vector directions for the tightest hyper-rectangular boundaries for a cluster, a good approximation is to compute the Singular Value Decomposition (SVD) of the set of samples in a cluster, and use the singular vectors as directions for the rectangular boundaries. Furthermore, the singular values provide the variance information for the cluster. For a leaf cluster, the scalar boundary values for each of those vectors are determined as the maxima of the projections of all the samples on each singular vector direction. For a supercluster, the furthest vertex of each child's bounding rectangle is determined in each singular vector direction, and the scalar boundary value for that vector is thus calculated. Additionally, a bit vector is recorded for each child and each singular vector direction, v = { 1, --1}d (where d is the number of dimensions), indicating which combination of the child's singular vectors form the furthest vertex in that particular direction. Indeed, as long as the child's singular vectors (i.e., boundary directions) remain unchanged, the furthest vertex in a particular direction stays the same. 3.1.3 Dynamics of the index As suggested in Section 2.3, the learning algorithm intends to modify sample vectors in the index. In fact, in addition to displacing some sample vectors individually, it will displace entire clusters by their reference points (see Section 3.1.1). These changes are bound to disrupt the established structure of the index, and this structure must be brought back to a coherent state by a series of updates. The different kinds of updates needed are described in this Section, and define the dynamics of the index. Boundaries Updates When any sample is displaced, the boundaries of its leaf cluster must be checked to see if they must be accordingly extended or contracted. This operation is simple for a single 19 sample vector: the vector is projected on each of the cluster's singular vector and checked against the cluster's boundary values. This update must also be propagated up to the cluster's parents. For each singular vector direction of a supercluster, the furthest vertex of a modified child cluster's bounding rectangle can be quickly found, using the bit vector v sign, previously recorded for each child and each singular vector direction (see Section 3.1.2). The child's singular vectors are scaled by the respective elements of vsign, then added together. The vector is then checked to see if it exceeded the corresponding bound. This operation is time-logarithmic in terms of database size, and could be expensive if repeated often. However, as shown later, the vectors modified by one iteration of the learning algorithm are all along one branch of the tree, so the boundaries updates can be done collectively along that branch. Sample Transfer The purpose of the index is to reflect and discriminate, with individual clusters or superclusters, the ideal groups of a classification for the database. However, it is very unreasonable to expect all clusters or super-clusters to exactly match a class for the database, without prior knowledge. In consequence, any initial clustering of the space should be expected to place samples in clusters in which, subjectively, they do not belong. This is therefore likely to cause an inflation of the clusters' boundaries resulting in reduced search performance, after several iteration of subjective feedback and modifications by the learning algorithm, as samples move away from their original clusters. Samples in a cluster must remain densely packed in order to effectively prune the search space. To solve this problem, it should be made possible to transfer samples across the hierarchy. If a sample is found to have diverged too far and seems to belong better in another cluster, it should be transfered to that cluster. A sample should belong in a cluster if it fits the cluster's context best. Thus, dividing the sample's distance to the cluster centroid by the standard deviation of the cluster in the sample-centroid direction yields a better measure of the sample's fitness for that cluster than its absolute distance to the centroid. This standard deviation is easily calculated by the variance information contained in the cluster. The same applies to any level of the hierarchy: if an entire cluster is found to belong better in another supercluster, it should be transfered there. The process for determining whether a cluster fits a certain supercluster better than another is similar to the fitness test process outlined above for a single sample. Now, this transfer operation disturbs the index structure. Indeed, for any transfered element (sample or cluster), two branches need to be modified: the branch above the original location of the element, and the branch above its new location. Indeed, each cluster in those two branches will need its weight, centroid and bounds updated to reflect the disappearance/appearance of the element in its subtree. A single transfer operation has therefore a time-logarithmic cost in terms of database size. However, if a fitness test is performed for every element of a branch of the index, the updates needed by the transfers found necessary can be all performed together along the branch. It is therefore possible to perform fitness tests and element transfers along a whole branch still in logarithmic time. Thus, at each query, it can be reasonably done once, and a good candidate for this operation is the branch leading up from the target sample. 20 SVD Refresh Sometimes however, a sample may seem to be diverging from a cluster, when it is actually the cluster that has changed. Indeed, the directions of the bounding rectangle may become inadequate for tight boundaries. Thus, clusters in the database should periodically undergo a maintenance check, where the singular value decomposition is recalculated, and the required updates are made to the cluster's parent (specifically, a recalculation of the vsign vectors for this child). 3.1.4 Auxiliary Exact Search Index The main index does not provide any means for locating the target sample, given a query sample. Initially, the cluster subdivision of space has the optimal clustering property: at any level, the closest cluster centroid to a given sample vector is the centroid of the cluster containing the sample. This is used by RCSVD at each query to first locate the target sample in the index, the initial step of a nearest neighbors search. Unfortunately, in this index, samples and clusters move about, such that the optimal clustering property is lost, and the target sample must be found differently. To this effect, each sample in the main index is doubly indexed by an auxiliary index. This can be an inexpensive structure such as a B-Tree, where samples are classified in their original form. The exact sample lookup thus takes logarithmic time in terms of database size. 3.2 The Leash Learning Algorithm This section describes a family of incremental learning algorithms, where each learning increment can be performed in sublinear time in terms of the database size, given the appropriate index structure. This family is defined by a set of principles for both adapting the result set of a query to user feedback, and then generalizing the effect to the rest of the database. In each case, the principles are stated, followed by a few proposed implementations. 3.2.1 Adapting the Result Set Principle Relevance feedback information, as given by the user, applies directly to the samples of the result set of the query. The first learning step should then consist of reflecting this subjective information onto those samples. As explained in Section 2.3.2, the sample vectors are directly modified by following these heuristics: 1. If two samples are judged to be different, they should be pushed apart if they are close. On the other hand, it is unnecessary to push them further apart if they are already far apart; 2. If two samples are judged to be similar, they should be brought steadily closer, unless they are very close already. These updates are implemented as linear transformations, in order to maintain computational simplicity. These transformations must be chosen so as to make the database converge towards a reliable and stable state. 21 Implementation Main Approach Assume that the user's feedback is encoded by the variable u E [-1, 11, for which values between -1 and 0 indicate varying degrees of difference between s and T, and values between 0 and 1 indicate varying degrees of similarity. The simplest transformation proposed is to move each sample in the result set either directly towards or away from the target sample T. That is, for each sample s in the result set, v* being the vector from s to T, s is translated by a vector g collinear to v . The scaling factor f from v to g is logically proportional to u: indeed, with positive feedback s should be translated towards T, with negative feedback away from T, and by an amount reflecting the degree of similarity/difference. between s and T, so Furthermore, f should take into account the present distance 11 as to reflect the learning heuristics above (Section 3.2.1): if s and T are similar and already close together, they need not be brought much closer, and f should have an accordingly smaller magnitude; if s and T are different but already quite far apart, they need not be pushed much farther, and f should have an accordingly smaller magnitude. This effect is captured by two transition functions, fa and fb. Lastly, the distance 11 V 11 between s and T needs to be scaled to the context of T's cluster, hence it is divided by a, the standard deviation of said cluster in the direction of V+ (calculated using the singular vectors and values). This gives: 9 ( o ,U) = f ( 1. for -1 7,u)- V1, (3.1) <u<0, f(x,u) =Do-u-fa(x), 2. for 0 < u < 1, f (x,u) = S u fb(x), where: fa(X) 1 ±+ a I1+ Af() = (1- 1 X ). (3.2) The constant parameters Do, So C [0,1] adjust the weights for the similarity and difference displacements, and a, b > 2 are exponents used to give varying degrees of steepness to the transition functions fa and fA. The illustration in Figure 3-1, shows how these transition functions implement the learning heuristics. While they are computationally simple, these linear transformations have the disadvantage that they are not commutative, and difficult to formulate in a closed form. Hence, it is easier to experimentally determine the values for the parameters Do, So, a and b that facilitate the convergence of the the algorithm (see Section 4). Issues and Variations It should be noted about the learning function above that it is not designed to converge to a fixed point. Furthermore, those linear transformations can cause drastic changes to the concerned samples, and given the fact that these transformations are not commutative, this implies that the order of the queries could create an undesired bias in learning. Two variations from the main approach are thus proposed, addressing those two respective issues. 22 .1. 1 01 0 1 11 2 x 3 0*0 4 1 2 x 3 4 Figure 3-1: fa and fb, transition functions (here a, b = 2) The first variation attempts to relax the transformation by only delaying its effect. If an infinite number of queries were to be performed with this sample in the result set, the displacement of the sample s would end up the same as in the main approach. However, the inertial delay this creates can result in a smoother transition between queries, and reduce the bias introduced by their order. This delay is induced as the main transformation, f - v is reduced by a weight w C (0, 1), and combined with the previous move, g q_1: Sq (Tq~u, 9q-1) w - f( ,u). V +(1 - w). gq-1 . (3.3) The second variation is intended to cause a quick convergence. It consists of freezing samples as more and more queries are performed on them. This is done by scaling f using a decreasing function of the total number of queries q made on the considered sample. For example: 1. for -1 <u<0, f(x,u) =Do. 2. for 0 < u < 1, f(x, u) = S u-fa(X), U - fb(X). The disadvantage is, however, that samples might thus converge too fast to a database state which is far from the desired state. One compromise is to select slower decreasing functions of q: 1. for -1 <u < 0, f(x,u) =Do. --'- -. 2. for 0 < u < 1,f(x,u) = So fa(x), u- - f(x). Alternatively, database administrators can reset q for all or some selected samples, if they seem to have converged to a stable, yet unsatisfactory state. 3.2.2 Generalization Principle The second step in the learning process is to generalize the feedback information captured by the transformations applied to the samples of the result set. As explained in Section 2.3.3, 23 generalization is accomplished by hierarchically extending the effect of the transformation of each vector in the result set to the whole database. There are two motivations for this process. First, the learning process consists of smoothly combining the current information of the database with the user's feedback; therefore, it should ensure that the structure of the local space around the target sample is preserved by smoothing out the disruption caused by the learning step previously described (Section 3.2.1), because said structure captures the state of the database prior to feedback. Second, the actual process of generalization consists in extending the effects of the feedback to all the samples to which it may apply. Given two samples, the probability that they belong to the same class is likely to increase, the lower their lowest common parent cluster is in the hierarchy. In other words, samples should be bound by a similar fate if they belong to the same local cluster. Thus, when a sample is displaced in the first learning step, it is somewhat logical that each cluster in the index branch above it should follow the sample by an amount inversely proportional to the cluster's weight, as if "dragged on a leash". While it may potentially cause displacements in a number of samples comparable to the size of the whole database, this transformation can be achieved in sublinear time, because the index is designed such that the entire contents of a cluster can be displaced in one operation, by simply displacing its reference point (as explained in Section 3.1). Within the hierarchical cluster structure, the generalization process consists of propagating the individual displacements of each sample vector in the result set, up along its branch of superclusters until a cluster is reached which contains both the target and the moving sample. Specifically, each cluster in the hierarchy is displaced by a vector integrating the desired displacements for all the vectors which belong to that cluster and are thus correlated to the moving sample, towards or away from the target, and weighted down by an inertia factor for the cluster. Implementation Simple Approach The simplest transformation affects only the branch of clusters directly above the sample, and is illustrated in Figure 3-2. First, each leaf cluster containing one sample s in the query result set is displaced in the same direction as s, except by a fraction 1/IC of the amount by which s moves, where IC is an inertia factor which can simply be the weight of the cluster. If the leaf cluster contains several samples in the result set, it moves by the sum of the individual sample moves, again scaled down by 1/IC. This is done in one operation by simply moving the cluster's reference point. However, because those cluster's samples which belong to the result set already made a move on their own towards or away from the target sample, it makes no sense to make them move again by the displacement incurred on the cluster's reference point. Hence, that displacement should be subtracted from those samples in the result set. Next, the same process is applied to the superclusters containing one or more cluster which have been displaced. Once again, the resulting displacement for a supercluster's reference point should be subtracted from those clusters among its children, which have been displaced on their own already. With C being the reference point of a cluster, 1 c the cluster's inertia factor, vi the displacement of a sample vector, Ci the reference point of an internal cluster's sub-cluster, and P the displacement of the parent cluster, this can be formulated by the following, for a leaf cluster: 24 supercluster Legend leaf cluster T: target sample R: relevant sample element moving by itself leaf cluster R 0- ---- ' :element moved by parent " 0 Elements modified: one branch R Figure 3-2: Learning Illustration: Generalization 25 T A V-AP, (3.4) EA CZ) - A P. (3.5) A C= (I and for an internal node cluster: AC=-(1C i This simple transformation consists of only one operation at each level above each sample in the result set. Thus, the running time for this step is O(k log n), where n is the database size and k the query size. Variation In the simple approach, all the clusters in the branch above a sample in the result set are displaced in the exact same direction, regardless of their shape or size. A smoother transformation would be, for examples in the case of samples in a leaf cluster, to have all the samples in the cluster move independently towards or away from the target by a distance equal to 1/Ic times the distance traveled by their sibling from the result set, instead of having them all move in the same direction by moving the parent leaf cluster. Similarly, it would be smoother if clusters move independently in a supercluster. This transformation consists therefore in displacing a result sample's siblings instead of its parent leaf cluster, and displacing the siblings of a cluster containing a result sample, instead of the cluster's parent cluster. This is illustrated in Figure 3-3. While it uses more displacement operations overall, given a reasonably small and consistent branching factor b, this variant is still sublinear, as it runs in O(kblogn). 3.3 Searching for the Nearest Neighbors With the proposed database index structure, a specific nearest neighbors search algorithm can be used, capable of significantly pruning the search space and avoid searching through the entire index. This similarity search algorithm is once again inspired by the search algorithm designed for the RCSVD approach [17]. It involves first a preliminary exact search for the target sample corresponding to the query sample. This allows the nearest neighbors search to go from the bottom up, by progressively increasing the search radius, and every cluster in the index whose boundaries are beyond the search radius can be effectively pruned out of the search, ideally resulting in only a small subset of the database being explored. 3.3.1 Exact Search for the Target As explained earlier (Section 3.1.4), the exact search for the target sample is performed by using the auxiliary index. The target sample vector corresponding to the query sample is then found, and its parent is designated as the primary cluster for the nearest neighbors search. Before starting the nearest neighbors search however, a check is performed on all of the sample's branch, to transfer, if found necessary, the sample and/or any of its parents to the cluster where they each respectively belong, as described in Section 3.1.3. 26 supercluster Legend leaf cluster T: target sample R: relevant sample - : element moving by itself leaf cluster H R - - : element moved by parent T I I Elements modified: all the siblings of a branch R Figure 3-3: Learning Illustration: Generalization Variant 27 3.3.2 Collecting the Nearest Neighbors Given the query's primary cluster, the task is then to collect the k nearest neighbors (NN) of the target sample, working up along the hierarchy. The proposed algorithm is outlined below. Note that this is an exact NN search, not an approximation, within the current vector space. A k NN search is first performed in the primary cluster. If k exceeds the number of elements in the cluster, some of the k NNs belong to the neighboring clusters. Even if k samples are retrieved, samples in neighboring clusters might be closer to the target than some of the samples retrieved so far in the primary cluster. The search must therefore be extended to neighboring clusters in the rest of the database. Let C, be the primary cluster, T be the target sample, P(Ci) be the parent cluster of the cluster C, and S be the ordered set of I retrieved samples so far, where 1 < k. Let D(C, T) denote the distance of the target sample to the boundaries of the cluster Ci, and D(sj, T) the distance of the ith sample of S to the target (where S is ordered by increasing distance to the target). The search is performed following this double-iterative procedure: " For all the clusters C siblings of C1 (i.e., P(Ci) = P(C)), excluding CI itself, in order of their distance D(Ci, T) to the target, and while (I < k) XOR D(Ci, T) < D(sk, T) (i.e. while Ci is a candidate which could possibly contain some of the k NNs), do: - Find j such that D(sj_, T) < D(C 2 ,T) < D(sjT). Samples in S from 1 to (j - 1) are retained in the the result set during this iteration, for they are closer to the target than anything in C could be. - Find the ordered set S' of (k - j) NNs of T in Ci. If Ci is not a leaf cluster, a procedure very similar to this iteration is recursively applied to its children. - Merge S' with the current elements j through k of S in increasing order, and retain the first (k -j) elements of the merged list as the new j through k elements of S. " Set C1 to be P(CI), and repeat the above loop procedure until the root is reached, i.e. when P(C) is the root cluster. 28 Chapter 4 Results The experiments related here focus on the Leash learning algorithm, by testing three aspects of it: first, the ability to discriminate subjective classes in the database; next, the extent to which it preserves the spatial structure within any given class; and finally, its effectiveness with objective feedback, that is, how the learning algorithm can be used to simply increase gaps between pockets of spatially correlated samples. All the experiments were run on a geological core sample database, containing 2208 samples, all 60-dimensional vectors. The samples are pre-classified and pre-labeled (unbeknownst to the program) in groups of 32 samples, where each group represents a particular rock type. To be exact, the 32 images correspond to the same picture, viewed through 32 different positions of a sliding window. The choice of this test dataset was based on two factors: first, its content reflects the type of industrial archives for which this retrieval system is being designed. Second, this particular dataset was chosen specifically for the fact that a standard nearest neighbor search algorithm performs with very poor precision on it (on average 37%). The tests consisted of performing 3000 queries in which the query sample was chosen at random between any of the 2208 database samples, and the number of retrieved samples (query size) k was chosen at random between 1 and 31. 4.1 4.1.1 Class Precision Experimental Setup These tests used the database's pre-assigned labels to provide simulated strong subjective feedback for the learning algorithm. That is, for each sample in the result set, u was set to 1 if the sample's label was the same as the query sample's label, -1 otherwise. The tests were performed with the (So, Do, w) parameters set to various constant values, shown in Table 4.1.1. 4.1.2 Motivation These tests are performed in order to show how the Leash algorithm, for various parameters, discriminates classes and improves the class precision of each query, i.e. the fraction of the result set which belongs to the same class as the query sample (ideally, this fraction should be 1). These results are interpreted by plotting the number of true positives (TP), samples of the same class, against the number of false positives (FP), samples of a different class. 29 Simple kNN, no learning Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 So Do w 0 0 0.5 0.5 0.2 0.1 0.1 0 0.5 0 0.5 0.2 0.1 0.1 0 1 1 1 1 1 0.8 Table 4.1: Experimental Parameters for Class Precision Tests Experiments 1 through 3 use an excessively high value of 0.5 alternatively for So and Do, with the purpose of testing the limits of the learning algorithm. Experiment 6, with w = 0.8, is intended to test the effect of the relaxed variant of the transformation proposed in Section 3.2.1 against the normal transformation in experiment 5. 4.1.3 Test Results Figure 4-1 contains the TP/FP plots. Note that these are not the typical TP/FP plots, in that they are not obtained by exhaustively querying every sample in the database will all possible values of k. Instead, they represent the results of 3000 random queries, and for experiments 1 through 6, the database is constantly modified by the learning algorithm as the queries are performed. As such, these plots are mere approximations of the actual TP/FP relations, and should not be expected to be monotically increasing. Nevertheless, they definitely display a precision improvement obtained with the Leash algorithm over a standard nearest neighbors search. The parameter So is found to have a greater impact on class precision, but the best results are obtained when both So and Do are non-zero. In the latter case, high values like 0.5 do not provide a significant advantage over more reasonable values like 0.2 or 0.1. Finally, though further testing would be necessary for a conclusion, these tests seem to show that the relaxation factor w does not seem to have a significant effect. 4.2 4.2.1 Structure Preservation Experimental Setup These experiments are the same as the ones above (Section 4.1.1), and the present data is collected concurrently to the previous data. 4.2.2 Motivation Class precision alone is not enough to evaluate the learned precision for the Leash algorithm. Indeed, the relative similarity of samples within a class is a factor as well. That is, given a query number k, a query sample Q, its corresponding target T and a class C such that T E C, and n the number of elements in class C, if k << n, not any combination of k samples in C constitute a good result set, only those that are the actual k nearest neighbors 30 20kNN -expi -e- exp2 9--- exp3 -b- exp4 -Xexp5 -A- exp6 -+-- 18- 16- 14- 12- 10- 8 - 6- 5 10 15 False Positives Figure 4-1: Class Precision of the Leash Algorithm 31 20 25 1 0.9 0.8- z z V 0) 0.7 U, 0.6 C. 0) .G 0.5 E CL 0.4 F 0 C . 0.3 F -b--- - -E- 0.2- -0-X- -A 0.1 0 0 5 10 20 15r 25 exp1 exp2 expc3 exp4 exp5 exp6 30 35 query size k Figure 4-2: Structure Preservation Results for the Leash algorithm of T within C. If k samples from C were returned regardless of how far they are from T, then the learning algorithm would have achieved class precision only at the unacceptable cost of losing all the objective information contained in the original feature vectors, by destroying the internal spatial structure of C. Thus, this section presents a stricter criterion for learning precision. For each query result set (with variables defined as above) in the previous experiments (Section 4.1), the k' true positive (TP) samples are retained, where k' < k. On the other hand, a standard nearest neighbors query is performed for T on the subset of the original feature space, which contains only the sample vectors pre-classified in C. This set, N, consists of the natural nearest neighbors of T within class C. The structure preservation measure for these experiments is thus obtained by finding which fraction of the set of true positives match samples in N. 4.2.3 Test Results The structure preservation measure is plotted in Figure 4-2. The results are expected to naturally converge towards 1 as k increases and k' approaches n. For low values of k, the Leash algorithm is shown to perform poorly, especially if the parameter So is set high. This 32 seems to be the main aspect of the learning algorithm which needs improvement. 4.3 4.3.1 Objective Feedback Experimental Setup This test used objective feedback for the learning algorithm. That is, for each sample s in the query result set, the feedback given was a monotonically decreasing function of the distance between this sample and the target sample T. Specifically, u > 0 when s was within a cluster's width from T, and u < 0 otherwise. A single experiment was performed, with So = 0.1, Do = 0.1, and w = 0.8, and both class precision and structure preservation data were collected. 4.3.2 Motivation This experiment tests how stable the learning algorithm is with respect to objective feedback, by testing how the structure of the original database is preserved. Ideally, the vector space should remain stable and its structure preserved close to 100%. Furthermore, this experiment tests whether the Leash algorithm can be used to solve the "curse of dimensionality" (the fact that in a high-dimensional space, all samples have similar distances to each other), by increasing the gaps between clusters of spatially related samples. 4.3.3 Test Results The class precision results with objective feedback are plotted in Figure 4-3, while the structure preservation is plotted in Figure 4-3. These plots show results similar to, if not better than, a standard test using simulated subjective feedback. While this might seem surprising, it is actually consistent with previous results and could simply be an artifact of this dataset and the testing methods, because subjective feedback is simulated using objective information. Nevertheless, it shows that the initial clustering does well at discriminating classes, and that in this case the learning algorithm merely gets rid of the "curse of dimensionality" by increasing the relative gap between clusters. However, the structure preservation results, while not the worst, still emphasize this weakness in the learning algorithm, considering that objective feedback should in principle not create much entropy in the database. 33 22 r20181614(D - 120 0- ID 10 8642 0 0 5 10 15 False Positives Figure 4-3: Class Precision with Objective Feedback 34 20 25 1 - 0.90.8- z z CD 0.7- V (n (n Co 0.6 - (. C.i 0.5Co) E . 0.4- 0 C .0 0.3- 0.2- 0.1 - 0 0 5 10 15 20 query size k 25 Figure 4-4: Structure Preservation with Objective Feedback 35 30 35 Chapter 5 Discussion 5.1 Contributions While the specific design proposed in this research for a scalable learning algorithm does offer some improvement in search precision, the main contribution of this research is the framework proposed for scalable and reliable similarity search. Indeed, in this paper I present a generic scheme for a scalable similarity search engine, combining an adaptive relevance feedback method refining the feature space by incremental changes, an efficient similarity search algorithm, and an adaptive relevance feedback method, around a common compact and versatile indexing structure. While the details of the learning algorithm can remain open for more elaborate variations, the principles set forth for the structure and dynamics of the index should retain the same ideas in future work. Indeed, they fulfill the dual goals accomplished by this index: implementing an efficient nearest neighbors search index, while allowing an elaborate set of coherent incremental modifications in sublinear time, to adapt a database to the desired feedback. 5.2 Future Research 5.2.1 Improving Learning Precision While the results obtained in this study are encouraging, they are not yet satisfactory for the desired performance. Many more tests need to be performed with larger datasets and more test queries, in order to really evaluate the average reliability of the Leash algorithm, notably using different displacement constants or functions for So and Do. Also, tests are needed to evaluate the time performance of the search algorithm, showing whether the index structure indeed effectively prunes the search space, both before and after the database has gone through a reasonable amount of learning. Furthermore, several directions should be explored in order to improve the design, such as: Elaborate Result Set Transforms While they would be subjected to similar attractive/repulsive forces, the samples in the result set could be made to move along an elastic mesh, instead of directly undergoing a translation collinear to the force to which they are submitted. This would be more likely to preserve the class structure. Once again, the additional computational costs of such a transformation is amortized by the fact that this 36 transformation applies to the result set alone and not the whole database. Furthermore, a similar elastic mesh method could be applied for the generalization technique. Again, by keeping the leaf size and the branching factor small, the additional computational cost is reduced by applying to only a small number of elements at each level of the hierarchy. Optimized within-Cluster Index On the other hand, one could try to lift the restrictions of small leaf size and branching factor, by creating an efficient and versatile withincluster indexing method for both leaf clusters and superclusters, allowing the elements to be quickly sorted by their spatial position while making this within-cluster index flexible towards modifications to its elements. 37 Bibliography [1] J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Humphrey, R. C. Jain, and C. Shu. Virage image search engine: an open framework for image management. In Symposium on Electronic Imaging: Science and Technology - Storage & Retrieval for Image and Video Databases IV, volume 2670, pages 76 - 87. IS&T/SPIE, 1996. [2] M. Beatty and B. S. Manjunath. Dimensionality reduction using multidimensional scaling for image search. In Proc. IEEE International Conference on Image Processing, October 1997. [3] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for highdimensional data. In Proc. 22nd Int'l Conf. on VLDB, pages 28-39, Bombay, India, Sept. 1996. [4] B. Bhanu, J. Peng, and S. Qing. Learning feature relevance and similiarity metrics in image databases. In IEEE Computer Vision and Pattern Recognition, June 1998. [5] I. J. Cox, M. L. Miller, T. P. Minka, and P. N. Yianilos. An optimized interactions strategy for bayesian relevance feedback. In SPIE Photonics West, 1998. [6] I. J. Cox, M. L. Miller, S. M. Omohundro, and P. N. Yianilos. Pichunter: Bayesian relevance feedback for image retrieval. In IEEE Proceeding of the International Conference on Pattern Recognition, pages 361 - 369, 1996. [7] B.S. Everitt. Cluster Analysis. John Wiley & Sons, 3rd edition, 1993. [8] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The QBIC system. IEEE Computer, 28(9):23 - 32, September 1995. [9] B. S. Kim and S. B. Park. A fast k nearest neighbor finding algorithm based on the ordered partition. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI 8(6):761-766, Nov 1986. [10] C.-S. Li, J. R. Smith, and V. Castelli. S-stir: Similarity search through iterative refinement. In Symposium on Electronic Imaging: Science and Technology - Storage & Retrieval for Image and Video Databases VI. IS&T/SPIE, 1998. [11] Y. Linde, A. Buzo, and R.M. Gray. An algorithm for vector quantizer design. IEEE Trans. Communications, COM-28(1):84-95, January 1980. [12] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Analysis Machine Intell. Special Issue on Digital Libraries, (8), 1996. 38 [13] T. P. Minka and R. W. Picard. Interactive learning through a society of models. In technial report 349, MIT, 1995. [14] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Tools for content-based manipulation of image databases. In Proceedings of the SPIE Storage and Retrieval Image and Video Databases II, February 1994. [15] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Trans. on Circuit and Systems for Video Technology, 8:644-655, Sept. 1998. [16] J. R. Smith and S.-F. Chang. Visualseek: A fully automated content-based image query system. In Proc. InternationalConference on Image Processing, 1996. [17] A. Thomasian, V. Castelli, and C.-S. Li. Clustering and singular value decomposition for approximate indexing in high dimensional spaces. In A CM CIKM, Nov. 1998. 39

Combining Fast Search and Learning ... Search Hooman Vassef

Related documents

Products

Support

Combining Fast Search and Learning ... Search Hooman Vassef

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib