A Fast Algorithm for Indexing,

A Fast Fastlk?ap: Algorithm Visualization of Traditional Christos AT&T for Indexing, and Dept. Laboratories Murray Hill, Multimedia King-Ip Faloutsos” Bell NJ Univ. Abstract N), A very promising space, using domain fine-tuned [25]. spatial translates (which translates or best-match Thus, of Maryland, College while to preserve it manages several Example’ ‘all type pairs’ The objective and visualization as query extraction functions easier for a domain it is not expert Given obvious only how can storing, for objects of this to map objects into (k is user-defined), preserved. There and now (b) that We introduce namely, for met hod. Then, problem proposed a much while and faster in addition is as opposed [51]; data for algorithm to solve faster on the show than database [44], tion our behind size we try *On was by leave partially the 8803012, matching Machines from Univ. supported National Science EEC-94-02384, funds from of by ..- Maryland, the Institute Foundation of and Software Park. Systems under IRI-8958546 Empress College Grants Research No. of insertion, one the string features digitized do some to design to map that objects a domain that this setting timefeature- distance into the function motivaby Ja- k-dimensional has only function distance the the approach expert includes Euclidean is exactly Generalizing a distance/dis-similarity eg., the This work it difficult eg., function in matching to above case, distance have difficulties work. with as the the the which Similarly, points, derive transform clear makes these this assuming MDS, in to number typically which to the to not case. a point is reduced functions. gadish, that is we Overcoming for indexing. indeed significantly to quadratic, be in this excerpts, extraction although yardstick it allows synthetic other where to available. Consider (minimum It was k-dimensional easy substitutions other). voice pat te~n recognition, as the should revealing distance and warping (MDS) it to the objects and editing words, into problem of algorithms always An [25], object the functions. English deletions as discussed space, typed is the (a) for. from use of datasets. k feature-extraction displaying is not a retrieval of traditional Jagadish each is a plethora it provide derive Then and there to by mapping feature-extraction ies are mapping: attributes Scaling we algorithm linear, or 3-d is looklng we propose real 2-d method indexing, on this a SAM, among data-mining in hand, Experiments (being correlations MultLDirnensionral unsuitable from with in a fast in some k-dimensional and data-mining: as points an older We describe the dis-similarit benefits visualization clusters, regularities points in conjunction be plotted pot ential paper. such that are two ret rieval, before the the topic and the overall collections multimedia to space. retrieving which and experts thus the distance is suggested domain to assess the work for large ‘exotic’ However, is exactly efficient on this tool idea, k-dimensional to map of as functions, be points. space well excellent the nearest-neighbor [8]); of two objects. though, algorithm the By distances Introduction rely feature It is relatively This can join to answer the Park of the data-set. 1 by a use highly etc. similarity/distance into ‘Query query); to a spatial designing information the Lin Science in k-d provided (SAMS), (David) and points functions, a range query, However, into we can subsequently including to in traditional objects access methods of queries, (which hard. searching is to map k feature-extraction expert types idea for fast databases and Datasets of Computer structure multimedia Data-Mining points, provided D(*, *). case of features, between between two the us Notice by using feature vectors corresponding objects. and Given CDR- 220, IRI-9205273), with and Thinking Inc. given Inc. such users query are most the Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyin is by permission of the Association of Computing Machinery. ? o copy otherwise, or to republish, requires a fee and/or specific permission. SIGMOD’95,San Jose, CA USA @ 1995 ACM 0-89791-731 -6/95/0005 ..$3.50 a set of objects would like object, similar (b) space, to of objects in order and to the find to find to each other, distribution chosen (a) the pairs as well into check distance objects of objects to that as (c) to visualize some for function similar appropriately clusters and other regularities. Next, Definition 163 we shall 1 use the following The k-dimensional terminology: point Pi that corre- a spends to object Oi. the Definition Oi, will Some will be called ‘the image’ search of is, P; ~ (z; ,1, z;, z, . . .,x~,~ ) 2 The ‘images’ work object That k-dimensional be called of the target next. containing the space. applications are listed space it. that Some motivated distance the present functions are also and, a collection shapes multimedia shapes [25] to images, similar the shapes feature inertia (voice, that are clip, Once video immediately 2-d databases, images and of typically (eg., the we consider warping makes it difficult describe a point series, prices, in feature sales eg. series geological, etc., of In ‘jind companies such whose typical stock past days in which showed patterns goal patterns we that used errors) similar is to aid the to have the Euclidean as the distance solar today’s appeared distance function find ones the the case of spelling, tion [26]. There, typing given string [30] databases, and a wrong OCR string, we (with at- etc.), detect any we clusters, demographic data two types of queries requests Specifically: query-by-example ‘similarity form: are (or, equiva- will signify a desirable object query’) Given search a collection of objects within a user- dejined to distance e object. 4 The term join’) wiIl of objects, distance all pairs signify query (or, equivalently of the form: queries find the pairs c from of objects other. each In a which Again, are c is user- the above into two can major is Access [7] and the the we search can Methods z-ordering searching by a mapping Such a mapping for time for employ highly (SAMS), like [37]. range These queries queries. The fine-tuned the R*-trees methods provide as well as spatial [8]. can dataset, [53]. general linear help with visualization, Plotting can objects reveal such as the shape clustering as points much of existence Gaussian) powerful etc.. insights discovering rules. as discussed before, in the and k=2 (linear These data- or structure of major of the distribution versus provide and benefit k-d space. benefits: that Spatial would in some accelerate mensions [1] applications points 3 diof clusters, versus curvi- observations in formulating the the can hypotheses time as in error that query mining: squared two object), as follows. in applications. patients ‘query-by-example’ or query wind In query’ (termed Thus, searching to trans- blood-pressure descriptions, following 2. it be series. Similarity of symptoms, desirable: of the joins similar of of insertions, visualization physician among queries fast similarly’, past. (sum the 3 The term reason data, would pattern> between In the second. age, queries. ‘range 1. It [11], magnettc in the [4]. are needed records gender, above pairs’ provides as stock by examining [2] and to help the ‘all All map [53] move against is typically number that is a alphabet candidates distance to the given to be very of objects databases, queries there diseases. within would weather prices forecasting, may that scientific data, where best to approxi- defined. This therefore, such like collection that [50]. astrophysics databases, or ‘find The or sensor environmental, etc., be are properly data, [3], like Definition space). financial numbers time (and example, ‘spatial some sure features image MRI distance differences to find each with, bones) mining For from teaching make Data Definition be would the string of a four-letter minimum first lently retrieve requiring to structures before adequately that images, aligned, (eg., as for medical ie., form and ECGS), to the the or substitutions From video can symptoms Notice two anatomical images distance strings case has to be matched find deletions seem function Ability as well complicated, the 3-d from string to closest the databases, applications, the the is clips or (eg., these and users method all a new strings, or correlations audio or video objects similar purposes. are warping with we is with score old would moments dis-similarity) stored. for diagnosis, functions Time There example, music and are cases with research it into For scores, l-d X-rays) [5] past valuable the (eg., scans) quickly with appropriately proposed where of strings tributes applied. Medical brain [35]. [33]. our of images databases, (or determined, find attributes, a target in DNA collection editing find Search-by-content music matching large the find In to collection between (color similarity like a texture etc. to the in etc.) to retrieve, similar been or in multimedia music), want databases: would to distance shape, desirable might like vectors for we one; would Euclidean selected highly a give we colors, used has general, of similar color of in to identical mate (A, G, C,T); described. Image a dictionary Conceptually to highlight the correc- known: should General 164 We the the general shall refer to fact that only Problem (’distance’ it problem as the the distance case) is defined ‘ distance case’, function is Given N objects and distance them (eg., N simply the distance an N x information distance function D(*, has about matrix, or *) between two been ences, solve the the objects) Find N points such that in a k-dimensional the distances are maintained been as well We expect ity. In the Euclidean tions. that the symmetric ‘target’ it distance like the the space, because Alternative function obeys (k-d) distance, Lp metrics, distance and V() is triangular we is invariant metrics L1 (’city-block’ use under could features case from projection, is the when the of the or ‘Manhattan’ ‘features’ have but we objects, usually (’dimensionality we because curse’). the We already still features shall are refer to dis- from There to Given N Find N such that do a (’features’ vectors with vectors n attributes distances too many it as the dimensionality Then, choose distance spaces are maintained to use the In the fulfill Lp where dij but not the (L2 ideal and for should large gives the suffer from, crepancies (low To should or O(N because distances, ‘stress’ should a new The provide the leading - see (Eq. object cost to a random). dis- smal. k-d The is vital outline (MDS), of this a brief related SVD etc) for O(1) and our method. In results on real list conclusions. to methods. section and its O(log is as follows. pointers access or to N). its In Here distances the point older present its to attempts cuss the to solve In section literature on section in function: datasets. background (K- information (MDS) First The above have termines 5 we [28] different we MDS dis- including that 165 space algorithm heuristic, or point, to even at and k-d method of moves ‘steepest k-d points. as a ‘spring’ algorithm points the distances. of the distance the k-d item between estimated then, is computes – 1 points each pair-wise and works each the of the a guess improvement assigns positions version of scaling [51], MDS Several proposed tries to re- to minimize the the to the for that the metric basic [48], MDS where MDS, semantic algorithm: de- and Kruskal the distance Young which [55] incor- corresponding of the has been used in numerous, are exten- automatically messures, perception mul- distances and qualitatively; difference distance following: above k; Shepard specified individual observers’ called generalizations non-metric are is because a method value multiple the with discrepancy the the points; a good the in k-d springs. been items N the [29] proposed proposed function ‘stress’ every other employs as numbers. Kruskal about method two distance The the some and positions of the sions experimental the it treats ‘stress’ given present In section problem. Scaling object (Euclidean) starts examines minimize tidimensional clustering 3 we some 2 Scaling methods 4 we give the to a point between no further originally using it MDS the porates Multidimensional items, desirable stress distances version, It update the arrange Pj. the until simplest from to between some the is the MDS dissimilarities between Survey we each object and that it, (eg., the describes 2 ‘distance’ of N (c) the y measure & goal, Then, Intuitively, This reduction In synthetic the (eg., average. improves descent’ image. of Multi-Dimensional dimensionality spatial object) map to ‘queries-by-example’, paper survey and the be map Pi error on the point actual algorithm and as follows: will be Oi achieve log N), 1)). fast a query should requirement L, (eg., algorithm we present a very data them. method a set and to minimize ‘images’ Technically, 3. It will dissimilarity relative iteratively databases. preserve basic (a) (dis)similarities of among Following expects space, their of metric). mapping O(N) or higher, the discover set k. object possible. to compute: 0(N2) prohibitive 2. It method is the we in either As before, requirements: be fast but next. the algorithm roughly 1, It should to a as well vectors metric. distance problems, following two be any Euclidean above the between could of each, space, Oi the two the used information variations, described pair-wise between Again, access meth- case) in a k-dimensional the has case). (MDS) is structure (dis)similarity are several [29] ) is (b) their as possible. the the a k dimensional Problem closely that algorithms. (MDS) (spatial) case setting, case: Specialized the of spatial to present (’features’ Scaling scaling underlying items extracted want survey we and to clustering sci- [55]) (SVD) reduction Multi-Dimensional see special transform a brief social Then, Decomposition as pointers Multidimensional tance). A problem. (K-L) we provide as well 2.1 the rota- be any case Value (eg., physics non- inequal- typically ‘distance’ fields research, used for dimensionality ods, negative, diverse market Singular Finally, as possible. in several Karhunen-Lo&we related space, used psychology, to dat a’s difference. diverse applications, structure analysis of words; perceived operating on 60 different perception of ‘trusting’); what shifts) However, the our ‘warm’ texture analysis applications, and of of spins from Value Appendix as well the K-L Our the related [49, 39, 19] implementation [54] is ‘mosaic’ /pub/SRC/ transform on (SVD) as on approx- is closely Mathematical edu and projections operation matrix. in A, order, its Decomposition cs. umd. However, drawbacks: with The transform //olympos. two eigenvalue vector object-feature K-L in suffers Singular the the [40]. MDS data k eigenvectors. to the (determining in decreasing each first pattern type science them imates people’s spectra different sorts [41], and (like gamma-ray political [55]; for traits together (nuclear relationships); ideological relationships personality recognizing their trait goes physics recognition, and personality kl of available (URL: f tp: two draw- m). suffers from backs: It ● requires In the Its ) time, it for ‘query-by-example’ be mapped algorithm is search/add a point this an a new at best. query would Thus, query k-d the number ● it can not ● even the complexity latter situation trieval and MDS spond to to lary be sands). would of answering as sequential this above two present MDS and drawbacks paper. Despite as a yardstick, ‘stress’ are the the against of our a motivation behind problems, such the ‘features’ we measure extensively in statistical algebra, The (’K-L’) transform the where and sense the its Figure it the speed to map is and matrix points a method 1 shows a set of 2 directions suggests: If we are allowed to project (x’ K-L mean between on is the 2-d only direction etc. i points, and y’) that is optimal square each and typically large 1) information being the in the tens experimental recorre- vocabuof thouresults on the corre- next , . .J2cz best three the handle R-tree (b) methods the [14, be and points, The (a) its most tree-based variants (R+- [24], R*-tree [7], Hilbert using quadtrees linear [37, finally 38], (c) a definition, shapes. classes: z-ordering 23] and will by k-dimensional [20], [31], P-tree engine which, form curves or other methods R[18] spacethat use [36, 22]. are only the [6]. All also retrieval triangular these methods in none ‘target for to exploit them space’, to [46], the space tries nor case where [10], the search of the holds try to prune However, points methods inequality in order query. is y’ retrieval (SAM), complicated equivalently, into the can [27] etc.) inequality direction Clustering more trees There point to [47], triangular on a range map objects provide a tool to research for visualization. Finally, our clustering x, been . . . proposed. could be beneficial where See, eg., application Information . . work algorithms, a recent 0- ..O for (n> documents (V that like grid-files transform k= 1, the best . in where or even [45], hB-tree error, n-d the the K-L of z’; [13], Method methods filling image. sponding be slow attributes eg., 4 we provide before, tree or, ATar%unen-Lo2ve [17]). the distance studied n-dimensional see [12], it may many vectors and Access popular been recognition minimizes is the has (k < n) is the (eg., that error k-d way points [16], collection, mentioned Spatial techniques problem pattern optimal to k-dimensional in the filtering Retrieval rectangles, reduction case, case, 1) with appears, In section methods In case a dataset. As we use method. Dimensionality 2,2 ‘distance’ scanning. above which > V-dimensional size of the 2.3 The ‘features’ (N The to algorithm database in the databases is not the at all on the the has MDS that be applied of In item space. Given in be as bad the incremental item of datasets. 10-100). the in number large questionable: operation: 0(N2), O(N) is setting, to for N= retrieval is the above, (typically, fast prepared N for presented was small use where is impractical applications items ● 0(N2 Thus, items. several [32], in GIS, [21] [43] approaches for surveys, [52] for on have [34] for applications in Retrieval. . Figure 1: Illustration transformation - the of ‘best’ the Karhunen-Lo&ve axis to project Proposed 3 (K-L) is x’. In the which ‘K-L’ the is often most nations computes used important of features), the in pattern features for eigenvectors matching (actually, a given of the set of first so that linear arithmetic combi- vectors. covariance larger It and matrix, 166 part, achieves [17] to choose we describe a fast distances mapping are preserved example example their Method with definitions. with a small real data. the proposed algorithm, of objects well. Then, distance Table into 1 lists points, we give matrix, the an and symbols a 01 ] Definitions. Symbols N I I Number n of ob.iects \ dimensionality I I of original (’features’ *0, in database space Oa case only) dimensionality distance of ‘target function space’ between two xi objects ~ dab Table 1: Summary of Symbols and Figure Definitions the 3.1 goal case, is that that objects are For indeed these projections rest this discussion, of it were heart of the objects on a carefully choose two objects’ from through the n- cl points challenge distance a point an in an Oa in is will space, be and (referred to space. algorithm of the do that, the ‘line’ is discussed using ob To The later the (see Figure objects cosine on law. as the Figure 3: Projection we to the line passes to choose The 4). line See Figure 1 (Cosine cosine law Law) In any are 2 for triangle an and eventually, and the are indeed O~); oaoiob, Oi’ gives: n k rectangles the OaEOi 2 can Pythagorean and be in the two The ObEOi. solved for ~i, the first coordinate of da,i2 the above distance V(O,, OJ) the computation objects, points pivot on problem thanks a line, For 0., for (for i, j is a shorthand = 1, ..., needs the N. for Notice distances the Eq. 3, we can preserving example, some if Oi x; wi 11be small. map of the is reasonably Thus, we have part ‘H, such 3 two apply observation is the the two on Oj’ next the ‘l-i with with! the Once previous N). create distance projections Oi’. objects Let 1, ..., not to begin and 1)-d (0=, problem, should the – line i = determine of as, Oil depicts Oi’, original is to the hyper-plane. (for This n was unknown between projections Lemma tance to ‘D’() on this objects a (n to of Oi one. the consider as the by missing is affirmative, on this the is done, steps. Oi, Oj, and their A hyper-plane. key Lemma: that between are given. that, information: di,j of xi only which Observe 2da,b equations, because Figure (3) same decreased we can recursively + da,b2 _ db,i2 z is the objects method, in 2-d space, that is perpendicular our the this points answer space, projection hyper-plane The n-d project only function Oi: Xi the theorem in into Pretending for problem and space. ‘H that then, we can extend is as follows: points problems, From M, perpendicular figure. the objects k-d idea stand The (2) db,i2 = da,i2 + da,b2 – 2xida,b Proof previous is whether we can map hyper-plane Theorem on a hyper-plane of the question so that that o.ob ‘piuot that illustration. into 1 I 1 I I 1 I 1 (with is to project ‘line’. and consider n-d projections computed obiect 08 I I I 1 matrix object n-d method selected now on), objects Eq. !4=% Oj we have. proposed objects them The the Ob n). The pivot from input I xi .Xj unknown, The on D pretend these - projection Oi of a given project directions. whose is to some to only as if unknown in try it is the the idea law’ E ‘distance’ space, distances key points to the k-d the The orthogonal to compute in match and for points matrix. on k mutually treated problem N will space, since the find distance dimensional only, solve to distances N x ‘cosine ? to is, Euclidean In of the Oaob. Algorithm The N 2: Illustration line objects 1 On V’() computed the between from the hyper-plane the ‘H, projections original dist ante the Euclidean 0~1 and D(), 01’ discan be as follows: distance close solved to (’D’(Oi’,0j’))2 the = (D(Oi)Oj))2-(~i-~j)2 i, j = 1,.. .,N (4) k=l. 167 Proof From the Pythagorean OiCOj (with the right theorem angle at ‘C’) on the triangle (AB) AB. indicates Since (OiC) the = length (DE) = of the Global line (5) N segment [ [~~ – x~ 112, the to project and, on compute the a second therefore, distance line, lying orthogonal D’() proof on to the the first allows is us line (O., int At PAD array /* currently of the algorithm, of the stores the ids of the pivot per points end image /* - one pair =0; the is the array col# by ] /* row objects fi, 0~) X[ i-th *I 2 x k pivot to hyper-plane variables: k array x the complete. Ability 2 FastMap begin (o,’oj’)’ = (Coj)’ = (o,oj)’ - (OK7)2 where Algorithm we have: recursive to the being i-th call */ column updated object. of the Xo */ construction. Thus, we space. can More The k times, point that ‘pivot to find each not ob. on which the projections 0. and ob such distance computations. heuristic algorithm same the this we would that, we { return; k. {col# like 2) /* apart ‘D(O., require propose choose-distant-objectso, Ob) 3) /* record 1 ( 0, 5) /* dist () ) begin 1) Choose arbitrarily second pivot let = O. from an object, object (the and that is farthest to the 6) apart distance /* function object that is farthest apart from report the pair objects 0. and ob as the the steps to choose Now In we to algorithm two distant typed n-d vectors) the triangular words, (b) of dimensions space, desired variable, the to 2 x k array objects’ PAD. (a) */ FastMap( the global = z~ projections of the objects perpendicular to line function is given the on (O., D’ () between by Eq. two 4 */ 1,D’(),0) k – and it maps (c) the distances output vectors Xo. each Figure ([dist objects 3.1 gives call, the a k-d lines in a global of the global for function 168 that 5, each query ‘pivot each the time. respect objects’ The is mapped it object mapping to the into on the the appropriate is, we repeat query of the search ‘query-by-example’ with the ‘pivot projecting That for is O(iVk) the longest is O(N). the 0~ by objects’, complexity with call, queries. object space’, algorithm (0(1)) object. of which a point j] is the j-th algorithm to record when of the k recursive to the X[i, i-th is to facilitate ‘target FastMap is constant of the we need the in is mapped k]) where ‘FastMap’ follows: of appropriate Notice FastMap. as of the At each recursive call arrives, point distance also records pseudo-code is request in image 2 and that each object of the recursive algorithm as well in the reason the coordinates after X[i, P;, the are steps in each number i-th calculations: The obeys the complexity steps or axis, 1], X[i,2], distance case), points Therefore, (X[i, The objects that into algorithm recursive ‘ FastMap’ determines on a new co-ordinate images, preserved are written The ante’ desired are F’i= algorithm. color the 5: Algorithm the algorithm objects calls. of the a set O of N D() on 5 iterations. basic function the for linearity documents, a distance the N x k array ‘pivot our definition as input ASCII k, and The the linear N a constant we have describe problem are be repeated maintaining inequality so that as possible. Ob) objects. algorithm can experiments, accepts (eg., k-d still ready the above steps all our are According the two of times, heuristic. the in middle number (O., of objects. 4: Heuristic The return are O */ end Thus, All i and Eq. 3 and update dist ante Figure Figure every distances coi#] the the end N. X[i, consider call */ b; Oi, x, using projections 0.) for on line object a hyper-plane Ob); let ob = (the =0 all inter-object compute ob objects col#]= = O) objects each of choose-dwtant- pivot = a; PA[2, array: object ob ) (according ids of the i, col#] project let it be the dist ()) 4) the col#] since for */ result ‘D()); 4) if ( D(O.,0~) illustrated choose-distant-objects objects ob be the 0, PA[l, linear /* 3) pivot and objects( 4 2) ++;} set X[ Algorithm O ) } choose let 0. 0(N2) the k, D(), else we need distance would Thus, any I’astlfap( 1) if (k < O) steps for are as far achieve that Algorithm ‘target’ is how to choose Clearly, To However, the discussed and as possible. a 2-d the probiem 0. maximized. in Figure for apply solving we have other to choose is problem we can thus objects’ a line from the importantly, recursively, the solve database step 5 only. operation size N. More detailed, calculation distance of the objects. algorithm the object if decide we pivot distance requires El(k) we need to compute because query Even between more the operations, objects from each to compute on the fly, calculations to the distance- of the the 13 to add of Due to space limitations, as well method for omit an of how of documents. report [15], also our details are The available We implemented our on a DECStation two 5000/25 groups. method In with speed and The distance, quality we use the STAT/LIBRARY The the 1). For of for real MSIDV The centers be the points (0,0,10,0,0) to standard ance by 13 the sample of we used the each This IMSL designed to abilities of We used attribute real datasets documents in 7 groups i # and j. Hart on a 3-d chosen data axis points with and covari- the distance Euclidean version 6 each (0,10,0,0,0) The distance. of the textbook [12, p. were Again, is the is a simplified form in distribution, 1 on each Recognition 30 points Duda clusters (10,0,0,0,0) points N=120 points (0,0,0,0,10). any such dataset the of points of a Gaussian a = O for The number of follow deviation SPIRAL: several The same (0,0,0,10,0) two a dataset space. (0,0,0,0,0) cluster p~,~ = from clustering applications. normalizing generated cluster. between the We in a Pattern as synthetic. of specific measure, 5-dimensional the in each respect is has each are as follows: with implementation experiments the of learning row interval. datasets in our routines. and several as well compared as measured the in after unit clusters, to UNIX(TMIJ with synthetic points experiments, we output, procedure visualization algorithm datasets, of (Eq. group the several MDS, FORTRAN second illustrate and group traditional function of MDS, run first to the of machine amount dis-similarity GAUSSIAN5D: on ‘mosaic’. in ‘C++’ and the the to ‘stress’ method the found the ory theories.lEach arithmetic to apply Experiments 4 our we as an illustration a collection in a technical the For domain example, e reposit domain indicating Euclidean 3*k. in and constituents wine. k a total UC-Iruin attributes, distances for the databases 2 * k pivot we have count, from the one used [17, p. 46]. spiral, as suggested by 243]: zl(i) = cos zs(i) z?(i) = sin #s(i) Zs(i) = i/ti, i= with MDS are: D OCS: It (each consists with ABS: of 35 text Abstracts of computer science technical re- BBR: Reports about CAL: ‘Call papers’ for Portions (taken from Cooking WOR: ‘World SAL: for Gospel games. technical conferences. in King James’ News’: Version of Matthew). documents about the Middle 1994). advertisements for computers and dat asets or text is available taken repositories edu). The of the distance from vectors; it is closely function (for see the WINE: N= analysis but details, 154 derived we expect records, of wines from grown three is the vectors, ‘cosine-similarity’ more various soft- (eg., after related Euclidean dataset. To with different to see 3 clusters. of same The 105. 3. each method N, in of UNIX, was and (6) ...29 and Thus, 169 For both We report a linear scales, and become ‘O(X’2)’ respectively, are highlight the MDS while that FastMap requires lines, labeled linear with in slopes as ‘O(x)’ roughly time 6 we which, as visual requires utility Figure curve, lines by records time In intended both N = 45, of the times. straight run required number a quadratic These fact time used user we we experiment the the our ‘WINE’ namely, methods, of the N, sizes, 6 plots compare using on of varying 2, respectively. 1 and aids, to quadratic on the database size N. The important time we as increases. obtained dependency scales. we we MDS, as a function plotted from in Italy, experiments, Figure logarithmic Next, a chemical region see the on subsets k=2 method [15]). cultivars. file popular of traditional with dramatic Retrieval report results in the dis- the 60, 75, 90 and time . normalization to the technical MAT Wustl of Information group logarithmic news- wuarchive. function document from on the Internet electronically tance to unit are first with and above groups the method also ware The Comparison algorithms recipes. (October Sale basketball of the Bible the REC: East 4.1 In ports. MAT: 0,1, 5 documents): versus want the to Figure again is that over MDS, study in the the of subset the logarithmic 1 ics . uci . edu: // ftp/pub/machine-learn achieves performance k 60-point 7 shows FastMap even for small dimensionality We used 2 to 6. k, conclusion savings time scales. the and for datasets. of target each space we varied each Notice k method that ing-databases/wine the 1 time of while our the method time Fasilfap increases for provides MDS with grows dramatic k, even savings as expected, faster. Again, . ‘t in time. i, * + + 01 Ii -2 c i? ~ 001 ,4 ,y o all 1 01 ,,., 1 ~ ; .:.,,,.,., -.-..---”$:-’- .,,.,.*----- A .’ .. .. . I . .. . .. . ,,- Figure 6: dataset; axes Response time MDS vs. and Figure 8: for WINE the Response FastMap Io?J(N)‘b WINE database FasWap, size with N for k=2,3. The to savings in time, In its first k for the WINE FastMap The each The gives the three variable ‘ideal’ it time, closer (0,0). quality), almost the for a method much but give lower the FastMap of magnitude Thus, The ‘stress’, origin closer present and then same can zero (O ,0), the value produce 8 of the graphs, of the that, fz, (= a mapping faster. 170 ask j~ stand several for k=3 for the the all the all even with the detect (b), any first two the for plots, by the of .fl ~3, only 6 the same vs $2, and (c) dimensions 4 clusters; can clusters Figure showing ‘target’ In clusters two k=3, forming ‘ FastMap-attributes’. roughly the for of ~1 vs. that using be are disjoint 9(c) all the completely in at confirms the 6 clusters are space. it uses a fictitious ability synthetic points, indicated three scatter-plots. observation, the the scatter-plot scatter-plot with 3-d with ones. cluster). are gives because in the and The that and on we mapping per cluster we can Although tion point results (N=120 plot one of the previous show .fZ and resulting 9(a) scatter-plot least is to visualization stated, real dataset gives Notice next goal results the 20 points 3-d ~1 and lustrates better ‘stress’ (b) the of Data the Figure while significant quality. for the .fI, the with of a given disjoint time. ‘ideal’ gives with gives our present that we separated, reduce Figure independent in to the MDS of each k. In these zero of stress. as a function scales. goes to the for we see that stress algorithm of time. points is that, achieves ‘ FastMap-attributes’. 9 clusters, and k, it gives can each method, is in general Alternatively, the k, and properties otherwise GAUSSIAN5D letter. dimensionality dimensionality should (solid) ‘price/performance’ in logarithmic FastMap an order estimate amount each dimensions - MDS to same the is, how was the method is. is of of experiments is useful Synthetic Figure axes logarithmic. as we saw, in a given ‘stress’ response Both is to find that ‘stress’, The longer, question algorithm, the For takes number (N=60) line). experiment method. clearly subset (dashed final vs. varying (solid) logarithmic. loss in output we Recall 4.2.1 the group of experiments Unless First time axes FastMap algorithm Here datasets 56 of this without group dimensions. Response with - MDS Map proposed datasets. 7: Both linearity, clustering. Figure stress N=60 Clustering/visualization this the 4 vs. with line). conclusion thanks Fast 3 Iw(k) time subset (dashed 4.2 2 lC.W the Both logarithmic. :- Iwl Ikw%) I *--,:.,-:.: dataset, of FastMap this to help example with il- visualiza- clustering. next Figure 10(a) shows the Notice that experiment plots result the the of involves original FastMap projections the SPIRAL dataset in for (Figure k=2 10(b)) dataset. 3-d and (b) dimensions. give much f3 fz 17.5 ~ 17.5 15 “$ 15 D 12,5 ‘*~:B 12.5 B 10 B 10 &!A 7.5 ,F%$ ‘ ~; WfFpF ~ A f+ ;+: 5 ~ .%8;. F 2.5 7.5 5 10 15 12.5 2.5 = ccc 17.5 Figure 9: FustMap information to form 2.5 5 7.5 type GAUSSIAN5D the original curve, 10 12.5 15 fl 17.5 (c) (b) on the about a l-d cc& & fl (a) some ‘:@@q’% E PF & ‘-”p: z,: ~ ,51 with dataset dat aset: (a) the points no obvious clusters, ~z vs ~1 (b) and ~3 vs jl (c) the 3-d scatter-plot (~1, .f2, ~3) seems and with ., ,! L._._.L of oscillation. ,~% ,,, ‘,, , ,,!, . . ‘“ ,.. ,5 . . . . . . !, ., !8 ,, ,, ,“ ,,, ,,, (b) (a) Figure Figure and 10: (a) (b) the 3-d points result on a spiral for of FastMap, (SPIRAL 12: The space dataset) box k=2 (a) The Real Next, in Data present we 11. Figure The GAU~SIAN5D using gives the two scatter-plot previous two first, the first the The second three clusters separate gives for third class, one Figure 9 for a 2~d ~1, ~s and points the scatter-plot (c) ~z, labeled even picture to Notice separate with more ‘?’). The and separates of the The 3-d our Figure in its last 12. dataset, The entirety to illustrate documents separated figure and that shows (b) after FastMap of each well, DOCS, class. in only k=3 the the results almost 3-d zooming manages Notice that are shown scatter-plot, into the to cluster the objects space, in (a) center, well the 7 classes are to 171 dashed our for each queries, ‘all queries [42] optimized spatial access (R-trees [20], is useful visualization The can the R*-trees for has the or spatial etc.), [7] etc.). data-mining, algorithm of this that- two searching or joins because methods contribution a linear appropriate following accelerate are for ‘range’ [9, 8], nearest several, highly readily available Secondly, such cluster of a high-dimensionality main the function, (’query-by-example’ queries neighbor mapping algorithm, infer points it of queries pairs’ expert functions. a distance will into Firstly, types in non- a domain extraction provide algorithm as possible. object. objects several [25], into distances searching cFastMap’ only objects the as well feature proposed need which features provide the that for similarity databases expert FastMap, dimensions! in k=3-d of the to map so are preserved approach applications. help scatter-plot the clusters to domain Mapping jl-~s to completely. For the expected from that one information better. between was of the FastMap the contents algorithm traditional/multimedia combines members a fast k-dimensional In an earlier (b) after (b) detail, proposed in Thanks denote respectively. some clusters the whole ‘?’) manages (the dataset, a 3-d scatter-plot. provides the ~1 and ‘Cl’, scatter-plot scatter-plot gives WINE ‘ FastMap-coordinates’ (’+’, and the as in is (a) into symbols ~1-~z layout dataset: the for results dataset, Conclusions We have the DOCS big picture in more 5 4.2.2 (b) (a) ;f analysis a and dataset. paper f~lfills is the all design the d&ign of f3 + 1.5[ @ + 1.5 1.25 1 @ 0 75 @ 0.5 0.25 I I 0.25 0.5 0 75 1.25 1 0.25 1.’75’1 1.5 0.75 0.5 (a) 1 1.25 1.5 1.75f] (b) Figure 11: on WINE FastMap dataset (a) (c) k=2 with jz vs .fl, Finally, goals: we output 1. it solves the general eg., the Value than the database time, it able to map a new, point in O(k) distance the only (while, synthetic separate solve the low size, and therefore Scaling leads to fast arbitrary (MDS) and indexing, object calculations, ● work into a k-d regardless of the ● The algorithm uses theorems (such as the cosine each object on an appropriate recursive calls. With sured the ‘stress’ by FastMap the from law), it respect we datasets: The ‘stress’ levels as MDS, projects ● at each of the to quality function), k of output (mea- experimented with result is that for should for given second contribution tools matrix from algebra, Scalzng form and method (or though (MDS) these database research, tion function. study of its is a good SVD tion the the We with would choice and for the the general algorithm. to handle for K-L problem and MDS has been into Being transform provide case (although of the ‘distance’ used in quadratic on N easily, MDS datasets. The optimal solu- unable of interactive the the the dis- given data algorithm for thank Labs Dr. for mining for and document them; and Howard for Patrick maintaining Learning Joseph providing algorithms on Aha to Bell MDS A using the for from M. the Elman and Doug Kruskal source answering Murphy UC-Irvine Databases B. the code several and David Repository and Domain Oard for help of Theories; with SVD of visualiza- k-d points of small dataset, databases, determine algorithms. al- arsenal the Prof. Al- proposed the ‘queries-by-example’ visualization ‘features’ to like AT&T Machine trans- indexing objects W. SVD). as the benefits to multimedia automatically and application questions and Multi-Dimensional added datasets. to map for intro- Karhunen-Lo3ve as fast to help it sciences Decomposition, be iterative unable or the space Acknowledgments of the is that social the could applications a quadratic, and tools of non-traditional diverse and Value as general gorithm, paper target to with it achieves a fraction recognition, specifically, Singular not of the pattern even retrieval. time. A the tance from duces the and geom- quickly direction on real same traditional and and real managed clusters, k of the of the algorithm clustering etry ‘ FastMap’ existing dimensionality FastMap features size N. on includes: Application where being speed algorithm or 3 dimensions). Future much the of the k=3 the proposed There, or most for (c) demonstrated of our datasets. all values (k=2 j’s vs $1 and have quality Singular case)) Multi-Dimensional same database can (’features’ on the case) and (SVD) version 2. it is linear faster (’distance’ (K-L) Decomposition specialized 3. at problem Karhunen-Loeve (b) Karhunen-Lo&e This is the code for Transform the K-L transform in Mathemat- ical [54] (* given $n$ it to handle and case). a matrix vectors creates their attributes 172 mat_ of $m$ a matrix f lrst (le. $k$ , the with attributes, with $n$ most ‘ important’ vectors K-L expansions of these $n$ vectors) [10] *) KLexpansion[ mat_, k_:2] . Transpose[ mat KL[mat, k] ]; [11] (* given a matrix dimensions, $h$ singular of KLC with the mat_ {n,m, K-L , k_:2 1:= Module[ $k$ 3, vec translate newmat= {val, vectors, so Table[ vecl= [13] n IIN; the mat[[ill - mean avgvec, and {i,l,nll; burg, 1; [14] 11 and P.E. Christos similarity Foundations Faloutsos, search of and Arun in sequence Swami. databases. Organization Data and [15] MD, C. March Faloutsos Mining Tomasz association databases. Imielinski, rules Proc. and Arun between ACM and S. Symposium (PODS), pages 247-252, Christos Agrawal pages and Ramakrishnan association 1993. olym- in large [16] Faloutsos Altschul, W. Gish, W. D.J. Lipman. A basic local [5] Manish Arya, 3-d and medical [6] Ricardo A. Toga. 16(l) Baeza-Yates, Proximity trees. Crochemore In Science, Univ. available from Peter [7] N. Pattern Beckmann, Seeger. cessmethod The pages 322-331, [8] Thomas Brinkhoff, and Bernhard joins. [9] Thomas ACM of ACM D. Kriegel, Joel IEEE Udi fixed pages and editors, [21] processing SIGMOD, ofspatial [22] robust pages 237-246, (David) Lin. datasets. Dept. College (URL ftp: Fastmap: and visualCs-tr- of Computer Park, 1994. //Olympus. also cs. umd.edu /sigmod95.ps). and Susan T. Dumais. an analysis of ACM Fukunaga. Personalized of information (CA CM), in- filtering 35(12):51–60, of Introduction Academic I. Gargantini. to Press, An effective ACM G. H. Golub De- Statistical 1990. Pat- 2nd Edition. way to represent (CACM), quadtrees. December 25(12):905–910, and C. F. Van Loan. ac- John K. Hinrichs Computations. index second structure SIGMOD, ACM Clustering Proc. for pages 47-57, John Algorithms. Wiley of The Nievergelt. proximity the WG ’83 Concepts in grid queries (Intern. file: Workshop Computer a on spatial Science), on pages 1983. Jagadish. attributes. May Linear clustering ACM SIGMOD of objects Conf,, with pages 332– 1990. H.V. Jagadish. Sixth IEEE 1990. J. to support Theoretic H.V. 342, [24] and structure multiple 1994. 173 Matrix Press, Baltimore, a dynamic Proc. A. Hartigan. objects. [23] using r-trees. May 1993. R-trees: searching. Graph and Bernhard University 1989. A. Guttman. data SIGMOD, of spatial joins as & Sons, 1975. and Schneider, May Systems June 1984. pages Ralf Kriegel, sec- also available data-mining of Maryland, Comm. edition, queries processing 197-208, Hans-Peter of Database 1989. isr tr 94-80, mosaic Recognition, spatial June 1994. ACM Kriegel, Multi-step for CT-SIGMOD- 1992. 100-113, Hans-Peter SIGA and multimedia The Johns Hopkins [20] Schneider, an efficient [19] Manber, LNCS807, CA, March and King-Ip delivery: Comm, Data Gusfield, R. [18] 1990. SIGMOD, Brinkhoff, 500- 1982. 1993. using and rectangles. Seeger. Seeger. Efficient Proc. May and and tool. a prototype Cunto, Matching, r*-tree: for points Faloutsos, March Asilomar, H.-P. Publication Fractals ACM for indexing, W. Foltz Keinosuke tern 1990. system. :38-42, matching Springer-Verlag, 198-212. B. M. Combinatorial Myers, Qbism: Walter and Sun Wu. 5th Christos [17] 1994. search 215(3):403-410, database Bulletin, E.W. alignment Cody, Arthur image Engineering Re- Fast algodatabases. September Miller, Biologg, William Richardson, Srikant. pages 487-499, ofMolecular Special on Principles of traditional formation May 207-216, rules in large [4] S,F. Journal and Tezt Swami. sets ofitems SIGMOD, of VLDBConf., Proc. (LSI) Second and CS-TR-2242. algorithm methods. for mining The Roseman. SIGART cember rithms and pages 105–1 15, Gaithers- Eighth 3383 umiacs-tr-94-132 1993. [3] Rakesh and Ctagsification indexing editor, NE3T. 1994. key retrieval. ization ftp/pub/TechReports/fodo.ps, Agrawal, Sci- 1973. semantic (TREC-2), /pub/TechReports [2] Rakesh En- Perfor- National Pattern New York, Latent Conference a fast In Algorithms (FODO) Conference, Evanston, Illinois, October also available through anonymous ftp, from pos.cs.umd.edu: Hart. Wiley, UMIACS-TR-89-47 References Efficient NSF High Communications, In D. K. Harman, ondary 1 [1] Rakesh Agrawal, Physical Challenges: 1992. The FY 1992 U.S. Research Susan T. Dumais. trieval and to Program. Duda = O *) on Grand Computing Some approaches of the ACM(CACM), 215. . newmat Range[l,k] Committee R.O. Keller. Comm. 1973. Sciences. TREC-2. Eigensystem[ Transpose[newmat] vec[[ I Mathematical Scene Analysis. Dimensi.ons[mat]; mat] April Development [12] plus, 16(4):230-236, mance *) val, and R.M. filesearching. ence Foundation, axes expansion newmat,i, Apply[ Burkhard gineering first the $k$ Of the ie., avgvec, {n,m}= vectors computes first avgvec= (* $n$ vectors, W.A. best-match := Irat’1 Spatial Conf. search on Data with polyhedra. Engineering, Proc. February [25] [26] H.V. Jagadish. Proc. ACM Mark A. A retrieval SIGMOD Jones, Integrating Guy multiple ocr post-processor. on Document France, [27] Joseph ing. [29] [30] SAGE Ballard. [42] r- Proc. of December [43] Chile, scal- Wish. 1978. Surveys, [45] correcting B. Lomet and Betty indexing performance. ACM chical Salzberg. method TODS, A survey clustering The with hb-tree: good [46] A. Desai [34] [35] Wayne Niblack, Flickner, texture Databases, [36] Computer Science. J. Nievergelt, file: ture, ACM J.A. of [39] William and Cambridge [40] A. high Conference, The using color, Symposium T. Conf., Video and available r+ In pages VLDB, also available as CS-TR-1795. of reference of points the in ACM (CA bestCM), 1977. retrieval. 1, R. N. Shepard. Gilbert Wang. ACM K,C. Proc, Sevcik. multikey [51] New techniques TOIS, A.W. W. [52] 8(2):140-158, [53] 1993, for April Brian of spatial pages [55] pages 343–352, 1990. P. Flannery, Press, Rao and of Saul Numerical Proc. A. Teukolsky, Recipes in C. 1988. Jerry texture San Jose, February Lohse. perception. Identifying In 1962. and its Applications. 2nd edition. Banerjee, S. Torgerson. and E,M. Santori, comparisons. Multidimensional Psychometrika, London, Warping Neurosc. Abs., scaling: 17:401–419, I. theory 1952. SPIE 1992. 174 information England, Retrieval, 1979. Butter- 2nd edition. Dimitris Vassiliadis. The input-state space approach to the prediction of auroral geomagnetic activity from wind Artificial Stephen variables. Int. Intelligence on Applications Workshop in Solar Terrestrial Ph~sics, 1993. Wolfram. Mathematical. Second Edition. Forrest W, Young. Theory and Applications. Hillsdale, process- spaces. i and ii. 1990, September [54] query P.K. Algebra Multidi- distance 219-246, for interbrain C. J. Van-Rljsbergen. of The in an object- and parameter 1980. of proximities: an unknown Linear Press, Toga, worths, file struc- SIGMOD, analysis with 27:125-140, and method. 1984. ACM The scaling Strang. 16:247, as IBM Feb. March Vetterling. University features The objects. on Image processing A comparison Ravishankar level Taubin. for and query for native H. Press, William Peter Conf. symmetric system, SIGMOD and 1983. on 1987. Cornm. May 3d models 1986. Orenstein. ACM Intl. (81511), 9(1):38-71, Spatial ing techniques choice and Tsong-Li solar TODS, May 326-336, Petkovic, Also 9203 an adaptable, orient ed database [38] 1993. [50] Myron Technology, H. Hinterberger, J. Orenstein. MA, Conference 20(5):339-343, Academic 0$ and Retrieval RJ l%c. Equitz, by content 1993 [49] and effective and Gabriel Science February Report Will images SPIE and Research grid [37] shape. Imaging: Storage 1908, Warps, Theory Addison-Wesley Reading, Dennis Shasha Modern Time the for multi-dimensional searching. Psychometrika, 1994. Dragutin FaIoutsos, Querying and Electronic B. Kruskal. September The file mensional of a 1991. mining. September Glasman, Christos project: unfolding Efficient data Barber, [48] Christodoulakis. the Han. spatial Ron Eduardo Yanker, QBIC for pages 144-155, Conf., Shapiro. to to 1983. tJMIACS-TR-87-3, best-match Journal, 24(10) :6–8, October T. Ng and Jiawei met hods Stavros systems: Computer, IEEE VLDB and information Raymond in hierar- Computer England,, 1995. 1990. Narasimhalu clustering advances The [47] 1983. Multimedia reality. of recent M. match December May and C. Faloutsos. International SRC-TR-87-32, a guaranteed 15(4):625-658, algorithms. 26(4):354–359, [33] I%h Vincent. 1995 ACM- ~ntroduction Inc., index applica- F. the CA, Comparisons. Company, a dynamic and Macromolecules: Sequence and Applications, – of McGraw-Hill, N. Roussopoulos, 507–518, multiattribute Jose, Joseph and of Kelley, McGill, and Edits Proc. 24(4):377–440, 1992. F. Murtagh. Sankoff T. Sellis, tree: 1990. [32] David M.J. II In Proc. San Retrieval. Publishing Multidimensional Hills, for automatically Computing and vol 1972, queries. Conference, Practice Beverly Techniques neighbor and Sara Beth Theory : ROUSSOPOU1OS, Steve String 1964. sciences Press, New York, Salton G. N. Shepard, scaling: behavioral Information [44] ACM Nick Hilbert multidimensional and Myron the in appear. Santiago, 500–509, 29:1–27, tions Nearest In Roger Multidimensional SIGMOD fractals. Nonmetric Romney, Nerlove. Saint-Male, FaJoutsos. using A. Kimball Seminar a bayesian to appear. publications, Kukich, David 1991. 1994. words in text. [31] Recognition, pages B. Kruskal Karen Internchonal anri Christos Psychometrika, Joseph W. in [41] shapes. May Conference First r-tree B. Kruskal. scaling. sources and Conference,, September [28] knowledge 1991. an improved VJ5DB and Bruce In Kamel tree: for similar pages 208-217, A. Story, Analysis September Ibrahim technique Conf., New Addison Mzdtidimensionai Lawrence Jersey, 1987. Wesley, scaling: Erlbaum 1991. Hastory, associates,

A Fast Algorithm for Indexing,

Related documents

Products

Support

A Fast Algorithm for Indexing,

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib