Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important features or reduce indexing dimensions for faster retrieval (low dim indices are faster) E.G.M. Petrakis Dimensionality Reduction 1 Techniques Eigenvalue analysis techniques [NR’92] Karhunen-Loeve (K-L) transform Singular Value Decomposition (SVD) both need O(N2) time FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectors O(N) time E.G.M. Petrakis Dimensionality Reduction 2 Mathematical Preliminaries For an nxn square matrix S, for unit vector x and scalar value λ: Sx = λx x: eigenvector of S λ: eigenvalue of S The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are real r rank of a matrix: maximum number or independent columns or rows E.G.M. Petrakis Dimensionality Reduction 3 Example 1 Intuition: S defines an affine transform y = Sx that involves scaling, rotation eigenvectors: unit vectors along the new directions eigenvalues denote scaling eigenvector of 2 1 S 1 3 E.G.M. Petrakis major axis 0.52 1 3.62, u1 0 . 85 0.85 2 3.38, u2 0 . 52 Dimensionality Reduction 4 Example 2 If S is real and symmetric (S=ST) then it can be written as S = UΛUT the columns of U are eigenvectors of S U: column orthogonal (UUT=I) Λ: diagonal with the eigenvalues of S 2 1 0.52 0.85 3.62 0 0.52 0.85 S 0 1.38 0.85 0.52 1 3 0 . 85 0 . 52 E.G.M. Petrakis Dimensionality Reduction 5 Karhunen-Loeve (K-L) Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs) K-L gives a linear combination of axes sorted by importance keep the first k dims 2-dim points and the 2 K-L directions for k=1 keep x’ E.G.M. Petrakis Dimensionality Reduction 6 Computation of K-L Put N vectors in rows in A=[aij] N 1 Compute B=[aij-ap] , where a p aip i 1 N Covariance matrix: C=BTB Compute the eigenvectors of C Sort in decreasing eigenvalue order Approximate each object by its projections on the directions of the first k eigenvectors E.G.M. Petrakis Dimensionality Reduction 7 Intuition B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean C represents attribute to attribute similarity C square, real, symmetric Eigenvector and eigenvalues are computed on C not on A C denotes the affine transform that minimizes the error Approximate each vector with its projections along the first k eigenvectors E.G.M. Petrakis Dimensionality Reduction 8 Example Input vectors [1 2], [1 1], [0 0] 1 2 Then A 1 1 col.avgs are 2/3 and 1 0 0 1 1/ 3 2 / 3 1 B 1/ 3 0 and C 1 2 2 / 3 1 0.47 1 2.53 u1 0.88 - 0.88 2 0.13 u 2 0.47 E.G.M. Petrakis Dimensionality Reduction 9 SVD For general rectangular matrixes Nxn matrix (N vectors, n dimensions) groups similar entities (documents) together Groups similar terms together and each group of terms corresponds to a concept Given an Nxn matrix A, write it as A = UΛVT U: Nxr column orthogonal (r: rank of A) Λ: rxr diagonal matrix (non-negative, desc. order) V: rxn column orthogonal matrix E.G.M. Petrakis Dimensionality Reduction 10 SVD (cont,d) A = λ1u1v1T + λ2u2v2T + … + λrurvrT u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs E.G.M. Petrakis Dimensionality Reduction 11 Example Term/ Document data information retrieval brain lung CS-TR1 1 1 1 0 0 CS-TR2 2 2 2 0 0 CS-TR3 1 1 1 0 0 CS-TR4 5 5 5 0 0 MED-TR1 0 0 0 2 2 MED-TR2 0 0 0 3 3 MED-TR3 0 0 0 1 1 Two types of documents: CS and Medical Two concepts (groups of terms) CS: data, information, retrieval Medical: brain, lung E.G.M. Petrakis Dimensionality Reduction 12 Example (cont,d) U 0.18 0 0.36 0 0.18 0 0 9.64 0 0.58 0.58 0.58 0 A 0.90 0 0 0 5 . 29 0 0 0 . 71 0 . 71 0 0.53 t V 0 0 . 80 Λ 0 0.27 r=2 U: document-to-document similarity matrix V: term-to-document similarity matrix v12 = 0: data has 0 similarity with the 2nd concept E.G.M. Petrakis Dimensionality Reduction 13 SVD and LSI SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html) Terms that occur together are grouped into concepts When a user searches for a term, the system determines the relevant concepts to search LSI maps concepts to vectors in the concept space instead of the n-dim. document space Concept space: is a lower dimensionality space E.G.M. Petrakis Dimensionality Reduction 14 Examples of Queries 1 0 Find documents with the term “data” q 0 Translate query vector q 0 to concept space 0 The query is related to T qc V q the CS concept and unrelated to the medical concept LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query E.G.M. Petrakis 1 0 0 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71 0 0 0.58 0 Dimensionality Reduction 15 FastMap Works with distances, has two roles: 1. Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing) 2. Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible E.G.M. Petrakis Dimensionality Reduction 16 Main idea Pretend that objects are points in some unknown n-dimensional space project these points on k mutually orthogonal axes compute projections using distance only The heart of FastMap is the method that projects two objects on a line take 2 objects which are far apart (pivots) project on the line that connects the pivots E.G.M. Petrakis Dimensionality Reduction 17 Project Objects on a Line Apply cosine low: 2 2 2 d bi d ai d ab 2 xi d ab 2 2 d ai d ab d bi xi 2d ab 2 Oa, Ob: pivots, Oi: any object dij: shorthand for D(Oi,Oj) xi: first coordinate on a k dimensional space If Oi is close to Oa, xi is small E.G.M. Petrakis Dimensionality Reduction 18 Choose Pivots Complexity: O(N) The optimal algorithm would require O(N2) time steps 2,3 can be repeated 4-5 times to improve the accuracy of selection E.G.M. Petrakis Dimensionality Reduction 19 Extension for Many Dimensions Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab Project objects on H and apply previous step choose two new pivots the new xi is the next object coordinate repeat this step until k dim. vectors are obtained The distance on H is not D D’: distance between projected objects E.G.M. Petrakis Dimensionality Reduction 20 Distance on the Hyper-Plane H Pythagorean theorem: D' (OiO j ) D(OiO j )2 2( xi x j )2 D’ on H can be computed from the Pythagorean theorem The ability to compute D’ allows for computing a second line on H etc. E.G.M. Petrakis Dimensionality Reduction 21 Algorithm E.G.M. Petrakis Dimensionality Reduction 22 Observations Complexity: O(kN) distance calculations k: desired dimensionality k recursive calls, each takes O(N) The algorithm records pivots in each call (dimension) to facilitate queries the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension O(1) computation/step: no need to compute pivots E.G.M. Petrakis Dimensionality Reduction 23 Observations (cont,d) The projected vectors can be indexed mapping on 2-3 dimensions allows for visualization of the data space Assumes Euclidean space (triangle rules) not always true (at least after second step) Approximation of pivots some distances are negative turn negative distances to 0 E.G.M. Petrakis Dimensionality Reduction 24 Application: Document Vectors distance(d1 , d 2 ) 2 sin( / 2) 2(1 cos( )) 2(1 sim ilarity(d1 , d 2 )) E.G.M. Petrakis Dimensionality Reduction 25 FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3 E.G.M. Petrakis Dimensionality Reduction 26 References Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996 W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988 LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995 E.G.M. Petrakis Dimensionality Reduction 27