10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos Outline Goal: ‘Find similar / interesting things’ • Intro to DB • Indexing - similarity search • Data Mining Multi DB and D.M. Copyright: C. Faloutsos (2001) 2 Indexing - Detailed outline • • • • • • • • primary key indexing secondary key / multi-key indexing spatial access methods fractals text Singular Value Decomposition (SVD) multimedia ... Multi DB and D.M. Copyright: C. Faloutsos (2001) 3 SVD - Detailed outline • • • • • • Motivation Definition - properties Interpretation Complexity Case studies Additional properties Multi DB and D.M. Copyright: C. Faloutsos (2001) 4 SVD - Motivation • problem #1: text - LSI: find ‘concepts’ • problem #2: compression / dim. reduction Multi DB and D.M. Copyright: C. Faloutsos (2001) 5 SVD - Motivation • problem #1: text - LSI: find ‘concepts’ Multi DB and D.M. Copyright: C. Faloutsos (2001) 6 SVD - Motivation • problem #2: compress / reduce dimensionality Multi DB and D.M. Copyright: C. Faloutsos (2001) 7 Problem - specs • ~10**6 rows; ~10**3 columns; no updates; • random access to any cell(s) ; small error: OK Multi DB and D.M. Copyright: C. Faloutsos (2001) 8 SVD - Motivation Multi DB and D.M. Copyright: C. Faloutsos (2001) 9 SVD - Motivation Multi DB and D.M. Copyright: C. Faloutsos (2001) 10 SVD - Detailed outline • • • • • • Motivation Definition - properties Interpretation Complexity Case studies Additional properties Multi DB and D.M. Copyright: C. Faloutsos (2001) 11 SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 Multi DB and D.M. x 1 = -1 2x1 Copyright: C. Faloutsos (2001) 12 SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 Multi DB and D.M. x 1 = -1 2x1 3x1 Copyright: C. Faloutsos (2001) 13 SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 Multi DB and D.M. x -1 1 = -1 2x1 3x1 Copyright: C. Faloutsos (2001) 14 SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 3x2 Multi DB and D.M. x -1 1 = -1 -1 2x1 3x1 Copyright: C. Faloutsos (2001) 15 SVD - Definition (reminder: matrix multiplication 1 2 3 4 5 6 Multi DB and D.M. x 1 -1 -1 = -1 -1 Copyright: C. Faloutsos (2001) 16 SVD - Definition A[n x m] = U[n x r] L [ r x r] (V[m x r])T • A: n x m matrix (eg., n documents, m terms) • U: n x r matrix (n documents, r concepts) • L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) • V: m x r matrix (m terms, r concepts) Multi DB and D.M. Copyright: C. Faloutsos (2001) 17 SVD - Definition • A = U L VT - example: Multi DB and D.M. Copyright: C. Faloutsos (2001) 18 SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U L VT , where • U, L, V: unique (*) • U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) – UT U = I; VT V = I (I: identity matrix) • L: eigenvalues are positive, and sorted in decreasing order Multi DB and D.M. Copyright: C. Faloutsos (2001) 19 SVD - Example • A = U L VT - example: retrieval inf. lung brain data CS MD 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 20 SVD - Example • A = U L VT - example: retrieval CS-concept inf. lung MD-concept brain data CS MD 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 21 SVD - Example • A = U L VT - example: doc-to-concept similarity matrix retrieval CS-concept inf. MD-concept brain lung data CS MD 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 22 SVD - Example • A = U L VT - example: retrieval inf. lung brain data CS MD ‘strength’ of CS-concept 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 23 SVD - Example • A = U L VT - example: term-to-concept similarity matrix retrieval inf. lung brain data CS MD CS-concept 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 24 SVD - Example • A = U L VT - example: term-to-concept similarity matrix retrieval inf. lung brain data CS MD CS-concept 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 25 SVD - Detailed outline • • • • • • Motivation Definition - properties Interpretation Complexity Case studies Additional properties Multi DB and D.M. Copyright: C. Faloutsos (2001) 26 SVD - Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: • U: document-to-concept similarity matrix • V: term-to-concept sim. matrix • L: its diagonal elements: ‘strength’ of each concept Multi DB and D.M. Copyright: C. Faloutsos (2001) 27 SVD - Interpretation #2 • best axis to project on: (‘best’ = min sum of squares of projection errors) Multi DB and D.M. Copyright: C. Faloutsos (2001) 28 SVD - Motivation Multi DB and D.M. Copyright: C. Faloutsos (2001) 29 SVD - interpretation #2 SVD: gives best axis to project v1 • minimum RMS error Multi DB and D.M. Copyright: C. Faloutsos (2001) 30 SVD - Interpretation #2 Multi DB and D.M. Copyright: C. Faloutsos (2001) 31 SVD - Interpretation #2 • A = U L VT - example: 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 x v1 Copyright: C. Faloutsos (2001) 0 0 0 0.71 0.71 32 SVD - Interpretation #2 • A = U L VT - example: variance (‘spread’) on the v1 axis 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 33 SVD - Interpretation #2 • A = U L VT - example: – U L gives the coordinates of the points in the projection axis 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 34 SVD - Interpretation #2 • More details • Q: how exactly is dim. reduction done? 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 35 SVD - Interpretation #2 • More details • Q: how exactly is dim. reduction done? • A: set the smallest eigenvalues to zero: 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 36 SVD - Interpretation #2 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. ~ 0.90 0 x 9.64 0 0 x 0 Copyright: C. Faloutsos (2001) 0 0 0 0.71 0.71 37 SVD - Interpretation #2 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. ~ 0.90 0 x 9.64 0 0 x 0 Copyright: C. Faloutsos (2001) 0 0 0 0.71 0.71 38 SVD - Interpretation #2 1 1 1 0 0 0.18 2 2 2 0 0 0.36 1 1 1 0 0 0.18 5 5 5 0 0 0 0 0 2 2 0 0 0 0 3 3 0 0 0 0 1 1 0 Multi DB and D.M. ~ 0.90 x 9.64 x 0.58 0.58 0.58 0 Copyright: C. Faloutsos (2001) 0 39 SVD - Interpretation #2 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 Multi DB and D.M. ~ 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Copyright: C. Faloutsos (2001) 40 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. = 0.90 0 x 9.64 0 0 5.29 Copyright: C. Faloutsos (2001) 0 x 0 0 0.71 0.71 41 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = u1 u2 x l1 l2 x v1 v2 Multi DB and D.M. Copyright: C. Faloutsos (2001) 42 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: m n 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 Multi DB and D.M. = l1 u1 vT1 + Copyright: C. Faloutsos (2001) l2 u2 vT2 +... 43 SVD - Interpretation #2 Exactly equivalent: ‘spectral decomposition’ of the matrix: m n 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 Multi DB and D.M. r terms = l1 u1 vT1 + nx1 l2 u2 vT2 +... 1xm Copyright: C. Faloutsos (2001) 44 SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) m n 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 Multi DB and D.M. = l1 u1 vT1 + l2 u2 vT2 +... assume: l1 >= l2 >= ... Copyright: C. Faloutsos (2001) 45 SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of li ’s) m n 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 Multi DB and D.M. = l1 u1 vT1 + l2 u2 vT2 +... assume: l1 >= l2 >= ... Copyright: C. Faloutsos (2001) 46 SVD - Detailed outline • Motivation • Definition - properties • Interpretation – #1: documents/terms/concepts – #2: dim. reduction – #3: picking non-zero, rectangular ‘blobs’ • Complexity • Case studies • Additional properties Multi DB and D.M. Copyright: C. Faloutsos (2001) 47 SVD - Interpretation #3 • finds non-zero ‘blobs’ in a data matrix 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 = x 9.64 0 0 5.29 x 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. 0.90 0 Copyright: C. Faloutsos (2001) 0 0 0 0.71 0.71 48 SVD - Interpretation #3 • finds non-zero ‘blobs’ in a data matrix 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 1 1 1 0 0 0.18 0 = x 9.64 0 0 5.29 x 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 1 1 0 0.27 0 Multi DB and D.M. 0.90 0 Copyright: C. Faloutsos (2001) 0 0 0 0.71 0.71 49 SVD - Interpretation #3 • Drill: find the SVD, ‘by inspection’! • Q: rank = ?? 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = ?? x ?? x ?? Multi DB and D.M. Copyright: C. Faloutsos (2001) 50 SVD - Interpretation #3 • A: rank = 2 (2 linearly independent rows/cols) 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 Multi DB and D.M. = ?? ?? x ?? 0 0 x ?? ?? ?? Copyright: C. Faloutsos (2001) 51 SVD - Interpretation #3 • A: rank = 2 (2 linearly independent rows/cols) 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 1 0 0 = 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 x ?? 0 0 x ?? 1 1 1 0 0 0 0 0 1 1 orthogonal?? Multi DB and D.M. Copyright: C. Faloutsos (2001) 52 SVD - Interpretation #3 • column vectors: are orthogonal - but not unit vectors: 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 0 0 1 1 ) 2(tr q s/ 1 0 0 0 0 1 1 ) 2(tr q s/ 1 0 = 0 ) 3(tr q s/ 1 x ?? 0 0 x ?? 1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0 0 Multi DB and D.M. 0 Copyright: C. Faloutsos (2001) 0 0 1/sqrt(2) 1/sqrt(2) 53 SVD - Interpretation #3 • and the eigenvalues are: 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 0 0 1 1 ) 2(tr q s/ 1 0 0 0 0 1 1 ) 2(tr q s/ 1 0 = 0 ) 3(tr q s/ 1 x 3 0 0 2 x 1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0 0 Multi DB and D.M. 0 Copyright: C. Faloutsos (2001) 0 0 1/sqrt(2) 1/sqrt(2) 54 SVD - Interpretation #3 • Q: How to check we are correct? 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 ) 3(tr q s/ 1 1 1 1 0 0 0 0 0 1 1 ) 2(tr q s/ 1 0 0 0 0 1 1 ) 2(tr q s/ 1 0 = 0 ) 3(tr q s/ 1 x 3 0 0 2 x 1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0 0 Multi DB and D.M. 0 Copyright: C. Faloutsos (2001) 0 0 1/sqrt(2) 1/sqrt(2) 55 SVD - Interpretation #3 • A: SVD properties: – matrix product should give back matrix A – matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other – ditto for matrix V – matrix L should be diagonal, with positive values Multi DB and D.M. Copyright: C. Faloutsos (2001) 56 SVD - Detailed outline • • • • • • Motivation Definition - properties Interpretation Complexity Case studies Additional properties Multi DB and D.M. Copyright: C. Faloutsos (2001) 57 SVD - Complexity • O( n * m * m) or O( n * n * m) (whichever is less) • less work, if we just want eigenvalues • or if we want first k eigenvectors • or if the matrix is sparse [Berry] • Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica ...) Multi DB and D.M. Copyright: C. Faloutsos (2001) 58 SVD - conclusions so far • SVD: A= U L VT : unique (*) • U: document-to-concept similarities • V: term-to-concept similarities • L: strength of each concept • dim. reduction: keep the first few strongest eigenvalues (80-90% of ‘energy’) – SVD: picks up linear correlations • SVD: picks up non-zero ‘blobs’ Multi DB and D.M. Copyright: C. Faloutsos (2001) 59 References • Berry, Michael: http://www.cs.utk.edu/~lsi/ • Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. • Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press. Multi DB and D.M. Copyright: C. Faloutsos (2001) 60