SVD part-I

advertisement
10-603/15-826A:
Multimedia Databases and Data
Mining
SVD - part I (definitions)
C. Faloutsos
Outline
Goal: ‘Find similar / interesting things’
• Intro to DB
• Indexing - similarity search
• Data Mining
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
2
Indexing - Detailed outline
•
•
•
•
•
•
•
•
primary key indexing
secondary key / multi-key indexing
spatial access methods
fractals
text
Singular Value Decomposition (SVD)
multimedia
...
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
3
SVD - Detailed outline
•
•
•
•
•
•
Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
4
SVD - Motivation
• problem #1: text - LSI: find ‘concepts’
• problem #2: compression / dim. reduction
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
5
SVD - Motivation
• problem #1: text - LSI: find ‘concepts’
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
6
SVD - Motivation
• problem #2: compress / reduce
dimensionality
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
7
Problem - specs
• ~10**6 rows; ~10**3 columns; no updates;
• random access to any cell(s) ; small error: OK
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
8
SVD - Motivation
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
9
SVD - Motivation
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
10
SVD - Detailed outline
•
•
•
•
•
•
Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
11
SVD - Definition
(reminder: matrix multiplication
1 2
3 4
5 6
3x2
Multi DB and D.M.
x
1
=
-1
2x1
Copyright: C. Faloutsos (2001)
12
SVD - Definition
(reminder: matrix multiplication
1 2
3 4
5 6
3x2
Multi DB and D.M.
x
1
=
-1
2x1
3x1
Copyright: C. Faloutsos (2001)
13
SVD - Definition
(reminder: matrix multiplication
1 2
3 4
5 6
3x2
Multi DB and D.M.
x
-1
1
=
-1
2x1
3x1
Copyright: C. Faloutsos (2001)
14
SVD - Definition
(reminder: matrix multiplication
1 2
3 4
5 6
3x2
Multi DB and D.M.
x
-1
1
= -1
-1
2x1
3x1
Copyright: C. Faloutsos (2001)
15
SVD - Definition
(reminder: matrix multiplication
1 2
3 4
5 6
Multi DB and D.M.
x
1
-1
-1
= -1
-1
Copyright: C. Faloutsos (2001)
16
SVD - Definition
A[n x m] = U[n x r] L [ r x r] (V[m x r])T
• A: n x m matrix (eg., n documents, m
terms)
• U: n x r matrix (n documents, r concepts)
• L: r x r diagonal matrix (strength of each
‘concept’) (r : rank of the matrix)
• V: m x r matrix (m terms, r concepts)
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
17
SVD - Definition
• A = U L VT - example:
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
18
SVD - Properties
THEOREM [Press+92]: always possible to
decompose matrix A into A = U L VT ,
where
• U, L, V: unique (*)
• U, V: column orthonormal (ie., columns are
unit vectors, orthogonal to each other)
– UT U = I; VT V = I (I: identity matrix)
• L: eigenvalues are positive, and sorted in
decreasing order
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
19
SVD - Example
• A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
20
SVD - Example
• A = U L VT - example:
retrieval CS-concept
inf.
lung
MD-concept
brain
data
CS
MD
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
21
SVD - Example
• A = U L VT - example:
doc-to-concept
similarity matrix
retrieval CS-concept
inf.
MD-concept
brain lung
data
CS
MD
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
22
SVD - Example
• A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
‘strength’ of CS-concept
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
23
SVD - Example
• A = U L VT - example:
term-to-concept
similarity matrix
retrieval
inf.
lung
brain
data
CS
MD
CS-concept
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
24
SVD - Example
• A = U L VT - example:
term-to-concept
similarity matrix
retrieval
inf.
lung
brain
data
CS
MD
CS-concept
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
25
SVD - Detailed outline
•
•
•
•
•
•
Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
26
SVD - Interpretation #1
‘documents’, ‘terms’ and ‘concepts’:
• U: document-to-concept similarity matrix
• V: term-to-concept sim. matrix
• L: its diagonal elements: ‘strength’ of each
concept
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
27
SVD - Interpretation #2
• best axis to project on: (‘best’ = min sum of
squares of projection errors)
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
28
SVD - Motivation
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
29
SVD - interpretation #2
SVD: gives
best axis to project
v1
• minimum RMS error
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
30
SVD - Interpretation #2
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
31
SVD - Interpretation #2
• A = U L VT - example:
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
x
v1
Copyright: C. Faloutsos (2001)
0
0
0
0.71 0.71
32
SVD - Interpretation #2
• A = U L VT - example:
variance (‘spread’) on the v1 axis
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
33
SVD - Interpretation #2
• A = U L VT - example:
– U L gives the coordinates of the points in the
projection axis
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
34
SVD - Interpretation #2
• More details
• Q: how exactly is dim. reduction done?
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
35
SVD - Interpretation #2
• More details
• Q: how exactly is dim. reduction done?
• A: set the smallest eigenvalues to zero:
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
36
SVD - Interpretation #2
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
~
0.90 0
x
9.64 0
0
x
0
Copyright: C. Faloutsos (2001)
0
0
0
0.71 0.71
37
SVD - Interpretation #2
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
~
0.90 0
x
9.64 0
0
x
0
Copyright: C. Faloutsos (2001)
0
0
0
0.71 0.71
38
SVD - Interpretation #2
1 1
1
0
0
0.18
2 2
2
0
0
0.36
1 1
1
0
0
0.18
5 5
5
0
0
0 0
0
2
2
0
0 0
0
3
3
0
0 0
0
1
1
0
Multi DB and D.M.
~
0.90
x
9.64
x
0.58 0.58 0.58 0
Copyright: C. Faloutsos (2001)
0
39
SVD - Interpretation #2
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
Multi DB and D.M.
~
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
0
0
0 0
0
0
0
0 0
0
0
0
Copyright: C. Faloutsos (2001)
40
SVD - Interpretation #2
Exactly equivalent:
‘spectral decomposition’ of the matrix:
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
=
0.90 0
x
9.64 0
0
5.29
Copyright: C. Faloutsos (2001)
0
x
0
0
0.71 0.71
41
SVD - Interpretation #2
Exactly equivalent:
‘spectral decomposition’ of the matrix:
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
= u1
u2
x
l1
l2
x
v1
v2
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
42
SVD - Interpretation #2
Exactly equivalent:
‘spectral decomposition’ of the matrix:
m
n
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
Multi DB and D.M.
=
l1 u1 vT1 +
Copyright: C. Faloutsos (2001)
l2 u2 vT2 +...
43
SVD - Interpretation #2
Exactly equivalent:
‘spectral decomposition’ of the matrix:
m
n
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
Multi DB and D.M.
r terms
=
l1 u1 vT1 +
nx1
l2 u2 vT2 +...
1xm
Copyright: C. Faloutsos (2001)
44
SVD - Interpretation #2
approximation / dim. reduction:
by keeping the first few terms (Q: how many?)
m
n
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
Multi DB and D.M.
=
l1 u1 vT1 +
l2 u2 vT2 +...
assume: l1 >= l2 >= ...
Copyright: C. Faloutsos (2001)
45
SVD - Interpretation #2
A (heuristic - [Fukunaga]): keep 80-90% of
‘energy’ (= sum of squares of li ’s)
m
n
1 1
1
0
0
2 2
2
0
0
1 1
1
0
0
5 5
5
0
0
0 0
0
2
2
0 0
0
3
3
0 0
0
1
1
Multi DB and D.M.
=
l1 u1 vT1 +
l2 u2 vT2 +...
assume: l1 >= l2 >= ...
Copyright: C. Faloutsos (2001)
46
SVD - Detailed outline
• Motivation
• Definition - properties
• Interpretation
– #1: documents/terms/concepts
– #2: dim. reduction
– #3: picking non-zero, rectangular ‘blobs’
• Complexity
• Case studies
• Additional properties
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
47
SVD - Interpretation #3
• finds non-zero ‘blobs’ in a data matrix
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
=
x
9.64 0
0
5.29
x
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
0.90 0
Copyright: C. Faloutsos (2001)
0
0
0
0.71 0.71
48
SVD - Interpretation #3
• finds non-zero ‘blobs’ in a data matrix
1 1
1
0
0
0.18 0
2 2
2
0
0
0.36 0
1 1
1
0
0
0.18 0
=
x
9.64 0
0
5.29
x
5 5
5
0
0
0 0
0
2
2
0
0.53
0 0
0
3
3
0
0.80
0.58 0.58 0.58 0
0 0
0
1
1
0
0.27
0
Multi DB and D.M.
0.90 0
Copyright: C. Faloutsos (2001)
0
0
0
0.71 0.71
49
SVD - Interpretation #3
• Drill: find the SVD, ‘by inspection’!
• Q: rank = ??
1 1
1
0
0
1 1
1
0
0
1 1
1
0
0
0 0
0
1
1
0 0
0
1
1
=
??
x
??
x
??
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
50
SVD - Interpretation #3
• A: rank = 2 (2 linearly independent
rows/cols)
1 1
1
0
0
1 1
1
0
0
1 1
1
0
0
0 0
0
1
1
0 0
0
1
1
Multi DB and D.M.
= ?? ??
x
?? 0
0
x
??
??
??
Copyright: C. Faloutsos (2001)
51
SVD - Interpretation #3
• A: rank = 2 (2 linearly independent
rows/cols)
1 1
1
0
0
1 0
1 1
1
0
0
1 0
1 1
1
0
0
=
1 0
0 0
0
1
1
0 1
0 0
0
1
1
0 1
x
?? 0
0
x
??
1
1
1
0
0
0
0
0
1
1
orthogonal??
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
52
SVD - Interpretation #3
• column vectors: are orthogonal - but not
unit vectors:
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 0
0
1
1
) 2(tr q s/ 1
0
0 0
0
1
1
) 2(tr q s/ 1
0
=
0 ) 3(tr q s/ 1
x
?? 0
0
x
??
1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0
0
Multi DB and D.M.
0
Copyright: C. Faloutsos (2001)
0
0
1/sqrt(2) 1/sqrt(2)
53
SVD - Interpretation #3
• and the eigenvalues are:
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 0
0
1
1
) 2(tr q s/ 1
0
0 0
0
1
1
) 2(tr q s/ 1
0
=
0 ) 3(tr q s/ 1
x
3
0
0
2
x
1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0
0
Multi DB and D.M.
0
Copyright: C. Faloutsos (2001)
0
0
1/sqrt(2) 1/sqrt(2)
54
SVD - Interpretation #3
• Q: How to check we are correct?
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 ) 3(tr q s/ 1
1 1
1
0
0
0 0
0
1
1
) 2(tr q s/ 1
0
0 0
0
1
1
) 2(tr q s/ 1
0
=
0 ) 3(tr q s/ 1
x
3
0
0
2
x
1/sqrt(3) 1/sqrt(3) 1/sqrt(3) 0
0
Multi DB and D.M.
0
Copyright: C. Faloutsos (2001)
0
0
1/sqrt(2) 1/sqrt(2)
55
SVD - Interpretation #3
• A: SVD properties:
– matrix product should give back matrix A
– matrix U should be column-orthonormal, i.e.,
columns should be unit vectors, orthogonal to
each other
– ditto for matrix V
– matrix L should be diagonal, with positive
values
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
56
SVD - Detailed outline
•
•
•
•
•
•
Motivation
Definition - properties
Interpretation
Complexity
Case studies
Additional properties
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
57
SVD - Complexity
• O( n * m * m) or O( n * n * m) (whichever
is less)
• less work, if we just want eigenvalues
•
or if we want first k eigenvectors
•
or if the matrix is sparse [Berry]
• Implemented: in any linear algebra package
(LINPACK, matlab, Splus, mathematica ...)
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
58
SVD - conclusions so far
• SVD: A= U L VT : unique (*)
•
U: document-to-concept similarities
•
V: term-to-concept similarities
•
L: strength of each concept
• dim. reduction: keep the first few strongest
eigenvalues (80-90% of ‘energy’)
– SVD: picks up linear correlations
• SVD: picks up non-zero ‘blobs’
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
59
References
• Berry, Michael: http://www.cs.utk.edu/~lsi/
• Fukunaga, K. (1990). Introduction to Statistical Pattern
Recognition, Academic Press.
• Press, W. H., S. A. Teukolsky, et al. (1992). Numerical
Recipes in C, Cambridge University Press.
Multi DB and D.M.
Copyright: C. Faloutsos (2001)
60
Download