Dimensionality Reduction

advertisement
Dimensionality Reduction
Given N vectors in n dims, find the k
most important axes to project them
k is user defined (k < n)
Applications: information retrieval &
indexing
identify the k most important features or
reduce indexing dimensions for faster
retrieval (low dim indices are faster)
E.G.M. Petrakis
Dimensionality Reduction
1
Techniques
Eigenvalue analysis techniques [NR’92]
Karhunen-Loeve (K-L) transform
Singular Value Decomposition (SVD)
both need O(N2) time
FastMap [Faloutsos & Lin 95]
dimensionality reduction and
mapping of objects to vectors
O(N) time
E.G.M. Petrakis
Dimensionality Reduction
2
Mathematical Preliminaries
For an nxn square matrix S, for unit
vector x and scalar value λ: Sx = λx
x: eigenvector of S
λ: eigenvalue of S
The eigenvectors of a symmetric matrix
(S=ST) are mutually orthogonal and its
eigenvalues are real
r rank of a matrix: maximum number or
independent columns or rows
E.G.M. Petrakis
Dimensionality Reduction
3
Example 1
Intuition: S defines an affine transform y =
Sx that involves scaling, rotation
eigenvectors: unit vectors along the new directions
eigenvalues denote scaling
eigenvector of
2 1
S

1
3


E.G.M. Petrakis
major axis
 0.52
1  3.62, u1  

0
.
85


  0.85 
2  3.38, u2  


0
.
52


Dimensionality Reduction
4
Example 2
If S is real and symmetric (S=ST) then
it can be written as S = UΛUT
the columns of U are eigenvectors of S
U: column orthogonal (UUT=I)
Λ: diagonal with the eigenvalues of S
2 1 0.52 0.85  3.62 0  0.52 0.85 
S


  0 1.38 0.85  0.52
1
3
0
.
85

0
.
52

 



E.G.M. Petrakis
Dimensionality Reduction
5
Karhunen-Loeve (K-L)
Project in a k-dimensional space (k<n)
minimizing the error of the projections
(sum. of sq. diffs)
K-L gives a linear combination of axes
sorted by importance
keep the first k dims
2-dim points and the
2 K-L directions
for k=1 keep x’
E.G.M. Petrakis
Dimensionality Reduction
6
Computation of K-L
Put N vectors in rows in A=[aij]
N
1
Compute B=[aij-ap] , where a p 
aip

i

1
N
Covariance matrix: C=BTB
Compute the eigenvectors of C
Sort in decreasing eigenvalue order
Approximate each object by its
projections on the directions of the
first k eigenvectors
E.G.M. Petrakis
Dimensionality Reduction
7
Intuition
B shifts the origin of the center of gravity of
the vectors by ap and has 0 column mean
C represents attribute to attribute similarity
C square, real, symmetric
 Eigenvector and eigenvalues are computed on
C not on A
C denotes the affine transform that
minimizes the error
Approximate each vector with its projections
along the first k eigenvectors
E.G.M. Petrakis
Dimensionality Reduction
8
Example
Input vectors [1 2], [1 1], [0 0]
1 2 
Then A  1 1 col.avgs are 2/3 and 1


0 0
1
 1/ 3
2 / 3 1
B   1/ 3
0  and C  



1
2


 2 / 3  1
 0.47
1  2.53 u1  

0.88
- 0.88

2  0.13 u 2  

0.47


E.G.M. Petrakis
Dimensionality Reduction
9
SVD
For general rectangular matrixes
Nxn matrix (N vectors, n dimensions)
groups similar entities (documents) together
Groups similar terms together and each group of
terms corresponds to a concept
Given an Nxn matrix A, write it as A = UΛVT
U: Nxr column orthogonal (r: rank of A)
Λ: rxr diagonal matrix (non-negative, desc. order)
V: rxn column orthogonal matrix
E.G.M. Petrakis
Dimensionality Reduction
10
SVD (cont,d)
 A = λ1u1v1T + λ2u2v2T + … + λrurvrT
 u, v are column vectors of U, V
 SVD identifies rect. blobs of related values in A
 The rank r of A: number of blobs
E.G.M. Petrakis
Dimensionality Reduction
11
Example
Term/
Document
data
information
retrieval
brain
lung
CS-TR1
1
1
1
0
0
CS-TR2
2
2
2
0
0
CS-TR3
1
1
1
0
0
CS-TR4
5
5
5
0
0
MED-TR1
0
0
0
2
2
MED-TR2
0
0
0
3
3
MED-TR3
0
0
0
1
1
Two types of documents: CS and Medical
Two concepts (groups of terms)
CS: data, information, retrieval
Medical: brain, lung
E.G.M. Petrakis
Dimensionality Reduction
12
Example (cont,d)
U
0.18 0 
0.36 0 


0.18 0 
0 

 9.64 0  0.58 0.58 0.58 0
A  0.90 0  
 0

0
5
.
29
0
0
0
.
71
0
.
71



 0
0.53
t


V
0
0
.
80
Λ


 0
0.27
r=2
U: document-to-document similarity matrix
V: term-to-document similarity matrix
v12 = 0: data has 0 similarity with the 2nd concept
E.G.M. Petrakis
Dimensionality Reduction
13
SVD and LSI
SVD leads to “Latent Semantic Indexing”
(http://lsi.research.telcordia.com/lsi/LSIpapers.html)
Terms that occur together are grouped into
concepts
When a user searches for a term, the system
determines the relevant concepts to search
LSI maps concepts to vectors in the concept
space instead of the n-dim. document space
Concept space: is a lower dimensionality space
E.G.M. Petrakis
Dimensionality Reduction
14
Examples of Queries
1 
0
 Find documents with the
  
term “data”
q  0
 Translate query vector q
 
0
to concept space
0
 The query is related to 
T
qc  V q 
the CS concept and
unrelated to the medical
concept
 LSI returns docs that
also contain the terms
“retrieval” and
“information” which are
not specified by the
query
E.G.M. Petrakis
1
0 
0  
0.58 0.58 0.58 0
0  
 0

0
0
0.71 0.71  

0 
0
0.58
 0 


Dimensionality Reduction
15
FastMap
 Works with distances, has two roles:
1. Maps objects to vectors so that their
distances are preserved (then apply
SAMs for indexing)
2. Dim. Reduction: N vectors with n
attributes each, find N vectors with k
attributes such that distances are
preserved as much as possible
E.G.M. Petrakis
Dimensionality Reduction
16
Main idea
Pretend that objects are points in some
unknown n-dimensional space
project these points on k mutually
orthogonal axes
compute projections using distance only
The heart of FastMap is the method
that projects two objects on a line
take 2 objects which are far apart (pivots)
project on the line that connects the pivots
E.G.M. Petrakis
Dimensionality Reduction
17
Project Objects on a Line
Apply cosine low:
2
2
2
d bi  d ai  d ab  2 xi d ab
2
2
d ai  d ab  d bi
xi 
2d ab
2
Oa, Ob: pivots, Oi: any object
dij: shorthand for D(Oi,Oj)
xi: first coordinate on a k dimensional space
If Oi is close to Oa, xi is small
E.G.M. Petrakis
Dimensionality Reduction
18
Choose Pivots
 Complexity: O(N)
 The optimal algorithm would require O(N2) time
 steps 2,3 can be repeated 4-5 times to improve the accuracy
of selection
E.G.M. Petrakis
Dimensionality Reduction
19
Extension for Many Dimensions
Consider the (n-1)-dimensional hyperplane H
that is perpendicular to line Oab
Project objects on H and apply previous step
choose two new pivots
the new xi is the next object coordinate
repeat this step until k dim. vectors are obtained
The distance on H is not D
D’: distance between projected objects
E.G.M. Petrakis
Dimensionality Reduction
20
Distance on the Hyper-Plane H
Pythagorean theorem:
D' (OiO j )  D(OiO j )2  2( xi  x j )2
D’ on H can be computed from the
Pythagorean theorem
The ability to compute D’ allows for computing
a second line on H etc.
E.G.M. Petrakis
Dimensionality Reduction
21
Algorithm
E.G.M. Petrakis
Dimensionality Reduction
22
Observations
Complexity: O(kN) distance calculations
k: desired dimensionality
k recursive calls, each takes O(N)
The algorithm records pivots in each call
(dimension) to facilitate queries
the query is mapped to a k-dimensional vector by
projecting it on the pivot lines for each dimension
O(1) computation/step: no need to compute pivots
E.G.M. Petrakis
Dimensionality Reduction
23
Observations (cont,d)
The projected vectors can be indexed
mapping on 2-3 dimensions allows for
visualization of the data space
Assumes Euclidean space (triangle rules)
not always true (at least after second step)
Approximation of pivots
some distances are negative
turn negative distances to 0
E.G.M. Petrakis
Dimensionality Reduction
24
Application: Document Vectors
distance(d1 , d 2 )  2 sin( / 2) 
2(1  cos( ))  2(1  sim ilarity(d1 , d 2 ))
E.G.M. Petrakis
Dimensionality Reduction
25
FastMap on 10 documents for 2 & 3 dims
(a) k = 2 and (b) k = 3
E.G.M. Petrakis
Dimensionality Reduction
26
References
Searching Multimedia Databases by Content,
C. Faloutsos, Kluwer, 1996
W. Press et.al. Numerical Recipes in C,
Cambridge Univ. Press, 1988
LSI website:
http://lsi.research.telcordia.com/lsi/LSIpapers.html
C. Faloutsos, K.-Ip.Lin, FastMap: A Fast
Algorithm for Indexing, Data Mining and
Visualization of Traditional and Multimedia
Datasets, Proc. of Sigmod, 1995
E.G.M. Petrakis
Dimensionality Reduction
27
Download