i290_280I_Lecture_4b - Courses

advertisement
Singular Value Decomposition in
Text Mining
Ram Akella
University of California
Berkeley
Silicon Valley Center/SC
Lecture 4b
February 9, 2011
Class Outline







Summary of last lecture
Indexing
Vector Space Models
Matrix Decompositions
Latent Semantic Analysis
Mechanics
Example
Summary of previous class





Principal Component Analysis
Singular Value Decomposition
Uses
Mechanics
Example swap rates
Introduction
How can we retrieve information using a
search engine?.
 We can represent the query and the documents
as vectors (vector space model)
 However to construct these vectors we should
perform a preliminary document preparation.
 The documents are retrieved by finding the
closest distance between the query and the
document vector.
 Which is the most suitable distance to retrieve
documents?
Search engine
Document File Preparation
 Manual Indexing
 Relationships and concepts between
topics can be established
 It is expensive and time consuming
 It may not be reproduced if it is
destroyed.
 The huge amount of information suggest
a more automated system
Document File Preparation
Automatic indexing
To buid an automatic index, we need to
perform two steps:
 Document Analysis
Decide what information or parts of the
document should be indexed
 Token analysis
Decide with words should be used in order to
obtain the best representation of the semantic
content of documents.
Document Normalization
After this preliminary analysis we need to perform
another preprocessing of the data
 Remove stop words
 Function words: a, an, as, for, in, of, the…
 Other frequent words
 Stemming
 Group morphological variants
 Plurals
 Adverbs
“ streets” -> “street”
“fully” -> “full”
 The current algorithms can make some
mistakes
 “police“, “policy” -> “polic”
File Structures
Once we have eliminated the stop words and apply
the stemmer to the document we can construct:
Document File
 We can extract the terms that should be used in the
index and assign a number to each document.
File Structures
Dictionary

We will construct a
searchable dictionary of
terms by arranging
them alphabetically and
indicating the frequency
of each term in the
collection
Term
Global Frequency
banana
1
cranb
2
Hanna
2
hunger
1
manna
1
meat
1
potato
1
query
1
rye
2
sourdough
1
spiritual
1
wheat
2
File Structures
Inverted List
 For each term we
find the documents
and its related
position associated
with
Term
(Doc, Position)
banana
(5,7)
cranb
(4,5); (6,4)
Hanna
(1,7); (8,2)
hunger
(9,4)
manna
(2,6)
meat
(7,6)
potato
(4,3)
query
(3,8)
rye
(3,3);(6,3)
sourdough
(5,5)
spiritual
(7,5)
wheat
(3,5);(6,6)
Vector Space Model
 The vector space model can be used to
represent terms and documents in a text
collection
 The document collection of n documents can be
represented with a matrix of m X n where the
rows represent the terms and the columns the
documents
 Once we construct the matrix, we can normalize
it in order to have unitary vectors
Vector Space Models
Query Matching
 If we want to retrieve a document we
should:
 Transform the query to a vector
 look for the most similar document vector to
the query. One of the most common similarity
methods is the cosine distance between vectors
defined as:
Where a is the document and q is the query vector
Example:
 Using the book titles we want to
retrieve books of “Child Proofing”
Book titles
Query
0
1
0
0
0
1
0
0
Cos 2=Cos 3=0.4082
Cos 5=Cos 6=0.500
With a threshold of 0.5, the 5th and the 6th would be retrieved.
Term weighting
 In order to improve the retrieval, we can give
to some terms more weight than others.
Local Term Weights
Global Term Weights
Where
1 if
X (r )  
0 if
r 0
r 0
pij 
fi j
f
j
ij
Synonymy and Polysemy
auto
engine
bonnet
tyres
lorry
boot
car
emissions
hood
make
model
trunk
make
hidden
Markov
model
emissions
normalize
Synonymy
Polysemy
Will have small cosine
Will have large cosine
but are related
but not truly related
Matrix Decomposition
To produce a reduced –rank approximation of
the document matrix, first we need to be able
to identify the dependence between columns
(documents) and rows (terms)
 QR Factorization
 SVD Decomposition
QR Factorization
 The document matrix A can be decomposed as
below:
A  QR
Where Q is an mXm orthogonal matrix and R is an
mX m upper triangular matrix
 This factorization can be used to determine the
basis vectors for any matrix A
 This factorization can be used to describe the
semantic content of the corresponding text
collection
Example
A=
 0.5774 0.5164  0.4961 0.1432  0.1252
 0
 0
 0.5774  0.2582 0.2481  0.0716 0.0626

 0
0
0
0
0
0.9393

0
0
 0.6282  0.2864 0.2505
 0
Q 0
 0.5774  0.2582 0.2481  0.0716 0.0626

0
0
0
0
0
 0.7071
 0
0
0
0
0.9309 0.1252

0
 0.7746  0.4961 0.1432  0.1252
 0
 0.7071
0
0
0
0
0

0

0.3802 

0.3430
0
0

 0.6860
0
0

 0.1715 0.5962  0.3862

0
 0.3802  0.5962

 0.3430
0
0

0.3430
0
0

0
0.3802 0.5962 
0.3430
0
 0.1715  0.5962
Example
0
 1 0
 0  1  0.6667

0
0  0.7454

R0
0
0
0
0
0

0
0
0
0
0
0

 0.6325

 0.2582  .4082
0
 0.4082
 0.1155 0.3651
0
0.3651 

 0.7211  0.3508
0
 0.3508
0
0.7596 0.6583 0.1013 

0
0
0.7528 0.5756 
0
0
0
0.4851 
0
0
0
Query Matching
 We can rewrite the cosine distance using
this decomposition
cos j 
aTj q
aj
2
q
2

(Q1rj )T q
Q1rj
2
q
2

rjT (Q1T q)
rj
2
q
2
 Where rj refers to column j of the matrix R
Singular Value Decomposition (SVD)
 This decomposition provides a reduced rank
approximations in the column and row
space of the document matrix
 This decomposition is defined as
A  UV
T
mm mn
V is nn
Where the columns U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 … r of AAT are the square root of the eigenvalues of
ATA.
Latent Semantic Decomposition (LSA)
 It is the application of SVD in text mining.
 We decompose the document-term matrix A
into three matrices
A
V
The V matrix refers to terms
and U matrix refers to documents
U
Latent Semantic Analysis
 Once we have decomposed the
document matrix A we can reduce
its rank
 We can account for synonymy and polysemy
in the retrieval of documents

Select the vectors associated with the higher
value of  in each matrix and reconstruct the
matrix A
Latent Semantic Analysis
Query Matching
 The cosines between the vector q and the n
document vectors can be represented as:
cos j 
( Ak e j )T q
Ak e j

q
(U k  kVk e j )T q
U k  kVk e j
q
 where ej is the canonical vector of dimension n
This formula can be simplified as
cos j 
sTj (U kT q)
sj
where
s j   kVkT e j
q
, j  1,2,....,m
Example
Apply the LSA method to the following
technical memo titles
c1: Human machine interface for ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user perceived response time to error
measurement
m1:
m2:
m3:
m4:
The generation of random, binary, ordered trees
The intersection graph of paths in trees
Graph minors IV: Widths of trees and well-quasi-ordering
Graph minors: A survey
Example
First we construct the document matrix
human
interface
computer
user
system
response
time
EPS
survey
trees
graph
minors
c1
1
1
1
0
0
0
0
0
0
0
0
0
c2
0
0
1
1
1
1
1
0
1
0
0
0
c3
0
1
0
1
1
0
0
1
0
0
0
0
c4
1
0
0
0
2
0
0
1
0
0
0
0
c5
0
0
0
1
0
1
1
0
0
0
0
0
m1
0
0
0
0
0
0
0
0
0
1
0
0
m2
0
0
0
0
0
0
0
0
0
1
1
0
m3
0
0
0
0
0
0
0
0
0
1
1
1
m4
0
0
0
0
0
0
0
0
1
0
1
1
Example
The Resulting decomposition is the following
{U} =
0.22
0.20
0.24
0.40
0.64
0.27
0.27
0.30
0.21
0.01
0.04
0.03
-0.11
-0.07
0.04
0.06
-0.17
0.11
0.11
-0.14
0.27
0.49
0.62
0.45
0.29
0.14
-0.16
-0.34
0.36
-0.43
-0.43
0.33
-0.18
0.23
0.22
0.14
-0.41
-0.55
-0.59
0.10
0.33
0.07
0.07
0.19
-0.03
0.03
0.00
-0.01
-0.11
0.28
-0.11
0.33
-0.16
0.08
0.08
0.11
-0.54
0.59
-0.07
-0.30
-0.34
0.50
-0.25
0.38
-0.21
-0.17
-0.17
0.27
0.08
-0.39
0.11
0.28
0.52
-0.07
-0.30
0.00
-0.17
0.28
0.28
0.03
-0.47
-0.29
0.16
0.34
-0.06
-0.01
0.06
0.00
0.03
-0.02
-0.02
-0.02
-0.04
0.25
-0.68
0.68
-0.41
-0.11
0.49
0.01
0.27
-0.05
-0.05
-0.17
-0.58
-0.23
0.23
0.18
Example
{} =
3.34
2.54
2.35
1.64
1.50
1.31
0.85
0.56
0.36
Example
{V} =
0.20
-0.06
0.11
-0.95
0.05
-0.08
0.18
-0.01
-0.06
0.61
0.17
-0.50
-0.03
-0.21
-0.26
-0.43
0.05
0.24
0.46
-0.13
0.21
0.04
0.38
0.72
-0.24
0.01
0.02
0.54
-0.23
0.57
0.27
-0.21
-0.37
0.26
-0.02
-0.08
0.28
0.11
-0.51
0.15
0.33
0.03
0.67
-0.06
-0.26
0.00
0.19
0.10
0.02
0.39
-0.30
-0.34
0.45
-0.62
0.01
0.44
0.19
0.02
0.35
-0.21
-0.15
-0.76
0.02
0.02
0.62
0.25
0.01
0.15
0.00
0.25
0.45
0.52
0.08
0.53
0.08
-0.03
-0.60
0.36
0.04
-0.07
-0.45
Example
 We will perform a 2 rank reconstruction:
 We select the first two vectors in each
matrix and set the rest of the matrix to
zero
 We reconstruct the document matrix
Example
m2
m3
m4
c1
c2
c3
c4
c5
m1
human
0.16
0.40
0.38
0.47
0.18
-0.05 -0.12 -0.16 -0.09
interface
0.14
0.37
0.33
0.40
0.16
-0.03 -0.07 -0.10 -0.04
computer
0.15
0.51
0.36
0.41
0.24
0.02
0.06
0.09
0.12
user
0.26
0.84
0.61
0.70
0.39
0.03
0.08
0.12
0.19
system
0.45
1.23
1.05
1.27
0.56
-0.07 -0.15 -0.21 -0.05
response
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
time
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
EPS
0.22
0.55
0.51
0.63
0.24
-0.07 -0.14 -0.20 -0.11
survey
0.10
0.53
0.23
0.21
0.27
0.14
0.31
0.44
0.42
trees
-0.06
0.23
-0.14 -0.27
0.14
0.24
0.55
0.77
0.66
graph
-0.06
0.34
-0.15 -0.30
0.20
0.31
0.69
0.98
0.85
minors
-0.04
0.25
-0.10 -0.21
0.15
0.22
0.50
0.71
0.62
The word
user seems
to have
presence in
the
documents
where the
word human
appears
Download