Fall 2015/16
1
• Several reasons for reducing dimensions of the matrix.
1. Reduce the memory footprint for storing the corpus.
2. Matrix is extremely sparse -- Term frequency follows a zipf’s law.
3. Transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms. (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence).
• Common techniques:
– SVD (Singular Value Decomposition) for general matrix factorization
–
LDA (Latent Dirichlet Allocation) for topic extraction
– Other techniques: Factor analysis, multi-dimensional scaling, etc.
– Feature selection – remove unimportant terms/columns
2
2
• Singular Value Decomposition (SVD) of an m*n real or complex matrix X is a factorization of the form
𝑇
–
U is an m x r real or complex satisfying the orthogonality condition U
T
U=I r x r
.
– r is the rank of the matrix
–
I r x r is an r x r identity matrix.
– Σ is an r x r diagonal matrix consisting of r positive “singular values”
1
2
...
r
>0
–
V is an r x n matrix satisfying the orthogonality condition V V
T
=I r x r
.
3
𝑇
X U Σ V
A good tutorial on SVD: http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
4
• When SVD is applied to a Term * Doc matrix, you can get the
‘concepts’ or ‘topics’ of the documents.
– Where a concept/topic is a vector of ‘membership/strength values’ of the documents.
5
– Documents:
Doc 1: Error: invalid message file format
Doc 2: Error: unable to open message file using message path
Doc 3: Error: Unable to format variable
– These three documents generate the following 11 x 3 termdocument matrix A .
Term 1 error
Term 2 invalid
Term 3 message
Term 4 file
Term 5 format
Term 6 unable
Term 7 to
Term 8 open
Term 9 using
Term 10 path
Term 11 variable
0
0
0
0
0
0 doc 1
1
1
1
1
1
1
1
0
1
1
1 doc 2
1
0
2
1
0
0
0
1
1
1
0 doc 3
1
0
0
0
1
6 continued...
U d
1
– First project the first document vector into a three- dimensional SVD space by the matrix multiplication: d
1
*
U
T
T
U d
1
d
1
U d d
1
= term-frequency vector for document 1
(using the unweighted counts here)
– And then write this in transposed form with column labels: d
T
1
7
– The SVD dimensions are ordered by the size of their singular values (“their importance”). Therefore, the document vector can simply be truncated to obtain a lower dimensional projection:
• 2-D representation for doc 1 is
– As a final step, the Text Cluster node then normalizes the coordinate values so that the sums of squares for each document are 1.0:
• Using this document’s 2-D representation,
1.63
2
.49
2
2.847
and 2.897
1.70
• Therefore, the final 2-D representation for doc 1 would be .
8
– The tiny example given here has a term-document matrix of rank r =3 . (The rank is always less than or equal to the minimum of the number of documents and the number of terms.)
– In actual practice, the rank of the term-document matrix will usually be in the thousands, so the SVD algorithm is used to dramatically reduce the dimensionality of the data.
– The SVD algorithm derives SVD dimensions in order of
i
– The number of SVD dimensions to keep is based on looking at these singular values and establishing a cut-off value .
k
9
– The user specifies a maximum dimension M (default=100 and highest allowed value=500) for the number of SVD dimensions to keep.
– The SVD algorithm produces the M singular values in decreasing order.
– The sum of the M singular values (squared) acts as a metric for the amount of information in the document collection.
Treating the sum of the top M squared values as the “total information” is useful for arriving at a reasonable cutoff.
10
– The user also specifies an SVD Resolution value:
• High=100%
• Medium=5/6=83.3%
• Low=2/3=66.67% (the default)
– Based on these two settings, the SAS Text Cluster node uses a simple algorithm to decide on the final number of SVD dimensions to use.
11 continued...
– To illustrate the logic for deciding the number of dimensions to use:
• Suppose the user sets Max SVD Dimensions=100 and SVD
Resolution=Low (66.7%).
• Assume that you are working with a big document collection so that the rank of the term-document matrix is much larger than 100.
• Let the sum of the first 100 squared singular values be given as
100
2
C i i
1
• The algorithm determines the minimum dimension k
100 such that i k
1
2 i
/ C
.667
• In the end, k SVD dimensions will be kept.
12