Dimensionality Reduction

advertisement

CSC 594 Topics in AI –

Text Mining and Analytics

Fall 2015/16

6. Dimensionality Reduction

1

Dimensionality Reduction

• Several reasons for reducing dimensions of the matrix.

1. Reduce the memory footprint for storing the corpus.

2. Matrix is extremely sparse -- Term frequency follows a zipf’s law.

3. Transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms. (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence).

• Common techniques:

– SVD (Singular Value Decomposition) for general matrix factorization

LDA (Latent Dirichlet Allocation) for topic extraction

– Other techniques: Factor analysis, multi-dimensional scaling, etc.

– Feature selection – remove unimportant terms/columns

2

2

Singular Value Decomposition (SVD)

• Singular Value Decomposition (SVD) of an m*n real or complex matrix X is a factorization of the form

𝑋 = 𝑈Σ𝑉

𝑇

U is an m x r real or complex satisfying the orthogonality condition U

T

U=I r x r

.

– r is the rank of the matrix

I r x r is an r x r identity matrix.

– Σ is an r x r diagonal matrix consisting of r positive “singular values”

1

 

2

...

 r

>0

V is an r x n matrix satisfying the orthogonality condition V V

T

=I r x r

.

3

SVD Example (1)

𝑋 = 𝑈Σ𝑉

𝑇

X U Σ V

A good tutorial on SVD: http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

4

SVD for Concept/Topic Extraction

• When SVD is applied to a Term * Doc matrix, you can get the

‘concepts’ or ‘topics’ of the documents.

– Where a concept/topic is a vector of ‘membership/strength values’ of the documents.

5

SVD Example (2)

– Documents:

Doc 1: Error: invalid message file format

Doc 2: Error: unable to open message file using message path

Doc 3: Error: Unable to format variable

– These three documents generate the following 11 x 3 termdocument matrix A .

Term 1 error

Term 2 invalid

Term 3 message

Term 4 file

Term 5 format

Term 6 unable

Term 7 to

Term 8 open

Term 9 using

Term 10 path

Term 11 variable

0

0

0

0

0

0 doc 1

1

1

1

1

1

1

1

0

1

1

1 doc 2

1

0

2

1

0

0

0

1

1

1

0 doc 3

1

0

0

0

1

6 continued...

U d

1

– First project the first document vector into a three- dimensional SVD space by the matrix multiplication: d

1

*

U

T

T

U d

1

 d

ˆ

1

U d d

1

= term-frequency vector for document 1

(using the unweighted counts here)

– And then write this in transposed form with column labels: d

ˆ

T

1

7

– The SVD dimensions are ordered by the size of their singular values (“their importance”). Therefore, the document vector can simply be truncated to obtain a lower dimensional projection:

• 2-D representation for doc 1 is

– As a final step, the Text Cluster node then normalizes the coordinate values so that the sums of squares for each document are 1.0:

• Using this document’s 2-D representation,

1.63

2 

.49

2 

2.847

and 2.897

1.70

• Therefore, the final 2-D representation for doc 1 would be .

8

Dimensionality Reduction

– The tiny example given here has a term-document matrix of rank r =3 . (The rank is always less than or equal to the minimum of the number of documents and the number of terms.)

– In actual practice, the rank of the term-document matrix will usually be in the thousands, so the SVD algorithm is used to dramatically reduce the dimensionality of the data.

– The SVD algorithm derives SVD dimensions in order of

 i

– The number of SVD dimensions to keep is based on looking at these singular values and establishing a cut-off value .

k

9

Dimensionality Reduction (cont.)

– The user specifies a maximum dimension M (default=100 and highest allowed value=500) for the number of SVD dimensions to keep.

– The SVD algorithm produces the M singular values in decreasing order.

– The sum of the M singular values (squared) acts as a metric for the amount of information in the document collection.

Treating the sum of the top M squared values as the “total information” is useful for arriving at a reasonable cutoff.

10

SVD “Resolutions”

– The user also specifies an SVD Resolution value:

• High=100%

• Medium=5/6=83.3%

• Low=2/3=66.67% (the default)

– Based on these two settings, the SAS Text Cluster node uses a simple algorithm to decide on the final number of SVD dimensions to use.

11 continued...

Selecting Number of Dimensions

– To illustrate the logic for deciding the number of dimensions to use:

• Suppose the user sets Max SVD Dimensions=100 and SVD

Resolution=Low (66.7%).

• Assume that you are working with a big document collection so that the rank of the term-document matrix is much larger than 100.

• Let the sum of the first 100 squared singular values be given as

100  

2 

C i i

1

• The algorithm determines the minimum dimension k

100 such that i k 

1

2 i

/ C

.667

• In the end, k SVD dimensions will be kept.

12

Download