ppt

advertisement
Clustering Search Results
Using PLSA
洪春涛
2015/4/10
1
Outlines
• Motivation
• Introduction to document clustering and
PLSA algorithm
• Working progress and testing results
2015/4/10
2
Motivation
• Current Internet search engines are giving
us too much information
• Clustering the search results may help find
the desired information quickly
2015/4/10
3
The writer Truman Capote
The film Truman Capote
2015/4/10
4
A demo of the searching result from Google.
Document clustering
• Put the ‘similar’ documents together
=> How do we define ‘similar’?
2015/4/10
5
Vector Space Model of documents
The Vector Space Model (VSM) sees a document
as a vector of terms:
Doc1:
Doc2:
doc1
doc2
2015/4/10
I see a bright future.
I see nothing.
I
1
1
see
1
1
a
1
0
bright future nothing
1
1
0
0
0
1
6
Cosine as Distance Between Documents
The distance between doc1 and doc2 is then
defined as
doc1  doc 2
cos(doc1, doc 2) 
| doc1| * | doc2 |
2015/4/10
7
Problems with cosine similarity
• Synonymy: different words may have the same
meaning
– Car manufacturer=automobile maker
• Polysemy: a word may have several different
meanings
- ‘Truman Capote’ may mean the writer or the
film
=> We need a model that reflects the ‘meaning’
2015/4/10
8
Probabilistic Latent Semantic
Analysis
Graphical model of PLSA:
D2
D1
0.1
0.9
0.3
0.7
D: document
D2
0.8
0.2
Z1
Z1
Z: latent class
W: word
P(d , w)  P(d ) P( w | d )
P( w | d )   P( w | z ) P( z | d )
zZ
W1
W1
W1
These can also be written as:
P(d , w)   P( z) P(w | z) P(d | z)
zZ
2015/4/10
9
• Through Maximization Likelihood, one
gets the estimated parameters:
P(d|z)
This is what we want – a document-topic matrix that
reflects meanings of the documents.
P(w|z)
P(z)
2015/4/10
10
Our approach
1. Get the P(d|z) matrix by PLSA, and
2. Use k-means clustering algorithm on
the matrix
2015/4/10
11
Problems with this approach
• PLSA takes too much time
solution: optimization & parallelization
2015/4/10
12
Algorithm Outline
Expectation Maximization(EM) Algorithm:
E-step:
M-step:
Tempered EM:
2015/4/10
13
Basic Data Structures
p_w_z_current, p_w_z_prev:
dense double matrix W*Z
p_d_z_current, p_d_z_prev:
dense double matrix D*Z
p_z_current, p_z_prev:
double array
Z
n_d_w:
sparse integer matrix N
2015/4/10
14
Lemur Implementation
• In-need calculation of p_z_d_w
• Computational complexity:
O(W*D*Z2)
• For the new3 dataset containing 9558
documents, 83487 unique terms, it takes
days to finish a TEM iteration
2015/4/10
15
Optimization of the Algorithm
• Reduce complexity
– calculate p_z_d_w just once in an iteration
– complexity reduced to O(N*Z)
• Reduce cache miss by reverting loops
for(int d=1;d<numDocs;d++){
for(int w=0;w<numTermsInThisDoc;w++){
for(int z=0;z<numZ;z++){
….
}
}
}
2015/4/10
16
Parallelization: Access Pattern
Data Race
solution: divide the co-occurrence table into blocks
2015/4/10
17
Block Dispatching Algorithm
2015/4/10
18
Block Dividing Algorithm
cranmed
2015/4/10
19
Experiment Setup
2015/4/10
20
8
8
7
7
6
6
5
5
1P
2P
4
4P
8P
3
speedup
speedup
Speedup
1P
4P
8P
3
2
2
1
1
0
2P
4
0
new3
la12
HPC134
2015/4/10
cranmed
new3
la12
cranmed
Tulsa
21
Memory Bandwidth Usage
8000
bandwidth in MB/s
4000
new3
2000
la12
cranmed
1000
500
1P
2015/4/10
2P
4P
8P
22
Memory Related Pipeline Stalls
40000
stall_mem
35000
other
milions of CPU cycles
30000
25000
20000
15000
10000
5000
0
1P
2P
4P
new3
2015/4/10
8P
1P
2P
4P
la12
8P
1P
2P
4P
8P
cranmed
23
Available Memory Bandwidth of
the Two Machines
8000
7000
memory bandiwdth in MB/s
6000
5000
HPC134
4000
Tulsa
3000
2000
1000
0
1P
2015/4/10
2P
4P
8P
24
END
2015/4/10
25
Backup slides
2015/4/10
26
Test Results
Table 1. F-score of PLSA and VSM
PLSA
Tr23
0.4977
K1b
0.8473
sports 0.7575
2015/4/10
VSM
0.5273
0.5724
0.5563
27
Table 2. Time used in one EM iteration (in second)
sizeZ
10
20
50
100
Lemur
29
48
263
1015
3.2
7
13
Optimized 2
Uses the k1b dataset
(2340 docs, 21247 unique terms, 530374 terms)
2015/4/10
28
Thanks!
2015/4/10
29
Download