Clustering Search Results Using PLSA 洪春涛 2015/4/10 1 Outlines • Motivation • Introduction to document clustering and PLSA algorithm • Working progress and testing results 2015/4/10 2 Motivation • Current Internet search engines are giving us too much information • Clustering the search results may help find the desired information quickly 2015/4/10 3 The writer Truman Capote The film Truman Capote 2015/4/10 4 A demo of the searching result from Google. Document clustering • Put the ‘similar’ documents together => How do we define ‘similar’? 2015/4/10 5 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: Doc2: doc1 doc2 2015/4/10 I see a bright future. I see nothing. I 1 1 see 1 1 a 1 0 bright future nothing 1 1 0 0 0 1 6 Cosine as Distance Between Documents The distance between doc1 and doc2 is then defined as doc1 doc 2 cos(doc1, doc 2) | doc1| * | doc2 | 2015/4/10 7 Problems with cosine similarity • Synonymy: different words may have the same meaning – Car manufacturer=automobile maker • Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’ 2015/4/10 8 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D2 D1 0.1 0.9 0.3 0.7 D: document D2 0.8 0.2 Z1 Z1 Z: latent class W: word P(d , w) P(d ) P( w | d ) P( w | d ) P( w | z ) P( z | d ) zZ W1 W1 W1 These can also be written as: P(d , w) P( z) P(w | z) P(d | z) zZ 2015/4/10 9 • Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z) 2015/4/10 10 Our approach 1. Get the P(d|z) matrix by PLSA, and 2. Use k-means clustering algorithm on the matrix 2015/4/10 11 Problems with this approach • PLSA takes too much time solution: optimization & parallelization 2015/4/10 12 Algorithm Outline Expectation Maximization(EM) Algorithm: E-step: M-step: Tempered EM: 2015/4/10 13 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double array Z n_d_w: sparse integer matrix N 2015/4/10 14 Lemur Implementation • In-need calculation of p_z_d_w • Computational complexity: O(W*D*Z2) • For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015/4/10 15 Optimization of the Algorithm • Reduce complexity – calculate p_z_d_w just once in an iteration – complexity reduced to O(N*Z) • Reduce cache miss by reverting loops for(int d=1;d<numDocs;d++){ for(int w=0;w<numTermsInThisDoc;w++){ for(int z=0;z<numZ;z++){ …. } } } 2015/4/10 16 Parallelization: Access Pattern Data Race solution: divide the co-occurrence table into blocks 2015/4/10 17 Block Dispatching Algorithm 2015/4/10 18 Block Dividing Algorithm cranmed 2015/4/10 19 Experiment Setup 2015/4/10 20 8 8 7 7 6 6 5 5 1P 2P 4 4P 8P 3 speedup speedup Speedup 1P 4P 8P 3 2 2 1 1 0 2P 4 0 new3 la12 HPC134 2015/4/10 cranmed new3 la12 cranmed Tulsa 21 Memory Bandwidth Usage 8000 bandwidth in MB/s 4000 new3 2000 la12 cranmed 1000 500 1P 2015/4/10 2P 4P 8P 22 Memory Related Pipeline Stalls 40000 stall_mem 35000 other milions of CPU cycles 30000 25000 20000 15000 10000 5000 0 1P 2P 4P new3 2015/4/10 8P 1P 2P 4P la12 8P 1P 2P 4P 8P cranmed 23 Available Memory Bandwidth of the Two Machines 8000 7000 memory bandiwdth in MB/s 6000 5000 HPC134 4000 Tulsa 3000 2000 1000 0 1P 2015/4/10 2P 4P 8P 24 END 2015/4/10 25 Backup slides 2015/4/10 26 Test Results Table 1. F-score of PLSA and VSM PLSA Tr23 0.4977 K1b 0.8473 sports 0.7575 2015/4/10 VSM 0.5273 0.5724 0.5563 27 Table 2. Time used in one EM iteration (in second) sizeZ 10 20 50 100 Lemur 29 48 263 1015 3.2 7 13 Optimized 2 Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms) 2015/4/10 28 Thanks! 2015/4/10 29