Study of the parallel techniques for dimensionality - PL-Grid

Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń1,2, Maciej Wielgosz1,2, Michał Karwatowski1,2, Kazimierz Wiatr12 1AGH 2ACK University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków RUC 17-18.09.2015 Kraków Agenda Text classification System architecture Metrics Dimensionality reduction Experiments and results Conclusions and future work 2 Text classification Very useful and popular problem in internet and big data processing Real time processing requirement Preceded by text preprocessing Clustering as a one of a few techniques which helps text classification 3 System architecture Text preprocessing Dictionary and model transformation 4 SVD K-means System architecture Document corpus generation (e.g. crawler) Text preprocessing (implemented by gensim library, lemmatization, stoplist etc.) SVD K-means as clustering method (clustering documents to chosen domains) 5 Quality metrics 6 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = 𝒕𝒑 𝒕𝒑+𝒇𝒑 𝒕𝒑 𝒓𝒆𝒄𝒂𝒍𝒍 = 𝒕𝒑 + 𝒇𝒏 𝑭=𝟐× 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏∗𝒓𝒆𝒄𝒂𝒍𝒍 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝒓𝒆𝒄𝒂𝒍𝒍 Entropy 7 𝒌 𝑬 𝑪𝒊 = − 𝒉=𝟏 𝒏𝒉𝒊 𝒏𝒉𝒊 𝐥𝐨𝐠( ) 𝒏𝒊 𝒏𝒊 𝒌 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 = 𝒊=𝟏 𝒏𝒊 𝑬(𝑪𝒊 ) 𝒏 Dimensionality reduction SVD: A=𝑼×𝜮×𝑽 where U is matrix of left singular vectors, V is matrix of the right singular vectors and 𝛴 is diagonal matrix with singular values 8 Dimensionality reduction Random Projection: random projection of vectors to reduced space by special matrixes (distances between points in reduced space are scalable) A = R × 𝑃 (e.g. Achlioptas random projection matrix) 9 Results and experiments 10 number of clusters Precision recall F-measure 3.9(0.3) 0.81(0.022) 0.56(0.077) 0.66(0.034) 3(0) 0.37(0.015) 0.7(0.061) 0.48(0.024) 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01) science 2.1(0.3) 0.39(0.014) 0.74(0.016) 0.51(0.014) sport 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01) business culture automotive employed algorithms vsm+kmeans vsm+tfidf+kmeans vsm+tfidf+svd+kmeans Entropy 0.28(0.012) 0.17(0.019) 0.16(0.006) Results and experiments 11 1.050 1.000 0.950 0.900 Entropy mean 0.850 0.800 0.750 2.000 42.000 700.000 1500.000 2300.000 3100.000 3900.000 4700.00 5500.00 6300.00 7100.00 7900.00 GPU implementation 12 NVIDIA tesla m2090 Intel Xeon e5645 reduction size GPGPU [ms] CPU [ms] 10 33 80 20 77 305 30 107 420 40 161 624 Conclusions and future work Applying more algorithms lowers entropy GPU can efficiently reduce time of text classification Random projection hardware implementation K-means GPU acceleration 13 Questions 14

Study of the parallel techniques for dimensionality - PL-Grid

Related documents

Products

Support

Study of the parallel techniques for dimensionality - PL-Grid

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib