Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroล1,2, Maciej Wielgosz1,2, Michaล Karwatowski1,2, Kazimierz Wiatr12 1AGH 2ACK University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków RUC 17-18.09.2015 Kraków Agenda Text classification System architecture Metrics Dimensionality reduction Experiments and results Conclusions and future work 2 Text classification Very useful and popular problem in internet and big data processing Real time processing requirement Preceded by text preprocessing Clustering as a one of a few techniques which helps text classification 3 System architecture Text preprocessing Dictionary and model transformation 4 SVD K-means System architecture Document corpus generation (e.g. crawler) Text preprocessing (implemented by gensim library, lemmatization, stoplist etc.) SVD K-means as clustering method (clustering documents to chosen domains) 5 Quality metrics 6 ๐๐๐๐๐๐๐๐๐ = ๐๐ ๐๐+๐๐ ๐๐ ๐๐๐๐๐๐ = ๐๐ + ๐๐ ๐ญ=๐× ๐๐๐๐๐๐๐๐๐∗๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐+๐๐๐๐๐๐ Entropy 7 ๐ ๐ฌ ๐ช๐ = − ๐=๐ ๐๐๐ ๐๐๐ ๐ฅ๐จ๐ ( ) ๐๐ ๐๐ ๐ ๐ฌ๐๐๐๐๐๐ = ๐=๐ ๐๐ ๐ฌ(๐ช๐ ) ๐ Dimensionality reduction SVD: A=๐ผ×๐ฎ×๐ฝ where U is matrix of left singular vectors, V is matrix of the right singular vectors and ๐ด is diagonal matrix with singular values 8 Dimensionality reduction Random Projection: random projection of vectors to reduced space by special matrixes (distances between points in reduced space are scalable) A = R × ๐ (e.g. Achlioptas random projection matrix) 9 Results and experiments 10 number of clusters Precision recall F-measure 3.9(0.3) 0.81(0.022) 0.56(0.077) 0.66(0.034) 3(0) 0.37(0.015) 0.7(0.061) 0.48(0.024) 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01) science 2.1(0.3) 0.39(0.014) 0.74(0.016) 0.51(0.014) sport 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01) business culture automotive employed algorithms vsm+kmeans vsm+tfidf+kmeans vsm+tfidf+svd+kmeans Entropy 0.28(0.012) 0.17(0.019) 0.16(0.006) Results and experiments 11 1.050 1.000 0.950 0.900 Entropy mean 0.850 0.800 0.750 2.000 42.000 700.000 1500.000 2300.000 3100.000 3900.000 4700.00 5500.00 6300.00 7100.00 7900.00 GPU implementation 12 NVIDIA tesla m2090 Intel Xeon e5645 reduction size GPGPU [ms] CPU [ms] 10 33 80 20 77 305 30 107 420 40 161 624 Conclusions and future work Applying more algorithms lowers entropy GPU can efficiently reduce time of text classification Random projection hardware implementation K-means GPU acceleration 13 Questions 14