Study of the parallel techniques for dimensionality - PL-Grid

advertisement
Study of the parallel techniques for
dimensionality reduction and its
impact on quality of the text
processing algorithms
Marcin Pietroล„1,2, Maciej Wielgosz1,2,
Michaล‚ Karwatowski1,2, Kazimierz Wiatr12
1AGH
2ACK
University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,
Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków
RUC 17-18.09.2015 Kraków
Agenda
Text classification
System architecture
Metrics
Dimensionality reduction
Experiments and results
Conclusions and future work
2
Text classification
Very useful and popular problem in internet
and big data processing
Real time processing requirement
Preceded by text preprocessing
Clustering as a one of a few techniques
which helps text classification
3
System architecture
Text preprocessing
Dictionary and
model
transformation
4
SVD
K-means
System architecture
Document corpus generation (e.g. crawler)
Text preprocessing (implemented by
gensim library, lemmatization, stoplist etc.)
SVD
K-means as clustering method (clustering
documents to chosen domains)
5
Quality metrics
6
๐’‘๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’ =
๐’•๐’‘
๐’•๐’‘+๐’‡๐’‘
๐’•๐’‘
๐’“๐’†๐’„๐’‚๐’๐’ =
๐’•๐’‘ + ๐’‡๐’
๐‘ญ=๐Ÿ×
๐’‘๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’∗๐’“๐’†๐’„๐’‚๐’๐’
๐’‘๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’+๐’“๐’†๐’„๐’‚๐’๐’
Entropy
7
๐’Œ
๐‘ฌ ๐‘ช๐’Š = −
๐’‰=๐Ÿ
๐’๐’‰๐’Š
๐’๐’‰๐’Š
๐ฅ๐จ๐ ( )
๐’๐’Š
๐’๐’Š
๐’Œ
๐‘ฌ๐’๐’•๐’“๐’๐’‘๐’š =
๐’Š=๐Ÿ
๐’๐’Š
๐‘ฌ(๐‘ช๐’Š )
๐’
Dimensionality reduction
SVD:
A=๐‘ผ×๐œฎ×๐‘ฝ
where U is matrix of left singular vectors,
V is matrix of the right singular vectors
and ๐›ด is diagonal matrix with singular
values
8
Dimensionality reduction
Random Projection:
random projection of vectors to reduced
space by special matrixes (distances
between points in reduced space are
scalable)
A = R × ๐‘ƒ (e.g. Achlioptas random
projection matrix)
9
Results and experiments
10
number
of
clusters
Precision
recall
F-measure
3.9(0.3)
0.81(0.022)
0.56(0.077)
0.66(0.034)
3(0)
0.37(0.015)
0.7(0.061)
0.48(0.024)
4.8(0.4)
0.39(0.007)
0.56(0.021)
0.45(0.01)
science
2.1(0.3)
0.39(0.014)
0.74(0.016)
0.51(0.014)
sport
4.8(0.4)
0.39(0.007)
0.56(0.021)
0.45(0.01)
business
culture
automotive
employed algorithms
vsm+kmeans
vsm+tfidf+kmeans
vsm+tfidf+svd+kmeans
Entropy
0.28(0.012)
0.17(0.019)
0.16(0.006)
Results and experiments
11
1.050
1.000
0.950
0.900
Entropy mean
0.850
0.800
0.750
2.000
42.000
700.000 1500.000 2300.000 3100.000 3900.000 4700.00
5500.00
6300.00
7100.00
7900.00
GPU implementation
12
NVIDIA tesla m2090
Intel Xeon e5645
reduction size
GPGPU [ms]
CPU [ms]
10
33
80
20
77
305
30
107
420
40
161
624
Conclusions and future work
Applying more algorithms lowers entropy
GPU can efficiently reduce time of text
classification
Random projection hardware
implementation
K-means GPU acceleration
13
Questions
14
Download