Recent Trends in Text Mining

advertisement
Recent Trends
in
Text Mining
Girish Keswani
gkeswani@micron.com
Text Mining?

What?


Why?




Data Mining on Text Data
Information Retrieval
Confusion Set Disambiguation
Topic Distillation
How?

Data Mining
Organization



Text Mining Algorithms
Jargon Used
Background







Data Modeling,
Text Classification, and
Text Clustering
Applications
Experiments {NBC, NN and ssFCM}
Further work
References
Text Mining Algorithms

Classification Algorithms




Naïve Bayes Classifier
Decision Trees
Neural Networks
Clustering Algorithms


EM Algorithms
Fuzzy
Jargon









DM: Data Mining
IR: Information Retrieval
NBC: Naïve Bayes Classifier
EM: Expectation Maximization
NN: Neural Networks
ssFCM: Semi-Supervised Fuzzy CMeans
Labeled Data (Training Data)
Unlabeled Data
Test Data
Background: Modeling

Vector Space Model
Background: Modeling


Generative Models of Data [13] :
Probabilistic
“to generate a document, a class is
first selected based on its prior
probability and then a document is
generated using the parameters of
the chosen class distribution”
NBC and EM Algorithms are based
on this model
Importance of Unlabeled
Data?
D
A
B
G
E
F
Labeled Data
Unlabeled Data
Test Data
C
Provides access to feature distribution
in set F using joint probability
distributions
How to make use of
Unlabeled Data?
How to make use of
Unlabeled Data?
Experimental Results [1]
Using NBC, EM and ssFCM
Experimental Results [2]
Using NBC and EM
Extensions and Variants of
these approaches

Authors in [6] propose a concept
of Class Distribution Constraint
matrix


Results on Confusion Set
Disambiguation
Automatic Title Generation [7]:
Using EM Algorithm
 Non-extractive approach

Relational Data [9]


A collection of data with
relations between entities
explained is known as relational
data
Probabilistic Relational Models
Commercial Use/Products

IBM Text Analyzer [11]


SAS Text Miner[12]


Singular Value Decomposition
Filtering Junk Email


Decision Tree Based
Hotmail, Yahoo
Advanced Search Engines
Applications: Search Engines
Vivisimo Search Engine:
(www.vivisimo.com)
Experiments

NBC
Naïve Bayes Classifier
 Probabilistic


NN


Neural Networks
ssFCM
Semi-Supervised Fuzzy Clustering
 Fuzzy

Datasets (20 Newsgroups Data)


Sampling I:
Dataset
min2
min4
min6
# Features
--
9467
5685
Sampling II:
Dataset
Sampling Percentage
Number of Features
Sample25
25%
13925
Sample30
30%
15067
Sample35
35%
16737
Sample40
40%
16871
Sample45
45%
17712
Sample50
50%
19135
Vectors
Sampling I
Raw
Data
Sampling II
Vectors
Naïve Bayes Classifier
SAMPLE
Sample25
Sample30
% TRAINING
% TEST
ACCURACY %
20
80
34.4637
63
36
48.4945
76
23
50.9322
82
17
47.7728
86
13
48.9971
20
80
31.5436
63
36
48.0729
76
23
47.8661
82
17
50.5568
86
13
50.4587
33
66
39.1137
66
33
46.4233
77
22
48.5528
83
16
52.7383
86
13
51.2136
33
66
39.26
66
33
47.0192
77
22
48.8439
83
16
49.6907
86
13
51.6169
Naïve Bayes Classifier
55
.01
.05 .10
.25
.50
.75
.90 .95
.99
Sample30
Sample25
Accuracy %
50
45
40
35
30
Sample25
Sample30
Sample
-3
-2
-1
0
1
Normal Quantile
2
3
NBC
Sample25
Sample30
55
55
Accuracy %
Accuracy %
50
45
40
35
50
45
40
30
20
63
76
82
86
33
66
% TRAINING
77
83
86
33
66
% TRAINING
55
55
Accuracy %
Accuracy %
50
45
40
35
50
45
40
30
13
17
23
% TEST
36
80
13
16
22
% TEST
ssFCM
Effect of Unlabeled Data
37.5
37.5
35
35
ACCURACY %
ACCURACY %
Effect of Labeled Data
32.5
30
27.5
32.5
30
27.5
20
33
42
50
% LABELED
55
60
40
44
50
57
% UNLABELED
66
80
27.5
Sample
sample50
sample45
sample40
sample35
sample30
sample25
ACCURACY %
ssFCM
37.5
35
32.5
30
Further Work

Ensemble of Classifiers [16]
Further Work

Knowledge Gathering from
Experts

E.g. 3 class Data:
Input Data {C1,C2,C3}
C1
Test Data
?
C2
Classifier
C3
References
[1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and
L.O.Hall, appeared in IEEE WCCI 2002 conference.
[2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam.
[3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul
Nigam et al.
[4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang.
[5] “Learning from Partially Labeled Data,” Martin Szummer et al.
[6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution
Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii.
[7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann.
[8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,”
Fabio G. Cozman and Ira Cohen.
[9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al.
[10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al.
[11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text
categorization,” D.E. Johnson et al.
[12] “SAS Text Miner,” Reincke
[13] “Pattern Recognition,” Duda and Hart 2000
[14] “Machine Learning,” Tom Mitchell
[15] “Data Mining,” Margaret Dunham
[16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/
Download