Data Mining Final Project

advertisement
Data Mining Final Project
Nick Foti
Eric Kee
Topic: Author Identification
• Author Identification
– Given writing samples, can we determine who
wrote them?
– This is a well studied field
• See also: “stylometry”
– This has been applied to works such as
• The Bible
• Shakespeare
• Modern texts as well
Corpus Design
• A corpus is:
– A body of text used for linguistic analysis
• Used Project Gutenberg to create corpus
• The corpus was designed as follows
– Four authors of varying similarity
•
•
•
•
Anne Brontë
Charlotte Brontë
Charles Dickens
Upton Sinclair
– Multiple books per author
• Corpus size: 90,000 lines of text
Dataset Design
• Extracted features common in literature
– Word Length
– Frequency of “glue” words
• See Appendix A and [1,2] for list of glue words
• Note: corpus was processed using
– C#, Matlab, Python
• Data set parameters are
– Number of dimensions: 309
• Word length and 308 glue words
– Number of observations: ≈ 3,000
• Each obervation ≈ 30 lines of text from a book
Classifier Testing and Analysis
• Tested classifier with test data
– Used testing and training data sets
• 70% for training, 30% for testing
– Used cross-validation
• Analyzed Classifier Performance
Red Dots Indicate
True-Positive Cases
– Used ROC plots
– Used confusion matrices
78%
55%
• Used common plotting
scheme (right)
45%
22%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Binary Classification
Word Length Classification
• Calculated average word length for each
observation
• Computed gaussian kernel density from word
length samples
• Used ROC curve to calculate cutoff
– Optimized sensitivity and specificity with equal
importance
Word Length: Anne B. vs Upton S.
Anne Brontë
Charlotte Brontë
100%
100%
0%
Anne B.
TP
Anne B.
FP
0%
Upton Sinclair
FP
Upton Sinclair
TP
Word Length: Brontë vs. Brontë
Anne Brontë
Charlotte Brontë
100%
78.1%
21.9%
0%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Principal Component Analysis
• Used PCA to find a better axis
• Notice: distribution similar to word length
Anne Brontë vs. Upton Sinclair
distribution
• Is word length
the only useful
dimension?
Word Length Density
PCA Density
Principal Component Analysis
Without word length
• It appears that word length is the most useful
Anne Brontë vs. Upton Sinclair
axis
• We’ll come
back to this…
PCA Density
K-Means
• Used K-means to find dominant patterns
– Unnormalized
– Normalized
• Trained K-means on training set
• To classify observations in test set
– Calculate distance of observation to each class
mean
– Assign observation to the closest class
• Performed cross-validation to estimate
performance
Unnormalized K-means
Anne Brontë vs. Upton Sinclair
98.1%
92.1%
7.9%
1.9%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Unnormalized K-means
Anne Brontë vs. Charlotte Brontë
95.7%
74.7%
25.3%
4.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Normalized K-means
Anne Brontë vs. Upton Sinclair
53.3%
50.6%
49.4%
46.7%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Normalized K-means
Anne Brontë vs. Charlotte Brontë
86.7%
84.2%
15.8%
Anne B.
TP
13.3 %
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Discriminant Analysis
• Peformed discriminant analysis
– Computed with equal covariance matrices
• Used average Omega of class pairs
– Computed with unequal covariance matrices
• Quadratic discrimination fails because covariance
matrices have 0 determinant (see equation below)
– Computed theoretical misclassification probability
• To perform quadratic discriminant analysis
– Compute Equation 1 for each class
– Choose class with minimum value
(1)
Discriminant Analysis
Anne Brontë vs. Upton Sinclair
Empirical P(err) = 0.116
Theoretical P(err) = 0.149
96.2%
92.2%
3.8%
7.8%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinsclair
TP
Discriminant Analysis
Anne Brontë vs. Charlotte Brontë
Empirical P(err) = 0.152
Theoretical P(err) = 0.181
92.7%
89.2%
10.8%
7.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Logistic Regression
• Fit linear model to training data on all dimensions
• Threw out singular dimensions
– Left with ≈ 298 coefficients + intercept
• Projected training data onto synthetic variable
– Found threshold by minimizing error of misclassification
• Projected testing data onto synthetic variable
– Used threshold to classify points
Logistic Regression
Anne Brontë vs Charlotte Brontë
Anne Brontë
Charlotte Brontë
92%
89.5%
10.5%
Anne B
TP
Anne B
TP
8%
Charlotte B
TP
Charlotte B
TP
Logistic Regression
Anne Brontë vs Upton Sinclair
Anne Brontë
Upton Sinclair
98%
Anne B
TP
99%
2%
2%
Anne B
FP
Upton S
TP
Upton S
FP
4-Class Classification
4-Class K-means
• Used K-means to find patterns among all
classes
– Unnormalized
– Normalized
• Trained using a training set
• Tested performance as in 2-class K-means
• Performed cross-validation to estimate
performance
Unnormalized K-Means
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
Charlotte Brontë
Upton Sinclair
88%
87%
59%
54%
34%
22%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Normalized K-Means
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
67%
Charlotte Brontë
Upton Sinclair
70%
67%
67%
27%
26%
20%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Additional K-means testing
• Also tested K-means without word length
– Recall that we had perfect classification with 1D
word length (see plot below)
– Is K-means using only 1 dimension to classify?
Anne Brontë vs. Upton Sinclair
Note: perfect classification only occurs between Anne B. and Sinclair
Unnormalized K-Means (No Word Length)
• K-means can classify without word length
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
Charlotte Brontë
Upton Sinclair
72%
44%
43%
35%
35%
33%
29%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Multinomial Regression
• Multinomial distribution
– Extension of binomial distribution
• Random variable is allowed to take on n values
• Used multinom(…) to fit log-linear model for
training
– Used 248 dimensions (max limit on computer)
– Returns 3 coefficients per dimension and 3
intercepts
• Found probability that observations belongs to
each class
Multinomial Regression
• Multinomial Logit Function is
where j are the coefficients and cj are the intercepts
• To classify
– Compute probabilities
• Pr(yi = Dickens), Pr(yi = Anne B.), …
– Choose class with maximum probability
Multinomial Regression
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
Charlotte Brontë
86%
78%
Upton Sinclair
83%
93%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Multinomial Regression
(Without Word Length)
• Multinomial regression does not require word length
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
Charlotte Brontë
79%
76%
Upton Sinclair
79%
91%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Appendix A: Glue Words
I a aboard about above across after again against ago ahead all almost along alongside already
also although always am amid amidst among amongst an and another any anybody anyone
anything anywhere apart are aren't around as aside at away back backward backwards be because
been before beforehand behind being below between beyond both but by can can't cannot could
couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during
each either else elsewhere enough even ever evermore every everybody everyone everything
everywhere except fairly farther few fewer for forever forward from further furthermore had
hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his
how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less
lest like likewise little low lower many may mayn't me might mightn't mine minus more
moreover most much must mustn't my myself near need needn't neither never nevertheless next
no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one
ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside
over own past per perhaps please plus provided quite rather really round same self selves several
shall shan't she should shouldn't since so some somebody someday someone something
sometimes somewhat still such than that the their theirs them themselves then there therefore
these they thing things this those though through throughout thus till to together too towards
under underneath undoing unless unlike until up upon upwards us versus very via was wasn't way
we well were weren't what whatever when whence whenever where whereas whereby wherein
wherever whether which whichever while whilst whither who whoever whom with whose within
why without will won't would wouldn't yet you your yours yourself yourselves
Conclusions
• Authors can be identified by their word usage frequencies
• Word length may be used to distingush between the Brontë
sisters
– Word length does not, however, extend to all authors (See Appendix C)
• The glue words describe genuine differences between all four
authors
– K-means distinguishes the same patterns that multinomial regression
classifies
• This indicates that supervised training finds legitimate patterns, rather than
artifacts
• The Brontë sisters are the most similar authors
• Upton Sinclair is the most different author
Appendix B: Code
• See Attached .R files
Appendix C: Single Dimension 4-Author Classification
Classification using Multinomial Regression
4-Class Confusion Matrix
Anne Brontë
Charles Dickens
Charlotte Brontë
Upton Sinclair
96%
94%
54%
46%
22%
11%
6%
3%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
References
[1] Argamon, Saric, Stein, “Style Mining of Electronic Messages for Multiple
Authorship Discrimination: First Results,” SIGKDD 2003.
[2] Mitton, “Spelling checkers, spelling correctors and the misspellings of poor
spellers,” Information Processing and Management, 1987.
Download