Gender Identification of SMS Texts By: Shannon Silessi

advertisement
Gender Identification of SMS Texts
By: Shannon Silessi
Introduction
Introduction
U.S. wireless users send &
receive an average of 6
billion text messages a day
Visual anonymity can be
misused and exploited by
criminals
Cyber forensics methods are
needed for detecting SMS
authors for use in criminal
persecution cases
Introduction (Cont’d)
Authorship
Characterization
Categorizing an author’s
text according to
sociolinguistic attributes
such as:
•
•
•
•
•
•
Gender
Age
Educational Background
Income
Nationality
Race
Introduction (Cont’d)
Most research in
the area of
authorship
characterization
has been
conducted on
larger, more formal
written documents
Unusual
characteristics of
SMS make it
difficult to apply
traditional
stylometric
techniques
Introduction (Cont’d)
Characteristics of SMS text messages
Limited to 140 characters [3]
Often contain abbreviations or
written representations of
sounds
(e.g. ‘kt’ for Katie) [3]
Various phonetic spellings for
verbal effects
(‘hehe’ for laughter and
‘muaha’ for evil laughter) [3]
Emoticons, such as 
(representing a frown) [3]
Combined letters & numbers
for compression
(‘CUL8R’ for ‘See You Later’)
[3]
Background
• 545 psycholinguistic & genderpreferential cues [4]
• Dataset: collection of all English
language stories produced by Reuters
journalists between August 20, 1996
and August 19, 1997 [4]
• Used messages that contained 200 <
1000 words
Background (Cheng Cont’d)
Enron email dataset
• Messages containing 50
< 1000 words [4]
Best classification result
• SVM with 76.75%
(Reuter’s) & 82.23%
(Enron) accuracies [4]
Examined performance of
Bayesian-based logistic
regression, Ada-Boost
decision tree, & Support
Vector Machine (SVM) [4]
Examination of
parameter performance
• Accuracy increases with
increasing number of
words [4]
Background (Cheng Cont’d)
Issues
Examination of feature sets
• Word-based features &
function words were more
important [4]
•
•
Categorization of documents by
gender was based on perceived
gender of a person’s name
Unequal amount of male vs.
female authored documents
Background (Cont’d)
Argamon et al proposed using content-based features & style-based features [5]
Bayesian Multinomial
Regression [5]
Issues
Dataset: blog posts
by 19,320 authors [5]
Content features
more effective
classifiers [5]
Varying length of texts – ranging
from several hundred to tens of
thousand of words
Background (Cont’d)
• Algorithms: C4.5, knearest neighbor,
Naïve Bayes, & SVM
[2]
Background (Orebaugh Cont’d)
Datasets:
• IM conversation logs from 19 authors
collected by the Gaim and Adium
clients over a three year period [2]
• IM logs between undercover agents
and 100 different child predators that
are publicly available from U.S.
Cyberwatch [2]
Optimal algorithm was SVM,
using 356 features [2]
Background (Cont’d)
• 83 features [6]
• Algorithms: WEKA’s Bagging
variant for classification with
REPTree as base classifier [6]
• Dataset: 1,672 NY Times
opinion blogs written by 100
male & 100 female authors [6]
• Use of only syntactic feature
group: 77.03% [6]
Background
• Dataset: NUS SMS corpus [7]
• Cosine similarity performed best to
calculate distance between two vectors [7]
• As the # of stacked messages increases,
accuracy increases, but saturation is
reached around 20 messages [7]
• 545 psycholinguistic & gender-preferential
cues [4]
• Dataset: collection of all English language
stories produced by Reuters journalists
between August 20, 1996 and August 19,
1997 [4]
• Used messages that contained 200 < 1000
words
Methodology
Hybrid approach
using a classification
technique & N-gram
modeling
Dataset
• NSU SMS corpus
Accuracy will be
measured by the
percentage of correct
author gender
classifications
N-gram model
Classification using
Naïve Bayes
• longest common string
Methodology (Cont’d)
References
•
[1]
•
[2]
•
[3]
•
[4]
US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The
Wireless Association. http://www.ctia.org/resource-library/facts-andinfographics/archive/us-text-messages-sms .
A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging
Communications to Perform Author Identification for Cybercrime
Investigations,” Lecture Notes of the Institute for Computer Sciences, Social
Informatics and Telecommunications Engineering.
http://dx.doi.org/10.1007/978-3-642-11534-9_10
M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current Practices,”
Online Journal of TESOL France. http://www.tesolfrance.org/Documents/
Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20
and%20Current%20Practice%20_1_.pdf.
N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author gender
identification from text,” Digital Investigation: The International Journal of Digital
Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158.
References (Cont’d)
•
[5]
•
[6]
•
[7]
•
[8]
S. Argamon et al. 2009. “Automatically profiling the author of an anonymous
text,” Communications of the ACM - Inspiring Women in Computing.
http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL
&dl=ACM&CFID=615755468&CFTOKEN=20724033
J. Soler and L. Wanner. 2014. “How to Use Less Features and Reach Better
Performance in Author Gender Identification,” Proceedings of the Ninth
International Conference on Language Resources and Evaluation (LREC).
R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of SMS
Messages Using Unigrams,” Eighth IEEE International Conference on
Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314
Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on Twitter Using
Stream Algorithms with N-Gram Character Features,” International Journal of
Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019
Download