Supporting Document

advertisement
Gender Identification of SMS Texts
Shannon Silessi
Cihan Varol
Computer Science Department
Sam Houston State University
Huntsville, TX, USA
sgs008@shsu.edu
Computer Science Department
Sam Houston State University
Huntsville, TX, USA
cvarol@shsu.edu
Abstract— Short message service (SMS) has become a very
popular medium for communication due to its convenience and
low cost. SMS text messaging makes it easy to provide a false
name, age, gender and location in order to hide one’s true
identity. Consequently it has become important to establish new
cyber forensics methods, such as detecting SMS message authors,
as a post-hoc analysis technique deemed useful in criminal
persecution cases. In this paper, we propose a new combined
method for authorship classification of gender of SMS text
messages.
Keywords—sms text; authorship classification, n-gram
I. INTRODUCTION
Short message service (SMS) has become a very popular
medium for communication due to its convenience and low
cost. As of December 2012, U.S. wireless users sent and
received an average of 6 billion text messages per day, or
69,635 text messages every second [1]. From a marketing
viewpoint, companies may be interested in knowing
demographics of people that like or dislike their products.
Unfortunately, SMS messages can also be used as a
communication channel for gangs, terrorists, and drug dealers.
The pervasive use of SMS is increasing the amount of digital
evidence available on cellular phones. Additionally, SMS
messaging makes it easy to provide a false name, age, gender
and location in order to hide one’s true identity. Consequently
it has become important to establish new cyber forensics
methods, such as detecting SMS message authors, as a posthoc analysis technique deemed useful in criminal persecution
cases.
Categorizing an author’s text according to sociolinguistic
attributes such as gender, age, educational background,
income, linguistic background, nationality, profession,
psychological status, and race is known as authorship
characterization [2]. These attributes are aimed at inferring an
author’s characteristics rather than identity [2].
Most research in the area of authorship characterization has
been conducted on larger, more formal written documents.
SMS text messages have unusual characteristics making it hard
or impossible to apply traditional stylometric techniques, such
as frequency counts and word similarity. SMS text messages
are limited to 140 characters and often contain a unique style of
language and emoticons [3]. SMS text messages often contain
words that are abbreviated or written representations of the
sounds and compressions of what the intended reader should
perceive of the message (e.g. ‘kt’ instead of Katie) [3].
Emoticons, such as :-( (representing a frown) and :-)
(representing a smile), are a representation of facial
expressions or body language, which would otherwise be
missing from non face-to-face communication [3]. Senders of
SMS text messages also tend to us different phonetic spellings
in order to create different types of verbal effects in their
messages, such as ‘hehe’ for laughter and ‘muaha’ for an evil
laughter [3]. Letters and numbers are also often combined for
compression and convenience (e.g. ‘See you’ can be texted as
‘CU ’ and ‘See You Later’ can be texted as ‘CUL8R’) [3]. All
of these facts make it difficult to use syntactic means to
characterize the authorship of SMS text messages.
II. BACKGROUND
Cheng et al investigated author gender identification for
short length, multi-genre, content-free text, such as the ones
found in many Internet applications [4]. In their article, they
proposed 545 psycho-linguistic and gender-preferential cues
along with stylometric features to build the feature space [4].
Their initial design began with a set of features that remained
relatively constant for a large number of messages written by
authors of the same gender [4]. From there, a model (or
classifier) was built to determine the category of a given
message [4].
They identified 4 steps to the gender
identification process:
1) collecting a suitable corpus of text messages to be
the dataset;
2) identifying features that are significant indicators
of gender;
3) extracting feature values from each message
automatically;
4) building a classification model to identify the
author’s gender of a candidate text message [4].
One of the datasets used in their study was a collection of
all English language stories produced by Reuters journalists
between August 20, 1996 and August 19, 1997 [4]. After
categorizing the documents by gender, they only kept the
messages that contained more than 200 but less than 1000
words [4]. Additionally, they removed the message that
contained too many quotes of other people’s statements [4].
The other dataset used in the study was a collection of e-mails
from Enron consisting of 8970 e-mails that were collected over
three-and-a-half years from about 108 users [4]. Only e-mails
with more than 50 words and less than 1000 words were used
in their study [4].

Roots – sets of parts-of-speech (articles, auxiliary
verbs, conjunctions, prepositions, pronouns)
For their analysis, they classified five sets of gender-related
features: (1) character-based; (2) word-based; (3) syntactic; (4)
structure-based; and (5) function words [4]. 29 stylometric
features were used for their character-based features, such as
number of white-space characters and number of special
characters [4].
Their word-based features included 33
statistical metrics such as vocabulary richness, Yule’s K
measure and entropy measure [4]. Additionally, 68 psycholinguistic features were used [4].

Children nodes – subclasses of the parent node
(various sorts of personal pronouns)

Leaves – set of individual words
For each node in the taxonomic trees, the frequency of the
node’s occurrence in a text normalized by the number of words
in the text was used to build the feature set [5]. The Bayesian
Multinomial Regression learning algorithm was used for
learning the weight vector [5].
To examine the performance of different classification
techniques, the Bayesian-based logistic regression, the AdaBoost decision tree and support vector machine (SVM) were
applied separately [4]. For both datasets, SVM outperformed
the Bayesian logistic regression and the Ada-Boost decision
tree [4]. When the size of the training examples was increased,
the performance of the Bayesian logistic regression did not
change significantly [4].
However, there was a sharp
improvement in the Ada-Boost decision tree when the size of
the training examples increased [4]. The best classification
result produced by SVM had accuracies of 76.75% and 82.23%
for two different datasets [4]. This implied that the authors of
neutral news (Reuters Newsgroup) are more difficult to
classify than the personal e-mails (Enron) [4].
The dataset used for the gender profiling experiment
consisted of full sets of posts by 19,320 blog authors, where the
gender of the author was self-reported [5]. The dataset
included an equal number of male and female authors and the
texts ranged in length from several hundred to tens of
thousands of words [5]. In their experiment, the content
features were more effective classifiers than the style features
[5]. The style features that showed to be the most useful for
gender discrimination were the determiners and prepositions
(markers of male writing) and pronouns (markers of female
writing) [5]. The content features that showed to be the most
useful for gender discrimination were words related to
technology for males and words related to personal life or
relationships for females [5].
To examine the impact of the parameters on the
classification performance, several experiments were
conducted on the minimum number of words per message and
the number of messages in the sample set [4]. The results
showed that since more words in one message may contain
more information about the personal writing style and the
corresponding gender influence, the accuracy increases as the
number of words per message increases [4].
As noted by the authors, the content features used in this
study were appropriate for gender classification on this specific
collection of blogs [5]. Other content features may be more
relevant for other types of text [5]. A potential problem with
their approach was that the gender of the authors was selfreported. It is well known that people often misrepresent
themselves online in order to mask their true identity.
To evaluate the significance of the proposed feature sets,
SVM was applied to the sub-dataset of messages containing a
minimum of 100 words, using one feature set at a time [4].
Although all five subsets contributed to the gender identifier,
the word-based features and function words were shown to be
more important gender classifiers [4].
Orebaugh and Allnut discuss the unique style of IM (instant
messaging) messages [2]. They present that IM messages
include special linguistic elements such as abbreviations, and
computer and Internet terms, referred to as netlingo [2]. These
characteristics, that are more likely to represent the true
stylistic habits of an author, give authors an identifiable
writeprint [2]. A writeprint is the term associated with an
author’s unique online writing style [2].
One of problems with their approach was that the initial
categorization of the documents by the gender of the journalists
appeared to be based solely on the perceived gender of a
person’s name. In today’s society, there are many women
baring a historically male name and vice versa. This could
have possibly impacted the results of the study. Another thing
that could have impacted the results could have been the
unequal amount of male authored documents versus female
authored documents. It raises the question of whether an equal
amount of male and female authored documents would have
positively increased the accuracy of the training datasets.
Argamon et al proposed using two types of features for
authorship profiling: content-based features and style-based
features [5]. In their study, they used Systemic Functional
Linguistics to represent the taxonomies describing meaningful
distinctions among various function words and parts-of-speech
[5]. The taxonomies were represented as trees with the
following labeling structure [5]:
In their study, during the classification stage, writeprints
were used as input to build classification models [2]. They
used the WEKA data mining toolkit to perform classification
on the writeprints using the C4.5, k-nearest neighbor, Naïve
Bayes, and SVM classifiers [2]. Their proposed feature set
used a 356-dimensional vector including lexical, syntactic, and
structural features [2]. Since content specific features are
highly dependent on the topic of the messages, their feature set
taxonomy did not include content specific features [2]. The IM
feature set taxonomy included several new stylistic features,
such as abbreviations and emoticons, which are frequently
found in instant messaging communications [2]. The lexical
features primarily consisted of frequency totals and were
further broken down into emoticons, abbreviations, wordbased, and character-based features [2]. The syntactic features
included punctuation and function words in order to capture an
author’s habits of organizing sentences [2]. Structural features,
which capture the way an author organizes the layout of text,
consisted of the average characters and words per message [2].
Two datasets were used for this research [2]. One dataset
consisted of personal IM conversation logs from 19 authors
collected by the Gaim and Adium clients over a three year
period [2]. The other dataset consisted of complete IM logs
between undercover agents and 100 different child predators
that are publicly available from U.S. Cyberwatch [2]. The
optimal algorithm was determined to be SVM, which was
achieved using the entire set of 356 features [2].
Soler and Wanner propose using a small number of features
that mainly depend on the structure of the texts as opposed to
other approaches that depend mainly on the content of the texts
and that use a huge number of features in the process of
identifying the gender of an author [6]. They argue that the
large number of features used by previous studies in which
dimension reduction was applied to minimize the complexity,
the accuracy and transparency of the outcomes were affected
negatively [6].
Their study included 83 features, which consisted of five
different types of features: character-based features, wordbased features, sentence-based features, dictionary-based
features, and syntactic features [6]. WEKA’s Bagging variant
was used for classification with REPTree (a fast decision tree
learning algorithm) as base classifier [6]. The dataset used
consisted of a collection of postings of a New York (NY)
Times opinion blog, which is extremely multi-thematic [6]. It
contained 836 texts written by more than 100 male and 836
texts written by more than 100 female authors [6].
Using the whole set, yet much smaller set, of features for
the NY Times blog dataset, their experiments showed that
accuracy could be obtained that is definitely within the range
of the accuracies achieved by the state-of-the-art approaches in
this area [6]. Using only the syntactic feature group, an
accuracy of 77.03% was achieved, demonstrating that there are
important differences in how men and women syntactically
structure their sentences [6]. On a separate dataset, the usage
of only syntactic features achieved an accuracy of 95.08% [6].
Their approach compared to other studies using thousands
of features that only capture the content of writings
demonstrated that the quality of the features is more important
than the quantity of features [6]. However, it would be
interesting to examine the same datasets and features using
classifiers used in other research studies.
Ragel et al propose using an n-gram method for authorship
detection [7]. The authors argue that traditional stylometric
methods such as average sentence/word length usage
frequencies of grammatical components such as nouns, verbs,
adjectives, frequency of functional words, and richness of
vocabulary fail when applied to SMS messages for several
reasons [7]. One reason is due to the size of a SMS message
[7]. Since SMS messages are limited to 140 characters, many
people find it difficult to fit all the words in a message [7].
Therefore, SMS messages tend to be less consistent with
grammar, spelling and punctuation as well as cause users to
introduce shortened words into SMS messages [7].
An n-gram based system can easily identify the author of a
message given recurrent errors and unique shortened words [7].
Additionally, punctuation will not cause any issues since the
sentence length is not taken in to consideration [7].
In an n-gram model, “n” represents the number of
consecutive sequences of tokens from a sequence of data,
where the tokens can be any unit of data [7]. In their study, the
string of text written by a certain author represented the series
of data and a word represented the token [7].
To check how well their experiment performed in
measuring the distance between two vectors, Cosine similarity
and Euclidean distance metrics were used [7].
The authors used the NUS SMS corpus since it contains
more than fifty thousand SMS messages written in English [7].
Additionally, only SMS messages were used where an author
had at least fifty collected SMS messages [7].
For the first experiment, about twenty authors were chosen
having more than five hundred SMS messages each and the
five hundred longest SMS messages were selected for the test
[7]. To generate one large testing dataset, a few SMS
messages were stacked since the testing data in one SMS was
not enough [7]. The number of stacked SMS messages was
varied from one up to twenty stacked SMS messages in order
to check the effects of the SMS stacking [7]. The extracted
unigram profiles from the testing stack were then compared
against the training set using the Cosine similarity and the
Euclidean distance metrics [7]. The one having the highest
similarity or the lowest distance was determined to be the
author [7]. As a result, it was determined that the Euclidean
distance metric is not a sufficient metric to measure the
similarity of SMS author profiles [7]. However, the Cosine
similarity distances demonstrated an accuracy starting from
forty percent for only one SMS message and going up to ninety
percent when the number of SMS messages stacked was large
[7].
A second experiment was conducted to determine the effect
of the size of the training set by varying the number of training
SMS messages in increments of one hundred up to five
hundred SMS messages [7]. The results showed that as the
number of test SMS messages stacked together increases, the
accuracy of the results increase [7]. Notably, a saturation level
was reached around twenty test SMS messages [7].
Additionally, there was not a significant increase in accuracy
between one hundred and four hundred stacked SMS messages
[7].
Another experiment was conducted to determine how well
the algorithm works for varying number of authors with little
data available [7]. A random number generator was used to
select a varying number of authors ranging from five candidate
authors up to seventy authors [7]. It was observed that when
the number of possible authors increases, the decrease in the
accuracy is close to linear [7].
They concluded that ten stacked SMS messages is optimal
for the near best accuracy to detect the author of a set of SMS
messages using the unigram method with the Cosine similarity
distance metric [7].
Miller et al proposed using Perceptron and Naive Bayes
with selected 1 through 5-gram features to identify gender of
Twitter’s tweet text [8]. The authors used the Twitter
Streaming API to download a set of 36, 238 unlabeled tweets
[8]. The tweets were manually labeled as male or female,
while removing any texts where the gender was unclear or was
not written in English [8]. This reduced the dataset to
approximately 3000 users with about 60% female authors [8].
Additionally, the tweets were sorted by their minimum lengths
[8].
Six feature selection algorithms were run on the training
sets to reduce feature space and noise: Chi-Square, Information
Gain, Information Gain Ratio, Relief, Symmetrical
Uncertainty, and Filtered Attribute Evaluation [8]. In order to
determine the feature rankings, each algorithm was run on all
six sets of tweets with different minimum lengths [8]. Features
were selected only if they were contained in at least four of the
rankings [8]. This made up the feature set collection referred
to as Feature Set A [8]. A second feature set consisting of
15,000 features, referred to as Feature Set B, as also selected
[8].
The performance of the Perceptron and Naïve Bayes stream
algorithms was measured based on accuracy, balanced
accuracy and F-Measure [8].
When the Perceptron algorithm was run using Feature Set
A, Perceptron demonstrated a high precision between 90% and
95% but a low recall (75% - 85% for most minimum lengths)
[8]. When the Naïve Bayes algorithm was run using Feature
Set A, Naïve Bayes had a slightly lower precision, but was able
to have higher F-Measure and recall rates (above 90%) [8].
When both algorithms were run using Feature Set B, Naïve
Bayes outperformed the Perceptron algorithm [8]. The Naïve
Bayes algorithm improved on every metric and performed best
with a minimum tweet length of 75 characters [8]. Since both
algorithms classified gender better using Feature Set B, it was
implied that a larger number of features was needed to
accurately classify the gender of the tweets [8].
One potential issue with their methodology is that they
manually classified the gender of the tweets in the training set.
This could have easily swayed their results. However, it would
be interesting to apply their approach to SMS text data.
Figure 1: Gender Identification Process
IV. CONCLUSION
In this paper, we proposed a new combined method for
authorship classification of SMS text messages.
Other
previous approaches for authorship classification used either
N-gram modeling or various classification techniques.
Majority of previous experiments were also conducted on
larger text documents. Due to the limited amount of characters
per SMS text message, classification techniques involving
numerous feature extractions simply cannot be applied to SMS
text messages.
[1]
[2]
III. METHODOLOGY
In this paper, we propose a new method that combines
previous experiments. The diagram shown in Figure 1
illustrates an overview of the process that will be used to
classify SMS text messages by gender. The NSU SMS text
message corpus will be used to obtain the SMS text message
data. After cleaning the data and separating it into the training
and testing sets, a classification technique will be applied.
Once an acceptable classification performance has been
achieved, N-gram modeling will be applied to the data in order
to boost the score of the classification technique.
A
distribution of N-grams will be used to create an author profile
for each author. The accuracy will be measured by the
percentage of correct author gender classifications.
[3]
[4]
[5]
[6]
US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The
Wireless
Association.
Available:
http://www.ctia.org/resourcelibrary/facts-and-infographics/archive/us-text-messages-sms.
A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging
Communications to Perform Author Identification for Cybercrime
Investigations,” Lecture Notes of the Institute for Computer Sciences,
Social
Informatics
and
Telecommunications
Engineering.
http://dx.doi.org/10.1007/978-3-642-11534-9_10.
M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current
Practices,” Online Journal of TESOL France. http://www.tesolfrance.org/Documents/Colloque07/SMS%20Text%20Analysis%20Lang
uage%20Gender%20and%20Current%20Practice%20_1_.pdf.
N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author
gender identification from text,” Digital Investigation: The International
Journal
of
Digital
Forensics
&
Incident
Response.
http://dl.acm.org/citation.cfm?id=2296158.
S. Argamon et al. 2009. “Automatically profiling the author of an
anonymous text,” Communications of the ACM - Inspiring Women in
Computing.
http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&c
oll=DL&dl=ACM&CFID=615755468&CFTOKEN=20724033.
J. Soler and L. Wanner. Reykjavik, Iceland , 2014. “How to Use Less
Features and Reach Better Performance in Author Gender
[7]
Identification,” Proceedings of the Ninth International Conference on
Language Resources and Evaluation (LREC).
R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of
SMS Messages Using Unigrams,” Eighth IEEE International Conference
on
Industrial
and
Information
Systems
(ICIIS).
http://arxiv.org/abs/1403.1314 .
[8]
Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on
Twitter Using Stream Algorithms with N-Gram Character Features,”
International
Journal
of
Intelligence
Science.
http://dx.doi.org/10.4236/ijis.2012.224019.
Download