Gender Identification of SMS Texts Shannon Silessi Cihan Varol Computer Science Department Sam Houston State University Huntsville, TX, USA sgs008@shsu.edu Computer Science Department Sam Houston State University Huntsville, TX, USA cvarol@shsu.edu Abstract— Short message service (SMS) has become a very popular medium for communication due to its convenience and low cost. SMS text messaging makes it easy to provide a false name, age, gender and location in order to hide one’s true identity. Consequently it has become important to establish new cyber forensics methods, such as detecting SMS message authors, as a post-hoc analysis technique deemed useful in criminal persecution cases. In this paper, we propose a new combined method for authorship classification of gender of SMS text messages. Keywords—sms text; authorship classification, n-gram I. INTRODUCTION Short message service (SMS) has become a very popular medium for communication due to its convenience and low cost. As of December 2012, U.S. wireless users sent and received an average of 6 billion text messages per day, or 69,635 text messages every second [1]. From a marketing viewpoint, companies may be interested in knowing demographics of people that like or dislike their products. Unfortunately, SMS messages can also be used as a communication channel for gangs, terrorists, and drug dealers. The pervasive use of SMS is increasing the amount of digital evidence available on cellular phones. Additionally, SMS messaging makes it easy to provide a false name, age, gender and location in order to hide one’s true identity. Consequently it has become important to establish new cyber forensics methods, such as detecting SMS message authors, as a posthoc analysis technique deemed useful in criminal persecution cases. Categorizing an author’s text according to sociolinguistic attributes such as gender, age, educational background, income, linguistic background, nationality, profession, psychological status, and race is known as authorship characterization [2]. These attributes are aimed at inferring an author’s characteristics rather than identity [2]. Most research in the area of authorship characterization has been conducted on larger, more formal written documents. SMS text messages have unusual characteristics making it hard or impossible to apply traditional stylometric techniques, such as frequency counts and word similarity. SMS text messages are limited to 140 characters and often contain a unique style of language and emoticons [3]. SMS text messages often contain words that are abbreviated or written representations of the sounds and compressions of what the intended reader should perceive of the message (e.g. ‘kt’ instead of Katie) [3]. Emoticons, such as :-( (representing a frown) and :-) (representing a smile), are a representation of facial expressions or body language, which would otherwise be missing from non face-to-face communication [3]. Senders of SMS text messages also tend to us different phonetic spellings in order to create different types of verbal effects in their messages, such as ‘hehe’ for laughter and ‘muaha’ for an evil laughter [3]. Letters and numbers are also often combined for compression and convenience (e.g. ‘See you’ can be texted as ‘CU ’ and ‘See You Later’ can be texted as ‘CUL8R’) [3]. All of these facts make it difficult to use syntactic means to characterize the authorship of SMS text messages. II. BACKGROUND Cheng et al investigated author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications [4]. In their article, they proposed 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space [4]. Their initial design began with a set of features that remained relatively constant for a large number of messages written by authors of the same gender [4]. From there, a model (or classifier) was built to determine the category of a given message [4]. They identified 4 steps to the gender identification process: 1) collecting a suitable corpus of text messages to be the dataset; 2) identifying features that are significant indicators of gender; 3) extracting feature values from each message automatically; 4) building a classification model to identify the author’s gender of a candidate text message [4]. One of the datasets used in their study was a collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4]. After categorizing the documents by gender, they only kept the messages that contained more than 200 but less than 1000 words [4]. Additionally, they removed the message that contained too many quotes of other people’s statements [4]. The other dataset used in the study was a collection of e-mails from Enron consisting of 8970 e-mails that were collected over three-and-a-half years from about 108 users [4]. Only e-mails with more than 50 words and less than 1000 words were used in their study [4]. Roots – sets of parts-of-speech (articles, auxiliary verbs, conjunctions, prepositions, pronouns) For their analysis, they classified five sets of gender-related features: (1) character-based; (2) word-based; (3) syntactic; (4) structure-based; and (5) function words [4]. 29 stylometric features were used for their character-based features, such as number of white-space characters and number of special characters [4]. Their word-based features included 33 statistical metrics such as vocabulary richness, Yule’s K measure and entropy measure [4]. Additionally, 68 psycholinguistic features were used [4]. Children nodes – subclasses of the parent node (various sorts of personal pronouns) Leaves – set of individual words For each node in the taxonomic trees, the frequency of the node’s occurrence in a text normalized by the number of words in the text was used to build the feature set [5]. The Bayesian Multinomial Regression learning algorithm was used for learning the weight vector [5]. To examine the performance of different classification techniques, the Bayesian-based logistic regression, the AdaBoost decision tree and support vector machine (SVM) were applied separately [4]. For both datasets, SVM outperformed the Bayesian logistic regression and the Ada-Boost decision tree [4]. When the size of the training examples was increased, the performance of the Bayesian logistic regression did not change significantly [4]. However, there was a sharp improvement in the Ada-Boost decision tree when the size of the training examples increased [4]. The best classification result produced by SVM had accuracies of 76.75% and 82.23% for two different datasets [4]. This implied that the authors of neutral news (Reuters Newsgroup) are more difficult to classify than the personal e-mails (Enron) [4]. The dataset used for the gender profiling experiment consisted of full sets of posts by 19,320 blog authors, where the gender of the author was self-reported [5]. The dataset included an equal number of male and female authors and the texts ranged in length from several hundred to tens of thousands of words [5]. In their experiment, the content features were more effective classifiers than the style features [5]. The style features that showed to be the most useful for gender discrimination were the determiners and prepositions (markers of male writing) and pronouns (markers of female writing) [5]. The content features that showed to be the most useful for gender discrimination were words related to technology for males and words related to personal life or relationships for females [5]. To examine the impact of the parameters on the classification performance, several experiments were conducted on the minimum number of words per message and the number of messages in the sample set [4]. The results showed that since more words in one message may contain more information about the personal writing style and the corresponding gender influence, the accuracy increases as the number of words per message increases [4]. As noted by the authors, the content features used in this study were appropriate for gender classification on this specific collection of blogs [5]. Other content features may be more relevant for other types of text [5]. A potential problem with their approach was that the gender of the authors was selfreported. It is well known that people often misrepresent themselves online in order to mask their true identity. To evaluate the significance of the proposed feature sets, SVM was applied to the sub-dataset of messages containing a minimum of 100 words, using one feature set at a time [4]. Although all five subsets contributed to the gender identifier, the word-based features and function words were shown to be more important gender classifiers [4]. Orebaugh and Allnut discuss the unique style of IM (instant messaging) messages [2]. They present that IM messages include special linguistic elements such as abbreviations, and computer and Internet terms, referred to as netlingo [2]. These characteristics, that are more likely to represent the true stylistic habits of an author, give authors an identifiable writeprint [2]. A writeprint is the term associated with an author’s unique online writing style [2]. One of problems with their approach was that the initial categorization of the documents by the gender of the journalists appeared to be based solely on the perceived gender of a person’s name. In today’s society, there are many women baring a historically male name and vice versa. This could have possibly impacted the results of the study. Another thing that could have impacted the results could have been the unequal amount of male authored documents versus female authored documents. It raises the question of whether an equal amount of male and female authored documents would have positively increased the accuracy of the training datasets. Argamon et al proposed using two types of features for authorship profiling: content-based features and style-based features [5]. In their study, they used Systemic Functional Linguistics to represent the taxonomies describing meaningful distinctions among various function words and parts-of-speech [5]. The taxonomies were represented as trees with the following labeling structure [5]: In their study, during the classification stage, writeprints were used as input to build classification models [2]. They used the WEKA data mining toolkit to perform classification on the writeprints using the C4.5, k-nearest neighbor, Naïve Bayes, and SVM classifiers [2]. Their proposed feature set used a 356-dimensional vector including lexical, syntactic, and structural features [2]. Since content specific features are highly dependent on the topic of the messages, their feature set taxonomy did not include content specific features [2]. The IM feature set taxonomy included several new stylistic features, such as abbreviations and emoticons, which are frequently found in instant messaging communications [2]. The lexical features primarily consisted of frequency totals and were further broken down into emoticons, abbreviations, wordbased, and character-based features [2]. The syntactic features included punctuation and function words in order to capture an author’s habits of organizing sentences [2]. Structural features, which capture the way an author organizes the layout of text, consisted of the average characters and words per message [2]. Two datasets were used for this research [2]. One dataset consisted of personal IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2]. The other dataset consisted of complete IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2]. The optimal algorithm was determined to be SVM, which was achieved using the entire set of 356 features [2]. Soler and Wanner propose using a small number of features that mainly depend on the structure of the texts as opposed to other approaches that depend mainly on the content of the texts and that use a huge number of features in the process of identifying the gender of an author [6]. They argue that the large number of features used by previous studies in which dimension reduction was applied to minimize the complexity, the accuracy and transparency of the outcomes were affected negatively [6]. Their study included 83 features, which consisted of five different types of features: character-based features, wordbased features, sentence-based features, dictionary-based features, and syntactic features [6]. WEKA’s Bagging variant was used for classification with REPTree (a fast decision tree learning algorithm) as base classifier [6]. The dataset used consisted of a collection of postings of a New York (NY) Times opinion blog, which is extremely multi-thematic [6]. It contained 836 texts written by more than 100 male and 836 texts written by more than 100 female authors [6]. Using the whole set, yet much smaller set, of features for the NY Times blog dataset, their experiments showed that accuracy could be obtained that is definitely within the range of the accuracies achieved by the state-of-the-art approaches in this area [6]. Using only the syntactic feature group, an accuracy of 77.03% was achieved, demonstrating that there are important differences in how men and women syntactically structure their sentences [6]. On a separate dataset, the usage of only syntactic features achieved an accuracy of 95.08% [6]. Their approach compared to other studies using thousands of features that only capture the content of writings demonstrated that the quality of the features is more important than the quantity of features [6]. However, it would be interesting to examine the same datasets and features using classifiers used in other research studies. Ragel et al propose using an n-gram method for authorship detection [7]. The authors argue that traditional stylometric methods such as average sentence/word length usage frequencies of grammatical components such as nouns, verbs, adjectives, frequency of functional words, and richness of vocabulary fail when applied to SMS messages for several reasons [7]. One reason is due to the size of a SMS message [7]. Since SMS messages are limited to 140 characters, many people find it difficult to fit all the words in a message [7]. Therefore, SMS messages tend to be less consistent with grammar, spelling and punctuation as well as cause users to introduce shortened words into SMS messages [7]. An n-gram based system can easily identify the author of a message given recurrent errors and unique shortened words [7]. Additionally, punctuation will not cause any issues since the sentence length is not taken in to consideration [7]. In an n-gram model, “n” represents the number of consecutive sequences of tokens from a sequence of data, where the tokens can be any unit of data [7]. In their study, the string of text written by a certain author represented the series of data and a word represented the token [7]. To check how well their experiment performed in measuring the distance between two vectors, Cosine similarity and Euclidean distance metrics were used [7]. The authors used the NUS SMS corpus since it contains more than fifty thousand SMS messages written in English [7]. Additionally, only SMS messages were used where an author had at least fifty collected SMS messages [7]. For the first experiment, about twenty authors were chosen having more than five hundred SMS messages each and the five hundred longest SMS messages were selected for the test [7]. To generate one large testing dataset, a few SMS messages were stacked since the testing data in one SMS was not enough [7]. The number of stacked SMS messages was varied from one up to twenty stacked SMS messages in order to check the effects of the SMS stacking [7]. The extracted unigram profiles from the testing stack were then compared against the training set using the Cosine similarity and the Euclidean distance metrics [7]. The one having the highest similarity or the lowest distance was determined to be the author [7]. As a result, it was determined that the Euclidean distance metric is not a sufficient metric to measure the similarity of SMS author profiles [7]. However, the Cosine similarity distances demonstrated an accuracy starting from forty percent for only one SMS message and going up to ninety percent when the number of SMS messages stacked was large [7]. A second experiment was conducted to determine the effect of the size of the training set by varying the number of training SMS messages in increments of one hundred up to five hundred SMS messages [7]. The results showed that as the number of test SMS messages stacked together increases, the accuracy of the results increase [7]. Notably, a saturation level was reached around twenty test SMS messages [7]. Additionally, there was not a significant increase in accuracy between one hundred and four hundred stacked SMS messages [7]. Another experiment was conducted to determine how well the algorithm works for varying number of authors with little data available [7]. A random number generator was used to select a varying number of authors ranging from five candidate authors up to seventy authors [7]. It was observed that when the number of possible authors increases, the decrease in the accuracy is close to linear [7]. They concluded that ten stacked SMS messages is optimal for the near best accuracy to detect the author of a set of SMS messages using the unigram method with the Cosine similarity distance metric [7]. Miller et al proposed using Perceptron and Naive Bayes with selected 1 through 5-gram features to identify gender of Twitter’s tweet text [8]. The authors used the Twitter Streaming API to download a set of 36, 238 unlabeled tweets [8]. The tweets were manually labeled as male or female, while removing any texts where the gender was unclear or was not written in English [8]. This reduced the dataset to approximately 3000 users with about 60% female authors [8]. Additionally, the tweets were sorted by their minimum lengths [8]. Six feature selection algorithms were run on the training sets to reduce feature space and noise: Chi-Square, Information Gain, Information Gain Ratio, Relief, Symmetrical Uncertainty, and Filtered Attribute Evaluation [8]. In order to determine the feature rankings, each algorithm was run on all six sets of tweets with different minimum lengths [8]. Features were selected only if they were contained in at least four of the rankings [8]. This made up the feature set collection referred to as Feature Set A [8]. A second feature set consisting of 15,000 features, referred to as Feature Set B, as also selected [8]. The performance of the Perceptron and Naïve Bayes stream algorithms was measured based on accuracy, balanced accuracy and F-Measure [8]. When the Perceptron algorithm was run using Feature Set A, Perceptron demonstrated a high precision between 90% and 95% but a low recall (75% - 85% for most minimum lengths) [8]. When the Naïve Bayes algorithm was run using Feature Set A, Naïve Bayes had a slightly lower precision, but was able to have higher F-Measure and recall rates (above 90%) [8]. When both algorithms were run using Feature Set B, Naïve Bayes outperformed the Perceptron algorithm [8]. The Naïve Bayes algorithm improved on every metric and performed best with a minimum tweet length of 75 characters [8]. Since both algorithms classified gender better using Feature Set B, it was implied that a larger number of features was needed to accurately classify the gender of the tweets [8]. One potential issue with their methodology is that they manually classified the gender of the tweets in the training set. This could have easily swayed their results. However, it would be interesting to apply their approach to SMS text data. Figure 1: Gender Identification Process IV. CONCLUSION In this paper, we proposed a new combined method for authorship classification of SMS text messages. Other previous approaches for authorship classification used either N-gram modeling or various classification techniques. Majority of previous experiments were also conducted on larger text documents. Due to the limited amount of characters per SMS text message, classification techniques involving numerous feature extractions simply cannot be applied to SMS text messages. [1] [2] III. METHODOLOGY In this paper, we propose a new method that combines previous experiments. The diagram shown in Figure 1 illustrates an overview of the process that will be used to classify SMS text messages by gender. The NSU SMS text message corpus will be used to obtain the SMS text message data. After cleaning the data and separating it into the training and testing sets, a classification technique will be applied. Once an acceptable classification performance has been achieved, N-gram modeling will be applied to the data in order to boost the score of the classification technique. A distribution of N-grams will be used to create an author profile for each author. The accuracy will be measured by the percentage of correct author gender classifications. [3] [4] [5] [6] US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The Wireless Association. Available: http://www.ctia.org/resourcelibrary/facts-and-infographics/archive/us-text-messages-sms. A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations,” Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. http://dx.doi.org/10.1007/978-3-642-11534-9_10. M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current Practices,” Online Journal of TESOL France. http://www.tesolfrance.org/Documents/Colloque07/SMS%20Text%20Analysis%20Lang uage%20Gender%20and%20Current%20Practice%20_1_.pdf. N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author gender identification from text,” Digital Investigation: The International Journal of Digital Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158. S. Argamon et al. 2009. “Automatically profiling the author of an anonymous text,” Communications of the ACM - Inspiring Women in Computing. http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&c oll=DL&dl=ACM&CFID=615755468&CFTOKEN=20724033. J. Soler and L. Wanner. Reykjavik, Iceland , 2014. “How to Use Less Features and Reach Better Performance in Author Gender [7] Identification,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of SMS Messages Using Unigrams,” Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314 . [8] Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,” International Journal of Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019.