Gender Identification of SMS Texts By: Shannon Silessi Introduction Introduction U.S. wireless users send & receive an average of 6 billion text messages a day Visual anonymity can be misused and exploited by criminals Cyber forensics methods are needed for detecting SMS authors for use in criminal persecution cases Introduction (Cont’d) Authorship Characterization Categorizing an author’s text according to sociolinguistic attributes such as: • • • • • • Gender Age Educational Background Income Nationality Race Introduction (Cont’d) Most research in the area of authorship characterization has been conducted on larger, more formal written documents Unusual characteristics of SMS make it difficult to apply traditional stylometric techniques Introduction (Cont’d) Characteristics of SMS text messages Limited to 140 characters [3] Often contain abbreviations or written representations of sounds (e.g. ‘kt’ for Katie) [3] Various phonetic spellings for verbal effects (‘hehe’ for laughter and ‘muaha’ for evil laughter) [3] Emoticons, such as (representing a frown) [3] Combined letters & numbers for compression (‘CUL8R’ for ‘See You Later’) [3] Background • 545 psycholinguistic & genderpreferential cues [4] • Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] • Used messages that contained 200 < 1000 words Background (Cheng Cont’d) Enron email dataset • Messages containing 50 < 1000 words [4] Best classification result • SVM with 76.75% (Reuter’s) & 82.23% (Enron) accuracies [4] Examined performance of Bayesian-based logistic regression, Ada-Boost decision tree, & Support Vector Machine (SVM) [4] Examination of parameter performance • Accuracy increases with increasing number of words [4] Background (Cheng Cont’d) Issues Examination of feature sets • Word-based features & function words were more important [4] • • Categorization of documents by gender was based on perceived gender of a person’s name Unequal amount of male vs. female authored documents Background (Cont’d) Argamon et al proposed using content-based features & style-based features [5] Bayesian Multinomial Regression [5] Issues Dataset: blog posts by 19,320 authors [5] Content features more effective classifiers [5] Varying length of texts – ranging from several hundred to tens of thousand of words Background (Cont’d) • Algorithms: C4.5, knearest neighbor, Naïve Bayes, & SVM [2] Background (Orebaugh Cont’d) Datasets: • IM conversation logs from 19 authors collected by the Gaim and Adium clients over a three year period [2] • IM logs between undercover agents and 100 different child predators that are publicly available from U.S. Cyberwatch [2] Optimal algorithm was SVM, using 356 features [2] Background (Cont’d) • 83 features [6] • Algorithms: WEKA’s Bagging variant for classification with REPTree as base classifier [6] • Dataset: 1,672 NY Times opinion blogs written by 100 male & 100 female authors [6] • Use of only syntactic feature group: 77.03% [6] Background • Dataset: NUS SMS corpus [7] • Cosine similarity performed best to calculate distance between two vectors [7] • As the # of stacked messages increases, accuracy increases, but saturation is reached around 20 messages [7] • 545 psycholinguistic & gender-preferential cues [4] • Dataset: collection of all English language stories produced by Reuters journalists between August 20, 1996 and August 19, 1997 [4] • Used messages that contained 200 < 1000 words Methodology Hybrid approach using a classification technique & N-gram modeling Dataset • NSU SMS corpus Accuracy will be measured by the percentage of correct author gender classifications N-gram model Classification using Naïve Bayes • longest common string Methodology (Cont’d) References • [1] • [2] • [3] • [4] US Consumers Send Six Billion Text Messages a Day. 2013. CTIA-The Wireless Association. http://www.ctia.org/resource-library/facts-andinfographics/archive/us-text-messages-sms . A. Orebaugh and J. Allnutt. 2010. “Data Mining Instant Messaging Communications to Perform Author Identification for Cybercrime Investigations,” Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. http://dx.doi.org/10.1007/978-3-642-11534-9_10 M. Rafi. 2008. “SMS Text Analysis: Language, Gender and Current Practices,” Online Journal of TESOL France. http://www.tesolfrance.org/Documents/ Colloque07/SMS%20Text%20Analysis%20Language%20Gender%20 and%20Current%20Practice%20_1_.pdf. N. Cheng, R. Chandramouli, and K. Subbalakshmi. 2011. “Author gender identification from text,” Digital Investigation: The International Journal of Digital Forensics & Incident Response. http://dl.acm.org/citation.cfm?id=2296158. References (Cont’d) • [5] • [6] • [7] • [8] S. Argamon et al. 2009. “Automatically profiling the author of an anonymous text,” Communications of the ACM - Inspiring Women in Computing. http://dl.acm.org.ezproxy.shsu.edu/citation.cfm?id=1461928.1461959&coll=DL &dl=ACM&CFID=615755468&CFTOKEN=20724033 J. Soler and L. Wanner. 2014. “How to Use Less Features and Reach Better Performance in Author Gender Identification,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). R. Ragel, P. Herath, and U. Senanayake. 2013. “Authorship Detection of SMS Messages Using Unigrams,” Eighth IEEE International Conference on Industrial and Information Systems (ICIIS). http://arxiv.org/abs/1403.1314 Z. Miller, B. Dickinson, and W. Hu. 2012. “Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features,” International Journal of Intelligence Science. http://dx.doi.org/10.4236/ijis.2012.224019