ChatCoder: Toward the Tracking and Categorization of Internet Predators April Kontostathis Lynne Edwards Amanda Leatherman Ursinus College Where are we coming from? Spring/Summer 2008 ◦ Amanda Leatherman, Ursinus class of 2009, approaches Lynne Edwards, Associate Professor of Media and Communication Studies, about a new project. April Kontostathis Department of Mathematics and Computer Science Summer 2009 Amanda and Lynne research related work Olson, L. N., Daggs, J. L., Ellevold, B . L., & Rogers, T. K. (2007). The communication of deviance: Toward a theory of child sexual predators' luring communication. Communication Theory, 17, 231-251. Lynne and Amanda channel this project in two directions ◦ Modify the theory for the online environment ◦ Operationalize the theory April Kontostathis Department of Mathematics and Computer Science Original LCT Model (Olson, et. al) Gaining Access Characteristics of the perpetrator Characteristics of the victim Strategic placement Deceptive Trust Development Grooming Communicative desensitization Reframing Isolation Approach April Kontostathis Department of Mathematics and Computer Science Process Read many transcripts from Perverted-justice.com ◦ … not an appealing job April Kontostathis Department of Mathematics and Computer Science Meanwhile … I am planning a Fall 2008 Software Engineering course – looking for projects to assign to students Lynne asks if my students can build a system to find phrases in the pervertedjustice transcripts … a collaboration is born! April Kontostathis Department of Mathematics and Computer Science Where are we now? Revised LCT Model Gaining Access Strategic Placement Deceptive Trust Development Activities Compliments Personal Information Exchange Relationship Exchange Grooming Communicative Desensitization Reframing Isolation Approach Categorization Experiments First Experiment ◦ Class: {Predator ,Victim} 32 instances, 16 in each class (talking to each other) ◦ Eight numeric attributes - Count of tagged phrases in each category Activities Personal Information Compliments Relationship Reframing Desensitization Isolation Approach April Kontostathis Department of Mathematics and Computer Science Results Classifier: C4.5 (J48 in Weka) 3-fold cross validation Success Rate: 59% ◦ baseline 50% Confusion matrix Classified as Predator Classified as Victim 8 8 Actual Predator 5 11 Actual Victim April Kontostathis Department of Mathematics and Computer Science Decision Tree DesensitizationCount <= 35 | RelationshipCount <= 0 | | ActivitiesCount <= 1 | | | IsolationCount <= 5: Predator (5.0/1.0) | | | IsolationCount > 5:Victim (4.0) | | ActivitiesCount > 1: Predator (2.0) | RelationshipCount > 0:Victim (10.0) DesensitizationCount > 35: Predator (11.0/1.0) April Kontostathis Department of Mathematics and Computer Science Predator vs. Victim Patterns 1000 450 900 400 800 700 350 300 600 250 500 200 400 300 200 150 100 100 50 0 0 Categorization Experiments Second Experiment ◦ Class: {PJ , Non-PJ} 31 instances, 14 PJ Transcripts, 15 Non-PJ Non-PJ obtained from Dr. Susan Gauch – collected during her ChatTrack project PJ transcripts, both Victim and Predator were coded ◦ Same eight attributes April Kontostathis Department of Mathematics and Computer Science Results Classifier: C4.5 (J48 in Weka) 3-fold cross validation Success Rate: 93% ◦ baseline 48% Confusion matrix Classified as Not PJ Classified as PJ 15 0 Actually Not PJ 2 12 Actually PJ April Kontostathis Department of Mathematics and Computer Science Non PJ vs. PJ 7000 1000 900 6000 800 5000 4000 700 600 500 3000 2000 400 300 200 1000 100 0 0 Clustering Experiments All 288 PJ Transcripts K Means Clustering Same eight attributes ◦ column normalized Four Clusters found ◦ minimum intra-cluster variation ◦ multiple runs to avoid local minima April Kontostathis Department of Mathematics and Computer Science Clusters Found Predator Category Cluster Centroids 0.4 Normalized Activity Count 0.35 0.3 0.25 0.2 0.15 Cluster0: Cluster1: Cluster2: 0.1 Cluster3: 0.05 0 April Kontostathis Department of Mathematics and Computer Science Labeling the Clusters 60 Transcripts Analyzed Closely Age Deception Data Categorized ◦ Four distinct ways that deception can be achieved when communicating with others 1. 2. 3. 4. Quantity Quality Relation Manner McCornack, S.A., Levine, T.R., Solowczuk, K.A., Torres, H.I., & Campbell, D.M. (1992). When the alteration of information is viewed as deception: An empirical test of information manipulation theory. Communication Monographs, 59, 17-29. Age data captured for all 288 transcripts April Kontostathis Department of Mathematics and Computer Science Age Deception Statistics Number of Transcripts Percentage of Transcripts No discussion of age 3 5% Honest Predators 36 60% Deceptive Predators 21 35% April Kontostathis Department of Mathematics and Computer Science Type of Deception Quantity manipulation findings Honest predators average real age was 31 yrs old Deceptive predators average real age was 38 yrs old Quality manipulation findings Average age given by deceptive predators was 27 yrs old Relation and Manner manipulation findings Rarely used by online sexual predators April Kontostathis Department of Mathematics and Computer Science Age Labeling – a bust Cluster Total Honest Percent C0 70 50 71% C1 173 112 65% C2 16 12 75% C3 27 20 74% April Kontostathis Department of Mathematics and Computer Science Synergistic Activities Content Analysis for the Web 2.0 ◦ Misbehavior Detection Task Pendar, Nick (2007) "Toward Spotting the Pedophile: Telling victim from predator in text chats " In The Proceedings of the First IEEE International Conference on Semantic Computing: 235-241. Irvine, California. ◦ Study for the Termination of Online Predators (STOP) Hughes, D., P. Rayson, J. Walkerdine, K. Lee, P. Greenwood, A. Rashid, C. MayChahal, and M. Brennan. 2008. Supporting Law Enforcement in Digital Communities through Natural Language Analysis,. In the proceedings of the 2nd International Workshop on Computational Forensics (IWCF’08). Washington D.C., USA, August 2008. ◦ Isis – Protecting Children in Online Social Networks April Kontostathis Department of Mathematics and Computer Science Where are we going? Data remains a big problem ◦ PJ data is problematic ◦ Access to large chat or “chat-like” collections is hard to get Labeling is a bigger problem ◦ Finding predatory chat is a “needle in haystack” problem Applications are nice, but applications need to be grounded in text mining and communicative theory research. April Kontostathis Department of Mathematics and Computer Science Acknowledgements Amanda Leatherman Lynne Edwards Kristina Moore Brian D. Davison and students at Lehigh Univ. Ursinus College ◦ Media and Communication Studies faculty and students ◦ Mathematics and Computer Science faculty and students Text Mining Workshop organizers and reviewers April Kontostathis Department of Mathematics and Computer Science Contact Information April Kontostathis Ursinus College akontostathis@ursinus.edu http://webpages.ursinus.edu/akontostathis 610-409-3000 x2650 April Kontostathis Department of Mathematics and Computer Science