ChatCoder: Toward the Tracking and Categorization of Internet Predators April Kontostathis

advertisement
ChatCoder: Toward the Tracking and
Categorization of Internet Predators
April Kontostathis
Lynne Edwards
Amanda Leatherman
Ursinus College
Where are we coming from?

Spring/Summer 2008
◦ Amanda Leatherman, Ursinus class of 2009,
approaches Lynne Edwards, Associate Professor of
Media and Communication Studies, about a new
project.
April Kontostathis
Department of Mathematics and Computer Science
Summer 2009
Amanda and Lynne research related work
 Olson, L. N., Daggs, J. L., Ellevold, B . L., & Rogers,
T. K. (2007). The communication of deviance:
Toward a theory of child sexual predators'
luring communication. Communication
Theory, 17, 231-251.
 Lynne and Amanda channel this project in two
directions

◦ Modify the theory for the online environment
◦ Operationalize the theory
April Kontostathis
Department of Mathematics and Computer Science
Original LCT Model (Olson, et. al)

Gaining Access

Characteristics of the perpetrator

Characteristics of the victim

Strategic placement
Deceptive Trust Development
 Grooming


Communicative desensitization

Reframing
Isolation
 Approach

April Kontostathis
Department of Mathematics and Computer Science
Process

Read many transcripts from Perverted-justice.com
◦ … not an appealing job
April Kontostathis
Department of Mathematics and Computer Science
Meanwhile …
I am planning a Fall 2008 Software
Engineering course – looking for projects
to assign to students
 Lynne asks if my students can build a
system to find phrases in the pervertedjustice transcripts
 … a collaboration is born!

April Kontostathis
Department of Mathematics and Computer Science
Where are we now?
Revised LCT Model
 Gaining Access


Strategic Placement
Deceptive Trust
Development

Activities
 Compliments
 Personal Information Exchange
 Relationship Exchange

Grooming

Communicative Desensitization
 Reframing


Isolation
Approach
Categorization Experiments

First Experiment
◦ Class: {Predator ,Victim}
 32 instances, 16 in each class (talking to each other)
◦ Eight numeric attributes - Count of tagged phrases in
each category








Activities
Personal Information
Compliments
Relationship
Reframing
Desensitization
Isolation
Approach
April Kontostathis
Department of Mathematics and Computer Science
Results
Classifier: C4.5 (J48 in Weka)
 3-fold cross validation
 Success Rate: 59%

◦ baseline 50%

Confusion matrix
Classified as
Predator
Classified as
Victim
8
8
Actual Predator
5
11
Actual Victim
April Kontostathis
Department of Mathematics and Computer Science
Decision Tree
DesensitizationCount <= 35
| RelationshipCount <= 0
| | ActivitiesCount <= 1
| | | IsolationCount <= 5: Predator (5.0/1.0)
| | | IsolationCount > 5:Victim (4.0)
| | ActivitiesCount > 1: Predator (2.0)
| RelationshipCount > 0:Victim (10.0)
DesensitizationCount > 35: Predator (11.0/1.0)
April Kontostathis
Department of Mathematics and Computer Science
Predator vs. Victim Patterns
1000
450
900
400
800
700
350
300
600
250
500
200
400
300
200
150
100
100
50
0
0
Categorization Experiments

Second Experiment
◦ Class: {PJ , Non-PJ}
 31 instances, 14 PJ Transcripts, 15 Non-PJ
 Non-PJ obtained from Dr. Susan Gauch – collected during her
ChatTrack project
 PJ transcripts, both Victim and Predator were coded
◦ Same eight attributes
April Kontostathis
Department of Mathematics and Computer Science
Results
Classifier: C4.5 (J48 in Weka)
 3-fold cross validation
 Success Rate: 93%

◦ baseline 48%

Confusion matrix
Classified as Not PJ
Classified as PJ
15
0
Actually Not PJ
2
12
Actually PJ
April Kontostathis
Department of Mathematics and Computer Science
Non PJ vs. PJ
7000
1000
900
6000
800
5000
4000
700
600
500
3000
2000
400
300
200
1000
100
0
0
Clustering Experiments
All 288 PJ Transcripts
 K Means Clustering
 Same eight attributes

◦ column normalized

Four Clusters found
◦ minimum intra-cluster variation
◦ multiple runs to avoid local minima
April Kontostathis
Department of Mathematics and Computer Science
Clusters Found
Predator Category Cluster Centroids
0.4
Normalized Activity Count
0.35
0.3
0.25
0.2
0.15
Cluster0:
Cluster1:
Cluster2:
0.1
Cluster3:
0.05
0
April Kontostathis
Department of Mathematics and Computer Science
Labeling the Clusters
60 Transcripts Analyzed Closely
 Age Deception Data Categorized

◦ Four distinct ways that deception can be achieved when
communicating with others
1.
2.
3.
4.
Quantity
Quality
Relation
Manner
McCornack, S.A., Levine, T.R., Solowczuk, K.A., Torres, H.I., & Campbell,
D.M. (1992). When the alteration of information is viewed as
deception: An empirical test of information manipulation theory.
Communication Monographs, 59, 17-29.

Age data captured for all 288 transcripts
April Kontostathis
Department of Mathematics and Computer Science
Age Deception Statistics
Number of Transcripts
Percentage of Transcripts
No discussion of age
3
5%
Honest Predators
36
60%
Deceptive Predators
21
35%
April Kontostathis
Department of Mathematics and Computer Science
Type of Deception

Quantity manipulation findings
 Honest predators average real age was 31 yrs old
 Deceptive predators average real age was 38 yrs old

Quality manipulation findings
 Average age given by deceptive predators was 27 yrs old

Relation and Manner manipulation findings
 Rarely used by online sexual predators
April Kontostathis
Department of Mathematics and Computer Science
Age Labeling – a bust 
Cluster
Total
Honest
Percent
C0
70
50
71%
C1
173
112
65%
C2
16
12
75%
C3
27
20
74%
April Kontostathis
Department of Mathematics and Computer Science
Synergistic Activities

Content Analysis for the Web 2.0
◦ Misbehavior Detection Task

Pendar, Nick (2007) "Toward Spotting the Pedophile: Telling victim
from predator in text chats " In The Proceedings of the First IEEE
International Conference on Semantic Computing: 235-241. Irvine,
California.
◦ Study for the Termination of Online Predators (STOP)

Hughes, D., P. Rayson, J. Walkerdine, K. Lee, P. Greenwood, A.
Rashid, C. MayChahal, and M. Brennan. 2008. Supporting Law
Enforcement in Digital Communities through Natural
Language Analysis,. In the proceedings of the 2nd International
Workshop on Computational Forensics (IWCF’08).
Washington D.C., USA, August 2008.
◦ Isis – Protecting Children in Online Social Networks
April Kontostathis
Department of Mathematics and Computer Science
Where are we going?

Data remains a big problem
◦ PJ data is problematic
◦ Access to large chat or “chat-like” collections is hard
to get

Labeling is a bigger problem
◦ Finding predatory chat is a “needle in haystack”
problem

Applications are nice, but applications need to
be grounded in text mining and communicative
theory research.
April Kontostathis
Department of Mathematics and Computer Science
Acknowledgements





Amanda Leatherman
Lynne Edwards
Kristina Moore
Brian D. Davison and students at Lehigh Univ.
Ursinus College
◦ Media and Communication Studies faculty and
students
◦ Mathematics and Computer Science faculty and
students

Text Mining Workshop organizers and reviewers
April Kontostathis
Department of Mathematics and Computer Science
Contact Information
April Kontostathis
Ursinus College
akontostathis@ursinus.edu
http://webpages.ursinus.edu/akontostathis
610-409-3000 x2650
April Kontostathis
Department of Mathematics and Computer Science
Download