Clickstream Models & Sybil Detection

advertisement
Clickstream Models &
Sybil Detection
Gang Wang (王刚)
UC Santa Barbara
gangw@cs.ucsb.edu
Modeling User Clickstream Events

User-generated events
 E.g.
profile load, link follow, photo browse, friend invite
 Assume we have event type, userID, timestamp
UserID

Event Generated
Timestamp
Intuition: Sybil users act differently from normal users
 Sybil
users act differently from normal users
 Goal-oriented:
focus on specific actions, less “extraneous” events
 Time-limited: focused on efficient use of time, smaller gaps?
 Forcing
Sybil users to mimic users  win?
System Overview
3
Sequence Clustering
Clickstream Log
Cluster Coloring
Known
Good Users
Incoming Clickstream
?
Legit
Sybils
Clickstream Models
4

Clickstream log
 user

clicks (click type) with timestamp
Modeling Clickstream
 Event-only
 e.g.
Sequence Model: order of events
ABCDA
 Time-based
 e.g.
{t1, t2, t3, …}
 Hybrid
 e.g.
Model: sequence of inter-arrival time
Model: sequence of click events with time
A(t1)B(t2)C(t3)D(t4)A
Clickstream Clustering
5

Similarity Graph
 Vertices:
users (or sessions)
 Edges: weighted by the similarity score of two user’s
clickstream

Clustering Similar Clickstreams together
 Graph
partitioning using METIS
Q: How to compare two clickstreams?
Distance Functions Of Each Model
6

Click Sequence (CS) Model

Ngram overlap
S1= AAB
S2= AAC

Ngram+count
S1= AAB
S2= AAC

ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)}
ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)}
Time-based Model



ngram1= {A, B, AA, AB, AAB}
ngram2= {A, C, AA, AC, AAC}
Compare the distribution of inter-arrival time
K-S test
Hybrid Model


Bucketize inter-arrival time
Compute 5grams (similar with CS Model)
Euclidean Distance
V1=(2,1,0,1,0,1,1,0)/6
V2=(2,0,1,1,1,0,0,1)/6
Detection In A Nutshell
7

Inputs:
?
 Trained
clusters
 Input sequences for testing

Methodology: given a test sequence A
K
nearest neighbor: find the top-k nearest sequences in the
trained cluster
 Nearest Cluster: find the nearest cluster based on average
distance to sequences in the cluster
 Nearest Cluster (center): pre-compute the center(s) of
cluster, find the nearest cluster center
Clustering Sequences
8

How well can each method separate Sybils from
legitimate users?
Model (Sequence
Type)
Distance
Function
Click Sequence
Model (Categories)
(False positives, False negatives) of users
20 clusters
50 clusters
100 clusters
unigram
(3% , 6%)
(1%, 7%)
(2%, 4%)
unigram+count
(1% , 4%)
(1%, 3%)
(1%, 3%)
10gram
(1%, 3%)
(1%, 3%)
(2%, 2%)
10gram+count
(1%, 4%)
(2%, 4%)
(1%, 2%)
Time-based Model
K-S Test
(9%, 8%)
(2%, 10%)
(5%, 10%)
Hybrid Model
(Categories)
5gram
(3%, 2%)
(2%, 2%)
(2%, 2%)
5gram+count
(3%, 4%)
(4%, 5%)
(1%, 2%)
Detection Accuracy
9

Basics



Training on one group of users, and test on the other group of users.
Clusters trained using Hybrid Model
Key takeaways


High accuracy with 50 clicks in the test sequence
Nearest Cluster (Center) method achieves high accuracy with minor
computation overhead
Number of Clicks in the
Sequence (length)
(False positives, False negatives) of users
K-nearest
Neighbors (k=3)
Nearest Cluster
(Avg. Distance)
Nearest Cluster
(Center)
Length <=50
(1.5% , 2.1%)
(0.6%, 2.6%)
(0.4%, 2.3%)
Length <=100
(0.9% , 1.8%)
(0.2%, 2.5%)
(0.3%, 2.3%)
(0.6% , 3%)
(0.4%, 2.8%)
(0.4%, 2.3%)
All
Can Model Be Effective Over Time?
10

Experiment method
 Using
first two-week data to train the model
 Testing on the following two-week data
Model
Click Sequence Model
Hybrid Model
(False positives, False negatives) of users
K-nearest
Neighbors (k=3)
Nearest Cluster
(Avg. Distance)
Nearest Cluster
(Center)
(1.8% , 1%)
(3%, 2%)
(3%, 0.8%)
(3% , 2%)
(3%, 1%)
(1.2%, 1.4%)
Still Ongoing Work


With broad interest and applications
As Sybil detection tool
 Code
being tested internally at Renren
 Trained
with 10K users (2-week log)
 Testing on 1 Million users (1-week log)

5 Sybil clusters
 Further



22K suspicious profiles
improvement
Training with longer clickstream (half users have <5 clicks in 2-week)
More conservative in labeling Sybil clusters.
As user modeling tool
 Code
being tested by LinkedIn as user profiler
Some Useful Tools

Graph Partitioning
 Metis
http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

Community Detection
 Louvain
code
https://sites.google.com/site/findcommunities/
Other Ongoing Works/Ideas

Fighting against crowdturfing
 Crowdturfing:
real users are paid to spam
 How to detect these malicious real users
 User
behavior model
 Network-wised temporal anomaly detection

Information Dissemination
 Content
sharing visa social edges
 How
often will user click on the content
 How often will user comment on the content

Sybil detection, target ad placement
Thank You!
Questions?
http://current.cs.ucsb.edu
Download