Outlier Detection Sanaz Bahargam Evimaria Terzi Boston University

advertisement
Sanaz Bahargam
Boston University
Outlier Detection
We consider the problem of finding the outliers and partitioning the
data to some groups, given the desired partition representative and we
are allowed to remove a predefined number of outliers.
Target-ℓ Partitioning
Target Partitioning
A Big Perspective
!
Evimaria Terzi
Boston University
Given X = {x1, x2, .. , xn} and vectors τ1, τ2,…, τk partition X into k
Given X = {x1, x2, .. , xn} and vectors τ1, τ2,…, τk and ℓ, partition
partitions such that Σ || µ(Ci) - τi || (i=1..k) is minimized.
X \ X ℓ into k groups such that Σ || µ(Ci) - τi || (i=1..k) is minimized.
Target Partitioning Problem is NP-hard and NP-hard to approximate
Target-ℓ Partitioning Problem is NP-hard and NP-hard to approximate
Solving Target-ℓ Partitioning
Solving Target Partitioning
ℓ-Removal
Given X = {x1, x2, .. , xn} and a vector τ, find ℓ point such that
|| mean(X \ X ℓ), τ|| is minimized.
ℓ-Removal problem is NP-hard and NP-hard to approximate
Set Partitioning Problem < ℓ-Removal
K-means
Step 1: Partition the points into k groups, using the algorithm for target
partitioning algorithm.
Step2: For each partitions find i= 1.. ℓ points to remove from, using
algorithm for ℓ-removal problem.
Step3: Remove ℓ points:
D(i, j) = max { D(i - 1, j - q) + d(i, q)} 0 <= q <= j
Benefit Partition
➢ Cluster points using K-means
clustering to get cluster centers
➢ Find the matching between
cluster centers and targets
(min-weight perfect matching)
Assign each point to the
partition which benefits the
most from adding that point.
!
Solving ℓ-Removal
Integer Regression: Find a 0-1 vector S such that ||mean(X * S), τ||
is minimized, and S contains at least n - ℓ 1s
Intel Dataset
PLOS Dataset
Detected Outliers in PLOS:
“AID Enzymatic Activity Is Inversely Proportional to the Size of
Cytosine C5 Orbital Cloud”
Step 1: Find a nonnegative real-valued vector S minimizing cost
function
Random Data
K-means with
Targets
42422
Synthetic Data
K-means
30168
Benefit
Partitioning
29187
Random
partitioning
35355
Twitted: 0
HTML Page downloaded: 1583
PDF Downloaded: 262
XML Downloaded: 150
AVG: 38.55
AVG: 1734.49
AVG: 350.81
AVG: 187.95
!
“Performance of the PointCare NOW System for CD4 Counting
in HIV Patients Based on Five Independent Evaluations”
Twitted: 23
HTML Page downloaded: 704
PDF Downloaded: 71
XML Downloaded: 9
AVG: 38.55
AVG: 1734.49
AVG: 350.81
AVG: 187.95
Intel Dataset
Step 2: Transform S into an integer value vector which contains at least
n - ℓ 1s
K-means with
Targets
4310291
K-means
4453967
Benefit
Partitioning
95182
PLOS Dataset
Random
partitioning
126192
Reference
Selecting a Set of Characteristic Reviews, T. Lappas, M. Crovella, E.
1
Terzi. ACM SIGKDD 2012
Download