Sanaz Bahargam Boston University Outlier Detection We consider the problem of finding the outliers and partitioning the data to some groups, given the desired partition representative and we are allowed to remove a predefined number of outliers. Target-ℓ Partitioning Target Partitioning A Big Perspective ! Evimaria Terzi Boston University Given X = {x1, x2, .. , xn} and vectors τ1, τ2,…, τk partition X into k Given X = {x1, x2, .. , xn} and vectors τ1, τ2,…, τk and ℓ, partition partitions such that Σ || µ(Ci) - τi || (i=1..k) is minimized. X \ X ℓ into k groups such that Σ || µ(Ci) - τi || (i=1..k) is minimized. Target Partitioning Problem is NP-hard and NP-hard to approximate Target-ℓ Partitioning Problem is NP-hard and NP-hard to approximate Solving Target-ℓ Partitioning Solving Target Partitioning ℓ-Removal Given X = {x1, x2, .. , xn} and a vector τ, find ℓ point such that || mean(X \ X ℓ), τ|| is minimized. ℓ-Removal problem is NP-hard and NP-hard to approximate Set Partitioning Problem < ℓ-Removal K-means Step 1: Partition the points into k groups, using the algorithm for target partitioning algorithm. Step2: For each partitions find i= 1.. ℓ points to remove from, using algorithm for ℓ-removal problem. Step3: Remove ℓ points: D(i, j) = max { D(i - 1, j - q) + d(i, q)} 0 <= q <= j Benefit Partition ➢ Cluster points using K-means clustering to get cluster centers ➢ Find the matching between cluster centers and targets (min-weight perfect matching) Assign each point to the partition which benefits the most from adding that point. ! Solving ℓ-Removal Integer Regression: Find a 0-1 vector S such that ||mean(X * S), τ|| is minimized, and S contains at least n - ℓ 1s Intel Dataset PLOS Dataset Detected Outliers in PLOS: “AID Enzymatic Activity Is Inversely Proportional to the Size of Cytosine C5 Orbital Cloud” Step 1: Find a nonnegative real-valued vector S minimizing cost function Random Data K-means with Targets 42422 Synthetic Data K-means 30168 Benefit Partitioning 29187 Random partitioning 35355 Twitted: 0 HTML Page downloaded: 1583 PDF Downloaded: 262 XML Downloaded: 150 AVG: 38.55 AVG: 1734.49 AVG: 350.81 AVG: 187.95 ! “Performance of the PointCare NOW System for CD4 Counting in HIV Patients Based on Five Independent Evaluations” Twitted: 23 HTML Page downloaded: 704 PDF Downloaded: 71 XML Downloaded: 9 AVG: 38.55 AVG: 1734.49 AVG: 350.81 AVG: 187.95 Intel Dataset Step 2: Transform S into an integer value vector which contains at least n - ℓ 1s K-means with Targets 4310291 K-means 4453967 Benefit Partitioning 95182 PLOS Dataset Random partitioning 126192 Reference Selecting a Set of Characteristic Reviews, T. Lappas, M. Crovella, E. 1 Terzi. ACM SIGKDD 2012