Unsupervised One-Class Learning for Automatic Outlier Removal Wei Liu1 1IBM Gang Hua1,2 John R. Smith1 June 2014 T. J. Watson Research Center 2Stevens Institute of Technology Outliers Outliers are pervasive in many Statistics, CV, and PR problems. 2 2 Outlier Removal We propose to automatically remove outliers from a corrupted one-class data source. The proportion of outliers can be as high as 50%. Our approach works under a fully unsupervised manner, yielding a decision function f(x) to judge inliers vs. outliers. f(x) = 0 Unsupervised One-Class Learning 3 Applications Internet vision applications: web image tag denoising (signs of f ) web image re-ranking (values of f ) •••••• rerank sky people sky people 4 Unsupervised One-Class Learning The UOCL framework working on : Min {predict loss + neighborhood smoothness – average margin} soft labels UOCL jointly learns a soft labeling and a one-class average margin classifier f(x) in an unsupervised self-guided fashion. of inliers 5 Kernel Leverage a kernel function k( , ) to formulate f(x): Use the manifold regularizer to enforce smoothness: Use soft labels such that scale of f by setting , and bound the : 6 Alternating Optimization Decompose the mixed optimization problem into two simpler subproblems, and alternate between them: -subproblem: -subproblem: Each subproblem has an exactly optimal solution, so and the alternating algorithm converges. 7 Alpha-Subproblem A constrained eigenvalue problem The optimal solution [Gander et al. 1989] is where is the smallest real eigenvalue of the matrix 8 Y-Subproblem A discrete optimization problem Solve a related problem sorted f >0 Optimal is in the same order as f m elements <0 9 Algorithm Initialize that gives the kernel density estimate Alternating Output and (a) Convergence curve of UOCL @ UIUC-Scene (b) Convergence curve of UOCL @ Caltech-101 12 Objective function value log(Q+n) Objective function value log(Q+n) 11 10 9 8 7 6 5 0 1 2 3 4 5 6 7 11 10 9 8 7 6 5 0 8 Iteration # 1 2 3 4 Iteration # 10 5 6 7 8 10 Soft Labels + (c , c ) Try three kinds of hard/soft labels satisfying The adaptively balanced ( ) soft labels work best. . 11 Experiments: artificial outliers Outlier removal: mean F1 scores over object categories. 12 12 Experiments: real-world outliers Image re-ranking: mean precision over the queries with outlier proportions being up to 60%. 13 13 Conclusions UOCL is highly robust to contamination of input data and able to suppress outliers with a high proportion up to 60%. The technical novelty lies in a self-guided joint learning mechanism: one-class classifier and self-labeling. The adaptively balanced soft labels help handle the high outlier level. The alternating optimization algorithm achieves fast convergence. 14 15