HandlingOutliersMissingDataISI

advertisement
Handling Outliers and Missing
Data in Statistical Data Models
Kaushik Mitra
Date: 17/1/2011
ECSU Seminar, ISI
Statistical Data Models
• Goal: Find structure in data
• Applications
– Finance
– Engineering
– Sciences
• Biological
– Wherever we deal with data
• Some examples
– Regression
– Matrix factorization
• Challenges: Outliers and Missing data
Outliers Are Quite Common
Google search results for `male faces’
Need to Handle Outliers Properly
Removing salt-and-pepper (outlier) noise
Noisy image
Gaussian filtered image
Desired result
Missing Data Problem
Missing tracks in structure from motion
Completing missing tracks
Incomplete tracks
Completed tracks by a
sub-optimal method
Desired result
Our Focus
• Outliers in regression
– Linear regression
– Kernel regression
• Matrix factorization in presence of missing
data
Robust Linear Regression for High
Dimension Problems
What is Regression?
• Regression
– Find functional relation between y and x
• x: independent variable
• y: dependent variable
– Given
• data: (yi,xi) pairs
• Model y = f(x, w)+n
– Estimate w
– Predict y for a new x
Robust Regression
• Real world data corrupted with outliers
• Outliers make estimates unreliable
• Robust regression
– Unknown
• Parameter, w
• Outliers
– Combinatorial problem
• N data and k outliers
• C(N,k) ways
Prior Work
• Combinatorial algorithms
– Random sample consensus (RANSAC)
– Least Median Squares (LMedS)
• Exponential in dimension
• M-estimators
– Robust cost functions
– local minima
Robust Linear Regression model
• Linear regression model : yi=xiTw+ei
– ei, Gaussian noise
• Proposed robust model: ei=ni+si
– ni, inlier noise (Gaussian)
– si, outlier noise (sparse)
• Matrix-vector form
– y=Xw+n+s
• Estimate w, s
y1
y2
.
.
yN
=
x1 T
x2 T
.
.
x NT
w1
w2
+
.
wD
n1
s1
n2
s2
. + .
.
.
nN
sN
Simplification
• Objective (RANSAC): Find w that minimizes the number
of outliers
min || s ||0 subject to || y  Xw  s ||2  
s, w
• Eliminate w
• Model: y=Xw+n+s
• Premultiple by C : CX=0, N ≥ D
– Cy=CXw+Cs+Cn
– z=Cs+g
– g Gaussian
• Problem becomes: min
|| s ||0 subject to || z  Cs ||2  
s
• Solve for s -> identify outliers -> LS -> w
Relation to Sparse Learning
• Solve:
min || s ||0 subject to || z  Cs ||2  
s
– Combinatorial problem
• Sparse Basis Selection/ Sparse Learning
• Two approaches :
– Basis Pursuit (Chen, Donoho, Saunder 1995)
– Bayesian Sparse Learning (Tipping 2001)
Basis Pursuit Robust regression (BPRR)
• Solve
min s 1 such that z  Cs  
s
– Basis Pursuit Denoising (Chen et. al. 1995)
– Convex problem
– Cubic complexity : O(N3)
• From Compressive Sensing theory (Candes 2005)
– Equivalent to original problem if
• s is sparse
• C satisfy Restricted Isometry Property (RIP)
• Isometry: ||s1 - s2|| = ||C(s1 – s2)||
• Restricted: to the class of sparse vectors
• In general, no guarantees for our problem
Bayesian Sparse Robust Regression
(BSRR)
• Sparse Bayesian learning technique (Tipping
2001)
N
1
– Puts a sparsity promoting prior on s : p(s)   s
i 1 i
– Likelihood : p(z/s)=Ν(Cs,εI)
– Solves the MAP problem p(s/z)
– Cubic Complexity : O(N3)
Setup for Empirical Studies
• Synthetically generated data
• Performance criteria
– Angle between ground truth
and estimated hyper-planes
Vary Outlier Fraction
Dimension = 2
Dimension = 8
Dimension = 32
 BSRR performs well in all dimensions
 Combinatorial algorithms like RANSAC, MSAC, LMedS not
practical in high dimensions
Facial Age Estimation
• Fgnet dataset : 1002 images of 82 subjects
• Regression
– y : Age
– x: Geometric feature vector
Outlier Removal by BSRR
• Label data as inliers and outliers
• Detected 177 outliers in 1002 images
•Leave-one-out testing
BSRR
Inlier MAE
3.73
Outlier MAE
19.14
Overall MAE
6.45
Summary for Robust Linear Regression
• Modeled outliers as sparse variable
• Formulated robust regression as Sparse
Learning problem
– BPRR and BSRR
• BSRR gives the best performance
• Limitation: linear regression model
– Kernel model
Robust RVM Using Sparse Outlier
Model
Relevance Vector Machine (RVM)
N
• RVM model: y(x)   wi k (x, xi )  w0  e
i 1
– k (x, xi ) : kernel function
• Examples of kernels
– k(xi, xj) = (xiTxj)2 : polynomial kernel
– k(xi, xj) = exp( -||xi - xj||2/2σ2) : Gaussian kernel
• Kernel trick: k(xi,xj) = ψ(xi)Tψ(xj)
– Map xi to feature space ψ(xi)
RVM: A Bayesian Approach
• Bayesian approach
– Prior distribution : p(w)
– Likelihood : p( y | x, w )
• Prior specification
– p(w) : sparsity promoting prior p(wi) = 1/|wi|
– Why sparse?
• Use a smaller subset of training data for prediction
• Support vector machine
• Likelihood
– Gaussian noise
• Non-robust : susceptible to outliers
Robust RVM model
• Original RVM model
– e, Gaussian noise
N
y   w j k (x,x j )  w0  e
i 1
• Explicitly model outliers, ei= ni + si
– ni, inlier noise (Gaussian)
– si, outlier noise (sparse and heavy-tailed)
• Matrix vector form
– y = Kw + n + s
• Parameters to be estimated: w and s
Robust RVM Algorithms
• y = [K|I]ws + n
– ws = [wT sT]T : sparse vector
• Two approaches
– Bayesian
– Optimization
Robust Bayesian RVM (RB-RVM)
• Prior specification
– w and s independent : p(w, s) = p(w)p(s)
– Sparsity promoting prior for s: p(si)= 1/|si|
• Solve for posterior p(w, s|y)
• Prediction: use w inferred above
• Computation: a bigger RVM
– ws instead of w
– [K|I] instead of K
Basis Pursuit RVM (BP-RVM)
• Optimization approach
min || w s ||0 subject to || y  [K | I]w s ||2  
ws
– Combinatorial
• Closest convex approximation
min || w s ||1 subject to || y  [K | I]w s ||2  
ws
• From compressive sensing theory
– Same solution if [K|I] satisfies RIP
• In general, can not guarantee
Experimental Setup
Prediction : Asymmetric Outliers Case
Image Denoising
• Salt and pepper noise
– Outliers
• Regression formulation
– Image as a surface over 2D grid
• y: Intensity
• x: 2D grid
• Denoised image obtained by prediction
Salt and Pepper Noise
Some More Results
RVM
RB-RVM
Median Filter
Age Estimation from Facial Images
• RB-RVM detected 90 outliers
• Leave-one-person-out testing
Summary for Robust RVM
• Modeled outliers as sparse variables
• Jointly estimated parameter and outliers
• Bayesian approach gives very good result
Limitations of Regression
• Regression: y = f(x,w)+n
– Noise in only “y”
– Not always reasonable
• All variables have noise
– M = [x1 x2 … xN]
– Principal component analysis (PCA)
• [x1 x2 … xN] = ABT
– A: principal components
– B: coefficients
– M = ABT: matrix factorization (our next topic)
Matrix Factorization in the
presence of Missing Data
Applications in Computer Vision
• Matrix factorization: M=ABT
• Applications: build 3-D models from images
– Geometric approach (Multiple views)
Structure from
Motion (SfM)
– Photometric approach (Multiple Lightings)
Photometric
stereo
37
Matrix Factorization
• Applications in Vision
– Affine Structure
from Motion (SfM)
– Photometric stereo
• Solution: SVD
– M=USVT
– Truncate S to rank r
• A=US0.5, B=VS0.5
M =
xij
yij
= CST
Rank 4 matrix
M = NST, rank = 3
38
Missing Data Scenario
• Missed feature tracks in SfM
Incomplete feature
tracks
• Specularities and shadow in photometric
stereo
39
Challenges in Missing Data Scenario
• Can’t use SVD
T
2
2
2
• Solve: min || W  (M  AB ) || F  (|| A || F  || B || F )
A, B
• W: binary weight matrix, λ: regularization parameter
• Challenges
– Non-convex problem
– Newton’s method based algorithm (Buchanan et. al. 2005)
• Very slow
• Design algorithm
– Fast (handle large scale data)
– Flexible enough to handle additional constraints
• Ortho-normality constraints in ortho-graphic SfM
Proposed Solution
• Formulate matrix factorization as a low-rank
semidefinite program (LRSDP)
– LRSDP: fast implementation of SDP (Burer, 2001)
• Quasi-Newton algorithm
• Advantages of the proposed formulation:
– Solve large-scale matrix factorization problem
– Handle additional constraints
41
Low-rank Semidefinite Programming
(LRSDP)
• Stated as:min C  RR
T
R
subject to A l  RR T  bl , l  1,2,...,k
• Variable: R
• Constants
• C: cost
• Al, bl: constants
• Challenge
• Formulating matrix factorization as LRSDP
• Designing C, Al, bl
Matrix factorization as LRSDP:
Noiseless Case
• We want to formulate:
min || A ||2F  || B ||2F subject to ( ABT )i , j  M i , j for (i, j )  
A, B
• As: C  RRT subject to Al  RRT  bl , l  1,2,...,|  |
• LRSDP formulation:
|| A ||2F  trace( AAT ), || B ||2F  trace(BBT )
|| A ||2F  || B ||2F  trace(RRT )
(AB )
T
i, j
 
 Mi, j  (RRT )i, j m  Mi, j

C identity matrix,
Al indicator matrix
Affine SfM
• Dinosaur sequence
72% missing data
• MF-LRSDP gives the best reconstruction
Photometric Stereo
• Face sequence
42% missing data
• MF-LRSDP and damped Newton gives the best
result
Additional Constraints:
Orthographic Factorization
• Dinosaur sequence
Summary
• Formulated missing data matrix factorization
as LRSDP
– Large scale problems
– Handle additional constraints
• Overall summary
– Two statistical data models
• Regression in presence of outliers
– Role of sparsity
• Matrix factorization in presence of missing data
– Low rank semidefinite program
Thank you! Questions?
48
Download