Kernel Methods Dept. Computer Science & Engineering, Shanghai Jiao Tong University Outline • • • • • • • One-Dimensional Kernel Smoothers Local Regression Local Likelihood Kernel Density estimation Naive Bayes Radial Basis Functions Mixture Models and EM 2015/4/9 Kernel Methods 2 One-Dimensional Kernel Smoothers • k-NN: fˆ ( x) Ave( yi | xi Nk ( x)) • 30-NN curve is bumpy, since fˆ ( x) is discontinuous in x. • The average changes in a discrete way, leading to a discontinuous fˆ ( x) . 2015/4/9 Kernel Methods 3 One-Dimensional Kernel Smoothers • Nadaraya-Watson Kernel weighted average: ˆf ( x ) N 0 i 1 N K ( x0 , xi ) yi i 1 K ( x0 , xi ) • Epanechnikov quadratic kernel: x x0 K ( x0 , x) D 2015/4/9 Kernel Methods 4 One-Dimensional Kernel Smoothers • More general kernel: x x0 K ( x0 , x) D h ( x ) 0 – h ( x0 ): width function that determines the width of the neighborhood at x0. – For quadratic kernel h ( x0 ) , Bias constant – For k-NN kernel k hk ( x0 ) | x0 x[ k ] |, replaced Variance constant – The Epanechnikov kernel has compact support 2015/4/9 Kernel Methods 5 One-Dimensional Kernel Smoothers • Three popular kernel for local smoothing: x x0 K ( x0 , x) D • Epanechnikov kernel and tri-cube kernel are compact but tri-cube has two continuous derivatives • Gaussian kernel is infinite support 2015/4/9 Kernel Methods 6 Local Linear Regression • Boundary issue – Badly biased on the boundaries because of the asymmetry of the kernel in the region. – Linear fitting remove the bias to first order 2015/4/9 Kernel Methods 7 Local Linear Regression • Locally weighted linear regression make a firstorder correction • Separate weighted least squares at each target point x0: min N K ( x , x )[ y ( x ) ( x ) x ]2 ( x0 ), ( x0 ) i 1 0 i i 0 0 i • The estimate: fˆ ( x0 ) ˆ ( x0 ) ˆ ( x0 ) x0 • b(x)T=(1,x); B: Nx2 regression matrix with i-th row b(x)T; WNN ( x0 ) diagK ( x0 , xi ), i 1,, N N fˆ ( x0 ) b( x0 )T ( BTW ( x0 ) B) 1 BTW ( x0 ) y li ( x0 ) yi i 1 2015/4/9 Kernel Methods 8 Local Linear Regression • The weights li ( x0 ) combine the weighting kernel K ( x0 ,) and the least squares operations——Equivalent Kernel 2015/4/9 Kernel Methods 9 Local Linear Regression • The expansion for Efˆ ( x0 ), using the linearity of local regression and a series expansion of the true function f around x0 N N N i 1 i 1 Efˆ ( x0 ) li ( x0 ) f ( xi ) f ( x0 ) li ( x0 ) f ( x0 ) ( xi x0 )li ( x0 ) i 1 f ( x0 ) 2 ( x x ) li ( x0 ) R i 0 2 i 1 N • For local regression i 1 li ( x0 ) 1, i 1 ( xi x0 )li ( x0 ) 0 • The bias Efˆ ( x0 ) f ( x0 ) depends only on quadratic and higher-order terms in the expansion of f . N 2015/4/9 Kernel Methods N 10 Local Polynomial Regression • Fit local polynomial fits of any degree d N d j min K ( x0 , xi ) yi ( x0 ) j ( x0 ) xi ( x0 ), j ( x0 ), j 1,, d i 1 j 1 d fˆ ( x0 ) ˆ ( x0 ) j 1 ˆ j ( x0 ) x0j 2015/4/9 Kernel Methods 2 11 Local Polynomial Regression • Bias only have components of degree d+1 and higher. • The reduction for bias costs the increased variance. 2 2 ˆ var( f ( x0 )) l ( x0 ) , l ( x0 ) increases with d 2015/4/9 Kernel Methods 12 选择核的宽度 • 核 K 中, 是参数,控制核宽度: – 对于有紧支集的核, 取其支集区域的半径 – 对于高斯核, 取其方差 – 对k-对近邻域法, 取 k/N • 窗口宽度导致偏倚-方差权衡: – 窗口较窄,方差误差大,均值误差偏倚小 – 窗口较宽,方差误差小,均值误差偏倚大 2015/4/9 Kernel Methods 13 Structured Local Regression • Structured kernels ( x x0 )T A( x x0 ) K , A ( x0 , x) D – Introduce structure by imposing appropriate restrictions on A • Structured regression function f ( X1 , X 2 ,, X p ) g j ( X j ) gkl ( X k , X l ) j k l – Introduce structure by eliminating some of the higher-order terms 2015/4/9 Kernel Methods 14 Local Likelihood & Other Models • Any parametric model can be made local: – Parameter associated with yi : i ( xi ) xiT N – Log-likelihood: l ( ) i 1 l ( yi , xiT ) – Model ( X ) likelihood local to x0 : N l ( ( x0 )) K ( x0 , xi )l ( yi , xiT ( x0 )) i 1 – A varying coefficient model (z ) N l ( ( z0 )) K ( z0 , zi )l ( yi , ( x0 , ( z0 ))) i 1 e.g . ( x, ) xT 2015/4/9 Kernel Methods 15 Local Likelihood & Other Models • Logistic Regression Pr(G j | X x) exp( j 0 Tj x) 1 k 1 exp( k 0 kT x) J 1 – Local log-likelihood for the J class model N T K ( x , x ) ( x ) ( x ) i1 0 i gi 0 0 gi 0 ( xi x0 ) log1 k 1 exp( k 0 ( x0 ) k ( x0 )T ( xi x0 )) – Center the local regressions at J 1 ˆ ( x )) exp( j0 0 Pˆ r(G j | X x) J 1 1 k 1 exp(ˆk 0 ( x0 )) 2015/4/9 Kernel Methods 16 Kernel Density Estimation • A natural local estimate ˆf ( x ) # xi N ( x0 ) X 0 N • The smooth Parzen estimate N ˆf ( x ) 1 K ( x0 , xi ) X 0 i 1 N – For Gaussian kernel K ( x , x ) / ( x x ) 0 i i 0 – The estimate become ˆf ( x) 1 N ( x x ) i 0 X N i 1 1 1 N 2 exp( ( x x / ) ) i 0 2 p / 2 i 1 N (2 ) 2 2015/4/9 Kernel Methods 17 Kernel Density Estimation • A kernel density estimate for systolic blood pressure. The density estimate at each point is the average contribution from each of the kernels at that point. 2015/4/9 Kernel Methods 18 Kernel Density Classification • Bayes’ theorem: • The estimate for CHD uses the tri-cube kernel with k-NN bandwidth. 2015/4/9 Kernel Methods 19 Kernel Density Classification • The population class densities and the posterior probabilities 2015/4/9 Kernel Methods 20 Naïve Bayes • Naïve Bayes model assume that given a class G=j, the features Xk are independent: p f j ( X ) f jk ( X k ) k 1 ˆ – f jk ( X k ) is kernel density estimate, or Gaussian, for coordinate Xk in class j. – If Xk is categorical, use Histogram. k 1 f k ( X k ) f ( X ) Pr(G | X ) logit log log p Pr(G J | X ) J fJ (X ) J f Jk ( X k ) p k 1 2015/4/9 f k ( X k ) p p log k 1 log k 1 g k ( X k ) J f Jk ( X k ) Kernel Methods 21 Radial Basis Function & Kernel • Radial basis function combine the local and flexibility of kernel methods. x j f ( x) j 1 K j ( j , x) j j 1 D j M M j – Each basis element is indexed by a location or prototype parameter j and a scale parameter j – D, a pop choice is the standard Gaussian density function. 2015/4/9 Kernel Methods 22 Radial Basis Function & Kernel • For simplicity, focus on least squares methods for regression, and use the Gaussian kernel. • RBF network model: 2 T N M ( x ) ( x ) i j i j min yi 0 j exp 2 , , i 1 j 1 j j j M j 1 • Estimate the j , j separately from the j . • A undesirable side effect of creating holes—— regions of IRp where none of the kernels has appreciable support. 2015/4/9 Kernel Methods 23 Radial Basis Function & Kernel • Renormalized radial basis functions. D x j / h j ( x) M k 1 D x k / • The expansion in renormalized RBF K ( x0 , xi ) N f ( x) i 1 yi N i 1 K ( x0 , xi ) N ifunction y h ( x0with ) fixed width can Gaussian radial basis 1 i i leave holes. Renormalized Gaussian radial basis function produce basis functions similar in some respects to B-splines. 2015/4/9 Kernel Methods 24 Mixture Models & EM • Gaussian Mixture Model M f ( x) m 1 m ( x; m , m ) – m are mixture proportions, M m 1 m 1 • EM algorithm for mixtures – Given x1 , x2 ,, xn ,log-likelihood: l ( y, ) i 1 log 1 ( xi ) (1 ) 2 ( xi ) N Bad – Suppose we observe Latent Binary N N L( x, z, ) i 1 log ( xi ) i 1 log(1 ) ( xi ) zi 1 zi 0 1 z such that z 1 x ~ , z 0 x ~ 2 1 2015/4/9 Kernel Methods 2 Good 25 Mixture Models & EM • Given 0 ,compute ~ ~ 0 ( ) E( x, z, )( , y), max(( )) • In Example ˆˆ ( xi ) 0 E ( zi | xi , ) wi ˆˆ ( xi ) (1 ˆ )ˆ ( xi ) 1 1 2 ( ) i 1 wi log ˆ1 ( xi ) (1 wi ) log (1 ) 2 ( xi ) N 2015/4/9 Kernel Methods 26 Mixture Models & EM • Application of mixtures to the heart disease risk factor study. 2015/4/9 Kernel Methods 27 Mixture Models & EM • Mixture model used for classification of the simulated data 2015/4/9 Kernel Methods 28