Sharp Piece-wise Linear Smoothing and Deconvolution with an L0 Penalty Paul Eilers1 and Ralph Rippe2 1Erasmus Medical Center, Rotterdam, The Netherlands 2Leiden University Medical Center, Leiden, The Netherlands The University of Warwick, March 2012 Data from genetics: DNA copy numbers in tumors 7 log2(CNV signal) 8 9 10 11 Array GBM 139.CEL 0 10 20 Position on chromosome (Mbase) 30 40 30 40 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 The University of Warwick, March 2012 10 20 Position on chromosome (Mbase) 1 Biological background, simplified • We have two copies of each chromosome • Most of our DNA is identical for everyone • But not at millions of SNPs (single nucleotide polymorphisms) • A SNP has two possible states, called alleles, say A and B • Possible combinations (genotypes): AA, AB, BB • Microarrays allow detection of the states of very many SNPs • Using selective hybridization and fluorescence technology • For each SNP we get two signals, a for A, b for B The University of Warwick, March 2012 2 Copy number variations • Two signals, a proportional to A, b proportional to B • Signal b is 0, c, 2c for AA, AB, BB genotypes • Signal a is 2c, c, 0 for AA, AB, BB genotypes • So a + b = 2c (save for noise, background, non-linearities, ...) • Normal DNA is quite boring, but tumor DNA is not • DNA segments may be deleted or multiplied (amplified) • So we can have A, B, AA, AB, BB, AAA, AAB, ABB, and so on • Copy number changes (CNV); a + b 6= 2c will reflect that The University of Warwick, March 2012 3 CNV in normal (top) and tumor (bottom) DNA 9.5 log2(CNV signal) 10.0 10.5 11.0 11.5 Array ctr aff 1.CEL 0 10 20 Position on chromosome (Mbase) 30 40 30 40 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 The University of Warwick, March 2012 10 20 Position on chromosome (Mbase) 4 A simple smoother: Whittaker • Whittaker (1923) proposed “graduation”: minimize S2 = 2+λ d z )2 ( y − z ) ( ∆ ∑ i i ∑ i i i • Given a noisy data series y, it finds a smoother series z • Operator ∆d forms differences of order d • Today we call this penalized least squares • Explicit solution, with matrix D, such that ∆d z = Dz: ( I + λD 0 D )ẑ = y The University of Warwick, March 2012 5 The Whittaker smoother (d = 2) in action 7 log2(CNV signal) 8 9 10 11 Array GBM 139.CEL 0 10 20 Position on chromosome (Mbase) 30 40 30 40 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 The University of Warwick, March 2012 10 20 Position on chromosome (Mbase) 6 Critique of the Whittaker smoother, and a solution • Noise is effectively reduced • But jumps are rounded • Alternative approach, inspired by LASSO, minimizes S1 = 2+λ ( y − z ) ∑ |∆zi | ∑ i i i i • Total variation penalty or “fused LASSO” • Notice the first differences • The L1 norm (instead of the L2 norm) in the penalty • It is a big improvement The University of Warwick, March 2012 7 The L1 penalty in action 7 log2(CNV signal) 8 9 10 11 Array GBM 139.CEL 0 10 20 Position on chromosome (Mbase) 30 40 30 40 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 The University of Warwick, March 2012 10 20 Position on chromosome (Mbase) 8 Computation for the L1 penalty • We could try quadratic programming techniques • But there is an easier solution • For any x and approximation x̃ we have | x | = x2 /| x | ≈ x2 /| x̃ | • Use weighted L2 penalty, with vi = 1/|∆z̃i |: S1 = 2+λ 2 ( y − z ) v ( ∆z ) ∑ i i ∑ i i i i • Iteratively update v and z̃ • Solve ( I + D 0 VD )z = y repeatedly, with V = diag(v) q • Some smoothing near 0: use vi = 1/ (∆z̃i )2 + β2, with small β The University of Warwick, March 2012 9 Critique of the L1 penalty • We certainly get a big improvement • But the jumps are not completely crisp • Solution: a penalty with (implicitly) the L0 norm • Iterate as above with weighted quadratic penalty S0 = 2+λ 2 ( y − z ) v ( ∆z ) ∑ i i ∑ i i i i • With vi = 1/( x̃2 + β2 ) instead of vi = 1/ The University of Warwick, March 2012 q (∆z̃i )2 + β2 10 The L0 penalty in action 7 log2(CNV signal) 8 9 10 11 Array GBM 139.CEL 0 10 20 Position on chromosome (Mbase) 30 40 30 40 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 The University of Warwick, March 2012 10 20 Position on chromosome (Mbase) 11 Sparseness • We have many equations (4032 here), but a banded system • Computation time linear in data length • R package spam is great (sparse matrices, Matlab-style) • L2 system solved in 20 milliseconds # m E D P z Whittaker smoother = length(y) = diag.spam(m) = diff(E, diff = 2) = lambda * t(D) %*% D = solve(E + P, y) The University of Warwick, March 2012 12 Optimal smoothing • Interactive use in the hands of biologists is our main goal • But we can try “optimal” smoothing, using AIC • AIC = 2m log σ̂ + 2 ∗ ED • With σ̂2 = ∑(yi − ẑi )2 /m (ML estimate of error SD) • Effective dimension ED • ED = tr( I + D 0 VD ), with V = diag(v) • Serial correlation in errors can spoil AIC (undersmoothing) The University of Warwick, March 2012 13 Computational details for AIC • We need to compute ED = tr( I + D 0 VD ), with V = diag(v) • But this is a large (4032 squared) matrix • Beautiful solution: Hutchinson and De Hoog algorithm • But not yet implemented • For now: use 400 intervals and indicator basis R Si = 2+ 2 ( y − r a ) v ( ∆a ) ∑ i ∑ ij j ∑ j j i j j • Adaptive weights v j = 1/((∆ ã j )2 + β2 ) The University of Warwick, March 2012 14 AIC for one array 7 log2(CNV signal) 8 9 10 11 Array GBM 180.CEL 0 10 20 Position on chromosome (Mbase) 30 40 −8750 AIC −8650 −8550 AIC profile −2.0 The University of Warwick, March 2012 −1.5 −1.0 log10(lambda) −0.5 0.0 15 AIC for the other array 7 log2(CNV signal) 8 9 10 11 Array GBM 203−2.CEL 0 10 20 Position on chromosome (Mbase) 30 40 −8250 AIC −8150 −8050 AIC profile −2.0 The University of Warwick, March 2012 −1.5 −1.0 log10(lambda) −0.5 0.0 16 Tests on simulated data • We did some tests on simulated data • Using results in the Bioconductor package VEGA • Starting with adaptive weights vi ≡ 1 • Graphs will show intermediate results • And the size of changes per iteration • The smoother starts with a “wild” solution • It gradually removes more and more details The University of Warwick, March 2012 17 Example of convergence, little noise −1.0 −1.0 0.0 0.5 Iteration 2 0.0 0.5 Iteration 1 0 200 400 600 800 1000 0 200 600 800 1000 800 1000 −1.0 −1.0 0.0 0.5 Iteration 10 0.0 0.5 Iteration 5 400 0 200 400 600 800 1000 0 200 600 Convergence −8 −1.0 log10(dz) −6 −4 −2 0.0 0.5 0 Iteration 20 400 0 200 400 The University of Warwick, March 2012 600 800 1000 5 10 Iteration 15 20 18 Example of convergence, more noise −0.5 0.5 1.5 Iteration 2 −2.0 −2.0 −0.5 0.5 1.5 Iteration 1 0 200 400 600 800 1000 200 400 600 800 1000 800 1000 −0.5 0.5 1.5 Iteration 10 −2.0 −2.0 −0.5 0.5 1.5 Iteration 5 0 0 200 400 600 800 1000 0 200 600 Convergence −2.0 −6 log10(dz) −4 −2 −0.5 0.5 0 1.5 Iteration 20 400 0 200 400 The University of Warwick, March 2012 600 800 1000 0 10 20 Iteration 30 40 19 Can we trust it? • The objective function is non-convex • In contrast to penalties with L2 or L1 norm • Yet we consistently get quite good results • Convergence history looks OK • We cannot be sure that we found a global minimum • But should we care? The University of Warwick, March 2012 20 A variation on our theme • Using first order differences in an L0 penalty we get – constant segments – jumps between segments • With second order differences we get – linear segments – kinks between segments The University of Warwick, March 2012 21 A kinky simulation −5 0 5 10 Smoothing with d = 2 and L_0 penalty 0 50 100 150 200 −5 0 5 10 Smoothing with d = 2 and L_1 penalty 0 The University of Warwick, March 2012 50 100 150 200 22 Non-normal data • We are not limited to a sums of squares • The L0 penalty can be combined with any log-likelihood • Example: Poisson distribution • Observed y, expected values µ = eη • Extend the IWLS algorithm of the GLM with L0 penalty on ∆η • No technical complications • Example: the famous coal mining disaster time series The University of Warwick, March 2012 23 Smoothing of coal mining data with L2 penalty 0 # of disasters 1 2 3 4 5 6 Coal mining data, with L_2 smooth 1880 1900 Year 1920 1940 1960 125 135 AIC 145 155 1860 0 The University of Warwick, March 2012 1 2 log10(lambda) 3 4 24 Smoothing of coal mining data with L1 penalty 0 # of accidents 1 2 3 4 5 6 Coal mining data, with L_1 smooths 1860 1880 1900 1920 1940 1960 105 AIC 115 125 135 AIC profile −1.0 The University of Warwick, March 2012 −0.5 0.0 log10(lambda) 0.5 1.0 25 Smoothing of coal mining data with L0 penalty 0 # of accidents 1 2 3 4 5 6 Coal mining data, with L_0 smooths 1860 1880 1900 1920 1940 1960 100 AIC 140 180 AIC profile −1.0 The University of Warwick, March 2012 −0.5 0.0 log10(lambda) 0.5 1.0 26 Deconvolution • We observed a “crisp” signal plus noise • Sometimes the signal has been filtered before • A crisp input, run through a filter • So we have penalized deconvolution • This is not hard to do • Model: y = Cx + e • Regression with penalty on x The University of Warwick, March 2012 27 Illustrating convolution with step input Input 6 4 2 0 −2 −4 0 50 100 150 200 250 300 350 400 250 300 350 400 Output 6 4 2 0 −2 −4 0 50 The University of Warwick, March 2012 100 150 200 28 Deconvolution with step input Input and output 6 4 2 0 −2 −4 0 50 100 150 200 250 300 350 400 300 350 400 Output and estimated input 6 4 2 0 −2 −4 0 50 The University of Warwick, March 2012 100 150 200 250 29 Another application: spike deconvolution • Many (technical) signals consists of peaks • Often they can be described by convolution • Spikes as input, convolution with peaked impulse response • Penalized regression: minimize S = ||y − Cx || + κ || x || p • Ridge regression, p = 2, is no use • LASSO, p = 1 is an improvement • But p = 0 works best The University of Warwick, March 2012 30 Spike deconvolution (simulated data) Data, components, and fit 1 0.5 0 0 10 20 30 40 50 60 70 80 90 100 80 90 100 80 90 100 Deconvolution with L1 penalty; κ = 0.01 1 0.5 0 0 10 20 30 40 50 60 70 Deconvolution with L0 penalty; κ = 0.003 1 0.5 0 0 10 20 The University of Warwick, March 2012 30 40 50 60 70 31 Summary • We can implement the L0 norm as a weighted L2 norm • We get surprisingly good results in segmentation and deconvolution • Non-convex objective function seems no problem in practice • Fast, sparse, computations, linear in data length • Easily combined with any likelihood • SCALA software for CNV smoothing • And for much more The University of Warwick, March 2012 32 SCALA software (r.c.a.rippe@lumc.nl) The University of Warwick, March 2012 33