Sharp Piece-wise Linear Smoothing and Deconvolution with an L Penalty

advertisement
Sharp Piece-wise Linear
Smoothing and Deconvolution
with an L0 Penalty
Paul Eilers1 and Ralph Rippe2
1Erasmus Medical Center, Rotterdam, The Netherlands
2Leiden University Medical Center, Leiden, The Netherlands
The University of Warwick, March 2012
Data from genetics: DNA copy numbers in tumors
7
log2(CNV signal)
8
9
10
11
Array GBM 139.CEL
0
10
20
Position on chromosome (Mbase)
30
40
30
40
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
The University of Warwick, March 2012
10
20
Position on chromosome (Mbase)
1
Biological background, simplified
• We have two copies of each chromosome
• Most of our DNA is identical for everyone
• But not at millions of SNPs (single nucleotide polymorphisms)
• A SNP has two possible states, called alleles, say A and B
• Possible combinations (genotypes): AA, AB, BB
• Microarrays allow detection of the states of very many SNPs
• Using selective hybridization and fluorescence technology
• For each SNP we get two signals, a for A, b for B
The University of Warwick, March 2012
2
Copy number variations
• Two signals, a proportional to A, b proportional to B
• Signal b is 0, c, 2c for AA, AB, BB genotypes
• Signal a is 2c, c, 0 for AA, AB, BB genotypes
• So a + b = 2c (save for noise, background, non-linearities, ...)
• Normal DNA is quite boring, but tumor DNA is not
• DNA segments may be deleted or multiplied (amplified)
• So we can have A, B, AA, AB, BB, AAA, AAB, ABB, and so on
• Copy number changes (CNV); a + b 6= 2c will reflect that
The University of Warwick, March 2012
3
CNV in normal (top) and tumor (bottom) DNA
9.5
log2(CNV signal)
10.0
10.5
11.0
11.5
Array ctr aff 1.CEL
0
10
20
Position on chromosome (Mbase)
30
40
30
40
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
The University of Warwick, March 2012
10
20
Position on chromosome (Mbase)
4
A simple smoother: Whittaker
• Whittaker (1923) proposed “graduation”: minimize
S2 =
2+λ
d z )2
(
y
−
z
)
(
∆
∑ i i
∑ i
i
i
• Given a noisy data series y, it finds a smoother series z
• Operator ∆d forms differences of order d
• Today we call this penalized least squares
• Explicit solution, with matrix D, such that ∆d z = Dz:
( I + λD 0 D )ẑ = y
The University of Warwick, March 2012
5
The Whittaker smoother (d = 2) in action
7
log2(CNV signal)
8
9
10
11
Array GBM 139.CEL
0
10
20
Position on chromosome (Mbase)
30
40
30
40
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
The University of Warwick, March 2012
10
20
Position on chromosome (Mbase)
6
Critique of the Whittaker smoother, and a solution
• Noise is effectively reduced
• But jumps are rounded
• Alternative approach, inspired by LASSO, minimizes
S1 =
2+λ
(
y
−
z
)
∑ |∆zi |
∑ i i
i
i
• Total variation penalty or “fused LASSO”
• Notice the first differences
• The L1 norm (instead of the L2 norm) in the penalty
• It is a big improvement
The University of Warwick, March 2012
7
The L1 penalty in action
7
log2(CNV signal)
8
9
10
11
Array GBM 139.CEL
0
10
20
Position on chromosome (Mbase)
30
40
30
40
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
The University of Warwick, March 2012
10
20
Position on chromosome (Mbase)
8
Computation for the L1 penalty
• We could try quadratic programming techniques
• But there is an easier solution
• For any x and approximation x̃ we have | x | = x2 /| x | ≈ x2 /| x̃ |
• Use weighted L2 penalty, with vi = 1/|∆z̃i |:
S1 =
2+λ
2
(
y
−
z
)
v
(
∆z
)
∑ i i
∑ i i
i
i
• Iteratively update v and z̃
• Solve ( I + D 0 VD )z = y repeatedly, with V = diag(v)
q
• Some smoothing near 0: use vi = 1/ (∆z̃i )2 + β2, with small β
The University of Warwick, March 2012
9
Critique of the L1 penalty
• We certainly get a big improvement
• But the jumps are not completely crisp
• Solution: a penalty with (implicitly) the L0 norm
• Iterate as above with weighted quadratic penalty
S0 =
2+λ
2
(
y
−
z
)
v
(
∆z
)
∑ i i
∑ i i
i
i
• With vi = 1/( x̃2 + β2 ) instead of vi = 1/
The University of Warwick, March 2012
q
(∆z̃i )2 + β2
10
The L0 penalty in action
7
log2(CNV signal)
8
9
10
11
Array GBM 139.CEL
0
10
20
Position on chromosome (Mbase)
30
40
30
40
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
The University of Warwick, March 2012
10
20
Position on chromosome (Mbase)
11
Sparseness
• We have many equations (4032 here), but a banded system
• Computation time linear in data length
• R package spam is great (sparse matrices, Matlab-style)
• L2 system solved in 20 milliseconds
#
m
E
D
P
z
Whittaker smoother
= length(y)
= diag.spam(m)
= diff(E, diff = 2)
= lambda * t(D) %*% D
= solve(E + P, y)
The University of Warwick, March 2012
12
Optimal smoothing
• Interactive use in the hands of biologists is our main goal
• But we can try “optimal” smoothing, using AIC
• AIC = 2m log σ̂ + 2 ∗ ED
• With σ̂2 = ∑(yi − ẑi )2 /m (ML estimate of error SD)
• Effective dimension ED
• ED = tr( I + D 0 VD ), with V = diag(v)
• Serial correlation in errors can spoil AIC (undersmoothing)
The University of Warwick, March 2012
13
Computational details for AIC
• We need to compute ED = tr( I + D 0 VD ), with V = diag(v)
• But this is a large (4032 squared) matrix
• Beautiful solution: Hutchinson and De Hoog algorithm
• But not yet implemented
• For now: use 400 intervals and indicator basis R
Si =
2+
2
(
y
−
r
a
)
v
(
∆a
)
∑ i ∑ ij j
∑ j j
i
j
j
• Adaptive weights v j = 1/((∆ ã j )2 + β2 )
The University of Warwick, March 2012
14
AIC for one array
7
log2(CNV signal)
8
9
10
11
Array GBM 180.CEL
0
10
20
Position on chromosome (Mbase)
30
40
−8750
AIC
−8650
−8550
AIC profile
−2.0
The University of Warwick, March 2012
−1.5
−1.0
log10(lambda)
−0.5
0.0
15
AIC for the other array
7
log2(CNV signal)
8
9
10
11
Array GBM 203−2.CEL
0
10
20
Position on chromosome (Mbase)
30
40
−8250
AIC
−8150
−8050
AIC profile
−2.0
The University of Warwick, March 2012
−1.5
−1.0
log10(lambda)
−0.5
0.0
16
Tests on simulated data
• We did some tests on simulated data
• Using results in the Bioconductor package VEGA
• Starting with adaptive weights vi ≡ 1
• Graphs will show intermediate results
• And the size of changes per iteration
• The smoother starts with a “wild” solution
• It gradually removes more and more details
The University of Warwick, March 2012
17
Example of convergence, little noise
−1.0
−1.0
0.0 0.5
Iteration 2
0.0 0.5
Iteration 1
0
200
400
600
800
1000
0
200
600
800
1000
800
1000
−1.0
−1.0
0.0 0.5
Iteration 10
0.0 0.5
Iteration 5
400
0
200
400
600
800
1000
0
200
600
Convergence
−8
−1.0
log10(dz)
−6 −4 −2
0.0 0.5
0
Iteration 20
400
0
200
400
The University of Warwick, March 2012
600
800
1000
5
10
Iteration
15
20
18
Example of convergence, more noise
−0.5 0.5
1.5
Iteration 2
−2.0
−2.0
−0.5 0.5
1.5
Iteration 1
0
200
400
600
800
1000
200
400
600
800
1000
800
1000
−0.5 0.5
1.5
Iteration 10
−2.0
−2.0
−0.5 0.5
1.5
Iteration 5
0
0
200
400
600
800
1000
0
200
600
Convergence
−2.0
−6
log10(dz)
−4
−2
−0.5 0.5
0
1.5
Iteration 20
400
0
200
400
The University of Warwick, March 2012
600
800
1000
0
10
20
Iteration
30
40
19
Can we trust it?
• The objective function is non-convex
• In contrast to penalties with L2 or L1 norm
• Yet we consistently get quite good results
• Convergence history looks OK
• We cannot be sure that we found a global minimum
• But should we care?
The University of Warwick, March 2012
20
A variation on our theme
• Using first order differences in an L0 penalty we get
– constant segments
– jumps between segments
• With second order differences we get
– linear segments
– kinks between segments
The University of Warwick, March 2012
21
A kinky simulation
−5
0
5
10
Smoothing with d = 2 and L_0 penalty
0
50
100
150
200
−5
0
5
10
Smoothing with d = 2 and L_1 penalty
0
The University of Warwick, March 2012
50
100
150
200
22
Non-normal data
• We are not limited to a sums of squares
• The L0 penalty can be combined with any log-likelihood
• Example: Poisson distribution
• Observed y, expected values µ = eη
• Extend the IWLS algorithm of the GLM with L0 penalty on ∆η
• No technical complications
• Example: the famous coal mining disaster time series
The University of Warwick, March 2012
23
Smoothing of coal mining data with L2 penalty
0
# of disasters
1
2
3
4
5
6
Coal mining data, with L_2 smooth
1880
1900
Year
1920
1940
1960
125
135
AIC
145
155
1860
0
The University of Warwick, March 2012
1
2
log10(lambda)
3
4
24
Smoothing of coal mining data with L1 penalty
0
# of accidents
1
2
3
4
5
6
Coal mining data, with L_1 smooths
1860
1880
1900
1920
1940
1960
105
AIC
115
125
135
AIC profile
−1.0
The University of Warwick, March 2012
−0.5
0.0
log10(lambda)
0.5
1.0
25
Smoothing of coal mining data with L0 penalty
0
# of accidents
1
2
3
4
5
6
Coal mining data, with L_0 smooths
1860
1880
1900
1920
1940
1960
100
AIC
140
180
AIC profile
−1.0
The University of Warwick, March 2012
−0.5
0.0
log10(lambda)
0.5
1.0
26
Deconvolution
• We observed a “crisp” signal plus noise
• Sometimes the signal has been filtered before
• A crisp input, run through a filter
• So we have penalized deconvolution
• This is not hard to do
• Model: y = Cx + e
• Regression with penalty on x
The University of Warwick, March 2012
27
Illustrating convolution with step input
Input
6
4
2
0
−2
−4
0
50
100
150
200
250
300
350
400
250
300
350
400
Output
6
4
2
0
−2
−4
0
50
The University of Warwick, March 2012
100
150
200
28
Deconvolution with step input
Input and output
6
4
2
0
−2
−4
0
50
100
150
200
250
300
350
400
300
350
400
Output and estimated input
6
4
2
0
−2
−4
0
50
The University of Warwick, March 2012
100
150
200
250
29
Another application: spike deconvolution
• Many (technical) signals consists of peaks
• Often they can be described by convolution
• Spikes as input, convolution with peaked impulse response
• Penalized regression: minimize
S = ||y − Cx || + κ || x || p
• Ridge regression, p = 2, is no use
• LASSO, p = 1 is an improvement
• But p = 0 works best
The University of Warwick, March 2012
30
Spike deconvolution (simulated data)
Data, components, and fit
1
0.5
0
0
10
20
30
40
50
60
70
80
90
100
80
90
100
80
90
100
Deconvolution with L1 penalty; κ = 0.01
1
0.5
0
0
10
20
30
40
50
60
70
Deconvolution with L0 penalty; κ = 0.003
1
0.5
0
0
10
20
The University of Warwick, March 2012
30
40
50
60
70
31
Summary
• We can implement the L0 norm as a weighted L2 norm
• We get surprisingly good results in segmentation and deconvolution
• Non-convex objective function seems no problem in practice
• Fast, sparse, computations, linear in data length
• Easily combined with any likelihood
• SCALA software for CNV smoothing
• And for much more
The University of Warwick, March 2012
32
SCALA software (r.c.a.rippe@lumc.nl)
The University of Warwick, March 2012
33
Download