Nonlinear Curve Fitting Earl F. Glynn Scientific Programmer Bioinformatics

advertisement
Nonlinear Curve Fitting
Earl F. Glynn
Scientific Programmer
Bioinformatics
11 Oct 2006
1
Nonlinear Curve Fitting
• Mathematical Models
• Nonlinear Curve Fitting Problems
–
–
–
–
Mixture of Distributions
Quantitative Analysis of Electrophoresis Gels
Fluorescence Correlation Spectroscopy (FCS)
Fluorescence Recovery After Photobleaching (FRAP)
• Linear Curve Fitting
• Nonlinear Curve Fitting
–
–
–
–
Gaussian Case Study
Math
Algorithms
Software
• Analysis of Results
– Goodness of Fit: R2
– Residuals
• Summary
2
Mathematical Models
• Want a mathematical model to describe
observations based on the independent
variable(s) under experimental control
• Need a good understanding of underlying
biology, physics, chemistry of the problem
to choose the right model
• Use Curve Fitting to “connect” observed
data to a mathematical model
3
Nonlinear Curve Fitting Problems
Mixture Distribution Problem
0.04
0.03
0.00
0.01
0.02
Probability Density
0.05
0.06
Heming Lake Pike: Length Distribution
0
20
40
60
80
Length [cm]
Adapted from www.math.mcmaster.ca/peter/mix/demex/expike.html
4
Nonlinear Curve Fitting Problems
Mixture Distribution Problem
Heming Lake Pike: Distribution by Age Groups
0.06
Normal Probability Density Function
2
e
0.03
0.04
2π σ
− ( x − µ )2
2σ 2
Coefficient of Variation
0.02
Probability Density
0.05
f ( x) =
1
σ
µ
0.00
0.01
cv =
0
20
40
60
80
Length [cm]
Data are fitted by five normal distributions with constant coefficient of variation 5
Nonlinear Curve Fitting Problems
Quantitative Analysis of Electrophoresis Gels
Deconvolve a pixel profile of a banding pattern into a family of
Gaussian or Lorentzian curves
Das, et al, RNA (2005), 11:348
http://papakilo.icmb.utexas.edu/cshl-2005/lectures/
CSHL_Lecture05_khodursky.ppt#23
Takamato, et al, Nucleic Acids Research, 32(15), 2004,6p. 2
Nonlinear Curve Fitting Problems
Quantitative Analysis of Electrophoresis Gels
Many proposed functional forms besides
Gaussian or Lorentzian curves
DiMarco and Bombi, Mathematical functions for the representation of chromatographic peaks,
Journal of Chromatography A, 931(2001), 1-30.
7
Nonlinear Curve Fitting Problems
Fluorescence Correlation Spectroscopy (FCS)
Bacia, Kim & Schwille, “Fluorescence cross-correlation spectroscopy in living cells,”
Nature Methods, Vol 3, No 2, p. 86, Feb. 2006.
8
Nonlinear Curve Fitting Problems
Fluorescence Correlation Spectroscopy (FCS)
Note likely heteroscedasticity in data
From discussion by Joe Huff at Winfried Wiegraebe’s Lab Meeting, 11 Aug 2006
9
Nonlinear Curve Fitting Problems
Fluorescence Recovery After Photobleaching (FRAP)
From discussion by Juntao Gao at Rong Li’s Lab Meeting, 25 Sept 2006
10
Nonlinear Curve Fitting Problems
Fluorescence Recovery After Photobleaching (FRAP)
From discussion by Juntao Gao at Rong Li’s Lab Meeting, 25 Sept 2006
11
Linear Curve Fitting
•
•
•
•
•
Linear regression
Polynomial regression
Multiple regression
Stepwise regression
Logarithm transformation
12
x
i
Linear Curve Fitting
Linear Regression: Least Squares
Given data points ( , y).
i
We want the “best” straight line, ( , yˆ ),
through these points, where yˆ is the“fitted”
value at point :
i
i
yˆ = a + b x
i
i
13
x
i
Linear Curve Fitting
Linear Regression: Least Squares
(x,y) Data
5
Linear Fit
yˆ = a + b x
4
i
i
3
(xi,yi)
y
Error Function
N
[
( a , b ) = ∑ y i − (a
i =1
+ b⋅x )]
i
0
1
2
χ
2
0
1
2
x
3
4
Assume homoscedasticity (same variance)
14
2
Linear Curve Fitting
Linear Regression: Least Squares
Search (a,b) parameter space to minimize error function, χ2
Error Function
χ2
200
χ
2
χ
2
N
[
( a , b ) = ∑ y i − (a
i =1
+ b⋅x )]
i
(1.2, 0.9) = 1.9
150
100
Linear Fit
3
50
1
-1
0
0
1
a
2
b
2
yˆ = 1.2 + 0.9 x
i
i
15
2
x
i
Linear Curve Fitting
Linear Regression: Least Squares
5
Least Squares Line
0
1
2
y
3
4
y = 1.2 + 0.9x
0
1
2
3
4
x
How can (a,b) parameters be found directly without a search?
16
x
i
Linear Curve Fitting
Linear Regression: Least Squares
How can (a,b) parameters be found directly without a search?
• Differentiate χ2 with respect to parameters a and b
• Set derivatives to 0.
N
∂ χ2
= −2∑ yi − a − b ⋅ xi = 0
∂a
i =1
N
∂ χ2
= −2∑ xi ( yi − a − b ⋅ xi ) = 0
∂b
i =1
17
x
i
Linear Curve Fitting
Linear Regression: Least Squares
How can (a,b) parameters be found directly without a search?
Linear Fit
yˆ = a + b x
i
i
Simultaneous Linear Equations
 N
xi   a   ∑ y i 
∑
=


2  
b
∑ xi ∑ x i    ∑ xi y i 
18
x
i
Linear Curve Fitting
Linear Regression: Least Squares
How can (a,b) parameters be found directly without a search?
Linear Fit
yˆ = a + b x
i
i
i
x y x² xy
1
0 1
0 0
2
1 3
1 3
3
2 2
4 4
4
3 4
9 12
5
4 5 16 20
Sum 10 15 30 39
Simultaneous Linear Equations
 N
xi   a   ∑ y i 
∑
=


2  
b
∑ xi ∑ x i    ∑ xi y i 
 5 10  a  15 
10 30 b  = 39

   
15
39
a=
5
10
10
30 15 ⋅ 30 − 39 ⋅10 60
=
= 1.2
=
10
5 ⋅ 30 − 10 ⋅10 50
30
5
10
b=
5
10
15
39 5 ⋅ 39 − 10 ⋅15 45
=
=
= 0.9
10 5 ⋅ 30 − 10 ⋅10 50
30
19
x
i
Linear Curve Fitting
Linear Regression: Least Squares
> x <- 0:4
> y <- c(1,3,2,4,5)
> summary( lm(y ~ x) )
R solution
using lm (linear model)
Call:
lm(formula = y ~ x)
Residuals:
1
2
3
-0.2 0.9 -1.0
4
0.1
5
0.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.2000
0.6164
1.947
0.1468
x
0.9000
0.2517
3.576
0.0374 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-Squared: 0.81,
Adjusted R-squared: 0.7467
F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739
20
x
i
Linear Curve Fitting
Linear Regression: Least Squares
Assume homoscedasticity (σi = constant = 1)
Assume heteroscedasticity
y − ( a + b⋅ x ) 

χ ( a, b) = ∑ 

σi


2
N
i
2
i
i =1
Often weights σi are assumed to be 1.
21
Experimental measurement errors can be used if known.
Nonlinear Curve Fitting
Gaussian Case Study
0.1
0.2
0.3
0.4
0.5
0.6
Gaussian Data
0.0
y
0.00004
0.00055
0.00472
0.02739
0.10748
0.28539
0.51275
0.62335
0.51275
0.28539
0.10748
0.02739
0.00472
0.00055
0.00004
y
x
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
-2
-1
0
Normal Probability Density Function
1
2
3
4
5
x
f ( x) =
1
2π σ
2
e
− ( x − µ )2
2σ 2
22
Nonlinear Curve Fitting
Gaussian Case Study
χ
2
8
Minimum
μ = 1.5
σ = 0.8
Gradient descent
works well only
inside “valley”
here
6
4
2
0.4
0.6
0.8
ma
si g
1.0
-5
1.2
1.4
f ( x) =
1
2π σ
2
e
− ( x − µ )2
2σ 2
0
mu
5
χ
2
N
[
( µ , σ ) = ∑ yi − f ( xi )
i =1
Assume homoscedasticity
]
2
23
Nonlinear Curve Fitting
Gaussian Case Study
Derivatives may be useful for estimating parameters
y
0.0
0.1
0.2
0.3
0.4
Single Gaussian
-4
-2
0
2
4
2
4
2
4
x
y'
-0.2
0.0
0.2
1st Derivative
-4
-2
0
x
-0.4
-0.2
y''
0.0
0.2
2nd Derivative
-4
-2
0
x
U:/efg/lab/R/MixturesOfDistributions/SingleGaussian.R
24
Nonlinear Curve Fitting
Gaussian Case Study
Derivatives may be useful for determining number of terms
0.04
0.00
0.02
y
0.06
Heming Lake Pike
0
20
40
60
80
60
80
60
80
x
0.005
-0.005
y'
1st Derivative
0
20
40
x
y''
-0.006 -0.002
0.002
2nd Derivative
0
20
40
x
25
Nonlinear Curve Fitting
Math
Given data points (xi,yi).
Given desired model to fit (not always known):
y = y(x; a)
where there are M unknown parameters:
ak, k = 1,2,..., M.
The error function (“merit function”) is
y − y (x ; a ) 

χ (a) = ∑  σ

i


2
N
i
2
i
i =1
From Press, et al, Numerical Recipes in C (2nd Ed), 1992, p. 682
26
Nonlinear Curve Fitting
Math
Need to search multidimensional parameter space to
minimize error function, χ2
27
Nonlinear Curve Fitting
Math
Gradient of χ2 with respect to parameters a will be zero
at the minimum:
N
[ y − y (xi ; a )] ∂ y (xi ; a )
∂ χ2
= −2∑ i
k = 1,2,..., M
2
∂ ak
∂ ak
σi
i =1
βk (after dropping “-2”)
Taking the second derivative of χ2:
Often small and ignored
2 2
2
N
y (x i ; a ) 
1  ∂ y ( x i ; a ) ∂ y (x i ; a )
∂ χ
∂
= 2∑ 2 
− [ yi − y (xi ; a )]
∂ a k ∂ al
∂ ak
∂ al
∂ al ∂ a k 
i =1 σ i 
αkl = Hessian or “curvature” matrix (after dropping “2”)
From Press, et al, Numerical Recipes in C (2nd Ed), 1992, p. 682
28
Nonlinear Curve Fitting
Algorithms
• Levenberg-Marquardt is most widely used
algorithm:
– When “far” from minimum, use gradient descent:
∆al = constant ⋅ β l
– When “close” to minimum, switch to inverse Hessian:
M
∑α
l =1
kl
∆a l = β k
• “Full Newton-type” methods keep dropped
term in second derivative – considered more
robust but more complicated
• Simplex is an alternative algorithm
29
Nonlinear Curve Fitting
Algorithms
• Fitting procedure is iterative
• Usually need “good” initial guess, based on
understanding of selected model
• No guarantee of convergence
• No guarantee of optimal answer
• Solution requires derivatives: numeric or
analytic can be used by some packages
30
Nonlinear Curve Fitting
Software
IDL:
curvefit function; MPFIT: Robust non-linear least square curve fitting
(3 limited licenses)
• Joe Huff in Advanced
Instrumentation is quite-well versed in using MPFIT and applying it in IDL
Mathematica
(1 limited license)
MatLab:
Curve Fitting Toolbox
(1 limited license)
OriginPro:
Peak Fitting Module
(10 limited licenses)
PeakFit:
Nonlinear curve fitting for spectroscopy, chromatography and electrophoresis
(1 limited license)
R: nls function
• many statistics
• symbolic derivatives (if desired)
• flawed implementation: exact “toy” problems fail unless “noise” added
31
Nonlinear Curve Fitting
Software
NIST reference datasets with certified computational results
http://www.itl.nist.gov/div898/strd/general/dataarchive.html
32
Analysis of Results
• Goodness of Fit: R2
• Residuals
33
Goodness of Fit: R2
Coefficient of Determination
Percentage of Variance Explained
R
2
= 1−
Residual Sum of Squares (RSS)
Total Sum of Squares (SS) [Corrected for Mean]
(
yˆ i − yi )
∑
R =1−
∑ ( yi − y )
2
2
2
0 ≤ R2 ≤ 1
• “Adjusted” R2 compensates for higher R2 as terms added.
• A “good” value of R2 depends on the application.
• In biological and social sciences with weakly correlated
variables, and considerable noise, R2 ~ 0.6 might be
considered good.
• In physical sciences in controlled experiments, R2 ~ 0.6
might be considered low.
Faraway, Linear Models with R, 2005, p.16-18
34
Residuals
• Residuals are estimates of the true and unobservable errors.
• Residuals are not independent (they sum to 0).
“Curve fitting made easy,” Marko Ledvij, The Industrial Physicist, April/May 2003.
http://www.aip.org/tip/INPHFA/vol-9/iss-2/p24.html
35
Analysis of Residuals
• Are residuals random?
• Is mathematical model appropriate?
• Is mathematical model sufficient to
characterize the experimental data?
• Subtle behavior in residuals may suggest
significant overlooked property
Good Reference: “Analysis of Residuals: Criteria for Determining Goodness-of-Fit,”
Straume and Johnson, Methods in Enzymology, Vol. 210, 87-105, 1992.
36
Analysis of Residuals
Synthetic FRAP Data: Fit with 1 term when 2 terms are better
Near “perfect” fit, but why is there a pattern in the residuals?
37
Analysis of Residuals
Lomb-Scargle periodogram can indicate “periodicity” in the residuals
Flat line with all “bad” p-values would indicate “random” residuals
38
Analysis of Residuals
Synthetic FRAP Data: Fit with 2 terms
39
Analysis of Residuals
FCS Data and Heteroscedasticity
χ
Heteroscedasticity in Residuals
2
N
(a) = ∑
i =1
 y i − y (x i ; a ) 

σi

2
Scaling Factor
Scaled Residuals
Use F Test to test for unequal variances
FCS Residual Plots Courtesy of Joseph Huff, Advanced Instrumentation & Physics
40
Analysis of Residuals
Heteroscedasticity and Studentized Residuals
• Studentized residual is a residual divided by
an estimate of its standard deviation
• The “leverage” hii is the ith diagonal entry of
a “hat matrix.”
Studentize d Residual =
σˆ
εˆ
i
i
1 − hii
• Externally Studentized Residuals follow
Student’s t-distribution.
• Can be used to statistically reject “outliers”
See http://en.wikipedia.org/wiki/Studentized_residual
41
Summary
• A mathematical model may or may not be appropriate for
any given dataset.
• Linear curve fitting is deterministic.
• Nonlinear curve fitting is non-deterministic, involves
searching a huge parameter space, and may not converge.
• Nonlinear curve fitting is powerful
(when the technique works).
• The R2 and adjusted R2 statistics provide easy to
understand dimensionless values to assess goodness of fit.
• Always study residuals to see if there may be unexplained
patterns and missing terms in a model.
• Beware of heteroscedasticity in your data. Make sure
analysis doesn’t assume homoscedasticity if your data are
not.
• Use F Test to compare the fits of two equations.
42
Acknowledgements
Advanced Instrumentation & Physics
• Joseph Huff
• Winfried Wiegraebe
43
Download