Fitting data to a straight line

advertisement
Fitting data to a straight line
Fitting data to a model (whether it is a simple function or a more complicated model
involving a large set of parameters) is problem that can be encountered in most physical
studies. The general idea is to build a figure-of-merit function (or merit function) that
characterises the quality of the fit for a given model and set of parameters. It is usually
arranged so that small values of this function show best agreement with data. The parameters
that give the smallest merit function are then the best-fit parameters. In the case where one
wants to fit the data to a straight line, the simple solution is to minimise the squared distance
between the data points and the fitted line.
However, the parameters estimated via this least square method are not the final answer,
since one also wants to know the errors associated with each best-fit parameter (in this case,
the slope and intercept of the line) and describe the quality of the fit itself. Furthermore, this
simple method doesn’t take into account errors associated with the data points themselves.
The following lines briefly addresses these issues.
Note that most of it comes from chapter 15 of Numerical Recipes [3], which you can refer
to for more details on the problem.
1
Case where only one variable has associated errors
In the most simple cases (i.e. gaussian errors in yi ), the function one wants to minimise is
called the chi-square and is defined as:
2
χ =
2
N X
yi − y(xi ; a1 ...aM )
σi
i=1
(1)
where xi and yi are the coordinates of the N data points (the sum is done over all data
points), y(xi ; a1 ...aM ) is the model (function of the xi values and the set of M parameters
a1 to aM ) and σi is the error on each yi value. (Keep in mind that there are no errors in xi
at this point1 .)
In the case where one wants to fit a set of N data points to a straight line,
y(x) = y(x; a, b) = a + bx ,
(2)
eq.(1) becomes:
2
χ (a, b) =
2
N X
yi − a − bxi
σi
i=1
1
(3)
This is a case that can be widely encountered, either if the xi values are absolutely well defined (e.g.
dates, quantum numbers...) or when the errors on xi are negligible compared to the ones on yi .
1
which must then be minimised to find out the best-fit a and b. At its minimum, derivatives
of χ2 (a, b) with respect to a and b vanishes:
N
0=
X yi − a − bxi
∂χ2
= −2
∂a
σi2
i=1
N
X xi (yi − a − bxi )
∂χ2
= −2
0=
∂b
σi2
i=1
(4)
Let us define the following quantities:
N
X
1
S≡
σ2
i=1 i
Sxx ≡
N
X
xi
Sx ≡
σ2
i=1 i
N
X
x2
i
i=1
Sxy ≡
σi2
N
X
yi
Sy ≡
σ2
i=1 i
N
X
xi yi
i=1
(5)
σi2
Eq.(4) can then be written:
aS + bSx = Sy
aSx + bSxx = Sxy
(6)
The solution of these two equations in two unknowns is calculated as:
∆ ≡ SSxx − (Sx )2
Sxx Sy − Sx Sxy
a=
(7)
∆
SSxy − Sx Sy
b=
∆
Those equations then give the solution for the best-fit parameters a and b. We then need
to estimate the errors on those parameters since the errors on the data must introduce
some uncertainty in the estimation of those parameters. If the data are independent, each
contribute its own bit of uncertainty to the parameters. Consideration of propagation of
errors shows that the variance σf2 in the value of any function will be:
σf2
=
N
X
σi2
i=1
∂f
∂yi
2
(8)
In our present case, the derivatives of a and b with respect to yi can be directly evaluated
from:
Sxx − Sx xi
∂a
=
∂yi
σi2 ∆
(9)
Sxi − Sx
∂b
=
∂yi
σi2 ∆
Summing these equations over all points as expressed in eq.(8), one gets:
Sxx
∆
S
σb2 =
∆
σa2 =
2
(10)
where σa and σb are the standard deviations (i.e. “errors”) in the estimates of a and b
respectively.
The question of estimating the goodness-of-fit is here left untreated, essentially because
it cannot be answered analytically and then requires the use of numerical tools. One might
nonetheless have a good sense of this by estimating χ2 /ν, the chi-square divided by the
degrees of freedom (in the present case, we have ν = N − M = N − 2) which must tend to
1 in the case of good fits.
2
Case where both variables have associated errors
Things get much more difficult when there are errors in xi and yi . In that case, the chi-square
has to be written:
N
X
(yi − a − bxi )2
2
(11)
χ (a, b) =
σy2i + b2 σx2i
i=1
and ∂χ2 /∂b = 0 becomes a non-linear equation, in which case the least-square method
described above is no longer valid and no simple analytical solution can be used. This will
then not be addressed here.
For more on this, one can refer to section 15.3 in Numerical Recipes [3], the description
of the BCES estimator [1] or of bayesian methods [2]. However, this is out of the scope of
the Spectroscopy experiment.
References
[1] M. G. Akritas and M. A. Bershady. Linear Regression for Astronomical Data with
Measurement Errors and Intrinsic Scatter. ApJ, 470:706–+, October 1996, arXiv:astroph/9605002.
[2] D. W. Hogg, J. Bovy, and D. Lang. Data analysis recipes: Fitting a model to data.
ArXiv e-prints, August 2010, 1008.4686.
[3] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes
in C. The art of scientific computing. 1992.
3
Download