Parameter Estimation

advertisement
PARAMETER ESTIMATION
The fundamental concept of parameter estimation is to determine optimal values of parameters
for a numerical model that predicts dependent variable outputs of a function, process or
phenomenon based on observations of independent variable inputs. For a given data
observation, independent variable inputs are grouped into an input vector of length ninp and
dependent variable outputs are grouped into an output vector of length nout. Corresponding
input and output vectors for a given data observation are called a training pair. In general,
training pairs from nobs number of data observations are pooled to estimate npar number of
model parameters.
Table of Contents
1.0 Least Squares Methods
1.1 Linear Models
1.1.1 Straight Line
1.1.2 Single Dependent Variable
1.1.2.1 Polynomial
1.1.2.1.1 Quadratic
1.1.2.1.2 Cubic
1.1.2.1.3 Quintic
1.1.2.1.4 Choice Of Polynomial Order
1.1.2.2 Harmonic
1.1.2.3 Plane
1.1.2.4 Bicubic
1.1.3 Multiple Dependent Variables
1.1.4 Weighting
1.1.5 Scaling
1.2 Linearized Models
1.2.1 Newtonian Cooling
1.2.2 Circle
1.2.3 Sphere
1.3 Example Data
1.3.1 Anscombe's Data Set
2.0 Eigenvector Methods
2.1 Line
2.2 Plane
2.3 Ellipse
2.4 Ellipsoid
2.5 2D Quadric
2.6 3D Quadric
3.0 Levenberg-Marquardt
4.0 Simplex
page 1 of 20
5.0 Gradient Search
6.0 Neural Network
7.0 Genetic Algorithms
8.0 Simulated Annealing
9.0 Kalman Filter
page 2 of 20
1.0 LEAST SQUARES METHODS
The least squares optimality criterion minimizes the sum of squares of residuals between actual
observed outputs and output values of the numerical model that are predicted from input
observations
1.1 LINEAR MODELS
Many processes exhibit true linear behavior. Many others operate over such small excursions of
input variable values that the output behavior appears linear.
1.1.1 STRAIGHT LINE
The classic least squares problem is to fit a straight line model with a single input x and a single
output y using parameters b and m as shown in Equation 1. Unfortunately, individual data
observations xi and yi may not fit the model perfectly due to experimental measurement error,
process variation or insufficient model complexity as shown in Equation 2. Multiple data
observations may be concatenated as shown in Equation 3 and represented in matrix form per
Equation 4. Even for optimal estimates of model parameters {}, each data observation will
have some residual error ei between the observed output yi and the predicted model output as
shown in Equations 5 and 6 .
b
y = b + mx = [ 1 x ] 
m 
for ninp=1, nout=1, npar=2
Eq. 1
b
yi  [ 1 xi ]  
m 
Eq. 2
 y1  1 x 1 
 y  1 x  b
 2  
2  


 
  m
   
 y nobs  1 x nobs 
Eq. 3
{Y}  [X] {}
 y1 
 y 


for {Y}   2 
nobs x 1
  
 y nobs 
1 x 1 
1 x 
2 
[X]  

 
nobs x 2


1 x nobs 
b
{}   
2 x1
m 
Eq. 4
page 3 of 20
 e1   y1  1 x 1 
 e   y  1 x  b
 2   2  
2  



 
  m
      
e nobs   y nobs  1 x nobs 
Eq. 5
 e1 
e 


for {e}   2 
nobs x 1
  
e nobs 
{e} = {Y} - [X] {}
Eq. 6
The scalar sum of squares SSQ of residual errors is shown in Equation 7. To minimize the sum
of squares, one may set the partial derivative of the sum of squares with respect to the model
parameters {} equal to zero as shown in Equation 8. Rearranging these terms provides a linear
matrix solution for optimal model parameters per Equation 9.
SSQ = {e}T{e} = {Y}T{Y} - 2{}T[X]T{Y} + {}T[X] T[X] {}
Eq. 7
SSQ / {} = - 2[X]T{Y} + 2[X] T[X] {} = 0
Eq. 8
{}  ( [X]T
2 x1
[X] ) 1 [X]T {Y}
2 x nobs nobs x 2
Eq. 9
2 x nobs nobs x 1
Expanding this matrix solution as shown in Equation 10, provides valuable insights into the
nature of least squares estimates. If the input observations xi and the output observations yi are
both mean centered as shown in Equation 11, then the offset term b is zero and the slope m is
equal to the covariance of xi and yi divided by the variance of xi as shown in Equation 12.
Consequently, mean centered data preclude the need to compute an offset term for many linear
models.

nobs
b
  
   nobs
m   x
i
 
i 1
 b  nobs
  0
m 

xi 

i 1

nobs
2
xi


i 1
nobs
0 

x i2 

i 1

nobs
1
1
 nobs 
  y i 
i 1
nobs

 x i y i 
 i 1

 0 
nobs

 xy
 i i 

 i 1
Eq. 10
nobs
for xo =
1
nobs
 x i = 0 and yo =
i 1
nobs
1
nobs
y
i 1
i
=0
Eq. 11
page 4 of 20
0
 b  nobs

   xy
i i
m  
i 1


x 


i 1
nobs
2
i
for xo = 0, yo = 0
Eq. 12
r2 ????
1.1.2 SINGLE DEPENDENT VARIABLE
The methodology developed above may be generalized to polynomials or other linear models
requiring npar  2 number of linear parameters as shown in Equation 13. Note that the number
of data observations nobs must be greater than or equal to the number of estimated parameters
npar to prevent singularity of the symmetric matrix ([X]T[X]).
{}  ( [X]T
npar x 1
[X] ) 1 [X]T
npar x nobs nobs x npar
{Y}
for nout=1
Eq. 13
npar x nobs nobs x 1
1.1.2.1 POLYNOMIAL
1.1.2.1.1 QUADRATIC
A quadratic model is presented in Equations 14 and 15.
a 0 
 
y = a0 + a1 x + a2 x = [ 1 x x ]  a 1 
a 
 2
2
 y1 
 y 


{Y}   2 
nobs x 1
  
 y nobs 
1 x 1

1 x2
[X]  

nobs x 3


1 x nobs
2
x 12 

x 22 
 

x 2nobs 
for ninp=1, nout=1, npar=3
Eq. 14
a 0 
 
{}   a 1 
3x1
a 
 2
Eq. 15
1.1.2.1.2 CUBIC
A cubic model is presented in Equations 16 and 17.
page 5 of 20
a 0 
a 
 
y = a0 + a1 x + a2 x2 + a3 x3 = [ 1 x x2 x3 ]  1 
a 2 
a 3 
 y1 
 y 


{Y}   2 
nobs x 1
  
 y nobs 
1 x 1

1 x2
[X]  

nobs x 4


1 x nobs
x 12
2
2
x

x 2nobs
x 13 

x 32 
 

x 3nobs 
for ninp=1, nout=1, npar=4
a 0 
a 
 
{}   1 
4 x1
a 2 
a 3 
Eq. 16
Eq. 17
1.1.2.1.3 QUINTIC
A quintic model is presented in Equations 18 and 19.
a 0 
a 
 1
2 3
4  
y = [ 1 x x x x ] a 2 
a 
 3
a 4 
 y1 
 y 


{Y}   2 
nobs x 1
  
 y nobs 
for ninp=1, nout=1, npar=5
1 x 1

1 x2
[X]  

nobs x 4


1 x nobs
x 12
x 22

x 2nobs
x 13 

x 32 
 

x 3nobs 
a 0 
a 
 1 
{}  a 2 
5 x1
a 
 3
a 4 
Eq. 18
Eq. 19
1.1.2.1.4 CHOICE OF POLYNOMIAL ORDER
Engine speed
[rpm]
800
1000
1200
1400
1600
1800
2000
Engine torque
[N.m]
500
547
636
679
719
724
712
page 6 of 20
2200
2400
2600
671
606
575
1.1.2.2 HARMONIC
A linear harmonic model is presented in Equations 17 and 18.
a 0 
a 
 1 
y = [ 1 cos sin cos2 sin2 ]  b1  for ninp=1, nout=1, npar=5
a 
 2
b 2 
 y1 
 y 


{Y}   2 
nobs x 1
  
 y nobs 
sin 1
1 cos 1
1 cos 
sin  2
2
[X]  



nobs x 5

1 cos  nobs sin  nobs
cos 21
cos 2 2

cos 2 nobs
sin 21 
sin 2 2 



sin 2 nobs 
Eq. 17
a 0 
a 
 1 
{}   b1 
5 x1
a 
 2
b 2 
Eq. 18
1.1.2.3 PLANE
A linear planar model is presented in Equations 19 and 20.
a 
 
z = a + b x + c y = [ 1 x y ] b 
c 
 
 z1 
 z 


{Y}   2 
nobs x 1
  
z nobs 
1 x 1
1 x
2
[X]  



nobs x 3

1 x nobs
for ninp=2, nout=1, npar=3
y1 
y 2 
 

y nobs 
a 
 
{}  b
3x1
c 
 
Eq. 19
Eq. 20
1.1.2.3 BICUBIC
A linear bicubic model is presented in Equations 21 and 22.
page 7 of 20

z  1 x x2
x3

 b1
b
 2
b 3

b 4
b5
b6
b9
b10
b7
b8
b11
b12
b13   1 
b14   y 
 
b15   y 2 

b16   y 3 
for ninp=2, nout=1, npar=16
Eq. 21
 z1 
 z 


{Y}   2 
nobs x 1
  
z nobs 
Eq. 22a
Eq. 22b
1 x 1

X  1 x 2
nobs x 16
 

1 x n
x 12
x 22
x 13
x 32
y1
y2
x 1 y1
x 2 y2
x 12 y1
x 22 y 2
x 13 y1
x 32 y 2
y12
y 22
x 1 y12
x 2 y 22
x 12 y12
x 22 y 22
x 13 y12
x 32 y 22
y13
y 32
x 1 y13
x 2 y 32
x 12 y13
x 22 y 32

x 2n

x 3n

yn

x n yn

x yn

x yn

y 2n

x n y 2n

x y 2n

x y 2n

y 3n

x n y 3n

x y 3n
2
n
3
n
 b1 
b 
 2
 b3 
 
 b4 
 b5 
 
 b6 
b 
 7
b
   8 
16 x1
 b9 
b10 
 
 b11 
b 
 12 
b13 
b 
 14 
b15 
 
b16 
2
n
3
n
2
n
Eq. 22c
1.1.3 MULTIPLE DEPENDENT VARIABLES
page 8 of 20
x 13 y13 

x 32 y 32 
 

x 3n y 3n 
Multiple simultaneous linear outputs for each input data observation may be modeled using this
methodology as shown in Equation 23. Note that the number of data observations nobs must be
greater than or equal to the number of parameters npar (not npar times nout) to prevent
singularity of the symmetric matrix ([X]T[X]).
{}  ( [X]T
[X] ) 1 [X]T
Y
Eq. 23
npar x nobs nobs x nout
npar x nobs nobs x npar
npar x nout
A classic example of multiple linear outputs is to compute the magnitude and location of a
vertical force applied on a forceplate transducer comprised of two load cells as shown in Figure
xxx.
full affine strain/graphic transform – displacement, rotation, principal strain, principal directions
1.1.4 WEIGHTING
A weighting factor wi may be assigned to each training pair to emphasize relative importance
among the data observations. Residuals {e} may be accentuated or attenuated by premultiplying with a diagonal weighting matrix [W] to form weighted residuals {} as shown in
Equation 23a. The least squares solution for this new set of weighted residuals is shown
Equation 23b.
 1   w 1
  0
   2   
   0
 nobs   0

{}  X W X
T
2
0
w2
0
0
0   e1 
0   e 2 

  W e

0   

0 w nobs  e nobs 
0
0
 X
1
T
W2 [Y] 
Eq. 23a
Eq. 23b
Two concepts are typically used for weights. The simplest form is to use a value of 1 if that data
observation is present in the current data set while a value of 0 is used if that observation is
missing as shown in Equation 23c. This technique is often used for scientific devices that
employ the same data collection protocols for each data set (same nobs each time), but in which
certain observations may be missing or spurious (e.g. photogrammetry with landmarks that are
occasionally not visible).
wi2 = 1 for valid data, wi2 = 0 for missing data
Eq. 23c
The second approach is to set the square of each weight equal to the inverse of the expected
variance for that observation. The expected variance may be the variance of the respective
dependent variable output, the variance of the independent variable input or a pooled value
representative of both.
page 9 of 20
wi2 = 1 / i2 for i2 = variance of observation i
Eq. 23d
1.1.5 SCALING
mean center xi* = xi - x0
scale largest abs(xi*) to ±1
1.2 LINEARIZED MODELS
It is often tempting to linearize models of nonlinear phenomena in an effort to arrive at estimates
for parameters. One should exercise caution however in that linear least squares solutions for
linearized models minimize the sum of squares of residuals for nonlinear forms of the dependent
variables rather than residuals of those dependent variables themselves.
1.2.1 NEWTONIAN COOLING
A classic example is to model the temperature T of an object as a function of time  during
exponential Newtonian cooling from an initial temperature T0 toward ambient temperature T as
shown in Equation 24. Taking the logarithm after a simple algebraic rearrangement provides the
linearized model in Equation 25. If the ambient temperature T is known, measuring
temperatures Ti of the object at times i allows direct estimation of the Newtonian cooling rate b
and indirect estimation of the initial temperature difference above ambient with the linearized
matrix model in Equations 26. Specifically, this approach minimizes residuals of the nonlinear
function ln(T - T) rather than residuals of the dependent variable temperature T. Note that
simultaneous estimation of ambient temperature T∞ requires an iterative algorithm.
T = T + (T0 - T) e-b
Eq. 24
ln(T 0 - T )
ln(T - T) = ln(T0 - T) - b  

 b

Eq. 25
 ln(T 1 - T ) 
 ln(T - T ) 


2

{Y}  


nobs x 1


ln(T nobs - T )
Eq. 26
1 τ1 
1 τ 
2 
[X]  

 
nobs x 2


1 τ nobs 
ln(T 0 - T )
{}  

2 x1
 b

1.2.2 LOGARITHMIC SPIRAL
The logarithmic spiral often appears in nature such as the cross –section of a nautilus shell,
atmospheric low pressure spirals or the arms of galaxies as shown in Equations 27. Given center
page 10 of 20
location xc and yc allows linearization of x and y data points on the spiral as shown in Equations
28 and provides the linearized matrix model of Equations 29. Again, this approach minimizes
residuals of the nonlinear function ln(r) rather than residuals of the independent variables x and
y. Simultaneous estimation of center location xc and yc requires iterative techniques. Note that
there are no actual dependent variables for closed form geometric models estimated from
coordinate data.
r = a eb
r
x = xc + r cos
x  x c 2  y  y c 2
 ln( r1 ) 
 ln( r ) 


2
{Y}  

nobs x 1
  
ln( rnobs )
y = yc + r cos
tan  
y  yc
x  xC
1 1 
1  
2 
[X]  


 
nobs x 2


1  nobs 
Eq. 27
ln a 
ln r   ln a   b   1 

 b 
ln(a )
{}  

2 x1
 b 
Eq. 28
Eq. 29
1.2.3 CIRCLE
Another example is to estimate the coordinates a and b for the center of a circle of unknown
radius r by measuring the coordinates of points x and y on the circle as shown in the familiar
nonlinear model in Equation 30. Expanding and rearranging provides the linearized model
shown in Equation 31. Measuring coordinates xi and yi for points on the circle allows direct
estimation of the center coordinates a and b and indirect estimation of the radius with the
linearized model in Equations 32. This approach minimizes residuals of the nonlinear function
(x2+y2) rather than residuals for the independent variables x and y.
( x - a )2 + ( y - b )2 = r2
Eq. 30
r 2  a 2  b 2 


2a
( x2 + y2 ) = [ 1 x y ] 



2
b


Eq. 31
 ( x 12  y12 ) 


2
2
 (x 2  y 2 ) 
{Y}  

nobs x 1



( x 2nobs  y 2nobs )
1 x 1
1 x
2
[X]  


nobs x 3

1 x nobs
y1 
y 2 
 

y nobs 
r 2  a 2  b 2 


{}  
2a

3x1


2b


Eq. 32
If radius r is known a priori, the problem may be reformulated to find the best fit for center
coordinates a and b as shown in Equations 33. The solution for the term (-a2-b2) may not be
page 11 of 20
completely consistent with the solution for terms (2a) and (2b) if the data does not exactly model
the given radius r. Again, note that there are no actual dependent variables for closed form
geometric models estimated from coordinate data.
 ( x 12  y12  r 2 ) 


2
2
2
 (x 2  y 2  r ) 
{Y}  

nobs x 1



( x 2nobs  y 2nobs  r 2 )
1 x 1
1 x
2
[X]  



nobs x 3

1 x nobs
y1 
y 2 
 

y nobs 
 a 2  b 2 


{}   2a 
3x1
 2b 


Eq. 33
1.2.4 SPHERE
Following the circle example above, one can estimate the coordinates a, b and c for the center of
a sphere of unknown radius r as shown in Equations 34.
 ( x 12  y12  z12 ) 


2
2
2
 (x 2  y 2  z 2 ) 
{Y}  

nobs x 1



( x 2nobs  y 2nobs  z 2nobs )
1 x 1
1 x
2
[X]  


nobs x 4

1 x nobs
y1
y2

y nobs
z1 
r 2  a 2  b 2  c 2 


z 2 
2a


{}  
 Eq. 34
  4 x1 
2b




z nobs 
2c

page 12 of 20
1.3 EXAMPLE DATA
1.3.1 ANSCOMBE'S DATA SET
Ref: Anscombe, F.J. (1973) Graphs in statistical analysis. Amer. Statistician 27:17-21
Anscombe's datasets
x1=[10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 ]';
y1=[ 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 ]';
x2=x1;
y2=[ 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 ]';
x3=x1;
y3=[ 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 ]';
x4=[ 8.00 8.00 8.00 8.00 8.00 8.00 8.00 19.00 8.00 8.00 8.00 ]';
y4=[ 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89 ]';
page 13 of 20
2.0 EIGENVECTOR METHODS
The linear methods described above minimize the sum of squares of residuals between observed
outputs and output values predicted from multiple input observations. However for some
numerical models, the concepts of independent inputs and dependent outputs are not well
defined. Fitting geometric models to k dimensional point coordinate data is an example. The
coordinates for each point are typically grouped into an input vector of length ninp=k and there
are no dependent variable outputs. In general, input vectors from nobs number of data
observations are pooled to estimate npar number of model parameters.
Linear least squares methods minimize the sum of squares of residuals parallel to the dependent
variable. For eigenvector methods, the residuals are nominally orthogonal to model.
Orthogonal distance fitting …
2.1 LINE
The parametric equation for a line in k dimensional space describes the location of any point {x}
on the line in terms of a given point {p} on the line plus a directed scalar distance s measured
from {p} in the unit direction û along the line as shown in Equation 32. However, individual
data observations {xi} for points representing the line may not fit the model perfectly due to
experimental measurement error, process variation or insufficient model complexity as shown in
Equation 33.
x  p  s û
k x1
Eq. 32
k x1
k x1
{xi}  {p} + s û
Eq. 33
The optimal least-squares estimate for the given point {p} on the line will be the centroid {xo} of
the nobs data observations {xi} as shown Equation 34. The symmetric centroidal point
distribution matrix [S] may then be formed as shown in Equation 35. Its diagonal eigenvalue
matrix [D] and orthogonal eigenvector matrix [V] shown Equation 36 provide least-squares
solutions.
nobs
{p} = {xo} =
1
nobs

{xi}
Eq. 34
i 1
nobs
1
[S] = nobs
 ({xi}-{xo})T ({xi}-{xo})
kxk
Eq. 35
i 1
[S] = [V] [D] [V]T
for [V] = [{v1}{v2}…{vk}] and [D] = diag( d1 d2 … dk )
Eq. 36
page 14 of 20
Perfect data observations of points on a line would produce a single non-zero eigenvalue of [S]
along a direction parallel to the line. Consequently for real-world data observations, the optimal
least-squares estimate for the unit direction û will be parallel to the eigenvector that
corresponds to the largest eigenvalue as shown in Equation 37. The largest eigenvalue will be
equal to variance of the data points about their centroid along the line. The other eigenvalues are
principal values of variance perpendicular to the line.
û = {vj} / norm{vj}
for {vj} corresponding to largest dj for a line
Eq. 37
2.2 PLANE
The equation for a plane in k dimensional space describes the location of any point {x} on the
plane based on the perpendicular distance  from the coordinate origin to the plane as
determined by a given point {p} on the plane and the unit normal n̂ to plane as shown in
Equation 38. However individual data observations {xi} for points representing the plane may
not fit the model perfectly due to experimental measurement error, process variation or
insufficient model complexity as shown in Equation 39.
{x}T n̂ = {p}T n̂ = 
Eq. 38
{xi}T n̂  {p}T n̂ = 
Eq. 39
The optimal least-squares estimate for the given point {p} on the plane will be the centroid {xo}
of the nobs data observations {xi} as shown Equation 34 above. The symmetric centroidal point
distribution matrix [S], diagonal eigenvalue matrix [D] and orthogonal eigenvector matrix [V]
shown in Equations 35 and 36 above again provide least-squares solutions per below.
Perfect data observations of points on a plane would produce a single zero eigenvalue of [S] in a
direction normal to the plane. Consequently for real-world data observations, the optimal leastsquares estimate for the unit normal n̂ will be parallel to the eigenvector that corresponds to
the smallest eigenvalue as shown in Equation 40. The smallest eigenvalue will be equal to the
variance of the data points perpendicular to the plane. The other eigenvalues are principal values
of variance within the plane.
n̂ = {vj} / norm{vj}
for {vj} corresponding to smallest dj for a plane
Eq. 40
2.3 ELLIPSE
2.4 ELLIPSOID
2.5 2D QUADRIC
page 15 of 20
2.6 3D QUADRIC
page 16 of 20
3.0 LEVENBERG-MARQUARDT
The Levenberg-Marquardt algorithm iteratively adjusts estimates of model parameters {} to
minimize residuals between measured dependent variable outputs {y} and predictions from a
numerical model f(.) based on independent variable inputs {x} as shown in Equation 44. For a
given set of model parameters {}k at iteration k each measured training pair {x}i and {y}i will
have residuals {e}i,k as shown in Equation 45. For parameter updates {} shown in Equation
46, the Taylor series expansion for residuals at iteration k+1 may be written as shown in
Equation 47 using the Jacobian [J] of the numerical model with respect to model parameters in
Equation 48.


 f  x ,  
nout x 1
 ninp x 1 npar x 1 
Eq. 44
ei,k  yi  f xi , k 
Eq. 45
{}k+1 = {}k + {}
Eq. 46
y
ei,k 1  ei,k  Ji,k
nout x 1
J i,k
nout x 1
nout x npar
 f 1
 
 1
 f 2
  1

 
 f nout
 1
f 1
 2
f 2
 2

f nout
 2

Eq. 47
npar x 1
f1 
 npar 

f 2 

 npar 


 
f nout 

 npar 

evaluated for xi
Eq. 48
If only one training pair is available and the number of dependent variable outputs is equal to the
number of parameters (nobs=1 and nout=npar), Equation 47 is deterministic and one can try to
drive all nout residuals {e}i,k+1 to zero using Equation 49. This provides the classical NewtonRaphson root finding algorithm shown in Equation 50.
  Ji,1k ei,k
   Ji,1k f xi , k 
Eq. 49
for yi  0
Eq. 50
If only one training pair is available and the number of dependent variable outputs is larger than
the number of parameters (nobs=1 and nout>npar), residuals {e}i,k+1 at iteration k+1 can be
minimized by the standard linear least squares solution shown in Equation 51.
page 17 of 20

   JiT,k
npar x 1
 npar x nout

Ji,k 
nout x npar 
1


 J iT,k ei ,k 
 npar x nout

nout x 1 

Eq. 51
However, if the number of parameters is greater than the number of dependent variable outputs,
Equation 51 is row insufficient and multiple training pairs are required (nobs>1 for nout<npar).
Residuals from all training pairs at iteration k shown in Equation 45 may be concatenated as
shown in Equation 52 providing an aggregate sum of squares SSQ over all observations.
Similarly all residuals predicted at iteration k+1 for update {} in Equation 47 may be
concatenated as shown Equation 53. The linear least squares solution for parameter updates that
will minimize the predicted aggregate SSQ after at iteration k+1 is then shown in Equations 54
and 55.
 e1,k 
 e 

2,k 
SSQ k  

  
enobs,k 
T
 e1,k 
 e  nobs

2,k 
T

   ei ,k ei ,k


 i 1
enobs,k 


Eq. 52
 e1,k 1   e1,k   J 1,k 
 e
  e   J  


2,k 
2 , k 1 
2,k 



  





 

enobs,k 1  enobs,k  J nobs,k 



  



 J 1,k 
 J  
 2,k 
  


J nobs,k 

T
 J 1,k 
 J  
 2,k 
  


J nobs,k 

1







1








Eq. 53
 J 1,k 
 J  
 2, k 
  


J nobs,k 
nobs
nobs
    JiT,k Ji,k    JiT,k ei,k
 i 1
  i 1
 

T
 e1,k 
 e 

2, k 


  
enobs,k 







Eq. 54
Eq. 55
Equation 55 provides rapid second order Newtonian convergence but can become unstable if the
square Jacobian summation is nearly singular. Levenberg and Marquardt showed that a positive
factor  added to the diagonal elements of the square Jacobian summation matrix as shown in
Equation 56 can provide both rapid and stable convergence. For very small values of , this
provides Newtonian convergence similar to Equation 55. For larger values of , this provides
small but stable steps along the gradient shown in Equation 57.
page 18 of 20

1


nobs
nobs
    JiT,k Ji,k   I    JiT,k ei,k
npar x npar   i 1
 i 1
  1    JiT,k ei,k
nobs
  i 1
 
Eq. 56

 
Eq. 57

If parameter updates provide a stable step with smaller aggregate SSQ than prior iterations,
factor  is reduced in preparation for the next iteration. If parameter updates provide an unstable
step with larger aggregate SSQ than prior iterations, those updates are rejected, factor  is
increased and the process is repeated. Typically  is started at a value of 0.1, is reduced by a
factor of 10 for stable steps, and is increased by a factor of 10 for unstable steps.
Convergence may be assessed by observing when absolute values of parameter updates are small
while the aggregate SSQ approaches the expected standard deviation of residuals. Observing the
progression of factor  can also help indicate convergence.
The algorithm may be summarized as follows.
1) Postulate initial estimates for parameters {}
2) Evaluate aggregate SSQ over all training pairs for initial parameter estimates (Equation 52)
3) Set factor  = 0.1
4) Proceed through all training pairs
a) Evaluate all residuals {e}i,k (Equation 45)
b) Evaluate all Jacobians [J]i.k (Equation 48)
  J
nobs
c) Accumulate summations
i 1
T
i ,k
J i,k
 and   J
nobs
i 1
T
i ,k
ei,k

(Equation 55)
Add factor  to diagonal and compute parameter updates {} (Equation 56)
Update parameters {}k+1 (Equation 46)
Evaluate aggregate SSQ over all training pairs for new parameter estimates (Equation 52)
If aggregate SSQ has been reduced:
a) Reduce factor NEW = OLD / 10
b) Proceed with the next iteration at step 4)
9) If aggregate SSQ has increased:
a) Discard the new parameter estimates and use immediate prior values
b) Increasee factor NEW = OLD * 10
c) Proceed with the next iteration at step 5)
5)
6)
7)
8)
Because of its robust performance, Levenberg-Marquardt method is often used with finite
difference numerical approximations for the Jacobian [J] of the numerical model with respect to
model parameters. Note that this Jacobian must be re-evaluated for each training pair for each
iteration whether analytically or numerically.
Penalty functions may be added to residuals to impose explicit or implicit inequality constraints
on parameters. However as with any gradient technique, convergence may dither across
constraint boundaries if the minimum SSQ is at a constraint boundary.
page 19 of 20
Levenberg, K. "A Method for the Solution of Certain Problems in Least Squares." Quart. Appl.
Math. 2, 164-168, 1944.
Marquardt, D. "An Algorithm for Least-Squares Estimation of Nonlinear Parameters." SIAM J.
Appl. Math. 11, 431-441, 1963.
4.0 SIMPLEX
Nedler, J.A. and Mead, R. "A Simplex Method for Function Minimization." Computer Journal,
7, 308-???, 1965.
5.0 GRADIENT SEARCH
6.0 NEURAL NETWORK
7.0 GENETIC ALGORITHMS
8.0 SIMULATED ANNEALING
9.0 KALMAN FILTER
page 20 of 20
Download