By Jugal Kalita
Based on Chapter 17 of Chapra and Canale,
Numerical Methods for Engineers
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1
• Describes techniques to fit curves ( curve fitting ) to discrete data to obtain intermediate estimates.
• There are two general approaches two curve fitting:
– Data exhibit a significant degree of scatter . The strategy is to derive a single curve that represents the general trend of the data.
– Data is very precise.
The strategy is to pass a curve or a series of curves through each of the points.
• Two types of applications:
– Trend analysis. Predicting values of dependent variable, may include extrapolation beyond data points or interpolation between data points.
– Hypothesis testing. Comparing existing mathematical model with measured data.
2
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
3
Simple Statistics
• In course of a study, if several measurements are made of a particular quantity, additional insight can be gained by summarizing the data in one or more well chosen statistics that convey as much information as possible about specific characteristics of the data set.
• These descriptive statistics are most often selected to represent
– The location of the center of the distribution of the data,
– The degree of spread of the data.
4
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Arithmetic mean . The sum of the individual data points (yi) divided by the number of points (n).
y i y = n i = 1, … , n
• Standard deviation . The most common measure of a spread for a sample.
S y
S t
= n
S t
− 1
=
( y i
− y ) 2
5
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Variance . Representation of spread by the square of the standard deviation.
S 2 y
=
( n y i
−
− 1 y ) 2
Degrees of freedom
• Coefficient of variation . Has the utility to quantify the spread of data. c .
v .
=
S y 100 % y
6
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
7
1 standard deviation on both sides of mean: Include 68% of the data, 2 sds:
95.5, 3 sds: 99% of the data.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
8
• Fitting a straight line to a set of paired observations: (x
1
, y
1
), (x
2
, y
2
),…,(x n
, y n
).
y=a
0
+a
1 x+e
a
1
- slope
a
0
- intercept
e- error, or residual, between the model and the observations
9
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Criteria for a “ Best ” Fit/
• Minimize the sum of the residual errors for all available data: n i
= 1 e i
= n i
= 1
( y i
− a o
− a
1 x i
)
n = total number of points
• However, this is an inadequate criterion, so is the sum of the absolute values n n
i = 1 e i
=
i = 1 y i
− a
0
− a
1 x i
10
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Figure 17.2 a) Minimize sum of residuals, b) Minimize absolute values of residuals, c) Minimize the maximum error of any individual point.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
11
• Best strategy is to minimize the sum of the squares of the residuals between the measured y and the y calculated with the linear model:
S r
= n i
∑
= 1 e i
2
= n i
∑
= 1
( y i
, measured − y i
, model ) 2
= n i
∑
= 1
( y i
− a
0
− a
1 x i
) 2
• Yields a unique line for a given set of data.
12
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Least-Squares Fit of a Straight Line
∂ S
∂ a o r
= − 2 ∑ ( y i
− a o
− a
1 x i
) = 0
∂ S r
∂ a
1
= − 2 ∑ [ ( y i
− a o
− a
1 x i
) x i
]
= 0
0 =
0 =
∑
∑ y i y i x i
−
−
∑ a
0
∑ a
0
− ∑ a
1 x i x i
− ∑ a
1 x i
2
∑ a
0 na
0
= na
0
+
( ∑ x i
) a
1
= ∑ y i
Normal equations, can be solved simultaneously a
1 a
0
=
= y n ∑ n x i
∑ y i x i
2
−
−
(
∑ ∑
∑ x i
) 2
− a
1 x y i
Mean values
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
13
Figure 17.3
The residual in linear regression represents the vertical distance between a data point and the straight line.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
14
Figure 17.4
a) Shows spread of data about the mean point b) Shows the spread of data around the best-fit line. The reduction in the spread in going from a) to b) says that b) describes the tendency in the data better.
15
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Figure 17.5
Linear Regression with a) small and b) large residual errors.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
16
“ Goodness ” of our fit
If
• Total sum of the squares around the mean for the dependent variable, y, is S t
• Sum of the squares of residuals around the regression line is S r
• S t
-S r
quantifies the improvement or error reduction due to describing data in terms of a straight line rather than as an average value. r 2
=
S t
S r
S t r 2 -coefficient of determination
Sqrt(r 2 ) – correlation coefficient
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
17
• For a perfect fit
S r
=0 and r=r 2 =1, signifying that the line explains 100 percent of the variability of the data.
• For r=r 2 =0, S r improvement.
=S t
, the fit represents no
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
18
Linearization of Nonlinear relationships
• It is possible to linearize some non-linear relationships
• Figure 17.9: a) Exponential equation b) Power equation c) Saturation-growth-rate equation
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
19
5
(2) Power-like curve: y = a x b
2
4
3
2 ln(y) = ln(a
2
) + b
2
ln(x) y =
1
20
(3) Saturation growth-rate curve
population growth under limiting conditions
1 y 3
ʹ′
3
ʹ′
1 x
= a
1
3
+ b
3
1 a x
40 60 a
3 b
3
=5
=1..10
80
Be careful about the implied distribution of the errors. Always use the untransformed values for error analysis.
100
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
• Some data is poorly represented by a straight line. For these cases a curve is better suited to fit the data. The least squares method can readily be extended to fit the data to higher order polynomials ( Sec. 17.2
).
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
21
y = a
0 z
0
+ a
1 z
1
+ a
2 z
2
+ + a m z m
+ e z
0
, z
1
, … , z m are m + 1 basis functions
=
[ ] { } { }
− matrix of the calculated values of the basis functions at the measured values of the independen t variable
− observed valued of the dependent variable
{ }
− unknown coefficien ts
{ } − residuals
S r
= i n
= 1
⎜
⎝ y i j m
⎛
−
= 0 a j z ji ⎟
⎠
⎞
2 Minimized by taking its partial derivative w.r.t. each of the coefficients and setting the resulting equation equal to zero
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
22
Least-Squares Regression
Given: n data points: (x
1
Obtain: "Best fit" curve:
,y
1
), (x
2
,y
2
), … (x n
,y n
) f(x) =a
0
Z
0
(x) + a
1
Z
1
(x) + a
2
Z
2
(x)+…+ a m
Z m
(x)
a i
's are unknown parameters of model
Z i
's are known functions of x.
We will focus on two of the many possible types of regression models:
Simple Linear Regression
Z
0
(x) = 1 & Z
1
(x) = x
General Polynomial Regression
Z
0
(x) = 1, Z
1
(x)= x, Z
2
(x) = x 2 , …, Z m
(x)= x m
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Least Squares Regression (cont'd):
General Procedure:
For the i th data point, (x i which:
,y i
) we find the set of coefficients for
y i
= a
0
Z
0
(x i
) + a
1
Z
1
(x i
) .... + a m
Z m
(x i
) + e i where e i
is the residual error = the difference between reported value and model:
e i
= y i
– a
0
Z
0
(x i
) – a
1
Z
1
(x) i
–… – a m
Z m
(x i
)
Our "best fit" will minimize the total sum of the squares of the residuals:
S r
= n i
= 1 e i
2
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
y y i measured value e i modeled value x i x
Our "best fit" will be the function which minimizes the sum of
squares of the residuals:
S r
= n
S r
= n n
⎜
⎝
⎛
⎜ m
∑ e i
2
= ∑ y i
− ∑ ( )
( y i
− − − −
L
−
⎟
⎟
⎞
2
⎠
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
)
2
Least Squares Regression (cont'd):
S r
= n i
= 1 e i
2
= n i
= 1
( y i
− a
0
Z
0
( x i
) − − a m
Z m
( x i
))
To minimize this expression with respect to the unknowns a
0
… a m
take derivatives of S r
and set them to zero :
, a
1
∂ S r
∂ a
0 n
= − 2 ∑ Z (x )(y i
− − −
∂ S r
∂ a
1 n
= − 2 ∑ Z (x )(y i
− − −
∂ S r
∂ a m
M n
= − 2 ∑ Z (x )(y i
− − −
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Least Squares Regression (cont'd):
{Y} = [Z] {A} + {E} or {E} = {Y} – [Z] {A} where: {E} and {Y} --- n x 1
[Z] -------------- n x (m+1)
{A} ------------- (m+1) x 1 n = # points (m+1) = # unknowns
{E} T = [e
1
e
2
... e n
],
{Y}
{A}
T
T
= [y
= [a
1
0
y
a
1
2
... y
a
2 n
],
... a m
]
=
⎢
⎣
⎢
⎢
⎡
⎢
Z x
Z x
M
Z x
Z x
M
Z x Z x
L
L
O
L
M
⎥
⎦
⎥
⎥
⎤
⎥
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Least Squares Regression (cont'd):
{E} = {Y} – [Z]{A}
Then
S r
= {E} T {E} = ({Y}–[Z]{A}) T ({Y}–[Z]{A})
= {Y} T {Y} – {A} T [Z] T {Y} – {Y} T [Z]{A} + {A} T [Z] T [Z]{A}
= {Y} T {Y}– 2 {A} T [Z] T {Y} + {A} T [Z] T [Z]{A}
Setting = 0 for i =1,...,n yields:
or
∂ S r
∂ a i
= 0 = 2 [Z]
[Z]
T
T
[Z]{A} – 2 [Z]
[Z]{A} = [Z] T
T {Y}
{Y}
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
[Z] T [Z]{A} = [Z] T {Y} (C&C Eq. 17.25)
This is the general form of Normal Equations.
They provides (m+1) equations in (m+1) unknowns.
(Note that we end up with a system of linear equations.)
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Simple Linear Regression (m = 1):
1
,y
1
),(x
2
,y
2
),…(x n
,y n
)
with n > 2
Obtain: "Best fit" curve: f(x) = a
0
+ a
1 x
from the n equations:
y
1
= a
0
+ a
1 x
1
+ e
y
2
= a
0
+ a
1 x
2
M
+ e
1
2
y n
= a
0
+ a
1 x n
+ e n
Or, in matrix form: [Z] T [Z] {A} = [Z] T {Y}
⎡
⎢
⎣
1 1 x
1 x
2
L
L x
1 n
⎤
⎥
⎦
⎢
⎢
⎡
⎢
1 x
1
1 x
2
M M
⎢
⎣
1 x n
⎥
⎦
⎥
⎥
⎤
⎥
⎧
⎨ a a
0
1
⎫
⎬ =
⎡
⎢
⎩ ⎭ ⎣
1 1 x
1 x
2
L
L
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. x
1 n
⎤ ⎪
⎥
⎧ y
1 ⎫
⎨
⎦ ⎪
⎪ y y
2
M n
⎪
⎬
⎪
⎪
Simple Linear Regression (m = 1):
Normal Equations
[Z] T [Z] {A} = [Z] T {Y} upon multiplying the matrices become
⎢
⎢
⎡
⎢
⎢
⎢ n n x n n
x i x i
2
⎤
⎥
⎥ a a
1
=
⎪
⎥ ⎨ ⎬ ⎨
⎧
⎪
⎧ 0 ⎫ ⎪
⎥ ⎩ ⎭ ⎪
⎥ ⎪
⎩ n
n y i
⎫
⎪
⎪
⎭
⎬
⎪
Normal Equations for Linear
Regression
C&C Eqs. (17.4-5)
(This form works well for spreadsheets.)
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
[Z] T [Z] {A} = [Z] T {Y}
Solving for {a}: a
1
= n n n n
x y −
y n n
x i
2
−
⎜
⎝
⎛
⎜
= n
=
2
x i
⎟
⎠
⎞
⎟ i a
0
=
1 n n y i
− a
1
⎡
⎢ n
1 n
x i
⎤
⎥ =
= −
C&C equations (17.6) and (17.7)
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
[Z] T [Z] {A} = [Z] T {Y}
A better version of the first normal equation is: a
1
= n
( y i
− )( i
− x ) n
( x i
− x )
2 which is easier and numerically more stable, but the 2 nd equation remains the same: a
0
= −
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Common Nonlinear Relations:
Objective: Use linear equations for simplicity.
Remedy: Transform data into linear form and perform regressions.
Given: data which appears as:
(1) exponential-like curve:
(e.g., population growth, radioactive decay, attenuation of a transmission line)
Can also use: ln(y) = ln(a
1
) + b
1 x
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1.
In all regression models one is solving an overdetermined system of equations, i.e., more equations than unknowns.
2.
How good is the fit?
Often based on a coefficient of determination, r 2
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Engineering Computa/on Standard Errpr of the es/mate 36
Precision :
If the spread of the points around the line is of similar
magnitude along the entire range of the data,
Then one can use s y x
=
S r
= standard error of the estimate
(standard deviation in y) to describe the precision of the regression estimate (in which m+1 is the number of coefficients calculated for the fit, e.g., m+1=2 for linear regression)
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display.