Uploaded by shikha singh

RM-05 Curve Fitting

advertisement
CURVE
FITTING
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
CURVE FITTING
Part 5
• Describes techniques to fit curves (curve fitting) to discrete data to obtain
intermediate estimates.
• There are two general approaches two curve fitting:
• Data exhibit a significant degree of scatter. The strategy is to derive a single curve that
represents the general trend of the data.
• Data is very precise. The strategy is to pass a curve or a series of curves through each of the
points.
• In engineering two types of applications are encountered:
• Trend analysis. Predicting values of dependent variable, may include extrapolation beyond
data points or interpolation between data points.
• Hypothesis testing. Comparing existing mathematical model with measured data.
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
2
Figure PT5.1
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
3
Mathematical Background
Simple Statistics/
• In course of engineering study, if several measurements are made of a
particular quantity, additional insight can be gained by summarizing
the data in one or more well chosen statistics that convey as much
information as possible about specific characteristics of the data set.
• These descriptive statistics are most often selected to represent
• The location of the center of the distribution of the data,
• The degree of spread of the data.
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
4
• Arithmetic mean. The sum of the individual data points (yi) divided by the
number of points (n).
y

y
i
n
i  1,  , n
• Standard deviation. The most common measure of a spread for a sample.
St
Sy 
n 1
S t   ( yi  y ) 2
or
S
2
y
y


2
i
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
  yi  / n
2
n 1
5
• Variance. Representation of spread by the square of the standard
deviation.
S y2 
2
(
y

y
)
 i
n 1
Degrees of freedom
• Coefficient of variation. Has the utility to quantify the spread of data.
c.v. 
Sy
y
100%
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
6
Figure PT5.2
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
7
Figure PT5.3
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
8
Least Squares Regression
Linear Regression
• Fitting a straight line to a set of paired observations: (x1, y1), (x2,
y2),…,(xn, yn).
y=a0+a1x+e
a1- slope
a0- intercept
e- error, or residual, between the model and the observations
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
9
Criteria for a “Best” Fit/
• Minimize the sum of the residual errors for all available data:
n
e

i
n
(y
i
 ao  a1 xi )
1
i 1
n =i total
number
of points
• However, this is an inadequate criterion, so is the sum of the absolute
values
n
e
i 1
i
n
  yi  a0  a1 xi
i 1
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
10
Figure 17.2
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
11
• Best strategy is to minimize the sum of the squares of the residuals
between the measured y and the y calculated with the linear model:
n
n
n
i 1
i 1
i 1
S r   ei2   ( yi , measured  yi , model) 2   ( yi  a0  a1 xi ) 2
• Yields a unique line for a given set of data.
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
12
Least-Squares Fit of a Straight Line/
S r
 2 ( yi  ao  a1 xi )  0
ao
S r
 2 ( yi  ao  a1 xi ) xi   0
a1
 y a a x
0   y x a x a x
0
a
i
i
0
1
0
i
0
i
1
 na0
na0   xi a1 
a1 
i
y
2
i
Normal equations, can be solved
simultaneously
i
n  xi yi   xi  yi
n  x   xi 
2
i
2
Mean values
a0  y  a1 x
13
Figure 17.3
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
14
Figure 17.4
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
15
Figure 17.5
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
16
“Goodness” of our fit/
If
• Total sum of the squares around the mean for the dependent variable, y, is
St
• Sum of the squares of residuals around the regression line is Sr
• St-Sr quantifies the improvement or error reduction due to describing data
in terms of a straight line rather than as an average value.
St  S r
r 
St
2
r2-coefficient of determination
Sqrt(r2) – correlation coefficient
17
• For a perfect fit
Sr=0 and r=r2=1, signifying that the line explains 100 percent of the
variability of the data.
• For r=r2=0, Sr=St, the fit represents no improvement.
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
18
Polynomial Regression
• Some engineering data is poorly represented by a straight line. For
these cases a curve is better suited to fit the data. The least squares
method can readily be extended to fit the data to higher order
polynomials (Sec. 17.2).
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
19
General Linear Least Squares
y  a0 z 0  a1 z1  a2 z 2    am z m  e
z 0 , z1,  , z m are m  1 basis functions
Y   Z A E
Z   matrix of the calculated values of
the basis functions
at the measured values of the independent variable
Y observed valued of the dependent variable
A unknown coefficients
E residuals



Sr   
y

a
z

i
j
ji


i 1 
j 0

n
m
2
Minimized by taking its partial derivative
w.r.t. each of the coefficients and setting
the resulting equation equal to zero
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
20
Data with Error
• Last time we showed that it is possible to fit a straight line of the form
to any data by minimizing chi-square:
y ( x )  a  bx
2
2
 y  y ( x) 
 1

2   i

y

a

bx


 s i


s
i


 i

• We found that we could solve for the parameters a and b that minimize the difference between the fitted line
and the data (with errors si) as:
s
s
s
s
yi
1
a

1
b

2
i
xi yi
2
i
1
2
i
xi
2
i
s
s
s
s
xi
2
i
xi2
2
i
xi2
yi
xi
xi yi 
1


 2  2  2 


si
si
si
s i2 
yi
2
i
xi yi
2
i

xi yi
xi
yi 
1
1

 2 
 s 2  s 2 ,

si
s i2
i
i 
s
s
1
Note, if errors si are all the same, they
cancel, so can be ignored.

2
i
xi
2
i
where
s
s
Becomes N
xi
2
i
xi2
2
i

1
xi2
s s
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
2
i
2
i
2

x 
   i2  .
si 

MatLAB Commands for Linear Fits
• MatLAB has a “low level” routine called polyfit() that can be used to fit a linear
function to a set of points (assumes all errors are equal):
• x = 0:0.1:2.5;
• y = randn(1,26)*0.3 + 3 + 2*x; % Makes a slightly noisy linear set of points y = 3 + 2x
• p = polyfit(x,y,1)
% Fits straight line and returns values of b and a in p
p = 2.0661 2.8803
% in this case, fit equation is y = 2.8803 + 2.0661x
• Plot the points, and overplot the fit using polyval() function.
• plot(x,y,'.');
• hold on
• plot(x,polyval(p,x),'r')
Points and Fit
9
8
data points
Fit
7
s = 0.3 around the fit,
as we specified above.
6
y
• Here is the result. The
points have a scatter of
5
4
3
2
0
0.5
1
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
1.5
x
2
2.5
Fits with Unequal Errors
• MatLAB has another routine called glmfit() (generalized linear model regression)
that can be used to specify weights (unequal errors). Say the errors are proportional
to square-root of y (like Poisson):
• err = sqrt(y)/3;
• hold off
• errorbar(x,y,err,’.’) % Plots y vs. x, with error bars
• Now calculate fit using glmfit()
• p = glmfit(x,y)
% Does same as polyfit(x,y,1) (assumes equal errors)
• p = glmfit(x,y,’normal’,’weights’,err) % Allows specification of errors (weights), but must
include ‘normal’ distribution type
• p = circshift(p,1) % Unfortunately, glmfit returns a,b instead of b,a as polyval() wants
% circshift() has the effect of swapping the order of p elements
• hold on
• plot(x,polyval(p,x),’r’)
Points with Errors, and Fit
9
8
7
y
6
5
4
3
2
-0.5
0
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
0.5
1
1.5
x
2
2.5
3
Error Estimation
• We saw that when the errors of each point in the plot are the same, they cancel from the
equations for the fit parameters a and b. If we do not, in fact, know the errors, but we
believe that a linear fit is a correct fit to the points, then we can use the fit itself to
estimate those errors.
• First, consider the residuals (deviations of the points from the fit, yi – y), calculated by
Residuals vs. x
• resid = y - polyval(p,x)
• plot(x,resid)
3
2
Residual
1
• You can see that this is a random distribution with zero mean, as it
should be. As usual, you can calculate the variance, where now
there are two fewer degrees of freedom (m = 2) because we used the data to determine a
1
2
1
2
and b.
s 2  s2 
(
y

y
)

(
y

a

bx
)
 i
 i
i
0
-1
-2
-3
0
0.5
1
1.5
2
2.5
x
N m
N 2
• Indeed, typing std(resid) gives 0.2998, which is close to the 0.3 we used.
• Once we have the fitted line, either using individual weighting or assuming uniform errors,
how do we know how good the fit is? That is, what are the errors in the determination of
the parameters a and b?
Chi-Square Probability
• Recall that the value of 2 is
2
2
 y  y ( x) 
  i
 
si


 1
 s
 i

 yi  a  bx  

2
• This should be about equal to the number of degrees of freedom,
n = N – 2 = 24 in this case. Since si is a constant 0.3, we can bring it out of the summation, and
calculate
• sum(resid.^2)/0.3^2
ans = 24.9594
• As we mentioned last time, it is often easier to consider the reduced chi-square,
2

n 
n
2
which is about unity for a good fit. In this case, n2 = 24.9594/24 = 1.04. If we look this value up in
table C.4 of the text, we find that P ~ 0.4 which means if we repeated the experiment multiple times,
about 40% would be expected to have a larger chi-square.
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Uncertainties in the Parameters
• We started out searching for a linear fit,y ( x)  a  bx, where a and b are the parameters
of the fit. What are the uncertainties in these parameters?
• We saw in Lecture 14 that errors due to a series of measurements propagate to the
result for, e.g. a, according to
2
s
• Since we have the expressions for a and b as
2
a


2  a 
  s i 
 .

 yi  


xi2
yi
xi
xi yi 
1
a
 2  2  2 


si
si
si
s i2 
b
• The partial derivatives are (note  is independent of yi)
xi yi
xi
yi 
1
1

 2 
 s 2  s 2 ,

si
s i2
i
i 
a
1 1

 2
y j

s j
b
1  xj

 2
y j

s j
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
xi2
s
2
i
1
s
2
i


xj
s 2j
1
s 2j
xi 
 s 2 
i 
xi 
 s 2  .
i 
Uncertainties in the Parameters
• Inserting these into the expressions
2





a
2
2
s a   s i 
 

y

 i  


after some algebra (see text page 109), we have
xi2
1
2
sa   2

si
s b2
2





b
2
  s i 
 

y

 i  


s b2 
1
1
s 2

i
• In the case of common si = s, we have
s 
2
a
s 2  xi2
N  x    xi
2
i

2
s 
2
b
Ns 2
N  xi2    xi
• For our example, this is calculated as
• del = 26*sum(x.^2)-sum(x)^2
• siga = sum(x.^2)*0.3^2/del
• sigb = 26*0.3^2/del
% gives 0.0131
% gives 0.0062
Pavan Chakraborty Indian Institute of Information Technology - Allahabad

2
Less Unknowns & More Equations
y
x
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Quantifying error in a curve fit
• positive or negative error have the same value (data point is above or below the
line)
• Weight greater errors more heavily
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Less Unknowns & More Equations
y
x
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Less Unknowns & More Equations
y  f x 
x
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Hunting for A Shape & Geometric Model to Represent A Data Set
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Minimum Energy Method : Least Squares Method
f  x   ax  b
If our fit is a straight line
Error   d i2   y1  f  x1    y2  f  x2    y3  f  x3    y4  f  x4 
2
N
2
2
N
2
Error    yi  f  xi     yi  axi  b 
i 1
2
i 1
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
2
•The ‘best’ line has minimum error between line and data points
•This is called the least squares approach, since square of the error is minimized.
N

2
Minimize  Error    yi  axi  b  
i 1


Take the derivative of the error with respect to a and b, set each to zero
 N
2 


   yi  axi  b  
 Error 
 0
  i 1
a
a
 N
2 
   yi  axi  b  
 Error 
 0
  i 1
b
b
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
N
 Error 
 2 xi  yi  axi  b   0
a
i 1
N
 Error 
 2  yi  axi  b   0
b
i 1
Solve for the a and b so that the previous two equations both = 0
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
N
N
N
i 1
i 1
a  x  b  xi   xi yi
i 1
2
i
N
N
i 1
i 1
a  xi  bN   yi
put these into matrix form

N

 N
 xi
 i 1

 N

xi 
yi 



b
i 1
i 1




a   N
N


2  
xi 

 xi yi 
i 1

 i 1

N
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
N
N
N  xi yi   xi
a
i 1
i 1
N
y
i 1


N  x   xi 
i 1
 i 1

N
N
i
2
2
i
N
b
N
N
y x
i 1
i
i 1
2
i
  xi
i 1
N
x
i 1


N  x   xi 
i 1
 i 1

N
N
i
yi
2
2
i
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
Is a straight line suitable for each of these cases ?
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
The Least-Squares mth Degree Polynomials
When using an mth degree polynomial
y  a0  a1 x  a2 x  .........am x
2
m
to approximate the given set of data, (x1,y1), (x2,y2)…… (xn,yn), where n ≥ m, the best fitting curve
has the least square error, i.e.,

Minimize Error  i 1 yi  f  xi 

n


2

Minimize Error  i 1 yi  a0  a1 xi  a2 x  ......am x
n
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
2
i
m
i
 
2
To obtain the least square error, the unknown coefficients a0, a1, …. and am
must yield zero first derivatives.
Expanding the previous equations, we have
The unknown coefficients can hence be obtained by solving the above linear equations.
No matter what the order j, we always get equations LINEAR with respect to the coefficients.
This means we can use the following solution method
Selection of Order of Fit
2nd and 6th order look similar, but 6th has a ‘squiggle to it.
Is it Required or not?
Under Fit or Over Fit: Picking An appropriate Order
•Underfit - If the order is too low to capture obvious trends in the data
•Overfit - over-doing the requirement for the fit to ‘match’ the data trend (order too
high)
• Polynomials become more ‘squiggly’ as their order increases.
•A ‘squiggly’ appearance comes from inflections in function
General rule: pick a polynomial form at least several orders lower than the number of data points.
Start with linear and add extra order until trends are matched.
Linear Regression Analysis
• Linear curve fitting
• Polynomial curve fitting
• Power Law curve fitting: y=axb
• ln(y) = ln(a)+bln(x)
• Exponential curve fitting: y=aebx
• ln(y)=ln(a)+bx
Goodness of fit and the correlation coefficient
• A measure of how good the regression line as a representation of the data.
• It is possible to fit two lines to data by
• (a) treating x as the independent variable : y=ax+b, y as the dependent variable
or by
• (b) treating y as the independent variable and x as the dependent variable.
• This is described by a relation of the form x= a'y +b'.
• The procedure followed earlier can be followed again to find best values of a’ and
b’.
N
a
'
y
i 1
2
i
b
a
y
i 1
'
y
i 1
N
'
N
N
i
  xi yi
i 1
N
 b N   xi
'
i
i 1
put these into matrix form

N

 N
  yi
 i 1

 N

yi  '
xi 



b 
i 1
     i 1

N
'
N



a


2 

yi 

 xi yi 
i 1

 i 1

N
N
x  a yb  a 
'
'
'
N
N  xi yi   xi
i 1
i 1
N
N
i 1
y
i 1


y    yi 
 i 1

Recast the second fit line as:
1
b'
y  ' x '
a
a
1
a'
N
is the slope of this second line, which not same as the first line
N
2
i
2
i
•The ratio of the slopes of the two lines is a measure of how good the form of the fit is to the data.
•In view of this the correlation coefficient ρ defined through the relation

2
slope of first Regression line
'

 aa
Slope of second Regression line
N
y  ax  b  a 
N
N  xi yi   xi
i 1
i 1
N
y
i 1
N
2
N
N


N  x   xi 
i 1
 i 1

N
i
2
i
N
x  a yb  a 
'
'
'
N  xi yi   xi
i 1
i 1
N
N
i 1
y
i 1


y    yi 
 i 1

N
2
i
2
i
2
2


 N  xi yi   xi  yi 
i 1
i 1
i 1



2
2
N
N
N
N












2
2
 N  xi   xi   N  yi   yi  

 i 1
 
 i 1
 
 i 1

 i 1

N
 
N
N
N
N
N


 N  xi yi   xi  yi 
i 1
i 1
 i 1

2
2
N
N
N
N












2
2
 N  xi   xi   N  yi   yi  

 i 1
 
 i 1
 
 i 1

 i 1

Correlation Coefficient
•
•
•
•
•
•
The sign of the correlation coefficient is determined by the sign of the covariance.
If the regression line has a negative slope the correlation coefficient is negative
while it is positive if the regression line has a positive slope.
The correlation is said to be perfect if ρ = ± 1.
The correlation is poor if ρ ≈ 0.
Absolute value of the correlation coefficient should be greater than 0.5 to indicate that
y and x are related!
• In the case of a non-linear fit a quantity known as the index of correlation is defined to
determine the goodness of the fit.
• The fit is termed good if the variance of the deviates is much less than the variance of
the y’s.
• It is required that the index of correlation defined below to be close to ±1 for the fit to
be considered good.
N
   1
 y
i 1
 f  xi 
2
i

N 

 yi 

i 1 



yi 


i 1

N 


N
2
2=1.000
2=0.821
2=0.991
2=0.493
2=0.904
2=0.0526
Multi-Variable Regression Analysis
• Cases considered so far, involved one independent variable and one dependent variable.
• Sometimes the dependent variable may be a function of more than one variable.
• For example, the relation of the form

d orifice
  f  p, T , p, A,
m

d pipe





• is a common type of relationship for flow through an Orifice or Venturi.
• mass flow rate is a dependent variable and others are independent variables.
Set up a mathematical model as:
d

 p 
c
d  orifice 
  a
m
p A 



 RT 
 d pipe 

b
e
Taking logarithm both sides

 d orifice 

 p 
   ln a   b ln 
lnm
 c lnp   d ln A  e ln 



 RT 
 d pipe 

Simply:
y  ln a   bl  cm  dn  eo
where y is the dependent variable, l, m, n, o and p are independent variables and a, b, c, d, e
are the fit parameters.
The least square method may be used to determine the fit parameters.
Let the data be available for set of N values of y, l, m, n, o, p values.
The quantity to be minimized is given by
N
Error    yi  a  bli  cmi  dni  eoi  fpi 
i 1
What is the permissible value of N ?
2
The normal linear equations are obtained by the usual process of setting the first partial derivatives with respect to
the fit parameters to zero.
N
 Error 
 2  yi  a  bli  cmi  dni  eoi  fpi   0
a
i 1
N
 Error 
 2 li  yi  a  bli  cmi  dni  eoi  fpi   0
b
i 1
N
N
N
N
N
Na  b  li  c  mi  d  ni  e oi  f
p
N
N
i 1
N
i 1
N
i 1
N
i 1
N
a  li  b  li2  c  li mi  d  li ni  e li oi  f
i 1
i 1
i 1
i 1
i 1
These equations are solved simultaneously to get the six fit parameters.
We may also calculate the index of correlation as an indicator of the quality of the fit.
This calculation is left to you!
Pavan Chakraborty Indian Institute of Information Technology - Allahabad
i
i 1
l
i 1
i
N
  yi
i 1
N
pi   li yi
i 1
Download