EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST

advertisement
EM375 STATISTICS AND MEASUREMENT UNCERTAINTY
LEAST SQUARES LINEAR REGRESSION ANALYSIS
In this unit of the course we investigate fitting a straight line to measured (x, y) data pairs.
The equation we want to fit is of the form:
π‘Œπ‘Œ = π‘Žπ‘Žπ‘Žπ‘Ž + 𝑏𝑏
where a is the slope and b is the intercept of the line. If we only have two data pairs we can
fit a unique line to them. However, when we have more data pairs we could fit a large
number of different lines that pass through the scatter of data points. The problem
therefore arises: how do we choose the “best” straight line?
There are several different methods. The one we develop here is the least squares error
approach. This is the one most commonly used in engineering and functions to calculate
this solution are readily available in Excel, MATLAB and many handheld calculators.
Let us assume we have n data pairs (xi, yi) with i = 1 .. n. We make the assumption that all x
values have no uncertainty – they are known exactly. The only uncertainty is associated
with the y values. If we knew the actual slope and intercept of the line, we could predict
the expect y-value for any given x-value. We choose upper case Y to represent the expected
values and lower case y to represent the measured values. For each measurement there will
be an error ei (difference) between the expected value and observed value. Thus:
π‘Œπ‘Œπ‘–π‘– = π‘Žπ‘Žπ‘₯π‘₯𝑖𝑖 + 𝑏𝑏
𝑒𝑒𝑖𝑖 = π‘Œπ‘Œπ‘–π‘– − 𝑦𝑦𝑖𝑖
= π‘Žπ‘Žπ‘₯π‘₯𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖
The values are demonstrated in the figure below.
EM375: Linear Regression - 1
y
Expected Yi
error, ei
Observed yi
x
The approach to finding the “least squares best fit” line is to minimize the sum of the
squares of the errors, E.
𝑛𝑛
𝑛𝑛
𝑛𝑛
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝐸𝐸 = οΏ½ 𝑒𝑒𝑖𝑖2 = οΏ½(π‘Œπ‘Œπ‘–π‘– − 𝑦𝑦𝑖𝑖 )2 = οΏ½(π‘Žπ‘Žπ‘₯π‘₯𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )2
Remember that at this stage we do not yet know the slope or intercept, so we cannot do
any numerical calculations with our data.
We wish to minimize the error E by choosing the ‘best’ slope and intercept. We do this by
differentiating the error function separately with respect to (wrt) slope and intercept and
setting the result to zero to minimize for the total error:
𝑛𝑛
πœ•πœ•πœ•πœ•
wrt slope, π‘Žπ‘Ž:
= οΏ½{2(π‘Žπ‘Žπ‘₯π‘₯𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )π‘₯π‘₯𝑖𝑖 } = 2 οΏ½π‘Žπ‘Ž οΏ½(π‘₯π‘₯𝑖𝑖 )2 + 𝑏𝑏𝑏𝑏π‘₯π‘₯Μ… − οΏ½ π‘₯π‘₯𝑖𝑖 𝑦𝑦𝑖𝑖 οΏ½ = 0
πœ•πœ•πœ•πœ•
wrt intercept, 𝑏𝑏:
𝑖𝑖=1
𝑛𝑛
πœ•πœ•πœ•πœ•
= οΏ½{2(π‘Žπ‘Žπ‘₯π‘₯𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )} = 2𝑛𝑛(π‘Žπ‘Žπ‘₯π‘₯Μ… + 𝑏𝑏 − 𝑦𝑦�) = 0
πœ•πœ•πœ•πœ•
Recognizing (for example) that
𝑖𝑖=1
𝑛𝑛
1
π‘₯π‘₯Μ… = οΏ½ π‘₯π‘₯𝑖𝑖
𝑛𝑛
𝑖𝑖=1
We solve these two equations simultaneously to get:
EM375: Linear Regression - 2
∑ π‘₯π‘₯π‘₯π‘₯ − 𝑛𝑛π‘₯π‘₯Μ… 𝑦𝑦�
∑ π‘₯π‘₯ 2 − 𝑛𝑛π‘₯π‘₯Μ… 2
intercept: 𝑏𝑏� = 𝑦𝑦� − π‘Žπ‘ŽοΏ½π‘₯π‘₯Μ…
slope:
π‘Žπ‘ŽοΏ½ =
We call the resulting line π‘Œπ‘Œ = π‘Žπ‘ŽοΏ½π‘₯π‘₯ + 𝑏𝑏� the least-squares best fit line (or just the “best-fit
line”) to the data represented by the data set (xi, yi) where π‘Žπ‘ŽοΏ½ is the best estimate of the
actual slope a, and 𝑏𝑏� is the best estimate of the actual intercept b.
UNCERTAINTY IN THE ESTIMATES
One concern is that the above equations give the “best” estimate, but how accurate is that
estimate? We investigate a method that gives us an estimate of the uncertainty in the
calculated slope and intercept. In other words, we wish to determine the confidence
intervals 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 and 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 such that, at a chosen level of significance:
π‘Žπ‘ŽοΏ½ − 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 < π‘Žπ‘Ž < π‘Žπ‘ŽοΏ½ + 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑏𝑏� − 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 < 𝑏𝑏 < 𝑏𝑏� + 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼
In this course we quote the uncertainties, but do not develop them from first principles.
First we calculate the Mean Square Error (MSE) of the data set. In some texts (for example,
Wheeler & Ganji) the MSE is called the standard error of the estimate 𝑆𝑆𝑦𝑦,π‘₯π‘₯ . The MSE can be
calculated as:
�𝑦𝑦
οΏ½ − π‘Žπ‘Ž
οΏ½ ∑ π‘₯π‘₯π‘₯π‘₯
∑(π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ)2
∑(𝑦𝑦𝑖𝑖 − π‘Œπ‘Œπ‘–π‘– )2
∑ 𝑦𝑦2 − 𝑛𝑛𝑏𝑏
MSE = 𝑆𝑆𝑦𝑦,π‘₯π‘₯ = οΏ½
=οΏ½
=οΏ½
(𝑛𝑛 − 2)
(𝑛𝑛 − 2)
(𝑛𝑛 − 2)
The confidence intervals can be calculated as follows, where 𝑑𝑑𝛼𝛼,𝑛𝑛−2 is the Student-t statistic for a
level of significance 𝛼𝛼 and (n-2) degrees of freedom:
1
𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = ±π‘‘𝑑𝛼𝛼,𝑛𝑛−2 𝑆𝑆𝑦𝑦,π‘₯π‘₯ οΏ½
2
οΏ½∑οΏ½π‘₯π‘₯𝑖𝑖 οΏ½ − 𝑛𝑛π‘₯π‘₯Μ… 2 οΏ½
(π‘₯π‘₯οΏ½ − π‘₯π‘₯Μ… )2
1
𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = ±π‘‘𝑑95%,𝑛𝑛−2 𝑆𝑆𝑦𝑦,π‘₯π‘₯ οΏ½ +
𝑛𝑛 οΏ½∑οΏ½π‘₯π‘₯𝑖𝑖2 οΏ½ − 𝑛𝑛π‘₯π‘₯Μ… 2 οΏ½
We see that the confidence interval for the slope is a single quantity.
However, the confidence interval for the intercept depends upon x. In the equation π‘₯π‘₯οΏ½ is the
value of x for which we calculate the confidence interval. We can calculate the confidence interval
for a single value of π‘₯π‘₯οΏ½, or we can calculate the confidence interval as a continuous function of π‘₯π‘₯οΏ½.
EM375: Linear Regression - 3
A strategy that is often sufficient for the type of problems we encounter on this course is to
determine the confidence interval for the special case when π‘₯π‘₯οΏ½ is the average value of x, or π‘₯π‘₯οΏ½ = π‘₯π‘₯Μ… .
For this case the intercept confidence interval reduces to:
∑(π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ)2
1
𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = ±π‘‘𝑑𝛼𝛼,𝑛𝑛−2 𝑆𝑆𝑦𝑦,π‘₯π‘₯ οΏ½ = ±π‘‘𝑑𝛼𝛼,𝑛𝑛−2 οΏ½
𝑛𝑛(𝑛𝑛 − 2)
𝑛𝑛
only when π‘₯π‘₯οΏ½ = π‘₯π‘₯Μ…
PRESENTING THE FINAL RESULTS OF THE REGRESSION ANALYSIS
Whenever you calculate a least squares error regression line, your final result should never
just quote the best fit slope and intercept, π‘Žπ‘ŽοΏ½ and 𝑏𝑏�. You should ALWAYS include the
confidence intervals of the slope and intercept and the associated level of confidence. Also,
you should ALWAYS generate at least one graph that shows the original data and the
reconstructed best-fit line. The preference is also to include the confidence interval bands.
The graphs on the next page give examples of the final plots you could produce for a linear
regression analysis. Both plots show the raw data as blue circular symbols and the best fit
line as a solid blue line.
The upper plot shows the intercept ±confidence interval. The solid green lines show the
confidence interval as a function of π‘₯π‘₯οΏ½. The red squares show the confidence interval calculated
when π‘₯π‘₯οΏ½ = π‘₯π‘₯Μ… .
The lower plot shows the slope ±confidence interval as the red solid line. You should ensure
that these lines intersect at the coordinate (π‘₯π‘₯Μ… , 𝑦𝑦�).
The Matlab code used to generate the confidence intervals is shown at the end of this
handout.
EM375: Linear Regression - 4
EM375: Linear Regression - 5
PROCEDURE FOR DETERMINING A BEST-FIT STRAIGHT LINE USING LINEAR CORRELATION
AND REGRESSION
1.
2.
3.
Plot the raw data
Make a subjective decision about the data and any trends
Calculate correlation coefficient rxy and compare it to the values in the table. Make a
statistical decision
Calculate slope & intercept of the best-fit line
Calculate the MSE (or standard error of estimate)
Calculate confidence intervals for the slope and intercept
Plot raw data and overlay the best-fit line with the ±confidence intervals
4.
5.
6.
7.
Remember that many data points will fall inside the CI
Expect a few points to fall outside the CI
FINAL DECISION BASED ON:
•
•
•
•
SUBJECTIVE – looks like a straight line?
STATISTICAL – correlation coefficient above the random expected value?
GRAPHICAL – do the reconstructed best-fit line and the uncertainty (slope and
intercept) lines “look OK”?
o If the uncertainty is large compared with the change in y-value over the
range of x data, the straight line approximation may not be adequate.
Although not presented in these note, plotting the residual values versus x can often
be helpful in identifying nonlinear trends in the data.
EM375: Linear Regression - 6
Matlab “bare bones” code
%% Linear Regression code written by Prof. Ratcliffe 2015
% This code WILL NOT RUN. IT PRODUCES NO PRINTED OUTPUT.
% It is "stripped down" so that it just shows the various calculations.
% in a compact form.
alpha=0.1; conf=1-alpha; % level of significance, 10% here
n=length(x);
R=corrcoef(x,y); R=R(1,2); % correlation coefficient
% calculate the maximum correlation value for random data
t2=tinv(1-alpha/2,n-2); % student t for (n-2) degrees of freedom
minCorr=t2/sqrt(n-2+t2^2);
if abs(R)>minCorr
fprintf('There is insufficient evidence to say the data are not
correlated.\n');
else
fprintf('The data are not correlated.\n');
end
%% the actual linear regression
p=polyfit(x,y,1); slope=p(1); intercept=p(2);
%% residuals and MSE
Y=slope*x+intercept;
resid=Y-y;
MSE=sqrt(sum(resid.^2)/(n-2));
%% calculate the confidence intervals for intercept and slope
xbar=mean(x);ybar=mean(y);
xrange=max(x)-min(x);
xtilde=min(x):xrange/100:max(x);
ytilde=slope.*xtilde+intercept;
CI_intercept=MSE*t2*sqrt(1/n +(xtilde-xbar).^2/(sum(x.^2)-n*xbar^2));
CI_interceptXBar=t2*MSE/sqrt(n); % calculated at x-tilde=xbar
CI_slope=t2*MSE/sqrt(sum(x.^2)-n*xbar^2 );
%% plot intercept CI on figure 1
figure(1);clf;hold on
plot(x,y,'o');
xlabel('The x-data (Units)'); ylabel('The y-data (Units)');
refline(slope,intercept); % overlay best fit line
plot(xtilde,ytilde+CI_intercept,'g'); % plot upper CI in green
plot(xtilde,ytilde-CI_intercept,'g'); % lower CI
plot(xbar,ybar+CI_interceptXBar,'sr'); %x-tilde=xbar plot
plot(xbar,ybar-CI_interceptXBar,'sr');
% CI slope
figure(2);clf;hold on
plot(x,y,'o');
xlabel('The x-data (units)'); ylabel('The y-data (units)');
refline(slope,intercept); % overlay best fit line
% want the two CI lines to intersect at the mean (x, y) coordinate
tmpy=(slope + CI_slope)*xbar+intercept-ybar;
h=refline(slope+CI_slope,intercept-tmpy); % plot upper CI
set(h,'Color','r'); % make it red
tmpy=(slope - CI_slope)*xbar+intercept-ybar;
h=refline(slope-CI_slope,intercept-tmpy);
set(h,'Color','r')
EM375: Linear Regression - 7
Download