EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST SQUARES LINEAR REGRESSION ANALYSIS In this unit of the course we investigate fitting a straight line to measured (x, y) data pairs. The equation we want to fit is of the form: ππ = ππππ + ππ where a is the slope and b is the intercept of the line. If we only have two data pairs we can fit a unique line to them. However, when we have more data pairs we could fit a large number of different lines that pass through the scatter of data points. The problem therefore arises: how do we choose the “best” straight line? There are several different methods. The one we develop here is the least squares error approach. This is the one most commonly used in engineering and functions to calculate this solution are readily available in Excel, MATLAB and many handheld calculators. Let us assume we have n data pairs (xi, yi) with i = 1 .. n. We make the assumption that all x values have no uncertainty – they are known exactly. The only uncertainty is associated with the y values. If we knew the actual slope and intercept of the line, we could predict the expect y-value for any given x-value. We choose upper case Y to represent the expected values and lower case y to represent the measured values. For each measurement there will be an error ei (difference) between the expected value and observed value. Thus: ππππ = πππ₯π₯ππ + ππ ππππ = ππππ − π¦π¦ππ = πππ₯π₯ππ + ππ − π¦π¦ππ The values are demonstrated in the figure below. EM375: Linear Regression - 1 y Expected Yi error, ei Observed yi x The approach to finding the “least squares best fit” line is to minimize the sum of the squares of the errors, E. ππ ππ ππ ππ=1 ππ=1 ππ=1 πΈπΈ = οΏ½ ππππ2 = οΏ½(ππππ − π¦π¦ππ )2 = οΏ½(πππ₯π₯ππ + ππ − π¦π¦ππ )2 Remember that at this stage we do not yet know the slope or intercept, so we cannot do any numerical calculations with our data. We wish to minimize the error E by choosing the ‘best’ slope and intercept. We do this by differentiating the error function separately with respect to (wrt) slope and intercept and setting the result to zero to minimize for the total error: ππ ππππ wrt slope, ππ: = οΏ½{2(πππ₯π₯ππ + ππ − π¦π¦ππ )π₯π₯ππ } = 2 οΏ½ππ οΏ½(π₯π₯ππ )2 + πππππ₯π₯Μ − οΏ½ π₯π₯ππ π¦π¦ππ οΏ½ = 0 ππππ wrt intercept, ππ: ππ=1 ππ ππππ = οΏ½{2(πππ₯π₯ππ + ππ − π¦π¦ππ )} = 2ππ(πππ₯π₯Μ + ππ − π¦π¦οΏ½) = 0 ππππ Recognizing (for example) that ππ=1 ππ 1 π₯π₯Μ = οΏ½ π₯π₯ππ ππ ππ=1 We solve these two equations simultaneously to get: EM375: Linear Regression - 2 ∑ π₯π₯π₯π₯ − πππ₯π₯Μ π¦π¦οΏ½ ∑ π₯π₯ 2 − πππ₯π₯Μ 2 intercept: πποΏ½ = π¦π¦οΏ½ − πποΏ½π₯π₯Μ slope: πποΏ½ = We call the resulting line ππ = πποΏ½π₯π₯ + πποΏ½ the least-squares best fit line (or just the “best-fit line”) to the data represented by the data set (xi, yi) where πποΏ½ is the best estimate of the actual slope a, and πποΏ½ is the best estimate of the actual intercept b. UNCERTAINTY IN THE ESTIMATES One concern is that the above equations give the “best” estimate, but how accurate is that estimate? We investigate a method that gives us an estimate of the uncertainty in the calculated slope and intercept. In other words, we wish to determine the confidence intervals πΏπΏππππππππππ and πΏπΏπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌ such that, at a chosen level of significance: πποΏ½ − πΏπΏππππππππππ < ππ < πποΏ½ + πΏπΏππππππππππ πποΏ½ − πΏπΏπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌ < ππ < πποΏ½ + πΏπΏπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌ In this course we quote the uncertainties, but do not develop them from first principles. First we calculate the Mean Square Error (MSE) of the data set. In some texts (for example, Wheeler & Ganji) the MSE is called the standard error of the estimate πππ¦π¦,π₯π₯ . The MSE can be calculated as: οΏ½π¦π¦ οΏ½ − ππ οΏ½ ∑ π₯π₯π₯π₯ ∑(ππππππππππππππππππ)2 ∑(π¦π¦ππ − ππππ )2 ∑ π¦π¦2 − ππππ MSE = πππ¦π¦,π₯π₯ = οΏ½ =οΏ½ =οΏ½ (ππ − 2) (ππ − 2) (ππ − 2) The confidence intervals can be calculated as follows, where π‘π‘πΌπΌ,ππ−2 is the Student-t statistic for a level of significance πΌπΌ and (n-2) degrees of freedom: 1 πΏπΏππππππππππ = ±π‘π‘πΌπΌ,ππ−2 πππ¦π¦,π₯π₯ οΏ½ 2 οΏ½∑οΏ½π₯π₯ππ οΏ½ − πππ₯π₯Μ 2 οΏ½ (π₯π₯οΏ½ − π₯π₯Μ )2 1 πΏπΏπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌ = ±π‘π‘95%,ππ−2 πππ¦π¦,π₯π₯ οΏ½ + ππ οΏ½∑οΏ½π₯π₯ππ2 οΏ½ − πππ₯π₯Μ 2 οΏ½ We see that the confidence interval for the slope is a single quantity. However, the confidence interval for the intercept depends upon x. In the equation π₯π₯οΏ½ is the value of x for which we calculate the confidence interval. We can calculate the confidence interval for a single value of π₯π₯οΏ½, or we can calculate the confidence interval as a continuous function of π₯π₯οΏ½. EM375: Linear Regression - 3 A strategy that is often sufficient for the type of problems we encounter on this course is to determine the confidence interval for the special case when π₯π₯οΏ½ is the average value of x, or π₯π₯οΏ½ = π₯π₯Μ . For this case the intercept confidence interval reduces to: ∑(ππππππππππππππππππ)2 1 πΏπΏπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌ = ±π‘π‘πΌπΌ,ππ−2 πππ¦π¦,π₯π₯ οΏ½ = ±π‘π‘πΌπΌ,ππ−2 οΏ½ ππ(ππ − 2) ππ only when π₯π₯οΏ½ = π₯π₯Μ PRESENTING THE FINAL RESULTS OF THE REGRESSION ANALYSIS Whenever you calculate a least squares error regression line, your final result should never just quote the best fit slope and intercept, πποΏ½ and πποΏ½. You should ALWAYS include the confidence intervals of the slope and intercept and the associated level of confidence. Also, you should ALWAYS generate at least one graph that shows the original data and the reconstructed best-fit line. The preference is also to include the confidence interval bands. The graphs on the next page give examples of the final plots you could produce for a linear regression analysis. Both plots show the raw data as blue circular symbols and the best fit line as a solid blue line. The upper plot shows the intercept ±confidence interval. The solid green lines show the confidence interval as a function of π₯π₯οΏ½. The red squares show the confidence interval calculated when π₯π₯οΏ½ = π₯π₯Μ . The lower plot shows the slope ±confidence interval as the red solid line. You should ensure that these lines intersect at the coordinate (π₯π₯Μ , π¦π¦οΏ½). The Matlab code used to generate the confidence intervals is shown at the end of this handout. EM375: Linear Regression - 4 EM375: Linear Regression - 5 PROCEDURE FOR DETERMINING A BEST-FIT STRAIGHT LINE USING LINEAR CORRELATION AND REGRESSION 1. 2. 3. Plot the raw data Make a subjective decision about the data and any trends Calculate correlation coefficient rxy and compare it to the values in the table. Make a statistical decision Calculate slope & intercept of the best-fit line Calculate the MSE (or standard error of estimate) Calculate confidence intervals for the slope and intercept Plot raw data and overlay the best-fit line with the ±confidence intervals 4. 5. 6. 7. Remember that many data points will fall inside the CI Expect a few points to fall outside the CI FINAL DECISION BASED ON: • • • • SUBJECTIVE – looks like a straight line? STATISTICAL – correlation coefficient above the random expected value? GRAPHICAL – do the reconstructed best-fit line and the uncertainty (slope and intercept) lines “look OK”? o If the uncertainty is large compared with the change in y-value over the range of x data, the straight line approximation may not be adequate. Although not presented in these note, plotting the residual values versus x can often be helpful in identifying nonlinear trends in the data. EM375: Linear Regression - 6 Matlab “bare bones” code %% Linear Regression code written by Prof. Ratcliffe 2015 % This code WILL NOT RUN. IT PRODUCES NO PRINTED OUTPUT. % It is "stripped down" so that it just shows the various calculations. % in a compact form. alpha=0.1; conf=1-alpha; % level of significance, 10% here n=length(x); R=corrcoef(x,y); R=R(1,2); % correlation coefficient % calculate the maximum correlation value for random data t2=tinv(1-alpha/2,n-2); % student t for (n-2) degrees of freedom minCorr=t2/sqrt(n-2+t2^2); if abs(R)>minCorr fprintf('There is insufficient evidence to say the data are not correlated.\n'); else fprintf('The data are not correlated.\n'); end %% the actual linear regression p=polyfit(x,y,1); slope=p(1); intercept=p(2); %% residuals and MSE Y=slope*x+intercept; resid=Y-y; MSE=sqrt(sum(resid.^2)/(n-2)); %% calculate the confidence intervals for intercept and slope xbar=mean(x);ybar=mean(y); xrange=max(x)-min(x); xtilde=min(x):xrange/100:max(x); ytilde=slope.*xtilde+intercept; CI_intercept=MSE*t2*sqrt(1/n +(xtilde-xbar).^2/(sum(x.^2)-n*xbar^2)); CI_interceptXBar=t2*MSE/sqrt(n); % calculated at x-tilde=xbar CI_slope=t2*MSE/sqrt(sum(x.^2)-n*xbar^2 ); %% plot intercept CI on figure 1 figure(1);clf;hold on plot(x,y,'o'); xlabel('The x-data (Units)'); ylabel('The y-data (Units)'); refline(slope,intercept); % overlay best fit line plot(xtilde,ytilde+CI_intercept,'g'); % plot upper CI in green plot(xtilde,ytilde-CI_intercept,'g'); % lower CI plot(xbar,ybar+CI_interceptXBar,'sr'); %x-tilde=xbar plot plot(xbar,ybar-CI_interceptXBar,'sr'); % CI slope figure(2);clf;hold on plot(x,y,'o'); xlabel('The x-data (units)'); ylabel('The y-data (units)'); refline(slope,intercept); % overlay best fit line % want the two CI lines to intersect at the mean (x, y) coordinate tmpy=(slope + CI_slope)*xbar+intercept-ybar; h=refline(slope+CI_slope,intercept-tmpy); % plot upper CI set(h,'Color','r'); % make it red tmpy=(slope - CI_slope)*xbar+intercept-ybar; h=refline(slope-CI_slope,intercept-tmpy); set(h,'Color','r') EM375: Linear Regression - 7