EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST SQUARES LINEAR REGRESSION ANALYSIS In this unit of the course we investigate fitting a straight line to measured (x, y) data pairs. The equation we want to fit is of the form: 𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏 where a is the slope and b is the intercept of the line. If we only have two data pairs we can fit a unique line to them. However, when we have more data pairs we could fit a large number of different lines that pass through the scatter of data points. The problem therefore arises: how do we choose the “best” straight line? There are several different methods. The one we develop here is the least squares error approach. This is the one most commonly used in engineering and functions to calculate this solution are readily available in Excel, MATLAB and many handheld calculators. Let us assume we have n data pairs (xi, yi) with i = 1 .. n. We make the assumption that all x values have no uncertainty – they are known exactly. The only uncertainty is associated with the y values. If we knew the actual slope and intercept of the line, we could predict the expect y-value for any given x-value. We choose upper case Y to represent the expected values and lower case y to represent the measured values. For each measurement there will be an error ei (difference) between the expected value and observed value. Thus: 𝑌𝑌𝑖𝑖 = 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 𝑒𝑒𝑖𝑖 = 𝑌𝑌𝑖𝑖 − 𝑦𝑦𝑖𝑖 = 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 The values are demonstrated in the figure below. EM375: Linear Regression - 1 y Expected Yi error, ei Observed yi x The approach to finding the “least squares best fit” line is to minimize the sum of the squares of the errors, E. 𝑛𝑛 𝑛𝑛 𝑛𝑛 𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1 𝐸𝐸 = � 𝑒𝑒𝑖𝑖2 = �(𝑌𝑌𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2 = �(𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )2 Remember that at this stage we do not yet know the slope or intercept, so we cannot do any numerical calculations with our data. We wish to minimize the error E by choosing the ‘best’ slope and intercept. We do this by differentiating the error function separately with respect to (wrt) slope and intercept and setting the result to zero to minimize for the total error: 𝑛𝑛 𝜕𝜕𝜕𝜕 wrt slope, 𝑎𝑎: = �{2(𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )𝑥𝑥𝑖𝑖 } = 2 �𝑎𝑎 �(𝑥𝑥𝑖𝑖 )2 + 𝑏𝑏𝑏𝑏𝑥𝑥̅ − � 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 � = 0 𝜕𝜕𝜕𝜕 wrt intercept, 𝑏𝑏: 𝑖𝑖=1 𝑛𝑛 𝜕𝜕𝜕𝜕 = �{2(𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )} = 2𝑛𝑛(𝑎𝑎𝑥𝑥̅ + 𝑏𝑏 − 𝑦𝑦�) = 0 𝜕𝜕𝜕𝜕 Recognizing (for example) that 𝑖𝑖=1 𝑛𝑛 1 𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖 𝑛𝑛 𝑖𝑖=1 We solve these two equations simultaneously to get: EM375: Linear Regression - 2 ∑ 𝑥𝑥𝑥𝑥 − 𝑛𝑛𝑥𝑥̅ 𝑦𝑦� ∑ 𝑥𝑥 2 − 𝑛𝑛𝑥𝑥̅ 2 intercept: 𝑏𝑏� = 𝑦𝑦� − 𝑎𝑎�𝑥𝑥̅ slope: 𝑎𝑎� = We call the resulting line 𝑌𝑌 = 𝑎𝑎�𝑥𝑥 + 𝑏𝑏� the least-squares best fit line (or just the “best-fit line”) to the data represented by the data set (xi, yi) where 𝑎𝑎� is the best estimate of the actual slope a, and 𝑏𝑏� is the best estimate of the actual intercept b. UNCERTAINTY IN THE ESTIMATES One concern is that the above equations give the “best” estimate, but how accurate is that estimate? We investigate a method that gives us an estimate of the uncertainty in the calculated slope and intercept. In other words, we wish to determine the confidence intervals 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 and 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 such that, at a chosen level of significance: 𝑎𝑎� − 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 < 𝑎𝑎 < 𝑎𝑎� + 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑏𝑏� − 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 < 𝑏𝑏 < 𝑏𝑏� + 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 In this course we quote the uncertainties, but do not develop them from first principles. First we calculate the Mean Square Error (MSE) of the data set. In some texts (for example, Wheeler & Ganji) the MSE is called the standard error of the estimate 𝑆𝑆𝑦𝑦,𝑥𝑥 . The MSE can be calculated as: �𝑦𝑦 � − 𝑎𝑎 � ∑ 𝑥𝑥𝑥𝑥 ∑(𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)2 ∑(𝑦𝑦𝑖𝑖 − 𝑌𝑌𝑖𝑖 )2 ∑ 𝑦𝑦2 − 𝑛𝑛𝑏𝑏 MSE = 𝑆𝑆𝑦𝑦,𝑥𝑥 = � =� =� (𝑛𝑛 − 2) (𝑛𝑛 − 2) (𝑛𝑛 − 2) The confidence intervals can be calculated as follows, where 𝑡𝑡𝛼𝛼,𝑛𝑛−2 is the Student-t statistic for a level of significance 𝛼𝛼 and (n-2) degrees of freedom: 1 𝛿𝛿𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = ±𝑡𝑡𝛼𝛼,𝑛𝑛−2 𝑆𝑆𝑦𝑦,𝑥𝑥 � 2 �∑�𝑥𝑥𝑖𝑖 � − 𝑛𝑛𝑥𝑥̅ 2 � (𝑥𝑥� − 𝑥𝑥̅ )2 1 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = ±𝑡𝑡95%,𝑛𝑛−2 𝑆𝑆𝑦𝑦,𝑥𝑥 � + 𝑛𝑛 �∑�𝑥𝑥𝑖𝑖2 � − 𝑛𝑛𝑥𝑥̅ 2 � We see that the confidence interval for the slope is a single quantity. However, the confidence interval for the intercept depends upon x. In the equation 𝑥𝑥� is the value of x for which we calculate the confidence interval. We can calculate the confidence interval for a single value of 𝑥𝑥�, or we can calculate the confidence interval as a continuous function of 𝑥𝑥�. EM375: Linear Regression - 3 A strategy that is often sufficient for the type of problems we encounter on this course is to determine the confidence interval for the special case when 𝑥𝑥� is the average value of x, or 𝑥𝑥� = 𝑥𝑥̅ . For this case the intercept confidence interval reduces to: ∑(𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)2 1 𝛿𝛿𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 = ±𝑡𝑡𝛼𝛼,𝑛𝑛−2 𝑆𝑆𝑦𝑦,𝑥𝑥 � = ±𝑡𝑡𝛼𝛼,𝑛𝑛−2 � 𝑛𝑛(𝑛𝑛 − 2) 𝑛𝑛 only when 𝑥𝑥� = 𝑥𝑥̅ PRESENTING THE FINAL RESULTS OF THE REGRESSION ANALYSIS Whenever you calculate a least squares error regression line, your final result should never just quote the best fit slope and intercept, 𝑎𝑎� and 𝑏𝑏�. You should ALWAYS include the confidence intervals of the slope and intercept and the associated level of confidence. Also, you should ALWAYS generate at least one graph that shows the original data and the reconstructed best-fit line. The preference is also to include the confidence interval bands. The graphs on the next page give examples of the final plots you could produce for a linear regression analysis. Both plots show the raw data as blue circular symbols and the best fit line as a solid blue line. The upper plot shows the intercept ±confidence interval. The solid green lines show the confidence interval as a function of 𝑥𝑥�. The red squares show the confidence interval calculated when 𝑥𝑥� = 𝑥𝑥̅ . The lower plot shows the slope ±confidence interval as the red solid line. You should ensure that these lines intersect at the coordinate (𝑥𝑥̅ , 𝑦𝑦�). The Matlab code used to generate the confidence intervals is shown at the end of this handout. EM375: Linear Regression - 4 EM375: Linear Regression - 5 PROCEDURE FOR DETERMINING A BEST-FIT STRAIGHT LINE USING LINEAR CORRELATION AND REGRESSION 1. 2. 3. Plot the raw data Make a subjective decision about the data and any trends Calculate correlation coefficient rxy and compare it to the values in the table. Make a statistical decision Calculate slope & intercept of the best-fit line Calculate the MSE (or standard error of estimate) Calculate confidence intervals for the slope and intercept Plot raw data and overlay the best-fit line with the ±confidence intervals 4. 5. 6. 7. Remember that many data points will fall inside the CI Expect a few points to fall outside the CI FINAL DECISION BASED ON: • • • • SUBJECTIVE – looks like a straight line? STATISTICAL – correlation coefficient above the random expected value? GRAPHICAL – do the reconstructed best-fit line and the uncertainty (slope and intercept) lines “look OK”? o If the uncertainty is large compared with the change in y-value over the range of x data, the straight line approximation may not be adequate. Although not presented in these note, plotting the residual values versus x can often be helpful in identifying nonlinear trends in the data. EM375: Linear Regression - 6 Matlab “bare bones” code %% Linear Regression code written by Prof. Ratcliffe 2015 % This code WILL NOT RUN. IT PRODUCES NO PRINTED OUTPUT. % It is "stripped down" so that it just shows the various calculations. % in a compact form. alpha=0.1; conf=1-alpha; % level of significance, 10% here n=length(x); R=corrcoef(x,y); R=R(1,2); % correlation coefficient % calculate the maximum correlation value for random data t2=tinv(1-alpha/2,n-2); % student t for (n-2) degrees of freedom minCorr=t2/sqrt(n-2+t2^2); if abs(R)>minCorr fprintf('There is insufficient evidence to say the data are not correlated.\n'); else fprintf('The data are not correlated.\n'); end %% the actual linear regression p=polyfit(x,y,1); slope=p(1); intercept=p(2); %% residuals and MSE Y=slope*x+intercept; resid=Y-y; MSE=sqrt(sum(resid.^2)/(n-2)); %% calculate the confidence intervals for intercept and slope xbar=mean(x);ybar=mean(y); xrange=max(x)-min(x); xtilde=min(x):xrange/100:max(x); ytilde=slope.*xtilde+intercept; CI_intercept=MSE*t2*sqrt(1/n +(xtilde-xbar).^2/(sum(x.^2)-n*xbar^2)); CI_interceptXBar=t2*MSE/sqrt(n); % calculated at x-tilde=xbar CI_slope=t2*MSE/sqrt(sum(x.^2)-n*xbar^2 ); %% plot intercept CI on figure 1 figure(1);clf;hold on plot(x,y,'o'); xlabel('The x-data (Units)'); ylabel('The y-data (Units)'); refline(slope,intercept); % overlay best fit line plot(xtilde,ytilde+CI_intercept,'g'); % plot upper CI in green plot(xtilde,ytilde-CI_intercept,'g'); % lower CI plot(xbar,ybar+CI_interceptXBar,'sr'); %x-tilde=xbar plot plot(xbar,ybar-CI_interceptXBar,'sr'); % CI slope figure(2);clf;hold on plot(x,y,'o'); xlabel('The x-data (units)'); ylabel('The y-data (units)'); refline(slope,intercept); % overlay best fit line % want the two CI lines to intersect at the mean (x, y) coordinate tmpy=(slope + CI_slope)*xbar+intercept-ybar; h=refline(slope+CI_slope,intercept-tmpy); % plot upper CI set(h,'Color','r'); % make it red tmpy=(slope - CI_slope)*xbar+intercept-ybar; h=refline(slope-CI_slope,intercept-tmpy); set(h,'Color','r') EM375: Linear Regression - 7

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST

Related documents

Products

Support

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib