ARG/PDW: MCEN4027F00 VIII: 1 Regression and Correlation Introduction One of the most important applications of statistics involves estimating the value of a response variable y or predicting some future value of y based upon knowledge of a set of independent variables, x1, x2, ....xk. For example, an engineer might want to relate the rate of malfunction y of a mechanical assembler to such variables as the speed of operation and the assembler operator. The objective would be to develop a prediction equation relating the dependent variable y to the independent variables and to use the prediction equation to predict the value of the rate of malfunction y for various combinations of speed of operation and operator. The models used to relate a dependent variable y to the independent variables, x1, x2, ....xk, are termed regression models because they express the mean value of y for given values x1, x2, ....xk, as a function of a set of unknown parameters. These parameters are estimated from sample data using a process to be described. This basic approach is applicable to situations ranging from simple linear regression to more complex nonlinear multiple regression. ARG/PDW: MCEN4027F00 VIII: 2 A Simple Linear Regression Model Suppose that the developer of a new insulation material wants to determine the amount of compression that would be produced on a 2inch thick specimen of material when subjected to different amounts of pressure. Five experimental pieces of the material were tested under different pressures. The values of x (in units of 10 psi) and the resulting amounts of compression y (in units of 0.1 inch) are given in Table 1. Table 1. Compression Versus Pressure for an Insulation Material Specimen Pressure, x (10 psi) Compression, y (0.1 inch) 1 1 1 2 2 1 3 3 2 4 4 2 5 5 4 ARG/PDW: MCEN4027F00 VIII: 3 A plot of the data, called a scattergram, is shown in Figure 1. 5 Compression (0.1 inch) 4 3 2 1 0 0 1 2 3 4 Pressure (10 psi) Figure 1. Scattergram for data in Table 1. 5 6 ARG/PDW: MCEN4027F00 VIII: 4 Suppose we believe that the value of y tends to increase in a linear manner as x increases. Then we could select a model relating y to x by drawing a line through the points in the figure. Such a deterministic model – one that does not allow for errors of prediction – might be adequate if all of the points in the figure fell on the fitted line. However, this will not occur for the data of Table 1. No matter how the line is drawn through the points, at least some of the points will deviate substantially from the fitted line. ARG/PDW: MCEN4027F00 VIII: 5 The solution to this problem is to construct a probabilistic model relating y to x – one that acknowledges the random variation of the data points about a line. One type of probabilistic model, a simple linear regression model, makes the assumption that the mean value of y for a given value of x can be represented by a straight line and that points deviate about this line of means by a random (positive or negative) amount equal to , i.e., y o 1x where the first two terms represent the mean value of y for a given value of x (and are unknown parameters of the deterministic (nonrandom) portion of the model) and the last term represents the random error. If we assume that the points deviate above and below the line of means, with some deviations positive, some negative, and with E() = 0, then the mean value of y is: E y Eo 1x o 1x E o 1x Therefore, the mean value of y for a given value of x, represented by E(y), is given by a straight line with a y-intercept equal to o and slope equal to 1. ARG/PDW: MCEN4027F00 VIII: 6 5 Compression (0.1 inch) 4 3 2 E(y) = o + 1x 1 0 0 1 2 3 4 5 Pressure (10 psi) Figure 2. Hypothetical line of means for the data of Table 1. In order to fit a simple linear regression model to a set of data, we must find estimators for the unknown parameters, o and 1. Valid inferences about o and 1 will depend on the probability distribution of the random error, ; therefore, we must first make specific assumptions about, . 6 ARG/PDW: MCEN4027F00 VIII: 7 These assumptions, summarized below, are basic to every statistical regression analysis. 1. The mean of the probability distribution of is 0. That is, the average of the errors over an infinitely long series of experiments is 0 for each setting of the independent variable x. This assumption implies that the mean value of y, E(y), for a given value of x is E(y) = o and 1x. 2. The variance of the probability distribution of is constant for all settings of the independent variable x. For our straight-line model, this assumption means that the variance of is equal to a constant, such as 2, for all values of x. 3. The probability distribution of is normal. 4. The errors associated with any two different observations are independent. That is, the error associated with one value of y has no effect on the errors associated with other y values. The implications of the first three assumptions can be observed in Figure 3 which shows distributions of errors for three particular values of x, namely x1, x2 and x3. Note that the relative frequency distributions of the errors are normal, with a mean of 0 and a constant variance 2 (each distribution has the same degree of spread or variability). Generally, these assumptions are robust. ARG/PDW: MCEN4027F00 VIII: 8 ARG/PDW: MCEN4027F00 VIII: 9 The Method of Least Squares In order to choose the “best fitting” line for a set of data, we must estimate the unknown parameters, o and 1, of the simple linear regression model. These estimators can be found using the method of least squares. The reasoning behind the method of least squares can be seen by considering Figure 4 which shows a scattergram of the data points of Table 1. ARG/PDW: MCEN4027F00 VIII: 10 The vertical line segments represent deviations of the points from the line. Although there are many lines for which the sum of deviations (or errors) is equal to 0, there is one and only one line for which the sum of squares of the deviations is a minimum. The sum of squares of the deviations is called the sum of squares for error and is denoted by the symbol SSE. The line is termed the least squares or regression line. To find the least squares line for a set of data assume that we have a sample of n data points which can be identified by corresponding values of x and y, i.e., (x1,y1), (x2,y2), ...... (xn,yn). For the straight-line model the response y in terms of x is y o 1x The line of means is E y o 1x The fitted line, which we hope to find, is represented as yˆ ˆo ˆ1x ARG/PDW: MCEN4027F00 VIII: 11 Thus, ŷ is an estimator of the mean value of y, E(y), and a predictor of some future value of y; ˆo and ˆ1 are estimators of o and 1, respectively. For a given data point, e.g., (x1,y1), the observed value of y is y1 and the predicted value of y would be obtained by substituting x1 into the prediction equation yˆi ˆo ˆ1xi The deviation of the ith value of y from its predicted value is yi yˆi yi ˆo ˆ1xi Then the sum of squares of the deviations of the y values about their predicted values for all of the n data points is n SSE yi ˆo ˆ1xi i 1 2 The quantities ˆo and ˆ1 that make the SSE a minimum are called the least squares estimates of the population parameters o and 1, and the prediction equation yˆ ˆo ˆ1x is called the least squares line. The values of ˆo and ˆ1 that minimize the SSE are obtained by setting the two partial derivatives SSE ˆo and SSE ˆ1 , equal to 0 and solving the resulting simultaneous linear system of least squares equations. ARG/PDW: MCEN4027F00 VIII: 12 We then obtain Slope: ̂1 SS xy SS xx and y-intercept: ˆo y ˆ1x where n equals the sample size and SS xy n n xi yi n xi x yi y xi yi i 1 i 1 n i 1 n 2 SS xx xi x i 1 n xi xi2 i 1 n 2 In summary, we have defined the best-fitting straight line to be the one that satisfies the least-squares criterion, i.e. the sum of the squared errors will be smaller than for any other straight-line model. ARG/PDW: MCEN4027F00 VIII: 13 The Least Squares Estimators An examination of the formulas for the least squares estimators reveals that they are linear functions of the observed y values, y1, y2, ….yn. Since we have assumed that the random errors associated with these y values 1, 2, ….n, are independent, normally distributed random variables with mean 0 and variance 2, it follows that the y values will be normally distributed with mean E(y) = o + 1x and variance 2. An Estimator of 2 In most practical situation, the variance 2 of the random error will be unknown and must be estimated from the sample data. Since 2 measures the variation of the y values about the line E(y) = o + 1x, it seems reasonable to estimate 2 by dividing SSE by an appropriate number. Estimation of 2 s2 SSE SSE Degrees of freedom for error n-2 where 2 SSE yi yˆi SS yy ˆ1SS xy and 2 yi SS yy yi y n 2 2 yi ARG/PDW: MCEN4027F00 VIII: 14 Inferences About the Slope 1 What could be said about the values of o + 1 in the hypothesized probabilistic model, y = o + 1x + , if x contributes no information for the prediction of y? The implication is that the mean of y, i.e. the deterministic part of the model E(y) = o + 1x, does not change as x changes. Regardless of the value of x, the same value of y is always predicted. In the straight-line model, this means that the true slope 1 is equal to 0. Therefore, to test the null hypothesis that x contributes no information for the prediction of y against the alternative hypothesis that these variables are linearly related with a slope differing from 0, we test Ho: 1 = 0 Ha: 1 0 If the data support the alternative hypothesis, we will conclude that x does contribute information for the prediction of y using the straight-line model (although the relationship between E(y) and x could be more complex than a straight line). Thus, to some extent, this is a test of the utility of the hypothetical model. ARG/PDW: MCEN4027F00 VIII: 15 Since will usually be unknown, the appropriate test statistic will be a student’s t statistic such that the test for model utility of a simple linear regression is given by One-tailed test: Two-tailed test: Ho: 1 = 0 Ho: 1 = 0 Ha: 1 < 0 or 1 > 0 Ha: 1 0 Test statistic: t ˆ1 s 1 ˆ1 s SS xx Rejection region: t < -t or : t > t Rejection region: t t 2 Note that t and t/2 are based upon n-2 degrees of freedom. ARG/PDW: MCEN4027F00 VIII: 16 EXAMPLE We can use the data in Table 1 to show the results of the necessary calculations using the linear regression analysis package in Excel. ˆ1 ˆ1 0.7 t 3.7 s1 s SS xx 0.19 EXCEL: LINEAR REGRESSION ANALYSIS FOR THE DATA IN TABLE 1 Pressure, x 1 2 3 4 5 Compression,y 1 1 2 2 4 SUMMARY OUTPUT Regression Statistics Multiple R 0.903696114 R Square 0.816666667 Adjusted R Square 0.755555556 Standard Error 0.605530071 Observations 5 ANOVA df Regression Residual Total Intercept X Variable 1 SS MS 1 3 4 4.9 1.1 6 4.9 0.366666667 Coefficients -0.1 0.7 Standard Error 0.635085296 0.191485422 t Stat -0.157459164 3.655630775 F 13.36363636 P-value 0.88488398 0.035352847 Significance F 0.035352847 Lower 95% Upper 95% -2.12112675 1.92112675 0.090607356 1.309392644 ARG/PDW: MCEN4027F00 VIII: 17 Unfortunately, the spreadsheet does not indicate that the critical t value for a two-tailed test (p = 0.025) is 3.182. However, the information provided in the table (p = 0.035) provides the necessary information for a rejection of the null hypothesis and a warrants a conclusion that the slope is not zero. Therefore, the sample evidence indicates that x contributes information for the prediction of y using a linear model for the relationship between compression and pressure. ARG/PDW: MCEN4027F00 VIII: 18 The Coefficient of Correlation The least squares slope, ̂1, provides useful information on the linear relationship or association between two variables y and x. Another way to measure association is to compute the Pearson product moment correlation coefficient r. The correlation coefficient, defined as SS xy r SS xx SS yy provides a quantitative measure of the strength of the linear relationship between x and y in the sample just as does ̂1. However, unlike the slope, the correlation coefficient r is scaleless, i.e. the value of r is always between 1- and +1 no matter the units of x and y. ARG/PDW: MCEN4027F00 VIII: 19 Since both r and ̂1 provide information about the utility of the model, it is not surprising that there is a similarity in their computational formulas. In particular, note that SSxy appears in the numerators of both expressions and, since both denominators are always positive, r and ̂1 will always be of the same sign. A value of r near or equal to 0 implies little or no linear relationship between y and x. In contrast, the closer r is to 1 or –1, the stronger the linear relationship between y and x. And if r = 1 or r = -1, all the points fall exactly on the least squares line. Positive values of r imply that y increases as x increases; negative values imply that y decreases as x increases. It is important to note that high correlation does not imply causality. If a large positive or negative value of the sample correlation coefficient r is observed, it is incorrect to conclude that a change in x causes a change in y. The only valid conclusion is that a linear trend may exist between x and y. ARG/PDW: MCEN4027F00 VIII: 20 ARG/PDW: MCEN4027F00 VIII: 21 The population correlation coefficient is denoted by the symbol . As expected, is estimated by the corresponding sample statistic r. It is easy to show that r ˆ1 SS xx SS yy . Thus, ̂1 = 0 implies r = 0, and visa versa. Consequently, the null hypothesis Ho: = 0 is equivalent to the hypothesis Ho: ̂1 = 0. The only real difference between the least-squares slope ̂1 and r is the measurement scale. Therefore, the information that they provide about the utility of the least-squares model is to some extent redundant. However, the slope ̂1 provides additional information on the amount increase (or decrease) in y for every 1-unit increase in x. For this reason, the slope is the preferred parameter for making inferences about the existence of a positive or negative linear relationship between two variables. ARG/PDW: MCEN4027F00 VIII: 22 Test of Hypothesis for Linear Correlation One-tailed test: Two-tailed test: Ho: = 0 Ho: = 0 Ha: < 0 or > 0 Ha: 0 Test statistic: t r n2 1 r2 Rejection region: t < -t or : t > t Rejection region: t t 2 Note that t and t/2 are based upon n-2 degrees of freedom. The correlation coefficient r describes only the linear relationship between x and y. For nonlinear relationships, the value of r may be misleading, and other methods must be used for describing and testing such relationships. ARG/PDW: MCEN4027F00 VIII: 23 The Coefficient of Determination Another way to measure the contribution of x in predicting y is to consider how much the errors of prediction of y can be reduced by using the information provided by x. Suppose a sample of data has the scattergram shown in Figure 6a. ARG/PDW: MCEN4027F00 VIII: 24 If we assume that x contributes no information for the prediction of y, the best prediction for a value of y is the sample mean y , which is represented by the horizontal line in Figure 6b. The vertical line segments in Figure 6b are the deviations of the points about the mean y . Note that the sum of squares of deviations for the model yˆ y is SSyy = yi y 2 . Now suppose that a least-squares line is fitted to the same set of data and the deviations of the points about the line are determined as indicated in Figure 6c. Comparison of the deviations about the prediction lines in parts b and c indicates that: 1. If x contributes little or no information for the prediction of y, then the sums of squares of the two lines will be nearly equal, i.e. SS yy yi y 2 SSE yi yˆ 2 2. If x does contribute information for the prediction of y, then SSE will be smaller than SSyy. In fact, if all the points fall on the least squares line, then SSE = 0. ARG/PDW: MCEN4027F00 VIII: 25 A convenient way of measuring how well the least-squares equation yˆ ˆo ˆ1x performs as a predictor of y is to compute the reduction in the sum of squares of deviations that can be attributed to x, expressed as a proportion of SSyy. This quantity termed the coefficient of determination is given by SSE 2 SS yy SSE r 1 SS yy SS yy In simple linear regression it can be shown that this quantity is equal to the square of the simple linear coefficient of correlation r. Note that r2 is always between 0 and 1 because r is between – 1 and +1. Thus r2 = 0.60 means that the sum of squares of deviations of the y values about their predicted values has been reduced by 60%, by the use of ŷ instead of y to predict y. Or, more practically, r2 = 0.60 implies that the straight-line model relating y to x can explain or account for 60% of the variation present in the sample of y values. ARG/PDW: MCEN4027F00 VIII: 26 Model Estimation and Prediction If we are satisfied that a useful model has been identified, we are ready to accomplish the original objectives for building the model, i.e. using the model to estimate or predict. In our example, we might predict or estimate the amount of compression for a particular level of pressure. The most common uses of a probabilistic model can be divided into two categories. The first is the use of the model for estimating the mean value of y, E(y), for a specific value of x. For example, we may want to estimate the mean amount of compression for all specimens of insulation subjected to a pressure of 40 (x = 4) psi. The second use of the model entails predicting a particular value of y for a given value of x. If we decide to install insulation in a particular piece of equipment in which we believe it will be subjected to a pressure of 40 psi, we will want to predict the compression for this particular specimen of insulation material. In the case of estimating a mean value of y, we are attempting to estimate the mean result of a very large number of experiments at the given x value. In the second case, we are trying to predict the outcome of a single experiment at the given x value. ARG/PDW: MCEN4027F00 VIII: 27 Sampling Errors for the Estimator of the Mean of y The standard deviation of the sampling distribution of the estimator ŷ of the mean value of y at a particular value of x, say xp, is 2 1 x p x yˆ n SS xx where is the standard deviation of the random error . Sampling Errors for the Predictor of an Individual Value of y The standard deviation of the prediction error for the predictor ŷ of an individual y value for x = xp is 2 1 x p x y yˆ 1 n SS xx where is the standard deviation of the random error . The true value of will rarely be known – therefore, we estimate by s. ARG/PDW: MCEN4027F00 VIII: 28 The error in estimating the mean value of y, E(y), for a given value of x, say xp,is the distance between the least squares line and the true line of means, E(y) = o + 1x. This error, yˆ E y , is shown in the figure below. ARG/PDW: MCEN4027F00 VIII: 29 In contrast, the error y p yˆ in predicting some future value of y is the sum of the two errors – the error of estimating the mean of y, E(y), plus the random error that is a component of the value of y to be predicted as indicated in the figure below. ARG/PDW: MCEN4027F00 VIII: 30 Consequently, the error of predicting a particular value of y will usually be larger than the error of estimating the mean value of y for a particular value of x. Note from their respective formulas that both the error of estimation and the error of prediction take their smallest values when x p x. Using the least-squares prediction equation to estimate a mean value of y or to predict a particular value of y for values of x that lie outside the range of values of x contained in the sample data may lead to errors of estimation or prediction that are much larger than expected. ARG/PDW: MCEN4027F00 VIII: 31 EXAMPLE Suppose a fire insurance company wants to relate the amount of fire damage in major residential fires to the distance between the residence and the nearest fire station. The study is to be conducted in a large suburb of a major city; a sample of 15 recent fires in this suburb is selected. The amount of damage y and the distance x between the fire and the nearest fire station are recorded for each fire. The data and the regression analysis are incorporated in the following Excel spreadsheet. ARG/PDW: MCEN4027F00 VIII: 32 LINEAR REGRESSION ANALYSIS Fire Damage Data Distance x, miles 3.4 1.8 4.6 2.3 3.1 5.5 0.7 3.0 2.6 4.3 2.1 1.1 6.1 4.8 3.8 Damage y, k$ 26.2 17.8 31.3 23.1 27.5 36.0 14.1 22.3 19.6 31.3 24.0 17.3 43.2 36.4 26.1 SUMMARY OUTPUT Regression Statistics Multiple R 0.960977715 R Square 0.923478169 Adjusted R Square 0.917591874 Standard Error 2.316346184 Observations 15 ANOVA df Regression Residual Total Intercept X Variable 1 1 13 14 SS MS 841.766358 841.766358 69.75097535 5.365459643 911.5173333 F 156.8861596 Coefficients 10.27792855 4.919330727 Standard Error t Stat 1.420277811 7.236562082 0.392747749 12.52542054 P-value 6.58556E-06 1.2478E-08 Significance F 1.2478E-08 Lower 95% Upper 95% 7.209605476 13.34625162 4.070850963 5.767810491 ARG/PDW: MCEN4027F00 VIII: 33 Regression Analysis 50 Fire Damage (k$) 40 30 20 10 0 0 1 2 3 4 Distance (miles) 5 6 7 ARG/PDW: MCEN4027F00 VIII: 34 Application of the Methodology Step 1 First, we hypothesize a model to relate fire damage y to the distance x from the nearest fire station. We will hypothesize a straight-line probabilistic model: y o 1x Step 2 Next, we use statistical software to perform a linear regression. We find that the estimate of the slope is ˆ1 4.919331 and the estimate of the y-intercept is ˆo 10.277929 . Thus, the least-squares equations is yˆ 10.278 4.919 x The data and the prediction equation are shown in the Figure accompanying the ANOVA table. ARG/PDW: MCEN4027F00 VIII: 35 Step 3 Now we specify the probability distribution of the random error component . Although we know that the assumptions that we previously considered are not completely satisfied (they seldom are for any practical problem), we are willing to assume that they are approximately satisfied for this example. The estimate of the variance 2 of is given in the table as the MS error (residual), i.e. s2 = MSE = 5.36546. The estimated standard deviation of is s 5.36546 2.31635 . The value s implies that most of the observed fire damage (y) values will fall within approximately 2s = 4.64 k$ of their respective predicted values. ARG/PDW: MCEN4027F00 VIII: 36 Step 4: Test of Model Utility We can now check the utility of the hypothesized model, i.e. whether x really contributes information for the prediction of y using the straightline model. First, test the null hypothesis that the slope 1 is equal to 0, i.e. that there is no linear relationship between the fire damage and the distance from the nearest station, against the alternative that x and y are positively linearly related, i.e. Ho: 1 = 0; Ha: 1 > 0. The value of the t-test statistic is given in the row marked x variable 1; t = 12.525 with an associated probability p = 1.2478E-08. This small p value leaves little doubt that x contributes information for the prediction of y. ARG/PDW: MCEN4027F00 VIII: 37 Step 5: Numerical Descriptive Measures of Model Adequacy The coefficient of determination is given as r square where r2 = 0.9235. This value implies that about 92% of the sample variation in fire damage (y) is explained by the distance x. The coefficient of correlation r which measures the strength of the linear relationship between y and x is given as multiple r with a value r = 0.96. The high value of r confirms our conclusion that 1 differs from 0. ARG/PDW: MCEN4027F00 VIII: 38 Step 6: We are now prepared to use the least-squares model for prediction. Suppose the insurance company wants to predict fire damage if a major residential fire were to occur 3.5 miles from the nearest fire station, i.e. xp = 3.5. The predicted value can be calculated using our model with coefficients from the ANOVA table, i.e. yˆ 10.278 4.919 x p . The result is that yˆ 27.4956. Note that we would not use this prediction model to make predictions for homes < 0.7 or > 6.1 miles from the nearest station. A straight-line model might not be appropriate for the relationship between the mean value of y and the value of x when stretched over a wider range of x values. ARG/PDW: MCEN4027F00 VIII: 39 Summary of Linear Regression Methodology 1. Hypothesize a probabilistic model – for our use a straight-line model whereby y o 1x . 2. Use the method of least squares to estimate the unknown parameters in the deterministic component, o 1x . The least squares estimates yield a model yˆ ˆo ˆ1x with a sum of squared errors (SSE) that is smaller than the SSE for any other straight-line model. 3. Specify the probability distribution of the random error component . 4. Assess the utility of the hypothesized model. Included here are making inferences about the slope 1 and calculating r and r2. 5. If we are satisfied with the model, we are prepared to use it to estimate the mean y value, E(y), for a given x as well as to predict an individual y value for a specific value of x. Summary: Appropriate Use of Regression and Correlation Purpose of Investigator Nature of the two variables Y random, X fixed Y1, Y2 both random Establish and estimate dependence of one variable upon another Model I regression yˆ ˆo ˆ1x Model II regression (not described) Establish and estimate association between two variables Meaningless for this case. Correlation coefficient