1 Chapter 11 – Regression Analysis Definition: When the values of two variables are measured for each member of a population or sample, the resulting data is called bivariate. When both variables are quantitative, we may represent the data set as a set of ordered pairs of numbers, (x, y). The variable x is called the input (or independent) variable; the variable y is called the response (or dependent) variable. We may examine the relationship between the two variables graphically using a scatter diagram, or scatterplot. The simplest type of model relating two quantitative variables is called a simple linear regression model, in which there is an assumed linear relationship between two variables. One variable is called the independent variable, or predictor variable. The other variable is called the dependent variable, or the response variable. Simple Linear Regression Model The response variable is assumed to be related to the predictor variable according to the following equation: Yi 0 1 xi i , where Yi the value of the response variable for the ith member of the sample, 0 a parameter, called the intercept of the line of best fit, or the regression line, 1 a parameter, called the slope of the line of best fit, or the regression line, xi the value of the predictor variable for the ith member of the sample, 2 i a random error variable associated with the ith member of the sample; it is assumed that the random errors are independent and 2 identically distributed, with i ~ Normal 0, . A picture of the model is shown on p. 309. Since it is assumed that a linear trend relationship exists between the predictor variable and the response variable, before we proceed to use the model, we must do a scatterplot to see whether the assumption of linearity is reasonable. We need to use sample data to estimate the three parameters, 0 , 1 , 2 . The estimation will be done using the method of least squares. Given a sample of size n, the data consists of ordered pairs, (x1, y1), (x2, y2), …, (xn, yn). We will find the best estimators of the slope and intercept by minimizing the residual sum of squares (also called the error sum of squares): n n n 2 SSE yi yˆi e yi 0 1 xi , i 1 2 i 1 2 i i 1 with respect to the two parameters. In doing this, we are simultaneously minimizing the squared vertical distances of the data points from the line of best fit to the data. A concrete example is useful here. Example: p. 302 3 Imagine constructing this scatterplot concretely as follows: 1) 2) 3) 4) 5) Draw the coordinate axes on a sheet of plywood. Hammer nails into the board at each data point. Obtain a thin wooden dowel and six rubber bands. Place each rubber band around the dowel and one of the nails. Wait until the dowel comes to rest. The rest position of the dowel will be the minimum energy configuration of the system, the configuration for which there will be the least total stretching of the rubber bands. This position will also be the least squares regression line relating thermal conductivity and density. We differentiate SSE w.r.t. each parameter, and set each derivative equal to 0, obtaining n SSE 2 yi 0 1 xi 0 , and 0 i 1 n SSE 2 yi 0 1 xi xi 0 . 1 i 1 This gives us two equations in two unknowns, called the normal equations: n n i 1 i 1 nˆ 0 ˆ1 xi y i , and n n n i 1 i 1 i 1 2 ˆ x x 0 i 1 i xi yi . ˆ The solution is 4 n ˆ1 x i 1 i x y i y n x i 1 i x 2 1 n n x y x y i i i i n i 1 i 1 i 1 n SS xy SS xx 1 n 2 xi xi n i 1 i 1 n 2 , ˆ0 y ˆ1 x . Then the estimated regression line, or line of best fit to the data, is given by: Yˆ ˆ0 ˆ1 x . The estimate of the error variance is found from the error sum of SSE 2 squares to be ˆ MSE n 2 . There are only n – 2 degrees of freedom associated with the error sum of squares because two parameters, the slope and the intercept, have already been estimated. To do inference, we need to know the distributional properties of the ˆ1 and ̂ 0 . One of the basic assumptions of the model is that the random error terms, i are i.i.d. normal with mean 0 and estimators, 2 common variance . Then Yi ~ Normal 0 1 xi , . Furthermore, the Y’s are independent of each other. From the normal equations, it is clear that ˆ1 is a linear function of the Y’s, and that ̂ 0 is also a linear function of the Y’s. We know that a statistic that is a linear function of independent normal random variables also has a normal distribution. 5 Specifically, it can be shown that both estimators are unbiased, and that ˆ1 ~ Normal 1 , SS xx 1 x2 ˆ ~ Normal 0 , , and that 0 n SS xx . We may use these facts to do hypothesis testing and interval estimation about the slope and intercept. The standard error of is given by MSE SE ̂1 ̂ SS xx . The standard error of 0 is given by ˆ1 1 x2 SE ̂0 MSE . n SS xx ˆ1 1 ~ t n 2 Therefore, we find that MSE , and that SS xx ˆ0 0 ~ t n 2 2 . We want to test whether there is a 1 x MSE n SS xx linear trend relationship between the predictor and the response variable. Our hypotheses are H0: 1 0 v. Ha: 1 0 . We may use the distributional properties of the estimated slope to find a test statistic. We may do the hypothesis test using the t-distribution of the estimator. 6 Example: The paper “A study of stainless steel stress-corrosion cracking by potential measurements” (Corrosion, 1962, pp. 425432) reported on the relationship between applied stress (the predictor variable, x, in kg/mm2) and time to fracture (the response variable, in hours) for 18-8 stainless steel under uniaxial tensile stress in a 40% CaCl2 solution at 100C. Tend different settings of applied stress were used, and the resulting data values (as read from a graph which appeared in the paper) are given in the table below: x i 2.5 5 y i 63 58 10 15 17.5 20 55 61 62 37 25 38 30 45 35 46 40 19 We want to 1) determine whether there is a linear trend relationship between applied tensile stress and time to fracture, and 2) estimate the relationship. We first do a scatterplot, using Excel: Scatterplot of Time to Fracture v. Tensile Stress 70 Timee to Fracture (Hours) 60 50 40 30 20 10 0 0 10 20 30 40 50 Tensile Stress (kg/square mm) It appears that there is a moderately strong negative linear trend relationship between time to fracture and tensile stress. 7 Next we want to test whether this relationship generalizes to the entire population of 18-8 stainless steel samples. Step 1: H0: 1 0 Ha: 1 0 . Step 2: n 10 . = 0.05 Step 3: The test statistic that will be used is F MSR , which under MSE the null hypothesis has an F(1, 7). Step 4: We will reject the null hypothesis if the value of the test statistic is greater than F 1, 7, 0.05 5.59. . Step 5: We enter the data in Excel. We choose Tools, Data Analysis, and Regression. Excel produces the following ANOVA table. SUMMARY OUTPUT Regression Statistics Multiple R 0.79531017 R Square 0.632518266 Adjusted R Square 0.58658305 Standard Error 9.124307466 Observations 10 ANOVA df Regression Residual Total Intercept X Variable 1 Significance SS MS F F 1 1146.376106 1146.376106 13.76978954 0.005949788 8 666.0238938 83.25298673 9 1812.4 Standard Coefficients Error t Stat P-value 66.41769912 5.648129399 11.75923822 2.50156E-06 -0.900884956 0.242775962 3.710766706 0.005949788 8 Step 6: We reject the null hypothesis at the 0.05 level of significance. We have sufficient evidence to conclude that 1 0 ; i.e., there is a linear trend relationship between tensile stress and time to fracture. Defn: The coefficient of determination is defined by R2 1 SSE SSR SST SST . This quantity is the proportion of the variation of the response variable that is explained by the linear relationship between the predictor variable and the response variable. In our example, R2 = 0.6325. Hence 63.25% of the variation in time to fracture is explained by the linear relationship between tensile stress and time to fracture. A large value for R2 (near 1) indicates that the model has good explanatory power. A value for R2 near 0 indicates that the model does not have good explanatory power. The estimated regression equation (line of best fit), may also be read from the last table in the Excel output. We have Yˆ 66.4177 0.9009x . This says that for every 1 kg/mm2 increase in tensile stress, the time to fracture decreases by 0.9009 hours, on average. If the applied tensile stress is 12 kg/mm2, then the predicted time to ˆ fracture is Y 66.4177 (0.9009)(12) 55.6069 hours.