7 Regression

CHAPTER 7 REGRESSION 1. 2. 3. 4. INTRODUCTION THE REGRESSION EQUATION 2.1. The Least Squares Regression Line 2.2. The Predicted Value of y for a Given 𝑥 Value, and the Residual 𝑒 2.3. The Standard Error of Estimate 2.4. Coefficient of Determination, 𝑅2 STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION 3.1. Confidence Interval for Population Slope Parameter 𝛽1 3.1.1. More About the Sampling Distribution of 𝑏1 3.2. Test of Hypothesis for Population Slope Parameter 𝛽1 3.3. Confidence Interval for Predicted Value of y for a Given 𝑥 3.4. Confidence Interval for Mean Value of 𝑦 for a Given 𝑥 USING EXCEL FOR REGRESSION ANALYSIS 4.1. Understanding the Computer Regression Output 1. INTRODUCTION To explain regression simply, suppose you want to find out what factor affects the students’ grades in the statistics departmental common finals. What determines the variations in test scores? Why do some students have higher scores than others? Suppose a friend offers the explanation that the variations in scores are related to students’ heights. Another friend proposes that score variations are related to the number of hours a student studies for the test. Your task is to find out which theory is more realistic (duh!). Here we have two models before us attempting to explain the variations in student statistics test scores. Each model consists of two variables: the dependent variable and the independent variable. In both models the dependent variable (also called the explained variable) is test scores. In Model 1, however, the independent variable (also called the explanatory variable) is student height and in Model 2 the independent variable is hours studied. Suppose you select a random sample of 10 students. You obtain the departmental final scores for the dependent variable. For the independent variable in Model 1 you measure their heights. For Model 2, you ask the students to state as accurately as possible the number of hours they studied for the test. The following are the hypothetical data for each model: Chapter 7—Regression Page 1 of 17 Model 1 Variables Dependent Independent 𝑦 𝑥 Height Score in Inches 52 72 56 65 56 70 72 74 72 64 80 62 88 71 92 75 96 74 100 69 Model 2 Variables Dependent Independent 𝑦 𝑥 Hours Score Studied 52 2.5 56 1.0 56 3.5 72 3.0 72 4.5 80 6.0 88 5.0 92 4.0 96 5.5 100 7.0 Your task is to find out which model better explains the variations in student scores. You do not observe the influence (if any) by looking at the numbers. In other words, it is hard to see any pattern or association in differences in scores in relation to differences in either height or hours studies. A visual aid is much more descriptive than the plain numbers. The visual aid in regression is called the scatter diagram. The following are the scatter diagrams for the two models. In each model the independent variable is measured on the horizontal axis and the dependent variable on the vertical axis. The scatter diagram for Model 1 shows that there is no relationship between student height and score, because there is no recognizable pattern. But in the scatter diagram for Model 2 there is a recognizable pattern showing that, in general, scores increase with the number of hours studied. Model 2 100 100 80 80 Test scores Test score Model 1 60 40 60 40 20 20 0 0 60 62 64 66 68 70 72 Student height Chapter 7—Regression 74 76 78 0 1 2 3 4 5 6 7 8 Hours studied Page 2 of 17 2. THE REGRESSION EQUATION To determine a more precise depiction of the relationship between the dependent variable 𝑦 and the independent variable 𝑥, we need to describe the relationship as a mathematical equation. This equation is derived as the equation of the line that fits the scatter diagram the best. The regression analysis provides the tools for fitting a regression line onto the scatter diagram. To draw any line in the 𝑥y quadrant, you must have a vertical intercept and a slope. The general equation for a straight line is the following: 𝑦 = 𝑏0 + 𝑏1 𝑥 Here 𝒃𝟎 represents the vertical intercept and 𝒃𝟏 the slope of the line. The slope represents the change in value of 𝑦 per unit change in 𝑥: 𝑏1 = ∆𝑦 ∆𝑥 y ∆y ∆x x One can fit a line to the scatter diagram manually. There are many possible lines that could be fitted in this manner. However, there is a mathematical approach to fitting the most accurate, or best-fitting line. This method is called the least squares method. In explaining the method, you will see why it is called the least squares. We will use the data for Model 2 to explain how to find the regression equation. The model we are dealing with in this discussion is called a simple linear regression model. It is “simple“ because there is only one independent variable. If a model contains more than one independent variable then it is called a multiple regression model. For example, in your model explaining the student scores, in addition to the number of hours studied, you may include the students’ SAT scores as a second independent variable. The current model is also a “linear“ regression model because the regression equation provides a straight regression line. The line is not curved. The general form of a simple linear regression equation is as follows: 𝑦̂ = 𝑏0 + 𝑏1 𝑥 Note that in the regression equation we use 𝑦̂ (𝑦-hat) rather than 𝑦. In regression models, the symbol 𝑦 (hatless) represents the observed values of the dependent variable, the actual value observed in the sample. The distinction between 𝑦 and 𝑦̂ will become apparent below. Chapter 7—Regression Page 3 of 17 2.1. The Least Squares Regression Line The mathematical method used to obtain the regression line is called the least squares method because with the resulting regression line the sum of squared value of the vertical distance between the observed y values and the regression line is minimized (is the least). In the following diagram, the diamond-shaped markers represent the y values observed in the sample. For each value of 𝑥 (hours) there is an observed value of y (actual score). The circular markers on the regression line represent the predicted values. Once you find the regression equation and draw the regression line, for each value of 𝑥 there will be a corresponding predicted value of 𝑦 on the line, which we denote by 𝑦̂. y y e = y − ŷ ŷ x The vertical distance between the observed value (𝑦)and predicted value (𝑦̂) is called the prediction error and is denoted by 𝑒: 𝑒 = 𝑦 − 𝑦̂. Squaring the error terms and summing them we obtain the sum of squared errors. 𝑒2 = (𝑦 − 𝑦̂ )2 The least squares line assures that this sum of squares is minimized. You cannot find any other line that would provide a smaller sum of squared errors than the least squares line. How do you obtain the least squares regression line? As explained, the equation for any straight line is obtained by determining the vertical intercept and the slope. In the simple linear regression, the slope and vertical inter intercept are obtained using the following formulas: ∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ ∑𝑥 2 − 𝑛𝑥̅ 2 Slope: 𝑏1 = Vertical Intercept: 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ Using the data in Model 2 now we can determine the values for 𝑏1 and 𝑏2 . Chapter 7—Regression Page 4 of 17 𝑦 52 56 56 72 72 80 88 92 96 100 𝑦̅ = 76.4 𝑥 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 𝑥̅ = 4.2 𝑥𝑦 130 56 196 216 324 480 440 368 528 700 𝑥𝑦 = 3438 𝑥2 6.25 1.00 12.25 9.00 20.25 36.00 25.00 16.00 30.25 49.00 𝑥2 = 205.00 Computing 𝑥̅ = 4.2 and 𝑦̅ = 76.4, we can fill in the values in the formulas: 𝑏1 = ∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ 3,438 − 10(4.2)(76.4) = = 8.014 205 − 10(4.22 ) ∑𝑥 2 − 𝑛𝑥̅ 2 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = 76.4 − 8.014(4.2) = 42.741 The least square regression equation for Model 2 is then: 𝑦̂ = 42.741 + 8.014𝑥 What does this equation imply? The slope value of (rounded) 8.0 means that for each additional hour of study the model predicts that score will increase by 8 points. The vertical intercept of (rounded) 43 means that the model predicts that if a student did not study at all the score would be 43. 2.2. The Predicted Value of y for a Given x Value, and the Residual e An important function of the regression equation is that it enables us to predict the value of 𝑦 for a given value of 𝑥. For example, according to the model, if a student studies 6 hours, the model predicts that the score would be 𝑦̂ = 42.741 + 8.014(6) = 90.8 You can thus predict the score for any number of hours studied. The difference between the predicted value 𝑦̂ for a given value of 𝑥 and the observed value associated with that 𝑥 value in the sample data is called the residual (or prediction error). For example, in the data when 𝑥 = 6, the associated score y is 80. The residual is then 𝑒 = 𝑦 − 𝑦̂ = 80 − 90.8 = −10.8 Now compute all the predicted values and residuals for Model 2. Chapter 7—Regression Page 5 of 17 𝑦̂ = 𝑏0 + 𝑏1 𝑥 𝑒 = 𝑦 − 𝑦̂ 𝑒 2 = (𝑦 − 𝑦̂)2 𝑦 𝑥 52 56 56 72 72 80 2.5 1.0 3.5 3.0 4.5 6.0 62.78 50.76 70.79 66.78 78.80 90.83 -10.78 5.24 -14.79 5.22 -6.80 -10.83 116.13 27.51 218.75 27.21 46.30 117.18 88 92 96 100 5.0 4.0 5.5 7.0 82.81 74.80 86.82 98.84 5.19 17.20 9.18 1.16 26.92 295.94 84.31 1.35 𝑒 = 0.00 𝑒2 = 961.59 Note that the sum of squared residuals or sum of squared errors (𝑆𝑆𝐸) is: 𝑆𝑆𝐸 = 𝑒 2 =  (𝑦 − 𝑦̂)2 = 961.59 This value is the “least squares” mentioned above. There is no other line that would give you a smaller sum of squared errors. The Least Squares Method of determining 𝑏0 and 𝑏1 , the regression coefficients, guarantees that this sum will be the smallest possible (the least squares). 2.3. Variance of Error and the Standard Error of Estimate Note that the value of 𝑆𝑆𝐸 is obtained by summing the squared deviations of the predicted from the observed values of y. Using 𝑆𝑆𝐸 we can obtain a summary measure similar to the variance and, its square root, standard deviation. The variance measure shows the average squared deviation of the observed 𝑦 from the regression line (𝑦). To compute this measure, denoted by 𝐯𝐚𝐫(𝒆), divide the 𝑆𝑆𝐸 by the degrees of freedom, which here is 𝑑𝑓 = 𝑛 − 2 = 8. This measure is also known as Mean Square Error (𝑀𝑆𝐸). var(𝑒) = 𝑀𝑆𝐸 =  (𝑦 − 𝑦̂)2 𝑛−2 The square root of var(𝑒) is called the standard error of estimate, and is denoted by 𝐬𝐞(𝒆). se(𝑒) = √ se(𝑒) = √ 𝑒 2 𝑛−2 =√  (𝑦 − 𝑦̂)2 𝑛−2 𝑆𝑆𝐸 =√ = √𝑀𝑆𝐸 𝑑𝑓 961.59 = √120.198 = 10.964 10 − 2 In any given regression model, the more scattered the observed values of 𝑦 around the regression line, the larger the se(𝑒). If se(𝑒) is large, we say that the regression line is not a good fit. The smaller the se(𝑒), the better the fit. Compare the equation for se(𝑒) to that of 𝑠, the standard deviation of 𝑦:  (𝑦 − 𝑦̅)2 𝑠=√ 𝑛−1 Chapter 7—Regression Page 6 of 17 These two equations are very similar. The standard error of estimate measures the deviations of 𝑦 values from the regression line. The standard deviation of 𝑦 measures the deviation of the 𝑦 values from the mean of 𝑦, (𝑦̅). In the following diagram note that the mean 𝑦̅ is a horizontal line because there is single value of 𝑦̅ for all values of 𝑥 on the horizontal axis. Both 𝑠 and se(𝑒) measure deviations or the degree of scatter of the 𝑦 values, the diamond-shaped markers, from a line: The standard error of estimate shows the average deviation of 𝑦 from 𝑦̂ and the standard deviation of 𝑦 provides the average deviation of 𝑦 from 𝑦̅. Thus, the larger se(𝑒) and s values the more scattered the data. y ŷ ȳ x The standard error of estimate for Model 2 is: se(𝑒) = √ 961.59 = 10.964 8 Compare this to the standard error for Model 1, where se(𝑒) = 18.129.1 Note that the standard error for Model 1 is significantly greater than that for Model 2. This indicates that the Model 2 regression line is a much better fit than Model 1. 2.4. Coefficient of Determination, 𝑹𝟐 The fit of the regression line is a measure of the closeness of the relationship between 𝑥 and 𝑦. The less scattered the observed 𝑦 values are around the regression line, the closer the relationship between 𝑥 and 𝑦. As explained above, se(𝑒) is such a measure of the fit. However, se(𝑒) has a major drawback. It is an absolute measure and, therefore, is affected by the absolute size, or the scale of the data. The larger the values or scale of the data set, the larger the se(𝑒). To explain this drawback, consider the data in Model 2. Suppose the statistics test from which the scores are obtained is a 25-question-mutiple-choice test. For scoring purposes we can either assign 1 point to each question and measure the scores from a scale of 25, or assign 4 points to each question and measure the scores from a scale of 100. We can set up our Model 2 either way. This value is obtained directly, without going through the worksheet calculations, using the Excel function: =STEYX(y range, x range). 1 Chapter 7—Regression Page 7 of 17 Scores Hours Scale = 100 Studied 52 2.5 56 1.0 56 3.5 72 3.0 72 4.5 80 6.0 88 5.0 92 4.0 96 5.5 100 7.0 se(𝑒) = 10.964 Scores Hours Scale = 25 Studied 13 2.5 14 1.0 14 3.5 18 3.0 18 4.5 20 6.0 22 5.0 23 4.0 24 5.5 25 7.0 se(𝑒) = 2.741 For the purpose of the analysis of the impact of hours studied on test scores it should make no difference which scale we use for test scores. But note that the standard error of estimate is higher when the scale is from 100. Does this mean that the model is a better fit when the test score scale is from 25? Of course not. Both versions of the model have exactly the same fit. This discussion should make it clear that using se(𝑒) as a measure of closeness of the fit suffers from the misleading impact of the absolute size or scale of the data used in the model. An alternative measure of the closeness of fit, which is not affected by the scale of the data, is the coefficient of determination denoted by 𝑹² (r-square). R-square measures the proportion of total variations in 𝑦 (around the mean 𝑦̅) explained by the regression (that is, by 𝑥). Mathematically, 𝑹𝟐 is the proportion of the total squared deviations of the 𝑦 values from 𝑦̅ that is explained by the total squared deviations of 𝑦̂ values (points on the regression line) from 𝑦̅. To understand this statement consider the following diagram. y Unexplained deviation y ŷ Explained deviation ȳ 96 86.8 Total deviation 76.4 5.5 x In the diagram, the horizontal line represents the mean of all the observed 𝑦 values, 𝑦̅ = 76.4. The regression line is represented by the regression equation 𝑦̂ = 42.741 + 8.014𝑥. A single observed value of 𝑦 = 96 for a given 𝑥 = 5.5 hours is selected. The vertical distance between this 𝑦 value and 𝑦̅ is Chapter 7—Regression Page 8 of 17 𝑇𝑜𝑡𝑎𝑙 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦̅ 𝑦 − 𝑦̅ = 96 − 76.4 = 19.6 The vertical distance between 𝑦̂ on the regression line and 𝑦̅ is 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦̂ − 𝑦̅ 𝑦̂ − 𝑦̅ = 86.8 − 76.4 = 10.4 As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression model. That is, this deviation is explained by the independent variable 𝑥, hours of study. The vertical distance between 𝑦 and 𝑦̂, the residual, is 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦̂ 𝑦 − 𝑦̂ = 96 − 86.6 = 9.2 Note that unexplained deviation is the familiar prediction error or residual. Thus, 𝑇𝑜𝑡𝑎𝑙 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 + 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑦 − 𝑦̅) = (𝑦̂ − 𝑦̅) + (𝑦 − 𝑦̂) Repeating the same process for all values of 𝑦, squaring the resulting deviations, and summing the squared values, we have the following sum of squared deviations: 1. Sum of Squared Total Deviations 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 (𝑆𝑆𝑇): (𝑦 − 𝑦̅)2 2. Sum of Squared Explained Deviations 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (𝑆𝑆𝑅): (𝑦̂ − 𝑦̅)2 3. Sum of Squared Unexplained Deviations 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑆𝐸): 𝑒2 = (𝑦 − 𝑦̂ )2 Mathematically and numerically it can be shown that (𝑦 − 𝑦̅)2 = (𝑦̂ − 𝑦̅)2 + (𝑦 − 𝑦̂ )2 That is, 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 The following worksheet for Model 2 shows that this equality holds: Chapter 7—Regression Page 9 of 17 𝑦 52 56 56 72 72 80 88 92 96 100 Note that: (𝑦 − 𝑦̅)2 595.36 416.16 416.16 19.36 19.36 12.96 134.56 243.36 384.16 556.96 (𝑦̂ − 𝑦̅)2 185.61 657.65 31.47 92.48 5.78 208.09 41.10 2.57 108.54 503.52 (𝑦 − 𝑦̂)2 116.13 27.51 218.75 27.21 46.30 117.18 26.92 295.94 84.31 1.35 (𝑦 − 𝑦̅)2 = 2798.40 (𝑦̂ − 𝑦̅)2 =1836.81 (𝑦 − 𝑦̂ )2 = 961.59 𝑥 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 2798.40 = 1836.81 + 961.59 Rearranging the equation, we can write 𝑆𝑆𝑅 as the difference between 𝑆𝑆𝑇 and 𝑆𝑆𝐸: 𝑆𝑆𝑅 = 𝑆𝑆𝑇 − 𝑆𝑆𝐸 (𝑦̂ − 𝑦̅)2 = (𝑦 − 𝑦̅)2 − (𝑦 − 𝑦̂ )2 Dividing both sides by 𝑆𝑆𝑇, we have: 𝑆𝑆𝑅 𝑆𝑆𝑇 𝑆𝑆𝐸 𝑆𝑆𝐸 = − =1− 𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇 (𝑦̂ − 𝑦̅)2 (𝑦 − 𝑦̂)2 = 1 − (𝑦 − 𝑦̅)2 (𝑦 − 𝑦̅)2 As stated at the beginning of this discussion, 𝑅2 measures the proportion of total deviations in 𝑦 explained by the regression. Thus the left hand side of the above equation is 𝑅2 : 𝑅2 = (𝑦̂ − 𝑦̅)2 𝑆𝑆𝑅 = (𝑦 − 𝑦̅)2 𝑆𝑆𝑇 On the right hand side of the equation, the ratio (𝑦 − 𝑦̂ )2 𝑆𝑆𝐸 = (𝑦 − 𝑦̅)2 𝑆𝑆𝑇 is the proportion of total deviations in y that is due to error or residual, that is, not explained by the regression. Thus the larger the ratio 𝑆𝑆𝐸 ⁄𝑆𝑆𝑇, the smaller will 𝑅2 be. For Model 2: 𝑅2 = 𝑆𝑆𝑅 183681 = = 0.6564 𝑆𝑆𝑇 279840 and Chapter 7—Regression Page 10 of 17 𝑆𝑆𝐸 96159 = = 0.3436 𝑆𝑆𝑇 279840 Thus, when 𝑅2 = 0.6564, nearly 66 percent of the variations or deviations in 𝑦, test scores, are explained by the regression model, that is the independent variable 𝑥, the hours of study. The remaining 34 percent of the variations are due to other unexplained factors (you may call these factors the “unmeasurable attributes of an individual”). Note that if all the variations in 𝑦 were explained by hours studied, then 𝑅2 = 1. Thus, the values of 𝑅2 vary from 0 to 1: 0 ≤ 𝑅2 ≤ 1 The closer to 0, the weaker the relationship between 𝑥 and 𝑦. The closer to 1, the stronger the relationship. Using Excel function RSQ(𝑦 range, 𝑥 range), we can find the 𝑅2 for Model 1: For Model 1, 𝑅2 = 0.0605 As expected, 𝑅2 for Model 1 is near zero. There is, practically, no relationship between student height and statistics test score. Also note that the value of 𝑅2 is not affected by the scale of the data. You can check this for Model 2 using Excel with the scores based on the scale of 25. 3. STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION Note that to check for the validity of the proposition that there is a relationship between the test scores and number of hours of studied, we used the data from a sample of 10 students. Using the sample data we obtained the sample regression equation, the general form of which is 𝑦̂ = 𝑏0 + 𝑏1 𝑥 To obtain the sample regression equation we have to determine the slope and the vertical intercept of the regression line from sample data. The sample regression equation is thus an estimate of the population regression equation. To construct the population regression equation we need to obtain the population slope and the population vertical intercept. But since we do not have access to the population data, we use the slope (𝑏1 ) and the vertical intercept (𝑏0 ) determined from the sample data as estimates of the population slope 𝜷𝟏 , and population vertical intercept 𝜷𝟎 . This way the sample regression line becomes the estimator of the population regression line: 𝑦̂ = 𝛽0 + 𝛽1 𝑥 Going back to the statistical inference for the population mean, we used the sample statistic 𝑥̅ as an estimator of the population parameter 𝜇. Using 𝑥̅ we built a confidence interval for 𝜇 or performed a test of hypothesis. Similarly, in regression, we use the sample statistic 𝒃𝟎 as the estimator of population parameter 𝜷𝟎 , and sample statistic 𝒃𝟏 as the estimator of population parameter 𝜷𝟏 . Using the two sample statistics in regression we can build confidence intervals or perform tests of hypotheses for the two population parameters. Chapter 7—Regression Page 11 of 17 3.1. Confidence Interval for Population Slope Parameter 𝜷𝟏 To see the similarities between the confidence interval for 𝜇, the population mean, and that for 𝛽1 , first consider the formula for the confidence interval for 𝜇 from Chapter 5: Confidence interval for 𝜇: 𝐿, 𝑈 = 𝑥̅ ± 𝑡𝛼⁄2,(𝑛−1) se(𝑥̅ ) where se(𝑥̅ ) = 𝑠⁄√𝑛. Note that the interval is built around 𝑥̅ using the margin of error, 𝑀𝑂𝐸 = 𝑡𝛼⁄2,(𝑛−1) se(𝑥̅ ). The confidence interval for 𝛽1 has the same general characteristics. It is built around the sample statistic 𝑏1 with ±𝑀𝑂𝐸. The 𝑀𝑂𝐸 in all confidence intervals is always equal to the 𝑡 score (or the 𝑧 score) times the standard error of the relevant sample statistic. The confidence interval formula for 𝛽1 is then: Confidence interval for 𝜷𝟏 : 𝐿, 𝑈 = 𝑏1 ± 𝑡𝛼⁄2,(𝑛−2) se(𝑏1 ) Note that here the t score involves 𝑛 − 2 degrees of freedom.2 The term se(𝑏1 ) is the standard error of the sampling distribution of 𝒃𝟏 . The formula for se(𝑏1 ) is: Standard error of the sampling distribution of b1: se(𝑏1 ) = se(𝑒) √∑(𝑥 − 𝑥̅ )2 3.1.1. More About the Sampling Distribution of 𝒃𝟏 You should clearly recognize the fact that 𝑏1 is a summary characteristic obtained from a random sample. It is, therefore, like 𝑥̅ , a sample statistic. In the discussion of the concept of sampling distribution in Chapter 4 we learned that the number of samples of size 𝑛 obtained from a parent population is infinite. There are, thus, infinite number of sample statistics such as 𝑥̅ that one may obtain from these samples. Since the values of 𝑥̅ are obtained from randomly selected samples, then 𝑥̅ is a random variable. The probability distribution of this random variable, we learned, is called a sampling distribution. We also learned that the center of gravity or the mean of the sample statistics is the corresponding parameter in the parent population. With respect to 𝑥̅ , it was explained, the mean of the means was the population mean 𝜇. Also, the measure of dispersion of the values of the sample statistic around their center of gravity is called the standard error. Thus, the measure of dispersion of 𝑥̅ values around μ is se(𝑥̅ ). Finally, in order to apply the sampling distribution of 𝑥̅ in statistical inference, the 𝑥̅ values must have a normal distribution. We can apply the same concepts to the sample statistics 𝑏1 . We can obtain infinite number of 𝑏1 values from the infinite number of random samples. This makes 𝑏1 a random variable with a sampling distribution. The expected value, the center of gravity, or the mean of the 𝑏1 is equal to the parameter of the parent population, 𝛽1 . And the measure of dispersion of the 𝑏1 values is the standard error of 𝑏1 , se(𝑏1 ). Furthermore, in order to apply the sampling distribution of 𝑏1 for statistical inference, the 𝑏1 values must be normally distributed.3 The following diagram shows the similarities between the sampling distribution of 𝑥̅ and the sampling distribution of 𝑏1 . The regression equation is obtained by estimating two population parameters 𝛽0 and 𝛽1 . For each population parameter estimated, we lose one degree of freedom. 3 In order for the sampling distribution of 𝑏 to be normal, certain conditions must be present. They are not relevant to 1 the discussion here. We assume these conditions are present for our discussion. 2 Chapter 7—Regression Page 12 of 17 Sampling Distribution of b₁ Sampling Distribution of x̄ x̄ E(x̄ ) = μ b₁ E(b₁) = β₁ Now, to build a 95% confidence interval for population slope parameter in Model 2, in addition to the following quantities, we need to compute the standard error of 𝑏1 𝑏1 = 8.5188 𝑡𝛼⁄2,(𝑛−2) = 𝑡0.025,(8) = 2.306 se(𝑒) = 10.964 𝑥̅ = 4.2 The following is the computation of ∑(𝑥 − 𝑥 ̅)2 used in the denominator of se(b₁). 𝑥 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 ∑(𝑥 − 𝑥̅)2 = se(𝑏1 ) = 10.964 √28.6 (𝑥 − 𝑥̅ )2 2.89 10.24 0.49 1.44 0.09 3.24 0.64 0.04 1.69 7.84 28.60 = 2.05 𝑀𝑂𝐸 = 𝑡𝛼⁄2,(𝑛−2) se(𝑏1 ) = 2.306(2.05) = 4.727 𝐿 = 𝑏1 − 𝑀𝑂𝐸 = 8.014 − 4.727 = 3.287 𝑈 = 𝑏1 + 𝑀𝑂𝐸 = 8.014 + 4.727 = 12.741 We are 95% confident that the population slope parameter 𝛽1 is between 3.287 and 12.741. 3.2. Test of Hypothesis for Population Slope Parameter 𝜷𝟏 Recall that to perform a test of hypothesis about population parameters 𝜇 or 𝜋 we stated a null and an alternative hypothesis and then compared a test statistic to a critical value. There the test could be either a two tail, a lower, or an upper tail test. In performing a two-tail test for, say, 𝜇 the critical value is 𝑡𝛼⁄2,(𝑛−1) and the test statistic is 𝑡 = (𝑥̅ − 𝜇0 )⁄se(𝑥̅ ). In regression analysis, performing a test of hypothesis for the Chapter 7—Regression Page 13 of 17 population slope parameter 𝛽1 , as you will see, is less complicated. The test is nearly always a two tail test. Here is why! In regression analysis we want to determine whether there is a relationship between 𝑥 and 𝑦. If there is a relationship, then the variations in the value of 𝑦 in response to changes in 𝑥 is reflected in the slope of the regression line. The slope shows the change in the value of 𝑦 per unit change in 𝑥. In the population regression equation, the slope is 𝛽1 = ∆𝑦⁄∆𝑥. If there is no relationship between 𝑥 and 𝑦, then there is no change in 𝑦 in response to changes in 𝑥. Thus, the slope is zero. In performing a test of hypothesis about 𝛽1 , the null hypothesis is that the slope is zero. Using inferential statistics, we want to reject the null hypothesis, to provide significant proof that the slope is not zero. The Null and Alternative Hypothesis for 𝜷𝟏 : 𝐻0 : 𝛽1 = 0 𝐻1 : 𝛽1 ≠ 0 To perform the test we need a critical value, which is The Critical Value: 𝑡𝛼⁄2,(𝑛−2) The test statistic for the hypothesis test has the same format as the that for 𝜇. The test statistic is again a 𝑡 value the numerator of which is the difference between the sample statistic 𝑏1 and the hypothesized value of the population parameter 𝛽1 , and the denominator is the standard error of 𝑏1 , se(b1): 𝑡= 𝑏1 − (𝛽1 )0 se(𝑏1 ) However, note that the null hypothesis states that 𝛽1 = 0. Therefore, the test statistic is simplified as follows: 𝑻𝒉𝒆 𝑻𝒆𝒔𝒕 𝑺𝒕𝒂𝒕𝒊𝒔𝒕𝒊𝒄: 𝑡= 𝑏1 se(𝑏1 ) To reject the null hypothesis that β1 = 0, the test statistic must exceed the critical value: To reject the null hypothesis: 𝑡 > 𝑡𝛼⁄2,(𝑛−2) Example Perform a test of hypothesis for the population slope parameter in Model 2. 𝐻0 : 𝛽1 = 0 𝐻1 : 𝛽1 ≠ 0 The critical value is: 𝑡𝛼⁄2,(𝑛−2) = 𝑡0.025,(8) = 2.306 The test statistic is: 𝑡= 𝑏1 8.014 = = 3.910 se(𝑏1 ) 2.05 Since the test statistic exceeds the critical value reject the null hypothesis that the population slope parameter is zero. We can also use the test statistic to obtain the probability value. Using the Excel function TDIST, we compute 2 × P(t > |t|). Chapter 7—Regression Page 14 of 17 =T. DIST. 2T(x, deg_freedom) =T.DIST.2T(3.91,8) = 0.0045 This is a very small probability value. Therefore, 𝐻0 : 𝛽1 = 0 is rejected for any 𝛼 > 0.005. 4. USING EXCEL FOR REGRESSION ANALYSIS Determining the regression equation and all the related analysis is a cumbersome process involving many calculations. In the above discussion of regression we went through all the calculations to explain the concepts. When conducting research, however, using a computer program is essential. The Excel spreadsheet provides a simple process for determining all the different calculations performed above in one swoop. The following explains the steps in Excel: 1. 2. 3. 4. Enter the data for 𝑦 and 𝑥 variables. Click on Data, then Data Analysis. Locate Regression in the provided list. Click the box labeled Input Y Range and then select the cells containing the 𝑦 data. Do the same in the box labeled Input X Range. Choose where you want Excel to show the output on the worksheet by clicking the box labeled Output Range and select the cell where you want the top left corner of the output be printed. Click OK and the following output will appear. SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.8102 0.6564 0.6134 10.9635 10 ANOVA df 1 8 9 SS 1836.8056 961.5944 2798.4 MS 1836.806 120.1993 F 15.2813 Significance F 0.004487 Coefficients 42.74126 8.01399 Standard Error 9.28207 2.05007 t Stat 4.60471 3.90913 P-value 0.00174 0.00449 Lower 95% 21.33676 3.28652 Regression Residual Total Intercept X variable 1 4.1. Upper 95% 64.14575 12.74145 Understanding the Computer Printout Not all the items on the printout that are familiar to you. Here we will consider only those we have discussed. 1. The vertical intercept 𝑏0 : Shown under the column “Coefficients” and clearly it is labeled “Intercept”. 𝑏0 = 42.74126 2. The slope 𝑏1 : Labeled “X variable 1” under the column “Coefficients”. 𝑏1 = 8.01399 Chapter 7—Regression Page 15 of 17 3. Standard Error of Estimate se(𝑒): Shown under the column labeled “Regression Statistics”, labeled “Standard error”: se(𝑒) = √ 4. Sum of Squares Regression (𝑆𝑆𝑅): ∑(𝑦−𝑦̂)2 𝑛−2 = 10.9635 Shown under the column labeled “𝑆𝑆” (meaning Sum of Squares) in the row labeled “Regression”. 𝑆𝑆𝑅 = (𝑦̂ − 𝑦̅)2 = 1836.8056. 5. Sum of Squares Error (𝑆𝑆𝐸): Shown under the intersection of column “SS” and row “Residual”. 𝑆𝑆𝐸 = (𝑦 − 𝑦̂)2 = 961.5944. 6. var(𝑒) When 𝑆𝑆𝐸 is divided by the degrees of freedom (df associated with Residual, 𝑑𝑓 = 𝑛 − 2 = 8) the result is called the Mean Squares Error (𝑴𝑺𝑬), which is also known as 𝐯𝐚𝐫(𝒆). This figure is shown under the column labeled 𝑀𝑆. Note that var(𝑒) = 𝑀𝑆𝐸 = ∑(𝑦 − 𝑦̂)2 𝑆𝑆𝐸 961.5944 = = = 120.1993 𝑛−2 𝑑𝑓 8 7. Sum of Squares Total (𝑆𝑆𝑇): Shown under the intersection of column “𝑆𝑆” and row “Total”. 𝑆𝑆𝑇 = (𝑦 − 𝑦̅)2 = 2798.4 8. R-Square (𝑅2 ): Shown under “Regression Statistics”. 𝑅2 = 𝑆𝑆𝑅 ⁄𝑆𝑆𝑇 = 0.65638 9. Standard Error of 𝑏1 , se(𝑏1 ): Shown under the intersection of column “Standard Error” and row “X variable 1”. se(𝑏1 ) = 10. 95% Confidence interval for β1: se(𝑒) √∑(𝑥 − 𝑥̅ )2 = 2.05007 Shown under the intersection of columns “Lower 95%” and “Upper 95%”, on the one hand, and row “Hours”. 𝐿, 𝑈 = 𝑏1 ± 𝑡𝛼⁄2,𝑑𝑓 se(𝑏1 ) 𝐿 = 8.01399 − 2.306(2.05007) = 3.28652 𝑈 = 8.01399 + 2.306(2.05007) = 12.74145 11. Test Statistic for 𝐻0 : 𝛽1 = 0 Shown under the intersection of column “t stat” and row “X variable 1”. 𝑇𝑆 = 𝑏1 8.01399 = = 3.0913 se(𝑏1 ) 2.05007 12. 𝑷𝒓𝒐𝒃 Value Recall that an alternative approach to the test of hypothesis is the 𝑝𝑟𝑜𝑏 value approach. In Chapter 6, in the discussion of the two-tail test of hypothesis for 𝜇, it was stated that you reject the null hypothesis if the 𝑝𝑟𝑜𝑏 value is less than α (where 𝛼 is the level of significance of the test). The 𝑝𝑟𝑜𝑏 value is P(𝑡 > 𝑇𝑆). The same argument applies to the test of hypothesis in the regression analysis. Let 𝛼 = 0.05. Excel computes 2 × P(𝑡 > 𝑇𝑆) = 2 × P(𝑡 > 3.0903) = 0.00449.4 This is shown under the intersection of 4 The Excel command is =T.DIST.2T(x, deg_freedom) Chapter 7—Regression Page 16 of 17 column “P-value” and row “X variable 1”. Note that this is a two tail test. The prob value shown in the computer output is the area under the two tails of the t curve. Since 0.00449 < 𝛼 = 0.05, reject the null hypothesis that the population slope 𝛽1 = 0. Chapter 7—Regression Page 17 of 17

7 Regression

Related documents

Products

Support

7 Regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib