CHAPTER 7 REGRESSION 1. 2. 3. 4. INTRODUCTION THE REGRESSION EQUATION 2.1. The Least Squares Regression Line 2.2. The Predicted Value of y for a Given π₯ Value, and the Residual π 2.3. The Standard Error of Estimate 2.4. Coefficient of Determination, π 2 STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION 3.1. Confidence Interval for Population Slope Parameter π½1 3.1.1. More About the Sampling Distribution of π1 3.2. Test of Hypothesis for Population Slope Parameter π½1 3.3. Confidence Interval for Predicted Value of y for a Given π₯ 3.4. Confidence Interval for Mean Value of π¦ for a Given π₯ USING EXCEL FOR REGRESSION ANALYSIS 4.1. Understanding the Computer Regression Output 1. INTRODUCTION To explain regression simply, suppose you want to find out what factor affects the students’ grades in the statistics departmental common finals. What determines the variations in test scores? Why do some students have higher scores than others? Suppose a friend offers the explanation that the variations in scores are related to students’ heights. Another friend proposes that score variations are related to the number of hours a student studies for the test. Your task is to find out which theory is more realistic (duh!). Here we have two models before us attempting to explain the variations in student statistics test scores. Each model consists of two variables: the dependent variable and the independent variable. In both models the dependent variable (also called the explained variable) is test scores. In Model 1, however, the independent variable (also called the explanatory variable) is student height and in Model 2 the independent variable is hours studied. Suppose you select a random sample of 10 students. You obtain the departmental final scores for the dependent variable. For the independent variable in Model 1 you measure their heights. For Model 2, you ask the students to state as accurately as possible the number of hours they studied for the test. The following are the hypothetical data for each model: Chapter 7—Regression Page 1 of 17 Model 1 Variables Dependent Independent π¦ π₯ Height Score in Inches 52 72 56 65 56 70 72 74 72 64 80 62 88 71 92 75 96 74 100 69 Model 2 Variables Dependent Independent π¦ π₯ Hours Score Studied 52 2.5 56 1.0 56 3.5 72 3.0 72 4.5 80 6.0 88 5.0 92 4.0 96 5.5 100 7.0 Your task is to find out which model better explains the variations in student scores. You do not observe the influence (if any) by looking at the numbers. In other words, it is hard to see any pattern or association in differences in scores in relation to differences in either height or hours studies. A visual aid is much more descriptive than the plain numbers. The visual aid in regression is called the scatter diagram. The following are the scatter diagrams for the two models. In each model the independent variable is measured on the horizontal axis and the dependent variable on the vertical axis. The scatter diagram for Model 1 shows that there is no relationship between student height and score, because there is no recognizable pattern. But in the scatter diagram for Model 2 there is a recognizable pattern showing that, in general, scores increase with the number of hours studied. Model 2 100 100 80 80 Test scores Test score Model 1 60 40 60 40 20 20 0 0 60 62 64 66 68 70 72 Student height Chapter 7—Regression 74 76 78 0 1 2 3 4 5 6 7 8 Hours studied Page 2 of 17 2. THE REGRESSION EQUATION To determine a more precise depiction of the relationship between the dependent variable π¦ and the independent variable π₯, we need to describe the relationship as a mathematical equation. This equation is derived as the equation of the line that fits the scatter diagram the best. The regression analysis provides the tools for fitting a regression line onto the scatter diagram. To draw any line in the π₯y quadrant, you must have a vertical intercept and a slope. The general equation for a straight line is the following: π¦ = π0 + π1 π₯ Here ππ represents the vertical intercept and ππ the slope of the line. The slope represents the change in value of π¦ per unit change in π₯: π1 = βπ¦ βπ₯ y βy βx x One can fit a line to the scatter diagram manually. There are many possible lines that could be fitted in this manner. However, there is a mathematical approach to fitting the most accurate, or best-fitting line. This method is called the least squares method. In explaining the method, you will see why it is called the least squares. We will use the data for Model 2 to explain how to find the regression equation. The model we are dealing with in this discussion is called a simple linear regression model. It is “simple“ because there is only one independent variable. If a model contains more than one independent variable then it is called a multiple regression model. For example, in your model explaining the student scores, in addition to the number of hours studied, you may include the students’ SAT scores as a second independent variable. The current model is also a “linear“ regression model because the regression equation provides a straight regression line. The line is not curved. The general form of a simple linear regression equation is as follows: π¦Μ = π0 + π1 π₯ Note that in the regression equation we use π¦Μ (π¦-hat) rather than π¦. In regression models, the symbol π¦ (hatless) represents the observed values of the dependent variable, the actual value observed in the sample. The distinction between π¦ and π¦Μ will become apparent below. Chapter 7—Regression Page 3 of 17 2.1. The Least Squares Regression Line The mathematical method used to obtain the regression line is called the least squares method because with the resulting regression line the sum of squared value of the vertical distance between the observed y values and the regression line is minimized (is the least). In the following diagram, the diamond-shaped markers represent the y values observed in the sample. For each value of π₯ (hours) there is an observed value of y (actual score). The circular markers on the regression line represent the predicted values. Once you find the regression equation and draw the regression line, for each value of π₯ there will be a corresponding predicted value of π¦ on the line, which we denote by π¦Μ. y y e = y − yΜ yΜ x The vertical distance between the observed value (π¦)and predicted value (π¦Μ) is called the prediction error and is denoted by π: π = π¦ − π¦Μ. Squaring the error terms and summing them we obtain the sum of squared errors. ο₯π2 = ο₯(π¦ − π¦Μ )2 The least squares line assures that this sum of squares is minimized. You cannot find any other line that would provide a smaller sum of squared errors than the least squares line. How do you obtain the least squares regression line? As explained, the equation for any straight line is obtained by determining the vertical intercept and the slope. In the simple linear regression, the slope and vertical inter intercept are obtained using the following formulas: ∑π₯π¦ − ππ₯Μ π¦Μ ∑π₯ 2 − ππ₯Μ 2 Slope: π1 = Vertical Intercept: π0 = π¦Μ − π1 π₯Μ Using the data in Model 2 now we can determine the values for π1 and π2 . Chapter 7—Regression Page 4 of 17 π¦ 52 56 56 72 72 80 88 92 96 100 π¦Μ = 76.4 π₯ 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 π₯Μ = 4.2 π₯π¦ 130 56 196 216 324 480 440 368 528 700 ο₯π₯π¦ = 3438 π₯2 6.25 1.00 12.25 9.00 20.25 36.00 25.00 16.00 30.25 49.00 ο₯π₯2 = 205.00 Computing π₯Μ = 4.2 and π¦Μ = 76.4, we can fill in the values in the formulas: π1 = ∑π₯π¦ − ππ₯Μ π¦Μ 3,438 − 10(4.2)(76.4) = = 8.014 205 − 10(4.22 ) ∑π₯ 2 − ππ₯Μ 2 π0 = π¦Μ − π1 π₯Μ = 76.4 − 8.014(4.2) = 42.741 The least square regression equation for Model 2 is then: π¦Μ = 42.741 + 8.014π₯ What does this equation imply? The slope value of (rounded) 8.0 means that for each additional hour of study the model predicts that score will increase by 8 points. The vertical intercept of (rounded) 43 means that the model predicts that if a student did not study at all the score would be 43. 2.2. The Predicted Value of y for a Given x Value, and the Residual e An important function of the regression equation is that it enables us to predict the value of π¦ for a given value of π₯. For example, according to the model, if a student studies 6 hours, the model predicts that the score would be π¦Μ = 42.741 + 8.014(6) = 90.8 You can thus predict the score for any number of hours studied. The difference between the predicted value π¦Μ for a given value of π₯ and the observed value associated with that π₯ value in the sample data is called the residual (or prediction error). For example, in the data when π₯ = 6, the associated score y is 80. The residual is then π = π¦ − π¦Μ = 80 − 90.8 = −10.8 Now compute all the predicted values and residuals for Model 2. Chapter 7—Regression Page 5 of 17 π¦Μ = π0 + π1 π₯ π = π¦ − π¦Μ π 2 = (π¦ − π¦Μ)2 π¦ π₯ 52 56 56 72 72 80 2.5 1.0 3.5 3.0 4.5 6.0 62.78 50.76 70.79 66.78 78.80 90.83 -10.78 5.24 -14.79 5.22 -6.80 -10.83 116.13 27.51 218.75 27.21 46.30 117.18 88 92 96 100 5.0 4.0 5.5 7.0 82.81 74.80 86.82 98.84 5.19 17.20 9.18 1.16 26.92 295.94 84.31 1.35 ο₯π = 0.00 ο₯π2 = 961.59 Note that the sum of squared residuals or sum of squared errors (πππΈ) is: πππΈ = ο₯π 2 = ο₯ (π¦ − π¦Μ)2 = 961.59 This value is the “least squares” mentioned above. There is no other line that would give you a smaller sum of squared errors. The Least Squares Method of determining π0 and π1 , the regression coefficients, guarantees that this sum will be the smallest possible (the least squares). 2.3. Variance of Error and the Standard Error of Estimate Note that the value of πππΈ is obtained by summing the squared deviations of the predicted from the observed values of y. Using πππΈ we can obtain a summary measure similar to the variance and, its square root, standard deviation. The variance measure shows the average squared deviation of the observed π¦ from the regression line (π¦). To compute this measure, denoted by π―ππ«(π), divide the πππΈ by the degrees of freedom, which here is ππ = π − 2 = 8. This measure is also known as Mean Square Error (πππΈ). var(π) = πππΈ = ο₯ (π¦ − π¦Μ)2 π−2 The square root of var(π) is called the standard error of estimate, and is denoted by π¬π(π). se(π) = √ se(π) = √ ο₯π 2 π−2 =√ ο₯ (π¦ − π¦Μ)2 π−2 πππΈ =√ = √πππΈ ππ 961.59 = √120.198 = 10.964 10 − 2 In any given regression model, the more scattered the observed values of π¦ around the regression line, the larger the se(π). If se(π) is large, we say that the regression line is not a good fit. The smaller the se(π), the better the fit. Compare the equation for se(π) to that of π , the standard deviation of π¦: ο₯ (π¦ − π¦Μ )2 π =√ π−1 Chapter 7—Regression Page 6 of 17 These two equations are very similar. The standard error of estimate measures the deviations of π¦ values from the regression line. The standard deviation of π¦ measures the deviation of the π¦ values from the mean of π¦, (π¦Μ ). In the following diagram note that the mean π¦Μ is a horizontal line because there is single value of π¦Μ for all values of π₯ on the horizontal axis. Both π and se(π) measure deviations or the degree of scatter of the π¦ values, the diamond-shaped markers, from a line: The standard error of estimate shows the average deviation of π¦ from π¦Μ and the standard deviation of π¦ provides the average deviation of π¦ from π¦Μ . Thus, the larger se(π) and s values the more scattered the data. y yΜ yΜ x The standard error of estimate for Model 2 is: se(π) = √ 961.59 = 10.964 8 Compare this to the standard error for Model 1, where se(π) = 18.129.1 Note that the standard error for Model 1 is significantly greater than that for Model 2. This indicates that the Model 2 regression line is a much better fit than Model 1. 2.4. Coefficient of Determination, πΉπ The fit of the regression line is a measure of the closeness of the relationship between π₯ and π¦. The less scattered the observed π¦ values are around the regression line, the closer the relationship between π₯ and π¦. As explained above, se(π) is such a measure of the fit. However, se(π) has a major drawback. It is an absolute measure and, therefore, is affected by the absolute size, or the scale of the data. The larger the values or scale of the data set, the larger the se(π). To explain this drawback, consider the data in Model 2. Suppose the statistics test from which the scores are obtained is a 25-question-mutiple-choice test. For scoring purposes we can either assign 1 point to each question and measure the scores from a scale of 25, or assign 4 points to each question and measure the scores from a scale of 100. We can set up our Model 2 either way. This value is obtained directly, without going through the worksheet calculations, using the Excel function: =STEYX(y range, x range). 1 Chapter 7—Regression Page 7 of 17 Scores Hours Scale = 100 Studied 52 2.5 56 1.0 56 3.5 72 3.0 72 4.5 80 6.0 88 5.0 92 4.0 96 5.5 100 7.0 se(π) = 10.964 Scores Hours Scale = 25 Studied 13 2.5 14 1.0 14 3.5 18 3.0 18 4.5 20 6.0 22 5.0 23 4.0 24 5.5 25 7.0 se(π) = 2.741 For the purpose of the analysis of the impact of hours studied on test scores it should make no difference which scale we use for test scores. But note that the standard error of estimate is higher when the scale is from 100. Does this mean that the model is a better fit when the test score scale is from 25? Of course not. Both versions of the model have exactly the same fit. This discussion should make it clear that using se(π) as a measure of closeness of the fit suffers from the misleading impact of the absolute size or scale of the data used in the model. An alternative measure of the closeness of fit, which is not affected by the scale of the data, is the coefficient of determination denoted by πΉ² (r-square). R-square measures the proportion of total variations in π¦ (around the mean π¦Μ ) explained by the regression (that is, by π₯). Mathematically, πΉπ is the proportion of the total squared deviations of the π¦ values from π¦Μ that is explained by the total squared deviations of π¦Μ values (points on the regression line) from π¦Μ . To understand this statement consider the following diagram. y Unexplained deviation y yΜ Explained deviation yΜ 96 86.8 Total deviation 76.4 5.5 x In the diagram, the horizontal line represents the mean of all the observed π¦ values, π¦Μ = 76.4. The regression line is represented by the regression equation π¦Μ = 42.741 + 8.014π₯. A single observed value of π¦ = 96 for a given π₯ = 5.5 hours is selected. The vertical distance between this π¦ value and π¦Μ is Chapter 7—Regression Page 8 of 17 πππ‘ππ π·ππ£πππ‘πππ = π¦ − π¦Μ π¦ − π¦Μ = 96 − 76.4 = 19.6 The vertical distance between π¦Μ on the regression line and π¦Μ is πΈπ₯πππππππ π·ππ£πππ‘πππ = π¦Μ − π¦Μ π¦Μ − π¦Μ = 86.8 − 76.4 = 10.4 As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression model. That is, this deviation is explained by the independent variable π₯, hours of study. The vertical distance between π¦ and π¦Μ, the residual, is ππππ₯πππππππ π·ππ£πππ‘πππ = π¦ − π¦Μ π¦ − π¦Μ = 96 − 86.6 = 9.2 Note that unexplained deviation is the familiar prediction error or residual. Thus, πππ‘ππ π·ππ£πππ‘πππ = πΈπ₯πππππππ π·ππ£πππ‘πππ + ππππ₯πππππππ π·ππ£πππ‘πππ (π¦ − π¦Μ ) = (π¦Μ − π¦Μ ) + (π¦ − π¦Μ) Repeating the same process for all values of π¦, squaring the resulting deviations, and summing the squared values, we have the following sum of squared deviations: 1. Sum of Squared Total Deviations ππ’π ππ πππ’ππππ πππ‘ππ (πππ): ο₯(π¦ − π¦Μ )2 2. Sum of Squared Explained Deviations ππ’π ππ πππ’ππππ π πππππ π πππ (πππ ): ο₯(π¦Μ − π¦Μ )2 3. Sum of Squared Unexplained Deviations ππ’π ππ πππ’ππππ πΈππππ (πππΈ): ο₯π2 = ο₯(π¦ − π¦Μ )2 Mathematically and numerically it can be shown that ο₯(π¦ − π¦Μ )2 = ο₯(π¦Μ − π¦Μ )2 + ο₯(π¦ − π¦Μ )2 That is, πππ = πππ + πππΈ The following worksheet for Model 2 shows that this equality holds: Chapter 7—Regression Page 9 of 17 π¦ 52 56 56 72 72 80 88 92 96 100 Note that: (π¦ − π¦Μ )2 595.36 416.16 416.16 19.36 19.36 12.96 134.56 243.36 384.16 556.96 (π¦Μ − π¦Μ )2 185.61 657.65 31.47 92.48 5.78 208.09 41.10 2.57 108.54 503.52 (π¦ − π¦Μ)2 116.13 27.51 218.75 27.21 46.30 117.18 26.92 295.94 84.31 1.35 ο₯(π¦ − π¦Μ )2 = 2798.40 ο₯(π¦Μ − π¦Μ )2 =1836.81 ο₯(π¦ − π¦Μ )2 = 961.59 π₯ 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 2798.40 = 1836.81 + 961.59 Rearranging the equation, we can write πππ as the difference between πππ and πππΈ: πππ = πππ − πππΈ ο₯(π¦Μ − π¦Μ )2 = ο₯(π¦ − π¦Μ )2 − ο₯(π¦ − π¦Μ )2 Dividing both sides by πππ, we have: πππ πππ πππΈ πππΈ = − =1− πππ πππ πππ πππ ο₯(π¦Μ − π¦Μ )2 ο₯(π¦ − π¦Μ)2 = 1 − ο₯(π¦ − π¦Μ )2 ο₯(π¦ − π¦Μ )2 As stated at the beginning of this discussion, π 2 measures the proportion of total deviations in π¦ explained by the regression. Thus the left hand side of the above equation is π 2 : π 2 = ο₯(π¦Μ − π¦Μ )2 πππ = ο₯(π¦ − π¦Μ )2 πππ On the right hand side of the equation, the ratio ο₯(π¦ − π¦Μ )2 πππΈ = ο₯(π¦ − π¦Μ )2 πππ is the proportion of total deviations in y that is due to error or residual, that is, not explained by the regression. Thus the larger the ratio πππΈ ⁄πππ, the smaller will π 2 be. For Model 2: π 2 = πππ 183681 = = 0.6564 πππ 279840 and Chapter 7—Regression Page 10 of 17 πππΈ 96159 = = 0.3436 πππ 279840 Thus, when π 2 = 0.6564, nearly 66 percent of the variations or deviations in π¦, test scores, are explained by the regression model, that is the independent variable π₯, the hours of study. The remaining 34 percent of the variations are due to other unexplained factors (you may call these factors the “unmeasurable attributes of an individual”). Note that if all the variations in π¦ were explained by hours studied, then π 2 = 1. Thus, the values of π 2 vary from 0 to 1: 0 ≤ π 2 ≤ 1 The closer to 0, the weaker the relationship between π₯ and π¦. The closer to 1, the stronger the relationship. Using Excel function RSQ(π¦ range, π₯ range), we can find the π 2 for Model 1: For Model 1, π 2 = 0.0605 As expected, π 2 for Model 1 is near zero. There is, practically, no relationship between student height and statistics test score. Also note that the value of π 2 is not affected by the scale of the data. You can check this for Model 2 using Excel with the scores based on the scale of 25. 3. STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION Note that to check for the validity of the proposition that there is a relationship between the test scores and number of hours of studied, we used the data from a sample of 10 students. Using the sample data we obtained the sample regression equation, the general form of which is π¦Μ = π0 + π1 π₯ To obtain the sample regression equation we have to determine the slope and the vertical intercept of the regression line from sample data. The sample regression equation is thus an estimate of the population regression equation. To construct the population regression equation we need to obtain the population slope and the population vertical intercept. But since we do not have access to the population data, we use the slope (π1 ) and the vertical intercept (π0 ) determined from the sample data as estimates of the population slope π·π , and population vertical intercept π·π . This way the sample regression line becomes the estimator of the population regression line: π¦Μ = π½0 + π½1 π₯ Going back to the statistical inference for the population mean, we used the sample statistic π₯Μ as an estimator of the population parameter π. Using π₯Μ we built a confidence interval for π or performed a test of hypothesis. Similarly, in regression, we use the sample statistic ππ as the estimator of population parameter π·π , and sample statistic ππ as the estimator of population parameter π·π . Using the two sample statistics in regression we can build confidence intervals or perform tests of hypotheses for the two population parameters. Chapter 7—Regression Page 11 of 17 3.1. Confidence Interval for Population Slope Parameter π·π To see the similarities between the confidence interval for π, the population mean, and that for π½1 , first consider the formula for the confidence interval for π from Chapter 5: Confidence interval for π: πΏ, π = π₯Μ ± π‘πΌ⁄2,(π−1) se(π₯Μ ) where se(π₯Μ ) = π ⁄√π. Note that the interval is built around π₯Μ using the margin of error, πππΈ = π‘πΌ⁄2,(π−1) se(π₯Μ ). The confidence interval for π½1 has the same general characteristics. It is built around the sample statistic π1 with ±πππΈ. The πππΈ in all confidence intervals is always equal to the π‘ score (or the π§ score) times the standard error of the relevant sample statistic. The confidence interval formula for π½1 is then: Confidence interval for π·π : πΏ, π = π1 ± π‘πΌ⁄2,(π−2) se(π1 ) Note that here the t score involves π − 2 degrees of freedom.2 The term se(π1 ) is the standard error of the sampling distribution of ππ . The formula for se(π1 ) is: Standard error of the sampling distribution of b1: se(π1 ) = se(π) √∑(π₯ − π₯Μ )2 3.1.1. More About the Sampling Distribution of ππ You should clearly recognize the fact that π1 is a summary characteristic obtained from a random sample. It is, therefore, like π₯Μ , a sample statistic. In the discussion of the concept of sampling distribution in Chapter 4 we learned that the number of samples of size π obtained from a parent population is infinite. There are, thus, infinite number of sample statistics such as π₯Μ that one may obtain from these samples. Since the values of π₯Μ are obtained from randomly selected samples, then π₯Μ is a random variable. The probability distribution of this random variable, we learned, is called a sampling distribution. We also learned that the center of gravity or the mean of the sample statistics is the corresponding parameter in the parent population. With respect to π₯Μ , it was explained, the mean of the means was the population mean π. Also, the measure of dispersion of the values of the sample statistic around their center of gravity is called the standard error. Thus, the measure of dispersion of π₯Μ values around μ is se(π₯Μ ). Finally, in order to apply the sampling distribution of π₯Μ in statistical inference, the π₯Μ values must have a normal distribution. We can apply the same concepts to the sample statistics π1 . We can obtain infinite number of π1 values from the infinite number of random samples. This makes π1 a random variable with a sampling distribution. The expected value, the center of gravity, or the mean of the π1 is equal to the parameter of the parent population, π½1 . And the measure of dispersion of the π1 values is the standard error of π1 , se(π1 ). Furthermore, in order to apply the sampling distribution of π1 for statistical inference, the π1 values must be normally distributed.3 The following diagram shows the similarities between the sampling distribution of π₯Μ and the sampling distribution of π1 . The regression equation is obtained by estimating two population parameters π½0 and π½1 . For each population parameter estimated, we lose one degree of freedom. 3 In order for the sampling distribution of π to be normal, certain conditions must be present. They are not relevant to 1 the discussion here. We assume these conditions are present for our discussion. 2 Chapter 7—Regression Page 12 of 17 Sampling Distribution of bβ Sampling Distribution of xΜ xΜ E(xΜ ) = μ bβ E(bβ) = ββ Now, to build a 95% confidence interval for population slope parameter in Model 2, in addition to the following quantities, we need to compute the standard error of π1 π1 = 8.5188 π‘πΌ⁄2,(π−2) = π‘0.025,(8) = 2.306 se(π) = 10.964 π₯Μ = 4.2 The following is the computation of ∑(π₯ − π₯ Μ )2 used in the denominator of se(bβ). π₯ 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 ∑(π₯ − π₯Μ )2 = se(π1 ) = 10.964 √28.6 (π₯ − π₯Μ )2 2.89 10.24 0.49 1.44 0.09 3.24 0.64 0.04 1.69 7.84 28.60 = 2.05 πππΈ = π‘πΌ⁄2,(π−2) se(π1 ) = 2.306(2.05) = 4.727 πΏ = π1 − πππΈ = 8.014 − 4.727 = 3.287 π = π1 + πππΈ = 8.014 + 4.727 = 12.741 We are 95% confident that the population slope parameter π½1 is between 3.287 and 12.741. 3.2. Test of Hypothesis for Population Slope Parameter π·π Recall that to perform a test of hypothesis about population parameters π or π we stated a null and an alternative hypothesis and then compared a test statistic to a critical value. There the test could be either a two tail, a lower, or an upper tail test. In performing a two-tail test for, say, π the critical value is π‘πΌ⁄2,(π−1) and the test statistic is π‘ = (π₯Μ − π0 )⁄se(π₯Μ ). In regression analysis, performing a test of hypothesis for the Chapter 7—Regression Page 13 of 17 population slope parameter π½1 , as you will see, is less complicated. The test is nearly always a two tail test. Here is why! In regression analysis we want to determine whether there is a relationship between π₯ and π¦. If there is a relationship, then the variations in the value of π¦ in response to changes in π₯ is reflected in the slope of the regression line. The slope shows the change in the value of π¦ per unit change in π₯. In the population regression equation, the slope is π½1 = βπ¦⁄βπ₯. If there is no relationship between π₯ and π¦, then there is no change in π¦ in response to changes in π₯. Thus, the slope is zero. In performing a test of hypothesis about π½1 , the null hypothesis is that the slope is zero. Using inferential statistics, we want to reject the null hypothesis, to provide significant proof that the slope is not zero. The Null and Alternative Hypothesis for π·π : π»0 : π½1 = 0 π»1 : π½1 ≠ 0 To perform the test we need a critical value, which is The Critical Value: π‘πΌ⁄2,(π−2) The test statistic for the hypothesis test has the same format as the that for π. The test statistic is again a π‘ value the numerator of which is the difference between the sample statistic π1 and the hypothesized value of the population parameter π½1 , and the denominator is the standard error of π1 , se(b1): π‘= π1 − (π½1 )0 se(π1 ) However, note that the null hypothesis states that π½1 = 0. Therefore, the test statistic is simplified as follows: π»ππ π»πππ πΊππππππππ: π‘= π1 se(π1 ) To reject the null hypothesis that β1 = 0, the test statistic must exceed the critical value: To reject the null hypothesis: π‘ > π‘πΌ⁄2,(π−2) Example Perform a test of hypothesis for the population slope parameter in Model 2. π»0 : π½1 = 0 π»1 : π½1 ≠ 0 The critical value is: π‘πΌ⁄2,(π−2) = π‘0.025,(8) = 2.306 The test statistic is: π‘= π1 8.014 = = 3.910 se(π1 ) 2.05 Since the test statistic exceeds the critical value reject the null hypothesis that the population slope parameter is zero. We can also use the test statistic to obtain the probability value. Using the Excel function TDIST, we compute 2 × P(t > |t|). Chapter 7—Regression Page 14 of 17 =T. DIST. 2T(x, deg_freedom) =T.DIST.2T(3.91,8) = 0.0045 This is a very small probability value. Therefore, π»0 : π½1 = 0 is rejected for any πΌ > 0.005. 4. USING EXCEL FOR REGRESSION ANALYSIS Determining the regression equation and all the related analysis is a cumbersome process involving many calculations. In the above discussion of regression we went through all the calculations to explain the concepts. When conducting research, however, using a computer program is essential. The Excel spreadsheet provides a simple process for determining all the different calculations performed above in one swoop. The following explains the steps in Excel: 1. 2. 3. 4. Enter the data for π¦ and π₯ variables. Click on Data, then Data Analysis. Locate Regression in the provided list. Click the box labeled Input Y Range and then select the cells containing the π¦ data. Do the same in the box labeled Input X Range. Choose where you want Excel to show the output on the worksheet by clicking the box labeled Output Range and select the cell where you want the top left corner of the output be printed. Click OK and the following output will appear. SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.8102 0.6564 0.6134 10.9635 10 ANOVA df 1 8 9 SS 1836.8056 961.5944 2798.4 MS 1836.806 120.1993 F 15.2813 Significance F 0.004487 Coefficients 42.74126 8.01399 Standard Error 9.28207 2.05007 t Stat 4.60471 3.90913 P-value 0.00174 0.00449 Lower 95% 21.33676 3.28652 Regression Residual Total Intercept X variable 1 4.1. Upper 95% 64.14575 12.74145 Understanding the Computer Printout Not all the items on the printout that are familiar to you. Here we will consider only those we have discussed. 1. The vertical intercept π0 : Shown under the column “Coefficients” and clearly it is labeled “Intercept”. π0 = 42.74126 2. The slope π1 : Labeled “X variable 1” under the column “Coefficients”. π1 = 8.01399 Chapter 7—Regression Page 15 of 17 3. Standard Error of Estimate se(π): Shown under the column labeled “Regression Statistics”, labeled “Standard error”: se(π) = √ 4. Sum of Squares Regression (πππ ): ∑(π¦−π¦Μ)2 π−2 = 10.9635 Shown under the column labeled “ππ” (meaning Sum of Squares) in the row labeled “Regression”. πππ = ο₯(π¦Μ − π¦Μ )2 = 1836.8056. 5. Sum of Squares Error (πππΈ): Shown under the intersection of column “SS” and row “Residual”. πππΈ = ο₯(π¦ − π¦Μ)2 = 961.5944. 6. var(π) When πππΈ is divided by the degrees of freedom (df associated with Residual, ππ = π − 2 = 8) the result is called the Mean Squares Error (π΄πΊπ¬), which is also known as π―ππ«(π). This figure is shown under the column labeled ππ. Note that var(π) = πππΈ = ∑(π¦ − π¦Μ)2 πππΈ 961.5944 = = = 120.1993 π−2 ππ 8 7. Sum of Squares Total (πππ): Shown under the intersection of column “ππ” and row “Total”. πππ = ο₯(π¦ − π¦Μ )2 = 2798.4 8. R-Square (π 2 ): Shown under “Regression Statistics”. π 2 = πππ ⁄πππ = 0.65638 9. Standard Error of π1 , se(π1 ): Shown under the intersection of column “Standard Error” and row “X variable 1”. se(π1 ) = 10. 95% Confidence interval for β1: se(π) √∑(π₯ − π₯Μ )2 = 2.05007 Shown under the intersection of columns “Lower 95%” and “Upper 95%”, on the one hand, and row “Hours”. πΏ, π = π1 ± π‘πΌ⁄2,ππ se(π1 ) πΏ = 8.01399 − 2.306(2.05007) = 3.28652 π = 8.01399 + 2.306(2.05007) = 12.74145 11. Test Statistic for π»0 : π½1 = 0 Shown under the intersection of column “t stat” and row “X variable 1”. ππ = π1 8.01399 = = 3.0913 se(π1 ) 2.05007 12. π·πππ Value Recall that an alternative approach to the test of hypothesis is the ππππ value approach. In Chapter 6, in the discussion of the two-tail test of hypothesis for π, it was stated that you reject the null hypothesis if the ππππ value is less than α (where πΌ is the level of significance of the test). The ππππ value is P(π‘ > ππ). The same argument applies to the test of hypothesis in the regression analysis. Let πΌ = 0.05. Excel computes 2 × P(π‘ > ππ) = 2 × P(π‘ > 3.0903) = 0.00449.4 This is shown under the intersection of 4 The Excel command is =T.DIST.2T(x, deg_freedom) Chapter 7—Regression Page 16 of 17 column “P-value” and row “X variable 1”. Note that this is a two tail test. The prob value shown in the computer output is the area under the two tails of the t curve. Since 0.00449 < πΌ = 0.05, reject the null hypothesis that the population slope π½1 = 0. Chapter 7—Regression Page 17 of 17