YALE School of Management EMGT511: HYPOTHESIS TESTING AND REGRESSION K. Sudhir Lecture 3 Introduction to Regression The first 2 sessions covered hypothesis testing about the mean, about the proportion, and about differences between means. In each of the cases, the process started with a null hypothesis, and we identified decision rules based on extreme outcomes for a test statistic, if the null hypothesis is true. Simple random sampling was critical to the testing procedure. Without probability sampling, we cannot justify the applications we did. We now move to situations where we want to know the relationship between variables. Regression analysis is a method that is used to estimate an equation that expresses how one variable (the dependent or criterion variable) depends on one or more other variables (independent or predictor variables). To justify statistical inference for regression analysis, we make assumptions. Obviously, the relevance of the conclusions we make depends on the extent to which the assumptions are correct. Examples of Regression: Testing for Gender Discrimination A firm is sued for gender discrimination. The plaintiffs performed a hypothesis test on the average difference between the salaries of males and females at the firm. They obtained a statistically significant difference between the average salaries of males and females. Attorneys for the firm argue that this evidence is moot, because they know that the males have higher salaries due to longer experience. Historically there were more males in the firm and therefore on average they tend to have greater experience. The question these attorneys want to address is whether there are average salary differences between males and females with comparable experience. How does one compare differences in salaries between men and women controlling for experience? A regression equation would help us to answer the problem. As we will see later, we could create a variable called Gender that takes the value 0, if the person is Male, and 1 if the person is Female. We could then estimate an equation such as: Salary 0 1Gender 2 Experience In this equation, we explore the discrimination issue by holding experience constant. That is, 1 represents the average difference between males and females at the firm, controlling for experience. In this case, Salary is the dependent variable; Gender and Experience are the independent variables. A Model of Demand Economists and others are frequently interested in estimating demand equations. An understanding of how demand depends on various activities may help managers decide on allocation of resources. It can also help both managers and policy makers learn about the intensity of competition. If we consider a given firm in isolation and ignore competitors, we could consider the following simplified representation. A model of linear effects of price and advertising on demand is: Sales 0 1Price 2 Advertising Sales is the dependent variable; Price and Advertising are the independent variables. Modeling the Experience Curve Effect Firms often find that as they have produced more in the past, the marginal cost of production falls due to learning or cumulative experience. Understanding the relationship between marginal costs and total production is useful for firms, because it helps them forecast how much and how quickly costs will fall over time. A typical regression equation for the experience curve is: ln(Marginal Cost) 0 1 ln(Cumulative Production) Academic performance and GMAT Scores Many business schools rely heavily on GMAT scores for selecting applicants. The assumption is that students with higher GMAT scores will have better academic performance. However, it is unclear how an admission officer should combine GMAT scores with information on undergraduate academic performance, work experience, etc. Unless an admission officer analyzes the relationship between academic performance of past graduates and the information available in the application file, s/he will use weights for the different independent variables that are “biased”. A simplified regression analysis would examine how individuals’ academic performances in a specific MBA program relates to independent variables such as the same individuals’ GMAT scores (verbal, quant), undergraduate academic performance, quality of the undergraduate institution, type of undergraduate major, etc. Such a regression analysis can help the admission director screen applicants on minimum academic competence so that s/he can then concentrate on harder to assess dimensions such as career performance potential. For example we could specify a simplified equation such as: MBA GPA 0 1GMAT _ VERB 2GMAT _ QUANT 3UG _ GPA 4 EXP _ YR Uses of Regression Regression is useful for two primary purposes: (1) Prediction and Forecasting: Given specific values for the independent variables, we can predict or forecast a value for the dependent variable. For example, what is the predicted MBA score for an applicant with specific GMAT scores, and how good is this prediction? (2) Description: The regression result informs us how one or more independent variables affect the dependent variable, assuming there is a causal relationship. This is useful for managers to decide the “optimal” levels of the independent variables in order to achieve a desire outcome. For example, based on an estimated demand equation, a manager can learn how sales (and therefore profits) will change as a function of prices. This enables the firm to decide on the optimal level of price so as to maximize profits. A Simple Regression Problem Consider the following data on prices and demand. Estimate a regression model for these data. Unit sales in lbs. Price in $ 115 5 105 5 95 10 105 10 95 15 85 15 Prior to estimating a regression model, it is useful to graph the data. The best fitting line runs approximately through the middle of the data. Intuitively this is the regression line. Note that none of the data fall on the line. 120 Sales 110 100 90 0 5 10 15 Price On average, these data show that a $5 increase in price results in a 10 lb. decrease in sales. Thus, the slope of a linear equation equals 10 2 . 5 By extrapolation we see that the linear equation intersects the y-axis at 120. Thus, the average relationship between the two variables is y = 120 – 2 xi where yi is unit sales in lbs. and xi is price in dollars. This is the regression equation. The interpretation of the slope coefficient in this problem is that if price increases by $1, demand is expected to decrease by 2 lbs. We will verify if we can recover the parameters of the regression equation using Excel. At the end of this note, I detail how to do regressions in Excel. Right now let us look at the output of the regression for this example. Regression Output SUMMARY OUTPUT Regression Statistics Multiple R 0.852803 R Square 0.727273 Adjusted R Square 0.659091 Standard Error 6.123724 Observations 6 ANOVA df Regression Residual Total Intercept Price SS 1 4 5 MS 400 150 550 F Significance F 400 10.66667 0.030906 37.5 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 120 6.614378 18.14229 5.43E-05 101.6355 138.3645 -2 0.612372 -3.26599 0.030906 -3.70022 -0.29978 Interpretation of the Slope Coefficient and Intercept Slope: for a one dollar increase in price, demand tends to decrease by 2 lbs. Intercept: for a price of zero, demand is estimated to be 120 lbs. Note: “tends to” and “estimated” are expressions that reflect the inexact nature of the function. (Note: We should not interpret this intercept too literally, because we have never charged a price of zero and do not have adequate information about how sales will be when price is zero; usually regression estimates work well in the range of the x data which has been used in the regression. In this case, we can be more confident of predictions in the price range of $5-$15, because this is the range of the data we have used for the regression. Nevertheless, we should always attempt to estimate an equation that can be used for all possible values of the predictor variable (s), Price in this case. ) Assumptions used in Regression and how it ties to Statistical Inference We discussed last time that when doing hypothesis testing about means and proportions, the use of Simple Random Sampling allows us to claim that the expected value of the sample mean or sample proportion equals the value of the parameter. Simple random sampling also allows us to derive the variance of the random variable. And we refer to the Central Limit Theorem to justify the assumption that the random variable (sample mean or sample proportion) is normally distributed. For regression, we use historical data. To obtain formulas for statistical inference, we make the following assumptions about the error term for regression. The assumptions used for regression are: (1) E( i) = 0. (2) Var ( i) = 2 ( 2 to be clear about what the variance refers to) (3) cov( i , j) =0; independence of errors (4) i is normally distributed. See figure below to see how these assumptions can be thought off pictorially. y y 0 1 x 0 x 1 2 3 As you can see from the picture the dependent variable data y falls along the regression line, subject to a normal error term which is centered around the regression line. To interpret the picture, it is useful to think of the normal curve as projecting out of the page with its mean at exactly the point of the regression line. From the picture we can see: a. The error terms have a mean zero, irrespective of the values of x, as indicated by the normal curves which are centered around the regression line. b. The variance of the error term is a constant irrespective of the values of x, as indicated by the constant variance normal curve. c. In the picture, we are not able to show the independence assumption. d. The errors are normally distributed (as indicated by the bell curve). Statistical Inference We need to make the four assumptions about the error term to justify statistical inference (hypothesis testing and confidence intervals). Let us see how these assumptions help us in statistical inference. 1. Assumption (1) E( i) = 0, implies that the least squares estimates are unbiased estimates. Thus E ˆ0 0 and Eˆ1 1 . 2. Assumptions (2) and (3) imply that sˆ 0 1 x2 and sˆ 1 n S xx S xx where: Sxx x i x 2 Usually we don’t know , so we estimate it as s = ( yi ŷi )2 n2 = ˆ i2 n2 where ŷ ˆ0 ˆ1 x 3. Assumption (4) assumes normality of i, so we can do hypothesis testing. For example, we can do the hypothesis test for 1 by constructing the following Z and t-statistics. If the null value is 1 Null Z ˆ1 1Null ˆ 1 If we estimate s , then we do the t-test. The corresponding t-statistic is t ˆ1 1Null sˆ 1 Statistical Inference Example: Performing a t-test In the example problem above we estimated: ˆ1 -2 ˆ0 120 Suppose we wish to test the following hypothesis for the example problem discussed earlier: Null H0: 1 =0; Alternative: HA: 1 0 t ˆ1 1Null sˆ 1 = ˆ1 1Null s / S xx We need to estimate s = ˆ i2 ( yi ŷi )2 = n2 n2 We go back to the data to estimate s as follows: yi ŷ i ̂ i ̂ i 2 115 110 5 25 105 110 -5 25 95 100 -5 25 105 100 5 25 95 90 5 25 where: ŷi 120 2xi 85 90 s = 150 = 4 -5 25 0 150 37.5 = 6.1 We had computed Sxx x i x 100 Hence the t statistic can be computed as follows: 2 ˆ1 =-2; 1Null =0; s =6.1; Sxx = 100 t 2 0 2 3.28 6.1 / 10 .61 Since tn-2, 0.025,=t4, 0.025=2.776 Since – 3.28 < -2.776, reject H o . Computing Confidence interval for true slope: ˆ1 tn2, / 2 sˆ 1 2 2.776 .61 2 1.7 We can be be 95% sure that the true but unknown slope is between –3.7 and –0.3 (consistent with rejecting Ho). How well did the regression do? R-square and Adjusted R-square: y Regression Line y yˆ y y yˆ y Mean( y ) x It can be shown that ( yi y ) 2 i SS(Total) ( y yˆ ) i i i 2 ( yˆ y ) 2 i i SS(Residuals) SS(Regression) Where SS(Total) represents the sum of squares of the total variation in y from the mean; SS(Regression) represents the sum of squares of the variation in y explained by the regression; SS(Residuals) represents the sum of squares of the variation in y explained by the residuals Thus the proportion of the variation explained by the regression is defined as R-square or R2 and is defined as follows: SS(Regress ion) SS(Total) - SS(Residua l) SS(Residua l) R2 1 SS(Total) SS(Total) SS(Total) Clearly R2 is a number between 0 and 1. It will always increase when a new variable is included in a regression. In our example: Suppose we did not use the x variable in the regression and estimated the model y 0 . Then our best estimate of the 0 is y . In this case, the sum of squared errors is given by: ( yi y ) 2 . yi y ̂ i ̂ i 2 115 100 15 225 105 100 5 25 95 100 -5 25 105 100 5 25 95 100 -5 25 85 100 -15 225 0 550 For this problem above, this sum is equal to 550. This is the total sum of squared errors from the mean for the y variable. This is referred to as the total variation in the y variable. We had computed earlier that SS(Residual)=150. So R-square = 1 – 150/550 = 0.727. Compare this against the Excel Output we obtained last time. This is referred to as the Rsquare of the regression and is interpreted as the fraction of the variation in y in the sample, explained by the regression. The worst any variable in a regression could do is that it will it will explain zero variation. This R-square measure can only increase as you add more variables on the right-hand side of the equation. However managers and statisticians like to have the simplest possible models of the world that provide correct inferences and explain the most variation. It makes sense to penalize models with variables that do not explain much variation in y. Statisticians have therefore constructed an alternative measure of R-square. This is called adjusted R-square. When adding additional variables into a regression, it is therefore useful to look if the adjusted R-square is increasing, rather than whether the (unadjusted) R-square is increasing. However, this is not the only way to decide which model is best. An intuitive way to think about the difference between R-square and adjusted R-square is as follows: R-square is the proportion of the sample variation in y that is explained by the regression. Adjusted R-square is the proportion of the population variation in y that is explained by the regression. Adding a variable can never worsen the sample variation explained. But a bad variable in explaining some small proportion of the sample variation in y may actually harm in explaining the overall population variation. Hence R-square will always go up, while adjusted R-square may go down when variables without much explanatory power are added to the regression. Doing Regressions in Excel 1. Below I outline the steps to do regressions in Excel. Enter the data as shown below with the labels in the first row. 2. Select Tools>Data Analysis from the menu Data 3. This opens up a dialog box below: Select Regression. 4. This opens another dialog box below: a. Enter the y and x data ranges in the appropriate fill-in boxes. (If there are more than 1 x variable, enter the full range; for example with 2 variables you would enter $B$1:$C$7 b. Mark the labels checkbox, this makes sure the first row is recognized as label. c. If you want a different confidence level (say 99%), enter it. d. You may request any plots you want. We will discuss these later. Y X The results of the regression are given below: 1. In the last three rows of the output, Compare the estimates (called coefficients), the standard errors, the t-statistic and the confidence interval we computed earlier 2. Look at the adjusted R-square 3. The standard error of the regression is what we call s SUMMARY OUTPUT Regression Statistics Multiple R 0.852803 R Square 0.727273 Adjusted R Square 0.659091 Standard Error 6.123724 Observations 6 ANOVA df Regression Residual Total Intercept Price SS 1 4 5 MS 400 150 550 F Significance F 400 10.66667 0.030906 37.5 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 120 6.614378 18.14229 5.43E-05 101.6355 138.3645 -2 0.612372 -3.26599 0.030906 -3.70022 -0.29978