70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1 Multiple Independent Variables • We believe that both education and experience affect the salary you earn. Can linear regression still be used to capture this idea? • Yes, of course • The “linear” part of “linear regression” means that the regression coefficients cannot enter the eq’n in a nonlinear way (such as β12 * x1) Multiple Independent Variables • Salaryi = β0 + β1 * Educi + β2 * Experi + μi • Graphing this equation requires the use of 3 dimensions, so the usefulness of graphical methods such as scatterplots and best-fit lines is somewhat limited now • As the number of explanatory variables increases, the formulas for computing the estimates of the regression coefficients become increasingly complex – So we will not cover how to solve them by hand Multiple Independent Variables • Equation that “best” describes the relationship btwn a dependent variable y and K independent variables x1, x2, … , xK can be written as: – y = β0 + β1 * x1 + β2 * x2 + … + βK * xK + μ – Note that I will mostly drop the “i” subscript moving fwd • The criterion for “best” is the same as it was for simple (i.e. K = 1) regression – the sum of the squared difference btwn the true values of y and the values predicted yhat should be as small as possible • β0,hat, β1,hat, β2,hat, … , βK,hat ensure that the sum of squared errors is minimized Labeling β • Sometimes we just use β0, β1, β2, … , βK to label the coefficients • Other times, it is useful to be more specific. For example, if x1 represents “education level”, it is better to write β1 as βeduc. – β0 is always written the same • The first regression below is more helpful in seeing and presenting your work than the second regression, even if we knew that y was salary, x1 was education, etc – Salary = β0 + βeduc * Educ + βexper * Exper + μ – y = β0 + β1 * x1 + β2 * x2 + μ • I will go back and forth with my labeling throughout the course. I just wanted you to understand the difference and why one way might be better in practice. Multiple Independent Variables • Ceteris paribus – all else equal • In the case of simple regression, we interpreted the regression coefficient estimate as meaning how much the dependent variable increased when the independent variable went up one unit • Implicit was the concept that the error term for any two individuals were equally distributed, in other words, that all else was equal Multiple Independent Variables • It is very possible that that is a bad implicit assumption • That is one reason we like to add multiple explanatory variables. Once they are added, they are not part of the error term and can be explicitly accounted for when we interpret coefficient estimates • What the hell do I mean by all of this? Multiple Independent Variables • Go back to the salary example • Hopefully you all agree that education and experience are both highly likely to explain salary in statistically significant ways • But what if we didn’t have experience data, so we just ran the regression on salary and education? Multiple Independent Variables • What we would like to run: – Salary = β0 + β1 * Educ + β2 * Exper + μ • What we do run: – Salary = β0 + β1 * Educ + μ • Which means that experience has now been sucked into the error term. If experience levels (conditional on education) differ in our sample data set, the implicit assumption that the errors are equally distributed across all observations is wrong! • If we ran the 2nd regression written above, we would interpret β1,hat as the amount by which salary increases when education increases by one unit (implicitly saying all else, i.e. the “errors”, are equal, which I just argued is probably a poor assumption) Multiple Independent Variables • So now say we have the experience data and we can run the regression with 2 explanatory variables • Now we would interpret β1,hat as the amount by which salary increases when education increases by one unit AND EXPERIENCE IS THE SAME (plus the remaining information captured by the errors is the same across all observations) • So we explicitly take experience out of the error term and can now condition on it being the same when we interpret the education coefficient Multiple Independent Variables • But how good does the implicit, ceteris paribus, “error” assumption hold up even when both educ and exper are included? • Maybe still not very good. Everything you can think of is still being captured by the error terms except for education and experience levels. If these somehow differ systematically across observations, the assumption of equal error distributions is still wrong! Multiple Independent Variables • What do I mean by “everything you can think of”? Very simply, anything else that might (or might not!) affect salary. – Years of experience at current company – Number of extended family members that work at same company – Intelligence – How many sick days you took over the past 5 years – How many kids you have – How many siblings you have – How many different cities you’ve lived in – How many hot dogs you eat each year – Etc, etc, etc, blah, blah, blah Multiple Independent Variables • Let’s look at those closer – Years of experience at current company – Probably would have significant effect on salary. We should include this in the regression if we can get the data. – Number of extended family members that work at same company – Might or might not have affect on salary. – Intelligence – Tough to measure, but could proxy for it using an IQ score. Very likely to affect salary, so it should be included in the regression, too. – How many sick days you took over the past 5 years – Kind of a measure of effort, so I think it would matter. – How many kids you have – Could matter, especially for women. – How many siblings you have – Doubtful it would be significant. – How many different cities you’ve lived in – Very unlikely to be significant. – How many hot dogs you eat each year – I’m literally just making stuff up at this point, so I doubt this would affect salary (unless we are measuring the salaries of competitive eaters, so note that context can matter when “answering” these questions) Multiple Independent Variables • So what happens if we think intelligence matters but it wasn’t included in the regression as a separate explanatory variable? • Then intelligence is rolled up into the error term. But if education and intelligence are highly correlated (smarter people have more years of education), then the errors are not the same across the individuals in the sample (E(μi|X) ≠ 0). In fact, those with higher education have “higher” error, by which I mean one component of the error term is systemically bigger for some individuals • This would make our ceteris paribus assumption false and we would end up with biased estimators! Multiple Independent Variables • What if we include insignificant variables because we are afraid of getting biased estimates if we don’t throw everything in? • Not really a problem. We will see how to evaluate whether there are any relevant gains to including additional variables. If there are, they should be kept in the regression. If the gains are negligible or even negative, drop those insignificant variables and fear not the repercussions of bias. Multiple Independent Variables – Output • Look at and interpret output • Sales are dependent on Advertising and Bonus • Run the regression: – Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus • This equation can be interpreted as providing an estimate of mean sales for a given level of advertising and bonus payment. • If advertising is held constant, mean sales tend to rise by $1860 (1.86 thousands of dollars) for each unit increase in Bonus. If bonus is held fixed, mean sales tend to rise by $2470 (2.47 thousands of dollars) for each unit increase in Adv. Multiple Independent Variables – Output • Notice in the Excel output that the dof of the Regression is now 2 (always used to be 1). This is because there are 2 explanatory variables. The SSE, MSR, F, etc are calculated basically the same way as before, which we will go over very soon. • Look at Fig 4.7b on pg 141 of Dielman to see how Excel outputs all the regression information when multiple independent variables are included Multiple Independent Variables – Prediction • As in simple regression, when we run a multiple regression we can then predict, or estimate, values for y when we have values for every explanatory variable by solving for yhat • Back to sales example with Adv and Bonus only – Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus • Say Adv = 200 and Bonus = 150. What would we predict for Sales (i.e. what is Saleshat)? – Plug in Adv = 200 and Bonus = 150 – Saleshat = -516.4 + 2.47 * 200 + 1.86 * 150 = 256.6 Confidence Intervals and Hypothesis Testing • The confidence interval on βk,hat when K explanatory variables are included is – (βk,hat – tα/2,N-K-1 * sβk, βk,hat – tα/2,N-K-1 * sβk) – Notice the dof change on the t-value • Hypothesis testing on any one independent variable is the same as before. The default Excel test is shown below. – H0 : βk = 0 – Ha : βk ≠ 0 Hypothesis Testing • If the null on the previous slide is not rejected, then the conclusion is that, once the effects of all other variables in the regression are included, xk is not linearly related to y. In other words, adding xk to the regression eq’n is of no help in explaining any additional variation in y left unexplained by the other explanatory variables. You can drop xk from the regression and still have the same “fit”. Hypothesis Testing • If the null is rejected, then there is evidence that xk and y are linearly related and that xk does help explain some of the variation in y not accounted for by the other variables Hypothesis Testing • Are Sales and Bonus linearly related? • Use t-test – H0 : βBON = 0 – Ha : βBON ≠ 0 – Dec rule → reject null if test stat more extreme than tvalue and do not reject otherwise – βBON,hat = 1.856 and sβBON = 0.715, so test stat = 1.856 / 0.715 = 2.593 – The t value with 22 dof (from N-K-1) for a two-tailed test with α = 0.05 is 2.074. – Since 2.593 > 2.074, reject null – Yes, they are linearly related (even when Advertising is also accounted for) Hypothesis Testing • Could have used p-value or CI to answer the question on previous slide – Would have reached same conclusion – Don’t use full F when testing just one variable (more explanation later) Assessing the Fit • Recall SST, SSR, and SSE – SST = ∑ (yi – ybar)2 – SSR = ∑ (yi,hat – ybar)2 – SSE = ∑ (yi – yi,hat)2 • For SSR, dof is equal to number of explanatory variables K • For SSE, dof is N – K – 1 • So SST has N – 1 dof Assessing the Fit • Recall that R2 = SSR / SST = 1 – (SSE / SST) • It was a measure of the goodness of fit of the regression line and ranged from 0 to 1. If R2 was multiplied by 100, it represented the percentage of the variation in y explained by the regression. • Drawback to R2 in multiple regression → As more explanatory variables are added, the value of R2 will never decrease even if the additional variables are explaining an insignificant proportion of the variation in y Assessing the Fit • From R2 = 1 – (SSE / SST), you can see that R2 gets increasingly closer to 1 since SSE falls any time any little tiny bit more variation in y is explained • Addition of unnecessary explanatory variables, which add little, if anything, to the explanation of the variation in y, is not desirable • An alternative measure is called adjusted R2, or Radj2 – “Adjusted” because it adjusts for the dof Assessing the Fit • Radj2 = 1 – (SSE / (N – K – 1)) / (SST / (N – 1)) • Now suppose an explanatory variable is added to the regression model that produces only a very small decrease in SSE. The divisor N-K-1 also falls since K has been increased by 1. It is possible that the numerator of Radj2 may increase if the decrease in SSE from the addition of another variable is not great enough to overcome the decrease in N-K-1. Assessing the Fit • Radj2 no longer represents the proportion of variation in y explained by the regression (that is still captured only by R2), but it is useful when comparing two regressions with different numbers of explanatory variables. A decrease in Radj2 from the addition of one or more explanatory variables signals that the added variable(s) was of little importance in the regression, so it can be dropped. Assessing the Fit • • • • F = MSR / MSE MSR = SSR / K MSE = SSE / (N – K – 1) Full F statistic is used to test the following hypothesis: – H0 : β1 = β2 = … = βK = 0 – Ha : At least one coefficient above is not equal to 0 Assessing the Fit • Decision rule → reject null if F > fcrit(α; K, N-K1) and do not rej otherwise • Failing to reject the null implies that the explanatory variables in the regression equation are of little or no use in explaining the variation in y. Rejection of the null implies that at least one (but not necessarily all) of the explanatory variables helps explain the variation in y. Assessing the Fit • Rejection of the null does not mean that all pop’n regression coefficients are different from 0 (though this may be true), just that the regression is useful overall in explaining y. • The full F test can be thought of as a global test designed to assess the overall fit of the model. • That’s why full F cannot be used for hypothesis testing on a single variable in multiple regression, but it could be used for the hypothesis testing on the single explanatory variable in simple regression (since that variable was the whole, “global” model) Sales Example • Show the calculation of F on the Excel sheet – Using SSE and SSR – Using MSE and MSR • Would we reject the null that all coefficients are equal to 0? – YES Comparing Two Regression Models • Remember that the t-test can check whether each individual regression coefficient is significant and the full F test can check the overall fit of the regression by asking whether any coefficient is significant • Partial F test is in between – it answers the question of whether some subset of coefficients are significant or not Comparing Two Regression Models • Want to test whether variables xL+1, … , xK are useful in explaining any variation in y after taking into account variation already explained by x1, … , xL variables • Full model has all K variables: – y = β0 + β1 * x1 + β2 * x2 + … + βL * xL + βL+1 * xL+1 + … + β K * xK + μ • Reduced model only has L variables: – y = β 0 + β 1 * x 1 + β 2 * x 2 + … + β L * xL + μ Comparing Two Regression Models • Is the full model significantly better than the reduced model at explaining the variation in y? • H0 : βL+1 = … = βK = 0 • Ha : at least one of them isn’t equal to 0 • If null is not rejected, choose the reduced model • If null is rejected, xL+1, … , xK contribute to explaining y, so use the full model Comparing Two Regression Models • To test the hypothesis, use the following partial F statistic – Fpart = ((SSER – SSEF) / (K – L)) / ((SSEF) / (N – K – 1)), where the “R” stands for reduced model and “F” stands for full model • SSER – SSEF is always greater than or equal to 0 – Full model includes K – L extra variables which, at worst, explain none of variation in y and in all likelihood explain at least a little of it, so SSE falls – This difference represents the additional amount of variation in y explained by adding xL+1, … , xK to the regression Comparing Two Regression Models • This measure of improvement is then divided by the number of additional variables included, K – L – Thus the numerator of Fpart is the additional variation in y explained per additional explanatory variable used • Reject null if Fpart > fcrit(α; K – L, N – K – 1) and do not reject otherwise Sales Example Revisited • Example 4.4, pg 152 of Dielman • Let’s add two more variables to the sales example from earlier • x3 is mkt share held by company in each territory and x4 is largest competitor’s sales in each territory • So the “reduced” model results we already have. They were shown earlier when just x1 (Adv) and x2 (Bonus) were included Sales Example Revisited • We need to see the full model results • Notice that R2 is higher for the full model (remember, R2 can never fall when more variables are added) but Radj2 is actually lower – This should be a clue that we will probably not reject the null on β3 and β4 when comparing the full and reduced models Sales Example Revisited • SSER = 181176, SSEF = 175855, K – L = 2, N – K – 1 = 20 – Note that this last value is the dof of SSE in the full model • So Fpart = ((181176 – 175855) / 2) / (175855 / 20) = 0.303 • fcrit(0.05; 2, 20) = 3.49 • Since 0.303 < 3.49, do not reject null • Conclude that β3 = β4 = 0, so x3 and x4 should not be included in the regression Sales Example Revisited • Notice that the values for β0,hat, βADV,hat, and βBON,hat changed when we added additional variables – Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus – Saleshat = -593.5 + 2.51 * Adv + 1.91 * Bonus + 2.65 * Mkt_Shr – 0.121 * Compet • This should not surprise you. Some of what was previously rolled up into μ has now been explicitly accounted for, and that changes the way the initial set of explanatory variables relate to Sales. • Note that the inclusion of additional observations (i.e. we gather more data) could also adjust the estimates of β0,hat, etc • Every regression is different! (like snowflakes.......) Sales Example Revisited • If we chose to stick with the “full” sales model, we would include the x3 and x4 variables in predicting Saleshat – Even though they are insignificant, because the β0,hat, βADV,hat, and βBON,hat values changed with their inclusion, it would be wrong to make predictions without them (unless we re-ran the original regression where they were not even included) • So what is Saleshat for Adv = 500, Bonus = 150, Mkt_Shr = 0.5, and Compet = 100? – Saleshat = -593.5 + 2.51 * 500 + 1.91 * 150 + 2.65 * 0.5 – 0.121 * 100 = 937.2 Limits to K? • There are K + 1 coefficients that need to be estimated (β0, β1, … , βK) • We need at least N observations to estimate that many coefficients • Normally written as K ≤ N – 1 • This is a similar concept from an algebra class you’d have taken in middle school, where we needed at least M equations to solve for X unknowns (i.e. M ≥ X) – Here, you can think of N being similar to the number of equations needed and K being the number of unknowns to be solved for Multicollinearity • For a regression of y on K explanatory variables, it is hoped that the explanatory variables are highly correlated with the dependent variable • However, it is not desirable for strong relationships to exist among the explanatory variables themselves • When explanatory variables are correlated with one another, the problem of multicollinearity is said to exist Multicollinearity • Seriousness of problem depends on degree of correlation • Some books list an additional assumption of OLS that the sample data X is not all the same value, and a follow-up assumption that X1 cannot directly determine X2 – The first point made in the last bullet hardly ever happens. As long as X varies in the population, the sample data will almost always vary unless the pop’n variation is minimal or the sample size is very small. – The second point made in the last bullet expressly forbids perfect multicollinearity to occur between any 2 explanatory variables Biggest Problem for MultiC • The std errors of regression coefficients are large when there is high multicollinearity among explanatory variables • The null hypo that the coefficients are 0 may not be rejected even when the associated variable is important in explaining variation in y • Summary: Perfect collinearity is fatal for a regression. Any small degree of multicollinearity increases std errors and is thus somewhat undesirable, though basically unavoidable. – We will look at one strategy for investigating multicollinearity and using it to inform our regression choices next (free preview: Fpart is useful) Baseball Example • Example comes from the Wooldridge text • I believe baseball player salaries are determined by years in the league, avg games played per year, career batting average, avg home runs per year, and avg RBIs per year • So the following regression is run: – log(salary) = β0 + β1 * years + β2 * games_yr + β3 * cavg + β4 * hr_yr + β5 * rbi_yr + μ – Ignore the log for now, that’s for next week. I just wanted to stay kosher with the example from my other book. Just think of it as “salary” if it really bothers you. Baseball Example • Results β0 β1 β2 β3 β4 11.19 (0.29) 0.0689 (0.0121) 0.0126 (0.0026) 0.00098 0.0144 (0.00110) (0.0161) • Plus N = 353 and SSEF = 183.186 β5 0.0108 (0.0072) Baseball Example • Simple t-test on the last three coefficients would say they are insignificant in explaining log(salary) • But any baseball fan knows that batting avg, home runs, and RBIs definitely are big factors in determining player salaries (and team performance for that matter) • So let’s run the reduced model where we drop out those three variables and check to see what the partial F statistic reveals Baseball Example • Results β0 β1 β2 11.22 (0.11) 0.0713 (0.0125) 0.0202 (0.0013) • Plus N = 353 and SSER = 198.311 Baseball Example • So Fpart = 9.55 (do the math yourself later, you have everything you need), and we reject null that β3 = β4 = β5 = 0 (and thus that batting avg, home runs, and RBIs have no effect on salary) • That may seem surprising in light of insignificant t-stats for all 3 in the full model regression Baseball Example • What is happening is that two variables, hr_yr and rbi_yr, are highly correlated (and less so for cavg), and this multicollinearity makes it difficult to uncover the partial effect of each variable – This is reflected in individual t-stats • Fpart stat tests whether all 3 variables above are jointly significant, and multicollinearity between them is much less relevant for testing this hypo • If we drop one of those variables, we would see the tstat of the others increase by a lot (even to the point of significance). The point estimates might change up or down, but the standard errors would definitely fall. Dummy Variables • Dummy variables, or indicator variables, take on only two values → 0 or 1 • They indicate whether a sample observation from our data does (1) or does not (0) belong in a certain category – You can think of them as “yes” (1) or “no” (0) variables • Examples: – – – – Gender – 1 if female, 0 otherwise Race – 1 if white, 0 otherwise Employment – 1 if employed, 0 otherwise Education – 1 if college graduate, 0 otherwise Dummy Variables • Can also be used to capture deeper qualitative information – – – – Is person A a US citizen? (1 if yes, 0 if no) Is person A a baseball fan? (1 if yes, 0 if no) Does person A own a computer? (1 if yes, 0 if no) Is summer the favorite season of person A? (1 if yes, 0 if no) – Does firm Z sell video games? (1 if yes, 0 if no) – Has country Z signed a free trade agreement with Canada? (1 if yes, 0 if no) Dummy Variables • In regression analysis, we must always “leave out” one part of the indicator • Use gender as the example here – So Xmale = 1 if male, 0 otherwise might be included in the regression as an independent variable – But we cannot also include Xfemale = 1 if female, 0 otherwise in the regression – One “part” (here, female indicator) must be left out – Why is this? Think back to the prefect collinearity problem discussed earlier. We can always define “female” completely in terms of “male” (Xfemale = 1 – Xmale). So both cannot be included in the regression or we get an error. Dummy Variables • The group whose indicator is omitted from the regression serves as the base-level group for comparison • In the gender example, say I ran the following regression: – Salary = β0 + β1 * Educ + β2 * Male + μ Dummy Variables • Then the base-level group are females • The intercept for females is β0, while for males it is β0 + β2 • From where? – Indicated group (males) → Salary = β0 + β1 * Educ + β2 * Male + μ = β0 + β1 * Educ + β2 + μ = (β0 + β2) + β1 * Educ + μ – Non-indicated group (females) → Salary = β0 + β1 * Educ + β2 * Male + μ = (β0) + β1 * Educ + μ Dummy Variables • If we wanted to answer the question of whether or not men and women earn the same salary once education has been accounted for, a simple t-test would do the trick – H0 : β2 = 0 – Ha : β2 ≠ 0 – If we reject the null, then men and women earn different salaries even when education levels are accounted for (remember there’s all kinds of other stuff in μ though) Dummy Variables • How about a more complicated example of indicator variables? • Suppose firms in a sample are categorized according to the exchange on which they are listed (NYSE, AMEX, or NASDAQ). We believe the exchange they are on may have some predictive power over the value of the firm. – D1 = 1 if listed on NYSE, 0 otherwise – D2 = 1 if listed on AMEX, 0 otherwise – D3 = 1 if listed on NASDAQ, 0 otherwise Dummy Variables • Let NYSE be the base level, so leave its dummy out of the regressions equation • Include firm-level assets and number of employees as additional independent variables • Value = β0 + β1 * D2 + β2 * D3 + β3 * Assets + β4 * Employees + μ • Then the NYSE intercept is β0, AMEX is β0 + β1, and NASDAQ is β0 + β2 Dummy Variables • When using indicator variables, the partial F statistic is used to test whether the variables are important as a group. The t-test on individual coefficients should not be used to decide whether individual indicator variables should be retained or dropped (except when there are only two groups represented, and thus only one indicator variable, such as the male/female salary regression a few slides back). • The indicator variables are designed to have meaning as a group, and are either all retained or all dropped as a group. Dropping individual indicators changes the meaning of the remaining ones. – Imagine dropping just D2 (AMEX) in the previous regression. Then D3 (NASDAQ) is kept, while the base-level group switches from D1 (NYSE) to simply “not D3” (which would include both NYSE and AMEX) Dummy Variables – Sales Example • This is example 7.3 on pg 279 of Dielman • Look at relationship between dependent variable (Sales) and a few independent variables (Advertising, Bonus). • Let’s add variables indicating the region of the US in which Sales are made. – South = 1 if territory is in the South, 0 otherwise – West = 1 if territory is in the West, 0 otherwise – Midwest = 1 if territory is in the Midwest, 0 otherwise Dummy Variables – Sales Example • Let Midwest be the base level group – So leave it out of the regression • Regression: – Sales = β0 + β1 * Adv + β2 * Bonus + β3 * South + β4 * West + μ • We find β3,hat = -258 and β4,hat = -210. What do those mean? – Since β3,hat = -258, Sales in the South are 258 units lower than sales in the Midwest (since Midwest is our comparison group) even if we condition on Adver and Bonus being the same (similar for β4,hat) Dummy Variables – Sales Example • It would be inappropriate to run simple t-tests on those coefficients to determine their significance. We need to use partial F. Think about how the interpretation of all indicators would change if we ran a t-test and decided to drop only β4 * West from the regression. • To determine whether there is a significant difference in sales for territories in different regions, the following hypotheses should be tested: – H0 : β3 = β4 = 0 – Ha : at least one of them isn’t equal to 0 Dummy Variables – Sales Example • The larger (full) model is again: – Sales = β0 + β1 * Adv + β2 * Bonus + β3 * South + β4 * West + μ • So the null stating that both indicators are 0, if not rejected, would mean no differences in sales in the regions exists, and the indicators can be dropped • The simpler (reduced) model: – Sales = β0 + β1 * Adv + β2 * Bonus + μ Dummy Variables – Sales Example • So Fpart = ((SSER – SSEF) / (K – L)) / MSEF = 17.3 • Decision is to reject null since 17.3 > fcrit(0.05; 2, 20) • Thus, at least one of the coefficients of the indicator variables is not 0. There are differences in average sales levels btwn the three regions. Keep the indicator variables in the regression. Suggested Problems from Dielman • • • • • • Pg 148, #1 Pg 158, #3 Pg 169, #7 Pg 170, #11 Pg 173, #17 Pg 285, #1