ECONOMICS 309 FINAL STUDY-GUIDE: 4. REGRESSION AS A BEST FITTING LINE: Tool economists use to understand the relationship between two or more variables. Particularly useful in cases where there are many variables and complex interactions between them (ex. unemployment and interest rates, money supply, exchange rates, inflation, etc.). Regression that involves two variables in considered a simple regression with two lines (X, Y). Multiple regression involves many variables. XY-plots reveal a great deal about the relationship between X and Y. A straight line drawn through the points on the XY-plot provides a convenient summary of the relationship between them (ex. Y = house price and X = lot size). The linear relationship between them is Y = alpha + betaX, where alpha is the intercept of the line and beta is the slope. This equation is referred to as the regression line. In the real world there are no data points that lie precisely on a straight line and the linear regression model is only an approximation of the true relationship. There are many factors that are dependent on data for a regression model that is not practical to have. The omission of this type of data means that the model makes an error. We call this the error (e). With the error includes the equation Is Y = alpha + betaX + e. Y is the dependent, X is the explanatory, and alpha and beta are the coefficients. A model specifies how different variables interact. We can treat the regression as a technique for generalizing correlation and interpret the numbers the regression model produces purely as reflecting the association between the variables. The implicit assumption of causality can be a problem and develop new methods. Alpha^(hat) and beta^(hat) are estimates of alpha and beta that we guess estimate of the unknown true values. The way we find estimates is by drawing a straight line through the points on an XY-plot which fits best. The error is the distance between a data point and the true regression line. If we replace alpha and beta with the ^hat versions, we get a straight line which is generally a little different from the true regression line. The deviations from the estimated regression line are called the residuals [µ] (errors and residuals are basically the same thing). Residuals are the vertical difference between the line drawn and the points (u1, u2, u3). A good fitting line will have small residuals. The usual way of measuring the size of the residuals is by means of the SUM OF SQUARED RESIDUALS (SSR) which is given by: SSR = Σ________. We want to find the best fitting line which minimizes the sum of squared residuals. For this reason, estimates found in this way are called LEAST SQUARED estimates (OLS). INTERPRETING OLS ESTIMATES: Beta is the coefficient of the best fitting straight line through the XYplot. If B^ is positive X and Y are positively correlated. Beta^ can also be interpreted as the marginal effect of X on Y and is a measure of how much X influences Y or how much Y tends to change when X is changed by one unit. Unusual (large) observations are called outliers. FITTED VALUES & R^2: The most common measure of fit is referred to as R^2 (for simple regression model it is the correlation squared but not for multiple regression model). Fitted value does not pass precisely through each point on a plot (I.e. for each data point an error is made). The fitted value for observation ¡ is the value that lies on the regression line corresponding to the Xi value for that particular observation. If you draw a straight vertical line through a particular point in the XY-plot, the intersection between this vertical line and the regression line is the fitted value corresponding to the point that you chose. Adding an ¡ subscript indicates that we are referring to a particular observation. By looking at actual Yi and Y^I, we can gain a rough impression of the “goodness of fit” of the regression model. This helps measure how good the regression model fits and allows you to examine individual observations to determine which ones are close to the regression line and which ones are not. The difference between the actual and fitted value of Y is another way to express the residual (ui = Yi – Y^I). Sometimes big outliers are of interest for information. A closely related concept of R^2 is the TOTAL SUM OF SQUARES (TSS). TSS = Σ(Yi – Y Y^(bar on top). TSS is the measure of the variability of Y through the explanatory variable X. The total variability of Y can be broken down into two parts: TSS = RSS + SSR. PSS = Σ(Y^I = Y(bar on top))^2. SSR is the sum of squared residuals and a good fitting regression model will make the SSR very small. We combine the equation to yield a measure of fit R^2 = 1 - (SSR/TSS) or equivalently, R^2 = RSS/TSS. RSS SSR AND TSS ARE ALL POSITIVE (TSS ≥ RSS and TSS ≥ SSR. This means that 0 ≤ R^2 ≤ 1. A regression line that fits all data points perfectly in the XY-plot will have no errors and hence SSR = 0 and R^2 = 1. In summary, high values of R^2 imply a good fit and low values of bad fit. If RSS is near TSS that means the fit is good because it accounts for almost all the variability in the dependent variable. 5. STATISTICAL ASPECTS OF REGRESSION: You can think of a point estimate as your best guess at what beta is. We can obtain different confidence intervals for example 95% CI says that “we are 95% confident that beta lies in the interval.” the degree of confidence is referred to as confidence level. WHICH FACTORS AFFECT THE ACCURACY OF THE ESTIMATE B^?: The line fitted narrow and bunched together is the most accurate. 1. Having more data points improves the accuracy of estimation. 2. Having smaller errors improves the accuracy of estimation. 3. Having a larger spread of values (I.e. a large variance) of the explanatory variable (X) improves accuracy of estimation. You want your data to be diverse and have a broad spectrum. You will also want X to have a high variance. Having a large spread of values (I.e. a larger variance) for the error (e), is not. CALCULATING A CONFIDENCE INTERVAL FOR BETA: The factors mentions before are commonly used for interval estimates for beta: the confidence interval. Sb is the standard deviation of beta^ and is often referred as standard error. __________________________. Notes that the more confident you wish to be about your interval, the wider it becomes. For ex. A 99% confidence interval is wider than a 95% confidence interval. Tb decreases with N (the more data points you have, the smaller the CI). Tb increases with the level of confidence you choose. TESTING WETHER BETA = 0: One way to test if Beta = 0 is to look at the CI and see whether it contains a 0. If you use the CI approach to hypothesis testing, then you do 100% minus your CI (so if CI is 95% you can say I reject the hypothesis that Beta = 0 at the %5 level of significance. The alternate way of carrying out hypothesis testing is to calculate the test statistic (t-stat). T = beta^/ Sb. The P-Value provides a direct measure of whether t is “large” or “small”. 6. MULTIPLE REGRESSION: Not much changes between the statistical techniques from multiple and simple regression. Since multiple regression implies the existence of more than two variables, we cannot draw an XY-plot on a two-dimensional graph. If we have three variables, we can show how multiple regression involves fitting a line through a four-dimensional graph in which Y is plotted on 1 axis and X1 on the 2nd., X2, on the 3rd, and X3 on the 4th. OLS ESTIMATION AS A BEST FITTINF LINE: The multiple regression model with k explanatory variables is written.... Y = alpha + beta1X1 + beta2X2 + …. + betakXk + e. The SSR is.... SSR = Σ(Yi – alpha(hat) - B1X1i - …. - B(hat)kXki). STATISTICAL ASPECTS OF MULTIPLE REGRESSION: The statistical aspects of multiple regression are essentially identical to the simple regression case. INTERPRETING OLS ESTIMATES: Assume that these are pretty much all the same. PITFALLS OF USING SIMPLE REGRESSION IN A MULTIPLE REGRESSION CONTEXT: For the bedroom example in a simple regression model you only look at the number of bedrooms and figure out the house price. There are other variables in play thought most of the time and with the multiple regression analysis we can include those. For example, if bathrooms matter more to the buyer instead of bedrooms. The simple regression combines the contribution of all these factors and allocates it to the only explanatory variable it can: bedrooms. Hence, beta(hat) is very big. The multiple regression model allows us to disentangle the individual contributions into however many explanatory variables are assumed affect the house price. OMMITED VARIABLE BIAS: Omitted variable bias affects the results. If we omit explanatory variables that should be present in the regression and if these omitted variables are correlated with the explanatory variables that are included, then the coefficients on the included variables will be wrong. You will almost always have omitted variable bias and there is little that can be done about it. 7. REGRESSION WITH DUMMY VARIABLES: Dummy variables are a way of turning qualitative variables into quantitative variables. Once this change occurs then we can continue to use the formulas in the previous chapters. Formally, a dummy variable is a variable that can take on only two values; 0 and 1. SIMPLE REGRESSION WITH A DUMMY VARIABLE: 1 dummy variable regression model.... Y = alpha + betaD + e. If we carry out OLF estimation of the above regression model we obtain alpha^(hat) and beta^(hat). The straight-line relationship between Y and D gives a fitted value of I-th observation of: Y^(hat)I = alpha^(hat) + beta^(hat)Di. Since Di is either 0 or 1, Y(hat) = alpha(hat) or Y(hat) = alpha(hat) + beta(hat). MULTIPLE REGRESSION WITH DUMMY VARIABLES: The multiple regression model with several dummy explanatory variables: Y = alpha + Beta1D1 + …. + BetaKDk + e. MULTIPLE REGRESSION WITH DUMMY AND NON-DUMMY EXPLANATORY VARIABLES: In practice, you may have a mix of different types of explanatory variables. The simplest such case is where there is one dummy variable (D) and one quantitative explanatory variable (X) in a regression: Y = alpha + B1D + B2X + e. We can extend this to multiple dummy variables and nondummy explanatory variables. The following example has two dummy and two non-dummy explanatory variables: Y = alpha + B1D1 + B2D2 + B3X1 + B4X2 + e. <== The interpretation of results from this regression model combines elements from the previous examples in this chapter. STANDARD DEVIATION STEPS: 1. Work out the Mean (the simple average of the numbers) 2. Then for each number: subtract the Mean and square the result 3. Then work out the mean of those squared differences. 4. Take the square root of that and we are done! PRACTICE EXAM FOR ECON 309 FINAL: The residual (u1) measures the different between the predicted value of the dependent variable & mean value of dependent variable (Y(bar)). FALSE: It’s the difference between Y1 and Y^. The correlation coefficient should be between –1 and 1? - TRUE The estimated intercept in a regression model must always be positive? - FALSE: It can be a negative number or 0. The estimated regression line is obtained by finding the values of alpha(hat) and beta(hat) that minimize the sum of the residuals? FALSE: alpha(hat) and beta(hat) minimize the sum of the squared residuals. Why would someone want to include a dummy variable in a regression? - We might have qualitative factors to include a regression model to explore their effects on the dependent variable we might include a dummy to account for them.