Review for Mid-Term Exam What this course is about ? Estimate how changes in one or more “explanatory” or “independent” variables, X1,…,Xk, change the value of a variable of interest, the “dependent” variable, Y, all else equal. Why? -To provide quantitative answers to quantitive questions: Policy evaluation/design Forecasting/prediction -To evaluate the plausibility of economic hypotheses Approach? Scientific approach based on the principles of probability and statistics. Major Topics Covered So Far 1. Estimating the Population Mean and the difference between to Population Means. (Also, provided the opportunity to review basic ideas from probability and statistics.): Ch’s 2 and 3 Problem Set 1 2. Estimating the Simple Linear Regression Model Ch. 4 Problem Sets 2 and 3 3. Estimating the Multiple Linear Regression Model Ch. 5 Problem Set 4 The same approach was taken to address each of these topics: 1. Population from which Y is drawn and we are interested in the population mean or the condtional mean of Y – E(Y), E(Y X), E(Y X1,…,Xk) The linear regression model is defined by the assumption, which may or may not be correct, that the conditional mean of Y given X (or X1,…,Xk) is a linear function of X (or X1,…,Xk). 2. We draw what we assume is an i.i.d. sample from the population – Y1,…,Yn (Y1,X1),…,(Yn,Xn) (Y1,X11,…,Xk1),…,(Yn,X1n,…,Xkn) 3. Use the sample to construct: - an estimator of the mean - test statistics - confidence intervals - descriptive statistics An estimator is a procedure that is applied to compute an estimate of the parameters of interest; in our case, either the population mean or the conditional population mean. An estimator is a random variable and its quality is evaluated by its sampling distribution. The estimator we applied in all three cases: the OLS estimator. Sampling distribution of the OLS estimator (under the appropriate assusmptions): - Unbiased estimator - Consistent estimator - ( ˆ ) / se( ˆ ) ~ N (0,1) A test statistic is a random variable whose value is computed from the sample and whose distribution under the “null hypothesis” is known and has a simple form. Consider null hypotheses of the form: H0: = b, where b is some known no. Under the null hypothesis, the t-statistic t ( ˆ b) / se( ˆ ) is drawn from a N(0,1) distribution, for suffiently large sample sizes. For a given alternative hypothesis HA: ≠ b or HA: < b or HA: > b we can compare the calculated t-statistic against the percentiles of the N(0,1) to determine the “p-value” of the test and, for a given test size (significance level), whether to reject or not reject the null. In the multiple regression model, we encountered the F-statistic (and the closely related chi-square statistic) to test restrictions involving more than one coefficient: - joint hypotheses (e.g., are the values of a group of coefficients all equal to zero?) - single restriction on the relationship among coefficients (e.g., is 2 = 3)? Under the null hypothesis and for sufficiently large sample size, F ~ F(q,∞) or, equivalently, qF ~ Χ2(q) where q is the number of restrictions that make up H0. A confidence interval (or interval estimate) is a random interval derived from the sample. That is, it is an interval whose endpoints depend on the particular sample that was drawn and, therefore, will vary from sample to sample. An x% confidence interval for the parameter (0 < x < 100) will contain the actual population value of , x% of the time. Under our usual assumptions, a (1-2) confidence interval for is given by: ˆ Z1 se(ˆ ) where Z1- is the (1-) percentile of the N(0,1) distribution. Note: Heteroskedasticity vs. Homoskedasticity The assumptions we have made allow the variance of the Y’s for a given value of X (or for given values of X1,…,Xk) to depend on the value of X. That is, we allow the errors in the regression model to be heteroskedastic. Experience suggests that the more restrictive assumption of homoskedasticity is not very plausible in most applications with cross-sectional data. The formulas for the standard error of beta-hat (and, therefore, the t-statistics that depend on this standard) and for the F-statistic differ according to whether heteroskedasticity or homoskedasticity is assumed. Most regression software computes these standard errors and statistics under the default setting of homoskedasticity, but provide an option to override the default and compute them appropriately under the assumption of heteroskedasticity. Descriptive statistics, including the standard error of the regression (SER), the R2, and the adjusted-R2 are used to measure the amount of the variation in the observed values of the Y’s accounted for by the variation in the X’s. A high R2 does not imply a good and/or meaningful regression; A low R2 does not imply a bad and/or meaningless regression.