Page 1 Econ107 Applied Econometrics Topic 1: An Overview of Regression Analysis (Studenmund, Chapter 1) I. The Nature and Scope of Econometrics. Lot’s of definitions of econometrics. • Nobel Prize Committee • Paul Samuelson, et al. “Econometrics may be defined as quantitative analysis of actual economic phenomena.” • Goldberger “... application of economic theory, mathematics and statistical inference to the analysis of economic phenomena.” • (Joke) E.E. Leamer “There are two things you don’t want to see in the making – sausage and econometric research.” II. Major Uses of Econometrics. 1. Describing economic reality 2. Testing hypothesis about economic theory 3. Forecasting future economic activity III. Econometric Methodology – Regression Analysis An important methodology in econometrics is regression analysis which typically follows these steps: Use a famous example to illustrate. 1. State the hypotheses. Keynes in the General Theory said a $1 increase in income will lead to less than a $1 increase in overall consumption. Page 2 We want to test this hypothesis — that the MPC<1. 2. Specify the mathematical model of the theory. Although Keynes didn’t specify the exact nature of the relationship. Might suggest a simple linear relationship. C = β 0 + β 1 DI 0 < β1< 1 where C=aggregate consumption and DI=aggregate disposable income 3. Specify the econometric model. This purely mathematical model is uninteresting to the econometrician. It assumes an exact or deterministic relationship between C and DI. C = β 0 + β 1 DI + ε We re-write the equation with a disturbance or error term. This is now an econometric model, or more precisely a linear regression model. Page 3 4. Obtain the Data. Only way to estimate the parameters of interest in this model, is to obtain the necessary data. Data source could involve time series, cross-sectional or panel data. Time series data are collected over time for the same country or other single aggregate economic unit (e.g., aggregate C and DI could be obtained for Singapore from 1950 -2000). In this case, we’d normally re-write the equation with a ‘t’ subscript on the variables and disturbance term to denote ‘time’. C t = β 0 + β 1 DI t + ε t Cross-sectional data are collected for a sample over individuals, households, firms or other disaggregate economic entity at a point in time (e.g., C and DI could be obtained for sample of 1,000 Singapore families during 2000). In this case, we’d normally re-write the equation with a ‘i’ subscript on the variables and disturbance term to denote ‘individual’. C i = β 0 + β 1 DI i + ε i Finally, panel data contains elements of both time series and cross-sectional data (e.g., C and DI could be obtained for all countries in the OECD during the period 1950-2000). Note that we have variation across countries at any single point in time, as well as variation across time. In this case, we’d normally re-write the equation with both an ‘i’ and ‘t’ subscript on the variables and disturbance term to denote ‘country’ and ‘time’. C it = β 0 + β 1 DI it + ε it Time series or cross sectional data could be plotted as a ‘scatter diagram’ below: Page 4 5. Estimate the parameters in the econometric model. Now it’s time to estimate the coefficients in the model. The basic idea is to come up with a ‘line’ that best ‘fits’ the data points. Imagine that this ‘regression analysis’ yields the following consumption function. Ĉ = 336.9 + 0.820DI These are the estimates of the 2 coefficients. The ‘hat’ on C indicates that this is an ‘estimated’ consumption function or regression model. 6. Test the hypothesis. Recall that we wanted to test Keynes’ hypothesis that the MPC was between zero and 1. Looks reasonable, but unsure whether there is any ‘statistical’ evidence that it’s below 1. Page 5 7. Forecast or predict economic behaviour. One of the other uses of this model if for forecasting or predicting future economic behaviour. To predict C, however, need to know future values of DI. Suppose you know that DI is going to be $65,000 (millions). Ĉ = 336.9 + 0.820(65,000) = 53,636.9 This also allows you to predict savings of $11,363.1. This is just the difference between DI and C. 8. Use the model for policy purposes. Can also be used for ‘control’ purposes. Suppose that C of 53.6 billion is insufficient to maintain full-employment. Not enough spending by households. Government could consider increasing DI through tax cuts to achieve a higher target. Suppose 62 billion is needed. 62,000 = 336.9 + 0.820DI DI = 75,198.9 Thus, need to cut taxes by just over $10 billion from forecasted levels. IV. Types of Econometrics and Names of Variables in Regression Split into ‘theoretical’ and ‘applied’ fields. We end up ‘straddling’ these 2 approaches. Theoretical econometrics concerns the development of basic estimation approaches, properties of estimators, etc. More closely related to mathematical statistics (e.g., proofs, axioms, ...). Applied econometrics is built on this theoretical foundation. Applies estimation techniques to various areas of economic enquiry. Examples: Where to open a new restaurant? How much ad? Should we fix the target interest rate? How many hours studying on Econ107? Academics, private and government sectors have increasingly used econometrics. Page 6 Regression analysis is the study of the relationship between a ‘Dependent Variable’ and one or more ‘Independent’ or ‘Explanatory Variables’. In the linear regression model (or true regression line or population regression function) Yi = β 0 + β 1 X 1i + + β K X Ki + ε i Yi is called dependent or left-hand-side variable or regressant and is random; X ki (k = 1, , K ) is called independent or explanatory or right-hand-side variable or regressor, it can be fixed or random; ε i is called error or disturbance term and is random; β ’s are called regression coefficients, they are unknown and fixed; β 0 is the intercept coefficient; β k (k = 1,, K ) is the slope coefficients. The meaning of β1 is the impact of a one unit increase in X 1 on Y , holding constant the other independent variables. The estimated regression line (or sample regression function) is written as Yˆi = βˆ 0 + βˆ1 X 1i + + βˆ K X Ki Yˆi is called ‘estimated’ or fitted value of Yi ; βˆ k (k = 0, , K ) is called estimated regression coefficient; Define ei = Yi − Yˆi and call ei the residual. When K=1, the regression model is Simple Linear Regression (SLR) model. When K>1, the regression model is Multiple Linear Regression (MLR) model. V. Statistical vs. Deterministic Relationships Regression analysis is concerned with a Statistical, not a Functional or Deterministic dependence among variables. In statistical relationships, the variables are Random or Stochastic. VI. Regression vs. Causation Although regression analysis deals with the relationship of one variable on other variables, it doesn’t necessarily imply causation. A causal relationship must come from outside of statistics. Economic theory is supposed to provide the compelling evidence of causation. Page 7 VII. The True (or Population) Regression Function (PRF) Suppose we have a small community of 12 families. We’re interested in studying the relationship between their weekly disposable income (X) and expenditure on food (Y). We want to predict the population mean of food expenditures, given some level of family income. The 12 families can be grouped into four income groups. Each family within a group has the same disposable income. This is the entire population, not a sample. Disposable Income (X) Individual Food Expenditures (Y) Average Food Expenditures 250 78.00, 88.50, 96.00 87.50 300 77.50, 89.00, 96.50, 109.00 93.00 350 90.50, 106.50 98.50 400 99.00, 103.00, 110.00 104.00 Plot these data points on the following diagram. This is often known as a Scatter Diagram. The ‘solid’ dots are the actual observations. Now the Conditional Mean or Conditional Expectation is E(Y | X = X i ) The ‘circles’ are the conditional means. Clearly, food expenditures ‘on average’ increase with disposable income. This can be seen even more clearly by ‘connecting’ these conditional means with a straight line. This is the True (or Population) Regression Line. Note that it could also be a True (or Population) Regression Curve. Page 8 Geometrically, a population regression line or curve is simply the locus of the conditional means or expectations of the dependent variable for fixed values of the explanatory variable(s). In general, we could write the Population Regression Function (PRF) as: E(Y | X i ) = f( X i ) where this is some function of the explanatory variable. We might anticipate that food consumption will be linearly related to disposable income. This is an initial assumption of our estimation. We could narrow this functional form to: E(Y | X i ) = β 0 + β 1 X i This is known as the linear PRF (or PR Line). Page 9 VIII. ‘Linearity’ in Regression Analysis What do we mean when we say that our regression model is linear? One possibility is that the model is nonlinear in terms of the variables. E(Y | X i ) = β 0 + β 1 X i2 The second possibility is that the PRF is nonlinear in terms of the coefficients. E(Y | X i ) = β 0 + β1 X i Such regressions functions will not be considered in this paper, but the one given above will be. From now on, ‘linear regression models’ should be read as linear (in terms of the parameters). IX. Adding the Disturbance Term to Our PRF The PRF tells us the 'average' food expenditures for a given level of household income. But we know that any 'particular' household is unlikely to be on this function. For this reason we rewrite PRF as Y i = β 0 + β1 X i + ε i where ε i is a random variable with mean 0. Lot's of reasons why ε i might exist. • Minor influences of Y are omitted. • The underlying theoretical equation might have a different functional form than the one chosen for the regression. • Some purely random variations are always there. • Measurement Error on Y or X. Page 10 X. The Sample (Estimated) Regression Function Thus far, we've dealt with the entire population and the PRF. Avoided any consideration of sampling. In most cases, we will never observe the entire population. We have to infer from a sample or samples what the PRF might look like. Note that we're unlikely to know just how close we get to the truth. Each sample we draw can be used to produce a Sample (Estimated) Regression Function (SRF), that is, the estimated regression function: ˆ ˆ Yˆ i = β 0 + β1 X i Of course, we can replace the actual value of the dependent variable ( Y i ) with its fitted value ( Yˆ i ). The LHS is no longer an estimator, it’s the actual value. The RHS now includes the Residual term e i . Y i = βˆ0 + βˆ1 X i + ei This means that the actual dependent variable can be decomposed into its fitted value and the residual. Y i = Yˆ i + ei This residual, like the disturbance can be either positive or negative. We can either overestimate: Y i - Yˆ i = ei < 0 if Y i < Yˆ i or underestimate the true value of Y i : Y i - Yˆ i = ei > 0 if Y i > Yˆ i X. Questions for discussion: Q1.10 XI. Run the height regression (Section 1.4) using the data file provided. Do further exploration according to Q1.4 and Q1.5