ECONOMETRICS I CHAPTER 1: THE NATURE OF REGRESSION ANALYSIS Textbook: Damodar N. Gujarati (2004) Basic Econometrics, 4th edition, The McGraw-Hill Companies HISTORICAL ORIGIN OF THE TERM REGRESSION • The term regression is introduced by Francis Galton. • He found that, although there was a tendency for tall parents to have tall children and for short parents to have short children, the average height of children born of parents of a given height tended to move or “regress” toward the averge height in the population as a whole. This tendency is called Galton’s law of universal regression. THE MODERN INTERPRETATION OF REGRESSION • Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, with a view to estimating and/or predicting the (population) mean or average value of the former in terms of the known or fixed (in repeated sampling) values of the latter. Examples of Regression Analysis 1. Reconsider Galton’s law of universal regression. We want to find out how the average height of sons changes, given the father’s height. Look at the scatter diagram or scattergram on the next slide. Figure 1.1 Hypothetical distribution of sons’ heights corresponding to given heights of fathers. Examples of Regression Analysis 2. Consider the heights of boys measured at fixed ages. Notice that corresponding to any given age we have a range of heights. Therefore, knowing the age, we may be able to predict the average height corresponding to that age. Figure 1.2 Hypothetical distribution of heights corresponding to selected ages. Examples of Regression Analysis 5. A labor economist may want to study the rate of change of money wages in relation to the unemployment rate. Figure 1.3 Examples of Regression Analysis 6. From monetary economics it is known that, other things remaining the same, the higher the rate of inflation π, the lower the proportion k of their income that people would want to hold in the form of money, as depicted in Figure 1.4 (next slide). A quantitative analysis of this relationship will enable the monetary economist to predict the amount of money, as a proportion of their income, that people would want to hold at various rates of inflation. Figure 1.4 Money holding in relation to the inflation rate π STATISTICAL AND DETERMINISTIC RELATIONSHIPS • In the regression analysis we are concerned with that what is known as the statistical, not functional or deterministic, dependence among variables, such as those of classical physics. • In statistical relationships among variables we essentially deal with random or stochastic variables. These variables have probability distributions. REGRESSION VERSUS CAUSATION • Although regression analysis deals with the dependence of one variable on other variables, it does not necessarily imply causation. • A statistical relationship per se cannot logically imply causation. REGRESSION VERSUS CORRELATION • In the correlation analysis we try to measure the strength or degree of linear association between two variables. The correlation coefficient measures this strength of (linear) association • In regression analysis we try to estimate the average value of one variable on the basis of the fixed values of other variables. REGRESSION VERSUS CORRELATION • In correlation analysis we treat any two variables symmetrically. There is no distinction between variables. Both variables are considered random. • Most of the regression theory is based on the assumption that the dependent variable is stochastic but the explanatory variables are fixed or nonstochastic. TERMINOLOGY Dependent variable Explanatory variable Explained variable Independent variable Predictand Predictor Regressand Regressor Response Stimulus Endogenous Exogenous Outcome Covariate Controlled variable Control variable TERMINOLOGY • In a simple (two-variable) regression analysis we study the dependence of a variable on only a single explanatory variable, such as that of consumption expenditure on real income. • In a multiple regression analysis we study the dependence of one variable on more than one explanatory variable, such as that of money demand on interest rates, income, and inflation. TERMINOLOGY • The term random is a synonym for the term stochastic. A random (stochastic) variable is a variable that can take on any set of values, positive or negative, with a given probability. NOTATION • • • • • • Y: dependent variable X1, X2, … , Xk : explanatory variables Xk : kth explanatory variable Xki : ith observation on variable Xk (cross-sectional data) Xkt : tth observation on variable Xk (time series data) N (or T): the total number of observations or values in the population. • n (or t): the total number of observations in the sample. (time series data) TYPES OF DATA • There are mainly three types of data for empirical analysis: 1. Time series data 2. Cross sectional data 3. Pooled data Time series data • A time series is a set of observations on the values that a variable takes at different times. Cross-sectional data • Cross-sectional data are data on one or more variables collected at the same point in time. GPA study hours/week 3.5 10 2.7 8 1.9 9 2.3 5 2.0 8 2.2 6 2.5 3 Pooled data • In the pooled data there are elements of both time and cross-sectional data. time GPA study hs/week 2000 2.5 9 2000 2.7 8 2000 2.3 6 2005 1.9 5 2005 3.1 12 2010 2.4 7 2010 2.0 5 2010 3.9 11 2010 1.2 2 • Panel data is a special type of pooled data in which the same cross-sectional unit is surveyed over time. person time GPA study hs/week 1 2010 2.5 9 1 2011 2.7 7 1 2012 2.3 6 2 2010 1.9 8 2 2011 3.1 12 2 2012 2.4 6 3 2010 2.0 5 3 2011 3.9 11 3 2012 1.2 2 Sources of Data • Government agencies (Department of Commerce...) • International agencies (World Bank...) • Surveys In the social sciences the data that one generally obtains are nonexperimental in nature, that is, not subject to the control of the researcher. The quality of data which are used in economics is often not that good. 1. Possibility of observational errors. 2. Approximations and roundoffs. 3. Nonresponce to surveys may cause selectivity bias. 4. The sampling method used in obtaining the data may vary so widely that it might be very difficult to compare them. 5. Economic data are generally available at a highly aggregate level. Such highly aggregated data may not tell us much about the individual or micro level units (GNP...) . 6. Because of confidentiality, certain data can be published only in highly aggregate form (health data...). The researcher should always keep in mind that the results of research are only as good as the quality of data.