Econometrics - I Manini Ojha JSGP Fall, 2020 Manini Ojha (JSGP) EC-14 Fall, 2020 1 / 237 Introduction - Econometrics The term econometrics is attributed to Frisch; 1969 Nobel prize co-winner Haavelmo (1944) states “The method of econometric research aims, essentially, at a conjunction of economic theory and actual measurements, using the theory and technique of statistical inference as a bridge pier.” Frisch Manini Ojha (JSGP) Haavelmo EC-14 Fall, 2020 2 / 237 “[E]conometrics is by no means the same as economic statistics. Nor is it identical with what we call general economic theory, although a considerable portion of this theory has a definitely quantitative character. Nor should econometrics be taken as synonymous with the application of mathematics to economics. Experience has shown that each of these three view-points, that of statistics, economic theory, and mathematics, is a necessary, but not by itself a sufficient, condition for a real understanding of the quantitative relations in modern economic life. It is the unification of all three that is powerful. And it is this unification that constitutes econometrics.” Frisch, Econometrica, 1933, volume 1, pgs. 1-2. Manini Ojha (JSGP) EC-14 Fall, 2020 3 / 237 Econometric methods Useful in estimating economic relationships between variables Theory usually suggests the direction of change in one variable when another variable changes Simplification of reality Relationships b/w variables in practice are not exact and one must incorporate stochastic elements into the model Incorporating stochastic elements transforms the theory from one of an exact statement to a probabilistic description about expected outcomes Probability or stochastic approach to econometrics dates to Haavelmo (1944); 1989 Nobel Prize winner Manini Ojha (JSGP) EC-14 Fall, 2020 4 / 237 1 2 Used to test simple economic theories (demand, supply, business cycles, individual decisions etc.) Used to evaluate and implement govt and business policy Effectiveness of government programs (job-training, mid-day meals, employment creation, skill-acquisition). Determine effect on hourly wages, school performance, attendance etc. Empirical analysis helps us make empirical judgements as opposed to moral judgements Empirical analysis uses data to test a theory or estimate a relationship Manini Ojha (JSGP) EC-14 Fall, 2020 5 / 237 Data used can be experimental or non-experimental data Experimental data is collected in a lab or field Non-experimental is observational data. Researcher is a passive collector Manini Ojha (JSGP) EC-14 Fall, 2020 6 / 237 Steps in empirical/econometric analysis Formulate an interesting question Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Steps in empirical/econometric analysis Formulate an interesting question Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Steps in empirical/econometric analysis Formulate an interesting question Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it Construct an econometric model Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Steps in empirical/econometric analysis Formulate an interesting question Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it Construct an econometric model State hypotheses of interest in terms of unknown parameters Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Steps in empirical/econometric analysis Formulate an interesting question Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it Construct an econometric model State hypotheses of interest in terms of unknown parameters Obtain relevant data for analysis Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Steps in empirical/econometric analysis Formulate an interesting question Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it Construct an econometric model State hypotheses of interest in terms of unknown parameters Obtain relevant data for analysis Use econometric methods to estimate the parameters in the model and test hypotheses Manini Ojha (JSGP) EC-14 Fall, 2020 7 / 237 Empirical/econometric analysis Econometric analysis begins where economic theory concludes An econometric model starts with a statement of a theoretical proposition Goal is to test this proposition through the statistical analysis of data Econometrics requires: theory =⇒ data =⇒ methods Manini Ojha (JSGP) EC-14 Fall, 2020 8 / 237 Example You are a labor economist examining the effects of a job training program on worker’s productivity. Simple intuition/reasoning suggests that factors like education, experience, training would affect productivity. wage here is the dependent variable (y ) educ, exper , training the control/explanatory/independent variables (x 0 s). Workers are paid wages according to their productivity Manini Ojha (JSGP) EC-14 Fall, 2020 9 / 237 From economic analysis to econometric analysis Simple reasoning such as this leads to a model: (1) wage = f (educ, exper , training ) wage =β0 + β1 educ + β2 exper + β3 training + u (2) wage = hourly wage educ = years of formal education exper = years of workforce experience training = weeks spent in job training A specific functional form must be specified for econometric analysis From Eqn. (1) to Eqn. (2) Manini Ojha (JSGP) EC-14 Fall, 2020 10 / 237 There may be several unobservables affecting wages of a worker that we cannot measure Incorporate stochastic elements/unobserved factors: u Add other variables to Eqn. (2) like family income, parent’s education, age, etc. to reduce u but can never eliminate entirely Other factors like “innate ability”, “quality of education”, “family background” etc that influence a person’s wage that are included in u Dealing with this error term or disturbance term u , is the most important component of econometric analysis Manini Ojha (JSGP) EC-14 Fall, 2020 11 / 237 In Eqn. (2): β0, β1, β2, β3 are parameters of the econometric model Describe the directions and strengths of relationship between wage and the factors used to determine wage/productivity If the question of interest is how training affects wage, we are particularly interested in β3 β3 is then referred to as the parameter of interest Hypotheses of interest can then be stated in terms of unknown parameter: β3 = 0 Hypothesizing that job training has no effect on wages Manini Ojha (JSGP) EC-14 Fall, 2020 12 / 237 Now, we proceed to obtain relevant data on education, experience, training, wages Then estimate the parameters in the model specified in Eqn. (2) Formally test our hypothesis of interest Manini Ojha (JSGP) EC-14 Fall, 2020 13 / 237 Types of data Cross-sectional data Manini Ojha (JSGP) EC-14 Fall, 2020 14 / 237 Types of data Cross-sectional data Time series data Manini Ojha (JSGP) EC-14 Fall, 2020 14 / 237 Types of data Cross-sectional data Time series data Pooled-cross-sections Manini Ojha (JSGP) EC-14 Fall, 2020 14 / 237 Types of data Cross-sectional data Time series data Pooled-cross-sections Panel/longitudinal data Manini Ojha (JSGP) EC-14 Fall, 2020 14 / 237 Cross-section data Cross-sectional data consists of data on multiple agents or a sample of individuals, households, cities, states, countries, firms or a variety of other units at a single point in time t = t0 . {xi , yi }N i=1 x = years of schooling ,y = wage In pure cross-section, ignore any minor differences in timing of survey (e.g. households interviewed on different days or weeks) Assume that data have been obtained by a random sampling from the underlying population e.g. information on wages, education, experience etc have been obtained from a random draw of 500 people of all working population Manini Ojha (JSGP) EC-14 Fall, 2020 15 / 237 Obs 1 2 3 . . . 555 wage 190 75 100 . . . 80 educ 7 11 8 . . . 16 exper 2 7 41 . . . 15 female 1 0 0 . . . 0 married 0 1 0 . . . 1 The table contains data on 555 individuals and their wages, educ (number of years of education), exper (number of years of labour force experience), female (binary indicator for gender), married (binary indicator for marital status). Manini Ojha (JSGP) EC-14 Fall, 2020 16 / 237 The matrix of cross-section data can be represented as x11 x12 x13 ...................... x1N x21 x22 x23 ...................... x2N x31 . . ...................... . . . . ...................... . . . . ...................... . . . . ....................... . xN1 xN2 xN3 ........................ xNN Manini Ojha (JSGP) EC-14 Fall, 2020 17 / 237 Time series data Time series data is data on a single agent at multiple points in time {xt , yt }T t=1 x = real interest rate,y = investment Example: stock prices, GDP, inflation, unemployment, homicide rates, etc Here, past events can influence future events Thus, time series data is rarely ever assumed to be independent across time Manini Ojha (JSGP) EC-14 Fall, 2020 18 / 237 Special attention given to data frequency here GDP recorded quarterly Inflation recorded monthly Stock prices recorded daily Unemployment recorded monthly Monthly, daily, quarterly, etc show strong seasonal patterns that are important considerations in time series data Manini Ojha (JSGP) EC-14 Fall, 2020 19 / 237 The matrix of time series data would look like x11 x12 x13 ...................... x1N x21 x22 x23 ...................... x2N x31 . . ...................... . . . . ...................... . . . . ...................... . xt1 xt2 xt3 ...................... xtN . . . ...................... . . . . ....................... . xT 1 xT 2 xT 3 ........................ xTN Manini Ojha (JSGP) EC-14 Fall, 2020 20 / 237 Table below shows a time series data on minimum wages and GNP (in millions of 1954 dollars) in Puerto Rico [Castillo-Freeman and Freeman (1992)] Obs 1 2 3 . . . 40 Manini Ojha (JSGP) year 1950 1951 1962 . . . 1987 avgmin 0.20 0.21 0.23 . . . 3.35 EC-14 prunemp 15.4 16.0 14.8 . . . 16.8 prgnp 878.7 925.0 1015.9 . . . 4496.7 Fall, 2020 21 / 237 Panel/longitudinal data Panel data is data on multiple agents at multiple points in time {xit , yit }i=1,2,....,N; t=1,2,.....,T x = income inequality in country i in time t;y = growth rate A panel consists of a time series for each cross-sectional member in the data set. For example wage, educ, exper for a same set of individuals over a period of 10 years investment or financial data for a same set of firms over a period of 5 years Note: panel follows the same cross-sectional units over time i.e. follows the same set of households over time Manini Ojha (JSGP) EC-14 Fall, 2020 22 / 237 The table below shows a two year panel on crime and related statistics for 150 cities in the US Manini Ojha (JSGP) EC-14 Fall, 2020 23 / 237 Pooled cross-section The difference between pooled cross-section and panel data is that the agents in each cross sections can differ Also referred to as repeated cross-sections e.g. 2 cross-sectional data of household surveys in India (NSS) not the same households Let us look at data on effect of property taxes on house prices a random sample of house prices in 1993 a new random sample of house prices in 1995 Manini Ojha (JSGP) EC-14 Fall, 2020 24 / 237 Pooled cross-section Observations 1 to 250 correspond to houses sold in 1993 and 251 to 520 correspond to houses sold in 1995 Manini Ojha (JSGP) EC-14 Fall, 2020 25 / 237 Pooled cross-section versus panel data Panel data requires replication of the same units/agents/individuals/households overtime whereas pooled/repeated cross-sections do not require the same agents Thus, panel especially on households, individuals, firms, etc are more difficult to obtain e.g. IHDS for India Observing the same units over time leads to advantages over cross-sectional or pooled cross-sectional data multiple observations on the same units allows us to control for unobserved characteristics of individuals, firms or households etc aids in causal inferences allows us to study the importance of lags in behaviour (many economic policies have an impact only after some time has passed) Manini Ojha (JSGP) EC-14 Fall, 2020 26 / 237 Causality and Ceteris Paribus Most econometric analysis concerns itself with tests of economic theory or evaluation of policy goal is usually to infer the effect of one variable on another does education have a causal effect on wages? finding an association between variables is only suggestive, but establishing a causality is compelling Recall the notion of ceteris paribus: effect of one variable on another holding everything else constant/‘other things equal’ effect of changing the price of a good on its quantity demanded while holding other factors like income, prices of other goods, tastes etc fixed Manini Ojha (JSGP) EC-14 Fall, 2020 27 / 237 Notion of causality is similar critical for policy analysis does one week of job training, while all other factors are held constant, improve worker’s productivity and in-turn wages? if we succeed in holding all other relevant factors fixed, we can find a causal impact of job training on wages This is a difficult task Key question in most econometric/empirical research is thus Have enough other factors been held fixed to make a case for causality? Manini Ojha (JSGP) EC-14 Fall, 2020 28 / 237 Example Measuring returns to education Economists are often interested in answering the question If a person is chosen from a population and given one more year of education, by how much will his/her wage increase? Implicit assumption here is: holding everything else (family background, intelligence etc.) constant Manini Ojha (JSGP) EC-14 Fall, 2020 29 / 237 Problem Experiment: Choose a group of people, randomly assign different amounts of education to them (infeasible!), compare wages If levels of education are assigned independently of other factors, then the expt. ignores that these other factors will yield useful results Non-experimental data Problem in non-experimental data for a large sample is that people choose their levels of education This means education levels is not determined independently of other factors that affect wages eg: people with higher innate ability choose higher levels of education. Higher innate ability leads to higher wages, thus there is a correlation between education levels and a critical factor that affects wages. Another problem is omitted variables Difficult to measure a lot of factors that may affect wages like innate ability. Thus ceteris paribus in true sense is difficult to justify. Manini Ojha (JSGP) EC-14 Fall, 2020 30 / 237 Classical Linear Regression Model (CLRM) Interested in the relationship between two variables, Y and X Scatterplots are typically a good way of examining the distribution (dbn), F(X,Y) Definition: A scatterplot is a plot with each data point appearing once. In econometrics, the variable on the y-axis is the dependent variable and the variable on the x-axis is the independent variable. Manini Ojha (JSGP) EC-14 Fall, 2020 31 / 237 Scatterplots are informative, but messy to present. An alternative is to summarize the information contained in the scatterplot Can be done by finding the ‘best’ line in terms of fitting the data, and then reporting the intercept and the slope of the line Most common means of accomplishing this is known as regression analysis Manini Ojha (JSGP) EC-14 Fall, 2020 32 / 237 Terminology: simple regression Y Dependent variable Explained variable Response variable Predicted variable Regressand Manini Ojha (JSGP) X Independent variable Explanatory variable Control variable Predictor variable Regressor Covariate EC-14 Fall, 2020 33 / 237 An Aside: The term regression was originated by Galton (1886) Sir Francis Galton also introduced the concepts of correlation, standard deviation and bivariate normal dbn Manini Ojha (JSGP) EC-14 Fall, 2020 34 / 237 Simple Linear Regression We are interested in the relationship between x and y Explaining y in terms of x or Studying how y varies with changes in x Questions: Since there is never an exact relationship between two variables, how do we allow for other factors to affect y ? What is the functional relationship between y and x? How can we be sure we are capturing a ceteris paribus relationship between y and x (if that is the goal)? Manini Ojha (JSGP) EC-14 Fall, 2020 35 / 237 Resolve the ambiguities by writing down an equation relating y to x (3) y = β0 + β1 x + u This equation which is assumed to hold in the population of interest and defines the simple linear regression (SLR) model u is the error/disturbance term reflecting factors other than x that affect y treated effectively as unobserved SLR model is a population model When it comes to estimating β0 and β1 , we use a random sample of data and we must restrict how u and x are related Manini Ojha (JSGP) EC-14 Fall, 2020 36 / 237 Example of simple linear regression The following is a model relating a person’s wage to observed education level and other unobserved factors wage = β0 + β1 educ + u wage is measured in dollars per hour educ is measured in years of education β1 measures the change in hourly wage given another year of education holding all other factors fixed Manini Ojha (JSGP) EC-14 Fall, 2020 37 / 237 Linearity of Eqn.3 =⇒ each unit increase in x has the same effect on y regardless of where x starts from may not be realistic (allow for increasing returns ... later) Does the model in Eqn. 3 really lead to ceteris paribus conclusions about how x affects y ? Need to make certain assumptions to estimate a ceteris paribus effect Manini Ojha (JSGP) EC-14 Fall, 2020 38 / 237 Assumptions of CLRM Mean of the unobserved factors is zero in the population E (u) = 0 For wage-education example, this means we are assuming that things such as average ability are zero in the population of all working people Manini Ojha (JSGP) EC-14 Fall, 2020 39 / 237 u and x are uncorrelated E (u|x) = E (u) Average value of u does not depend on the value of x and is equal to average of u over the entire population Or u is mean independent of x For wage-education example, this implies that the average ability for a group of people from the population with 8 years of education is the same as the average ability for a group of people from the population with 16 years of education Manini Ojha (JSGP) EC-14 Fall, 2020 40 / 237 Combining both E (u|x) = 0 Called zero conditional mean assumption Implication: y = β0 + β1 x + u E (y |x) = β0 + β1 x + E (u|x) E (y |x) = β0 + β1 x (PRF ) which shows that E (y |x) is a linear function of x =⇒ 1 unit increase in x changes the expected value of y by the amount β1 Tells us about how the average value of y changes with x , not how y changes with x for all units of the population! Manini Ojha (JSGP) EC-14 Fall, 2020 41 / 237 Example Suppose that x is high school GPA and y is college GPA We are given that E (colGPA|hsGPA) = 1.5 + 0.5hsGPA Suppose hsGPA = 3.6, then E (colGPA|hsGPA) = 1.5 + 0.5(3.6) = 3.3 =⇒ Avg. collGPA for all high school graduates who attend college with a high school GPA is 3.3 Does not mean every student with hsGPA = 3.6 will have college GPA of 3.3 Some will have 3.3, some will have more, some less On average, colGPA = 3.3 Manini Ojha (JSGP) EC-14 Fall, 2020 42 / 237 Going back to Eqn.3 (y = β0 + β1 x + u) with this information, we can divide Eqn.3 into 2 parts y = E (y |x) + u (4) = β0 + β1 x + u Systematic part: β0 + β1 x represents E (y |x) which is the part of y explained by x Stochastic/unsystematic part: u part of y not explained by x Manini Ojha (JSGP) EC-14 Fall, 2020 43 / 237 From PRF to SRF Analogous to population regression function (PRF) is the concept of sample regression function (SRF) which represents the sample regression line Let us take a random sample of size n from the population Let the random sample be {(xi , yi ) : i = 1, ......, n}, then we can express the sample counterpart of E (y |x) in Eqn. 4 as: (5) ŷi = β̂0 + β̂1 xi ŷi is the estimator of E (y |x); β̂0 and β̂1 are estimators of β0 and β1 respectively Similarly: ûi can be regarded as an estimate of ui Manini Ojha (JSGP) EC-14 Fall, 2020 44 / 237 Our primary objective in regression analysis is to estimate the SRF because more often than not analysis is based on a sample from the population because of sampling fluctuations, this estimate is at best an approximate one Manini Ojha (JSGP) EC-14 Fall, 2020 45 / 237 For x = xi , in terms of sample regression, observed yi can be expressed as yi = ŷi + ûi (6) In terms of population, it can be observed as yi = E (y |xi ) + ui Manini Ojha (JSGP) EC-14 Fall, 2020 46 / 237 From 6 ûi = yi − ŷi (7) = yi − β̂0 − β̂1 xi Residuals ûi is thus the difference between actual and estimated y values To achieve our goal of choosing the best estimates, P let us P say we choose SRF such that the sum of the residuals ûi = (y − ŷi ) is as small as possible Manini Ojha (JSGP) EC-14 Fall, 2020 47 / 237 Intuitively appealing but not a very good criterion P If we minimize ûi , given that we give equal importance to all residuals no matter how close or widely scattered the individual observations are from the SRF it is quite possible from the scatterplot above that the algebraic sum be zero Manini Ojha (JSGP) EC-14 Fall, 2020 48 / 237 CLRM: Ordinary Least Squares Estimates (OLS) OLS refers to a choice of parameters β0 and β1 in Eqn.3 Question: How does one estimate parameters β0 and β1 ? Answer: Minimize the distance b/w the data points and the line Manini Ojha (JSGP) EC-14 Fall, 2020 49 / 237 OLS We use the least-squares criterion/ordinary least squares (OLS) and minimize the sum of squared residuals X X ûi2 = (yi − β̂0 − β̂1 xi )2 P Define S = ûi2 OLS implies min S ⇒ β̂0 ,β̂ 1 ∂S ∂ β̂0 = ∂S ∂ β̂1 =0 Do it! Manini Ojha (JSGP) EC-14 Fall, 2020 50 / 237 OLS β̂0∗ ...... n ∂ X = [ (yi − β̂0 − β̂1 xi )2 ] ∂ β̂0 ∂ β̂0 i=1 ∂S implies −2 X (yi − β̂0 − β̂1 xi ) = 0 i X i yi − X β̂0∗ − β̂1 X i xi = 0 i nȳ − nβ̂0∗ − β̂1 nx̄ = 0 implies β̂0∗ = ȳ − β̂1 x̄ (8) where ȳ is the sample average of yi and likewise for x̄. Manini Ojha (JSGP) EC-14 Fall, 2020 51 / 237 OLS β̂1∗ ....... ∂S ∂ β̂1 = n ∂ X [ (yi − β̂0 − β̂1∗ xi )2 ] ∂ β̂1 i=1 = n ∂ X [ (yi − ȳ + β̂1∗ x̄ − β̂1∗ xi )2 ] ∂ β̂1 i=1 n ∂ X = [ ((yi − ȳ ) − β̂1∗ (xi − x̄))2 ] ∂ β̂1 i=1 Manini Ojha (JSGP) EC-14 Fall, 2020 52 / 237 implies −2 X [(yi − ȳ ) − β̂1∗ (xi − x̄)](xi − x̄) = 0 i known as the least squares normal equation P (yi − ȳ )(xi − x̄) ∗ β̂1 = i P 2 i (xi − x̄) Manini Ojha (JSGP) EC-14 (9) Fall, 2020 53 / 237 Eqn. 9 gives us the slope parameter and using the slope estimate, it is straightforward to obtain the intercept estimate. Note that β̂1∗ P − ȳ )(xi − x̄) 2 i (xi − x̄) sample coviance between x and y = sample variance of x cov (xi , yi ) = var (xi ) = i (y Pi ∴ If x and y are positively correlated in the sample, β̂1 > 0; if x and y are negatively correlated , then β̂1 < 0 is negative Manini Ojha (JSGP) EC-14 Fall, 2020 54 / 237 Thus, Eqn. 8 and Eqn. 9 give us the OLS estimates The name OLS (ordinary least squares) comes from the fact that we obtain these estimates by minimizing the sum of squared residuals Manini Ojha (JSGP) EC-14 Fall, 2020 55 / 237 Once we have determined the OLS intercept and slope estimates, we form the OLS regression line (10) ŷ = β̂0 + β̂1 x β̂0 is the predicted value of y when x = 0 (although in some cases, it will not make sense to set x = 0) In most cases, we can write the slope estimate as β̂1 = 4ŷ 4x This is usually of primary interest and tells us how ŷ changes when x increases by a unit Manini Ojha (JSGP) EC-14 Fall, 2020 56 / 237 Terminology - regression When we run an OLS regression, without writing out the equation, we say we run the regression of y on x regress y on x Manini Ojha (JSGP) EC-14 Fall, 2020 57 / 237 Properties of OLS statistics Sum and thus the sample average of OLS residuals is zero n X ûi = 0 (11) i=1 Sample covariance between between the regressors and the OLS residuals is zero. n X xi ûi = 0 (12) i=1 The point (x̄i , ȳi ) is always on the OLS regression line. ȳ = β̂0 + β̂1 x̄ Manini Ojha (JSGP) EC-14 Fall, 2020 58 / 237 OLS Implication OLS formula for β̂1∗ uses the optimal choice of β̂0∗ β̂0∗ = ȳ − β̂1∗ x̄ If xi = x̄ ∀i, implying that var (x) = 0, then β̂1∗ is undefined. Thus, var (x) 6= 0 is an identification condition Sample average of the fitted values ŷi is the same as the sample average of the true yi (follows from Eqn. 6) ȳ = ŷ¯i Manini Ojha (JSGP) EC-14 Fall, 2020 59 / 237 Terminology Total sum of squares: measure of total sample variance in yi SST = n X (yi − ȳ )2 i=1 Explained sum of squares: measure of sample variance in ŷi (given that ȳ = ŷ¯i ) n X SSE = (ŷi − ȳ )2 i=1 Residuals sum of squares: measure of sample variance in ûi SSR = n X ûi2 i=1 Can write SST = SSE + SSR Manini Ojha (JSGP) EC-14 Fall, 2020 60 / 237 Proof: decompose yi for each observation into 2 components (explained and unexplained) yi − ȳ = (yi − ŷi ) + (ŷi − ȳ ) Unexplained part is the residual ûi = yi − ŷi With some algebra X X (yi − ȳ )2 = [(yi − ŷi ) + (ŷi − ȳ )]2 X = [ûi + (ŷi − ȳ )]2 X X X = ûi2 + (ŷi − ȳ )2 + 2 ûi (ŷi − ȳ ) X = SSR + SSE + 2 ûi (ŷi − ȳ ) (14) SST = SSR + SSE X ∵ ûi (ŷi − ȳ ) = 0 Manini Ojha (JSGP) EC-14 (13) Fall, 2020 61 / 237 This also implies P ûi2 (ŷi − ȳ )2 P 1= P + (yi − ȳ )2 (yi − ȳ )2 SSE SSR + 1= SST SST P Coefficient of determination, R 2 of the regression is defined as R2 = Manini Ojha (JSGP) SSR SSE =1− SST SST EC-14 (15) Fall, 2020 62 / 237 This is the fraction of total variation that is explained by the model i.e. fraction of variation in y that is explained by x R 2 is always between 0 and 1 because SSE can be no greater than SST R 2 = 1 means all data points lie on the same line and OLS provides a perfect fit to the data R 2 = 0 means a poor fit of the line Interpreted usually by multiplying by 100 to change into percentage terms Manini Ojha (JSGP) EC-14 Fall, 2020 63 / 237 Examples CEO salary and return on equity \ = 963.191 + 18.501roe salary n = 209, R 2 = 0.0132 Here, the regression only explains 1.3% of the total variation in CEO’s salary Voting outcomes and campaign expenditures \ = 26.81 + 0.464shareA voteA n = 173, R 2 = 0.856 Here, the regression explains 85% of the total variation in election outcomes Note: High R 2 does not necessarily mean regression has a causal interpretation. Low R 2 is infact quite common in social sciences, especially for cross-sectional analysis Manini Ojha (JSGP) EC-14 Fall, 2020 64 / 237 Functional Form Common specifications of the functional forms to incorporate non-linearities look like: y = β0 + β1 x + u 4y = β1 4x Log-level specification log (y ) = β0 + β1 x + u %4y = (100β1 )4x Log-log specification log (y ) = β0 + β1 log (x) + u %4y = β1 %4x Level specification Manini Ojha (JSGP) EC-14 Fall, 2020 65 / 237 Log - level specification Regression of log wages on years of education log (wage) = β0 + β1 educ + u Here, the interpretation of the regression coefficient is ∂wage ∂log (wage) 1 ∂wage wage β1 = = . = ∂educ wage ∂educ ∂educ i.e. %4in wages from 1 unit (year) increase in education (semi-elasticity) Manini Ojha (JSGP) EC-14 Fall, 2020 66 / 237 Log-log specification Regression of CEO salary on firm sales log (salary ) = β0 + β1 log (sales) + u Here, the interpretation of regression coefficient is β1 = ∂log (salary ) = ∂log (sales) ∂salary salary ∂sales sales i.e. %4in salary from a 1% increase in sales (constant-elasticity) Manini Ojha (JSGP) EC-14 Fall, 2020 67 / 237 Gauss-Markov assumptions for SLR (SLR.1) Linearity in parameters: in the population, y is related to x and u as y = β0 + β1 x + u where β0 , β1 are the population intercept and population slope parameters respectively (SLR.2) Random sampling: we have a random sample of size n,{(xi , yi ) : i = 1, 2, ..., n} following the population model above In terms of random sample, above equation can be written as yi = β0 + β1 xi + ui Manini Ojha (JSGP) EC-14 Fall, 2020 68 / 237 (SLR. 3) Sample variation in the explanatory variable: the sample outcomes on x, namely {xi , i = 1, 2, ...., n}, are not all the same value This is a weak assumption - only says that if x varies in the population, then x varies in the random sample as well unless population variation is minimal or sample is small. (SLR. 4) Zero conditional mean:E (u|x) = 0. This assumption coupled with random sampling implies E (ui |xi ) = 0 for all i i.e. the value of the explanatory variable must contain no information about the mean of the unobserved factors Manini Ojha (JSGP) EC-14 Fall, 2020 69 / 237 Note: SLR.4 in conjunction with SLR.3 allows for a technical simplification. In particular, we can derive the statistical properties of the OLS estimators as conditional on the values of xi in our sample. Technically, in statistical derivations, conditioning on the sample values of the independent variables x 0 s is the same as treating the xi as fixed in repeated samples. Manini Ojha (JSGP) EC-14 Fall, 2020 70 / 237 Unbiasedness of OLS estimators What qualities do OLS estimators possess? Recall that we briefly touched upon this last semester in Stats- I Using SLR. 1 - SLR. 4, we can show that the OLS estimators are unbiased E (β̂0 ) = β0 ; E (β̂1 ) = β1 Proof P P − ȳ )(xi − x̄) yi (xi − x̄) β̂1 = = Pi 2 2 (xi − x̄) i (xi − x̄) P i (xi − x̄)(β0 + β1 xi + ui ) P = i 2 i (xi − x̄) i (y Pi Manini Ojha (JSGP) EC-14 Fall, 2020 71 / 237 Numerator can be written as X X X (xi − x̄)β0 + (xi − x̄)β1 xi + (xi − x̄)ui i = β0 i X i X X (xi − x̄) + β1 (xi − x̄)xi + (xi − x̄)ui i i = β1 i X i Manini Ojha (JSGP) EC-14 2 (xi − x̄) + X (xi − x̄)ui i Fall, 2020 72 / 237 Combining the numerator and denominator: P P β1 (xi − x̄)2 (x − x̄)ui Pi i β̂1 = P i + 2 2 (x − x̄) i i P i (xi − x̄) (xi − x̄)ui = β1 + Pi 2 i (xi − x̄) P (xi − x̄)ui ] ∴ E (β̂1 ) = β1 + E [ Pi (xi − x̄)2 P i (xi − x̄) E (β̂1 ) = β1 + P i E (ui ) 2 i (xi − x̄) (16) E (β̂1 ) = β1 Manini Ojha (JSGP) EC-14 Fall, 2020 73 / 237 Proof for β0 is now straightforward E (β̂0 ) = E [ȳ − β̂1 x̄] h i = E β0 + β1 x̄ + ū − β̂1 x̄ = β0 + β1 x̄ − x̄E [β̂1 ] = β0 + β1 x̄ − x̄β1 (17) E (β̂0 ) = β0 Manini Ojha (JSGP) EC-14 Fall, 2020 74 / 237 Gauss-Markov assumptions for SLR So we now know that sampling distribution of β̂1 is centered around β1 i.e. β̂1 is unbiased What can we say about how far we can expect β̂1 to be from β1 on average? This helps us choose the best estimator among a broad class of unbiased estimators We will work with the measure of spread of the estimators - variance Manini Ojha (JSGP) EC-14 Fall, 2020 75 / 237 (SLR. 5) Homoskedasticity: variance of the unobservable u conditional on x is constant. Also known as constant variance assumption var (u|x) = σ 2 (18) i.e. error has the same variance given any value of the explanatory variable. It is often called the error variance or disturbance variance √ Standard deviation of error = σ 2 = σ For random sample we say var (ui ) = σ 2 ∀i If the error variance is dependent on x, then we say the error term exhibits heteroskedasticity Var (u|x) = σx2 Read JW for more on homoskedasticity. Section 2-5b , JW 6th ed. HW! Manini Ojha (JSGP) EC-14 Fall, 2020 76 / 237 Homoskedasticity assumption plays no role in showing that the estimators β̂0 and β̂1 are unbiased This assumption tells us about efficiency of the estimators Var (u|x) = E (u 2 |x) − [E (u|x)]2 Var (u|x) = E (u 2 |x) − [E (u|x)]2 ∵ E (u|x) = 0 Var (u|x) = E (u 2 |x) = σ2 Larger σ means that the distribution of the unobservables affecting y is more spread out Also, given homoskedasticity: V (y |x) = V (u|x) = σ 2 (Convince yourself of this!) Manini Ojha (JSGP) EC-14 Fall, 2020 77 / 237 Sampling variance of OLS estimators Sampling variance of β̂1 is: σ2 σ2 = 2 SSTx i (xi − x̄) (19) Var (β̂1 ) = P And, Manini Ojha (JSGP) P P σ 2 n−1 i xi2 σ 2 i xi2 Var (β̂0 ) = P = 2 n.SSTx i (xi − x̄) EC-14 (20) Fall, 2020 78 / 237 Proof of Eqn. 19: Recall that Var (R) = E [R 2 ] − (E [R])2 = E (R − E [R])2 h i Var (β̂1 ) = E (β̂1 − β1 )2 since β̂1 is unbiased Start with (β̂1 − β1 )2 =? P P yi (xi − x̄) (xi − x̄)(β0 + β1 xi + ui ) i P β̂1 = P = i 2 2 i (xi − x̄) i (xi − x̄) P P 2 β1 i (xi − x̄) + i (xi − x̄)ui P = 2 i (xi − x̄) P (xi − x̄)ui = β1 + Pi 2 i (xi − x̄) P 2 (xi − x̄)ui (β̂1 − β1 )2 = Pi 2 i (xi − x̄) Manini Ojha (JSGP) EC-14 Fall, 2020 79 / 237 Now, take expectations P (xi − x̄)ui 2 i (β̂1 − β1 ) = P (xi − x̄)2 "i P 2 # h i (x − x̄)u i i Pi E (β̂1 − β1 )2 = E 2 i (xi − x̄) X 1 Var (β̂1 ) = P .E [( (xi − x̄)2 ui2 )] [ i (xi − x̄)2 ]2 i P 2 (xi − x̄) Var (β̂1 ) = P i .E (ui 2 ) [ i (xi − x̄)2 ]2 2 σ2 σ2 = 2 SSTx i (xi − x̄) Var (β̂1 ) = P Manini Ojha (JSGP) EC-14 Fall, 2020 80 / 237 Implication of variance of estimators: If σ 2 ↑ =⇒ Var (β̂1 ) ↑ larger the error variance, larger the Var (β̂1 ) more variation in the unobservables affecting y makes it more difficult to precisely estimate β1 var (x) ↓=⇒ Var (β̂1 ) ↑ we can learn more about the relationship between y and x if x is more dispersed and there is less ‘noise’ in the relationship Manini Ojha (JSGP) EC-14 Fall, 2020 81 / 237 Theorem - Gauss-Markov In the class of linear, unbiased estimators of β, β̂OLS has the smallest variance. In other words, if there exists an alternative linear, unbiased estimator, say β̃, then Var (β̂OLS ) < Var (β̃) ∴ BLUE Manini Ojha (JSGP) EC-14 Fall, 2020 82 / 237 Errors and residuals Emphasizing the difference between errors and residuals as it is crucial for estimating σ 2 Population model in terms of a random sample yi = β0 + β1 xi + ui where ui is the error for observation i Expressing yi in terms of fitted value and residuals yi = β̂0 + β̂1 xi + ûi Comparing these 2 equations errors show up in the equation containing population parameters residuals show up in the estimated equation errors are never observed while residuals are computed from the data Manini Ojha (JSGP) EC-14 Fall, 2020 83 / 237 We do not observe the errors ui but we have estimates of ui namely residuals ûi . Unbiased estimator of σ 2 is denoted as σ̂ 2 : E (σ̂ 2 ) = σ 2 where n σ̂ 2 = SSR 1 X 2 ûi = n−2 n−2 (21) i=1 where the degrees of freedom is n − 2 degrees of freedom: total number of observations in the sample less the number of independent restrictions put on them Manini Ojha (JSGP) EC-14 Fall, 2020 84 / 237 Standard error of regression (SER)/ root mean squared error √ σ̂ = σ̂ 2 σ̂ is also called the estimate of standard deviation of β̂0 and β̂1 σ2 Var (βˆ1 ) = SSTx σ sd(β̂1 ) = √ SSTx σ̂ ∴ se(β̂1 ) = √ SSTx (22) (23) Eqn. 23 is the standard error of β̂1 . This gives us an idea of how precise β̂1 is. Manini Ojha (JSGP) EC-14 Fall, 2020 85 / 237 More assumptions (SLR. 6.) Each error is normally distributed ui |xi ∼ N(0, σ 2 ) This assumption is needed when deriving small sample sampling distributions of OLS estimators β̂0 and β̂1 and that of the t-statistics used in hypothesis tests of β0 and β1 Random errors ui and uj for 2 different values of i and j are uncorrelated with each other. That is E [ui uj ] = 0 for all i and j; i 6= j Also, called spherical errors/disturbances. Manini Ojha (JSGP) EC-14 Fall, 2020 86 / 237 Regression through the origin Special case: model without an intercept (24) ỹi = β̃1 xi Eqn. 24 is also called regression through the origin because it passes through x = 0, ỹ = 0 Obtain the slope estimate using OLS min β̃1 n X 2 (yi − β̃1 xi ) = ∂ Pn i=1 (yi ∂ β̃1 i=1 = −2 − β̃1 xi )2 n X xi (yi − β̃1 xi ) i=1 Manini Ojha (JSGP) EC-14 Fall, 2020 87 / 237 =⇒ n X xi (yi − β̃1 xi ) = 0 i=1 Pn i=1 (xi yi ) β̃1 = P n 2 i=1 xi Comparing β̂1 and β̃1 : the two estimates are the same when x̄ = 0. Prove ! [Hint: substitute x = 0 in Eqn. 9] Manini Ojha (JSGP) EC-14 Fall, 2020 88 / 237 Covariance and correlation Recall , sign of slope estimate P (xi − x̄)(yi − ȳ ) Cov (xi , yi ) P β̂1 = = 2 (xi − x̄) Var (xi ) Furthermore, ρxy = correlation coefficient between x and y Cov (x, y ) p =p Var (x) Var (y ) p Var (x) = β̂1 p Var (y ) p Var (y ) ∴ β̂1 = ρxy p Var (x) sgn{β̂1 } = sgn{ρxy } β̂1 ∝ ρxy Manini Ojha (JSGP) EC-14 Fall, 2020 89 / 237 Coefficient of determination and correlation coefficient In CLRM, R 2 = ρ2xy Proof: X X SSE = (ŷi − ȳ )2 = (β̂0 + β̂1 xi − (β̂0 + β̂1 x̄))2 X = β̂12 (xi − x̄)2 P β̂12 (xi − x̄)2 SSE 2 = P R = SST (yi − ȳ )2 2 Cov (x, y ) Var (x) = Var (x) Var (y ) 2 Cov (x, y ) = Var (x)Var (y ) )2 ( Cov (x, y ) p = p Var (x) Var (y ) = ρ2xy Manini Ojha (JSGP) EC-14 Fall, 2020 90 / 237 Wrap-up OLS Estimation Given a random sample {yi , xi }ni=1 , OLS minimizes the sum of squared residuals argmin β̂0 ,β̂1 n X i=1 ûi2 = argmin β̂0 ,β̂1 n X (yi − β̂0 − β̂1 xi )2 i=1 Solution implies Pn Pn (xi − x̄)(yi − ȳ ) yi (xi − x̄) Cov (xi , yi ) OLS i=1 P P β̂1 = = = ni=1 n 2 (x − x̄) x (x − x̄) Var (xi ) i=1 i i=1 I i β̂0OLS = ȳ − β̂1OLS x̄ Manini Ojha (JSGP) EC-14 Fall, 2020 91 / 237 Two goals: 1 Judge the quality of OLS estimates under CLRM assumptions (SLR.1 to SLR.5) 1 2 2 SLR.1 to SLR. 4 =⇒ unbiasedness SLR.1 to SLR. 5 =⇒ BLUE Alter OLS estimates when CLRM assumptions are violated Manini Ojha (JSGP) EC-14 Fall, 2020 92 / 237 Multiple Linear Regression Model Multiple regression analysis is more amenable to ceteris paribus analysis allows us to explicitly control for many other factors that simultaneously affect dependent variable If we add more factors to our model that are useful for explaining y =⇒ more of the variation in y can be explained Thus, MLR used to build better models for predicting y allows for much more flexibility Manini Ojha (JSGP) EC-14 Fall, 2020 93 / 237 Example 1: MLR Simple variation of our wage education example: wage = β0 + β1 educ + β2 exper + u where educ : no.of years of education exper : no.of years of experience in the labor market Compared with a SLR relating wage to educ, here in MLR we effectively take exper out of the error term put it explicitly in the equation Manini Ojha (JSGP) EC-14 Fall, 2020 94 / 237 Similar to SLR, in MLR we make assumptions about u Since, exper has been accounted for explicitly can measure the effect of educ on wages holding experience fixed or can measure the effect of exper on wages holding education fixed In SLR, because we didn’t take into account exper explicitly we assumed that experience (which was part of the error term) is uncorrelated with educ Manini Ojha (JSGP) EC-14 Fall, 2020 95 / 237 Example 2: MLR avgscore = β0 + β1 expend + β2 avginc + u If the question is how per-student spending (expend) affects average standardized test scores, then we are interested in β1 i.e. the ceteris paribus effect of expend on avgscore By including avginc, we are controlling for the effect of average family income on average score Likely important as average family income tends to be correlated with per-student expenditure Manini Ojha (JSGP) EC-14 Fall, 2020 96 / 237 Example 3: MLR MLR also useful in generalizing functional forms Suppose family consumption is a quadratic function of family income then cons = β0 + β1 inc + β2 inc 2 + u Here u contains all other factors affecting consumption apart from income Model depends on only one observed factor (even though more than one explanatory variable) Here β1 does not measure the effect of income on consumption holding inc 2 fixed as if inc changes, inc 2 also changes! Manini Ojha (JSGP) EC-14 Fall, 2020 97 / 237 Here, change in consumption w.r.t. to change in income is given by the marginal propensity to consume ∂cons ≈ β1 + 2β2 inc ∂inc Marginal effect of income on consumption depends on β1 and β2 and inc (level of income) more on this later... Manini Ojha (JSGP) EC-14 Fall, 2020 98 / 237 Model with 2 - independent variables Model y = β0 + β1 x1 + β2 x2 + u β0 is the intercept β1 measures the change in y w.r.t change in x1 , holding other factors fixed β2 measures the change in y w.r.t change in x2 , holding other factors fixed Manini Ojha (JSGP) EC-14 Fall, 2020 99 / 237 Model with k - independent variables General form y = β0 + β1 x1 + β2 x2 + .... + βk xk + u β0 is the intercept β1 measures the change in y w.r.t change in x1 , holding other factors fixed β2 measures the change in y w.r.t change in x2 , holding other factors fixed and so on Here, there are k-independent variables and an intercept thus, equation consists of k + 1 unknown population parameters terminology: 1- intercept parameter and k - slope parameters Note: no matter how many variables we include in our model, there will always be factors we cannot include =⇒ collectively contained in u Manini Ojha (JSGP) EC-14 Fall, 2020 100 / 237 Example 4: MLR Regressing CEO’s salary (salary ) on firm sales (sales) and CEO tenure (ceoten) log (salary ) = β0 + β1 log (sales) + β2 ceoten + β3 ceoten2 + u k =3 x1 = log (sales); x2 = ceoten; x3 = ceoten2 β1 : interpreted as the elasticity of salary w.r.t sales If β3 = 0, then β2 is the effect of one more year of ceoten on salary if β3 6= 0, then the effect of ceoten on salary is ∂log (salary ) = β2 + 2β3 ceoten ∂ceoten also called the marginal effect (more later...) Manini Ojha (JSGP) EC-14 Fall, 2020 101 / 237 Regressing CEO’s salary (salary ) on firm sales (sales) and CEO tenure (ceoten) log (salary ) = β0 + β1 log (sales) + β2 ceoten + β3 ceoten2 + u Reminder: This equation is an example of multiple linear regression linear in parameters βj non-linear in relationship between y and x 0 s At minimum, what MLR requires is that all factors in unobserved error term be uncorrelated with explanatory variables Manini Ojha (JSGP) EC-14 Fall, 2020 102 / 237 OLS estimates - 2 independent vars Model with 2 independent variables: estimated equation is given by ŷ = β̂0 + β̂1 x1 + β̂2 x2 β̂0 is the estimate of β0 β̂1 is the estimate of β1 β̂2 is the estimate of β2 OLS (like before) chooses estimates s.t. SSR is minimized min n X i=1 Manini Ojha (JSGP) ûi2 = n X (yi − ŷi )2 = i=1 n X (yi − β̂0 − β̂1 xi1 − β̂xi2 )2 i=1 EC-14 Fall, 2020 103 / 237 Indexing Important to master the meaning of indexing of independent/explanatory vars Independent var is followed by 2 subscripts First subscript i refers to the observation number (i = 1, .....n) Second subscript refers to a way of distinguishing between the explanatory/independent variables (j = 1, ......, k) Example: Wage-Educ-Exper xi1 = educi is education for person i in the sample xi2 = exper Pn i is experience for person i in the sample SSR is i=1 (wagei − β̂0 − β̂1 educi − β̂2 experi )2 Manini Ojha (JSGP) EC-14 Fall, 2020 104 / 237 Indexing Thus, xij is the i th observation on the j th independent variable Manini Ojha (JSGP) EC-14 Fall, 2020 105 / 237 OLS estimates - k-independent vars General form: SSR minimizes n X (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )2 i=1 k + 1 estimates are chosen to minimize the SSR Solved by multi-variable calculus Manini Ojha (JSGP) EC-14 Fall, 2020 106 / 237 You get k + 1 equations in k + 1 unknowns β̂0 , β̂1 , ..., β̂k n X (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik ) = 0 i=1 n X (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xi1 = 0 i=1 n X (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xi2 = 0 i=1 . . . n X (yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xik = 0 i=1 Called the OLS FOCs Manini Ojha (JSGP) EC-14 Fall, 2020 107 / 237 Terminology OLS regression line (25) ŷ = β̂0 + β̂1 x1 + β̂2 x2 + .... + β̂k xk We say “we ran an OLS regression of y on x1 , x2 , x3 , ......,xk ” or We regressed y on x1 , x2 , x3 , ......,xk Manini Ojha (JSGP) EC-14 Fall, 2020 108 / 237 Interpreting OLS regression function Back to: model with 2 independent variables (26) ŷ = β̂0 + β̂1 x1 + β̂2 x2 Intercept β̂0 is the predicted value of y when x1 = 0 and x2 = 0 Sometimes this is an interesting case, other times, this will not make sense (later..) Estimate β̂1 and estimate β̂2 have partial effect interpretation Manini Ojha (JSGP) EC-14 Fall, 2020 109 / 237 From Eqn. 26 4ŷ = β̂1 4x1 + β̂2 4x2 When x2 is held constant, 4x2 = 0, then 4ŷ = β̂1 4x1 4ŷ β̂1 = 4x1 When x1 is held constant, 4x1 = 0, then 4ŷ = β̂2 4x2 4ŷ β̂2 = 4x2 Manini Ojha (JSGP) EC-14 Fall, 2020 110 / 237 Example 5: MLR Determinants of College GPA \ = 1.29 + 0.453hsGPA + 0.0094ACT colGPA n = 141 Slope coefficient on hsGPA: comment on magnitude comment on direction: positive partial relationship b/w colGPA and hsGPA interpret: holding ACT fixed, one more point on hsGPA −→ almost half a point rise in colGPA If we choose 2 students A and B with the same exact ACT scores, but A has 1 point higher hsGPA than B, then we predict A’s colGPA to be 0.453 higher than B Slope coefficient on ACT : Positive partial relationship between colGPA and ACT Holding hsGPA fixed, 1 more point on ACT increases colGPA by less than 1/10th of a point Manini Ojha (JSGP) EC-14 Fall, 2020 111 / 237 Example 6: MLR Model log\ (wage) = 0.284 + 0.092educ + 0.0041exper + 0.022tenure What is the estimated effect of increasing one more year an individual stays at the same firm on wages? 4log\ (wage) = 0.00414exper + 0.0224tenure both experience and tenure would increase by one year resulting effect:(holding education fixed) leads to 2.6 % increase in wages 4log\ (wage) = 0.0041 + 0.022 = 0.0261 Manini Ojha (JSGP) EC-14 Fall, 2020 112 / 237 OLS fitted values and residuals Once we obtain the OLS regression line ŷ = β̂0 + β̂1 x1 + β̂2 x2 + .... + β̂k xk We can obtain the fitted/predicted value for each observation, say for observation i: ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + .... + β̂k xik it is the predicted value we obtain by plugging in values of x1 to xk for observation i Normally the yi will not equal the predicted value ŷi But OLS minimizes the average squared prediction error OLS doesn’t say anything about prediction error for any particular observation though As before, residual for each observation can be computed: ûi = yi − ŷi Manini Ojha (JSGP) EC-14 Fall, 2020 113 / 237 Residual: ûi = yi − ŷi If ûi > 0 =⇒ ŷi < yi yi is underpredicted If ûi < 0 =⇒ ŷi > yi yi is overpredicted Manini Ojha (JSGP) EC-14 Fall, 2020 114 / 237 Properties of OLS residuals As before Sample average of residuals is zero: ȳ = ŷ¯ Sample covariance between each independent var and OLS residuals is zero =⇒sample covariance between OLS fitted values and OLS residuals is zero Point (x̄1 , x̄2 , x̄3 , ....., x̄k , ȳ ) is always on the OLS regression line ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + ..... + β̂k x̄k Manini Ojha (JSGP) EC-14 Fall, 2020 115 / 237 “Partialling out” interpretation of slope parameter In a model with k = 2: ŷ = β̂0 + β̂1 x1 + β̂2 x2 Here, the slope parameter Pn rˆi1 yi ) β̂1 = Pi=1 n 2 i=1 rˆi1 ( (27) where rˆi1 are the OLS residuals from simple regression of x1 on x2 using the same sample (no proof required!) We follow a step process to get to slope parameter: 1 2 We first regress x1 on x2 −→ obtain residuals rˆ1 (y has no role) We then regress y on rˆ1 −→ obtain β̂1 Manini Ojha (JSGP) EC-14 Fall, 2020 116 / 237 “Partialling out” Interpretation of MLR The way to interpret this is: residuals rˆi1 are the part of xi1 that are uncorrelated with xi2 or residuals rˆi1 is the effect of xi1 on y after the effects of xi2 have been partialled out/netted out β̂1 measures the sample relationship between y and x1 after x2 has been partialled out Manini Ojha (JSGP) EC-14 Fall, 2020 117 / 237 “Partialling out” Interpretation of MLR In the model with k explanatory vars β̂1 can still be written as in Eqn. 27 but rˆi1 will be residuals from regression of x1 on x2 , ....xk So measures the effect of x1 on y after partialling out effects of x2 ,....., xk Manini Ojha (JSGP) EC-14 Fall, 2020 118 / 237 Comparison of SLR and MLR slope parameter If SLR: ỹ = β̃0 + β̃1 x1 MLR: ŷ = β̂0 + β̂1 x1 + β̂2 x2 Then, β̃1 = β̂1 + β̂2 δ̃1 where δ̃1 is the slope parameter from simple regression of x2 on x1 : x̃2 = δ̃0 + δ̃1 x1 Thus, β̃1 6= β̂1 Manini Ojha (JSGP) EC-14 Fall, 2020 119 / 237 Special case ( 2 independent vars ) : β̃1 = β̂1 only if 1 2 The partial effect of x2 on ŷ is zero in the sample i.e. β̂2 = 0 x1 and x2 are uncorrelated in the sample i.e. δ̃1 = 0 Special case ( k−independent vars ): β̃1 = β̂1 only if 1 2 The OLS coefficients on x2 through xk are all zero or x1 is uncorrelated with each of x2 , ....., xk Neither likely in practice, but possible that the correlations are small in which case the slope parameters will be similar Manini Ojha (JSGP) EC-14 Fall, 2020 120 / 237 Example An econometrician wrongly regresses height of individuals on nutrition of individuals as follows: ˜ = β̃0 + β̃1 nutrn ht Now, say she realizes the correct model and regresses height on nutrition as well as HH income: ˆ = β̂0 + β̂1 nutrn + β̂2 inc ht Nutrition is likely positively correlated with income s.t. ˜ = δ̃0 + δ̃1 nutrn inc Effect of individual’s nutrition on height in the SLR would actually be made up of the partial effect of own nutrition on height (after partialling out the effect of income on height) + effect of income on height*the effect of nutrition on income β̃1 = β̂1 + β̂2 δ̃1 Manini Ojha (JSGP) EC-14 Fall, 2020 121 / 237 Goodness-of-Fit Total variation in {yi }: total sum of squares (SST) n X SST = (yi − ȳ )2 i=1 Total variation in {ŷi }: explained sum of squares (SSE) SSE = n X (ŷi − ȳ )2 i=1 Total variation in {ûi } :residual sum of squares (SSR) SSR = n X ûi2 i=1 SST = SSE + SSR R 2 = SSE /SST = 1 − SSR/SST defined as the proportion of sample variation in yi that is explained by the OLS regression line. Lies between 0 and 1 Manini Ojha (JSGP) EC-14 Fall, 2020 122 / 237 Goodness-of-Fit Recall R 2 = (Correlation coefficient)2 = ρ2 R 2 usually increases when we add more and more explanatory variables to the regression Mathematically this happens as SSR never increases if more explanatory vars are added Poor tool for deciding if another x should be added or not Should ideally depend on whether x has a non-zero partial effect on y Manini Ojha (JSGP) EC-14 Fall, 2020 123 / 237 MLR - assumptions MLR.1 : Linearity in parameters MLR.2 : Random sampling MLR.3 : No perfect collinearity none of the explanatory variables is constant no exact linear relationships among the explanatory variables Bivariate model: x1 and x2 are linearly independent We allow for some correlation but no perfect correlation (otherwise meaningless econometric analysis) MLR.4 : Zero conditional mean MLR.5 : Homoskedasticity Collectively MLR.1 through MLR. 5 called the Gauss-Markov assumptions Manini Ojha (JSGP) EC-14 Fall, 2020 124 / 237 MLR. 3 - violation: case 1 Example: 2 candidates and an election regress percentage of vote for candidate A on campaign expenditures voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u Model violates MLR.3. as perfect collinearity : x3 = x1 + x2 =⇒ totexpend = expendA + expendB Try interpreting β1 measures the effect _____ on ____ keeping ___ fixed. Manini Ojha (JSGP) EC-14 Fall, 2020 125 / 237 MLR. 3 - violation: case 1 Example: 2 candidates and an election regress percentage of vote for candidate A on campaign expenditures voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u Model violates MLR.3. as perfect collinearity : x3 = x1 + x2 =⇒ totexpend = expendA + expendB Try interpreting β1 measures the effect _____ on ____ keeping ___ fixed. Nonsense! Manini Ojha (JSGP) EC-14 Fall, 2020 125 / 237 MLR. 3 - violation: case 1 Example: 2 candidates and an election regress percentage of vote for candidate A on campaign expenditures voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u Model violates MLR.3. as perfect collinearity : x3 = x1 + x2 =⇒ totexpend = expendA + expendB Try interpreting β1 measures the effect _____ on ____ keeping ___ fixed. Nonsense! Solution to perfect collinearity is to simply drop one of the explanatory vars from the regression Manini Ojha (JSGP) EC-14 Fall, 2020 125 / 237 MLR. 3 - violation - case 2 MLR.3. is violated if the same explanatory variable is measured in different units in the same regression income measured both in rupees and thousands of rupees Manini Ojha (JSGP) EC-14 Fall, 2020 126 / 237 MLR. 3 - violation - case 3 MLR.3 also fails if the sample size n is too small in relation to k In general k−independent var model, there are k + 1 parameters MLR.3. fails if n < k + 1 Why? need k + 1 observations at least to estimate k + 1 parameters Manini Ojha (JSGP) EC-14 Fall, 2020 127 / 237 MLR.4. - violation Omitted variables If important explanatory variables are omitted from the regression: u may be correlated with x 0 s Reasons: data limitation or ignorance in case of actual application Endogeneity u is correlated with explanatory variables If xj is uncorrelated with u =⇒ “exogenous explanatory vars” If xj is correlated with u =⇒ “endogenous explanatory vars” Manini Ojha (JSGP) EC-14 Fall, 2020 128 / 237 Caution! Before we proceed, a word of caution: Do not confuse MLR. 3 and MLR.4 MLR. 3. rules out certain relationship between x 0 s (easier to deal with) MLR. 4. rules out relationship between u and x (difficult to identify and deal with) Violation −→ bias in OLS estimators If bivariate −→ bias in all 3 OLS estimators If k-independent variable model −→ bias in all k + 1 OLS estimators Manini Ojha (JSGP) EC-14 Fall, 2020 129 / 237 Unbiasedness Under the assumptions MLR. 1 through MLR. 4 E (β̂j ) = βj , j = 0, 1, ...., k OLS estimators are unbiased estimators of the population parameters Manini Ojha (JSGP) EC-14 Fall, 2020 130 / 237 Unbiasedness Useful to remember the meaning of unbiasedness: Run a regression of log (wages) on educ Find the estimate of the coefficient attached educ to be 9.2%. Tempting, to say something like “9.2% is an unbiased estimate of the return to education.” But, when we say that OLS is unbiased under MLR.1 through MLR.4, we mean the procedure by which the OLS estimates are obtained is unbiased we hope that we’ve obtained a sample that gives us an estimate close to the population value, but, unfortunately, this cannot be assured. What is assured is that we have no reason to believe our estimate is more likely to be too big or too small.” Manini Ojha (JSGP) EC-14 Fall, 2020 131 / 237 Issues in MLR 1 Inclusion of irrelevant variables/ over-specifying the model 2 Excluding a relevant variable/ under-specifying the model Manini Ojha (JSGP) EC-14 Fall, 2020 132 / 237 Including irrelevant vars/Overspecifying Means one (or more) of explanatory vars is included in the regression that have no partial effect on dependent variable y = β0 + β1 x1 + β2 x2 + β3 x3 + u Lets assume that model satisfies MLR.1 through MLR. 4 However, x3 has no effect on y after x1 and x2 have been controlled for i.e. β3 = 0 x3 may or may not be correlated with x1 or x2 No need to include x3 in the model when its coefficient in the population is zero. Has no effect on unbiasedness of OLS estimators Has undesirable effects on efficiency of OLS estimators (affects variances) Manini Ojha (JSGP) EC-14 Fall, 2020 133 / 237 Omitted variable/under-specifying Means we omit a variable that actually belongs to the true population model y = β0 + β1 x1 + β2 x2 + u This would ideally give the OLS regression ŷ = β̂0 + β̂1 x1 + β̂2 x2 Lets assume that this model satisfies MLR.1 through MLR. 4 Suppose, primary interest is in β1 (partial effect of x1 on y ) But due to some reason (say unavailability of data), we run the following regression instead ỹ = β̃0 + β̃1 x1 Manini Ojha (JSGP) EC-14 Fall, 2020 134 / 237 Omitted variable bias (OVB) Comparing β̃1 and β̂1 , we know (28) β̃1 = β̂1 + β̂2 δ̃1 δ̃1 depends only on independent variables, so we take it as non-random Bias: E (β̃1 ) = E (β̂1 + β̂2 δ̃1 ) = E (β̂1 ) + δ̃1 E (β̂2 ) = β1 + β2 δ̃1 Cov (x1 x2 ) = β1 + β2 Var (x1 ) E (β̃1 ) − β1 = β2 δ̃1 = OVB Manini Ojha (JSGP) EC-14 (29) Fall, 2020 135 / 237 OVB = 0 if β2 = 0 i.e. x2 does not appear in the true model or δ̃1 = 0 i.e. x1 and x2 are uncorrelated Sign of OVB β2 > 0 β2 < 0 Manini Ojha (JSGP) corr (x1 , x2 ) > 0 + - EC-14 corr (x1 , x2 ) < 0 + Fall, 2020 136 / 237 Upward bias or downward bias? If E (β̃1 ) > β1 , then β̃1 has an upward bias If E (β̃1 ) < β1 , then β̃1 has a downward bias Manini Ojha (JSGP) EC-14 Fall, 2020 137 / 237 Example: wage = β0 + β1 educ + β2 ability + u If however regress (due to data issues) wage = β0 + β1 educ + v Do you think there is an OVB here? What is the likely sign of OVB here? corr (x1 , x2 ) =? ; sgn{β2 } =? sgn{E (β̃1 ) − β1 } =? Manini Ojha (JSGP) EC-14 Fall, 2020 138 / 237 Example: avgscore = β0 + β1 expend + β2 povrate + u If however regress avgscore = β0 + β1 expend + v Is there an OVB here? What is the likely sign of OVB here? corr (x1 , x2 ) =? ; sgn{β2 } =? sgn{E (β̃1 ) − β1 } =? Manini Ojha (JSGP) EC-14 Fall, 2020 139 / 237 OVB contd.. Deriving the sgn{OVB} is more difficult in a general case (with more explanatory variables) Note: If corr (u, xj ) 6= 0, then all OLS estimators are biased Which assumption is violated here? Example: If population model: y = β0 + β1 x1 + β2 x2 + β3 x3 + u Satisfies MLR.1 through MLR. 4 But we omit x3 and end up estimating: ỹ = β̃0 + β̃1 x1 + β̃2 x2 Manini Ojha (JSGP) EC-14 Fall, 2020 140 / 237 Suppose corr (x2 , x3 ) = 0 but corr (x1 , x3 ) 6= 0 Tempting to think β̃1 is probably biased but β̃2 will not be. But both β̃1 and β̃2 will be biased! Unless! corr (x1 , x2 ) = 0 Rough guide to figure out the sign of OVB if corr (x1 , x2 ) = 0 is actually true, then what would the sign of bias be (x1 x3 ) given E (β̃1 ) = β1 + β3 cov var (x1 ) Manini Ojha (JSGP) EC-14 Fall, 2020 141 / 237 Variance of OLS estimators MLR.5 : homoskedasticity Example: wage = β0 + β1 educ + β2 exper + β3 tenure + u Homoskedasticity requires that the variance of unobserved error does not depend on levels of education, experience or tenure i.e. Var (u|educ, exper , tenure) = σ 2 If variance of error changes with any of the 3 −→ heteroskedasticity MLR. 1 through MLR. 5 are needed to get to variance of β̂j Var (β̂j ) = σ2 SSTj (1 − Rj2 ) (30) where Rj2 is the R-squared from the regression of xj on all other explanatory variables Manini Ojha (JSGP) EC-14 Fall, 2020 142 / 237 Variance of OLS estimator Var (β̂j ) = σ2 SSTj (1 − Rj2 ) Size of Var (β̂j ) is important. Why? Larger the variance the less precise the estimator larger the confidence intervals less accurate the hypotheses test Manini Ojha (JSGP) EC-14 Fall, 2020 143 / 237 Components of OLS variance Var (β̂j ) depends on 3 factors 1 2 3 σ2 SSTj Rj2 Manini Ojha (JSGP) EC-14 Fall, 2020 144 / 237 Component 1 - Error variance σ2: Larger the error variance −→larger the sampling variance for OLS estimator more “noise” in the equation −→ more difficult to estimate the precise partial effect of any xj on y For any given dependent variable y , the only way to reduce error variance: adding more x 0 s (that can explain more of y leaving “little” in the error u) problem: unfortunately, not always possible to find additional legitimate x 0 s that affect y Manini Ojha (JSGP) EC-14 Fall, 2020 145 / 237 Component 2 - Total sample variance in xj SSTj : Larger the SSTj −→smaller the sampling variance for OLS estimator as SSTj −→ 0 Var (β̂j ) −→ ∞ SSTj = 0 not allowed Everything else equal, we prefer to have as much variation in x 0 s as possible A nice way of increasing sample variation in each of the x 0 s increase the sample size itself i.e. n Manini Ojha (JSGP) EC-14 Fall, 2020 146 / 237 Component 3 - Linear association between x 0 s Rj2 : Distinct from R − squared from regression of y on x Rj2 appears in the regression of one explanatory var on others Proportion of total variation in xj that is explained by other explanatory vars Suppose k = 2: y = β0 + β1 x1 + β2 x2 + u Then Var (β̂1 ) = σ2 SST1 (1 − R12 ) R12 is R − squared from the regression of x1 on x2 Higher R12 means x2 explains much of the variation in x1 or x1 and x2 are highly correlated For general k- independent var model : If Rj2 ↑−→(1 − Rj2 ) ↓−→Var (β̂1 )↑ Manini Ojha (JSGP) EC-14 Fall, 2020 147 / 237 Extreme cases: If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!) then we get smallest Var (β̂j ) for given SSTj and σ 2 If Rj2 =1 Which assumption is violated? Manini Ojha (JSGP) EC-14 Fall, 2020 148 / 237 Extreme cases: If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!) then we get smallest Var (β̂j ) for given SSTj and σ 2 If Rj2 =1 Which assumption is violated? MLR.3. : perfect collinearity Manini Ojha (JSGP) EC-14 Fall, 2020 148 / 237 Extreme cases: If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!) then we get smallest Var (β̂j ) for given SSTj and σ 2 If Rj2 =1 Which assumption is violated? MLR.3. : perfect collinearity Relevant case: If Rj2 −→ 1 : “close” to 1 Var (β̂j ) −→ ∞ This case is called “multicollinearity”: high but not perfect correlation between 2 or more independent vars Not a violation of MLR. 3. Manini Ojha (JSGP) EC-14 Fall, 2020 148 / 237 Multicollinearity: how much is too much? Read [JW 5th ed] Chapter 3, p.95-98, section on: “Linear relationship among the independent variables” (homework!) Manini Ojha (JSGP) EC-14 Fall, 2020 149 / 237 Trade-off between variance and bias Choice between including a particular var or not in the model analyze the trade-off Recall: Omitting relevant variable −→ bias Including many vars −→ loss in efficiency Manini Ojha (JSGP) EC-14 Fall, 2020 150 / 237 Suppose true population model (31) y = β0 + β1 x1 + β2 x2 + u Consider 2 estimates of β1 : β̂1 from the MLR : ŷ = β̂0 + β̂1 x1 + β̂2 x2 (32) ỹ = β̃0 + β̃1 x1 (33) β̃1 from the SLR : Manini Ojha (JSGP) EC-14 Fall, 2020 151 / 237 When β2 6= 0 , then Eqn. 33 excludes a relevant variable −→ OVB unless corr (x1 , x2 ) = 0 β̃1 is biased Therefore, if bias is the only criterion used to decide which estimator is better, then Manini Ojha (JSGP) EC-14 Fall, 2020 152 / 237 When β2 6= 0 , then Eqn. 33 excludes a relevant variable −→ OVB unless corr (x1 , x2 ) = 0 β̃1 is biased Therefore, if bias is the only criterion used to decide which estimator is better, then β̂1 is preferred to β̃1 Manini Ojha (JSGP) EC-14 Fall, 2020 152 / 237 But! if variance is also brought into picture, then things change We know, conditioning on values of x1 and x2 in the sample Var (β̂1 ) = σ2 SST1 (1 − R12 ) (34) σ2 SST1 (35) We also know Var (β̃1 ) = Comparing Eqn. 34 and Eqn. 35 Var (β̃1 ) < Var (β̂1 ) : β̃1 is more efficient unless corr (x1 , x2 ) = 0 =⇒ β̃1 = β̂1 Manini Ojha (JSGP) EC-14 Fall, 2020 153 / 237 If corr (x1 , x2 ) 6= 0, then following conclusions: 1 2 When β2 6= 0 : β̃1 is biased, β̂1 is unbiased, Var (β̃1 ) < Var (β̂1 ) When β2 = 0 : β̃1 and β̂1 are both unbiased and Var (β̃1 ) < Var (β̂1 ) including x2 in the model exacerbates multicollinearity problem −→ less efficient estimator Traditionally, econometricians have compared the likely size of the bias with the reduction in variance to decide to include x2 or not Manini Ojha (JSGP) EC-14 Fall, 2020 154 / 237 Estimating σ 2 in MLR Recall In SLR, estimate of σ 2 is σ̂ 2 Similarly in MLR, estimate of σ 2 is σ̂ 2 Recall, in SLR Pn 2 2 i=1 ûi σ̂ = In MLR, 2 n−2 Pn σ̂ = 2 i=1 ûi n − (k + 1) where degrees of freedom df for general OLS with n observations and k independent variables is n − k − 1 Manini Ojha (JSGP) EC-14 Fall, 2020 155 / 237 Since 2 Pn 2 i=1 ûi σ̂ = and n − (k + 1) σ2 Var (β̂j ) = SSTj (1 − Rj2 ) σ and ⇒ sd(β̂j ) = q SSTj (1 − Rj2 ) σ̂ estimates σ σ̂ ⇒ se(β̂j ) = q SSTj (1 − Rj2 ) called the standard error of β̂j utilized when we construct confidence intervals and conduct tests Manini Ojha (JSGP) EC-14 Fall, 2020 156 / 237 Efficiency of OLS: Guass-Markov Theorem Why do we use OLS instead of the wide variety of estimation methods? We know under MLR.1 through MLR.4, OLS is unbiased But there may be other unbiased estimators as well... However, we also know that under MLR.1 through MLR.5, OLS also has the smallest variance Theorem Gauss-Markov Theorem: Under assumptions MLR.1 through MLR.5, β̂0 , β̂1 , ...., β̂k are the best linear unbiased estimators (BLUEs) of β0 , β1 , ...., βk respectively. Manini Ojha (JSGP) EC-14 Fall, 2020 157 / 237 Language of MLR Analysis Note: many econometricians report that they have “estimated an OLS model” Incorrect language Correct usage “used the OLS estimation method” - OLS is not a model A model describes an underlying population and depends on unknown parameters Model: y = β0 + β1 x1 + β2 x2 + ..... + βk xk + u Can talk about interpretation of βj (any one of the unknown parameters) without looking at the data, just by looking at the model Of course, we learn much more about βj from data Manini Ojha (JSGP) EC-14 Fall, 2020 158 / 237 Language of MLR Analysis Other ways of estimation exist: weighted least squares, least absolute deviations, instrumental variables etc. (Advanced Trics) Finally, Important to not use imprecise language Leads to vagueness on important considerations like assumptions Example of correct usage: “I estimated the equation by ordinary least squares. Under the assumption that no important variables have been omitted from the equation, and assuming random sampling, the OLS estimator of β1 is unbiased. If the error term u has constant variance, the OLS estimator is actually best linear unbiased.” Manini Ojha (JSGP) EC-14 Fall, 2020 159 / 237 Inference We know Expected value of OLS estimators Variance of OLS estimators For statistical inference, need to know more than just the first 2 moments of β̂j Need to know the full sampling distribution of β̂j Sampling distributions of OLS estimators is entirely dependent on the sampling distribution of the errors when we condition on values of control variables in the sample Manini Ojha (JSGP) EC-14 Fall, 2020 160 / 237 Normality Assumption MLR. 6: Normality assumption The population error u is independent of the explanatory variables x1, x2,...., xk and is normally distributed with zero mean and constant variance σ 2 such that u ∼ N(0, σ 2 ) Stronger assumption that any of the previous assumptions If we make assumption MLR.6., we are necessarily making the assumptions MLR.4. and MLR. 5. Why? Manini Ojha (JSGP) EC-14 Fall, 2020 161 / 237 Language For cross-section regression analysis, the full set of assumptions means MLR.1. through MLR.6. and collectively called the CLRM assumptions Under these 6 assumptions, we will call the model the Classical Linear Regression Model (CLRM) Manini Ojha (JSGP) EC-14 Fall, 2020 162 / 237 In real-world application, assumption of normality of u is really an empirical matter Often times this is not true No theorem says that wage conditional on education and experience is normally distributed In fact, most probably this is not true as wage cannot be less than zero, so strictly speaking, cannot have a n.d. Nevertheless, we consider this assumption and ask the question if the distribution is “close” to being normal Often, using a transformation like logs yields a distribution closer to a normal. Example: log (price) tends to have a distribution that looks more normal than the distribution of price Log is one of the most common transformations to get a skewed distribution looking more like a normal Manini Ojha (JSGP) EC-14 Fall, 2020 163 / 237 Normal sampling distributions Normality of errors assumptions =⇒ normal sampling distributions of OLS estimators: Theorem Normal Sampling Distributions Under the CLRM assumptions, MLR.1 through MLR.6, conditional on the sample values of the independent variables, β̂j ∼ N[βj , Var (β̂j )], Therefore, (β̂j − βj ) sd(β̂j ) ∼ N(0, 1) (36) Standardized a normal r.v. by subtracting off its mean and dividing by its standard deviation to get to a standard normal r.v. Manini Ojha (JSGP) EC-14 Fall, 2020 164 / 237 Normal sampling distributions Theorem =⇒ 1 2 Any linear combination of β̂0 , β̂1, ....., β̂k is also normally distributed Any subset of β̂j has a joint normal distribution Manini Ojha (JSGP) EC-14 Fall, 2020 165 / 237 Hypothesis Testing Recall, in general θ denotes unknown parameter and θ̂ is an estimate If θ̂ is a unique value, then θ̂ is a point estimate Hypothesis testing is a method of inference concerning the unknown population parameters Manini Ojha (JSGP) EC-14 Fall, 2020 166 / 237 Begin with some definitions Definition A hypothesis is a statement about a population parameter Manini Ojha (JSGP) EC-14 Fall, 2020 167 / 237 Begin with some definitions Definition A hypothesis is a statement about a population parameter Definition The two complementary hypotheses in a hypothesis testing problem are denoted the null and alternative hypothesis. These are denoted H0 and Ha (sometimes H1 ) respectively Manini Ojha (JSGP) EC-14 Fall, 2020 167 / 237 Testing hypothesis about a single population parameter: t-test Population model y = β0 + β1 x1 + ........ + βk xk + u can hypothesize about value of βj and use hypothesis testing to draw inference Theorem t- distribution for standardized estimators Under CLRM assumptions (β̂j − βj ) se(β̂j ) ∼ tn−k−1 = tdf (37) Theorem on t - distribution is important as it allows us to test the hypothesis involving βj . Manini Ojha (JSGP) EC-14 Fall, 2020 168 / 237 Primary interest is typically in testing the null hypothesis H0 :βj = 0 (38) What does this mean? Since βj measures the partial effect of xj on (the expected value of) y , after controlling for all other independent variables, Eqn. 38 means that, once x1 , x2 , ....., xj−1 , xj+1 , xk have been accounted for, xj has no effect on the expected value of y Example: returns to education log (wage) = β0 + β1 educ + β2 exper + β3 tenure + u Manini Ojha (JSGP) EC-14 Fall, 2020 169 / 237 Primary interest is typically in testing the null hypothesis H0 :βj = 0 (38) What does this mean? Since βj measures the partial effect of xj on (the expected value of) y , after controlling for all other independent variables, Eqn. 38 means that, once x1 , x2 , ....., xj−1 , xj+1 , xk have been accounted for, xj has no effect on the expected value of y Example: returns to education log (wage) = β0 + β1 educ + β2 exper + β3 tenure + u H0 :β2 = 0 means that once education and tenure have been accounted for, # years of past experience in the workforce has no effect on hourly wages. Manini Ojha (JSGP) EC-14 Fall, 2020 169 / 237 The statistic we use to test the null against any alternative is called the t−statistic or the t- ratio of β̂j tβ̂j is β̂j se(β̂j ) We will say teduc is the t statistic for β̂educ If β̂j > 0 then tβ̂j > 0 If β̂j < 0 then tβ̂j < 0 For given value of se(β̂j ), a larger value of β̂j −→ tβ̂j larger Manini Ojha (JSGP) EC-14 Fall, 2020 170 / 237 t- distribution derived by Gosset (1908) Worked at Guinness brewery in Dublin which prohibited publishing due to fear of revealing trade secrets Gosset instead published under the name ‘Student’ Hence, also known as the Student-t dbn Manini Ojha (JSGP) EC-14 Fall, 2020 171 / 237 For null H0 : βj = 0, Look at the unbiased estimator β̂j and ask how far is β̂j from zero? Value of β̂j very far from zero provides evidence against the null But sampling errors exist in our estimate which is accounted for by the s.e. Thus, tβ̂j measures how many estimated standard deviations β̂j is away from zero tβ̂j sufficiently far away from zero will result in rejection of the null Precise rule depends on alternative hypothesis and chosen level of significance Manini Ojha (JSGP) EC-14 Fall, 2020 172 / 237 Caution Never write the Null as H0 :β̂j = 0 !! We test hypotheses for population parameters NOT estimates from a particular sample! Manini Ojha (JSGP) EC-14 Fall, 2020 173 / 237 Hypothesis testing procedure Procedure typically entails 1 2 3 Construction of a test statistic which is a function of sample estimate(s) Specification of the rejection region for this test statistic Comparison of the sample value with the rejection region Manini Ojha (JSGP) EC-14 Fall, 2020 174 / 237 Choosing a rejection rule How do we choose a rejection rule? First choose a significance level Definition Significance level is the probability of rejecting the H0 when it is in fact true. Suppose we have decided a 5% significance level 5% significance level is denoted by c By the choice of this critical value, rejection of the null will occur for 5% of all the random samples when null is true Manini Ojha (JSGP) EC-14 Fall, 2020 175 / 237 Testing one - sided alternative One-sided alternative H0 : βj = 0 Ha : βj > 0 Here rejection rule: tβ̂j > c (right tailed) Or H0 : βj = 0 Ha : βj < 0 Here rejection rule: tβ̂j < −c (left-tailed) Manini Ojha (JSGP) EC-14 Fall, 2020 176 / 237 Rejection rule and computation of c Rejection rule in one-tailed test is that H0 is rejected in favor of Ha at 5% significance level if (39) tβ̂j > c To compute c, we need significance level and df Rough guide: If df>120, can use standard normal critical values As the significance level ↓, the critical value ↑ Thus, we need larger and larger values of t-statistic to reject the null Manini Ojha (JSGP) EC-14 Fall, 2020 177 / 237 Right-tailed rejection region Manini Ojha (JSGP) EC-14 Fall, 2020 178 / 237 Left-tailed rejection region Manini Ojha (JSGP) EC-14 Fall, 2020 179 / 237 Example Hourly wage model log\ (wage) = 0.284 + 0.092educ + 0.0041exper + 0.022tenure (0.104) (0.007) (0.0017) (0.003) 2 n = 526 R = 0.316 Standard errors are provided in parentheses below the estimated coefficients Use this equation to tests whether return to exper , controlling for educ and tenure is zero in the population against the alternative that is it positive. H0 : ? Ha : ? How will you compute the t-statistic if significance level chosen is 5%? ? tβ̂2 = texper = =? ? Manini Ojha (JSGP) EC-14 Fall, 2020 180 / 237 Note: Since df=522, can use standard normal critical values At 5% significance level, critical value c = 1.645 At 1% significance level, critical value c = 2.326 Therefore, from the example equation, since texper ≈ 2.41, we can say β̂exper or exper is statistically significant even at the 1% level OR β̂exper is statistically greater than zero even at the 1% significance level Manini Ojha (JSGP) EC-14 Fall, 2020 181 / 237 Testing - two-sided alternatives Two sided alternative H0 : βj = 0 Ha : βj 6= 0 Rejection rule in two-tailed test - look at absolute value of t-stat (40) tβ̂j > c To compute c , need significance level and df When specific alternative not stated, usually considered to be two-sided Manini Ojha (JSGP) EC-14 Fall, 2020 182 / 237 Manini Ojha (JSGP) EC-14 Fall, 2020 183 / 237 Language If H0 is rejected in favor of H1 at the 5% level, we usually say that xj is statistically significant at 5% level or xj is statistically different from zero at the 5% level. If H0 is not rejected, we say that xj is statistically not significant at the 5% level or we fail to reject H0 at the 5% level (rather than, we accept the null) Manini Ojha (JSGP) EC-14 Fall, 2020 184 / 237 Testing other hypotheses about βj Sometimes interested in whether βj is equal to another constant Common examples H0 : βj = a where a is the hypothesized value of βj , then t= β̂j − a (41) se(β̂j ) t measures how many estimated standard deviations β̂j is away from the hypothesized value of βj t= Manini Ojha (JSGP) (estimate − hypothesized value) standard error EC-14 Fall, 2020 185 / 237 If H0 : βj = 1 H1 : βj > 1 then find critical value for one-sided alternative exactly as before We reject the H0 in favor of H1 if t > c =⇒β̂j is statistically greater than one at appropriate significance level If H0 : βj = 1 H1 : βj 6= 1 then find critical value for two-sided alternative exactly as before We reject the H0 in favor of H1 if |t| > c=⇒β̂j is statistically different than one at appropriate significance level Manini Ojha (JSGP) EC-14 Fall, 2020 186 / 237 If H0 : βj = −1 H1 : βj 6= −1 then find critical value for two-sided alternative exactly as before t = (β̂j + 1)/se(β̂j ) We reject the H0 in favor of H1 if |t| > c=⇒β̂j is statistically different than negative one at appropriate significance level ∴ Difference is in how we compute the t - stat, not how we obtain c Manini Ojha (JSGP) EC-14 Fall, 2020 187 / 237 Recap - hypothesis testing To test hypothesis using classical approach: Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Recap - hypothesis testing To test hypothesis using classical approach: State the alternative hypothesis Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Recap - hypothesis testing To test hypothesis using classical approach: State the alternative hypothesis Choose a significance level Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Recap - hypothesis testing To test hypothesis using classical approach: State the alternative hypothesis Choose a significance level Then determine a critical value based on df and significance level Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Recap - hypothesis testing To test hypothesis using classical approach: State the alternative hypothesis Choose a significance level Then determine a critical value based on df and significance level Compute the value of the t-statistic Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Recap - hypothesis testing To test hypothesis using classical approach: State the alternative hypothesis Choose a significance level Then determine a critical value based on df and significance level Compute the value of the t-statistic Compare t-statistic with the critical value the null is either rejected or not rejected at the given significance level Manini Ojha (JSGP) EC-14 Fall, 2020 188 / 237 Manini Ojha (JSGP) EC-14 Fall, 2020 189 / 237 p-values Rather than testing at different significance levels, can try to answer the following: Given the value of the t-stat, what is the smallest significance level at which the null would be rejected? This is known as the p-value for the test It is a probability and always lies between 0 and 1 Manually computing requires detailed printed t-tables Regression package will do it for you Almost always its the p-value for testing the null against the two-sided alternative Manini Ojha (JSGP) EC-14 Fall, 2020 190 / 237 p-value summarizes the strength or weakness of an empirical evidence against the null Small p-values are evidence against the null Large p-values provide little evidence against the null If α denotes the significance level of the test (in decimal), then we reject the H0 if p-value < α we fail to reject the H0 if p-value > α Manini Ojha (JSGP) EC-14 Fall, 2020 191 / 237 Economic significance vs statistical significance Statistical significance of a variable xj is determined entirely by the size of tβ̂j Economic significance of a variable is related to the size and sign of β̂j Too much emphasis on statistical significance may lead to false conclusions that estimate is important even though the estimated effect is small driven by both size of β̂j and se(β̂j ) Manini Ojha (JSGP) EC-14 Fall, 2020 192 / 237 Example: log\ (wage) = 80.29 + 5.44educ + 0.269exper − 0.00013tenure (0.78) (0.52) (0.045) (0.00004) 2 n = 1534 R = 0.100 Discuss the statistical (compute t-stat) vs economic significance of tenure on predicted log (wage) Small sample size −→ less precise estimators, higher standard errors Large sample size −→ more precise estimators, smaller standard errors in comparison to coefficient estimate Sometimes large standard errors because of high multicollinearity Rj2 even if sample size is fairly large Manini Ojha (JSGP) EC-14 Fall, 2020 193 / 237 Confidence Intervals Also called interval estimates provide a range of likely values for the population parameter rather than just a point Given that β̂j − βj se(β̂j ) ∼ tn−k−1 Confidence interval (CI) for unknown βj is constructed as β̂j ± c.se(β̂j ) (42) where c is the critical value Manini Ojha (JSGP) EC-14 Fall, 2020 194 / 237 Confidence Intervals Lower bound of CI βj ∼ β̂j − c· se(β̂j ) (43) β j ∼ β̂j + c· se(β̂j ) (44) Upper bound of CI Manini Ojha (JSGP) EC-14 Fall, 2020 195 / 237 Example: 1 df=25, a 95% CI for any βj is given by [β̂j − 2.06· se(β̂j ), β̂j + 2.06· se(β̂j )] 2 df=25, a 90% CI for any βj is given by [β̂j − 1.71· se(β̂j ), β̂j + 1.71· se(β̂j )] 3 df=25, a 99% CI for any βj is given by [β̂j − 2.79· se(β̂j ), β̂j + 2.79· se(β̂j )] Example: for df=120, (using normal dbn) a 95% CI for any βj is given by [β̂j − 1.96· se(β̂j ), β̂j + 1.96· se(β̂j )] Manini Ojha (JSGP) EC-14 Fall, 2020 196 / 237 Testing hypothesis : single linear combination of parameters Till now we’ve seen how to test hypothesis about single unknown parameter βj In application, often required to test hypothesis about many population parameters, sometimes a combination of them Example: Kane & Rouse (1995) : population includes working people with HS degree log (wage) = β0 + β1 jc + β2 univ + β3 exper + u (45) jc: # years of attending 2 year college univ : # years of attending 4 year college exper : # months in the workforce Manini Ojha (JSGP) EC-14 Fall, 2020 197 / 237 If hypothesis of interest is whether 1 year at jc is worth 1 year at uni, then H0 : β1 = β2 i.e. 1 more year at jc and 1 more yr at uni lead to same ceteris paribus increase in wage Alternative of interest is one-sided: a year at a junior college is worth less than a year at a university H1 : β1 < β2 Here, rewrite the null and alternative as H0 : β1 − β2 = 0 H1 : β1 − β2 < 0 Manini Ojha (JSGP) EC-14 Fall, 2020 198 / 237 Construct t- stat as t= β̂1 − β̂2 se(β̂1 − β̂2 ) t- stat is based on whether the estimated difference β̂1 − β̂2 is sufficiently less than zero to warrant rejection of the null Then choose significance level and based on df, compute critical value and test Difficulty lies in getting se(β̂1 − β̂2 ) Caution! se(β̂1 − β̂2 ) 6= se(β̂1 ) − se(β̂2 ) Recall Var (β̂1 − β̂2 ) = Var (β̂1 ) + Var (β̂2 ) − 2Cov (β̂1 , β̂2 ) q se(β̂1 − β̂2 ) = Var (β̂1 ) + Var (β̂2 ) − 2Cov (β̂1 , β̂2 ) Manini Ojha (JSGP) EC-14 Fall, 2020 199 / 237 Easier method to do this is as follows: Define a new parameter θ 1 = β1 − β2 (46) Null and alternative respectively become: H0: θ1 = 0 H1: θ1 < 0 Manini Ojha (JSGP) EC-14 Fall, 2020 200 / 237 Model becomes (subbing Eqn 46 in Eqn 45): log (wage) = β0 + β1 jc + β2 univ + β3 exper + u = β0 + (θ1 + β2 )jc + β2 univ + β3 exper + u = β0 + θ1 jc + β2 (jc + univ ) + β3 exper + u log (wage) = β0 + θ1 jc + β2 totcoll + β3 exper + u (47) If we want to directly estimate θ1 and obtain its se, then we must construct the new variable jc + univ = totcoll and include it in the regression i.e. total years of college Manini Ojha (JSGP) EC-14 Fall, 2020 201 / 237 Eqn 47 is simply a way of reformulating the original model Eqn 45. Can compare the coefficients and se and check if the reformulation is correct β1 disappears and θ1 appears explicitly β0 remains same (in fact you will see, estimate and se of these will be the same) β3 remains same (in fact you will see, estimate and se of these will be the same) Coeff on new variable totcoll and its se will also be the same as before log (wage) = 1.472 + 0.0667jc + 0.0769univ + 0.0049exper + u (0.021) (0.0068) (0.0023) (0.0002) log (wage) = 1.472 + 0.0102jc + 0.0769totcoll + 0.0049exper + u (0.021) (0.0069) Manini Ojha (JSGP) EC-14 (0.0023) (0.0002) Fall, 2020 202 / 237 Reformulation is done so we can estimate θ1 directly and get its se directly Can compute CI at 95% confidence level for θ1 = β1 − β2 as θ̂1 ± 1.96· se(θ̂1 ) Manini Ojha (JSGP) EC-14 Fall, 2020 203 / 237 Testing hypothesis: multiple linear restrictions What to do if interested in testing whether a set/group of independent variables has no partial effect on a dependent variable? Model y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + u Null: H0 : β2 = 0, β3 = 0, β4 = 0 Null here =⇒ once x1 and x5 have been controlled for x2, x3, x4 have no effect on y and can be excluded from the model These are called multiple restrictions Test of multiple restrictions is called multiple hypotheses test or a joint hypotheses test Manini Ojha (JSGP) EC-14 Fall, 2020 204 / 237 What should be the alternative? Ha :H0 is not true This would hold if at least one of β2 , β3 , or β4 is different from zero (any or all could be different from 0) Manini Ojha (JSGP) EC-14 Fall, 2020 205 / 237 Cannot use t- test to see whether each variable is individually significant. An individual t- test does not put any restrictions on the other parameters Another way of testing joint hypotheses where SSR and R 2 play a role Recall, since OLS estimates are chosen to minimize SSR SSR always ↑ when x 0 s are dropped from the model : restricted model Compare SSR in the model with all of the variables (unrestricted model) with SSR where x 0 s are dropped and check rejection rule Manini Ojha (JSGP) EC-14 Fall, 2020 206 / 237 Unrestricted model: model with more parameters y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + u (48) Restricted model: model with fewer parameters than unrestricted model y = β0 + β1 x1 + β5 x5 + v (49) Question: 1 2 Which model out of Eqn. 48 and Eqn. 49 will have greater SSR? Which model out of Eqn. 48 and Eqn. 49 will have greater R 2 ? Manini Ojha (JSGP) EC-14 Fall, 2020 207 / 237 SSRr > SSRur SSEr < SSEur =⇒ Rr2 2 < Rur (50) The way to test multiple restrictions / joint hypotheses is by using F statistic/ F − ratio: F = (SSRr − SSRur )/q SSRur /(n − k − 1) Where q : number of restrictions (in our example q = 3) Since SSRr is no smaller than SSRur , F stat is always non-negative (strictly positive) Manini Ojha (JSGP) EC-14 Fall, 2020 208 / 237 Since the restricted model has fewer parameters—and each model is estimated using the same n observations dfr is always greater than dfur F statistic distributed as r.v. F ∼ Fq,n−k−1 Rejection rule: reject H0 in favor of H1 at the chosen significance level if F >c Then we say β2 , β3 , β4 are jointly statistically significant or jointly statistically different from zero. Manini Ojha (JSGP) EC-14 Fall, 2020 209 / 237 F - statistic can also be written as F = 2 − R 2 )/q (Rur r 2 )/n − k − 1 (1 − Rur 2 ∵SSRur = SST (1 − Rur ) and SSRr = SST (1 − Rr2 ) Called the R 2 form of F - statistic easier to use this to compute F stat since R 2 always reported in all software packages, while SSR may not be 2 comes first as R 2 > R 2 (refer to Note: Here in the numerator Rur ur r Eqn. 50) Manini Ojha (JSGP) EC-14 Fall, 2020 210 / 237 F statistic for overall significance of the model A special set of exclusion restrictions These restrictions have the same interpretation, regardless of the model. In the model with k independent variables, we can write the null hypothesis as H0 : x1 , x2 , ......., xk do not help to explain y This null hypothesis is very pessimistic <insert relevant emoji> It states that none of the explanatory variables has an effect on y H0 : β1 = β2 = ....... = βk = 0 Alternative is that at least one of the βj is different from zero Manini Ojha (JSGP) EC-14 Fall, 2020 211 / 237 How many restrictions are there? k restrictions q=k Restricted model here looks like y = β0 + u What is the Rr2 ? Zero (none of the variation in y is being explained because there are no explanatory variables) Thus, F - stat F = Manini Ojha (JSGP) R 2 /k R 2 /q = (1 − R 2 )/(n − k − 1) (1 − R 2 )/(n − k − 1) EC-14 Fall, 2020 212 / 237 Testing general linear restrictions Testing exclusion restrictions is by far the most important application of F statistics But sometimes, restrictions implied by a theory are more complicated than just excluding some independent variables. log (price) = β0 +β1 log (assess)+β2 log (lotsize)+β3 log (sqrft)+β4 bdrms+u where price = house price; assess= the assessed housing value (before the house was sold); lotsize = size of the lot, in square feet; sqrft = square footage; bdrms = number of bedrooms Manini Ojha (JSGP) EC-14 Fall, 2020 213 / 237 Testing general linear restrictions Say want to test the following H0 :β1 = 1, β2 = 0, β3 = 0, β4 = 0 What does this null mean? How many restrictions are there? How will you test this? [Hint: write the restricted model, compare with unrestricted, get F -stat] Can you use the R 2 form of F stat here? Do it! Manini Ojha (JSGP) EC-14 Fall, 2020 214 / 237 Nested & non-nested models When you have two equations and neither equation is a special case of the other, called non-nested models The F - statistic only allows us to test nested models one model (the restricted model) is a special case of the other model (the unrestricted model) Manini Ojha (JSGP) EC-14 Fall, 2020 215 / 237 Effects of data scaling on OLS Statistics When variables are rescaled, the coefficients, standard errors, confidence intervals, t statistics, and F statistics change in ways that preserve all measured effects and testing outcomes Manini Ojha (JSGP) EC-14 Fall, 2020 216 / 237 Data scaling is often used for cosmetic purposes to reduce the number of zeros after a decimal point in an estimated coefficient to improve the appearance of an estimated equation while changing nothing that is essential Manini Ojha (JSGP) EC-14 Fall, 2020 217 / 237 Example Consider the equation relating infant birth weight to cigarette smoking and family income: \ = β̂0 + β̂1 cigs + β̂2 faminc bwght where bwght is child birth weight, in ounces cigs is number of cigarettes smoked by the mother while pregnant, per day faminc is annual family income, in thousands of dollars Manini Ojha (JSGP) EC-14 Fall, 2020 218 / 237 Manini Ojha (JSGP) EC-14 Fall, 2020 219 / 237 Estimate on cigs says if a woman smoked 5 more cigarettes per day, birth weight is predicted to be about .4634(5) = 2.317 ounces less t - stat on cigs is −5.06 (v. statistically significant) Change unit of measurement for dependent var: Suppose now we decide to measure birth weight in pounds instead of ounces Let bwghtlbs = bwght/16 be birth weight in pounds What happens to OLS statistics? Essentially dividing the entire equation by 16 Verify by looking at col (2) Manini Ojha (JSGP) EC-14 Fall, 2020 220 / 237 Estimates in col. (2) = col. (1) /16 coefficient on cigs = −.0289 if cigs were higher by five, birth weight would be .0289(5) = 0.1445 pounds lower Convert to ounces −→.1445(16) = 2.312 slightly different from the 2.317 we obtained earlier (due to rounding error) Point being: once the effects are transformed into the same units, we get exactly the same answer, regardless of how the dependent variable is measured Manini Ojha (JSGP) EC-14 Fall, 2020 221 / 237 What happens to statistical significance? Changing the dependent variable from ounces to pounds has no effect on how statistically important the independent variables are t-stat in col. (2) are identical to t-stat in col. (1) end-points for the CIs in col (2) are the endpoints in col. (1) divided by 16 R-squareds from the two regressions are identical SSRs differ (SSR in col. (2) = SSR in col. (1)/256) Manini Ojha (JSGP) EC-14 Fall, 2020 222 / 237 Change the unit of measurement of one of the independent variables, cigs. Define packs to be the number of packs of cigarettes smoked per day What happens to the coefficients and other OLS statistics now? Look at col. (3) [ t - stat same, se differ] Why have we not included both cigs and packs in the same equation? Manini Ojha (JSGP) EC-14 Fall, 2020 223 / 237 Note: changing the unit of measurement of the dependent variable, when it appears in logarithmic form, does not affect any of the slope estimates only the intercept changes Note: changing the unit of measurement of any xj , where log (xj ) appears in the regression does not affect slope estimates only the intercept changes Manini Ojha (JSGP) EC-14 Fall, 2020 224 / 237 Standardized coefficients /Beta coefficients Sometimes, in econometric applications, a key variable is measured on a scale that is difficult to interpret example effect of test scores on wages: usually interested in how a particular individual’s scores compares to the population difficult to visualize what would happen to wages if test score increased by 10 points makes more sense to think of it in terms of what happens to wages if test score is one standard deviation higher Thus, sometimes, it is useful to obtain regression results when all variables (dependent and independent vars) have been standardized Manini Ojha (JSGP) EC-14 Fall, 2020 225 / 237 A variable is standardized in the sample by subtracting off its mean and dividing by its standard deviation i.e. compute the z-score for every variable in the sample run the regression using the z-scores Manini Ojha (JSGP) EC-14 Fall, 2020 226 / 237 Start with original OLS equation: yi = β̂0 + β̂1 xi1 + β̂2 xi2 + ..... + β̂k xik + ûi Beta coefficients or standardized coefficients (b̂i ) are simply the original coefficient β̂1 multiplied by the ratio of the standard deviation of x1 to the standard deviation of y i.e. σ̂σ̂1y .β̂1 no need to know the proof: take the average across the original equation, subtract from original and divide by standard deviations Manini Ojha (JSGP) EC-14 Fall, 2020 227 / 237 Interpretation of beta coefficients If x1 ↑ by one s.d. , then ŷ changes by b̂1 s.d. Measuring effects not in terms of the original units of y or xj , but in standard deviation units In a standard OLS equation, cannot look at the size of different coefficients and conclude explanatory variable with the largest coefficient is “the most important.” But, when each xj has been standardized compelling to compare magnitudes of resulting beta coefficients Note: when the regression equation has only a single explanatory variable, x1 : its standardized coefficient is simply the sample correlation coefficient between y and x1 must lie in the range −1 to 1. Manini Ojha (JSGP) EC-14 Fall, 2020 228 / 237 Models with Logs in functional forms Interpret coefficients here: log\ (price) = 9.23 − 0.718log (nox) + 0.306rooms β1 is the elasticity of price w.r.t. nox (pollution) β2 is the change in log (price) when 4rooms = 1 multiply this by 100 and you will get an approximate percentage change in price recall: 100.β2 also called the semi-elasticity of price wrt to rooms 30.6% Manini Ojha (JSGP) EC-14 Fall, 2020 229 / 237 But till now, this was a simplistic interpretation Exact interpretation in a log-level functional form: %4ŷ = 100.[e β̂2 4x2 − 1] or = 100.[exp(β̂2 4x2 ) − 1] Thus, exact interpretation in housing price example of β̂2 is: when rooms increase by 1 or 4rooms = 1, then percentage change in price \ = 100[exp(0.306) − 1] %4price = 35.8% Manini Ojha (JSGP) EC-14 Fall, 2020 230 / 237 More on logs Advantages 1 We can be ignorant about units of measurement of variables appearing in log slope coefficients are invariant to rescaling 2 3 When y > 0, models using log (y ) as the dependent variable often satisfy the CLRM assumptions more closely than models using the level of y . (since distribution looks more like a normal) Taking the log of a variable often narrows its range. Narrowing the range of the y and x 0 s can make OLS estimates less sensitive to outliers Manini Ojha (JSGP) EC-14 Fall, 2020 231 / 237 More on logs Disadvantage 1 Sometimes, log transformation can actually create extreme values when a y is between 0 and 1 (such as a proportion) and takes on values close to zero, log (y ) can be very large in magnitude 2 Cannot be used if a variable takes on zero or negative values Manini Ojha (JSGP) EC-14 Fall, 2020 232 / 237 Models with Quadratics Quadratic functions are also used quite often in applied economics to capture decreasing or increasing marginal effects ŷ = β̂0 + β̂1 x + β̂2 x 2 Here, 4ŷ ≈ β̂1 + 2β̂2 x 4x The relationship between x and y depends on the value of x Manini Ojha (JSGP) EC-14 Fall, 2020 233 / 237 Example: wage [ = 3.73 + 0.298exper − 0.0061exper 2 (0.35) (0.041) (0.0009) 2 n = 526, R = 0.093 Estimated equation implies that exper has a diminishing effect on wage What is the shape of the quadratic in this case (coeff on x is positive and coeff on x 2 is negative)? Find the maximum of the function Manini Ojha (JSGP) EC-14 Fall, 2020 234 / 237 When models have quadratics shape can be U-shaped (β1 negative and β2 positive) or parabolic (β1 positive and β2 negative) Manini Ojha (JSGP) EC-14 Fall, 2020 235 / 237 Models with interaction terms Sometimes, partial effect, elasticity, or semi-elasticity of the y wrt an explanatory variable may also depend on the magnitude of another explanatory variable price = β0 + β1 sqrft + β2 bdrms + β3 sqrft· bdrms + β4 bthrms + u Here partial effect of bdrms on price is given by 4price = β2 + β3 sqrft 4bdrms If β3 > 0, then an additional bedroom leads to higher housing price for larger houses Manini Ojha (JSGP) EC-14 Fall, 2020 236 / 237 Interaction term: leads to an interaction effect between square footage and # bedrooms For summarizing interaction effects, typically evaluate the effect at mean value, upper quartile, lower quartile i.e. evaluate the effect of bdrms on price at mean value, upper and lower quartile, max and min, of sqrft Interesting to look at average partial effects (APE): β2 + β3 sqrft Manini Ojha (JSGP) EC-14 Fall, 2020 237 / 237