1/28 Outline Probability Theory Basic Econometrics in Transportation Random Variable Characteristics of Probability Distributions Common Probability Distributions Statistical Concepts Amir Samimi Civil Engineering Department Sharif University of Technology Primary Sources: Statistical Inference (Casella and Berger) and Basic Econometrics (Gujarati) 2/28 3/28 Background Definitions You ou can neve never foretell fo etell what any one man will do, but you can Sa Sample p e space say with precision what an average will be up to. (Sherlock Holmes) All the possible outcomes, or the population. Event Any possible collection of outcomes. Theory of probability has a long and rich history, back to the 17th century, when a mathematical formulation of gambling odds was developed. Mutually exclusive events If occurrence of one event prevents the occurrence of another event. Exhaustive E h i events If they exhaust all the possible outcomes of an experiment. The question is how to draw conclusions about a population of objects by conducting an experiment. experiment 4/28 5/28 Probability Random Variable Probability Definition Proportion of times an event will occur in repeated trials of an experiment. A variable whose value is determined by the outcome of a chance experiment. Notations Random variables are usually denoted by the capital letters X. Calculus of probabilities The values taken by them are denoted by small letter x. 0 ≤ P(A) ≤ 1 for every event A. A random variable may be If A, B, C, . . . constitute an exhaustive set of events, then P(A + B + C + ···)) = 1 If A, B, C, . . . are mutually exclusive events, then P(A + B + C + ···) = P(A) + P(B) + P(C) + … Discrete Finite: How many children in a family are disable? Infinite: How many people have blood cancer in a given population? Continuous: How long will a light bulb last? Intervals and preferences How do you rank public transport service reliability, on a scale 1 to 5? 6/28 7/28 Probability Density Function Probability Density Function For o tthee discrete d sc ete rv vX f(x) iss tthe e PDF of o continuous co t uous rv v X,, if:: PDF of sum of the numbers shown on two dice: 8/28 9/28 Joint PDF Joint PDF Let X and Y be two discrete random variables. Marginal PDF: Conditional PDF: Examples of joint PDF Statistical Independence: 10/28 Characteristics of Probability Distributions 11/28 Characteristics of Probability Distributions Correlation coefficient ρ is defined as: 12/28 Higher Moments of Probability Distributions 13/28 Skewness and Kurtosis Second moment: E(X ( − μ)2 Third moment: E(X − μ) 3 Fourth moment: E(X − μ) 4 Skewness (lack of symmetry) Kurtosis (tallness or flatness) 14/28 15/28 Normal Distribution Standardized Normal Distribution The e best known ow oof aall tthee ttheoretical eo et ca probability p obab ty distributions. d st but o s. Its ts mean ea iss 0 aandd its ts variance va a ce iss 1.. 16/28 Area Under the Curve 17/28 Example Assume that X ∼ N(8, 4). What is the probability that X will assume a value between X1 = 4 and X2 = 12? To compute the required probability, we compute the Z values as Now we observe that Pr(0 ≤ Z ≤ 2) = 0.4772. Then, by symmetry, we have Pr(−2 ≤ Z ≤ 0) = 0.4772. Therefore, the required probability is 0.4772 + 0.4772 = 0.9544. 18/28 19/28 Central Limit Theorem Normality Test Co Consider s de n independent depe de t random a do variables, va ab es, all a of o which w c have ave How ow tthee values va ues of o sskewness ew ess aandd kurtosis u tos s depa departt from o the t e the same PDF with mean μ and variance σ2. The sample mean approaches the normal distribution with mean μ and variance σ2/n, as n increases indefinitely. norms of 0 and 3? Jarque–Bera (JB) test: This result holds true regardless of the form of the PDF. Under the null hypothesis of normality, JB is distributed as a chi-square statistic with 2df. 20/28 21/28 Chi-Square Distribution Student’s t Distribution Sum of squares of k independent standardized normal variables If Z1 ∼ N(0, 1) and Z2 ∼ χ2k are independent, then t ∼ tk possess the χ2 distribution with k degrees of freedom (df). i.e. t follows Student’s t distribution with k df. df means number of independent quantities in the sum. Symmetrical, but flatter than normal. The mean of the chi-square distribution is Mean is zero, and variance is k/(k− 2). k, and its variance is 2k, where k is the df. As the number of df increases,, the distribution becomes increasingly symmetrical. If Z1 and Z2 are two independent chisquare variables with k1 and k2 df, then the sum Z1 + Z2 is also a chi-square variable with df = k1 + k2. 22/28 As k increases, the t distribution approximates i t the th normall distribution. di t ib ti 23/28 F Distribution Distributions Related to Normal If Z1 ∼ χ2k1 a andd Z2 ∼ χ2k2 aaree independent, depe de t, then t e F ∼ F k1,k2 i.e. F follows Fisher’s distribution with k1 and k2 df. Mean is k2 /(k2 − 2) and variance is Since for large df the t, chi-square, and F distributions approach the normal distribution, these distributions are known as the As k1 and k2 increase, the F distribution approaches the normal distribution. The Th square off a t-distribution t di t ib ti has h an F k ∼ F 1, k distribution: t2 If k2 is fairly large, then k2 F ∼ χ2k1 distributions related to the normal distribution. 24/28 25/28 Bernoulli Distribution Binomial Distribution Its PDF: Generalization of Bernoulli distribution. P(X = 0) = 1-p , n: number of independent trials, P(X = 1) = p p: success probability, Success / Failure q: failure probability, Mean is p and variance is p(1-p) X: number of successes. Mean is np, an variance is npq. 26/28 Poisson Distribution PDF:: E(X) = var (X) = λ Is used to model rare events. 27/28 Conditional Distribution Normal f (y, ( x)) f (x) (x, y) distributed as bivariate normal f(y|x) = f(x,y) = f(x) = 1 2 x y 2 2 x x y y 1 x x y y exp 2 2 2(1 ) x y 1 x y 2 1 x 2 1 x exp x 2 2 x f (y | x) 2 y 2 1 1 y [ y ( y / x )(x x )] 1 y|x exp exp 2 2 y|x y|x 2 y 1 2 2 1 2 1 y 28/28 Conditional Distribution Normal Y and Y|X Y X X 1/22 Outline Problem of Estimation Basic Econometrics in Transportation Point Interval Estimation Methods Statistical Properties Statistical Inference Small sample Large sample Hypothesis Testing Amir Samimi Civil Engineering Department Sharif University of Technology Primary Sources: Basic Econometrics, by Gujarati and lecture notes of Professor Greene 2/22 3/22 Estimation Point Estimation The e population popu at o has as ccharacteristics: a acte st cs: For o ssimplicity, p c ty, assu assumee tthat at tthere e e iss only o y one o e unknown u ow Mean, variance Median Percentiles Get a slice (random sample) of the population. How H would ld you obtain b i the h value l off its i parameters?? This is known as the problem of estimation: Point estimation Interval estimation parameter (θ) in the PDF. Develop a function of the sample values such that: is a statistic, or an estimator and is a random variable. A particular numerical value of the estimator is known as an estimate. is known as a point estimator because it provides a single (point) estimate of θ. 4/22 5/22 Example Interval Estimation Let An interval between two values that may include the true θ. The key concept is the of probability distribution of an estimator. is the sample mean, an estimator of the true mean value (μ). If in a specific case = 50, this provides an estimate of μ. Will the interval contain the true value? The interval is a “Confidence Interval” The degree of certainty is the degree of confidence or level of significance. 6/22 7/22 Example Estimation Methods Suppose that t at the t e distribution d st but o of o height e g t of o men e in a population popu at o Least east Squa Squares es ((LS) S) is normally distributed with mean = μ inches and σ = 2.5 inches. A sample of 100 men drawn randomly from this population had an average height of 67 inches. Establish a 95% confidence interval for the mean height (= μ) in the population as a whole. A considerable time will be devoted to illustrate the LS method. Maximum Likelihood This method is of a broad application and will also be discussed partially in the last lectures. Method of moments (MOM) Is becoming more popular, but not at the introductory level. 8/22 9/22 Statistical Properties Small-Sample Properties The statistics fall into two categories: Unbiasedness ˆ is an unbiased estimator of θ if E (ˆ) Small-sample, or finite-sample, properties Bias ˆ E (ˆ) Distribution of two estimators of θ: Large-sample, or asymptotic, properties. Underlying both these sets of properties is the notion that an estimator has a probability distribution. 10/22 11/22 Small-Sample Properties Small-Sample Properties Minimum u Va Variance a ce Linearity ea ty An estimator ˆ is said to be a linear estimator of θ if it is a linear function of the sample observations. Example: Best Linear Unbiased Estimator (BLUE). 12/22 13/22 Small-Sample Properties Large-Sample Properties Minimum Mean Mean-Square-Error Square Error Often an estimator does not satisfy one or more of the desirable statistical properties in small samples. . . But as the sample size increases indefinitely, the estimator possesses several desirable statistical properties. Asymptotic Unbiasedness: Sample variance of a random variable 14/22 15/22 Large-Sample Properties Large-Sample Properties Co Consistency s ste cy Co Consistency s ste cy ˆ is said to be a consistent estimator if it approaches the true value θ as the sample size gets larger and larger. A sufficient condition for consistency is that the bias and variance both tend to Unbiasedness and consistency are conceptually very much different. zero (or MSE tends to zero) as the sample size increases indefinitely. If ˆ is a consistent estimator of θ and h(.) is a continuous function: Note that h this hi property does d not hold h ld true off the h expectation i operator E; . 16/22 17/22 Large-Sample Properties Hypothesis Testing Asymptotic Efficiency Assume X with a known PDF f(x; θ). The variance of the asymptotic distribution of ˆ is called its asymptotic variance. If ˆ is consistent and its asymptotic variance is smaller than the asymptotic variance of all other consistent estimators of θ, it is called asymptotically efficient. Asymptotic Normality An estimator ˆ is said to be asymptotically normally distributed if its sampling distribution tends to approach the normal distribution as the sample size n increases indefinitely. 18/22 Example From o tthee previous p ev ous example, e a p e, which w c was concerned co ce ed with w t the t e height (X) of men in a population: Could the sample with a mean of 67, the test statistic, have come from the population with the mean value of 69? We obtain the point estimator ˆ from a random sample. Since the true θ is rarely known, we raise the question: Is the estimator ˆ is “compatible” with some hypothesized value of θ, say θ*? To test the null hypothesis, yp , we use the sample p information to obtain what is known as the test statistic. Use the confidence interval or test of significance approach to test the null hypothesis. 19/22 Confidence Interval Approach 20/22 Test of Significance Approach 21/22 Errors State the null hypothesis H0 and the alternative hypothesis H1 H0:μ = 69 and H1:μ ≠ 69 Select the test statistic X Determine the probability distribution of the test statistic . Choose Ch the h llevell off significance i ifi (α) ( ) Using the probability distribution of the test statistic, establish a 100(1 − α)% confidence interval. Check if the value of the parameter lies in the acceptance region. 22/22 Homework 1 Statistical Stat st ca Inference e e ce (Case (Casellaa aand d Berger, e ge , 2001) 00 ) Chapter 3, Problem 23 [20 points] 2. Chapter 4, Problem 1 [20 points] 3. Chapter 4, Problem 7 [15 points] 4. Chapter 5, Problem 18 (a,b, and d) [45 points] 1. Type I error is likely to be more serious in practice. practice The probability of a type I error is designated as α and is called the level of significance. The probability of not committing a type II error is called the power of the test. 1/23 Definition Study of the dependence of one variable on one or more other Basic Econometrics in Transportation variables, with a view to estimating and/or predicting the average value of the former in terms of the known values of the latter. Introduction to Regression Analysis Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/23 3/23 Variables Causation In regression, eg ess o , aanalysis a ys s we aaree co concerned ce ed w with t stat statistical st ca A stat statistical st ca relationship e at o s p in itself tse ca cannot ot logically og ca y imply py dependence among variables. Not functional or deterministic dependence, such as the Newton’s law of gravity. Explanatory variables will not enable us to predict the dependant variable exactly because of: Errors involved in measuring these variables Other factors that collectively affect Y but may be difficult to identify individually. causation. Ideas of causation must come from outside statistics, ultimately from some theory or other. Example of crop yield and rainfall. 4/23 5/23 Correlation Terminology Closely related but conceptually very different from regression. In correlation analysis the objective is to measure the of degree of linear association between two variables. Both dependent and independent variables are assumed to be random. In regression analysis we try to estimate the average value of one variable on the basis of the fixed values of other variables. Dependent variable is assumed to be random, while the explanatory variables are assumed to have fixed values. 6/23 7/23 Terminology Two-variable wo va ab e (b (bivariate) va ate) regression eg ess o analysis a a ys s Multiple regression analysis Y denote the dependent variable. X’s (X1, X2, . . . , Xk) will denote the explanatory variables. Xki (or Xkt) will denote the ith (or tth) observation on Xk. N denote the total number of observations in the population, population and n the total number of observations in a sample. Subscript i is for cross-sectional data and t for time series data. Bivariate Regression Analysis 8/23 Example 9/23 Example Weekly income (X) and weekly consumption expenditure (Y), both in dollars 10/23 11/23 Essence of Regression Analysis Regression Curve W What at iss the t e expected e pected value va ue of o weekly wee y consumption co su pt o expenditure e pe d tu e A population regression curve of a family? $121.20 (the unconditional mean). What is the expected value of weekly consumption expenditure of a family whose monthly income is $140? $101 (th (the conditional diti l mean). ) The knowledge of the income level enable us to better predict the mean value of consumption expenditure is the curve connecting the means of the subpopulations of Y corresponding to the given values of X. For each X there is a population of Y values that are spread around the mean of those Y values. 12/23 13/23 Population Regression Function Linearity What form does the function f (Xi) assume? In variables The regression curve is a straight line. Form of the PRF is an empirical question. A linear function maybe hypothesized: E(Y | Xi) = β1 + β2Xi E(Y | Xi) = β1 + β2Xi2 is not a linear function. In parameters E(Y | Xi) = β1 + β2Xi2 is a linear model. E(Y | Xi) = β1 + β22Xi is not a linear model. model We are primarily concerned with linear models. Term “linear” regression mean a regression that is linear in the parameters 14/23 15/23 Error Term Example Ass family a y income co e increases, c eases, family a y consumption co su pt o expenditure e pe d tu e In a linear ea model, ode , consumption co su pt o expenditure e pe d tu e of o the t e families a es with wt on the average increases. What about individual families? Deviation of an individual Yi around its expected value is ui Yi = E(Y | Xi) + ui Stochastic Disturbance or Stochastic Error Term Yi can be expressed as the sum of two components: A systematic, or deterministic component: E(Y | Xi) A nonsystematic component: ui $80 income, can be expressed as: 55 = β1 + β2 (80) + u1 60 = β1 + β2 (80) + u2 65 = β1 + β2 (80) + u3 70 = β1 + β2 (80) + u4 75 = β1 + β2 (80) + u5 16/23 17/23 Significance of Error Term Significance of Error Term Instead of having an error term, why not introduce these variables into the model explicitly? Intrinsic randomness in human behavior Randomness in individual Y’s cannot be explained no matter how hard we try. Poor proxy variables Observed variables may not equal desired variables. Vagueness of theory We might be ignorant or unsure about the other variables affecting Y. Principle of parsimony We would like to keep our regression model as simple as possible. Unavailability of data We do not have access to all sorts of data (e.g. family wealth). Wrong W functional f i l form f Core variables versus peripheral variables Influence of some variables is so small that are ignored for cost consideration. 18/23 19/23 Sample Regression Function Sample Regression Function In real ea situations s tuat o s we only o y have ave a sample o of X and a d Y values. va ues. From our previous example, suppose we have two samples: Sample 1 Sample 2 Which one represents the “true” population regression line? 20/23 21/23 Sample Regression Function Sample Regression Function Analogously to PRF, we can develop the concept of the sample Analogous to ui regression function (SRF) to represent the sample regression line. Yˆi ˆ1 ˆ2 X i Yˆi estimator of E(Y | X i ) ˆ1 estimator of 1 Yi Yˆi uˆi Primary objective in regression analysis is to estimate the PRF Yi 1 2 X i ui ˆ2 estimator of 2 Note: an estimator is a formula or method that tells how to estimate the population on the basis of the SRF Yi ˆ1 ˆ2 X i uˆi parameter from the information provided by the sample at hand. 22/23 Sample and Population Regression Lines Yˆi ˆ1 ˆ2 X i Yi ˆ1 ˆ2 X i uˆi 23/23 An Example 1/60 Problem of Estimation Basic Econometrics in Transportation Ordinary Least Squares (OLS) Maximum Likelihood (ML) Generally, OLS is used extensively in regression analysis: Bivariate Regression Analysis It is intuitively appealing Mathematically y much simpler p than the MLE. Both methods generally give similar results in the linear Amir Samimi regression context. Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/60 Ordinary Least Squares Method To o determine dete e PRF:: Yi 1 2 X i ui Population Regression Function We estimate it from the SRF: Yi ˆ1 ˆ2 X i uˆi Sample Regression Function 3/60 Ordinary Least Squares Method uˆ f (ˆ , ˆ ) 2 i possible value of the function above. ˆ2 as possible to the actual Y. Y i Why square? Sign of the residuals. ( X i X )(Yi Y ) xi yi ( X i X )2 xi 2 ˆ1 Y ˆ2 X û is as small as possible. More weight to large residuals. 2 OLS finds unique estimates of β1 and β2 that give the smallest We would like to determine the SRF in a way that it is as close i.e. sum of the residuals 1 Deviation form: yi Yi Y xi X i X 4/60 5/60 OLS Properties: Sample Mean OLS Properties: Linearity Regression line passes through the sample mean Linearity It is a linear function (a weighted average) of Y. Xi and thus ki are nonstochastic and: Note: 6/60 7/60 OLS Properties: Unbiasedness OLS Properties: Mean of Estimated Y U Unbiasedness b ased ess Mean ea value va ue of o the t e estimated est ated Y iss equal equa to the t e mean ea value va ue of o the t e actual Y: Sum both sides over the sample values and divide by sample size 8/60 9/60 OLS Properties: Mean of Residuals OLS Properties: Uncorrelated Residuals Y The mean value of the residuals is zero: The residuals are uncorrelated with the predicted Yi: 10/60 11/60 OLS Properties: Uncorrelated Residuals X OLS Assumptions The e residuals es dua s aaree uuncorrelated co e ated with w t Xi: Ou Our objective object ve iss not ot only o y to estimate est ate some so e coefficients coe c e ts but also a so to draw inferences about the true coefficients. Thus certain assumptions are made about the manner in which Yi are generated. Yi = β1 + β2Xi + ui Unless we are specific about how Xi and ui are created, there is no way we can make any statistical inference about the Yi, β1 and β2. The Gaussian standard, or classical linear regression model (CLRM), makes 10 assumptions that are extremely critical to the valid interpretation of the regression estimates. 12/60 13/60 Assumption 1 Assumption 2 The regression model is linear in the parameters. X values are fixed in repeated sampling. E(Y | Xi) = β1 + β2Xi2 is a linear model. E(Y | Xi) = β1 + β22Xi is not a linear model. X is assumed to be non-stochastic. Our regression analysis is conditional regression analysis, that is, conditional on the given values of the regressor(s) X. 14/60 15/60 Assumption 3 Assumption 4 Zero e o mean ea value va ue oof ddisturbance stu ba ce ui Homoscedasticity o oscedast c ty or o equal equa variance va a ce of o ui E(ui | Xi) = 0 Factors not explicitly included in the model, and therefore subsumed in ui, do not systematically affect the mean value of Y. 16/60 17/60 Assumption 4 Assumption 5 Heteroscedasticity No autocorrelation (serial correlation) between the disturbances. 18/60 19/60 Assumption 6 Assumption 7 Disturbance stu ba ce u aandd explanatory e p a ato y variable va ab e X aaree uncorrelated. u co e ated. n must ust be greater g eate than t a the t e number u be of o explanatory e p a ato y variables. va ab es. Obviously, we need at least two pairs of observations to estimate the two unknowns! If X and u are correlated, their individual effects on Y may not be assessed. 20/60 21/60 Assumption 8 Assumption 9 Variability in X values. The regression model is correctly specified. Mathematically, if all the X values are identical, it impossible to Important questions that arise in the specification of a model: estimate β2 (the denominator will be zero) and therefore β1. Intuitively, it is obvious as well. What variables should be included in the model? What is the functional form of the model? Is it linear in the parameters, the variables, or both? What are the probabilistic assumptions made about the Yi, the Xi, and the ui entering the model? 22/60 23/60 Assumption 10 Assumptions There e e iss noo perfect pe ect multicollinearity. u t co ea ty. How ow realistic ea st c are a e all a these t ese assumptions? assu pt o s? We make certain assumptions because they facilitate the study, No perfect linear relationships among the explanatory variables. Will be further discussed in multiple regression models. not because they are realistic. Consequences of violation of CLRM assumptions will be examined later. We will look into: Precision of OLS estimates, and Statistical properties of OLS. 24/60 25/60 Precision of OLS Estimates Standard Error of Precision of an estimate is measured by its standard error. Precision of an estimate is measured by its standard error. 26/60 27/60 Homoscedastic Variance of ui Features of the Variances How to estimate 2 The e va variance a ce o of iss ddirectly ect y proportional p opo t o a to proportional to sum of . but inversely ve se y As n increases, the precision with which β2can be estimated also increases. If there is substantial variation in X, β2 can be measured more accurately. is directly proportional to σ2 and Xi2, but i inversely l proportional i l to xi2 and d the h sample l size i n. The variance of . Note: If the slope coefficient is overestimated, the intercept will be underestimated. 28/60 29/60 Gauss–Markov Theorem Goodness of Fit Given the assumptions of the classical linear regression model, The coefficient of determination R2 is a summary measure that the OLS estimators, in the class of unbiased linear estimators, have minimum variance, that is, they are BLUE. tells how well the sample regression line fits the data. The theorem makes no assumptions about the probability distribution of ui, and therefore of Yi. Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of Squares (RSS) 30/60 31/60 Consistency Before Hypothesis Testing An asy asymptotic ptot c property. p ope ty. Us Usingg the t e method et od of o OLS O S we can ca estimate est ate β1, β2, and σ2. An estimator is consistent if it is unbiased and its variance tends Estimators ( to zero as the sample size n tends to infinity. Unbiasedness is already proved. ) are random variables. To draw inferences about PRF, we must find out how close is to the true . We need to find out PDF of the estimators. 32/60 Probability Distribution of Disturbances is ultimately a linear function of the random variable ui, which is random by assumption. 33/60 Why the Normality Assumption? We hope that the influence of these omitted or neglected variables is small and at best random. We show by the CLT, that if there are a large number of IID random variables, the distribution of their sum tends to a normal distribution as the number of such variables increase indefinitely. The nature of the probability distribution of ui plays an extremely important role in hypothesis testing. It is usually assumed that ui ∼ NID(0, σ2) Central limit theorem (CLT): Let L X1, X2, . . . , Xn denote d n independent i d d random d variables, i bl all ll off which hi h have h : the same PDF with mean = μ and variance = σ2. Let NID: Normally and Independently Distributed. 34/60 35/60 Why the Normality Assumption? Normality Test A va variant a t of o the t e CLT C states tthat, at, eve even if tthee number u be of o variables va ab es S Since ce we are a e “imposing” pos g tthee normality o a ty assumption, assu pt o , itt behooves be ooves is not very large or if these variables are not strictly independent, their sum may still be normally distributed. With this assumption, PDF of OLS estimators can be easily derived, as any linear function of normally distributed variables is itself normally distributed. The normal distribution is a comparatively simple distribution involving only two parameters. It enables us to use the t, F, and χ2 tests for regression models. us to find out in practical applications involving small sample size data whether the normality assumption is appropriate. Later, we will introduces some tests to do just that. We will come across situations where the normality assumption may be inappropriate. Until then we will continue with the normality assumption. The normality assumption plays a critical role for small sample size data. In reasonably large sample size, we may relax the normality assumption. 36/60 37/60 Estimators’ Properties with Normality Assumption Method of Maximum Likelihood They are unbiased, with minimum variance (efficient), and If ui are assumed to be normally distributed, the ML and OLS consistent. The ML estimator of σ2 is biased for small sample size. Will help us to draw inferences about the true σ2 from the estimated σ2. ML method can be applied to regression models that are estimators of the regression coefficients, are identical. , , and is distributed as the χ2 with (n − 2) df. . Asymptotically, the ML estimator of σ2 is unbiased. nonlinear in the parameters, in which OLS is generally not used. The importance of this will be explained later. Estimators have minimum variance in the entire class of unbiased estimators, whether linear or not. Best Unbiased Estimators (BUE) 38/60 39/60 ML Estimation ML Estimation Assume ssu e tthee two-variable two va ab e model ode Yi = β1 + β2Xi + ui in w which c The e method et od of o maximum a u likelihood e ood consists co s sts in estimating est at g the t e Having independent Y’s, joint PDF of Y1,...,Yn, can be written as: Where: β1, β2, and σ2 are unknowns in likelihood function: unknown parameters in such a manner that the probability of observing the given Y’s is as high as possible. 40/60 41/60 ML Estimation Interval Estimation From the first first-order order condition for optimization: How reliable are the point estimates? We try to find out two positive numbers δ and α, such that: Probability of constructing an interval that contains β2 is 1 − α. Such an interval is known as a confidence interval. α (0 < α < 1) is known as the level of significance. How H are the h confidence fid intervals i l constructed? d? If the probability distributions of the estimators are known, the task of Note how ML underestimates the true σ2 in small samples. 42/60 constructing confidence intervals is a simple one. 43/60 Confidence Intervals for Confidence Intervals for Itt can ca be shown s ow that t at the t e t va variable ab e follows o ows the t etd distribution st but o with wt Itt can ca be shown s ow tthat at under u de the t e normality o a ty assumption, assu pt o , following o ow g n − 2 df. variable follows χ2 distribution with n − 2 df. Width of the confidence interval is proportional to the standard error of the estimator. Same for 1 Interpretation of this interval: If we establish 95% confidence limits on σ2 and if we maintain a priori that these limits will include true σ2,we shall be right in the long run 95 percent of the time. 44/60 45/60 Hypothesis Testing Confidence Interval Approach Is a given observation compatible with some stated hypothesis? Decision Rule: Construct a 100(1 − α)% confidence interval for In statistics, the stated hypothesis is known as the null β2. If the β2 under H0 falls within this interval, do not reject H0, but if it falls outside this interval, reject H0. Note: There is a 100α percent chance of committing a Type I error. If α = 0.05, there is a 5 percent chance that we could reject the null hypothesis even though it is true. When we reject the null hypothesis, we say that our finding is statistically significant. One-tail or two-tail test: hypothesis or H0 (versus an alternative hypothesis or H1). Hypothesis testing is developing rules for rejecting or accepting the null hypothesis. Confidence interval approach, Test of significance approach. Most of the statistical hypotheses of our interest make statements about one or more values of the parameters of some assumed probability distribution such as the normal, F, t, or χ2. 46/60 Sometimes we have a strong expectation that the alternative hypothesis is one- sided rather than two-sided. 47/60 Test of Significance Approach Practical Aspects In tthe e confidence confidence--inte interval val p procedure ocedu e we try t y to establish estab s a range a ge Accepting ccept g “null” u hypothesis: ypot es s: that has a probability of including the true but unknown β2. In the test test--ofof-significance approach we hypothesize some value for β2 and try to see whether the estimated β2 lies within confidence limits around the hypothesized value. All we can say: based on the sample evidence we have no reason to reject it. Another null hypothesis may be equally compatible with the data. 2-t rule of thumb If df >20 and α = 0.05, then the null hypothesis β2 = 0 can be rejected if |t| > 2. In these cases we do not even have to refer to the t table to assess the significance of the estimated slope coefficient. Forming the null hypotheses Theoretical expectations or prior empirical work can be relied upon to A large t value will be evidence against the null hypothesis. formulate hypotheses. p value The lowest significance level at which a null hypothesis can be rejected. 48/60 49/60 Analysis of Variance Analysis of Variance A study of two components of TSS ((= ESS + RSS) is known as If we assume that the disturbances ui are normally distributed, ANOVA from the regression viewpoint. which we do under the CNLRM, and if the null hypothesis (H0) is that β2 = 0, then it can be shown that the F follows the F distribution with 1 df in the numerator and (n − 2) df in the denominator. What use can be made of the preceding F ratio? 50/60 51/60 F-ratio Example Itt can ca be shown s ow that t at Co Compute pute F ratio at o and a d obtain obta p value o of the t e computed co puted F stat statistic. st c. Note that β2 and σ2 are the true parameters parameters. If β2 is zero, both equations provide us with identical estimates of true σ2. Thus, X has no linear influence on Y. p value of this F statistic with 1 and 8 df is 0.0000001. Therefore, if we reject the null hypothesis, the probability of F ratio provides a test of the null hypothesis H0: β2 = 0. committing a Type I error is very small. 52/60 53/60 Application Of Regression Analysis Mean Prediction One use is to “predict” predict or “forecast” forecast the future consumption Estimator of E(Y | X0): expenditure Y corresponding to some given level of income X. It can be shown that: Now there are two kinds of predictions: Prediction of the conditional mean value of Y corresponding to a chosen X. Prediction of an individual Y value corresponding to a chosen X. 54/60 Individual Prediction Estimator st ato of o E(Y ( | X0): It can be shown that: This statistic follows the t distribution with n− 2 df and may be used to derive confidence intervals This statistic follows the t distribution with n− 2 df and may be used to derive confidence intervals 55/60 Confidence Bands 56/60 Individual Versus Mean Prediction 57/60 Reporting the Results Confidence interval for individual Y0 is wider than that for the mean value of Y0. The width of confidence bands is smallest when X0 = X. 58/60 59/60 Evaluating the Results Normality Tests How ow “good” good iss tthee fitted tted model? ode ? Anyy standard? sta da d? Several tests in the literature. We look at: Are the signs of estimated coefficients in accordance with theoretical or prior expectations? How well does the model explain variation in Y? One can use r2 Does the model satisfies the assumptions of CNLRM? For now, we would like to check the normality of the disturbance term. Recall that the t and F tests require that the error term follow the normal distribution. Histogram of residuals: A simple graphic device to learn about the shape of the PDF. Horizontal axis: the values of OLS residuals are divided into suitable intervals. Vertical axis: erect rectangles equal in height to the frequency in that interval. From a normal population we will get a bell-shape PDF. Normal p probability y pplot ((NPP): ) A simple graphic device. Horizontal axis: plot values of OLS residuals, Vertical axis: show expected value of variable if it were normally distributed. From a normal population we will get a straight line. The Jarque–Bera test. An asymptotic test, with chi-squared distribution and 2 df: 60/60 Homework 2 Basic Econometrics (Gujarati, 2003) Chapter 3, Problem 21 [10 points] Chapter 3, Problem 23 [30 points] 3. Chapter 5, Problem 9 [30 points] 4. Chapter 5, Problem 19 [30 points] 1. 2. Assignment weight factor = 1 1/24 Topics First we consider the case of regression through the origin: Basic Econometrics in Transportation A situation where the intercept term is absent from the model. Then we consider the question of the units of measurement. Whether a change in the units of measurement affects the regression results. Bivariate Regression Discussion Finally, we consider the question of the functional form of the Amir Samimi linear regression model. So far we have considered models that are linear in the parameters as well as Civil Engineering Department Sharif University of Technology in the variables. Primary Source: Basic Econometrics (Gujarati) 2/24 3/24 Regression Through the Origin Estimation There e e are a e occasions occas o s when w e the t e PRF assumes assu es the t e following o ow g W Write te the t e SRF S as: form: Yi = β2Xi + ui Sometimes the underlying theory dictates that the intercept term be absent from the model. Applying the OLS method: Cost analysis theory: The variable cost of production is proportional to output. Some versions of monetarist theory: Rate of inflation is proportional to the rate of change of the money supply. How do we estimate such models like, and what special problems do they pose? Comparison with formulas obtained with intercept: 4/24 5/24 Intercept-present Versus Intercept-less R2 for Interceptless Model Intercept Intercept-present: present: Compute the raw R2 for such models: adjusted (from mean) sums of squares and cross products. df for estimating σ2 is (n − 1). Raw r2 is not directly comparable to the conventional r2 value. is always zero. Coefficient of determination is always between 0 and 1. Some authors do not report r2 value for interceptless models. Unless there is a very strong priori expectation, one would be well advised to stick to the conventional models. Intercept-less: If intercept turns out to be statistically insignificant, we have a regression raw sums of squares and cross products. through the origin. df for estimating σ2 is (n − 2). If there is an intercept but we insist on fitting a regression through the origin, need not be zero. we would be committing a specification error, thus violating Assumption 9. Coefficient of determination is not always between 0 and 1. 6/24 Units of Measurement Do o tthee units u ts in w which c tthee va variables ab es are a e measured easu ed make a e any a y difference in the regression results? If so, what is the sensible course to follow in choosing units of measurement for regression analysis? 7/24 Units of Measurement 8/24 9/24 Discussion Standardized Variables One should choose the units of measurement sensibly. Subtract the mean value of the variable from its individual If the scaling factors are identical (w1 = w2): the slope coefficient and its standard error remain unaffected. values and divide the difference by the standard deviation of that variable: The intercept and its standard error are both multiplied by w1. If the X scale is not changed: the slope as well as the intercept coefficients and their respective standard errors are all multiplied by the same factor factor. If the Y scale remains unchanged: ? Goodness-of-fit is not affected at all. 10/24 Instead of running Yi = β1 + β2Xi + ui We could run Yi* = β1* + β2* Xi* + ui* = β2* Xi* + ui* 11/24 Interpretation Advantages? How ow do we interpret te p et the t e beta coefficients? coe c e ts? More o e apparent appa e t if there t e e iss more o e than t a one o e regressor. eg esso . Note that new variables are measured in standard deviation We put all regressors on equal basis and compare them directly. units. If the regressor increases by one standard deviation, on average, the regressand increases by β2* standard deviation units. If a coefficient is larger than another, then … the it contributes more to the explanation of the regressand. Note 1: For the standardized regression g we have not given g the r2 value because this is a regression through the origin. Note 2: There is an interesting relationship between the β of the conventional model and the standardized one: • Sx = The sample standard deviation of the X regressor, • Sy = The sample standard deviation of the regressand. 12/24 13/24 Functional Forms of Regression Models Log-linear Models We consider some commonly used regression models that may Consider the following model, exponential regression model: be nonlinear in the variables but are linear in the parameters: The log-linear models Semilog models Reciprocal models The logarithmic reciprocal models What are the special features of each model? When they are appropriate? How they are estimated? 14/24 Linear in the parameters, linear in the logarithms of the variables, and can be estimated by OLS regression. Because of this linearity, such models are called log-log, double-log, or loglinear models. 15/24 Discussion Semilog Models O Onee att attractive act ve feature eatu e of o the t e log-log og og model, ode , which w c has as made ade itt The e Log–Lin og Model: ode : popular in applied work: To measure the growth rate. β2 measures the elasticity of Y with respect to X (why?). The model assumes that the elasticity coefficient, remains constant throughout, hence called the constant elasticity model. Although α and β2 have unbiased estimators, β1 has not! In most practical problems, β1 is of secondary importance. Regressand is the logarithm of Y and the regressor is “time”. How to decide? The simplest way is to plot the scattergram of ln Yi against ln Xi and see if the scatter points lie approximately on a straight line. A model in which Y is logarithmic is called a log-lin model. A model in which Y is linear but X(s) are logarithmic is a lin- log model. 16/24 17/24 Semilog Models Reciprocal Models The Lin Lin–Log Log Model: As X increases indefinitely, Y approaches the asymptotic value We want to find absolute change in Y for a percent change in X. β1. In a Log Log–Lin Lin model multiply 2 by 100 for a more meaningful figure. In a Lin-Log model divide 2 by 100 for a more meaningful figure. 18/24 Logarithmic Reciprocal Model Log og hyperbola ype bo a or o logarithmic oga t c reciprocal ec p oca model ode takes ta es the t e form: o : Initially Y increases at an increasing rate and then increases at a decreasing rate. Example: As per capita GNP increases, initially there is dramatic drop in CM (child mortality) but the drop tapers off as per capita GNP continues to increase. 19/24 Functional Forms 20/24 21/24 Choice of Functional Form Tips Choice of a functional form is easy in bivariate cases, because Keep in mind: we can plot the variables and get some rough idea. The underlying theory may suggest a particular functional form. The choice becomes much harder in multivariate cases The knowledge of slope and elasticity coefficients of the various involving more than one regressor. A great deal of skill and experience are required in choosing an appropriate model for empirical estimation. 22/24 models will help to compare the various models. The coefficients of the model chosen should satisfy certain a priori expectations. Sometime more than one model may fit a given set of data reasonably well. Do not overemphasize the r2 measure in the sense that the higher the r2 the better the model. 23/24 Error Term in Transformation Error Term in Transformation Co Consider s de this t s regression eg ess o model: ode : Although t oug following o ow g linear ea regression eg ess o models ode s can ca be estimated est ated Error term may enter in three forms: by OLS or ML, we have to be careful about the properties of the stochastic error term that enters these models. BLUE property of OLS requires that ui has zero mean value, 1st and 2nd models are intrinsically linear regression models and the 3rd is intrinsically nonlinear-in-parameter. constant variance, and zero autocorrelation. For hypothesis testing, we further assume that ui ∼ N(0, σ2). Thus in the 1st model we must test the normality on ln ui 24/24 Homework 3A Basic Econometrics (Gujarati, 2003) 1. 2. Chapter 6, Problem 13 [50 points] Chapter 6, Problem 14 [50 points] Assignment weight factor = 0.5 1/33 Scope The simplest possible multiple regression model is three three- Basic Econometrics in Transportation variable regression. with one dependent variable and two explanatory variables. We are concerned with multiple linear regression models Multiple Regression Analysis Models are linear in the parameters. They Th may or may nott be b linear li in i the th variables. i bl Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/33 3/33 Notation and Assumptions OLS Estimators PRF may ay be written w tte as: Yi = β1 + β2X2i + β3X3i + ui S SRF:: β1 is the intercept term: mean effect on Y of all the excluded variables. β2 is called the partial regression coefficients, as it measures the change in the mean value of Y per unit change in X2, holding the value of X3 constant. We continue to operate within the CLRM framework: Zero mean value of ui: E(ui | X2i , X3i) = 0 for each i cov (ui , uj) = 0 (i ≠ j) Homoscedasticity: var (ui) = σ2 Zero covariance between ui and each X variable: cov (ui , X2i) = cov (ui , X3i) = 0 No specification bias. No multicollinearity. Sufficient variability in the values of the regressors. No serial correlation: Differentiate it with respect to the unknowns and set it to zero: 4/33 5/33 OLS Estimators Variances of OLS Estimators We need the standard errors for two main purposes: To establish confidence intervals, to test statistical hypotheses. 6/33 Variances of OLS Estimators 7/33 Properties of OLS Estimators The e three-variable t ee va ab e regression eg ess o linee passes through t oug the t e means. ea s. Mean of estimated Yi is equal to the mean of actual Yi. . In these formulas σ2 is variance of population disturbances ui: The estimated residuals are uncorrelated with X2i and X3i. The estimated residuals are uncorrelated with estimated Yi. As the correlation coefficient between X2 and X3, increases The degrees of freedom are now ( n − 3) because we must first estimate the coefficients, which consume 3 df. toward 1, the variances of estimated β2 and β3 increase. β2 can be estimated more precisely if the variation in the sample values of X2 is large. OLS estimators are BLUE. 8/33 9/33 Maximum Likelihood Estimators Coefficient of Determination R2 Under the assumption that the population disturbances are It gives the percentage of the total variation in Y explained by normally distributed with zero mean and constant variance σ2, the explanatory variable X. ML estimators and OLS estimators of the regression coefficients are identical. This is not true of the estimator of σ2. ML estimator of σ2 is regardless of the number of variables in the model. 10/33 Similar to the bivariate case. 11/33 Example R2 and Adjusted R2 How ow do you interpret te p et this: t s: R2 iss a non-decreasing o dec eas g function u ct o of o the t e number u be of o explanatory e p a ato y variables present in the model. One may adjust RSS and TSS by their degrees of freedom: in which which, CM is the number of deaths of children under five per 1000 live births, PGNP is per capita GNP in 1980, and FLR is female literacy rate measured in percent. Besides R2 and adjusted R2, other criteria are often used to judge the adequacy of a regression model. Akaike’s Information Criteria. 12/33 13/33 Comparing Two R2 Values Example It is crucial to note that in comparing two models on the basis Consider these models in which consumption of cups of coffee of the coefficient of determination, whether adjusted or not: the sample size n, and per day (Y) is estimated by the price of coffee (X): Model I: the dependent variable must be the same. Thus for the following models the computed R2 terms cannot be compared (why?). Model II: To compare R2 values: Obtain the estimated log value of each observation in Model II, Take the antilog of these values and then compute r2 between these antilog values and actual Yt. How then does one compare the R2’s of two models when the regressand is not in the same form? 14/33 This r2 value is comparable to the r2 value of Model I. The result is an r2 value of 0.7318, which is now comparable with the r2 value of the log–linear model of 0.7448. 15/33 Transformations Polynomial Regression Models Transformations a s o at o s discussed d scussed in tthee co context te t oof tthee bivariate b va ate case Marginal a g a cost ((MC) C) of o production. p oduct o . can be easily extended to multiple regression models. Example: Cobb–Douglas production function There is one explanatory variable on the A log-linear model, β2 is the partial elasticity of output with respect to the input. right-hand side but it appears with various powers. Do these models present any special estimation problems? Strictly speaking, does not violate the no multicollinearity assumption. 16/33 17/33 The Problem of Inference Forms of Hypothesis Testing If our objective is estimation as well as inference, we need to Testing hypotheses about an individual regression coefficient. assume that ui follow some probability distribution. • CM: child mortality • PGNP: per capita GNP • FLR: female literacy rate Is the coefficient of PGNP of statistically significant, that is, statistically different from zero? Is the coefficient of FLR of −2.2316 statistically significant? Testing the overall significance of the estimated multiple regression model, that is, finding out if all the partial slope coefficients are simultaneously equal to zero. Testing that two or more coefficients are equal to one another. Testing that the partial regression coefficients satisfy certain restrictions. Testing the stability of the estimated regression model over time or in different cross-sectional units. Testing the functional form of regression models. Are both coefficients statistically significant? 18/33 19/33 Individual Regression Coefficients Overall Significance H0: β2 = 0 H0: β2 = β3 = 0 Example: The null hypothesis: with female literacy rate held constant, per capita GNP has no linear influence on child mortality. To test the null hypothesis, we use the t test. . DF = n – 3 In our example p-value = 0.0065, and thus with the female literacy rate held constant, per capita GNP has a significant effect on child mortality. Remember that the t-testing procedure is based on the assumption that the error term ui follows the normal distribution. Whether Y is linearly related to both X2 and X3. Can this joint hypothesis be tested by testing the significance of . and individually? We cannot use the usual t test to test the joint hypothesis that the true partial slope coefficients are zero simultaneously. This hypothesis yp can be tested by y the ANOVA technique, q , introduced before. Under the assumption of normal distribution for ui and the null hypothesis, F has the F distribution with 2 and n − 3 df: 20/33 21/33 F-testing Procedure Relationship between R2 and F For a k k-variable variable regression model: When R2 = 0, F is zero. Yi = β1 + β2X2i + β3X3i + · · · + βkXki + ui To test the hypothesis: H0: β2 = β3 = · · · = βk = 0 Compute: . Thus, the F test, which is a measure of the overall significance of the estimated regression, is also a test of significance of R2. If F > Fα(k − 1, n − k), reject H0. 22/33 23/33 Example Marginal Contribution of X We observed obse ved that t at RGDP G (relative ( e at ve per pe capita cap ta GDP) G ) and a d RGDP G Suppose we have ave regressed eg essed child mortality mo tality o on pe per capita GN GNP.. squared explain only about 5.3 percent of the variation in GDPG (GDP growth rate) in a sample of 119 countries. This R2 of 0.053 seems a “low” value. Is it really statistically different from zero? How do we find that out? If R2 is zero, then F is zero: We decide to add female literacy rate to the model and want to know: This F value is significant at about the 5 percent level. What is the marginal, or incremental, contribution of FLR, knowing that PGNP The p value is actually 0.0425. Is the incremental contribution of FLR statistically significant? Thus, we can reject H0 (the two regressors have no impact on Y). What is the criterion for adding variables to the model? is already in the model and that it is significantly related to CM? 24/33 25/33 Marginal Contribution of X Equality of Two Regression Coefficients Using ANOVA, we will have: Suppose in the multiple regression Yi = β1 + β2X2i + β3X3i + β4X4i + ui We want to test the hypotheses: How do we test such a null hypothesis? This Thi F value l follows f ll the th F distribution di t ib ti with ith 1 andd 62 df: df Under the classical assumptions, p , it can be shown that t follows the t distribution with (n− 4) df: This is highly significant, suggesting that addition of FLR to the model significantly increases ESS and hence the R2 value. Therefore, FLR should be added to the model. The F ratio of is equivalent to the following F ratio: 26/33 If the p value of the statistic is reasonably low, one can reject H0. 27/33 Testing Linear Equality Restrictions General F Testing Co Consider s de the t e Cobb–Douglas Cobb oug as pproduction oduct o function: u ct o : The F test p provides a ggeneral method of testing g hypotheses yp about one or more parameters of the k variable regression model: We want to test the whether β2 + β3 = 1 or not. There are two approaches: The t-test approach: The F F-test test approach: restricted least squares β2 = 1 – β3 ln Yi = β0 + (1 − β3) ln X2i + β3 ln X3i + ui ln (Yi / X2i) = β0 + β3 ln (X3i / X2i) + ui How do we know that the restriction is valid? m is the number of linear restrictions. Yi = β1 + β2X2i + β3X3i +· · · + βkXki + ui Following hypotheses can all be tested by the F test: H 0: β 2 = β 3 H 0: β 3 + β 4 + β 5 = 3 H 0: β 3 = β 4 = β 5 = β 6 = 0 The general strategy: There is a larger (unconstrained) model, and then a smaller (constrained or restricted) model, which is obtained by deleting some variables from the larger. Compute the F ratio. If the computed F exceeds Fα(m, n− k), where Fα(m, n− k) is the critical F at the α level of significance, we reject the null hypothesis. 28/33 29/33 Parameter Stability (the Chow Test) Parameter Stability (the Chow Test) A regression g model involving g time series data,, mayy have a Assumptions: p u1t ∼ N(0, σ2) and u2t ∼ N(0, σ2), and are independently distributed. structural change in the relationship between X and Y. Consider the data on disposable personal income and personal savings, for the United States for the period 1970–1995. It is well known that in 1982 USA suffered its worst peacetime recession. Now we have three possible regressions: 1970–1981: Yt = λ1 + λ2Xt + u1t n1 = 12 1982–1995: Yt = γ1 + γ2Xt + u2t n2 = 14 1970–1995: Yt = α1 + α2Xt + ut n = 26 30/33 Procedure: Estimate regression in which there is no parameter instability, and obtain RSS3 (df = n1 + n2 − k). We call RSS3 the restricted RSS. (why?) Estimate regression for the first period and obtain RSS1 (df = n1 − k). Estimate regression for the second period and obtain RSS2 (df = n2 − k). Obtain the unrestricted RSS (RSSUR = RSS1 + RSS2): 2 sets are independent. If there is no structural Change then the RSSR and RSSUR should not be statistically different: Do not reject H0 of parameter stability if the computed F value does not exceed the critical F value. 31/33 Parameter Stability (the Chow Test) Parameter Stability (the Chow Test) In our example: p Since we cannot observe the true error variances,, we can obtain RSSR = 23,248.30; RSSUR = 11,790.252; F = 10.69; F1% = 5.72. their estimates from: The Chow test seems to support our earlier guess. Chow test can be generalized to handle cases of more than one structural break. (how?) Keep in mind: The assumptions underlying the test must be fulfilled. It can be shown that The Chow test will tell us only if the two regressions are different, without telling us whether the difference is on account of the intercepts, slopes, or both. The Chow test assumes that we know the point(s) of structural break. Before we move on, lets us examine one of the assumptions: The error variances in the two periods are the same. Computing in an application and comparing it with the critical F value, one can decide to reject or not reject the null hypothesis. 32/33 33/33 Testing the Functional Form Homework 3B Choosing g between linear and log–linear g regression g models: Basic Econometrics (Gujarati, 2003) We can use the MWD test to choose: H0: Linear Model: Y is a linear function of the X’s. 1. H1: Log–Linear Model: lnY is a linear function of logs of X’s. 2. 1. 2. 3. 4. 5. 6. Estimate the linear model and obtain the estimated Y values (Yf). Estimate the log–linear g model and obtain the estimated lnY values ((ln f) f). Obtain Z1 = (ln Yf − ln f ). Regress Y on X’s and Z1. Reject H0 if the coefficient of Z1 is statistically significant by the usual t test. Obtain Z2 = (antilog of ln f − Yf ). Regress ln Y on the logs of X’s and Z2. Reject H1 if the coefficient of Z2 is statistically significant by the usual t test. Chapter 7, Problem 16 [50 points] Chapter 7, Problem 21 [50 points] Assignment weight factor = 0.5 1/9 Nature of Dummy Variables In regression analysis Y is influenced not only by ratio scale Basic Econometrics in Transportation variables but also by variables that are essentially qualitative, or nominal scale, in nature. Such as sex, race, color, religion, nationality, geographical region, political disorders, and party affiliation. Dummy Variables Female workers are found to earn less than their male counterparts. Nonwhite workers are found to earn less than whites. Since such variables indicate the presence or absence of a Amir Samimi quality, such as male or female, black or white. Dummy variables are a device to classify data into mutually exclusive categories. Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/9 3/9 ANOVA Models Caution in Use of Dummy Variables ANOVA NOV regression eg ess o model ode contain co ta X’ss tthat at are a e aall dummy. du y. To o distinguish d st gu s tthee 3 regions, eg o s, we used oonly y 2 du dummy y va variables. ab es. Example: teachers’ salaries by geographical region: The category for which no dummy variable is assigned is Does average annual salary of teachers differ among the regions? One may take the average of salaries of the teachers in the three regions: $24,424.14 (Northeast and North Central), $22,894 (South), and $26,158.62 (West). These numbers look different, but are they statistically different? Yi = β1 + β2D2i + β3D3i + ui o Yi = average salary of public school teacher in state i. o D2i = 1 if the state is in the Northeast or North Central; = 0 otherwise o D3i = 1 if the state is in the South = 0 otherwise. How do you use this model to answer the question? known as the base, to which all the comparisons are made. If the intercept is not introduced in a model, you may avoid the “dummy variable trap”. In an interceptless model (a dummy variable for each category), we obtain directly the mean values of the various categories. 4/9 5/9 Models with Two Qualitative Variables A Mixture of Quantitative and Qualitative Hourly wages in relation to marital status and region of residence: • Y = hourly wage ($). • D2 = married status; 1 = married, 0 = otherwise. • D3 = region of residence; 1 = South, 0 = otherwise. Yi = average annual salary of public school teachers in state ($) Xi = spending on public school per pupil ($) D2i = 1, if the state is in the Northeast or North Central = 0, otherwise D3i = 1, if the state is in the South = 0, otherwise Which is the benchmark category g y here? Unmarried persons who do not live in the South. 6/9 Are the average hourly wages statistically different compared to These results are different from previous. the base category? Once you go beyond one qualitative variable, you have to pay close attention to the category that is treated as the base category. As public expenditure goes up by a dollar, on average, a public school teacher’s salary goes up by about $3.29. 7/9 Dummy Variable Alternative to Chow Test Dummy Variable Alternative to Chow Test The e Chow C ow test procedure, p ocedu e, tells te s us only o y if two regressions eg ess o s are ae Sou Source ce oof ddifference, e e ce, can ca be specified spec ed by pooling poo g all a the t e different without telling us what is the source of the difference. observations and running just one multiple regression: Yt = α1 + α2Dt + β1Xt + β2(DtXt) + ut Y = savings X = income t = time D = 1 for observations in 1982 1982–1995; 1995; = 0, 0 otherwise. otherwise α1 + β1Xt : Mean savings for 1970–1981: (α1 + α2) + (β1 + β2) Xt: Mean savings function for 1982–1995 Notice how: interactive dummy enables us to differentiate between slope coefficients, additive dummy enables us to differentiate between intercepts of two periods. 8/9 9/9 Dummy Variables in Seasonal Analysis Piecewise Linear Regression Yt = α1D1t + α2D2t + α3D3t + α4D4t + ut Yi = α1 + β1Xi + β2(Xi − X*)Di + ui Yt = sales of refrigerators Yi = sales commission D’s are the dummies, taking a value of 1 in the relevant quarter and 0 otherwise. Xi = volume of sales X* = threshold value of sales D = 1 if Xi > X* If there is any seasonal effect in a given quarter, that will be indicated by a statistically significant t value of the dummy coefficient for that quarter. 1/29 Outline What is the nature of multicollinearity? Basic Econometrics in Transportation Is multicollinearity really a problem? What are its practical consequences? How does one detect it? Multicollinearity What remedial measures can be taken to ease the problem? Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/29 3/29 Nature of Multicollinearity Nature of Multicollinearity O Originally g a y itt meant ea t the t e existence e ste ce oof a “perfect” pe ect linear ea If multicollinearity u t co ea ty iss pe perfect, ect, the t e coefficients coe c e ts are ae relationship among some explanatory variables: indeterminate and their standard errors are infinite. λ1X1 + λ2X2 + · · · + λkXk = 0 λs are constant such that not all of them are zero simultaneously. Today, it is also used where X variables are intercorrelated but nott perfectly f tl so: λ1X1 + λ2X2 + · · · + λkXk + vi= 0 where vi is a stochastic error term. If multicollinearity is less than perfect, the coefficients are determinate but possess large standard errors. The coefficients cannot be estimated with great precision. 4/29 5/29 Example Ballentine View of Multicollinearity Consider the following hypothetical data: Perfect collinearity between X2 and X3 Coefficient of correlation is 1 No perfect collinearity between X2 and X3* Coefficient of correlation between them is 0.9959 6/29 Sources of Multicollinearity 7/29 Estimation in Perfect Multicollinearity Multicollinearity may be due to the following factors: The data collection method employed: Sampling over a limited range of the values taken by the regressors in the population. Constraints on the model or in the population being sampled: In the regression of electricity consumption on income and house size there is a physical constraint in the population in that families with higher incomes generally have larger homes than families with lower incomes. Model specification: Assume that X3i = λX2i , where λ is a nonzero constant Adding polynomial terms to a regression model. An overdetermined model: This happens when the model a lot of explanatory variables compared to the number of observations. This could happen in medical research where there may be a small number of patients about whom information is collected on a large number of variables. Common trend in time series data: Regressors may all be growing over time at more or less the same rate. If X3 and X2 are perfectly collinear, there is no way X3 can be kept constant. As X2 changes, so does X3 by the factor λ. 8/29 9/29 Theoretical Consequences Practical Consequences Recall that with CLRM assumptions, OLS estimators are BLUE. Although BLUE, the OLS estimators have large variances and And with CNLRM assumptions, OLS estimators are BUE. covariances, making precise estimation difficult. Thus, the confidence intervals tend to be much wider, leading to the acceptance of the “zero null hypothesis” more readily. Therefore, the t ratio of one or more coefficients tends to be statistically insignificant. Although the t ratio(s) is statistically insignificant, R2 can be very high. The OLS estimators and their standard errors can be sensitive to small changes in the data. It can be shown that even if multicollinearity is very high, the OLS estimators still retain the property of BLUE. But multicollinearity violates no regression assumptions. Unbiased, consistent estimates will occur, and their standard errors will be correctly estimated. The only effect of multicollinearity is to make it hard to get coefficient estimates with small standard error. In fact, at a theoretical level, multicollinearity, few observations and small variances on the independent variables are essentially all the same problem. Thus “What should I do about multicollinearity?” is a question like “What should I do if I don’t have many observations?” 10/29 Large Variances and Covariances 11/29 Large Variances and Covariances The e results esu ts can ca be extended e te ded to the t e kk-variable va ab e model: ode : R2j = R2 in the regression of Xj on the remaining (k − 2) regressors. r23 is the coefficient of correlation between X2 and X3. The speed with which variances increase can be seen with the variance-inflating factor (VIF): The inverse of the VIF is called tolerance (TOL). 12/29 Wider Confidence Intervals 13/29 High R2 but Few Significant t Ratios In cases of high collinearity, it is possible to find that one or more of the coefficients are individually insignificant. Yet the R2 in such situations may be so high that on the basis of In cases of high collinearity the estimated standard errors increase dramatically, thereby making the t values smaller. In such cases, one will increasingly accept the null hypothesis that the relevant true population value is zero (type II error increases). 14/29 Sensitivity to Small Changes in Data the F test one can convincingly reject the hypothesis that β2 = · · · = βk = 0. Indeed, this is one of the signals of multicollinearity. This outcome should not be surprising in view of our discussion on individual vs. joint testing. The real problem here is the covariances between the estimators, which are related to the correlations between the regressors. 15/29 An Illustrative Example X2: income, X3: wealth, and Y: consumption expenditure. 16/29 Detection of Multicollinearity 1. High g R2 but few significant g t ratios. 17/29 Detection of Multicollinearity 4. This is the “classic” symptom of multicollinearity. Regress each Xi on the remaining X variables and compute the corresponding Its disadvantage is that it is too strong. 2. Auxiliary regressions. R2 (R2i). If the computed F exceeds the critical Fi at the chosen level of significance, it is taken to mean that the particular Xi is collinear with other X’s. High pair-wise correlations among regressors. If the correlation coefficient between two regressors is high (0.8 <), then multicollinearity is a serious problem. A sufficient but not a necessary y condition for the existence of multicollinearity. y 3. Klien’s rule of thumb: multicollinearity may be a problem only if R2i is Examination of partial correlations. greater than the overall R2. A finding that R21.234 is very high but r212.34, r213.24, and r214.23 are comparatively low may suggest that the variables X2, X3, and X4 are highly intercorrelated. Partial correlation between X and Y given a set of n controlling variables Z = {Z1, Z2, …, Zn}, is the correlation between the residuals resulting from the linear regression of X with Z and of Y with Z. 18/29 Detection of Multicollinearity 5. Eigenvalues ge va ues aandd condition co d t o index. de . Set A = X’ X, where X is a n x k matrix of observations. Calculate the eigenvalues. The eigenvalues of a k x k matrix A are the k solutions to | A – I | = 0 Condition number is defined as the maximum eigenvalue over the minimum eigenvalue. If it is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds 1000 there is severe multicollinearity. Condition C diti iindex d is i the th square roott off the th condition diti number. b If CI exceeds d 30 there is severe multicollinearity. Some authors believe that the condition index is the best available multicollinearity diagnostic 19/29 Detection of Multicollinearity 6. Tolerance o e a ce and a d variance va a ce inflation at o factor acto The larger the value of VIFj, the more “troublesome” or collinear the variable Xj. As a rule of thumb, if the VIF of a variable exceeds 10 that variable is said be highly collinear. VIF as a measure of collinearity is not free of criticism. A high VIF is neither necessary nor sufficient to get high variances. 20/29 21/29 Remedial Measures A Priori Information We have two choices: Consider the model: Yi = β1 + β2X2i + β3X3i + ui Y = consumption, X2 = income, and X3 = wealth Do nothing, Multicollinearity is essentially a data deficiency problem and some times we have no choice over the data we have available for empirical analysis. If we cannot estimate one or more regression coefficients with greater precision, a linear combination of them can be estimated relatively efficiently. Suppose a priori we believe that β3 = 0.10β2 We can run: Yi = β1 + β2Xi + ui How to obtain a priori information? Previous empirical p work,, Relevant theory underlying the field of study. Follow some rules of thumb The success depends on the severity of the collinearity problem. 22/29 We know how to test for the validity of such restrictions explicitly. 23/29 Pooling the Data Dropping a Variable A va variant a t of o the t e priori p o information o at o technique tec que iss to combine co b e A ssimple p e way iss to drop d op one o e of o the t e collinear co ea variables. va ab es. cross-sectional and time series data. Consider the model: lnYt = β1 + β2 lnPt + β3 lnIt + ut Y = number of cars sold, P = average price, I = income, and t = time. Price and income variables generally tend to be highly collinear. A way out of this multicollinearity problem: If we h have cross-sectional ti l data d t we can obtain bt i a ffairly i l reliable li bl estimate ti t off the th income elasticity (β3). Use this estimate, and write time series regression as Y*t = β1 + β2 ln Pt + ut Note: we are assuming that cross-sectionally estimated income elasticity is the same thing as that which would be obtained from a pure time series analysis. It is worthy of consideration in situations where the cross-sectional estimates do not vary substantially from one cross section to another. Thus, in our consumption–income–wealth illustration, we can drop the wealth variable and get a significant coefficient for the income variable. Note: We may be committing a specification error in dropping a variable. The remedy may be worse than the disease in some situations because whereas multicollinearity may prevent precise because, estimation of the parameters of the model, omitting a variable may seriously mislead us as to the true values of the parameters. 24/29 25/29 Transformation Transformation First Difference Form: In the first first-difference difference transformation: If the relation Yt = β1 + β2X2t + β3X3t + ut holds at time t, it must also hold at time t − 1: Yt-1 = β1 + β2X2,t-1 + β3X3,t-1 + ut-1 Thus: Yt − Yt-1 = β2 (X2t − X2,t-1) + βt3(X3t − X3,t-1) + vt This often reduces the severity of multicollinearity because, although the levels of X2 and X3 may be highly correlated, there is no a priori reason to believe that their differences will also be highly correlated. Ratio R ti Transformation T f ti Consider: Yt = β1 + β2X2t + β3X3t + ut (X2 is GDP, and X3 is total population) GDP and population grow over time, they are likely to be correlated. One “solution” is to express the model on a per capita basis: The error term vt may not satisfy the assumption that the disturbances are serially uncorrelated. If the original disturbance term is serially uncorrelated, vt will in most cases be serially correlated. May not be appropriate in cross-sectional data. There is a loss of one observation (important in small sample data) Therefore, Th f the th remedy d may be b worse than th the th disease. di In the ratio transformation: The error term will be heteroscedastic, if the original error term is homoscedastic. Again, the remedy may be worse than the disease of collinearity. BE CAREFUL IN TRANSFORMING! 26/29 27/29 Additional or New Data Deviation Form Remember e e be multicollinearity u t co ea ty iss a sample sa p e feature. eatu e. A spec special a feature eatu e of o po polynomial y o a regression eg ess o models ode s iss tthat at the t e Sometimes simply increasing the size of the sample is helpful. As the sample size increases, variance of the estimated coefficients will decrease, leading to a more precise estimate. Obtaining additional or “better” data is not always that easy. Note: when adding new variables in situations that are not controlled, we must be aware of adding observations that were generated by a process other than that associated with the original data set. explanatory variable(s) appear with various powers. Thus, the explanatory variable(s) are going to be correlated. It has been found in practice that multicollinearity is substantially reduced, if the explanatory variable(s) are expressed in the d i i form. deviation f There is no guarantee and the problem may persist! 28/29 29/29 Is Multicollinearity Necessarily Bad? Homework 4 It has been said that if the sole purpose of regression analysis is Basic Econometrics (Gujarati, 2003) prediction or forecasting, then multicollinearity is not a serious problem when R2 is high and the regression coefficients are individually significant. 1. 2. Chapter 10, Problem 26 [50 points] Chapter 10, Problem 27 [50 points] Assignment weight factor = 0.5 1/25 Outline What is the nature of heteroscedasticity? Basic Econometrics in Transportation What are its consequences? How does one detect it? What are the remedial measures? Heteroscedasticity Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/25 3/25 Nature of Heteroscedasticity Possible Reasons An important po ta t assumption assu pt o in CLRM C iss that t at E(u (u2i) = σ2 1.. This is the assumption of equal (homo) spread (scedasticity). Ass peop peoplee learn, ea , their t e errors e o s of o behavior be av o become beco e smaller s a e over ove time. As the number of hours of typing practice increases, the average number of Example: the higher income families on the average save more than the lower- typing errors as well as their variances decreases. income families, but there is also more variability in their savings. 2. As incomes grow, people have more choices about the disposition of their income. 3. As data collecting techniques improve, σ2i is likely to decrease. Rich people have more choices about their savings behavior. Banks that have sophisticated data processing equipment are likely to commit fewer errors. 4/25 5/25 Possible Reasons 4. Cross-sectional and Time Series Data Heteroscedasticity can arise when there are outliers. Heteroscedasticity is likely to be more common in cross cross- sectional than in time series data. An observation that is much different than other observations in the sample. 5. Heteroscedasticity arises when model is not correctly specified. In cross-sectional data, one usually deals with members of a population at a given point in time. These members may be of different sizes, income, etc. Very often what looks like heteroscedasticity may be due to the fact that In time series data, the variables tend to be of similar orders of magnitude some important variables are omitted from the model. 6. because one generally collects the data for the same entity over a period of time. Skewness in distribution of a regressor is an other source. Distribution of income and wealth in most societies is uneven uneven, with the bulk of the income and wealth being owned by a few at the top. 7. Other sources of heteroscedasticity: Incorrect data transformation (ratio or first difference transformations). Incorrect functional form (linear versus log–linear models). 6/25 7/25 OLS Estimation with Heteroscedasticity Method of Generalized Least Squares O OLS S est estimators ato s aand d ttheir e va variances a ces when w e Ideally, dea y, we wou would d likee to ggive ve less ess we weight g t to the t e observations obse vat o s . . coming from populations with greater variability. Consider: Yi = β1 + β2Xi + ui = β1X0i + β2Xi + ui Assume the heteroscedastic variances are known: Is it still BLUE when we drop only the homoscedasticity assumption? We can easily prove that it is still linear and unbiased. We can also show that it is a consistent estimator. Variance of transformed disturbance term is now homoscedastic: It is no longer best and the minimum variance is not given by the equation above. What is BLUE in the presence of heteroscedasticity? Apply OLS to the transformed model and get BLUE estimators. 8/25 9/25 GLS Estimators Consequences of Using OLS Minimize OLS estimator for variance is a biased estimator. Overestimates or underestimates, on average Cannot tell whether the bias is positive or negative Follow the standard calculus techniques, we have: No longer rely on confidence intervals, t and F tests If we persist in using the usual testing procedures despite heteroscedasticity, whatever conclusions we draw may be very misleading. Heteroscedasticity is potentially a serious problem and the researcher needs to know whether it is present in a given situation. 10/25 11/25 Detection Informal Methods There e e are a e noo hard-and-fast a d a d ast rules u es for o detecting detect g heteroscedasticity, ete oscedast c ty, Nature of the Problem only a few rules of thumb. This is inevitable because σ2i can be known only if we have the entire Y population corresponding to the chosen X’s, More often than not, there is only one sample Y value corresponding to a particular value of X. And there is no way one can know σ2i from just one Y observation. Thus, Thus heteroscedasticity may be a matter of intuition, intuition educated guesswork, guesswork or prior empirical experience. Most of the detection methods are based on examination of OLS residuals. Those are the ones we observe, and not ui. We hope they are good estimates. This hope may be fulfilled if the sample size is fairly large. Nature of problem may suggest heteroscedasticity is likely to be encountered. Residual variance around the regression of consumption on income increases with income. Graphical Method Estimated u2i are plotted against estimated Yi Is the estimated mean value of Y systematically related l t d to t the th squaredd residual? id l? a) no systematic pattern, perhaps no heteroscedasticity. b-e) definite pattern, perhaps no homoscedasticity. Using such knowledge, one may transform the data to alleviate the problem. 12/25 13/25 Formal Methods Formal Methods Park Test Glejser Test He formalizes the graphical method, by suggesting a Log-linear model: Glejser suggests regressing the estimated error term on the X variable: ln σ2i = ln σ2 + β ln Xi + vi Since σ2i is generally unknown, Park suggests Following functional forms are suggested: If β turns out to be insignificant, homoscedasticity assumption may be accepted. The h particular i l functional f i l form f chosen h by b Parkk is i only l suggestive. i For large samples the first four give generally satisfactory results. The last two models are nonlinear in the parameters. Note: the error term vi may not satisfy the OLS assumptions. Note: some argued that vi does not have a zero expected value, it is serially correlated, and heteroscedastic. 14/25 15/25 Formal Methods Formal Methods Spea Spearman’s a s Rank a Co Correlation e at o Test est Go Goldfeld-Quandt d e d Qua dt Test est Fit the regression to the data on Y and X and estimate the residuals. Rank the observations according to Xi values. Rank both absolute value of residuals and Xi (or estimated Yi) and compute the Omit c central observations, and divide the remaining observations into two Spearman’s rank correlation coefficient: • di = difference in the ranks for ith observation. Assuming that the population rank correlation coefficient is zero and n > 8, the significance i ifi off the th sample l rs can be b ttested t d by b the th t test, t t with ith df = n − 2: 2 groups each of (n − c) / 2 observations. Fit separate OLS regressions to the first and last set of observations, and obtain the residual sums of squares RSS1 and RSS2. Compute the ratio If ui are assumed to be normally distributed, and if the assumption of homoscedasticity is valid, then it can be shown that λ follows the F distribution. The ability of the test depends on how c is chosen. If the computed t value exceeds the critical t value, we may accept the hypothesis of heteroscedasticity. Goldfeld and Quandt suggest that c = 8 if n = 30, c = 16 if n = 60. Judge et al. note that c = 4 if n = 30 and c = 10 if n is about 60. 16/25 17/25 Formal Methods Formal Methods Breusch Breusch–Pagan–Godfrey Pagan Godfrey Test White White’ss General Heteroscedasticity Test Success of GQ test depends on c and X with which observations are ordered. Does not rely on the normality assumption and is easy to implement. Estimate Yi = β1 + β2X2i + ··· + βkXki + ui by OLS and obtain the residuals. Estimate Yi = β1 + β2X2i + β3X3i + ui and obtain the residuals. , (ML estimator of σ2) Construct variables pi defined as Regress pi on the Z’s as pi = α1 + α2Z2i + ··· + αmZmi + vi o σ2i is assumed to be a linear function of the Z’s. o Some or all of the X’s can serve as Z’s. Obtain the ESS (explained sum of squares) = 0.5 ESS Assuming ui are normally distributed, one can show that if there is homoscedasticity and if the sample size n increases indefinitely, then ∼ χ2m−1 BPG test is an asymptotic, or large-sample, test. Run the following auxiliary regression: Obtain 18/25 Higher powers of regressors can also be introduced. Under the null hypothesis (homoscedasticity), if the sample size n increases i d fi i l it indefinitely, i can be b shown h that h nR2 ∼ χ2 (df = number b off regressors)) If the chi-square value exceeds the critical value, the conclusion is that there is heteroscedasticity. If it does not α2 = α3 = α4 = α5 = α6 = 0. It has been argued that if cross-product terms are present, then it is a test of heteroscedasticity and specification bias. 19/25 Remedial Measures Remedial Measures Heteroscedasticity ete oscedast c ty does not ot destroy dest oy u unbiasedness b ased ess aandd W When e σ2i iss known: ow : consistency. But OLS estimators are no longer efficient, not even asymptotically. There are two approaches to remediation: when σ2i is known, and When σ2i is not known. The most straightforward method of correcting heteroscedasticity is by means of weighted least squares. WLS method provides BLUE estimators. When σ2i is unknown: Is there a way of obtaining consistent estimates of the variances and covariances of OLS estimators even if there is heteroscedasticity? The answer is yes. 20/25 21/25 White’s Correction White’s Procedure White has suggested a procedure by which asymptotically valid For a 2 2-variable variable regression model Yi = β1 + β2X2i + ui we showed: statistical inferences can be made about the true parameter values. Several computer packages present White’s heteroscedasticitycorrected variances and standard errors along with the usual OLS variances and standard errors. White’s heteroscedasticity-corrected standard errors are also known as robust standard errors. White has shown that is a consistent estimator of For Yi = β1 + β2X2i + β3X3i + · · · +βkXki + ui we have: 22/25 Example are the residuals obtained from the original regression. are the residuals obtained from the auxiliary regression of the regressor Xj on the remaining regressors. 23/25 Reasonable Heteroscedasticity Patterns Apart pa t from o being be g a large-sample a ge sa p e procedure, p ocedu e, oonee drawback d awbac of o the t e Y = per capita expenditure on public schools by state in 1979 White procedure is that the estimators thus obtained may not be so efficient as those obtained by methods that transform data to reflect specific types of heteroscedasticity. Income = per capita income by state in 1979 Both the regressors are statistically significant at the 5 percent level, whereas on the basis of White estimators they are not. Since robust standard errors are now available in established regression packages, it is recommended to report them. WHITE option can be used to compare the output with regular OLS output as a check for heteroscedasticity. We may consider several assumptions about the pattern of heteroscedasticity. 24/25 25/25 Reasonable Heteroscedasticity Patterns Homework 5 Assumption 1: if , Basic Econometrics (Gujarati, 2003) Assumption 2: if , Assumption 3: if , 1. 2. Chapter 11, Problem 15 [50 points] Chapter 11, Problem 16 [50 points] Assumption 4: A log transformation such as lnYi = β1 + β2 ln Xi + ui very often reduces heteroscedasticity. Assignment weight factor = 0.5 1/30 Outline What is the nature of autocorrelation? Basic Econometrics in Transportation What are the theoretical and practical consequences of autocorrelation? Since the assumption of no autocorrelation relates to the unobservable disturbances, how does one know that there is autocorrelation in any given situation? How does one remedy the problem of autocorrelation? Autocorrelation Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/30 3/30 Nature of the Problem Nature of the Problem Autocorrelation utoco e at o may ay be defined de ed as correlation co e at o between betwee Thiss iss in many a y ways ssimilar a to tthee problem p ob e of o members of series of observations ordered in time or space. CLRM assumes that E(uiuj ) = 0 Consider the regression of output on labor and capital inputs, in a quarterly time series data. If there is a labor strike affecting output in one quarter, there is no reason to believe that this disruption will be carried over to the next quarter. If output is lower this quarter quarter, there is no reason to expect it to be lower next quarter. In cross-section studies: Data are often collected on the basis of a random sample of say households. There is often no prior reason to believe that the error term pertaining to one household is correlated with the error term of another household. heteroscedasticity. Under both heteroscedasticity and autocorrelation the usual OLS estimators, although linear, unbiased, and asymptotically normally distributed, have no longer the minimum variance (not efficient) among all linear unbiased estimators. Thus, they may not be BLUE. As a result, the usual, t, F, and χ2 may not be valid. 4/30 5/30 Patterns of Autocorrelation Possible Reasons 1. a. b. c. d. 6/30 Figure a shows a cyclical pattern. Figure b suggests an upward linear trend. Figure c shows a downward linear trend in the disturbances, Possible Reasons 4.. 5. Cobweb Phenomenon e o e o There is a “momentum’’ built into them, and it continues until something happens. 2. Specification Bias: Excluded Variables Case The price of chicken may affect the consumption of beef. Thus, the error term will reflect a systematic pattern, if this is excluded. 3. Specification Bias: Incorrect Functional Form If X2i is incorrectly omitted from the model, a systematic fluctuation in the error term and thus autocorrelation will be detected because of the use of an incorrect functional form. 7/30 Possible Reasons 7. Data ata Transformation a s o at o Supply reacts to price with a lag of one time period. Autocorrelation may be induced as a result of transforming the model. Planting of crops are influenced by their price last year. If the error term in Yt = β1 + β2Xt + ut satisfies the standard OLS assumptions, particularly the assumption of no autocorrelation, it can be shown that the error term in Yt = β2 Xt + vt is autocorrelated. Lags In a time series regression of consumption expenditure on income, it is not uncommon to find that the consumption expenditure in the current period depends also on the consumption expenditure of the previous period. Consumptiont = β1 + β2 incomet + β3 consumptiont-1 + ut (Autoregression) 6. A prominent feature of most economic time series. Figure d indicates that both linear and quadratic trend terms are present in the disturbances. Inertia “Manipulation’’ of Data Averaging introduces smoothness and dampen the fluctuations in the data. Quarterly data looks much smoother than the monthly data. This smoothness may lend to a systematic pattern in the disturbances. 8. Nonstationarity A time series is stationary if its characteristics (e.g., mean, variance, and covariance) do not change over time. When Y and X are nonstationary, it is quite possible that the error term is also nonstationary. Most economic time series exhibit positive autocorrelation because most of them either move upward or downward. 8/30 9/30 Estimation in Presence of Autocorrelation AR(1) Scheme What happens to the OLS estimators if we introduce auto auto- ut = ρut-1 t 1 + εt correlation in disturbances by assuming E(utut+s) ≠ 0? We revert to the two-variable regression model to explain the basic ideas involved. To make any headway, we must assume the mechanism that generates ut. E(ut, ut+S) ≠ 0 is too general an assumption to be of any practical use. As a starting point, one can assume a first-order autoregressive scheme. 10/30 ((−11 < ρ < 1) ρ is the coefficient of autocovariance εt is the stochastic disturbance term that satisfies the OLS assumptions: E(εt) = 0 var (εt) = σ2ε cov (εt , εt +S) = 0 s≠0 G Given ve tthee AR(1) ( ) scheme, sc e e, itt can ca be shown s ow that: t at: Since ρ is constant, variance of ut is still homoscedastic. If |ρ| < 1, AR(1) process is stationary; that is, the mean, variance, and covariance of ut do not change over time. A considerable amount of theoretical and empirical work has been done on the AR(1) scheme. 11/30 AR(1) Scheme BLUE Estimator in Autocorrelation Case U Under de the t e AR(1) ( ) scheme, sc e e, itt can ca be shown s ow that t at For o a two-variable two va ab e model ode and a d a AR(1) ( ) scheme, sc e e, we can ca show s ow that t at following are BLUE: C and D are a correction factor that may be disregarded in practice. A term that depends on ρ as well as the sample autocorrelations between the values taken by the regressor X at various lags. lags In general we cannot foretell which one is greater. OLS estimator with adjusted variance are still linear and unbiased, but no longer efficient. In case of autocorrelation can we find a BLUE estimator? GLS incorporates additional information (autocorrelation parameter ρ) into the estimating procedure, whereas in OLS such side information is not directly taken into consideration. What if we use OLS procedure despite autocorrelation? 12/30 13/30 Consequences of Using OLS Consequences of Using OLS As in heteroscedasticity case, OLS estimators are not efficient. If we mistakenly believe that CNLRM assumptions hold true, OLS confidence intervals derived are likely to be wider than those based on the GLS procedure: Since b2 lies in the OLS errors will arise for the following reasons: 1. confidence interval, we mistakenly accept the hypothesis that true β2 is zero with 95% confidence. in which r is the sample correlation coefficient between successive values of the X. X In most economic time series, ρ and r are both positive and Use GLS and not OLS to establish confidence intervals and to test hypotheses, even though OLS estimators are unbiased and consistent. 14/30 The residual variance is likely to underestimate the true σ2. If there is AR(1) autocorrelation, it can be shown that 2. 3. . As a result, we are likely to overestimate R2. t and F tests are no longer valid, and if applied, are likely to give seriously misleading conclusions about the statistical significance of the estimated regression coefficients. 15/30 A Concrete Example Detecting Autocorrelation G Graphical ap ca Method: et od: Very often a visual examination of the estimated u’s gives us some clues about the likely presence of autocorrelation in the u’s. We can simply plot them against time in a time sequence plot. Standardized residuals are the residuals divided by standard error of the regression. g Standardized residuals devoid of units of measurement. This figure exhibit a pattern for estimated error terms, suggesting that perhaps ut are not random. How reliable are the results if there is autocorrelation? 16/30 17/30 Detecting Autocorrelation The Runs Test Graphical Method: Also know as the Geary test, a nonparametric test. To see this differently, we can plot the residuals at time t against their value at time (t − 1), a kind of empirical test of the AR(1) scheme. In our wage-productivity regression, most of the residuals are grouped in the second and fourth quadrants. This suggests gg a strong g ppositive correlation in the residuals. Although the graphical method is suggestive, it is subjective or qualitative in nature. There are 9 negative, followed by 21 positive, followed by 10 negative residuals, in our wage-productivity regression: (−−−−−−−−−)(+++++++++++++++++++++)(−−−−−−−−−− ) Run: An uninterrupted sequence of one symbol. Length of a run: Number of elements in it. Too many runs mean the residuals change sign frequently, thus indicating negative serial correlation. Too few runs mean the residuals change sign infrequently, thus indicating positive serial correlation. 18/30 19/30 The Runs Test Durbin–Watson d Test Let et The e most ost ce celebrated eb ated test for o detect detecting g se serial a co correlation: e at o : N = total number of observations = N1 + N2 N1 = number of + symbols N2 = number of − symbols Assumptions underlying the d statistic: R = number of runs 1. The regression model includes the intercept term. If it is not present, obtain H0: successive outcomes are independent If N1 and N2 > 10, 10 R ~ N( , Decision Rule: Reject H0 if the estimated R lies outside the limits. Prob [E(R) − 1.96σR ≤ R ≤ E(R) + 1.96σR] = 0.95 ) the RSS from the regression including the intercept term. 2. The Th explanatory l variables i bl are nonstochastic, h i or fixed fi d in i repeatedd sampling. li 3. The disturbances are generated by the first-order autoregressive scheme. 4. The error term is assumed to be normally distributed. 5. The regression model does not include the lagged value of the dependent variable as one of the explanatory variables. (no autoregressive model) 6. There are no missing observations in the data. 20/30 21/30 Durbin–Watson d Test Durbin–Watson d Test PDF of the d statistic is difficult to derive because, it depends in In our wage wage-productivity productivity regression: a complicated way on the X values. The estimated d value can be shown to be 0.1229. Durbin and Watson suggested a boundary (Table D.5) for the computed d. If there is no serial correlation, d is expected to be about 2. From the Durbin–Watson tables (N = 40 and k’ = 1), dL = 1.44 and dU = 1.54 at the 5 percent level. d < dL Evidence of positive auto-correlation. Run the OLS regression and compute d. For the given sample size and given number of explanatory variables, find out the critical dL and dU values. Now follow the decision rules in the figure. 22/30 In many situations, it has been found that the upper limit dU is approximately the true significance limit: If d < dU, there is statistically significant positive autocorrelation. If (4 − d) < dU, there is statistically significant negative autocorrelation. 23/30 Remedial Measures Mis-specification If autoco autocorrelation e at o iss found, ou d, we have ave four ou options: opt o s: For o tthee wages–productivity wages p oduct v ty regression: eg ess o : 1. 2. 3. 4. Try to find out if the autocorrelation is pure autocorrelation and not the result of mis-specification of the model. In case of pure autocorrelation, one can use some type of generalized leastsquare (GLS) method. In large samples, we can use the Newey–West method to obtain standard errors of OLS estimators that are corrected for autocorrelation autocorrelation. In some situations we can continue to use the OLS method. It may be safe to conclude that probably the model suffers from pure autocorrelation and not necessarily from specification bias. 24/30 25/30 Correcting for Pure Autocorrelation When ρ Is Known The remedy depends on the knowledge about the structure of Yt = β1 + β2Xt + ut If it holds true at time t, it also holds true at time (t − 1) Yt-1 = β1 + β2Xt-1 + ut-1 ρYt-1 = ρβ1 + ρβ2Xt-1 + ρut-1 Thus, Yt − ρYt-1 = β1(1 − ρ) + β2(Xt − ρXt-1) + εt Y*t = β*1 + β*2X*t + εt autocorrelation. As a starter, consider the two-variable regression model, and assume that the error term follows the AR(1) scheme. ut = ρutt-11 + εt (−1 < ρ < 1) Yt = β1 + β2Xt + ut Now we consider two cases: ρ is known, ρ is not known but has to be estimated. 26/30 As the error term satisfies the CLRM assumptions, we can apply OLS to the transformed variables and obtain BLUE estimators. 27/30 When ρ Is Unknown When ρ Is Unknown The e First-Difference st e e ce Method: et od: ρ Based ased oon Durbin–Watson u b Watso d Stat Statistic: st c: Since ρ lies between 0 and ±1, one could start from two extreme positions. ρ = 0: No (first-order) serial correlation, ρ = ±1: perfect positive or negative correlation. If we cannot use the first difference transformation because ρ is not sufficiently close to 1: In reasonably large samples one can obtain rho from and use GLS. If ρ = 1: Yt − Yt-1 = β2(Xt − Xt-1) + (ut − ut-1) This may be appropriate if the coefficient of autocorrelation is very high. ρ Estimated from the Residuals: A rough rule of thumb: Use the first difference form whenever d < R2 If AR(1) scheme is valid, run the following regression: Use the g statistic to test the hypothesis that ρ = 1. No need to introduce the intercept, as we know the OLS residuals sum to zero. ut are the OLS residuals from the original regression et are the OLS residuals from the first-difference regression. Use Durbin–Watson tables except that now the null hypothesis is that ρ = 1. g = 0.0012, dL = 1.435 and dU = 1.540: As the observed g lies below the lower limit of d, we do not reject the hypothesis that true ρ = 1 An estimate of ρ is obtained and then used to estimate the GLS. Thus, all these methods of estimation are known as feasible GLS (FGLS) or estimated GLS (EGLS) methods. 28/30 29/30 Newey–West Method Dummy Variables and Autocorrelation Instead of FGLS methods, OLS with corrected SEs can be used Recall the US savings savings–income income regression model for 1970 1970–1995: 1995: in large samples. This is an extension of White’s correction. The corrected SEs are known as HAC (heteroscedasticity- and autocorrelation-consistent) standard errors. We will not present the mathematics behind, for it is involved. The wages–productivity regression with HAC standard errors: Yt = α1 + α2Dt + β1Xt + β2(DtXt) + ut Use the generalized difference method to estimate the parameters. How do we transform the dummy? One can follow the following procedure: 1. 2. 3. 30/30 Homework 6 Basic as c Econometrics co o et cs (Gujarati, (Guja at , 2003) 003) Chapter 12, Problem 26 [30 points] 2. Chapter 12, Problem 29 [15 points] 3. Chapter 12, Problem 31 [25 points] 4. Chapter 12, Problem 37 [25 points] 1. Assignment weight factor = 1 Y = savings; X = income; D = 1 for 1982–1995; D = 0 for 1970–1981 Suppose that ut follows a AR(1) scheme: D = 0 ffor 1970 1970–1981; 1981 D = 1/(1 − ρ)) for f 1982 1982; D = 1 ffor 1983 1983–1995. 1995 Transform Xt as (Xt − ρXt-1). DtXt is zero for 1970–1981; DtXt = Xt for 1982; and the remaining observations are set to (DtXt − DtρXt-1) = (Xt − ρXt-1). 1/34 Outline How does one go about finding the “correct” correct model? Basic Econometrics in Transportation What are the consequences of specification errors? How does one detect specification errors? What remedies can one adopt and with what benefits? Model Specification How does one evaluate the performance of competing models? Amir Samimi Civil Engineering Department Sharif University of Technology Primary Source: Basic Econometrics (Gujarati) 2/34 3/34 Model Selection Criteria Types of Specification Errors 1. Be data admissible The model that we accept as a good model: Predictions made from the model must be logically possible. 2. Be consistent with theory Must make good economic sense. 3. Have weakly exogenous regressors The explanatory variables, must be uncorrelated with the error term. 4 Exhibit parameter constancy 4. In the absence of parameter stability, predictions will not be reliable. 5. Exhibit data coherency Estimated residuals must be purely random (technically, white noise). 6. Be encompassing Other models cannot be an improvement over the chosen model. Yi = β1 + β2X1i + β3X2i + β4X3i + u1i 1. Omission of a relevant variable(s) Yi = α1 + α2X1i + α3X2i + u2i u2i = u1i + β4X3i 2. Inclusion of an unnecessary variable(s) Yi = λ1 + λ2X1i + λ3X2i + λ4X3i + λ5X4i + u3i u3i = u1i – λ5X4i 3 Adopting the wrong functional form 3. LnYi = γ1 + γ2X1i + γ3X2i + γ4X3i + u4i 4. Errors of measurement Y*i = β*1 + β*2X*1i + β*3X*2i + β*4X*3i + u*i Y*i = Yi + εi and X*i = Xi + wi 5. Incorrect specification of the stochastic error term Multiplicative versus additive error term: Yi = βXiui and Yi = αXi + ui 4/34 5/34 Specification and Mis-specification Errors Consequences of Model Specification Errors The first four types of error are essentially in the nature of model To keep the discussion simple, specification errors. We will answer this question in the context of the three-variable model and consider the first two types of specification errors discussed earlier: We have in mind a “true” model but somehow we do not estimate the correct one. In model mis-specification errors, we do not know what the true Underfitting a model model is to begin with. Omitting relevant variables, The controversy between the Keynesians and the monetarists. Overfitting a model The monetarists give primacy to money in explaining changes in GDP. GDP The Keynesians emphasize the role of government expenditure to explain Including unnecessary variables. changes in GDP. There are two competing models. We will first consider model specification errors and then examine model mis-specification errors. 6/34 Underfitting a Model Suppose tthe e true t ue model ode is: s: Yi = β1 + β2X2i + β3X3i + ui For some reason we fit the following model: Yi = α1 + α2X2i + vi 7/34 Underfitting a Model It can be shown that: b32 is i the h slope l in i the h regression i off X3 on X2. b32 will be zero if X3 on X2 are uncorrelated. It can also be shown that: The consequences of omitting variable X3 are as follows: 1. If X2 and X3 are correlated, estimated α1 and α2 are biased and inconsistent. 2 If X2 and X3 are not correlated 2. correlated, only estimated α1 is biased. biased 3. The disturbance variance is incorrectly estimated. 4. The conventionally measured variance of estimated α2 is biased. 5. In consequence, the hypothesis-testing procedures are likely to give misleading conclusions. 6. As another consequence, the forecasts will be unreliable. Although estimated α2 is biased, its variance is smaller. There is a tradeoff between bias and efficiency involved here. Estimated σ2 are not the same because RSS and df of the models are different. If X has a strong impact on Y, it may reduce RSS more than the loss in df. Thus, inclusion of such variables will reduce bias and standard error. Once a model is formulated on the basis of the relevant theory, one is ill-advised to drop a variable from such a model. 8/34 Overfitting a Model 9/34 Tests of Specification Errors Suppose the true model is: Yi = β1 + β2X2i + ui For some reason we fit the following model: Yi = 1 + 2X2i + 3X3i + vi The consequences of adding variable X3 are as follows: 1. The OLS estimators of the parameters of the “incorrect” model are all unbiased and consistent. consistent 2. The error variance σ2 is correctly estimated. 3. The usual hypothesis-testing procedures remain valid. 4. The estimated α’s will be generally inefficient (larger variance): Suppose we develop a k-variable model, but are not sure that Xk really belongs in the model: Yi = β1 + β2X2i + … + βkXki + ui One simple way to find this out is to test the significance of the estimated βk with the usual t test. If we are not sure whether X3 and X4 legitimately belong in the model, we can be easily ascertained by the F test discussed before. It is important to remember that in carrying out these tests of significance we have a specific model in mind. Given that model, then, we can find out whether one or more regressors are really relevant by the usual t and F tests. Inclusion of unnecessary variables makes the estimations less precise. 10/34 11/34 Tests of Specification Errors Tests of Specification Errors Note: we should not use t and F tests to build a model iteratively. On the basis of theory or prior empirical work, we develop a We should not say Y is related to X2 only because estimated β2 is statistically significant and then expand the model to include X3 and decide to keep that variable in the model, and so on. This strategy of building a model is called the bottom-up approach or by the somewhat critical term, data mining. Nominal significance in data mining approach: For c candidate Xs out of which k are finally selected on the basis of data mining, the true level of significance (α*) is related to the nominal level of significance (α) as : α* = 1 − (1 − α)c/k The art of the applied econometrician is to allow for data-driven theory while avoiding the considerable dangers in data mining. model that we believe captures the essence of the subject under study. Then we look at some broad features of the results, such as adjusted R2, t ratios, signs of the estimated coefficients, Durbin–Watson statistic, … If diagnostics do not look encouraging, then we begin to look for remedies: Maybe we have omitted an important variable, Maybe we have have used the wrong functional form, Maybe we have not first-differenced (to remove autocorrelation) To determine whether model inadequacy is on account of one of these problems, we can use some of the coming methods. 12/34 13/34 Omitted Variables and Incorrect Function Omitted Variables and Incorrect Function Omission of an important variable or incorrect functional form, 1. From the assumed model, obtain the OLS residuals. causes a plot of residuals to exhibit distinct patterns. 2. If you believe the model is misspecified because it excludes Z, order the residuals according to increasing values of Z. a) A linear function: Yi = λ1 + λ2Xi + u3i b) A quadratic function: Yi = α1 + α2Xi + α3X2i + u2i c) The true total cost function: Yi = β1 + β2Xi + β3X2i + β4X3i + ui As we approach the truth, the residuals are smaller and they do not exhibit noticeable patterns. 14/34 The Z variable could be one of the X variables included in the assumed model or it could be some function of that variable. 3. Compute C t the th d statistic t ti ti from f the th residuals id l thus th ordered. d d 4. If the estimated d value is significant, one can accept the hypothesis of model misspecification. 15/34 Omitted Variables and Incorrect Function Omitted Variables and Incorrect Function 1. From the chosen model obtain the estimated Yi 1. Estimate the restricted regression by OLS and obtain residuals. Yi = λ1 + λ2Xi + u3i Yi = λ1 + λ2Xi + u3i 2. Rerun the model introducing the estimated Yi in some form as an additional regressor. (get idea from the plot of residuals and estimated Y) .Yi 1 2 X i 3 Ŷ 4 Ŷ u i 2 i 3 i 3. Use F test to find out if R2 is significantly improved. F .F (0.9983 - 0.8409)/2 284.4035 (1 . 0.9983)/(10 - 4) R - R /DF 1 - R /DF 2 new 2 old 2 new 4. If F value is significant, one can accept that the model is mis- specified. 2. If the unrestricted regression (Yi = 1 + 2Xi + 3X2i + 4X3i + ui) is the true one, the residuals should be related to X2i and X3i. 3. Regress the estimated ui from Step 1 on all the regressors .û i 1 2 X i 3 X i 4 X i v i 2 3 4. n times R2 from the auxiliary regression follows the chi square distribution: 5. If the chi-square value exceeds the value, we reject the restricted regression. 16/34 17/34 Errors of Measurement Errors of Measurement in Y We assume that the data is “accurate”. accurate . Consider the following model: Not guess estimates, extrapolated, rounded off, etc in any systematic manner. Unfortunately, this ideal is not usually met in practice! Non-response errors, reporting errors, and computing errors. Whatever the reasons, it is a potentially troublesome problem. It forms another example of specification bias. Y*i = α + βXi + ui Y*i = permanent consumption expenditure; Xi = current income Y*i is not directly measurable, we may use Yi such that Yi = Y*i + εi Therefore, we estimate Yi = (α + β Xi + ui) + εi = α + βXi + vi Will be discussed in two parts: Errors of measurement in the dependent variable Y Errors of measurement in the explanatory variable X vi is a composite error term: population disturbance and measurement error. For simplicity assume that: E(ui) = E(εi) = 0, cov (Xi, ui) = 0 (which is the assumption of CLRM); cov (Xi, εi) = 0: errors of measurement in Y*i are uncorrelated with Xi; cov (ui, εi) = 0: equation error and the measurement error are uncorrelated. 18/34 19/34 Errors of Measurement in Y Errors of Measurement in X W With t these t ese assu assumptions, pt o s, itt ca can be see seen that t at Co Consider s de the t e following o ow g model: ode : β estimated from either equations will be an unbiased estimator of the true β. The standard errors of β estimated from the two equations are different: First model: Second model: The latter variance is larger than the former. Therefore, although the errors of measurement in Y still give unbiased estimates of the parameters and their variances, the estimated variances are now larger than in the case where there are no such errors of measurement. Yi = α + βX*i + ui X*i is not directly measurable, we may use Xi such that Xi = X*i + wi Therefore, we estimate Yi = α + β(Xi – wi) + ui = α + βXi + zi zi = ui − βwi is a compound of equation and measurement errors. errors Assumptions: Even if we assume that wi has zero mean, is serially independent, and is uncorrelated with ui, We can no longer assume cov (zi, Xi) = 0: we can show cov (zi , Xi) = −βσ2w A crucial CLRM assumption is violated! 20/34 21/34 Errors of Measurement in X Incorrect Specification of Error Term As the explanatory variable and the error term are correlated, the Since the error term is not directly observable, there is no easy OLS estimators are biased and inconsistent. It is shown that: What is the solution? The answer is not easy. If σ2w is small compared to σ2X* , we can “assume away” for practical purposes. o The rub here is that there is no way to judge their relative magnitudes. One other remedy is the use of instrumental or proxy variables. o Although highly correlated with X, are uncorrelated with the equation and measurement error terms. way to determine the form in which it enters the model. Consider Yi = βXiui and Yi = αXi + ui Assume the multiplicative is the “correct” model (ln ui ∼ N(0, σ2)) but we estimated the other one. 2 2 2 It can be shown that ui ∼ log normal [eσ /2, eσ (eσ − 1)], and thus 2 E(estimated α) = β eσ /2 Estimated α is a biased, as its average is not equal to the true β. Measure the data as accurately as possible. 22/34 23/34 Nested Versus Non-Nested Models The Non-Nested F Test Two wo models ode s aaree nested, ested, if oonee ca can be de derived ved as a spec special a case oof Co Consider: s de : the other: Model A: Yi = β1 + β2X2i + β3X3i + β4X4i + β5X5i + ui Model B: Yi = β1 + β2X2i + β3X3i + ui We already have looked at tests of nested models. Tests of non-nested hypotheses: The discrimination approach: o Given competing models, one chooses a model based on some criteria of goodness of fit (R2, adjusted R2, Akaike’s information, etc). The discerning approach: o We take into account information provided by other models, in investigating one model. (non-nested F test, Davidson–MacKinnon J test, Cox test, JA test, P test, Mizon–Richard encompassing test, etc). Model C: Yi = α1 + α2X2i + α3X3i + ui Model D: Yi = β1 + β2Z2i + β3Z3i + vi Estimate the following nested or hybrid model and use the F test: Yi = λ1 + λ2X2i + λ3X3i + λ4Z2i + λ5Z3i + wi If Model C is correct, λ4 = λ5 = 0, whereas Model D is correct if λ2 = λ3 = 0. Problems with this procedure: If X’s and Z’s are highly correlated, we have no way of deciding which one is the correct model. To test the significance of an incremental contribution, one should choose Model C or D as the reference. Choice of the reference hypothesis could determine the outcome of the choice model. The artificially nested model F may not have any economic meaning. 24/34 25/34 Davidson–MacKinnon J Test Davidson–MacKinnon J Test 1. Estimate Model D and obtain the estimated Y values. Some problems of the J test: 2. Add estimated Y as a regressor to Model C and estimate: No clear answer if the test leads to acceptance or rejection of both models. This model is an example of the encompassing principle. 3. Using the t test, test the hypothesis that α4 = 0. 4. If this is not rejected, we accept Model C as the true model. The J test may not be very powerful in small samples. This is because the influence of variables not included in Model C, have no additional explanatory power beyond that contributed by Model C. If the null hypothesis is rejected, Model C cannot be the true model. 5. Reverse the roles of hypotheses, or Models C and D. 26/34 27/34 Model Selection Criteria The R2 Criterion We distinguish d st gu s between betwee in-sample sa p e and a d out-of-sample out o sa p e forecasting. o ecast g. R2 iss de defined ed as ESS/TSS SS/ SS In-sample forecasting tells us how the model fits the data in a given sample. Necessarily lies between 0 and 1. Out-of-sample forecasting tries to determine how a model forecasts future. The closer it is to 1, the better is the fit. Several criteria are used for this purpose: 1. R2 2. Adjusted R2 Problems with R2 1. It measures in-sample goodness of fit. There is no guarantee that it will forecast well out-of-sample observations. 3. Akaike information criterion (AIC) ( ) 2. The dependent p variable must be the same,, in comparing p g two or more R2’s. 4. Schwarz Information criterion (SIC) 3. An R2 cannot fall when variables are added to the model. 5. Mallow’s Cp criterion 6. forecast χ2 (chi-square) There is a tradeoff between goodness of fit and a model’s complexity (as judged by the number of X’s) in criteria 2 to 5. There is every temptation to play the game of “maximizing the R2”. Adding variables may increase R2 but it may also increase the variance of forecast error. 28/34 29/34 Adjusted R2 Akaike Information Criterion (AIC) The adjusted R2 is defined as AIC is defined as or Adjusted R2 is a better measure than R2, as it penalizes for adding more X’s. Again keep in mind that Y must be the same for the comparison to be valid. AIC imposes a harsher penalty than adjusted R2 for adding X’s. In comparing two or more models, the model with the lowest value of AIC is preferred. p One advantage of AIC is that it is useful for not only in-sample but also out of-sample forecasting performance of a regression model. Also, it is useful for both nested and non-nested models. 30/34 31/34 Schwarz Information Criterion (SIC) Mallows’s Cp Criterion S SIC C iss de defined ed as Mallows a ows has as developed deve oped a ccriterion te o for o model ode selection: se ect o : or SIC imposes a harsher penalty than AIC. Like AIC, the lower the value of SIC, the better the model. Like AIC AIC, SIC can be used to compare in in-sample sample or out out-ofof sample forecasting performance of a model. RSSp is the residual sum of squares using the p regressors. If the model with p regressors does not suffer from lack of fit, it can be shown that E(RSSp) = (n− p)σ2. It is true approximately that E(Cp) ≈ p p. In practice, one usually plots Cp against p and look for a model that has a low Cp value, about equal to p. 32/34 33/34 Forecast Chi-Square Ten Commandments Suppose we have a regression model based on n observations and 1. Thou shalt use common sense and economic theory. suppose we want to use it to forecast for an additional t observations. Now the forecast χ2 test is defined as follows: 2. Thou shalt ask the right questions. 3. Thou shalt know the context (No ignorant statistical analysis). 4. Thou shalt inspect the data. If we hypothesize that the parameter values have not changed between the sample l andd post-sample t l periods, i d it can be b shown h that th t the th statistic t ti ti above b follows f ll the chi-square distribution with t degrees of freedom. The forecast χ2 test has a weak statistical power, meaning that the probability that the test will correctly reject a false null hypothesis is low and therefore the test should be used as a signal rather than a definitive test. 5. Thou shalt not worship complexity. 6 Thou shalt look long and hard at thy results. 6. results 7. Thou shalt beware the costs of data mining. 8. Thou shalt be willing to compromise. 9. Thou shalt not confuse statistical with practical significance. 10.Thou shalt anticipate criticism. 34/34 Homework 7 Basic as c Econometrics co o et cs (Gujarati, (Guja at , 2003) 003) Chapter 13, Problem 21 [35 points] 2. Chapter 13, Problem 22 [25 points] 3. Chapter 13, Problem 25 [40 points] 1. Assignment weight factor = 1 1/26 Outline Basic Econometrics in Transportation Features of the discrete choice models. What is a choice set? Utility-maximizing behavior. The most prominent types of discrete choice models Behavioral Models Logit, Generalized extreme value (GEV), Probit, Amir Samimi Mixed logit How the models are used for forecasting over time. Civil Engineering Department Sharif University of Technology Primary Source: Discrete Choice Methods with Simulation (Train) 2/26 3/26 Discrete Choice versus Regression The Choice Set W With t regression eg ess o models, ode s, the t e dependent depe de t variable va ab e iss continuous, co t uous, The e set of o alternatives a te at ves within wt a discrete d sc ete cchoice o ce framework a ewo With infinite number of possible outcomes (e.g. how much money to save). When there is an infinite number of alternatives, discrete choice models cannot be applied. Often regressions examine choices of “how much” and discrete choice models examine choice of “which.” This distinction is not accurate. needs to exhibit three characteristics: The alternatives must be mutually exclusive. The decision maker chooses only one alternative from the choice set. The choice set must be exhaustive The decision maker necessarily chooses one of the alternatives. The number of alternatives must be finite. Discrete choice models have been used to examine choices of “how much” Whether to use regression or discrete choice models is a specification issue. Usually a regression model is more natural and easier. A discrete choice model is used only if there were compelling reasons. The appropriate specification of the choice set in these situations is governed largely by the goals of the research and the data that are available to the researcher. 4/26 5/26 Example Utility-maximizing Behavior Consider households’ households choice among heating fuels: Discrete choice models are usually derived under an assumption of utility-maximizing behavior by the decision maker. Available fuels: natural gas, electricity, oil, and wood. These alternatives violate both mutual exclusivity (a household can have two It is a psychological stimuli, that is also interpreted as utility. types) and exhaustiveness (a household can have no heating). Models that can be derived in this way are called random utility models How to handle these issues ? (RUMs). Note: To obtain mutually exclusive alternatives: 1. List every possible combination as an alternative. Models derived from utility y maximization can also be used to represent p decision making that does not entail utility maximization. 2. Define the choice as the choice among fuels for the “primary” heating source. The models can be seen as simply describing the relation of explanatory To obtain exhaustive alternatives: variables to the outcome of a choice, without reference to exactly how the choice is made. 1. Include “no heating” as an alternative. 2. Redefine as the choice of heating fuel conditional on having heating. The first two conditions can usually be satisfied. In contrast, the third condition is actually restrictive. 6/26 7/26 Utility-maximizing Behavior Derivation of Choice Probabilities Notat Notation: o : You ou do not ot obse observe ve the t e decision dec s o maker’s a e s utility, ut ty, rather at e A decision maker, labeled n, faces a choice among J alternatives. some attributes of the alternatives, xnj The utility that decision maker n obtains from alternative j is Unj. and some attributes of the decision maker, sn Decision Rule: The decision maker chooses the alternative that provides the greatest utility: choose alternative i if and only if Uni > Unj ∀ j ≠ i . Note: This utility is known to the decision maker but not by the researcher. and then specify a representative utility function, Vnj = V(xnj , sn) There are aspects that the researcher does not see: Unj = Vnj + εnj These terms are treated as random, as the researcher does not know εnj. Joint PDF of the random vector εn = εn1, ... , εnJ is denoted f ((εn)). The probability that decision maker n chooses alternative i is: Pni = Prob (Uni > Unj ∀ j ≠ i ) = Prob (Vni + εni > Vnj + εnj ∀ j ≠ i) = Prob (εnj − εni < Vni − Vnj ∀ j ≠ i) = ∫ε I(εnj − εni < Vni − Vnj ∀ j ≠ i ) f (εn) dεn I (·) is the indicator function. 8/26 Derivation of Choice Probabilities Pnii = ∫ε I(εnjj − εnii < Vnii − Vnjj ∀ j ≠ i ) f (εn) dεn This is a multidimensional integral over the density of εn. Different discrete choice models are obtained from different specifications of the unobserved portion of utility. Logit and nested logit have closed-form expressions for this integral. Under the assumption that εn is distributed iid extreme value and a type of generalized extreme value value, respectively. respectively Probit is derived under the assumption that f is a multivariate normal. Mixed logit is based on the assumption that the unobserved portion of utility consists of a part that follows any distribution specified by the researcher plus a part that is iid extreme value. 10/26 9/26 Meaning of Choice Probabilities Consider a person who can take car or bus to work. The researcher observes the time (T) and cost (M) that the person would incur: Vc = αTc + βMc and Vb = αTb + βMb There are other factors that affect the person’s choice. If it turns out that Vc = 4 and Vb = 3: This means that car is better for this person by 1 unit on observed factors. It does not mean that the person chooses car. Specifically, the person will choose bus if the unobserved portion of utility is higher than that for car by at least 1 unit: Pc = Prob(εb − εc < 1) and Pb = Prob(εb − εc > 1) What is meant by the distribution of εn? 11/26 Meaning of Choice Probabilities Specific Models The e density de s ty iss the t e distribution d st but o of o unobserved u obse ved ut utility ty within wt the t e A qu quick c preview p ev ew of o logit, og t, G GEV, V, p probit, ob t, aandd mixed ed logit og t iss use useful u to population of people who face the same observed utility. Pni is the share of people who choose alternative i within the population of people who face the same observed utility for each alternative as person n. The distribution can be considered in subjective terms, Pni is the probability that the researcher assigns to the person’s choosing alternative i given the researcher’s ideas about the unobserved utility. The distribution can represent the effect of factors that are quixotic to the decision maker (aspects of bounded rationality) Pni is the probability that these quixotic factors induce the person to choose alternative i given the observed, realistic factors. show how they relate to general derivation of all choice models. Logit The most widely used discrete choice model. It is derived under the assumption that εni is iid extreme value for all i: Unobserved factors are uncorrelated with the same variance over alternatives. Ap person who dislikes travel byy bus because of the presence p of other riders might have a similar reaction to rail travel. GEV To avoid the independence assumption within a logit. A comparatively simple GEV model places the alternatives into nests, with unobserved factors having the same correlation for all alternatives within a nest and no correlation for alternatives in different nests. 12/26 Specific Models Probit Unobserved factors are distributed jointly normal. With full covariance matrix , any pattern of correlation and heteroskedasticity can be accommodated. A customer’s willingness to pay for a desirable attribute of a product is necessarily positive. (normal distribution has density on both sides of zero) Mixed logit Allows All th the unobserved b d ffactors t tto ffollow ll any di distribution. t ib ti Unobserved factors can be decomposed into a part that contains all the correlation and heteroskedasticity, and another part that is iid extreme value. 13/26 Identification of Choice Models Several aspects of the behavioral decision process affect the specification and estimation of any discrete choice model. The issues can be summarized easily in two statements: “Only differences in utility matter” “The scale of utility is arbitrary.” Other models These models are often obtained by combining concepts from other models. A mixed probit has a similar concept to the mixed logit. 14/26 15/26 Differences in Utility Matter Alternative-Specific Constants The e level eve oof ut utility ty doesn’t does t matter. atte . Itt iss often o te reasonable easo ab e to specify spec y the t e observed obse ved part pa t of o utility ut ty to be Pni = Prob(Uni − Unj > 0 ∀ j ≠ i ) It has several implications for the identification and specification of discrete choice models. It means that the only parameters that can be estimated are those that capture differences across alternatives. This general statement takes several forms forms. Alternative-Specific Constants Sociodemographic Variables Number of Independent Error Terms linear in parameters with a constant. The alternative-specific constant for an alternative captures the average effect on utility of all factors that are not included in the model. Since alternative-specific constants force the εnj to have a zero mean, it is reasonable to include a constant in Vnj for each alternative. Researcher must normalize the absolute levels of constants It is impossible to estimate the constants themselves, since an infinite number of constants (any values with the same difference) result in the same choice. The standard procedure is to normalize one of the constants to zero. In our example: Uc = αTc + βMc + εc and Ub = αTb + βMb + kb + εb With J alternatives, at most J − 1 constants can enter the model. 16/26 17/26 Sociodemographic Variables Number of Independent Error Terms Attributes of the decision maker do not vary over alternatives. Remember that Pnii = ∫ε I(εnjj − εnii < Vnii − Vnjj ∀ j ≠ i ) f (εn) dεn This probability is a J-dimensional integral. The same issue! Consider the effect of a person’s income on the mode choice decision. Suppose that a person’s utility is higher with higher income (Y): With J errors, there are J − 1 error differences. The probability can be expressed as a (J − 1)-dimensional integral: Uc = αTc + βMc + θ0cY + εc and Ub = αTb + βMb + θ0bY + kb + εb Since only differences matter, absolute levels of θ’s cannot be estimated. To set the level,, one of these parameters p is normalized to zero: Uc = αTc + βMc + εc and Ub = αTb + βMb + θbY + kb + εb Sociodemographic variables can enter utility in other ways. For example, cost is often divided by income. Now, there is no need to normalize the coefficients. g(·) is the density of these error differences. For any f, the corresponding g can be derived. Normalization occurs automatically in logit. Normalization should be performed in probit. 18/26 19/26 Overall Scale of Utility Is Irrelevant Normalization with IID Errors The e alternative a te at ve with w t the t e highest g est utility ut ty iss the t e same sa e noo matter atte how ow W With t iid d assu assumption, pt o , normalization o a at o for o scale sca e iss straightforward. st a g t o wa d. utility is scaled. Consider U0nj = x’njβ + ε0nj where Var(ε0nj) = σ2 The model U0nj = Vnj + εnj is equivalent to U1nj = λVnj + λεnj for any λ > 0. To take account of this fact, the researcher must normalize the scale of utility. The standard way is to normalize the variance of the error terms. The scale of utility and the variance of the error terms are linked by definition definition. Var(λεnj ) = λ2Var(ε We will discuss Normalization with iid Errors, Normalization with Heteroskedastic Errors Normalization with Correlated Errors nj) The original model is equivalent to U1nj = x’nj(β/σ) + ε1nj where Var(ε1nj) = 1 As we will see, the error variances in a standard logit model are traditionally normalized to π2/6, which is about 1.6. Interpretation of results must take the normalization into account Suppose in a mode choice model, the estimated cost coefficient is −0.55 from a logit and −0.45 from a probit model. It is incorrect to say that the logit model implies more sensitivity to costs. Same issue when a model is estimated on different data sets: Suppose we estimate cost coefficient of −0.55 and −0.81 for Chicago and Boston This scale difference means unobserved utility has less variance in Boston Other factors have less effect on people in Boston 20/26 21/26 Normalization with Heteroskedastic Errors Normalization with Correlated Errors Variance of error terms can be different for different segments. When errors are correlated, normalizing the variance of the error One sets the overall scale of utility by normalizing the variance for one segment, and then estimates the variance for each segment relative to this one segment. In our Chicago and Boston mode choice model: Unj = αTnj + βMnj + εBnj ∀n in Boston Unj = αTnj + βMnj + εCnj ∀n in Chicago Define k = Var(εBnjj)/Var(εCnjj) for one alternative is not sufficient to set the scale of utility differences. Consider the utility for the four alternatives as Unj = Vnj + εnj. The error vector εn = {εn1, εn2, εn3, εn4} has zero mean and Covariance matrix . . The scale parameter (k), is estimated along with β and α. Variance of error term can differ over geographic regions, data sets, time, or other factors. 22/26 Since only differences in utility matter, this model is equivalent to one in which all utilities are differenced from, the first alternative. 23/26 Normalization with Correlated Errors Aggregation The e variance va a ce of o each eac error e o difference d e e ce depends depe ds on o the t e variances va a ces Discrete sc ete choice c o ce models ode s operate ope ate at individual d v dua level. eve . and covariances of the original errors: The researcher is usually interested in some aggregate measure. In linear regression models: Estimates of aggregate Y are obtained by inserting aggregate values of X’s. Setting the variance of one of the original errors is not sufficient to set the variance of the error differences (σ11 = k is not sufficient) Instead, Instead normalize the variance of one of the error differences to some number. number The covariance matrix for the error differences becomes: On recognizing that only differences matter and that the scale of utility is arbitrary, the number of covariance parameters drops from ten to five. Discrete choice models are not linear in explanatory variables: Inserting aggregate values of X’s will not provide an unbiased estimate of average response. Estimating the average response by calculating derivatives and elasticities at the average of the explanatory variables is similarly problematic. Sample enumeration or segmentation. 24/26 25/26 Sample Enumeration Segmentation The most straightforward, and by far the most popular approach. When The choice probabilities of each decision maker in a sample are summed, or averaged, over decision makers. Each sampled decision maker n has some weight associated with him, representing the number of decision makers similar to him in the population. A consistent estimate of the total number of the decision makers in the population who choose i is wnPni. Average derivatives and elasticities are similarly obtained by calculating the derivative and elasticity for each sampled person and taking the weighted average. 26/26 Forecasting Procedures ocedu es described desc bed ea earlier e for o aggregate agg egate va variables ab es are a e applied. app ed. The exogenous variables and/or the weights are adjusted to reflect changes that are anticipated over time. Recalibration of constants To reflect that unobserved factors are different for the forecast area or year. An iterative process is used to recalibrate the constants. α0j: The estimated alternative-specific constant for alternative j. Sj: Share of agents in the forecast area that choose alternative j in base year. : Predicted share of decision makers in the forecast area who will choose each alternative. Adjust alternative-specific constants: With the new constants, predict the share again, compare with the actual shares, and if needed adjust the constants again. The number of explanatory variables is small, and Those variables take only a few values. Consider A model with only two variables in the utility function: education and gender. The total number of different types of decision makers (segments) is eight, if there are four education levels for each of the two genders. Choice probabilities vary only over the segments, not over individuals within each segment. Aggregate outcome variables can be estimated by calculating the choice probability for each segment and taking the weighted sum of these probabilities. 1/31 Choice Probabilities The utility that the decision maker obtains is Unj = Vnj + εnj ∀j. Basic Econometrics in Transportation The logit model is obtained by assuming that each εnj is IID extreme value. For each unobserved component of utility, PDF and CDF are: Logit Models The variance of this distribution is π2/6. Difference between two extreme value variables (ε∗nji = εnj − εni) is distributed logistic: Amir Samimi The key assumption is not so much the shape of the distribution as that the errors are independent of each other. Civil Engineering Department Sharif University of Technology One should specify utility well enough that a logit model is appropriate (get white noise error term). Primary Source: Discrete Choice Methods with Simulation (Train) 2/31 Choice Probabilities 3/31 Logit Choice Probabilities Derivation If tthe e researcher esea c e tthinkss tthat at tthee uunobserved obse ved portion po t o of o utility ut ty iss correlated over alternatives given her specification of representative utility, then she has three options: For a given εni, this is the cumulative distribution for each εni: Use a different model that allows for correlated errors, Re-specify representative utility so that the source of the correlation is captured explicitly and thus the remaining errors are independent, independent Use the logit model under the current specification of representative utility, considering the model to be an approximation. Of course, εni is not given, and so the choice probability is the integral of Pni | εni over all values of εni weighted by its density: Some algebraic manipulation of this integral results in: 4/31 5/31 Properties of Logit Probabilities Scale Parameter Pni is necessarily between zero and one. Setting the variance to π2/6 is equivalent to normalizing the model for the scale of utility. The logit probability for an alternative is never exactly zero. Choice probabilities for all alternatives sum to one. Utility can be expressed as U∗nj = Vnj + ε∗nj, that becomes Unj = Vnj /σ + εnj The relation of the logit probability to representative utility is The Choice probability becomes: sigmoid, or S-shaped. Only the ratio β∗/σ can be estimated (β∗ and σ are not separately identified). The sigmoid shape of logit probabilities is shared by most discrete choice A larger variance in unobserved factors leads to smaller coefficients coefficients, even if models and has important implications for policy makers. makers the observed factors have the same effect on utility. It is easily interpretable. Willingness to pay, values of time, and other measures of marginal rates of Ratio of coefficients has economic meaning. 6/31 substitution are not affected by the scale parameter: β1/β2 = β∗1 /β∗2 7/31 Power and Limitations of Logit Taste Variation Three ee topics top cs explain e p a the t e power powe and a d limitations tat o s of o logit og t models: ode s: Logit og t ca can only o y represent ep ese t syste systematic at c taste variation. va at o . Taste variation Consider choice among makes and models of cars to buy: Logit can represent systematic taste variation but not random taste variation. Substitution patterns The logit model implies proportional substitution across alternatives, given the researcher’s specification of representative utility. Repeated choices over o er time (panel data) If unobserved factors are independent over time in repeated choice situations, then logit can capture the dynamics of repeated choice. However, logit cannot handle situations where unobserved factors are correlated over time. Two observed attributes are purchase price (PP), and shoulder room (SR). So utility is written as Unj = αnSRj + βnPPj + εnj Suppose SR varies only with household size (M): αn = ρMn Suppose PP is inversely related to income(I): βn = θ/In. Substituting g these relations: Unj = ρ( ρ(MnSRj) + θ(PP ( j/In) + εnj Now suppose αn = ρMn + μn and βn = θ/In + ηn Thus Unj = ρ(MnSRj) + μnSRj + θ(PPj/In) + ηnPPj + εnj = ρ(MnSRj) + θ(PPj/In) + The new error term cannot possibly be distributed independently and identically as required for the logit formulation. 8/31 9/31 Substitution Patterns IIA Increase in the probability of one alternative necessarily means a While the IIA property is realistic in some choice situations, it is decrease in probability for other alternatives. What is the market share of an alternative and where the share comes from? Logit implies a certain substitution pattern. If it actually occurs, logit model is appropriate. To allow for more general patterns more flexible models are needed. IIA (Independence from Irrelevant Alternatives) property clearly inappropriate in others: Red-bus–blue-bus problem: Assume choice probabilities of car and blue bus are equal: Pc = Pbb = 0.5 Now a red bus is introduced that is exactly like the blue bus, and thus Prb/Pbb = 1 In logit, Pc/Pbb is the same whether or not red bus exists. The only probabilities for which Pc/Pbb = 1 and Prb/Pbb = 1 are Pc = Pbb = Prb = 1/3 In real life, we would expect Pc = ½ and Pbb = Prb = ¼ A new express bus Introduced along a line that already has standard bus service. This new mode might be expected to reduce the probability of regular bus by a greater proportion than it reduces the probability of car. The logit model over-predict transit share in this situation. 10/31 11/31 Proportional Substitution Advantages of IIA Co Consider s de cchanging a g g an a attribute att bute of o alternative a te at ve j. We want wa t to know ow Itt iss possible poss b e to est estimate ate model ode pa parameters a ete s consistently co s ste t y on o a subset effect of this change on the probabilities for all other alternatives. We will show that the elasticity of Pni with respect to Znj is This cross-elasticity is the same for all i. An improvement in the attributes of an alternative reduces the probabilities for all the other alternatives by the same percentage. This pattern of substitution is a manifestation of the IIA property. Electric car example: Share of large gas, small gas, and small electric cars are 0.66, 0.33, and 0.01. We want to subsidize electric cars and increase their share to 0.10. of alternatives for each sampled decision maker. Since relative probabilities within a subset of alternatives are unaffected by the attributes or existence of alternatives not in the subset, exclusion of alternatives in estimation does not affect the consistency of the estimator. At an extreme, the number of alternatives might be so large as to preclude estimation altogether if it were not possible to utilize a subset of alternatives. When one is interested in examining choices among a subset of alternatives and not among all alternatives. By logit, probability for large gas car would drop to 0.60, and that for the small Choice only between car and bus modes for travel to work. gas car would drop by the same ten percent, from 0.33 to 0.30. This pattern of substitution is clearly unrealistic. If IIA holds, one can exclude persons who used walking, bicycling, etc. 12/31 13/31 Tests of IIA Panel Data Whether IIA holds in a particular setting is an empirical question, Data on current and past vehicle purchases of sampled households open to statistical investigation. Two types of tests are suggested: 1. The model can be reestimated on a subset of the alternatives. Ratio of probabilities for any two alternatives is the same under IIA. The model can be reestimated with new, cross-alternative variables. Variables from one alternative entering the utility of another alternative. alternative If Pni/Pnk depends on alternative j (no IIA), attributes of alternative j will enter significantly the utility of alternatives i or k within a logit specification. Introduction of non-IIA models, makes testing IIA easier than before. 2. Data that represent repeated choices like these are called panel data. Logit model can be used to examine panel data If the unobserved factors are independent over the repeated choices. Each choice situation by each decision maker becomes a separate observation. Dynamic aspects of behavior Dependent variable in previous periods can be entered as an explanatory variable. variable Inclusion of the lagged variable does not induce inconsistency in estimation, since for a logit model the errors are assumed to be independent over time. The assumption of independent errors over time is severe. Use a model that allows unobserved factors to be correlated over time, Respecify utility to bring the sources of the unobserved dynamics into the model. 14/31 15/31 Consumer Surplus Derivatives Definition, e to , How ow pprobabilities obab t es cchange a ge with w t a change c a ge in some so e observed obse ved factor? acto ? A person’s consumer surplus is utility ($), that he receives in the choice situation. CSn = (1/αn) maxj (Unj), where αn is the marginal utility of income. For policy analysis If a light rail system is being considered in a city, it is important to measure the benefits of the project to see if they justify the costs. Expected consumer surplus It can be shown that expected consumer surplus in a logit model that is simply the log of the denominator (the log-sum term) of the choice probability. How do you interpret these? This resemblance has no economic meaning, and just the outcome of the mathematical form of the extreme value distribution. 16/31 17/31 Elasticities Estimation Response is usually measured by elasticity rather than derivative. There are various estimation methods under different sampling Elasticity is normalized for the variable’s unit. procedures. We discuss estimation under the most prominent of these sampling schemes: Estimation when the sample is exogenous. Estimation on a subset of alternatives. Estimation with certain types of choice-based (i.e., (i e non-exogenous) samples samples. 18/31 19/31 Estimation on an Exogenous Sample Estimation on an Exogenous Sample Assume ssu e First-order st o de optimality opt a ty condition co d t o for o becomes beco es The sample is random or stratified random. Explanatory variables are exogenous to the choice situation. That is, the variables entering representative utility are independent of the unobserved component. Maximum-likelihood procedures can be applied to logit: MLE of β are those that The probability of person n choosing the alternative Make the predicted average of each x equal to observed average in the sample. that he was observed to choose can be expressed as: where yni =1 if person n chose i and zero otherwise. Assuming independence of agents, probability of each person choosing the alternative that he chose is: McFadden shows that LL(β) is globally concave for linear-in parameters utility. Make sample covariance of the residuals with the explanatory variables zero. 20/31 Estimation on a Subset of Alternatives 21/31 Estimation on a Subset of Alternatives Denote The full set of alternatives as F and a subset of alternatives as K. q(K | i) be the probability that K is selected given that alternative i is chosen. We have q(K | i) = 0 for any K that does not include i. (why?) If selection process is designed so q(K | j) is the same for all j ∈ K: Pni is the probability that person n chooses alternative i from the full set. Pn(i | K) is the probability that the person chooses alternative i conditional on the researcher selecting subset K for him. (our goal is to determine that). Joint probability that K and i are selected are expressed in 2 forms: The conditional LL under the uniform conditioning property is Prob(K, i) = q(K | i) Pni. Prob(K, i ) = Pn(i | K)Q(K) where Q(K) = j∈F Pnjq(K | j) Q(K) is the probability of selecting K marginal over all the alternatives that the person could choose. Equate these two expressions and solve for Pn(i | K). 22/31 Estimation on a Subset of Alternatives Since information on alternatives not in each subset is excluded, the estimator based on CLL is not efficient. 23/31 Estimation on Choice-Based Samples Example a p e of o choice c o ce of o home o e location: ocat o : To identify factors that contribute to choosing one particular community, one If a selection process does not exhibit the uniform conditioning property, q(K | i) should be included into the model: might draw randomly from within that community at a rate different from all other communities. This procedure assures that the researcher has an adequate number of people in the sample from the area of interest. Sampling Procedures: ln q(Kn | j ) is added and its coefficient is constrained to 1. Purely choice-based: population is divided into those choose each alternative, and decision makers are drawn randomly within each group, at different rates. Hybrid of choice-based and exogenous: an exogenous sample is supplemented Why would one want to a selection procedure that does not satisfy the uniform conditioning property? with a sample drawn on the basis of the households’ choices. 24/31 25/31 Estimation on Choice-Based Samples Goodness of Fit Purely choice choice-based based estimation Percent correctly predicted: If one is using a purely choice-based sample and includes an alternative-specific constant, then estimating a regular logit model produces consistent estimates for all the model parameters except the alternative-specific constants. Percentage of sampled decision makers for which the highest-probability alternative and the chosen alternative are the same. It is based on the idea that the decision maker is predicted by the researcher to choose the alternative with the highest probability. These constants can be adjusted to be consistent: The procedure misses the point of probabilities, gives obviously inaccurate market shares, and seems to imply that the researcher has perfect information. Aj: Share of decision makers in the population who chose alternative j. Sj: Share in the choice-based sample who chose alternative j. 26/31 27/31 Hypothesis Testing Bay Area Rapid Transit (BART) Sta Standard da d tt-statistics statistics Ap prominent o e t test of o logit og t capab capabilities t es in the t e mid-1970s: d 970s: To test hypotheses about individual parameters, such as whether a parameter is 0. More complex hypotheses can be tested by likelihood ratio test. H0 can be expressed as constraints on the values of the parameters, such as: Several parameters are zero, Two or more parameters are equal. Define the ratio of likelihoods: is the constrained maximum value of the likelihood function under H0. is the unconstrained maximum of the likelihood function. The test statistic defined as −2 log R is distributed chi-squared with degrees of freedom equal to the number of restrictions implied by the null hypothesis. McFadden applied logit to commuters’ mode choices to predict BART ridership. Four modes were considered to be available for the trip to work: Driving a car by oneself, Taking the bus and walking to the bus stop, 3. Taking the bus and driving to the bus stop, 4. Carpooling. Parameters: Time and cost of each mode (determined based on home and work location) Travel time was differentiated as walk, wait, and on-vehicle time Commuters’ characteristics: income, household size, number of cars and drivers in the household, and whether the commuter was head of household. A logit model with linear-in-parameters utility was estimated on these data. 1. 2. 28/31 Bay Area Rapid Transit (BART) 30/31 29/31 Bay Area Rapid Transit (BART) 31/31 Bay Area Rapid Transit (BART) Homework 8 Forecasted o ecasted aand d actua actual sshares a es for o eac each mode: ode: Discrete sc ete C Choice o ce Methods et ods w with t S Simulation u at o (Kenneth ( e et E.. Train) a ) Problem set 1 on logit: elsa.berkeley.edu/users/train/ec244ps1.zip 1. 2. 3. 4. 5. 6. Exercise 1 [20 points] Exercise 2 [10 points] Exercise 3 [20 points] Exercise 4 [20 points] Exercise 5 [15 points] Exercise 6 [15 points] BART demand was forecast to be %6.3, compared with an actual share of %6.2. Assignment weight factor = 2 1/25 Introduction Often one is unable to capture all sources of correlation. Basic Econometrics in Transportation Thus, unobserved portions of utility are correlated and IIA does not hold. The unifying attribute of GEV models: Unobserved portions of utilities are jointly distributed as a GEV. When all correlations are zero, the GEV distribution becomes the product of GEV Models independent EV distributions and GEV model becomes standard logit. GEV includes logit but also includes a variety of other models. GEV model can be used to test whether the correlations are zero. Amir Samimi Is also a way to test whether IIA is a proper substitution pattern. Nested logit is the most widely used GEV model. Civil Engineering Department Sharif University of Technology Primary Source: Discrete Choice Methods with Simulation (Train) 2/25 3/25 Nested Logit Choice Probabilities Appropriate pp op ate w when e aalternatives te at ves can ca be pa partitioned t t o ed into to nests: ests: Co Consider s de K non-overlapping o ove app g nests: ests: B1, B2, . . . , BK IIA holds within each nest. The NL model is obtained by assuming that εn has a type of GEV CDF: IIA does not hold in general for alternatives in different nests. Example of IIA holding within nests of alternatives Change in probabilities when one alternative is removed: For any two alternatives j and m in nest Bk, εnj is correlated with εnm. For any two alternatives in different nests, Cov(εnj, εnm) = 0. A higher value of λk means greater independence in unobserved utility among the alternatives in nest k. When λk = 1 for all k, nested logit model reduces to the standard logit model. Choice probability: i ∈ Bk : 4/25 5/25 Choice Probabilities IIA in Nested Logit The parameter λk can differ over nests: IIA holds within each subset of alternatives but not across subsets. Consider alternatives i ∈ Bk and m ∈ Bl One can constrain the λk’s to be the same for all (or some) nests. Hypothesis testing can be used to determine whether the λk’s are reasonable. Testing λk = 1 ∀k is equivalent to testing whether the standard logit model is a reasonable specification against the more general nested logit. These tests are performed most readily with the likelihood ratio statistic. If k = l, IIA holds. Consistency with utility-maximizing behavior For k ≠ l, the ratio depends on attributes of all alternatives in the nests l and k. For 0 ≤ λk ≤ 1, the model is consistent for all possible values of the X’s. This form of IIA can be loosely described as “independence from irrelevant nests” or IIN. For λk > 1, the model is consistent for some range of X’s. For λk < 0, the model is not consistent with utility-maximizing behavior. λk might differ over decision makers based on their characteristics Such as a function of demographics: λ = exp(αzn) 6/25 7/25 Decomposition into Two Logits Decomposition into Two Logits The e observed obse ved co component po e t of o utility ut ty can ca be decomposed deco posed into to λk Ink iss tthe e eexpected pected utility ut ty received ece ved from o aalternatives te at ves in nest est Bk. A part (W) that is constant for all alternatives within a nest, A part (Y) that varies over alternatives within a nest. Utility can be written as: Unj = Wnk + Ynj + εnj Ink is often called the inclusive utility, or the “log-sum term”. Probability of choosing nest Bk depends on the expected utility that the person receives from that nest. Nested logit probability can be written as: Pni = Pni | Bk X PnBk With some algebraic rearrangement of the choice probability: Recall coefficients in the lower model are divided by λk STATA estimates i nestedd logit l i models d l without ih dividing di idi by b λk in i the h lower l model. d l The package NLOGIT allows either specification. If not divided by λk, choice probabilities are not the equations just given. Not–dividing approach is not consistent with utility maximization in some cases. 8/25 9/25 Estimation Three-Level Nested Logit The parameters of a nested model can be estimated by: Three level nested logit models could be more appropriate. Standard maximum likelihood technique. Just substitute the choice probabilities into the LL function. The values of the parameters that maximize this function are consistent and efficient. Numerical maximization could be difficult, as LL function is not globally concave. Sequential technique. Estimate the lower models; calculate the inclusive utility for each lower model; estimate the upper model. Standard errors of the upper-model parameters are biased downward. If some parameters appear in several submodels, separate estimations provide separate different estimates. 10/25 Housing unit choice in San Francisco. There are unobserved factors that are common to all units in the same neighborhood, such as proximity to shopping and entertainment. There are also unobserved factors that are common to all units with the same number of bedrooms, such as the convenience of working at home. Top model has an inclusive value for each nest from subnests within the nest Middle models have inclusive values for each subnest from the alternatives in subnest 11/25 IIA in Three-Level NL Overlapping Nests Thiss nesting est g structure st uctu e embodies e bod es IIA in tthee following o ow g ways: Co Consider s de ou our eexample a p e of o mode ode choice: c o ce: Ratio of probabilities of two housing units in the same neighborhood and with the same number of bedrooms is independent of characteristics of all other units. Ratio of probabilities of two housing units in the same neighborhood but with different numbers of bedrooms is independent of characteristics of units in other neighborhoods. Ratio of probabilities of two housing units in different neighborhoods depends on the characteristics of all the other housing units in those neighborhoods but not on the characteristics of units in other neighborhoods. Degree of correlation among alternatives within the nests: σmk is the coefficient of the inclusive value in the middle model, λkσmk is the coefficient of the inclusive value in the top model. Parameter must be in certain range to be consistent with utility maximization. Carpool and auto alone are in a nest, since they have some similar unobserved attributes. Carpooling also has some unobserved attributes that are similar to bus and rail. Alack of flexibility in scheduling. We may want unobserved utility for the carpool alternative to be correlated with that of auto alone, bus and rail, with different degrees. It would ld bbe useful f l ffor th the carpooll alternative lt ti to t be b in i two t nests. t Cross-nested logits (CNLs) contain multiple overlapping nests. 12/25 13/25 Overlapping Nests Paired Combinatorial Logit Ordered Generalized Extreme Value (OGEV) Each pair of alternatives is considered to be a nest. The alternatives have a natural order (number of cars in a household). Each alternative is member of J −1 nests. Correlation in unobserved utility between any two alternatives depends on their λij: degree of independence between alternatives i and j. closeness in the ordering. Nested OGEV PCL becomes a standard logit when λij = 1 for all pairs of alternatives. The choice probability: Lower models are OGEV rather than standard logit. Paired Combinatorial Logit (PCL) Each pair of alternatives forms a nest with its own correlation. With J alternatives, each alternative is a member of J −1 nests, and the correlation of its unobserved utility with each other alternative is estimated. Generalized Nested Logit (GNL) Includes the PCL and other cross-nested models as special cases. 14/25 The term being added in the numerator is the same as the numerator of the nested logit probability. The denominator is also the sum over all nests of the sum of the exp(V/λ)’s within the nest, raised to the appropriate power λ. 15/25 Paired Combinatorial Logit Generalized Nested Logit S Since ce the t e scale sca e and a d level eve of o utility ut ty are a e immaterial, ate a , Each ac alternative a te at ve can ca be a member e be of o more o e than t a one o e nest: est: At most J (J −1)/2−1 covariance parameters can be estimated in a choice model. Allocation parameter αjk: The extent to which alternative j is a member of nest k. αjk reflects the portion of the alternative that is allocated to each nest. A PCL model contains J (J −1)/2 λ’s, thus The researcher can normalize one of the λ’s to 1. λk: indicate the degree of independence among alternatives within the nest. The choice probability: Or, impose a specific pattern of correlation: Set λij = 1 for some or all of the pairs: no significant correlation in the unobserved utility for that pair. Set λij = λkl: correlations is the same among a group of alternatives. λij could be specified as a function of the proximity between i and j (OGEV). The model becomes a nested logit model, if each alternative enters only one nest, with αjk = 1 for j ∈ Bk and 0 otherwise. The model becomes standard logit, if, in addition, λk = 1 for all nests. 16/25 17/25 Generalized Nested Logit Heteroskedastic Extreme Value GNL probability can be decomposed as: Instead of capturing correlations among alternatives, the The probability of nest k is: researcher may want to allow the variance of unobserved factors to differ over alternatives. εnj is distributed independently EV with variance (θjπ)2/6. No correlation, but different variance for different alternatives. To set the overall scale of utility, the variance for one alternative is set to π2/6. The Th choice h i probability: b bilit The probability of alternative i given nest k is: where w = εni/θi 18/25 19/25 Heteroskedastic Extreme Value The GEV Family The e integral teg a can ca be approximated app o ated by simulation: s u at o : Let: et: Omit the subscript n denoting the decision maker 2. Take a draw from the extreme value distribution. For this draw of w, calculate the factor in brackets: 3. Repeat steps 1 and 2 many times, and average the results to approximate Pni. 1. Yj ≡ exp(Vj) G = G(Y1, ... , YJ) Gi = ∂G/∂Yi If G meets certain conditions, then Pi is the choice probability for a discrete choice model, consistent with utility maximization. Any model that can be derived in this way is a GEV model. We show how logit, nested logit, and paired combinatorial logit can be derived. 20/25 21/25 McFadden Conditions Logit in GEV Family 1. G ≥ 0 for all positive values of Yj Let: 2. G is homogeneous of degree 1: G(ρY1, ... , ρYJ) = ρG(Y1, ... , YJ) Ben-Akiva and Francois showed this condition can be relaxed to any degree. 3. G → ∞ as Yj → ∞ for any j 4. Cross partial derivatives of G change signs in a particular way: Gi ≥ 0 for all i , Gij = ∂Gi/∂Yj ≤ 0 for all j = i, Gijk = ∂Gij/∂Yk ≥ 0 for any distinct i , j , andd k, k andd so on for f higher-order hi h d cross-partials. i l Conditions: 1. The sum of positive Yj’s is positive. 2. If all Yj’s are raised by a factor ρ, G rises by that same factor. 3. If any Yj rises without bound, then G does also. 4 Gi = ∂G/∂Yi = 1, 4. 1 and thus Gi ≥ 0. 0 And higher higher-order order derivatives are all zero. zero Lack of intuition behind the properties is a blessing and a curse: Disadvantage: Researcher has little guidance on how to specify a G. Advantage: Mathematical approach allows the researcher to generate models that he might not have developed while relying only on his economic intuition. 22/25 23/25 Nested Logit in GEV Family Paired Combinatorial Logit in GEV Family Let: et: Let: et: The J alternatives are partitioned into K nests labeled B1, ... , BK λk between zero and one Verify the conditions. 24/25 GNL in GEV Family 25/25 Homework 9 Discrete Choice Methods with Simulation (Kenneth E. Train) Problem set 3 on nested logit: elsa.berkeley.edu/users/train/ec244ps3.zip Exercise 1 [30 points] Exercise 2 [30 points] 3. Exercise 3 [40 points] 1. 2. Assignment weight factor = 1.5 1/29 Introduction Basic Econometrics in Transportation Logit is limited in three important ways: It cannot represent random taste variation. It exhibits restrictive substitution patterns due to the IIA property. It cannot be used with panel data when unobserved factors are correlated over time for each decision maker. Probit Models Probit models deal with all three. three It only require normal distributions for all unobserved components of utility. Amir Samimi The coefficient of price with random taste variation, is problematic in probit. Civil Engineering Department Sharif University of Technology Primary Source: Discrete Choice Methods with Simulation (Train) 2/29 3/29 Choice Probabilities Differences in Utility Matter Ut Utility ty iss deco decomposed posed into: to: Unj = Vnj + εnj ∀j Let et us ddifference e e ce against aga st alternative a te at ve i:: Consider We assume εn is distributed normal with a mean vector of zero and covariance matrix n Then, Then The choice probability is: Density of the error differences is I (·) is an indicator of whether the statement in parentheses holds. 4/29 5/29 Covariance Matrix Scale of Utility is Irrelevant How to derive For a four four-alternative alternative model: from ? Let Mi be the (J − 1) identity matrix with an extra column of −1’s added as the The covariance matrix for the vector of error differences takes the form: ith column. The covariance matrix of error differences: This transformation comes in handy when simulating probit probabilities. 6/29 7/29 Scale of Utility is Irrelevant Covariance Structure S Since ce there t e e are a e five ve θ∗’ss aandd ten te σ σ’s, s, itt iss not ot possible poss b e to solve so ve for o Route oute choice: c o ce: all the σ’s from estimated values of the θ∗’s. This reduction in the number of parameters is not a restriction. Scale and level elements has no relevance for behavior. Covariance between any two routes depends only on the length of shared route segments. This structure reduces the number of covariance parameters to only one. Only the five parameters have economic content, and only the five parameters can be estimated. Physicians’ choice of location: A function of their proximity to one another another. Instead of allowing a full covariance matrix, one can imposes structure on the covariance matrix. A few of them are reviewed. Often the structure that is imposed will be sufficient to normalize the model. 8/29 9/29 Covariance Structure Covariance Structure Assumes the covariance matrix has the following form: Assumes the covariance matrix has the following form: Is this model, as specified, normalized for scale and level? Is this model, as specified, normalized for scale and level? θ = 1 + (ρ1 + ρ2)/2 The original model is NOT normalized for scale and level. All that matters for behavior is the average of these parameters, not their values relative to each other. 10/29 11/29 Taste Variation Taste Variation The e utility ut ty iss Unj = β’nxnj + εnj aandd suppose βn ∼ N(b,W). In tthiss case, the t e utility ut ty is: s: Un1 = βnxn1 + εn1 aandd Un2 = βnxn2 + εn2 The coefficients vary randomly over decision makers instead of being fixed. The goal is to estimate the parameters b and W. Rewritten utility as: Where The covariance of the ηnj’ss depends on W as well as the xnj’ss, so that the covariance differs over decision makers. The covariance of the ηnj’s can be described easily for a two alternative model with one explanatory variable. Assuming βn ~ N (b , σβ): Un1 = bxn1 + ηn1 and Un2 = bxn2 + ηn2 Assume that εn1 and εn2 are IID N (0, σε). The assumption of independence is not needed in general. ηn1 and ηn2 are jointly normally distributed with 0 mean covariance of: 12/29 13/29 Taste Variation Substitution Patterns The covariance matrix is Probit can represent any substitution pattern. Different covariance matrices provide different substitution patterns. One determines the substitution pattern that is most appropriate for the data. We also need to set the scale of utility. A convenient normalization for this case is σε = 1. We consider A full covariance matrix, A restricted covariance matrix. b and σβ are estimated. Researcher learns mean and variance of random coefficient in the population. 14/29 15/29 Substitution Patterns: Full Covariance Substitution Patterns: Structured Covariance Ap probit ob t model ode w with t four ou aalternatives te at ves was discussed. d scussed. Imposing pos g st structure, uctu e, usua usually y gets more o e interpretable te p etab e pa parameters. a ete s. A full covariance matrix, when normalized for scale and level was introduced. The estimated values can represent any substitution pattern. The normalization does not restrict the substitution patterns. The normalization only eliminates aspects that are irrelevant to behavior. Note: Estimated values of the θ∗’ss provide essentially no interpretable information information. If θ23 is estimated to be negative. This does not mean that unobserved utility for the second alternative is negatively correlated with unobserved utility for the third alternative (that is, σ23 < 0). Structure is necessarily situation-dependent. Consider a choice among purchase-money mortgages: four mortgages are available to the homebuyer: One with a fixed rate, and three with variable rates. Suppose n consists of two parts: Concern C about b t the th risk i k off rising i i interest i t t rates t (r ( n); ) All other unobserved factors (ηnj). εnj = − rndj + ηnj (dj = 0 for the fixed-rate loan, 1 otherwise) Assume rn is normally distributed with variance σ, and ηnj is IID N~(0,ω). 16/29 17/29 Substitution Patterns: Structured Covariance Panel Data Then the covariance matrix is: Probit with repeated choices is similar to probit on one choice. The only difference is the dimension of the covariance matrix of the errors. The utility that n obtains from j in t is Unjt = Vnjt + εnjt Consider a sequence of alternatives, one for each time period, i ={i1, ... , iT}. The probability that the decision maker makes this sequence of choices is: Where and φ(εn) is the joint normal density with zero mean and covariance The normalization for scale could be accomplished by setting ω = 1 in the original . Under this procedure, the parameter σ is estimated directly. 18/29 Again, it is often more convenient to work in utility differences. This results in a (J − 1)T–dimensional integral. 19/29 Example: Covariance of Errors Example: Choice Probabilities Suppose in a b binary a y case eerror o co consists s sts of: o: The e choice c o ce probabilities p obab t es under u de this t s structure: st uctu e: A portion that is specific to the decision maker (ηn) ηn is iid over people, with a standard normal density. A part that varies over time for each decision maker (μnt) μnt is iid over time and people, with a normal density with zero mean and variance σ. Thus, V(εnt ) = V(ηn + μnt ) = σ + 1 Cov(εnt, εns ) = E(ηn + μnt )(ηn + μns ) = σ Probability of not taking the action, conditional on ηn : Prob(Vnt + ηn + μnt < 0) = Prob(μnt < − (Vnt + ηn)) = (−(Vnt + ηn)) where (·) is the cumulative standard normal function. Probability of taking the action, conditional on ηn: 1 − (−(Vnt + ηn)) = (Vnt + ηn). The probability of the sequence of choices dn, conditional on ηn, is therefore: Hndn (ηn) = ∏t ((Vnt + ηn)dnt) dnt = 1 if person n took the action in period t, and dnt = −1 otherwise. Unconditional probability is: Pndn = ∫ Hndn (ηn) (ηn) dηn This probability can be simulated very simply 20/29 21/29 Simulation of the Choice Probabilities Accept–Reject Simulator Probit probabilities do not have a closed closed-form form expression and Probit probabilities: must be approximated numerically. Non-simulation procedures Quadrature methods approximate the integral by a weighted function of specially chosen evaluation points. Simulation procedures Accept–reject, Smoothed accept–reject, GHK (the most widely used probit simulator) 22/29 The AR simulator of this integral is calculated as follows: 1. Draw a value of εn, from a normal density with zero mean and covariance . Label the draw εrn with r = 1, and the elements of the draw as εrn1, ... , εrnJ 2. Calculate the utility (Urnj = Vnj + εrnj) that each alternative obtains with εrn 3. Determine whether Urni is greater than that for all other alternatives: Ir = 1 if Urni > Urnj ∀j ≠ i, i indicating i di ti an accept, t andd I r = 0 otherwise th i 4. Repeat steps 1–3 many times. Label the number of repetitions as R 5. The simulated probability is the proportion of draws that are accepts: 23/29 How to Take a Draw Smoothed AR Simulators The e most ost st straightforward a g t o wa d pprocedure ocedu e uses the t eC Choleski o es factor. acto . Replace ep ace 0 0–1 AR indicator d cato with w t a smooth, s oot , strictly st ct y positive pos t ve function. u ct o . A Choleski factor of is a lower-triangular matrix L such that LL’ = η ∼ N(0, I ) is a vector of J IID, where I is the identity matrix. Construct ε = Lη by using the Choleski factor to tranform η into ε. Any function can be used for simulating Pni as long as it rises when Urni rises, declines when Urnj rises, is strictly positive, The first step of the AR simulator becomes two sub–steps: a. Draw J values from a standard normal normal. Stack these values into a vector (ηr) b. Calculate εrn = Lηr has defined first and second derivatives with respect to Urnj ∀j. A function that is particularly convenient is the logit function, function Use of this function gives the logit-smoothed AR simulator. 24/29 25/29 Smoothed AR Simulators Smoothed AR Simulators The simulator is implemented in the following steps: The scale factor λ determines the degree of smoothing. 1. Draw a value of εn, from a normal density with zero mean and covariance . Label the draw εrn with r = 1, and the elements of the draw as εrn1, ... , εrnJ 2. Calculate the utility (Urnj = Vnj + εrnj) that each alternative obtains with εrn 3. Put these utilities into the logit formula. That is, calculate Sr, in which λ > 0 is a scale factor specified by the researcher. As λ → 0, Sr approaches the indicator function Ir. The researcher wants to set λ low enough to obtain a good approximation but not so low as to reintroduce numerical difficulties. 4. Repeat steps 1–3 many times. Label the number of repetitions as R. 5. The simulated probability is the average of the values of the logit formula: It is same as the AR simulator except for step 3. 26/29 27/29 GHK Simulator GHK Simulator The e most ost widely w de y used p probit ob t ssimulator u ato iss ca called ed G GHK.. The e second seco d factor acto iss an a integral, teg a , and a d we use ssimulation u at o to A three-alternative case is discussed here. approximate integrals. Draw a value of η1 from a standard normal density truncated above at For this draw, draw calculate the factor Repeat this process for many draws, and average the results. This average is a simulated approximation to the integral. How do we take a draw from a truncated normal? Take a draw from a standard uniform, labeled μ. Then calculate 28/29 GHK Simulator 29/29 Homework 10 Discrete Choice Methods with Simulation (Kenneth E. Train) Problem set 5 on probit: elsa.berkeley.edu/users/train/ec244ps5.zip This probability is simulated as follows: Exercise 1 [40 points] Exercise 2 [30 points] 3. Exercise 3 [30 points] 1. 2. Assignment weight factor = 1 1/28 General Information Course Basic Econometrics in Transportation Econometrics (20-563) Spring 2016, Saturday and Monday 10:30 – 12:00 Instructor Amir Samimi Introduction Office Hours: With Appointment Email: asamimi@sharif.edu TA Amir Samimi Ali Shamshiripour Email: shamshiripour@gmail.com Civil Engineering Department Sharif University of Technology Prerequisites Undergraduate level of probability and statistics Primary Source: Basic Econometrics (Gujarati) 2/28 3/28 What is Econometrics? Methodology Gujarati 1. Econometrics is mainly interested in the empirical verification of economic theory. Econometrics is based upon the development of statistical methods for estimating relationships, testing theories, and evaluating and implementing policy. Woodridge Econometrics is based upon the development of statistical methods for estimating economic relationships, testing economic theories, and evaluating and implementing government and business policy. 2. 3. 4. 5. 6. 7. 8. Statement of theory or hypothesis Specification of the mathematical model of the theory Specification of the statistical, or econometric, model Obtaining the data Estimation of the parameters of the econometric model Hypothesis testing Forecasting or prediction Using the model for control or policy purposes 4/28 5/28 Example Example (cont.) Keynes stated: The fundamental psychological law . . . is that men [women] are disposed, as a rule and on average, to increase their consumption as their income increases, but not as much as the increase in their income. A mathematical economist might suggest the following function: Y = β1 + β2X Example (cont.) 7/28 Example (cont.) Y = β1 + β2X + u The purely mathematical model is of limited interest to the econometrician, for it assumes that there is an exact or deterministic relationship between consumption and income. But relationships are generally inexact., because, in addition to income, other variables (size of family, age, religion) affect consumption expenditure. To allow for inexact relationships, the econometrician add a disturbance, or error term in the consumption function. 0 < β2 < 1 Y = consumption expenditure (dependent variable) X = income (independent variable) β1 and β2 are the parameters of the model. In short, the marginal propensity to consume (MPC) is greater than zero but less than 1. 6/28 The error term, is a random variable that has well-defined probabilistic properties. This is an econometric model. 8/28 9/28 Example (cont.) Example (cont.) To estimate the econometric model, (to obtain numerical values of β1 The numerical estimates of the parameters give empirical content and β2) we need data. to the consumption function. The estimated consumption function is: Y: Personal consumption expenditure, X: Gross domestic product, Both in 1992 billions of dollars The hat on the Y indicates that it is an estimate. The slope coefficient was about 0.70, suggesting that for the sample period an increase in real income of 1 dollar led, on average, to an increase of about 70 cents in real consumption expenditure. 10/28 11/28 Example (cont.) Example (cont.) To develop suitable criteria to find out whether the estimates are in If the chosen model does not refute our hypothesis, we may use it to accord with the expectations of the theory that is being tested. Keynes expected the MPC to be positive but less than 1. We found the MPC to be about 0.70. Is this estimate sufficiently below unity to convince us that this is not a chance occurrence or peculiarity of the particular data we have used? Is 0.70 statistically less than 1? forecast Y on the basis of expected future value of X. Predict the mean consumption expenditure for 1997 having the GDP value of 7269.8 billion dollars: The actual value of the consumption expenditure in 1997 was 4913.5 billion dollars. The estimated model overpredicted Y by about 37.82 billion dollars. The forecast error is about 37.82 billion dollars. Such errors are inevitable given the statistical nature of our analysis. 12/28 Example (cont.) 13/28 Applications Political Science: How does political campaign expenditures affect voting outcomes? How to manipulate control variable to produce the desired level of the target variable. Suppose the government believes that consumer expenditure of about 4900 (billions of 1992 dollars) will keep the unemployment rate at its current level of about 4.2 percent (early 2000). What level of income will guarantee the target amount of consumption expenditure? The model suggests that an income level of about 7197 (billion) dollars, given an MPC of about 0.70, will produce an expenditure of about 4900 billion dollars. 14/28 Education: How does school spending affects student performance? Transportation: How does fuel taxation affect public transportation ridership? Social Science: How does level of education affect personal income? 15/28 Data Data Accuracy The result of research is only as good as the quality of data. Observational errors The modeler is often faced with observational as opposed to experimental data. Cross-Sectional Data Data on one or more variables collected at the same point in time. Time Series Data Data on the values that a variable takes at different times. Pooled Cross Sections A combination of both time series and cross-section data. Panel or Longitudinal Data A special type of pooled data in which the same cross-sectional unit is surveyed over time. Social science data are non-experimental in nature. Omission or commission Errors of measurement Approximations and round offs Problem of non-response Financially sensitive questions Selectivity bias in sample Sampling method How to compare the results with various samples? 16/28 17/28 Data Accuracy Course Objectives Aggregate level To provide an elementary but comprehensive introduction Data are generally available at a highly aggregate level. to econometrics without resorting to statistics beyond the elementary level. Confidentiality o IRS is not allowed to disclose individual tax data o Department of Commerce, which conducts the census of business every 5 years, is not allowed to disclose information at the firm level. The researcher should always clearly state the data sources used in the analysis, their definitions, their collection methods, and any gaps in the data as well as any revisions in the data. 18/28 To estimate and interpret simple econometrics models. To get familiar with the most common modeling issues. 19/28 Course Material Class Policy Required Text Books 1. Casella, G., and R.L. Berger (2001) Statistical Inference, 2nd Edition, Syllabus Modifications A. Duxbury Press. Gujarati, D.N. (2004) Basic Econometrics, 4th Edition, McGraw Hill. Train, K. (2009) Discrete Choice Methods with Simulation, 2nd Edition, Cambridge University Press. B. 2. Class Participation A. Additional Readings The instructor reserves the right to change the course syllabus and any changes will be announced in class. Students are responsible for obtaining this information. Read the course material before class and be active in discussions. Wooldridge, J.M. (2008) Introductory Econometrics: A Modern Approach, 4th Edition, South-Western College Pub. Hensher, D.A., J.M. Rose, and W.H. Greene (2005) Applied Choice Analysis: A Primer, Cambridge University Press. 3. Late Submission of Assignments A. Any sort of assignments turned in late will incur a 5% penalty per day. 20/28 21/28 Grading Policy Attendance Homework Assignments Quiz Final Exam Term Paper (Extra) Attendance 10% 25% 15% 50% 10% 22/28 Late arrival results in a 50% penalty. More than 30 minutes delay results in an absence penalty. Catch up with the class, when circumstances arise. Do not discuss your excuses. 23/28 Homework Assignments Quiz No collaboration. Four Quizzes tentatively scheduled for: Submission 1. Type and email the assignments ONLY to the TA. 2. All the homework assignments are due in a week. 3. Format Have a cover letter. Each question starts from the top of the page. Get familiar with technical writing. Reference your work appropriately, if not, you may be penalized for plagiarism. Only PDF files are acceptable. Grading Effort: 50%, Correctness: 30%, Formatting: 20% 4. Session 10 Session 14 Session 19 Session 28 Starts sharp at the beginning of the session. 24/28 25/28 Final Exam Term Paper Exam Day Format A. B. C. D. E. Closed book. A double-sided, A4 note is allowed. Check for any conflict and request a reschedule by Week 6. You are not allowed to leave the session, during the exam. If you have special needs during the final exam, inform me at least one month in advance. In accordance with TRB guidelines: www.trb.org/Guidelines/Authors.pdf Qualification Papers are evaluated in TRB standards. Contribution No previous works by the student. Plagiarism shall result in a deduction of 20 in a scale of 100 from the final Exam Material A. B. Includes discussion and problems. Mainly from the primary text books and lectures. grade. Collaboration is possible if announced by Week 6. Extra points will be distributed based on the level of contribution. Research must be done under the instructor’s supervision. Due Seven days after the Final Week. 26/28 Academic Conduct Policy 27/28 Tentative Calendar Session Topic Students are expected be aware of and to adhere to the 1 Introduction university standards of academic honesty. Violations include, but are not limited to, submitting work which is not your own for credit. Copying or sharing of work (including computer files) is prohibited, except as explicitly authorized by the instructors. Any confirmed plagiarism shall result in the student receiving a zero for that work. 2 Statistical Concepts 3 Statistical Concepts 4 Bivariate Regression Analysis 5 Bivariate Regression Analysis 6 Bivariate Regression Discussion 7 Multiple Regression Analysis 8 Multiple Regression Analysis 9 Multiple Regression Analysis 10 Multiple Regression Analysis 11 Dummy Variables 12 Multicollinearity 13 Multicollinearity 14 Heteroscedasticity 15 Heteroscedasticity 28/28 Tentative Calendar Session Topic 16 Autocorrelation 17 Autocorrelation 18 Model Specification 19 Model Specification 20 Endogeneity 21 Behavioral Models 22 Behavioral Models 23 Logit Models 24 Logit Models 25 GEV Models 26 GEV Models 27 GEV Models 28 Probit Models 29 Probit Models 30 Advanced Topics