Uploaded by abhimod2002

Econometrics Slides

advertisement
1/28

Outline
 Probability Theory
Basic Econometrics in Transportation
 Random Variable
 Characteristics of Probability Distributions
 Common Probability Distributions
Statistical Concepts
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Sources: Statistical Inference (Casella and Berger) and Basic Econometrics (Gujarati)
2/28


3/28


Background
Definitions
 You
ou can neve
never foretell
fo etell what any one man will do, but you can
 Sa
Sample
p e space
say with precision what an average will be up to. (Sherlock
Holmes)
 All the possible outcomes, or the population.
 Event
 Any possible collection of outcomes.
 Theory of probability has a long and rich history, back to the
17th century, when a mathematical formulation of gambling
odds was developed.
 Mutually exclusive events
 If occurrence of one event prevents the occurrence of another event.
 Exhaustive
E h
i events
 If they exhaust all the possible outcomes of an experiment.
 The question is how to draw conclusions about a population of
objects by conducting an experiment.
experiment
4/28


5/28


Probability
Random Variable
 Probability
 Definition
 Proportion of times an event will occur in repeated trials of an experiment.
 A variable whose value is determined by the outcome of a chance experiment.
 Notations
 Random variables are usually denoted by the capital letters X.
 Calculus of probabilities
 The values taken by them are denoted by small letter x.
 0 ≤ P(A) ≤ 1 for every event A.
 A random variable may be
 If A, B, C, . . . constitute an exhaustive set of events, then
P(A + B + C + ···)) = 1
 If A, B, C, . . . are mutually exclusive events, then
P(A + B + C + ···) = P(A) + P(B) + P(C) + …
 Discrete
 Finite: How many children in a family are disable?
 Infinite: How many people have blood cancer in a given population?
 Continuous: How long will a light bulb last?
 Intervals and preferences
 How do you rank public transport service reliability, on a scale 1 to 5?
6/28


7/28


Probability Density Function
Probability Density Function
 For
o tthee discrete
d sc ete rv
vX
 f(x) iss tthe
e PDF of
o continuous
co t uous rv
v X,, if::
 PDF of sum of the numbers shown on two dice:
8/28


9/28


Joint PDF
Joint PDF
 Let X and Y be two discrete random variables.
 Marginal PDF:
 Conditional PDF:
 Examples of joint PDF
 Statistical Independence:

10/28 
Characteristics of Probability Distributions

11/28 
Characteristics of Probability Distributions
Correlation coefficient ρ is defined as:

12/28 
Higher Moments of Probability Distributions

13/28 
Skewness and Kurtosis
Second moment: E(X
( − μ)2
Third moment: E(X − μ) 3
Fourth moment: E(X − μ) 4
 Skewness (lack of symmetry)
 Kurtosis (tallness or flatness)

14/28 

15/28 
Normal Distribution
Standardized Normal Distribution
 The
e best known
ow oof aall tthee ttheoretical
eo et ca probability
p obab ty distributions.
d st but o s.
 Its
ts mean
ea iss 0 aandd its
ts variance
va a ce iss 1..

16/28 
Area Under the Curve

17/28 
Example
 Assume that X ∼ N(8, 4). What is the probability that X will
assume a value between X1 = 4 and X2 = 12?
 To compute the required probability, we compute the Z values as
 Now we observe that Pr(0 ≤ Z ≤ 2) = 0.4772. Then, by symmetry, we have
Pr(−2 ≤ Z ≤ 0) = 0.4772.
 Therefore, the required probability is 0.4772 + 0.4772 = 0.9544.

18/28 

19/28 
Central Limit Theorem
Normality Test
 Co
Consider
s de n independent
depe de t random
a do variables,
va ab es, all
a of
o which
w c have
ave
 How
ow tthee values
va ues of
o sskewness
ew ess aandd kurtosis
u tos s depa
departt from
o the
t e
the same PDF with mean μ and variance σ2.
 The sample mean approaches the normal distribution with
mean μ and variance σ2/n, as n increases indefinitely.
norms of 0 and 3?
 Jarque–Bera (JB) test:
 This result holds true regardless of the form of the PDF.
 Under the null hypothesis of normality, JB is distributed as a chi-square
statistic with 2df.

20/28 

21/28 
Chi-Square Distribution
Student’s t Distribution
 Sum of squares of k independent standardized normal variables
 If Z1 ∼ N(0, 1) and Z2 ∼ χ2k are independent, then t ∼ tk
possess the χ2 distribution with k degrees of freedom (df).
 i.e. t follows Student’s t distribution with k df.
 df means number of independent quantities in the sum.
 Symmetrical, but flatter than normal.
 The mean of the chi-square distribution is
 Mean is zero, and variance is k/(k− 2).
k, and its variance is 2k, where k is the df.
 As the number of df increases,, the
distribution becomes increasingly
symmetrical.
 If Z1 and Z2 are two independent chisquare variables with k1 and k2 df, then
the sum Z1 + Z2 is also a chi-square
variable with df = k1 + k2.

22/28 
 As k increases, the t distribution
approximates
i t the
th normall distribution.
di t ib ti

23/28 
F Distribution
Distributions Related to Normal
 If Z1 ∼ χ2k1 a
andd Z2 ∼ χ2k2 aaree independent,
depe de t, then
t e F ∼ F k1,k2
 i.e. F follows Fisher’s distribution with k1 and k2 df.
 Mean is k2 /(k2 − 2) and variance is
Since for large df the t, chi-square, and F distributions approach
the normal distribution, these distributions are known as the
 As k1 and k2 increase, the F distribution
approaches the normal distribution.
 The
Th square off a t-distribution
t di t ib ti has
h an F
k ∼ F 1, k
distribution: t2
 If k2 is fairly large, then k2 F ∼ χ2k1
distributions related to the normal distribution.

24/28 

25/28 
Bernoulli Distribution
Binomial Distribution
 Its PDF:
 Generalization of Bernoulli distribution.
 P(X = 0) = 1-p ,
 n: number of independent trials,
 P(X = 1) = p
 p: success probability,
 Success / Failure
 q: failure probability,
 Mean is p and variance is p(1-p)
 X: number of successes.
 Mean is np, an variance is npq.

26/28 
Poisson Distribution
 PDF::
 E(X) = var (X) = λ
 Is used to model rare events.

27/28 
Conditional Distribution Normal
f (y,
( x))
f (x)
(x, y) distributed as bivariate normal
f(y|x) =
f(x,y) =
f(x)
=
1
2  x  y
2
2

 x   x   y   y   
 1   x   x   y   y 
exp 
  2 
 
  
 
2 
 2(1   )    x    y 
1 
  x    y   


2
 1  x    2 
1
x
exp  
 
 x 2
 2   x  
f (y | x) 
2
 

  y    2  
1
 1  y  [ y   (  y /  x )(x   x )]   
1 

y|x
  
exp   
exp   
 
 2 
 2    y|x   
    y|x 2 
 y 1  2
2 1  2
  

  
 
1
y

28/28 
Conditional Distribution Normal
Y and
Y|X
Y
X
X
1/22


Outline
 Problem of Estimation
Basic Econometrics in Transportation
 Point
 Interval
 Estimation Methods
 Statistical Properties
Statistical Inference
 Small sample
 Large sample
 Hypothesis Testing
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Sources: Basic Econometrics, by Gujarati and lecture notes of Professor Greene
2/22


3/22


Estimation
Point Estimation
 The
e population
popu at o has
as ccharacteristics:
a acte st cs:
 For
o ssimplicity,
p c ty, assu
assumee tthat
at tthere
e e iss only
o y one
o e unknown
u
ow
 Mean, variance
 Median
 Percentiles
 Get a slice (random sample) of the population.
 How
H would
ld you obtain
b i the
h value
l off its
i parameters??
 This is known as the problem of estimation:
 Point estimation
 Interval estimation
parameter (θ) in the PDF.
 Develop a function of the sample values such that:
is a statistic, or an estimator and is a random variable.
 A particular numerical value of the estimator is known as an
estimate.

is known as a point estimator because it provides a single
(point) estimate of θ.

4/22


5/22


Example
Interval Estimation
 Let
 An interval between two values that may include the true θ.
 The key concept is the of probability distribution of an estimator.
is the sample mean, an estimator of the true mean value (μ).
 If in a specific case = 50, this provides an estimate of μ.

 Will the interval contain the true value?
 The interval is a “Confidence Interval”
 The degree of certainty is the degree of confidence or level of significance.
6/22


7/22


Example
Estimation Methods
 Suppose that
t at the
t e distribution
d st but o of
o height
e g t of
o men
e in a population
popu at o
 Least
east Squa
Squares
es ((LS)
S)
is normally distributed with mean = μ inches and σ = 2.5
inches. A sample of 100 men drawn randomly from this
population had an average height of 67 inches. Establish a 95%
confidence interval for the mean height (= μ) in the population
as a whole.
 A considerable time will be devoted to illustrate the LS method.
 Maximum Likelihood
 This method is of a broad application and will also be discussed
partially in the last lectures.
 Method of moments (MOM)
 Is becoming more popular, but not at the introductory level.
8/22


9/22


Statistical Properties
Small-Sample Properties
 The statistics fall into two categories:
 Unbiasedness
 ˆ is an unbiased estimator of θ if E (ˆ)  
 Small-sample, or finite-sample, properties
 Bias ˆ  E (ˆ)  
 Distribution of two estimators of θ:
 Large-sample, or asymptotic, properties.
 Underlying both these sets of properties is the notion that an
estimator has a probability distribution.

10/22 

11/22 
Small-Sample Properties
Small-Sample Properties
 Minimum
u Va
Variance
a ce
 Linearity
ea ty
 An estimator ˆ is said to be a linear estimator of θ if it is a linear
function of the sample observations.
 Example:
 Best Linear Unbiased Estimator (BLUE).

12/22 

13/22 
Small-Sample Properties
Large-Sample Properties
 Minimum Mean
Mean-Square-Error
Square Error
 Often an estimator does not satisfy one or more of the desirable


statistical properties in small samples.
.
.
 But as the sample size increases indefinitely, the estimator
possesses several desirable statistical properties.
 Asymptotic Unbiasedness:
 Sample variance of a random variable

14/22 

15/22 
Large-Sample Properties
Large-Sample Properties
 Co
Consistency
s ste cy
 Co
Consistency
s ste cy
 ˆ is said to be a consistent estimator if it approaches the true value θ as the
sample size gets larger and larger.
 A sufficient condition for consistency is that the bias and variance both tend to
 Unbiasedness and consistency are conceptually very much different.
zero (or MSE tends to zero) as the sample size increases indefinitely.
 If ˆ is a consistent estimator of θ and h(.) is a continuous function:
 Note that
h this
hi property does
d
not hold
h ld true off the
h expectation
i operator E;

.

16/22 

17/22 
Large-Sample Properties
Hypothesis Testing
 Asymptotic Efficiency
 Assume X with a known PDF f(x; θ).
 The variance of the asymptotic distribution of ˆ is called its
asymptotic variance.
 If ˆ is consistent and its asymptotic variance is smaller than the
asymptotic variance of all other consistent estimators of θ, it is
called asymptotically efficient.
 Asymptotic Normality
 An estimator ˆ is said to be asymptotically normally distributed if
its sampling distribution tends to approach the normal distribution
as the sample size n increases indefinitely.

18/22 
Example
 From
o tthee previous
p ev ous example,
e a p e, which
w c was concerned
co ce ed with
w t the
t e
height (X) of men in a population:
 Could the sample with a mean of 67, the test statistic, have
come from the population with the mean value of 69?
 We obtain the point estimator ˆ from a random sample.
 Since the true θ is rarely known, we raise the question:
 Is the estimator ˆ is “compatible” with some hypothesized value of θ, say θ*?
 To test the null hypothesis,
yp
, we use the sample
p information to
obtain what is known as the test statistic.
 Use the confidence interval or test of significance approach to test the null
hypothesis.

19/22 
Confidence Interval Approach

20/22 
Test of Significance Approach

21/22 
Errors
 State the null hypothesis H0 and the alternative hypothesis H1
 H0:μ = 69 and H1:μ ≠ 69
 Select the test statistic
 X
 Determine the probability distribution of the test statistic
 .
 Choose
Ch
the
h llevell off significance
i ifi
(α)
( )
 Using the probability distribution of the test statistic, establish
a 100(1 − α)% confidence interval. Check if the value of the
parameter lies in the acceptance region.

22/22 
Homework 1
Statistical
Stat
st ca Inference
e e ce (Case
(Casellaa aand
d Berger,
e ge , 2001)
00 )
Chapter 3, Problem 23 [20 points]
2. Chapter 4, Problem 1 [20 points]
3. Chapter 4, Problem 7 [15 points]
4. Chapter 5, Problem 18 (a,b, and d) [45 points]
1.
 Type I error is likely to be more serious in practice.
practice
 The probability of a type I error is designated as α and is called
the level of significance.
 The probability of not committing a type II error is called the
power of the test.
1/23

Definition
 Study of the dependence of one variable on one or more other
Basic Econometrics in Transportation
variables, with a view to estimating and/or predicting the
average value of the former in terms of the known values of the
latter.
Introduction to Regression Analysis
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/23


3/23


Variables
Causation
 In regression,
eg ess o , aanalysis
a ys s we aaree co
concerned
ce ed w
with
t stat
statistical
st ca
 A stat
statistical
st ca relationship
e at o s p in itself
tse ca
cannot
ot logically
og ca y imply
py
dependence among variables.
 Not functional or deterministic dependence, such as the
Newton’s law of gravity.
 Explanatory variables will not enable us to predict the
dependant variable exactly because of:
 Errors involved in measuring these variables
 Other factors that collectively affect Y but may be difficult to identify
individually.
causation.
 Ideas of causation must come from outside statistics, ultimately
from some theory or other.
 Example of crop yield and rainfall.
4/23


5/23
Correlation


Terminology
 Closely related but conceptually very different from regression.
 In correlation analysis the objective is to measure the of degree
of linear association between two variables.
 Both dependent and independent variables are assumed to be random.
 In regression analysis we try to estimate the average value of one
variable on the basis of the fixed values of other variables.
 Dependent variable is assumed to be random, while the explanatory variables
are assumed to have fixed values.
6/23


7/23


Terminology
 Two-variable
wo va ab e (b
(bivariate)
va ate) regression
eg ess o analysis
a a ys s
 Multiple regression analysis
 Y denote the dependent variable.
 X’s (X1, X2, . . . , Xk) will denote the explanatory variables.
 Xki (or Xkt) will denote the ith (or tth) observation on Xk.
 N denote the total number of observations in the population,
population
and n the total number of observations in a sample.
 Subscript i is for cross-sectional data and t for time series data.
Bivariate Regression Analysis
8/23


Example
9/23


Example
 Weekly income (X) and weekly consumption expenditure (Y),
both in dollars

10/23 

11/23 
Essence of Regression Analysis
Regression Curve
 W
What
at iss the
t e expected
e pected value
va ue of
o weekly
wee y consumption
co su pt o expenditure
e pe d tu e
 A population regression curve
of a family?
 $121.20 (the unconditional mean).
 What is the expected value of weekly consumption expenditure
of a family whose monthly income is $140?
 $101 (th
(the conditional
diti l mean).
)
 The knowledge of the income level enable us to better predict
the mean value of consumption expenditure
is the curve connecting the
means of the subpopulations of
Y corresponding to the given
values of X.
 For each X there is a
population of Y values that are
spread around the mean of
those Y values.

12/23 

13/23 
Population Regression Function
Linearity
 What form does the function f (Xi) assume?
 In variables
 The regression curve is a straight line.
 Form of the PRF is an empirical question.
 A linear function maybe hypothesized:
E(Y | Xi) = β1 + β2Xi
 E(Y | Xi) = β1 + β2Xi2 is not a linear function.
 In parameters
 E(Y | Xi) = β1 + β2Xi2 is a linear model.
 E(Y | Xi) = β1 + β22Xi is not a linear model.
model
 We are primarily concerned with linear models.
Term “linear” regression mean a regression that is linear in the parameters

14/23 

15/23 
Error Term
Example
 Ass family
a y income
co e increases,
c eases, family
a y consumption
co su pt o expenditure
e pe d tu e
 In a linear
ea model,
ode , consumption
co su pt o expenditure
e pe d tu e of
o the
t e families
a
es with
wt
on the average increases.
 What about individual families?
 Deviation of an individual Yi around its expected value is ui
Yi = E(Y | Xi) + ui
Stochastic Disturbance or Stochastic Error Term
 Yi can be expressed as the sum of two components:
 A systematic, or deterministic component: E(Y | Xi)
 A nonsystematic component: ui
$80 income, can be expressed as:
55 = β1 + β2 (80) + u1
60 = β1 + β2 (80) + u2
65 = β1 + β2 (80) + u3
70 = β1 + β2 (80) + u4
75 = β1 + β2 (80) + u5

16/23 

17/23 
Significance of Error Term
Significance of Error Term
Instead of having an error term, why not introduce these variables
into the model explicitly?
 Intrinsic randomness in human behavior
 Randomness in individual Y’s cannot be explained no matter how hard we try.
 Poor proxy variables
 Observed variables may not equal desired variables.
 Vagueness of theory
 We might be ignorant or unsure about the other variables affecting Y.
 Principle of parsimony
 We would like to keep our regression model as simple as possible.
 Unavailability of data
 We do not have access to all sorts of data (e.g. family wealth).
 Wrong
W
functional
f
i l form
f
 Core variables versus peripheral variables
 Influence of some variables is so small that are ignored for cost consideration.

18/23 

19/23 
Sample Regression Function
Sample Regression Function
 In real
ea situations
s tuat o s we only
o y have
ave a sample o
of X and
a d Y values.
va ues.
 From our previous example, suppose we have two samples:
Sample 1
Sample 2
Which one represents the “true” population regression line?

20/23 

21/23 
Sample Regression Function
Sample Regression Function
 Analogously to PRF, we can develop the concept of the sample
 Analogous to ui
regression function (SRF) to represent the sample regression
line.
Yˆi  ˆ1  ˆ2 X i
Yˆi  estimator of E(Y | X i )
ˆ1  estimator of 1
Yi  Yˆi  uˆi
 Primary objective in regression analysis is to estimate the PRF
Yi  1   2 X i  ui
ˆ2  estimator of  2
 Note: an estimator is a formula or method that tells how to estimate the population
 on the basis of the SRF
Yi  ˆ1  ˆ2 X i  uˆi
parameter from the information provided by the sample at hand.

22/23 
Sample and Population Regression Lines
Yˆi  ˆ1  ˆ2 X i
Yi  ˆ1  ˆ2 X i  uˆi

23/23 
An Example
1/60

Problem of Estimation
Basic Econometrics in Transportation
 Ordinary Least Squares (OLS)
 Maximum Likelihood (ML)
 Generally, OLS is used extensively in regression analysis:
Bivariate Regression Analysis
 It is intuitively appealing
 Mathematically
y much simpler
p than the MLE.
 Both methods generally give similar results in the linear
Amir Samimi
regression context.
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/60


Ordinary Least Squares Method
 To
o determine
dete
e PRF:: Yi  1   2 X i  ui

Population Regression Function
 We estimate it from the SRF: Yi  ˆ1  ˆ2 X i  uˆi

Sample Regression Function
3/60


Ordinary Least Squares Method
 uˆ  f (ˆ , ˆ )
2
i
possible value of the function above.
ˆ2  
as possible to the actual Y.
Y
i
 Why square?
 Sign of the residuals.
( X i  X )(Yi  Y )  xi yi

 ( X i  X )2
 xi 2
ˆ1  Y  ˆ2 X
 û is as small as possible.
 More weight to large residuals.
2
 OLS finds unique estimates of β1 and β2 that give the smallest
 We would like to determine the SRF in a way that it is as close
 i.e. sum of the residuals
1
Deviation form:
yi  Yi  Y
xi  X i  X
4/60


5/60


OLS Properties: Sample Mean
OLS Properties: Linearity
 Regression line passes through the sample mean
 Linearity
 It is a linear function (a weighted average) of Y.
 Xi and thus ki are nonstochastic and:
Note:
6/60


7/60


OLS Properties: Unbiasedness
OLS Properties: Mean of Estimated Y
 U
Unbiasedness
b ased ess
 Mean
ea value
va ue of
o the
t e estimated
est ated Y iss equal
equa to the
t e mean
ea value
va ue of
o the
t e
actual Y:
Sum both sides over the sample values and divide by sample size
8/60


9/60


OLS Properties: Mean of Residuals
OLS Properties: Uncorrelated Residuals Y
 The mean value of the residuals is zero:
 The residuals are uncorrelated with the predicted Yi:

10/60 

11/60 
OLS Properties: Uncorrelated Residuals X
OLS Assumptions
 The
e residuals
es dua s aaree uuncorrelated
co e ated with
w t Xi:
 Ou
Our objective
object ve iss not
ot only
o y to estimate
est ate some
so e coefficients
coe c e ts but also
a so
to draw inferences about the true coefficients.
 Thus certain assumptions are made about the manner in which
Yi are generated.
Yi = β1 + β2Xi + ui
 Unless we are specific about how Xi and ui are created, there is
no way we can make any statistical inference about the Yi, β1
and β2.
 The Gaussian standard, or classical linear regression model
(CLRM), makes 10 assumptions that are extremely critical to
the valid interpretation of the regression estimates.

12/60 

13/60 
Assumption 1
Assumption 2
 The regression model is linear in the parameters.
 X values are fixed in repeated sampling.
 E(Y | Xi) = β1 + β2Xi2 is a linear model.
 E(Y | Xi) = β1 + β22Xi is not a linear model.
 X is assumed to be non-stochastic.
 Our regression analysis is conditional regression analysis, that
is, conditional on the given values of the regressor(s) X.

14/60 

15/60 
Assumption 3
Assumption 4
 Zero
e o mean
ea value
va ue oof ddisturbance
stu ba ce ui
 Homoscedasticity
o oscedast c ty or
o equal
equa variance
va a ce of
o ui
 E(ui | Xi) = 0
 Factors not explicitly included in the model, and therefore
subsumed in ui, do not systematically affect the mean value of Y.

16/60 

17/60 
Assumption 4
Assumption 5
 Heteroscedasticity
 No autocorrelation (serial correlation) between the disturbances.

18/60 

19/60 
Assumption 6
Assumption 7
 Disturbance
stu ba ce u aandd explanatory
e p a ato y variable
va ab e X aaree uncorrelated.
u co e ated.
 n must
ust be greater
g eate than
t a the
t e number
u be of
o explanatory
e p a ato y variables.
va ab es.
 Obviously, we need at least two pairs of observations to estimate
the two unknowns!
 If X and u are correlated, their individual effects on Y may not be
assessed.

20/60 

21/60 
Assumption 8
Assumption 9
 Variability in X values.
 The regression model is correctly specified.
 Mathematically, if all the X values are identical, it impossible to
 Important questions that arise in the specification of a model:
estimate β2 (the denominator will be zero) and therefore β1.
 Intuitively, it is obvious as well.
 What variables should be included in the model?
 What is the functional form of the model? Is it linear in the parameters, the
variables, or both?
 What are the probabilistic assumptions made about the Yi, the Xi, and the ui
entering the model?

22/60 

23/60 
Assumption 10
Assumptions
 There
e e iss noo perfect
pe ect multicollinearity.
u t co ea ty.
 How
ow realistic
ea st c are
a e all
a these
t ese assumptions?
assu pt o s?
 We make certain assumptions because they facilitate the study,
 No perfect linear relationships among the explanatory variables.
 Will be further discussed in multiple regression models.
not because they are realistic.
 Consequences of violation of CLRM assumptions will be
examined later.
 We will look into:
 Precision of OLS estimates, and
 Statistical properties of OLS.

24/60 

25/60 
Precision of OLS Estimates
Standard Error of 
 Precision of an estimate is measured by its standard error.
 Precision of an estimate is measured by its standard error.

26/60 

27/60 
Homoscedastic Variance of ui
Features of the Variances
 How to estimate  2
 The
e va
variance
a ce o
of
iss ddirectly
ect y proportional
p opo t o a to
proportional to sum of
.
but inversely
ve se y
 As n increases, the precision with which β2can be estimated also increases.
 If there is substantial variation in X, β2 can be measured more accurately.
is directly proportional to σ2 and Xi2, but
i
inversely
l proportional
i l to xi2 and
d the
h sample
l size
i n.
 The variance of

.
Note:
 If the slope coefficient is overestimated, the intercept will be underestimated.

28/60 

29/60 
Gauss–Markov Theorem
Goodness of Fit
 Given the assumptions of the classical linear regression model,
 The coefficient of determination R2 is a summary measure that
the OLS estimators, in the class of unbiased linear estimators,
have minimum variance, that is, they are BLUE.
tells how well the sample regression line fits the data.
 The theorem makes no assumptions about the probability
distribution of ui, and therefore of Yi.
Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of Squares (RSS)

30/60 

31/60 
Consistency
Before Hypothesis Testing
 An asy
asymptotic
ptot c property.
p ope ty.
 Us
Usingg the
t e method
et od of
o OLS
O S we can
ca estimate
est ate β1, β2, and σ2.
 An estimator is consistent if it is unbiased and its variance tends
 Estimators (
to zero as the sample size n tends to infinity.
 Unbiasedness is already proved.
) are random variables.
 To draw inferences about PRF, we must find out how close
is to the true .
We need to find out PDF of the estimators.

32/60 
Probability Distribution of Disturbances

is ultimately a linear function of the random variable ui,
which is random by assumption.

33/60 
Why the Normality Assumption?
 We hope that the influence of these omitted or neglected
variables is small and at best random.
 We show by the CLT, that if there are a large number of IID
random variables, the distribution of their sum tends to a normal
distribution as the number of such variables increase indefinitely.
 The nature of the probability distribution of ui plays an
extremely important role in hypothesis testing.
 It is usually assumed that ui ∼ NID(0, σ2)
 Central limit theorem (CLT):
 Let
L X1, X2, . . . , Xn denote
d
n independent
i d
d random
d
variables,
i bl all
ll off which
hi h have
h
:
the same PDF with mean = μ and variance = σ2. Let
 NID: Normally and Independently Distributed.

34/60 

35/60 
Why the Normality Assumption?
Normality Test
 A va
variant
a t of
o the
t e CLT
C states tthat,
at, eve
even if tthee number
u be of
o variables
va ab es
 S
Since
ce we are
a e “imposing”
pos g tthee normality
o a ty assumption,
assu pt o , itt behooves
be ooves
is not very large or if these variables are not strictly independent,
their sum may still be normally distributed.
 With this assumption, PDF of OLS estimators can be easily
derived, as any linear function of normally distributed variables
is itself normally distributed.
 The normal distribution is a comparatively simple distribution
involving only two parameters.
 It enables us to use the t, F, and χ2 tests for regression models.
us to find out in practical applications involving small sample
size data whether the normality assumption is appropriate.
 Later, we will introduces some tests to do just that.
 We will come across situations where the normality assumption
may be inappropriate.
 Until then we will continue with the normality assumption.
 The normality assumption plays a critical role for small sample size data.
 In reasonably large sample size, we may relax the normality assumption.

36/60 

37/60 
Estimators’ Properties with Normality Assumption
Method of Maximum Likelihood
 They are unbiased, with minimum variance (efficient), and
 If ui are assumed to be normally distributed, the ML and OLS
consistent.
 The ML estimator of σ2 is biased for small sample size.
 Will help us to draw inferences about the true σ2 from the estimated σ2.
 ML method can be applied to regression models that are


estimators of the regression coefficients, are identical.
,
, and
is distributed as the χ2 with (n − 2) df.

.
 Asymptotically, the ML estimator of σ2 is unbiased.
nonlinear in the parameters, in which OLS is generally not used.
 The importance of this will be explained later.
 Estimators have minimum variance in the entire class of
unbiased estimators, whether linear or not.
 Best Unbiased Estimators (BUE)

38/60 

39/60 
ML Estimation
ML Estimation
 Assume
ssu e tthee two-variable
two va ab e model
ode Yi = β1 + β2Xi + ui in w
which
c
 The
e method
et od of
o maximum
a
u likelihood
e ood consists
co s sts in estimating
est at g the
t e
 Having independent Y’s, joint PDF of Y1,...,Yn, can be written as:
Where:
 β1, β2, and σ2 are unknowns in likelihood function:
unknown parameters in such a manner that the probability of
observing the given Y’s is as high as possible.

40/60 

41/60 
ML Estimation
Interval Estimation
 From the first
first-order
order condition for optimization:
 How reliable are the point estimates?
 We try to find out two positive numbers δ and α, such that:
 Probability of constructing an interval that contains β2 is 1 − α.
 Such an interval is known as a confidence interval.
 α (0 < α < 1) is known as the level of significance.
 How
H are the
h confidence
fid
intervals
i
l constructed?
d?
 If the probability distributions of the estimators are known, the task of
 Note how ML underestimates the true σ2 in small samples.

42/60 
constructing confidence intervals is a simple one.

43/60 
Confidence Intervals for 
Confidence Intervals for  
 Itt can
ca be shown
s ow that
t at the
t e t va
variable
ab e follows
o ows the
t etd
distribution
st but o with
wt
 Itt can
ca be shown
s ow tthat
at under
u de the
t e normality
o a ty assumption,
assu pt o , following
o ow g
n − 2 df.
variable follows χ2 distribution with n − 2 df.
 Width of the confidence interval is proportional to the standard
error of the estimator.
 Same for 1
 Interpretation of this interval:
 If we establish 95% confidence limits on σ2 and if we maintain a priori that these limits
will include true σ2,we shall be right in the long run 95 percent of the time.

44/60 

45/60 
Hypothesis Testing
Confidence Interval Approach
 Is a given observation compatible with some stated hypothesis?
 Decision Rule: Construct a 100(1 − α)% confidence interval for
 In statistics, the stated hypothesis is known as the null
β2. If the β2 under H0 falls within this interval, do not reject H0,
but if it falls outside this interval, reject H0.
 Note: There is a 100α percent chance of committing a Type I
error. If α = 0.05, there is a 5 percent chance that we could
reject the null hypothesis even though it is true.
 When we reject the null hypothesis, we say that our finding is
statistically significant.
 One-tail or two-tail test:
hypothesis or H0 (versus an alternative hypothesis or H1).
 Hypothesis testing is developing rules for rejecting or accepting
the null hypothesis.
 Confidence interval approach,
 Test of significance approach.
 Most of the statistical hypotheses of our interest make statements
about one or more values of the parameters of some assumed
probability distribution such as the normal, F, t, or χ2.

46/60 
 Sometimes we have a strong expectation that the alternative hypothesis is one-
sided rather than two-sided.

47/60 
Test of Significance Approach
Practical Aspects
 In tthe
e confidence
confidence--inte
interval
val p
procedure
ocedu e we try
t y to establish
estab s a range
a ge
 Accepting
ccept g “null”
u hypothesis:
ypot es s:
that has a probability of including the true but unknown β2.
 In the test
test--ofof-significance approach we hypothesize some value
for β2 and try to see whether the estimated β2 lies within
confidence limits around the hypothesized value.
 All we can say: based on the sample evidence we have no reason to reject it.
 Another null hypothesis may be equally compatible with the data.
 2-t rule of thumb
 If df >20 and α = 0.05, then the null hypothesis β2 = 0 can be rejected if |t| > 2.
 In these cases we do not even have to refer to the t table to assess the
significance of the estimated slope coefficient.
 Forming the null hypotheses
 Theoretical expectations or prior empirical work can be relied upon to
A large t value will be evidence against the null hypothesis.
formulate hypotheses.
 p value
 The lowest significance level at which a null hypothesis can be rejected.

48/60 

49/60 
Analysis of Variance
Analysis of Variance
 A study of two components of TSS ((= ESS + RSS) is known as
 If we assume that the disturbances ui are normally distributed,
ANOVA from the regression viewpoint.
which we do under the CNLRM, and if the null hypothesis (H0)
is that β2 = 0, then it can be shown that the F follows the F
distribution with 1 df in the numerator and (n − 2) df in the
denominator.
 What use can be made of the preceding F ratio?

50/60 

51/60 
F-ratio
Example
 Itt can
ca be shown
s ow that
t at
 Co
Compute
pute F ratio
at o and
a d obtain
obta p value o
of the
t e computed
co puted F stat
statistic.
st c.
 Note that β2 and σ2 are the true parameters
parameters.
 If β2 is zero, both equations provide us with identical estimates
of true σ2. Thus, X has no linear influence on Y.
 p value of this F statistic with 1 and 8 df is 0.0000001.
 Therefore, if we reject the null hypothesis, the probability of
F ratio provides a test of the null hypothesis H0: β2 = 0.
committing a Type I error is very small.

52/60 

53/60 
Application Of Regression Analysis
Mean Prediction
 One use is to “predict”
predict or “forecast”
forecast the future consumption
 Estimator of E(Y | X0):
expenditure Y corresponding to some given level of income X.
 It can be shown that:
 Now there are two kinds of predictions:
 Prediction of the conditional mean value of Y corresponding to a
chosen X.
 Prediction of an individual Y value corresponding to a chosen X.

54/60 
Individual Prediction
 Estimator
st ato of
o E(Y
( | X0):
 It can be shown that:
 This statistic follows the t distribution with n− 2 df and may be
used to derive confidence intervals
 This statistic follows the t distribution with n− 2 df and may be
used to derive confidence intervals

55/60 
Confidence Bands

56/60 
Individual Versus Mean Prediction

57/60 
Reporting the Results
 Confidence interval for individual Y0 is wider than that for the
mean value of Y0.
 The width of confidence bands is smallest when X0 = X.

58/60 

59/60 
Evaluating the Results
Normality Tests
 How
ow “good”
good iss tthee fitted
tted model?
ode ? Anyy standard?
sta da d?
 Several tests in the literature. We look at:
 Are the signs of estimated coefficients in accordance with
theoretical or prior expectations?
 How well does the model explain variation in Y? One can use r2
 Does the model satisfies the assumptions of CNLRM?
 For now, we would like to check the normality of the disturbance term. Recall
that the t and F tests require that the error term follow the normal distribution.
 Histogram of residuals:
 A simple graphic device to learn about the shape of the PDF.
 Horizontal axis: the values of OLS residuals are divided into suitable intervals.
 Vertical axis: erect rectangles equal in height to the frequency in that interval.
 From a normal population we will get a bell-shape PDF.
 Normal p
probability
y pplot ((NPP):
)
 A simple graphic device.
 Horizontal axis: plot values of OLS residuals,
 Vertical axis: show expected value of variable if it were normally distributed.
 From a normal population we will get a straight line.
 The Jarque–Bera test.
 An asymptotic test, with chi-squared distribution and 2 df:

60/60 
Homework 2
Basic Econometrics (Gujarati, 2003)
Chapter 3, Problem 21 [10 points]
Chapter 3, Problem 23 [30 points]
3. Chapter 5, Problem 9 [30 points]
4. Chapter 5, Problem 19 [30 points]
1.
2.
Assignment weight factor = 1
1/24

Topics
 First we consider the case of regression through the origin:
Basic Econometrics in Transportation
 A situation where the intercept term is absent from the model.
 Then we consider the question of the units of measurement.
 Whether a change in the units of measurement affects the regression results.
Bivariate Regression Discussion
 Finally, we consider the question of the functional form of the
Amir Samimi
linear regression model.
 So far we have considered models that are linear in the parameters as well as
Civil Engineering Department
Sharif University of Technology
in the variables.
Primary Source: Basic Econometrics (Gujarati)
2/24


3/24


Regression Through the Origin
Estimation
 There
e e are
a e occasions
occas o s when
w e the
t e PRF assumes
assu es the
t e following
o ow g
 W
Write
te the
t e SRF
S as:
form: Yi = β2Xi + ui
 Sometimes the underlying theory dictates that the intercept
term be absent from the model.
 Applying the OLS method:
 Cost analysis theory:
 The variable cost of production is proportional to output.
 Some versions of monetarist theory:
 Rate of inflation is proportional to the rate of change of the money supply.
 How do we estimate such models like, and what special
problems do they pose?
 Comparison with formulas obtained with intercept:
4/24


5/24


Intercept-present Versus Intercept-less
R2 for Interceptless Model
 Intercept
Intercept-present:
present:
 Compute the raw R2 for such models:
 adjusted (from mean) sums of squares and cross products.
 df for estimating σ2 is (n − 1).
 Raw r2 is not directly comparable to the conventional r2 value.
is always zero.
 Coefficient of determination is always between 0 and 1.

 Some authors do not report r2 value for interceptless models.
 Unless there is a very strong priori expectation, one would be
well advised to stick to the conventional models.
 Intercept-less:
 If intercept turns out to be statistically insignificant, we have a regression
 raw sums of squares and cross products.
through the origin.
 df for estimating σ2 is (n − 2).

 If there is an intercept but we insist on fitting a regression through the origin,
need not be zero.
we would be committing a specification error, thus violating Assumption 9.
 Coefficient of determination is not always between 0 and 1.
6/24


Units of Measurement
 Do
o tthee units
u ts in w
which
c tthee va
variables
ab es are
a e measured
easu ed make
a e any
a y
difference in the regression results?
 If so, what is the sensible course to follow in choosing units of
measurement for regression analysis?
7/24


Units of Measurement
8/24


9/24


Discussion
Standardized Variables
 One should choose the units of measurement sensibly.
 Subtract the mean value of the variable from its individual
 If the scaling factors are identical (w1 = w2):
 the slope coefficient and its standard error remain unaffected.
values and divide the difference by the standard deviation of
that variable:
 The intercept and its standard error are both multiplied by w1.
 If the X scale is not changed:
 the slope as well as the intercept coefficients and their respective standard
errors are all multiplied by the same factor
factor.
 If the Y scale remains unchanged:
 ?
 Goodness-of-fit is not affected at all.

10/24 
 Instead of running Yi = β1 + β2Xi + ui
 We could run Yi* = β1* + β2* Xi* + ui* = β2* Xi* + ui*

11/24 
Interpretation
Advantages?
 How
ow do we interpret
te p et the
t e beta coefficients?
coe c e ts?
 More
o e apparent
appa e t if there
t e e iss more
o e than
t a one
o e regressor.
eg esso .
 Note that new variables are measured in standard deviation
 We put all regressors on equal basis and compare them directly.
units.
 If the regressor increases by one standard deviation, on
average, the regressand increases by β2* standard deviation
units.
 If a coefficient is larger than another, then …
 the it contributes more to the explanation of the regressand.
 Note 1: For the standardized regression
g
we have not given
g
the
r2 value because this is a regression through the origin.
 Note 2: There is an interesting relationship between the β of the
conventional model and the standardized one:
• Sx = The sample standard deviation of the X regressor,
• Sy = The sample standard deviation of the regressand.

12/24 

13/24 
Functional Forms of Regression Models
Log-linear Models
 We consider some commonly used regression models that may
 Consider the following model, exponential regression model:
be nonlinear in the variables but are linear in the parameters:
 The log-linear models
 Semilog models
 Reciprocal models
 The logarithmic reciprocal models
 What are the special features of each model?
 When they are appropriate?
 How they are estimated?

14/24 
 Linear in the parameters, linear in the logarithms of the
variables, and can be estimated by OLS regression.
 Because of this linearity, such models are called log-log,
double-log, or loglinear models.

15/24 
Discussion
Semilog Models
 O
Onee att
attractive
act ve feature
eatu e of
o the
t e log-log
og og model,
ode , which
w c has
as made
ade itt
 The
e Log–Lin
og
Model:
ode :
popular in applied work:
 To measure the growth rate.
 β2 measures the elasticity of Y with respect to X (why?).
 The model assumes that the elasticity coefficient, remains
constant throughout, hence called the constant elasticity model.
 Although α and β2 have unbiased estimators, β1 has not!
 In most practical problems, β1 is of secondary importance.
 Regressand is the logarithm of Y and the regressor is “time”.
 How to decide?
 The simplest way is to plot the scattergram of ln Yi against ln Xi and see if the
scatter points lie approximately on a straight line.
 A model in which Y is logarithmic is called a log-lin model.
 A model in which Y is linear but X(s) are logarithmic is a lin-
log model.

16/24 

17/24 
Semilog Models
Reciprocal Models
 The Lin
Lin–Log
Log Model:
 As X increases indefinitely, Y approaches the asymptotic value
 We want to find absolute change in Y for a percent change in X.
β1.
 In a Log
Log–Lin
Lin model multiply 2 by 100 for a more meaningful
figure.
 In a Lin-Log model divide 2 by 100 for a more meaningful
figure.

18/24 
Logarithmic Reciprocal Model
 Log
og hyperbola
ype bo a or
o logarithmic
oga t
c reciprocal
ec p oca model
ode takes
ta es the
t e form:
o :
 Initially Y increases at an increasing rate and then increases at a
decreasing rate.
 Example:
 As per capita GNP increases, initially there is dramatic drop in CM (child
mortality) but the drop tapers off as per capita GNP continues to increase.

19/24 
Functional Forms

20/24 

21/24 
Choice of Functional Form
Tips
 Choice of a functional form is easy in bivariate cases, because
 Keep in mind:
we can plot the variables and get some rough idea.
 The underlying theory may suggest a particular functional form.
 The choice becomes much harder in multivariate cases
 The knowledge of slope and elasticity coefficients of the various
involving more than one regressor.
 A great deal of skill and experience are required in choosing an
appropriate model for empirical estimation.

22/24 
models will help to compare the various models.
 The coefficients of the model chosen should satisfy certain a priori
expectations.
 Sometime more than one model may fit a given set of data
reasonably well.
 Do not overemphasize the r2 measure in the sense that the higher
the r2 the better the model.

23/24 
Error Term in Transformation
Error Term in Transformation
 Co
Consider
s de this
t s regression
eg ess o model:
ode :
 Although
t oug following
o ow g linear
ea regression
eg ess o models
ode s can
ca be estimated
est ated
 Error term may enter in three forms:
by OLS or ML, we have to be careful about the properties of
the stochastic error term that enters these models.
 BLUE property of OLS requires that ui has zero mean value,
 1st and 2nd models are intrinsically linear regression models
and the 3rd is intrinsically nonlinear-in-parameter.
constant variance, and zero autocorrelation.
 For hypothesis testing, we further assume that ui ∼ N(0, σ2).
 Thus in the 1st model we must test the normality on ln ui

24/24 
Homework 3A
Basic Econometrics (Gujarati, 2003)
1.
2.
Chapter 6, Problem 13 [50 points]
Chapter 6, Problem 14 [50 points]
Assignment weight factor = 0.5
1/33

Scope
 The simplest possible multiple regression model is three
three-
Basic Econometrics in Transportation
variable regression.
 with one dependent variable and two explanatory variables.
 We are concerned with multiple linear regression models
Multiple Regression Analysis
 Models are linear in the parameters.
 They
Th may or may nott be
b linear
li
in
i the
th variables.
i bl
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/33


3/33


Notation and Assumptions
OLS Estimators
 PRF may
ay be written
w tte as: Yi = β1 + β2X2i + β3X3i + ui
 S
SRF::
 β1 is the intercept term: mean effect on Y of all the excluded variables.
 β2 is called the partial regression coefficients, as it measures the change in
the mean value of Y per unit change in X2, holding the value of X3 constant.
 We continue to operate within the CLRM framework:
 Zero mean value of ui:
E(ui | X2i , X3i) = 0 for each i
cov (ui , uj) = 0 (i ≠ j)
 Homoscedasticity:
var (ui) = σ2
 Zero covariance between ui and each X variable:
cov (ui , X2i) = cov (ui , X3i) = 0
 No specification bias.
 No multicollinearity.
 Sufficient variability in the values of the regressors.
 No serial correlation:
 Differentiate it with respect to the unknowns and set it to zero:
4/33


5/33
OLS Estimators


Variances of OLS Estimators
 We need the standard errors for two main purposes:
 To establish confidence intervals,
 to test statistical hypotheses.
6/33


Variances of OLS Estimators
7/33


Properties of OLS Estimators
 The
e three-variable
t ee va ab e regression
eg ess o linee passes through
t oug the
t e means.
ea s.
 Mean of estimated Yi is equal to the mean of actual Yi.
 .
 In these formulas σ2 is variance of population disturbances ui:
 The estimated residuals are uncorrelated with X2i and X3i.
 The estimated residuals are uncorrelated with estimated Yi.
 As the correlation coefficient between X2 and X3, increases
 The degrees of freedom are now ( n − 3) because we must first
estimate the coefficients, which consume 3 df.
toward 1, the variances of estimated β2 and β3 increase.
 β2 can be estimated more precisely if the variation in the
sample values of X2 is large.
 OLS estimators are BLUE.
8/33


9/33


Maximum Likelihood Estimators
Coefficient of Determination R2
 Under the assumption that the population disturbances are
 It gives the percentage of the total variation in Y explained by
normally distributed with zero mean and constant variance σ2,
the explanatory variable X.
ML estimators and OLS estimators of the regression
coefficients are identical.
 This is not true of the estimator of σ2. ML estimator of σ2 is
regardless of the number of variables in the model.

10/33 
 Similar to the bivariate case.

11/33 
Example
R2 and Adjusted R2
 How
ow do you interpret
te p et this:
t s:
 R2 iss a non-decreasing
o dec eas g function
u ct o of
o the
t e number
u be of
o explanatory
e p a ato y
variables present in the model.
 One may adjust RSS and TSS by their degrees of freedom:
 in which
which,
 CM is the number of deaths of children under five per 1000 live
births,
 PGNP is per capita GNP in 1980, and
 FLR is female literacy rate measured in percent.
 Besides R2 and adjusted R2, other criteria are often used to
judge the adequacy of a regression model.
 Akaike’s Information Criteria.

12/33 

13/33 
Comparing Two R2 Values
Example
 It is crucial to note that in comparing two models on the basis
 Consider these models in which consumption of cups of coffee
of the coefficient of determination, whether adjusted or not:
 the sample size n, and
per day (Y) is estimated by the price of coffee (X):
 Model I:
 the dependent variable must be the same.
 Thus for the following models the computed R2 terms cannot
be compared (why?).
 Model II:
 To compare R2 values:
 Obtain the estimated log value of each observation in Model II,
 Take the antilog of these values and then compute r2 between these antilog
values and actual Yt.
 How then does one compare the R2’s of two models when the
regressand is not in the same form?

14/33 
 This r2 value is comparable to the r2 value of Model I.
 The result is an r2 value of 0.7318, which is now comparable with the r2
value of the log–linear model of 0.7448.

15/33 
Transformations
Polynomial Regression Models
 Transformations
a s o at o s discussed
d scussed in tthee co
context
te t oof tthee bivariate
b va ate case
 Marginal
a g a cost ((MC)
C) of
o production.
p oduct o .
can be easily extended to multiple regression models.
 Example: Cobb–Douglas production function
 There is one explanatory variable on the
 A log-linear model,
 β2 is the partial elasticity of output with respect to the input.
right-hand side but it appears with
various powers.
 Do these models present any special
estimation problems?
 Strictly speaking, does not violate the no
multicollinearity assumption.

16/33 

17/33 
The Problem of Inference
Forms of Hypothesis Testing
 If our objective is estimation as well as inference, we need to
 Testing hypotheses about an individual regression coefficient.
assume that ui follow some probability distribution.
• CM: child mortality
• PGNP: per capita GNP
• FLR: female literacy rate
 Is the coefficient of PGNP of statistically significant, that is, statistically
different from zero?
 Is the coefficient of FLR of −2.2316 statistically significant?
 Testing the overall significance of the estimated multiple
regression model, that is, finding out if all the partial slope
coefficients are simultaneously equal to zero.
 Testing that two or more coefficients are equal to one another.
 Testing that the partial regression coefficients satisfy certain
restrictions.
 Testing the stability of the estimated regression model over
time or in different cross-sectional units.
 Testing the functional form of regression models.
 Are both coefficients statistically significant?

18/33 

19/33 
Individual Regression Coefficients
Overall Significance
 H0: β2 = 0
 H0: β2 = β3 = 0
 Example: The null hypothesis: with female literacy rate held constant, per
capita GNP has no linear influence on child mortality.
 To test the null hypothesis, we use the t test.
 .
 DF = n – 3
 In our example p-value = 0.0065, and thus
 with the female literacy rate held constant, per capita GNP has a
significant effect on child mortality.
 Remember that the t-testing procedure is based on the
assumption that the error term ui follows the normal
distribution.
 Whether Y is linearly related to both X2 and X3.
 Can this joint hypothesis be tested by testing the significance of
.
and
individually?
 We cannot use the usual t test to test the joint hypothesis that the true partial
slope coefficients are zero simultaneously.
 This hypothesis
yp
can be tested by
y the ANOVA technique,
q , introduced before.
 Under the assumption of normal distribution for ui and the null
hypothesis, F has the F distribution with 2 and n − 3 df:

20/33 

21/33 
F-testing Procedure
Relationship between R2 and F
 For a k
k-variable
variable regression model:
 When R2 = 0, F is zero.
 Yi = β1 + β2X2i + β3X3i + · · · + βkXki + ui
 To test the hypothesis:
 H0: β2 = β3 = · · · = βk = 0
 Compute:
 .
 Thus, the F test, which is a
measure of the overall
significance of the estimated
regression, is also a test of
significance of R2.
 If F > Fα(k − 1, n − k), reject H0.

22/33 

23/33 
Example
Marginal Contribution of X
 We observed
obse ved that
t at RGDP
G
(relative
( e at ve per
pe capita
cap ta GDP)
G ) and
a d RGDP
G
 Suppose we have
ave regressed
eg essed child mortality
mo tality o
on pe
per capita GN
GNP..
squared explain only about 5.3 percent of the variation in GDPG
(GDP growth rate) in a sample of 119 countries. This R2 of 0.053
seems a “low” value. Is it really statistically different from zero?
How do we find that out?
 If R2 is zero, then F is zero:
 We decide to add female literacy rate to the model and want to
know:
 This F value is significant at about the 5 percent level.
 What is the marginal, or incremental, contribution of FLR, knowing that PGNP
 The p value is actually 0.0425.
 Is the incremental contribution of FLR statistically significant?
 Thus, we can reject H0 (the two regressors have no impact on Y).
 What is the criterion for adding variables to the model?
is already in the model and that it is significantly related to CM?

24/33 

25/33 
Marginal Contribution of X
Equality of Two Regression Coefficients
 Using ANOVA, we will have:
 Suppose in the multiple regression Yi = β1 + β2X2i + β3X3i + β4X4i + ui
 We want to test the hypotheses:
 How do we test such a null hypothesis?
 This
Thi F value
l follows
f ll
the
th F distribution
di t ib ti with
ith 1 andd 62 df:
df
 Under the classical assumptions,
p
, it can be shown that t follows the t distribution
with (n− 4) df:
 This is highly significant, suggesting that addition of FLR to the model
significantly increases ESS and hence the R2 value.
 Therefore, FLR should be added to the model.
 The F ratio of is equivalent to the following F ratio:

26/33 
 If the p value of the statistic is reasonably low, one can reject H0.

27/33 
Testing Linear Equality Restrictions
General F Testing
 Co
Consider
s de the
t e Cobb–Douglas
Cobb oug as pproduction
oduct o function:
u ct o :
 The F test p
provides a ggeneral method of testing
g hypotheses
yp
about
one or more parameters of the k variable regression model:
 We want to test the whether β2 + β3 = 1 or not.
 There are two approaches:
 The t-test approach:
 The F
F-test
test approach: restricted least squares
 β2 = 1 – β3
 ln Yi = β0 + (1 − β3) ln X2i + β3 ln X3i + ui
 ln (Yi / X2i) = β0 + β3 ln (X3i / X2i) + ui
 How do we know that the restriction is valid?
 m is the number of linear restrictions.
Yi = β1 + β2X2i + β3X3i +· · · + βkXki + ui
 Following hypotheses can all be tested by the F test:
 H 0: β 2 = β 3
 H 0: β 3 + β 4 + β 5 = 3
 H 0: β 3 = β 4 = β 5 = β 6 = 0
 The general strategy:
 There is a larger (unconstrained) model, and then a smaller (constrained or
restricted) model, which is obtained by deleting some variables from the larger.
 Compute the F ratio.
 If the computed F exceeds Fα(m, n− k), where Fα(m, n− k) is the critical F at the
α level of significance, we reject the null hypothesis.

28/33 

29/33 
Parameter Stability (the Chow Test)
Parameter Stability (the Chow Test)
 A regression
g
model involving
g time series data,, mayy have a
 Assumptions:
p
 u1t ∼ N(0, σ2) and u2t ∼ N(0, σ2), and are independently distributed.
structural change in the relationship between X and Y.
 Consider the data on disposable personal income and personal savings, for the
United States for the period 1970–1995.
 It is well known that in 1982 USA suffered its worst peacetime recession.
 Now we have three possible regressions:
 1970–1981: Yt = λ1 + λ2Xt + u1t
n1 = 12
 1982–1995: Yt = γ1 + γ2Xt + u2t
n2 = 14
 1970–1995: Yt = α1 + α2Xt + ut
n = 26

30/33 
 Procedure:
 Estimate regression in which there is no parameter instability, and obtain RSS3
(df = n1 + n2 − k). We call RSS3 the restricted RSS. (why?)
 Estimate regression for the first period and obtain RSS1 (df = n1 − k).
 Estimate regression for the second period and obtain RSS2 (df = n2 − k).
 Obtain the unrestricted RSS (RSSUR = RSS1 + RSS2): 2 sets are independent.
 If there is no structural Change then the RSSR and RSSUR should not be
statistically different:
 Do not reject H0 of parameter stability if the computed F value does not exceed
the critical F value.

31/33 
Parameter Stability (the Chow Test)
Parameter Stability (the Chow Test)
 In our example:
p
 Since we cannot observe the true error variances,, we can obtain
 RSSR = 23,248.30; RSSUR = 11,790.252; F = 10.69; F1% = 5.72.
their estimates from:
 The Chow test seems to support our earlier guess.
 Chow test can be generalized to handle cases of more than
one structural break. (how?)
 Keep in mind:
 The assumptions underlying the test must be fulfilled.
 It can be shown that
 The Chow test will tell us only if the two regressions are different, without
telling us whether the difference is on account of the intercepts, slopes, or both.
 The Chow test assumes that we know the point(s) of structural break.
 Before we move on, lets us examine one of the assumptions:
 The error variances in the two periods are the same.
 Computing
in an application and comparing it with the
critical F value, one can decide to reject or not reject the null
hypothesis.

32/33 

33/33 
Testing the Functional Form
Homework 3B
 Choosing
g between linear and log–linear
g
regression
g
models:
Basic Econometrics (Gujarati, 2003)
 We can use the MWD test to choose:
 H0: Linear Model: Y is a linear function of the X’s.
1.
 H1: Log–Linear Model: lnY is a linear function of logs of X’s.
2.
1.
2.
3.
4.
5.
6.
Estimate the linear model and obtain the estimated Y values (Yf).
Estimate the log–linear
g
model and obtain the estimated lnY values ((ln f)
f).
Obtain Z1 = (ln Yf − ln f ).
Regress Y on X’s and Z1. Reject H0 if the coefficient of Z1 is statistically
significant by the usual t test.
Obtain Z2 = (antilog of ln f − Yf ).
Regress ln Y on the logs of X’s and Z2. Reject H1 if the coefficient of Z2 is
statistically significant by the usual t test.
Chapter 7, Problem 16 [50 points]
Chapter 7, Problem 21 [50 points]
Assignment weight factor = 0.5
1/9

Nature of Dummy Variables
 In regression analysis Y is influenced not only by ratio scale
Basic Econometrics in Transportation
variables but also by variables that are essentially qualitative,
or nominal scale, in nature.
 Such as sex, race, color, religion, nationality, geographical region, political
disorders, and party affiliation.
Dummy Variables
 Female workers are found to earn less than their male counterparts.
 Nonwhite workers are found to earn less than whites.
 Since such variables indicate the presence or absence of a
Amir Samimi
quality, such as male or female, black or white.
 Dummy variables are a device to classify data into mutually
exclusive categories.
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/9


3/9


ANOVA Models
Caution in Use of Dummy Variables
 ANOVA
NOV regression
eg ess o model
ode contain
co ta X’ss tthat
at are
a e aall dummy.
du
y.
 To
o distinguish
d st gu s tthee 3 regions,
eg o s, we used oonly
y 2 du
dummy
y va
variables.
ab es.
 Example: teachers’ salaries by geographical region:
 The category for which no dummy variable is assigned is
 Does average annual salary of teachers differ among the regions?
 One may take the average of salaries of the teachers in the three regions:
$24,424.14 (Northeast and North Central), $22,894 (South), and $26,158.62
(West). These numbers look different, but are they statistically different?
 Yi = β1 + β2D2i + β3D3i + ui
o Yi = average salary of public school teacher in state i.
o D2i = 1 if the state is in the Northeast or North Central; = 0 otherwise
o D3i = 1 if the state is in the South = 0 otherwise.
 How do you use this model to answer
the question?
known as the base, to which all the comparisons are made.
 If the intercept is not introduced in a model, you may avoid the
“dummy variable trap”.
 In an interceptless model (a dummy variable for each
category), we obtain directly the mean values of the various
categories.
4/9
5/9


Models with Two Qualitative Variables


A Mixture of Quantitative and Qualitative
 Hourly wages in relation to marital status and region of residence:
• Y = hourly wage ($).
• D2 = married status; 1 = married, 0 = otherwise.
• D3 = region of residence; 1 = South, 0 = otherwise.
 Yi = average annual salary of public school teachers in state ($)
 Xi = spending on public school per pupil ($)
 D2i = 1, if the state is in the Northeast or North Central = 0, otherwise
 D3i = 1, if the state is in the South = 0, otherwise
 Which is the benchmark category
g y here?
 Unmarried persons who do not live in the South.
6/9
 Are the average hourly wages statistically different compared to
 These results are different from previous.
the base category?
Once you go beyond one qualitative variable, you have to pay close
attention to the category that is treated as the base category.
 As public expenditure goes up by a dollar,


on average, a public school teacher’s salary
goes up by about $3.29.
7/9


Dummy Variable Alternative to Chow Test
Dummy Variable Alternative to Chow Test
 The
e Chow
C ow test procedure,
p ocedu e, tells
te s us only
o y if two regressions
eg ess o s are
ae
 Sou
Source
ce oof ddifference,
e e ce, can
ca be specified
spec ed by pooling
poo g all
a the
t e
different without telling us what is the source of the difference.
observations and running just one multiple regression:
 Yt = α1 + α2Dt + β1Xt + β2(DtXt) + ut
 Y = savings
 X = income
 t = time
 D = 1 for observations in 1982
1982–1995;
1995; = 0,
0 otherwise.
otherwise
 α1 + β1Xt : Mean savings for 1970–1981:
 (α1 + α2) + (β1 + β2) Xt: Mean savings function for 1982–1995
 Notice how:
 interactive dummy enables us to differentiate between slope coefficients,
 additive dummy enables us to differentiate between intercepts of two periods.
8/9


9/9


Dummy Variables in Seasonal Analysis
Piecewise Linear Regression
 Yt = α1D1t + α2D2t + α3D3t + α4D4t + ut
 Yi = α1 + β1Xi + β2(Xi − X*)Di + ui
 Yt = sales of refrigerators
 Yi = sales commission
 D’s are the dummies, taking a value of 1 in the relevant quarter and 0 otherwise.
 Xi = volume of sales
 X* = threshold value of sales
 D = 1 if Xi > X*
 If there is any seasonal effect in a given quarter, that will be indicated by a
statistically significant t value of the dummy coefficient for that quarter.
1/29

Outline
 What is the nature of multicollinearity?
Basic Econometrics in Transportation
 Is multicollinearity really a problem?
 What are its practical consequences?
 How does one detect it?
Multicollinearity
 What remedial measures can be taken to ease the problem?
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/29


3/29


Nature of Multicollinearity
Nature of Multicollinearity
 O
Originally
g a y itt meant
ea t the
t e existence
e ste ce oof a “perfect”
pe ect linear
ea
 If multicollinearity
u t co ea ty iss pe
perfect,
ect, the
t e coefficients
coe c e ts are
ae
relationship among some explanatory variables:
indeterminate and their standard errors are infinite.
 λ1X1 + λ2X2 + · · · + λkXk = 0
 λs are constant such that not all of them are zero simultaneously.
 Today, it is also used where X variables are intercorrelated but
nott perfectly
f tl so:
 λ1X1 + λ2X2 + · · · + λkXk + vi= 0
 where vi is a stochastic error term.
 If multicollinearity is less than perfect, the coefficients are
determinate but possess large standard errors.
 The coefficients cannot be estimated with great precision.
4/29


5/29
Example


Ballentine View of Multicollinearity
 Consider the following hypothetical data:
 Perfect collinearity between X2 and X3
 Coefficient of correlation is 1
 No perfect collinearity between X2 and X3*
 Coefficient of correlation between them is 0.9959
6/29


Sources of Multicollinearity
7/29


Estimation in Perfect Multicollinearity
 Multicollinearity may be due to the following factors:
 The data collection method employed:
 Sampling over a limited range of the values taken by the regressors in the population.
 Constraints on the model or in the population being sampled:
 In the regression of electricity consumption on income and house size there is a physical
constraint in the population in that families with higher incomes generally have larger
homes than families with lower incomes.
 Model specification:
 Assume that X3i = λX2i , where λ is a nonzero constant
 Adding polynomial terms to a regression model.
 An overdetermined model:
 This happens when the model a lot of explanatory variables compared to the number of
observations. This could happen in medical research where there may be a small
number of patients about whom information is collected on a large number of variables.
 Common trend in time series data:
 Regressors may all be growing over time at more or less the same rate.
 If X3 and X2 are perfectly collinear, there is no way X3 can be kept
constant. As X2 changes, so does X3 by the factor λ.
8/29


9/29


Theoretical Consequences
Practical Consequences
 Recall that with CLRM assumptions, OLS estimators are BLUE.
 Although BLUE, the OLS estimators have large variances and
 And with CNLRM assumptions, OLS estimators are BUE.
covariances, making precise estimation difficult.
 Thus, the confidence intervals tend to be much wider, leading to
the acceptance of the “zero null hypothesis” more readily.
 Therefore, the t ratio of one or more coefficients tends to be
statistically insignificant.
 Although the t ratio(s) is statistically insignificant, R2 can be very
high.
 The OLS estimators and their standard errors can be sensitive to
small changes in the data.
 It can be shown that even if multicollinearity is very high, the OLS
estimators still retain the property of BLUE.
 But multicollinearity violates no regression assumptions.
 Unbiased, consistent estimates will occur, and their standard errors will be
correctly estimated.
 The only effect of multicollinearity is to make it hard to get coefficient estimates
with small standard error.
 In fact, at a theoretical level, multicollinearity, few observations and small
variances on the independent variables are essentially all the same problem.
 Thus “What should I do about multicollinearity?” is a question like “What should
I do if I don’t have many observations?”

10/29 
Large Variances and Covariances

11/29 
Large Variances and Covariances
 The
e results
esu ts can
ca be extended
e te ded to the
t e kk-variable
va ab e model:
ode :
 R2j = R2 in the regression of Xj on the remaining (k − 2) regressors.
 r23 is the coefficient of correlation between X2 and X3.
 The speed with which variances increase can be seen with the
variance-inflating factor (VIF):
 The inverse of the VIF is
called tolerance (TOL).

12/29 
Wider Confidence Intervals

13/29 
High R2 but Few Significant t Ratios
 In cases of high collinearity, it is possible to find that one or
more of the coefficients are individually insignificant.
 Yet the R2 in such situations may be so high that on the basis of
 In cases of high collinearity the estimated standard errors increase
dramatically, thereby making the t values smaller.
 In such cases, one will increasingly accept the null hypothesis that
the relevant true population value is zero (type II error increases).

14/29 
Sensitivity to Small Changes in Data
the F test one can convincingly reject the hypothesis that
β2 = · · · = βk = 0.
 Indeed, this is one of the signals of multicollinearity.
 This outcome should not be surprising in view of our discussion
on individual vs. joint testing.
 The real problem here is the covariances between the estimators,
which are related to the correlations between the regressors.

15/29 
An Illustrative Example
X2: income, X3: wealth, and Y: consumption expenditure.

16/29 
Detection of Multicollinearity
1.
High
g R2 but few significant
g
t ratios.

17/29 
Detection of Multicollinearity
4.
 This is the “classic” symptom of multicollinearity.
 Regress each Xi on the remaining X variables and compute the corresponding
 Its disadvantage is that it is too strong.
2.
Auxiliary regressions.
R2 (R2i).
 If the computed F exceeds the critical Fi at the chosen level of significance, it
is taken to mean that the particular Xi is collinear with other X’s.
High pair-wise correlations among regressors.
 If the correlation coefficient between two regressors is high (0.8 <), then
multicollinearity is a serious problem.
 A sufficient but not a necessary
y condition for the existence of multicollinearity.
y
3.
 Klien’s rule of thumb: multicollinearity may be a problem only if R2i is
Examination of partial correlations.
greater than the overall R2.
 A finding that R21.234 is very high but r212.34, r213.24, and r214.23 are comparatively
low may suggest that the variables X2, X3, and X4 are highly intercorrelated.
 Partial correlation between X and Y given a set of n controlling variables Z =
{Z1, Z2, …, Zn}, is the correlation between the residuals resulting from the linear
regression of X with Z and of Y with Z.

18/29 
Detection of Multicollinearity
5.
Eigenvalues
ge va ues aandd condition
co d t o index.
de .
 Set A = X’ X, where X is a n x k matrix of observations.
 Calculate the eigenvalues.
 The eigenvalues of a k x k matrix A are the k solutions to | A – I | = 0
 Condition number is defined as the maximum eigenvalue over the minimum
eigenvalue. If it is between 100 and 1000 there is moderate to strong
multicollinearity and if it exceeds 1000 there is severe multicollinearity.
 Condition
C diti iindex
d is
i the
th square roott off the
th condition
diti number.
b If CI exceeds
d 30
there is severe multicollinearity.
 Some authors believe that the condition index is the best available
multicollinearity diagnostic

19/29 
Detection of Multicollinearity
6.
Tolerance
o e a ce and
a d variance
va a ce inflation
at o factor
acto
 The larger the value of VIFj, the more “troublesome” or collinear the variable
Xj.
 As a rule of thumb, if the VIF of a variable exceeds 10 that variable is said be
highly collinear.
 VIF as a measure of collinearity is not free of criticism.
 A high VIF is neither necessary nor sufficient to get high variances.

20/29 

21/29 
Remedial Measures
A Priori Information
 We have two choices:
 Consider the model: Yi = β1 + β2X2i + β3X3i + ui
 Y = consumption, X2 = income, and X3 = wealth
 Do nothing,
 Multicollinearity is essentially a data deficiency problem and some times we
have no choice over the data we have available for empirical analysis.
 If we cannot estimate one or more regression coefficients with greater
precision, a linear combination of them can be estimated relatively efficiently.
 Suppose a priori we believe that β3 = 0.10β2
 We can run: Yi = β1 + β2Xi + ui
 How to obtain a priori information?
 Previous empirical
p
work,,
 Relevant theory underlying the field of study.
 Follow some rules of thumb
 The success depends on the severity of the collinearity problem.

22/29 
 We know how to test for the validity of such restrictions explicitly.

23/29 
Pooling the Data
Dropping a Variable
 A va
variant
a t of
o the
t e priori
p o information
o at o technique
tec que iss to combine
co b e
 A ssimple
p e way iss to drop
d op one
o e of
o the
t e collinear
co ea variables.
va ab es.
cross-sectional and time series data.
 Consider the model: lnYt = β1 + β2 lnPt + β3 lnIt + ut
 Y = number of cars sold, P = average price, I = income, and t = time.
 Price and income variables generally tend to be highly collinear.
 A way out of this multicollinearity problem:
 If we h
have cross-sectional
ti l data
d t we can obtain
bt i a ffairly
i l reliable
li bl estimate
ti t off the
th
income elasticity (β3).
 Use this estimate, and write time series regression as Y*t = β1 + β2 ln Pt + ut
 Note: we are assuming that cross-sectionally estimated income elasticity is the
same thing as that which would be obtained from a pure time series analysis.
 It is worthy of consideration in situations where the cross-sectional estimates
do not vary substantially from one cross section to another.
 Thus, in our consumption–income–wealth illustration, we can drop the wealth
variable and get a significant coefficient for the income variable.
 Note: We may be committing a specification error in dropping a
variable.
 The remedy may be worse than the disease in some situations
because whereas multicollinearity may prevent precise
because,
estimation of the parameters of the model, omitting a variable
may seriously mislead us as to the true values of the parameters.

24/29 

25/29 
Transformation
Transformation
 First Difference Form:
 In the first
first-difference
difference transformation:
 If the relation Yt = β1 + β2X2t + β3X3t + ut holds at time t, it must also hold at
time t − 1: Yt-1 = β1 + β2X2,t-1 + β3X3,t-1 + ut-1
 Thus: Yt − Yt-1 = β2 (X2t − X2,t-1) + βt3(X3t − X3,t-1) + vt
 This often reduces the severity of multicollinearity because, although the levels
of X2 and X3 may be highly correlated, there is no a priori reason to believe that
their differences will also be highly correlated.
 Ratio
R ti Transformation
T
f
ti
 Consider: Yt = β1 + β2X2t + β3X3t + ut (X2 is GDP, and X3 is total population)
 GDP and population grow over time, they are likely to be correlated.
 One “solution” is to express the model on a per capita basis:
 The error term vt may not satisfy the assumption that the disturbances are
serially uncorrelated.
 If the original disturbance term is serially uncorrelated, vt will in most cases be
serially correlated.
 May not be appropriate in cross-sectional data.
 There is a loss of one observation (important in small sample data)
 Therefore,
Th f
the
th remedy
d may be
b worse than
th the
th disease.
di
 In the ratio transformation:
 The error term will be heteroscedastic, if the original error term is
homoscedastic.
 Again, the remedy may be worse than the disease of collinearity.
 BE CAREFUL IN TRANSFORMING!

26/29 

27/29 
Additional or New Data
Deviation Form
 Remember
e e be multicollinearity
u t co ea ty iss a sample
sa p e feature.
eatu e.
 A spec
special
a feature
eatu e of
o po
polynomial
y o a regression
eg ess o models
ode s iss tthat
at the
t e
 Sometimes simply increasing the size of the sample is helpful.
 As the sample size increases, variance of the estimated
coefficients will decrease, leading to a more precise estimate.
 Obtaining additional or “better” data is not always that easy.
 Note: when adding new variables in situations that are not
controlled, we must be aware of adding observations that were
generated by a process other than that associated with the original
data set.
explanatory variable(s) appear with various powers.
 Thus, the explanatory variable(s) are going to be correlated.
 It has been found in practice that multicollinearity is substantially
reduced, if the explanatory variable(s) are expressed in the
d i i form.
deviation
f
 There is no guarantee and the problem may persist!

28/29 

29/29 
Is Multicollinearity Necessarily Bad?
Homework 4
 It has been said that if the sole purpose of regression analysis is
Basic Econometrics (Gujarati, 2003)
prediction or forecasting, then multicollinearity is not a serious
problem when R2 is high and the regression coefficients are
individually significant.
1.
2.
Chapter 10, Problem 26 [50 points]
Chapter 10, Problem 27 [50 points]
Assignment weight factor = 0.5
1/25

Outline
 What is the nature of heteroscedasticity?
Basic Econometrics in Transportation
 What are its consequences?
 How does one detect it?
 What are the remedial measures?
Heteroscedasticity
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/25


3/25


Nature of Heteroscedasticity
Possible Reasons
 An important
po ta t assumption
assu pt o in CLRM
C
iss that
t at E(u
(u2i) = σ2
1..
 This is the assumption of equal (homo) spread (scedasticity).
Ass peop
peoplee learn,
ea , their
t e errors
e o s of
o behavior
be av o become
beco e smaller
s a e over
ove
time.
 As the number of hours of typing practice increases, the average number of
 Example: the higher income families on the average save more than the lower-
typing errors as well as their variances decreases.
income families, but there is also more variability in their savings.
2.
As incomes grow, people have more choices about the
disposition of their income.
3.
As data collecting techniques improve, σ2i is likely to decrease.
 Rich people have more choices about their savings behavior.
 Banks that have sophisticated data processing equipment are likely to commit
fewer errors.
4/25


5/25
Possible Reasons
4.
Cross-sectional and Time Series Data
Heteroscedasticity can arise when there are outliers.
 Heteroscedasticity is likely to be more common in cross
cross-
sectional than in time series data.
 An observation that is much different than other observations in the sample.
5.
Heteroscedasticity arises when model is not correctly specified.
 In cross-sectional data, one usually deals with members of a population at a
given point in time. These members may be of different sizes, income, etc.
 Very often what looks like heteroscedasticity may be due to the fact that
 In time series data, the variables tend to be of similar orders of magnitude
some important variables are omitted from the model.
6.


because one generally collects the data for the same entity over a period of
time.
Skewness in distribution of a regressor is an other source.
 Distribution of income and wealth in most societies is uneven
uneven, with the bulk
of the income and wealth being owned by a few at the top.
7.
Other sources of heteroscedasticity:
 Incorrect data transformation (ratio or first difference transformations).
 Incorrect functional form (linear versus log–linear models).
6/25


7/25


OLS Estimation with Heteroscedasticity
Method of Generalized Least Squares
 O
OLS
S est
estimators
ato s aand
d ttheir
e va
variances
a ces when
w e
 Ideally,
dea y, we wou
would
d likee to ggive
ve less
ess we
weight
g t to the
t e observations
obse vat o s
 .
 .
coming from populations with greater variability.
 Consider: Yi = β1 + β2Xi + ui = β1X0i + β2Xi + ui
 Assume the heteroscedastic variances are known:
 Is it still BLUE when we drop only the homoscedasticity
assumption?
 We can easily prove that it is still linear and unbiased.
 We can also show that it is a consistent estimator.
 Variance of transformed disturbance term is now homoscedastic:
 It is no longer best and the minimum variance is not given by the equation
above.
 What is BLUE in the presence of heteroscedasticity?
 Apply OLS to the transformed model and get BLUE estimators.
8/25


9/25


GLS Estimators
Consequences of Using OLS
 Minimize
 OLS estimator for variance is a biased estimator.
 Overestimates or underestimates, on average
 Cannot tell whether the bias is positive or negative
 Follow the standard calculus techniques, we have:
 No longer rely on confidence intervals, t and F tests
 If we persist in using the usual testing procedures despite
heteroscedasticity, whatever conclusions we draw may be very misleading.
 Heteroscedasticity is potentially a serious problem and the
researcher needs to know whether it is present in a given
situation.

10/25 

11/25 
Detection
Informal Methods
 There
e e are
a e noo hard-and-fast
a d a d ast rules
u es for
o detecting
detect g heteroscedasticity,
ete oscedast c ty,
 Nature of the Problem
only a few rules of thumb.
 This is inevitable because σ2i can be known only if we have the entire Y
population corresponding to the chosen X’s,
 More often than not, there is only one sample Y value corresponding to a
particular value of X. And there is no way one can know σ2i from just one Y
observation.
 Thus,
Thus heteroscedasticity may be a matter of intuition,
intuition educated guesswork,
guesswork or
prior empirical experience.
 Most of the detection methods are based on examination of OLS
residuals.
 Those are the ones we observe, and not ui. We hope they are good estimates.
 This hope may be fulfilled if the sample size is fairly large.
 Nature of problem may suggest heteroscedasticity is likely to be encountered.
 Residual variance around the regression of consumption on income increases
with income.
 Graphical Method
 Estimated u2i are plotted against estimated Yi
 Is the estimated mean value of Y systematically
related
l t d to
t the
th squaredd residual?
id l?
 a) no systematic pattern, perhaps no
heteroscedasticity.
 b-e) definite pattern, perhaps no
homoscedasticity.
 Using such knowledge, one may transform the
data to alleviate the problem.

12/25 

13/25 
Formal Methods
Formal Methods
 Park Test
 Glejser Test
 He formalizes the graphical method, by suggesting a Log-linear model:
 Glejser suggests regressing the estimated error term on the X variable:
ln σ2i = ln σ2 + β ln Xi + vi
 Since σ2i is generally unknown, Park suggests
 Following functional forms are suggested:
 If β turns out to be insignificant, homoscedasticity assumption may be
accepted.
 The
h particular
i l functional
f
i l form
f
chosen
h
by
b Parkk is
i only
l suggestive.
i
 For large samples the first four give generally satisfactory results.
 The last two models are nonlinear in the parameters.
 Note: the error term vi may not satisfy the OLS assumptions.
 Note: some argued that vi does not have a zero expected value, it is serially
correlated, and heteroscedastic.

14/25 

15/25 
Formal Methods
Formal Methods
 Spea
Spearman’s
a s Rank
a Co
Correlation
e at o Test
est
 Go
Goldfeld-Quandt
d e d Qua dt Test
est
 Fit the regression to the data on Y and X and estimate the residuals.
 Rank the observations according to Xi values.
 Rank both absolute value of residuals and Xi (or estimated Yi) and compute the
 Omit c central observations, and divide the remaining observations into two
Spearman’s rank correlation coefficient:
• di = difference in the ranks for ith observation.
 Assuming that the population rank correlation coefficient is zero and n > 8, the
significance
i ifi
off the
th sample
l rs can be
b ttested
t d by
b the
th t test,
t t with
ith df = n − 2:
2
groups each of (n − c) / 2 observations.
 Fit separate OLS regressions to the first and last set of observations, and obtain
the residual sums of squares RSS1 and RSS2.
 Compute the ratio
 If ui are assumed to be normally distributed, and if the assumption of
homoscedasticity is valid, then it can be shown that λ follows the F distribution.
 The ability of the test depends on how c is chosen.
 If the computed t value exceeds the critical t value, we may accept the
hypothesis of heteroscedasticity.
 Goldfeld and Quandt suggest that c = 8 if n = 30, c = 16 if n = 60.
 Judge et al. note that c = 4 if n = 30 and c = 10 if n is about 60.

16/25 

17/25 
Formal Methods
Formal Methods
 Breusch
Breusch–Pagan–Godfrey
Pagan Godfrey Test
 White
White’ss General Heteroscedasticity Test
 Success of GQ test depends on c and X with which observations are ordered.
 Does not rely on the normality assumption and is easy to implement.
 Estimate Yi = β1 + β2X2i + ··· + βkXki + ui by OLS and obtain the residuals.
 Estimate Yi = β1 + β2X2i + β3X3i + ui and obtain the residuals.
, (ML estimator of σ2)
 Construct variables pi defined as
 Regress pi on the Z’s as pi = α1 + α2Z2i + ··· + αmZmi + vi
o σ2i is assumed to be a linear function of the Z’s.
o Some or all of the X’s can serve as Z’s.
 Obtain the ESS (explained sum of squares)  = 0.5 ESS
 Assuming ui are normally distributed, one can show that if there is
homoscedasticity and if the sample size n increases indefinitely, then  ∼ χ2m−1
 BPG test is an asymptotic, or large-sample, test.
 Run the following auxiliary regression:
 Obtain

18/25 
Higher powers of regressors can also be introduced.
 Under the null hypothesis (homoscedasticity), if the sample size n increases
i d fi i l it
indefinitely,
i can be
b shown
h
that
h nR2 ∼ χ2 (df = number
b off regressors))
 If the chi-square value exceeds the critical value, the conclusion is that there is
heteroscedasticity.
 If it does not α2 = α3 = α4 = α5 = α6 = 0.
 It has been argued that if cross-product terms are present, then it is a test of
heteroscedasticity and specification bias.

19/25 
Remedial Measures
Remedial Measures
 Heteroscedasticity
ete oscedast c ty does not
ot destroy
dest oy u
unbiasedness
b ased ess aandd
 W
When
e σ2i iss known:
ow :
consistency.
 But OLS estimators are no longer efficient, not even
asymptotically.
 There are two approaches to remediation:
 when σ2i is known, and
 When σ2i is not known.
 The most straightforward method of correcting heteroscedasticity is
by means of weighted least squares.
 WLS method provides BLUE estimators.
 When σ2i is unknown:
 Is there a way of obtaining consistent estimates of the variances and
covariances of OLS estimators even if there is heteroscedasticity?
The answer is yes.

20/25 

21/25 
White’s Correction
White’s Procedure
 White has suggested a procedure by which asymptotically valid
 For a 2
2-variable
variable regression model Yi = β1 + β2X2i + ui we showed:
statistical inferences can be made about the true parameter
values.
 Several computer packages present White’s heteroscedasticitycorrected variances and standard errors along with the usual OLS
variances and standard errors.
 White’s heteroscedasticity-corrected standard errors are also
known as robust standard errors.
 White has shown that
is a consistent estimator of
 For Yi = β1 + β2X2i + β3X3i + · · · +βkXki + ui we have:



22/25 
Example
are the residuals obtained from the original regression.
are the residuals obtained from the auxiliary
regression of the regressor Xj on the remaining
regressors.

23/25 
Reasonable Heteroscedasticity Patterns
 Apart
pa t from
o being
be g a large-sample
a ge sa p e procedure,
p ocedu e, oonee drawback
d awbac of
o the
t e
Y = per capita expenditure on public schools by state in 1979
White procedure is that the estimators thus obtained may not be
so efficient as those obtained by methods that transform data to
reflect specific types of heteroscedasticity.
Income = per capita income by state in 1979
 Both the regressors are statistically significant at the 5 percent
level, whereas on the basis of White estimators they are not.
 Since robust standard errors are now available in established
regression packages, it is recommended to report them.
 WHITE option can be used to compare the output with regular
OLS output as a check for heteroscedasticity.
 We may consider several assumptions about the pattern of
heteroscedasticity.

24/25 

25/25 
Reasonable Heteroscedasticity Patterns
Homework 5
 Assumption 1: if
,
Basic Econometrics (Gujarati, 2003)
 Assumption 2: if
,
 Assumption 3: if
,
1.
2.
Chapter 11, Problem 15 [50 points]
Chapter 11, Problem 16 [50 points]
 Assumption 4:
 A log transformation such as lnYi = β1 + β2 ln Xi + ui very often reduces
heteroscedasticity.
Assignment weight factor = 0.5
1/30

Outline
 What is the nature of autocorrelation?
Basic Econometrics in Transportation
 What are the theoretical and practical consequences of
autocorrelation?
 Since the assumption of no autocorrelation relates to the
unobservable disturbances, how does one know that there is
autocorrelation in any given situation?
 How does one remedy the problem of autocorrelation?
Autocorrelation
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/30


3/30


Nature of the Problem
Nature of the Problem
 Autocorrelation
utoco e at o may
ay be defined
de ed as correlation
co e at o between
betwee
 Thiss iss in many
a y ways ssimilar
a to tthee problem
p ob e of
o
members of series of observations ordered in time or space.
 CLRM assumes that E(uiuj ) = 0
 Consider the regression of output on labor and capital inputs, in a quarterly
time series data. If there is a labor strike affecting output in one quarter, there
is no reason to believe that this disruption will be carried over to the next
quarter.
 If output is lower this quarter
quarter, there is no reason to expect it to be lower next
quarter.
 In cross-section studies:
 Data are often collected on the basis of a random sample of say households.
 There is often no prior reason to believe that the error term pertaining to one
household is correlated with the error term of another household.
heteroscedasticity.
 Under both heteroscedasticity and autocorrelation the usual
OLS estimators, although linear, unbiased, and asymptotically
normally distributed, have no longer the minimum variance
(not efficient) among all linear unbiased estimators.
 Thus, they may not be BLUE.
 As a result, the usual, t, F, and χ2 may not be valid.
4/30
5/30


Patterns of Autocorrelation


Possible Reasons
1.
a.
b.
c.
d.
6/30
Figure a shows a cyclical
pattern.
Figure b suggests an upward
linear trend.
Figure c shows a downward
linear trend in the
disturbances,
Possible Reasons
4..
5.
Cobweb Phenomenon
e o e o
 There is a “momentum’’ built into them, and it continues until something
happens.
2.
Specification Bias: Excluded Variables Case
 The price of chicken may affect the consumption of beef. Thus, the error
term will reflect a systematic pattern, if this is excluded.
3.
Specification Bias: Incorrect Functional Form
 If X2i is incorrectly omitted from the model, a systematic fluctuation in the
error term and thus autocorrelation will be detected because of the use of an
incorrect functional form.
7/30


Possible Reasons
7.
Data
ata Transformation
a s o at o
 Supply reacts to price with a lag of one time period.
 Autocorrelation may be induced as a result of transforming the model.
 Planting of crops are influenced by their price last year.
 If the error term in Yt = β1 + β2Xt + ut satisfies the standard OLS assumptions,
particularly the assumption of no autocorrelation, it can be shown that the
error term in Yt = β2 Xt + vt is autocorrelated.
Lags
 In a time series regression of consumption expenditure on income, it is not
uncommon to find that the consumption expenditure in the current period
depends also on the consumption expenditure of the previous period.
Consumptiont = β1 + β2 incomet + β3 consumptiont-1 + ut (Autoregression)
6.
 A prominent feature of most economic time series.
Figure d indicates that both
linear and quadratic trend
terms are present in the
disturbances.


Inertia
“Manipulation’’ of Data
 Averaging introduces smoothness and dampen the fluctuations in the data.
 Quarterly data looks much smoother than the monthly data.
 This smoothness may lend to a systematic pattern in the disturbances.
8.
Nonstationarity
 A time series is stationary if its characteristics (e.g., mean, variance, and
covariance) do not change over time.
 When Y and X are nonstationary, it is quite possible that the error term is also
nonstationary.
 Most economic time series exhibit positive autocorrelation
because most of them either move upward or downward.
8/30


9/30


Estimation in Presence of Autocorrelation
AR(1) Scheme
 What happens to the OLS estimators if we introduce auto
auto-
 ut = ρut-1
t 1 + εt
correlation in disturbances by assuming E(utut+s) ≠ 0?
 We revert to the two-variable regression model to explain the
basic ideas involved.
 To make any headway, we must assume the mechanism that
generates ut.
 E(ut, ut+S) ≠ 0 is too general an assumption to be of any
practical use.
 As a starting point, one can assume a first-order autoregressive
scheme.

10/30 
((−11 < ρ < 1)
 ρ is the coefficient of autocovariance
 εt is the stochastic disturbance term that satisfies the OLS assumptions:
 E(εt) = 0
 var (εt) = σ2ε
 cov (εt , εt +S) = 0
s≠0
 G
Given
ve tthee AR(1)
( ) scheme,
sc e e, itt can
ca be shown
s ow that:
t at:
Since ρ is constant, variance of ut is still homoscedastic.
If |ρ| < 1, AR(1) process is stationary; that is, the mean,
variance, and covariance of ut do not change over time.
 A considerable amount of theoretical and empirical
work has been done on the AR(1) scheme.



11/30 
AR(1) Scheme
BLUE Estimator in Autocorrelation Case
 U
Under
de the
t e AR(1)
( ) scheme,
sc e e, itt can
ca be shown
s ow that
t at
 For
o a two-variable
two va ab e model
ode and
a d a AR(1)
( ) scheme,
sc e e, we can
ca show
s ow that
t at
following are BLUE:
 C and D are a correction factor that may be disregarded in practice.
 A term that depends on ρ as well as the sample autocorrelations between the
values taken by the regressor X at various lags.
lags
 In general we cannot foretell which one is greater.
 OLS estimator with adjusted variance are still linear and
unbiased, but no longer efficient.
 In case of autocorrelation can we find a BLUE estimator?
 GLS incorporates additional information (autocorrelation
parameter ρ) into the estimating procedure, whereas in OLS such
side information is not directly taken into consideration.
 What if we use OLS procedure despite autocorrelation?

12/30 

13/30 
Consequences of Using OLS
Consequences of Using OLS
 As in heteroscedasticity case, OLS estimators are not efficient.
 If we mistakenly believe that CNLRM assumptions hold true,
 OLS confidence intervals derived are likely to be wider than
those based on the GLS procedure:
 Since b2 lies in the OLS
errors will arise for the following reasons:
1.
confidence interval, we
mistakenly accept the hypothesis
that true β2 is zero with 95%
confidence.
in which
 r is the sample correlation coefficient between successive values of the X.
X
 In most economic time series, ρ and r are both positive and
 Use GLS and not OLS to
establish confidence intervals
and to test hypotheses, even
though OLS estimators are
unbiased and consistent.

14/30 
The residual variance
is likely to underestimate the true σ2.
If there is AR(1) autocorrelation, it can be shown that
2.
3.
.
As a result, we are likely to overestimate R2.
t and F tests are no longer valid, and if applied, are likely to give seriously
misleading conclusions about the statistical significance of the estimated
regression coefficients.

15/30 
A Concrete Example
Detecting Autocorrelation
 G
Graphical
ap ca Method:
et od:
 Very often a visual examination of the estimated u’s gives us some clues about
the likely presence of autocorrelation in the u’s.
 We can simply plot them against time in a time sequence plot.
 Standardized residuals are the residuals
divided by standard error of the
regression.
g
 Standardized residuals devoid of units of
measurement.
 This figure exhibit a pattern for estimated
error terms, suggesting that perhaps ut are
not random.
 How reliable are the results if there
is autocorrelation?

16/30 

17/30 
Detecting Autocorrelation
The Runs Test
 Graphical Method:
 Also know as the Geary test, a nonparametric test.
 To see this differently, we can plot the residuals at time t against their value at
time (t − 1), a kind of empirical test of the AR(1) scheme.
 In our wage-productivity regression, most
of the residuals are grouped in the second
and fourth quadrants.
 This suggests
gg
a strong
g ppositive correlation
in the residuals.
 Although the graphical method is
suggestive, it is subjective or qualitative
in nature.
 There are 9 negative, followed by 21 positive, followed by 10
negative residuals, in our wage-productivity regression:
(−−−−−−−−−)(+++++++++++++++++++++)(−−−−−−−−−−
)
 Run: An uninterrupted sequence of one symbol.
 Length of a run: Number of elements in it.
 Too many runs mean the residuals change sign frequently, thus indicating
negative serial correlation.
 Too few runs mean the residuals change sign infrequently, thus indicating
positive serial correlation.

18/30 

19/30 
The Runs Test
Durbin–Watson d Test
 Let
et
 The
e most
ost ce
celebrated
eb ated test for
o detect
detecting
g se
serial
a co
correlation:
e at o :
 N = total number of observations = N1 + N2
 N1 = number of + symbols
 N2 = number of − symbols
 Assumptions underlying the d statistic:
 R = number of runs
1. The regression model includes the intercept term. If it is not present, obtain
 H0: successive outcomes are independent
 If N1 and N2 > 10,
10 R ~ N(
,
 Decision Rule:
 Reject H0 if the estimated R lies outside the limits.
 Prob [E(R) − 1.96σR ≤ R ≤ E(R) + 1.96σR] = 0.95
)
the RSS from the regression including the intercept term.
2. The
Th explanatory
l
variables
i bl are nonstochastic,
h i or fixed
fi d in
i repeatedd sampling.
li
3. The disturbances are generated by the first-order autoregressive scheme.
4. The error term is assumed to be normally distributed.
5. The regression model does not include the lagged value of the dependent
variable as one of the explanatory variables. (no autoregressive model)
6. There are no missing observations in the data.

20/30 

21/30 
Durbin–Watson d Test
Durbin–Watson d Test
 PDF of the d statistic is difficult to derive because, it depends in
 In our wage
wage-productivity
productivity regression:
a complicated way on the X values.
 The estimated d value can be shown to be 0.1229.
 Durbin and Watson suggested a boundary (Table D.5) for the computed d.
 If there is no serial correlation, d is expected to be about 2.
 From the Durbin–Watson tables (N = 40 and k’ = 1), dL = 1.44 and dU = 1.54 at
the 5 percent level.
 d < dL  Evidence of positive auto-correlation.
 Run the OLS regression and
compute d.
 For the given sample size and given
number of explanatory variables,
find out the critical dL and dU
values.
 Now follow the decision rules in
the figure.

22/30 
 In many situations, it has been found that the upper limit dU is
approximately the true significance limit:
 If d < dU, there is statistically significant positive autocorrelation.
 If (4 − d) < dU, there is statistically significant negative autocorrelation.

23/30 
Remedial Measures
Mis-specification
 If autoco
autocorrelation
e at o iss found,
ou d, we have
ave four
ou options:
opt o s:
 For
o tthee wages–productivity
wages p oduct v ty regression:
eg ess o :
1.
2.
3.
4.
Try to find out if the autocorrelation is pure autocorrelation and not the result
of mis-specification of the model.
In case of pure autocorrelation, one can use some type of generalized leastsquare (GLS) method.
In large samples, we can use the Newey–West method to obtain standard
errors of OLS estimators that are corrected for autocorrelation
autocorrelation.
In some situations we can continue to use the OLS method.
 It may be safe to conclude that
probably the model suffers from pure
autocorrelation and not necessarily
from specification bias.

24/30 

25/30 
Correcting for Pure Autocorrelation
When ρ Is Known
 The remedy depends on the knowledge about the structure of
Yt = β1 + β2Xt + ut
 If it holds true at time t, it also holds true at time (t − 1)
Yt-1 = β1 + β2Xt-1 + ut-1
ρYt-1 = ρβ1 + ρβ2Xt-1 + ρut-1
 Thus,
Yt − ρYt-1 = β1(1 − ρ) + β2(Xt − ρXt-1) + εt
Y*t = β*1 + β*2X*t + εt
autocorrelation.
 As a starter, consider the two-variable regression model, and
assume that the error term follows the AR(1) scheme.
ut = ρutt-11 + εt
(−1 < ρ < 1)
Yt = β1 + β2Xt + ut
 Now we consider two cases:
 ρ is known,
 ρ is not known but has to be estimated.

26/30 
 As the error term satisfies the CLRM assumptions, we can apply
OLS to the transformed variables and obtain BLUE estimators.

27/30 
When ρ Is Unknown
When ρ Is Unknown
 The
e First-Difference
st
e e ce Method:
et od:
 ρ Based
ased oon Durbin–Watson
u b Watso d Stat
Statistic:
st c:
 Since ρ lies between 0 and ±1, one could start from two extreme positions.
 ρ = 0: No (first-order) serial correlation,
 ρ = ±1: perfect positive or negative correlation.
 If we cannot use the first difference transformation because ρ is not sufficiently
close to 1:
 In reasonably large samples one can obtain rho from
and use GLS.
 If ρ = 1: Yt − Yt-1 = β2(Xt − Xt-1) + (ut − ut-1)
 This may be appropriate if the coefficient of autocorrelation is very high.
 ρ Estimated from the Residuals:
 A rough rule of thumb: Use the first difference form whenever d < R2
 If AR(1) scheme is valid, run the following regression:
 Use the g statistic to test the hypothesis that ρ = 1.
 No need to introduce the intercept, as we know the OLS residuals sum to zero.
 ut are the OLS residuals from the original regression
 et are the OLS residuals from the first-difference regression.
 Use Durbin–Watson tables except that now the null hypothesis is that ρ = 1.
 g = 0.0012, dL = 1.435 and dU = 1.540: As the observed g lies below the
lower limit of d, we do not reject the hypothesis that true ρ = 1
 An estimate of ρ is obtained and then used to estimate the GLS.
Thus, all these methods of estimation are known as feasible GLS
(FGLS) or estimated GLS (EGLS) methods.

28/30 

29/30 
Newey–West Method
Dummy Variables and Autocorrelation
 Instead of FGLS methods, OLS with corrected SEs can be used
 Recall the US savings
savings–income
income regression model for 1970
1970–1995:
1995:
in large samples.
 This is an extension of White’s correction.
 The corrected SEs are known as HAC (heteroscedasticity- and
autocorrelation-consistent) standard errors.
 We will not present the mathematics behind, for it is involved.
 The wages–productivity regression with HAC standard errors:
 Yt = α1 + α2Dt + β1Xt + β2(DtXt) + ut

 Use the generalized difference method to estimate the parameters.
 How do we transform the dummy?
 One can follow the following procedure:
1.
2.
3.

30/30 
Homework 6
Basic
as c Econometrics
co o et cs (Gujarati,
(Guja at , 2003)
003)
Chapter 12, Problem 26 [30 points]
2. Chapter 12, Problem 29 [15 points]
3. Chapter 12, Problem 31 [25 points]
4. Chapter 12, Problem 37 [25 points]
1.
Assignment weight factor = 1
Y = savings; X = income; D = 1 for 1982–1995; D = 0 for 1970–1981
 Suppose that ut follows a AR(1) scheme:
D = 0 ffor 1970
1970–1981;
1981 D = 1/(1 − ρ)) for
f 1982
1982; D = 1 ffor 1983
1983–1995.
1995
Transform Xt as (Xt − ρXt-1).
DtXt is zero for 1970–1981; DtXt = Xt for 1982; and the remaining
observations are set to (DtXt − DtρXt-1) = (Xt − ρXt-1).
1/34

Outline
 How does one go about finding the “correct”
correct model?
Basic Econometrics in Transportation
 What are the consequences of specification errors?
 How does one detect specification errors?
 What remedies can one adopt and with what benefits?
Model Specification
 How does one evaluate the performance of competing models?
Amir Samimi
Civil Engineering Department
Sharif University of Technology
Primary Source: Basic Econometrics (Gujarati)
2/34


3/34


Model Selection Criteria
Types of Specification Errors
1. Be data admissible
 The model that we accept as a good model:
 Predictions made from the model must be logically possible.
2. Be consistent with theory
 Must make good economic sense.
3. Have weakly exogenous regressors
 The explanatory variables, must be uncorrelated with the error term.
4 Exhibit parameter constancy
4.
 In the absence of parameter stability, predictions will not be reliable.
5. Exhibit data coherency
 Estimated residuals must be purely random (technically, white noise).
6. Be encompassing
 Other models cannot be an improvement over the chosen model.
 Yi = β1 + β2X1i + β3X2i + β4X3i + u1i
1. Omission of a relevant variable(s)
 Yi = α1 + α2X1i + α3X2i + u2i
u2i = u1i + β4X3i
2. Inclusion of an unnecessary variable(s)
 Yi = λ1 + λ2X1i + λ3X2i + λ4X3i + λ5X4i + u3i
u3i = u1i – λ5X4i
3 Adopting the wrong functional form
3.
 LnYi = γ1 + γ2X1i + γ3X2i + γ4X3i + u4i
4. Errors of measurement
 Y*i = β*1 + β*2X*1i + β*3X*2i + β*4X*3i + u*i
Y*i = Yi + εi and X*i = Xi + wi
5. Incorrect specification of the stochastic error term
 Multiplicative versus additive error term: Yi = βXiui and Yi = αXi + ui
4/34


5/34


Specification and Mis-specification Errors
Consequences of Model Specification Errors
 The first four types of error are essentially in the nature of model
 To keep the discussion simple,
specification errors.
 We will answer this question in the context of the three-variable model
and consider the first two types of specification errors discussed earlier:
 We have in mind a “true” model but somehow we do not estimate the correct one.
 In model mis-specification errors, we do not know what the true
 Underfitting a model
model is to begin with.
 Omitting relevant variables,
 The controversy between the Keynesians and the monetarists.
 Overfitting a model
 The monetarists give primacy to money in explaining changes in GDP.
GDP
 The Keynesians emphasize the role of government expenditure to explain
 Including unnecessary variables.
changes in GDP.
 There are two competing models.
 We will first consider model specification errors and then examine
model mis-specification errors.
6/34


Underfitting a Model
 Suppose tthe
e true
t ue model
ode is:
s:
 Yi = β1 + β2X2i + β3X3i + ui
 For some reason we fit the following model:
 Yi = α1 + α2X2i + vi
7/34


Underfitting a Model
 It can be shown that:
 b32 is
i the
h slope
l
in
i the
h regression
i off X3 on X2.
 b32 will be zero if X3 on X2 are uncorrelated.
 It can also be shown that:
 The consequences of omitting variable X3 are as follows:
1. If X2 and X3 are correlated, estimated α1 and α2 are biased and inconsistent.
2 If X2 and X3 are not correlated
2.
correlated, only estimated α1 is biased.
biased
3. The disturbance variance is incorrectly estimated.
4. The conventionally measured variance of estimated α2 is biased.
5. In consequence, the hypothesis-testing procedures are likely to give
misleading conclusions.
6. As another consequence, the forecasts will be unreliable.
 Although estimated α2 is biased, its variance is smaller.
 There is a tradeoff between bias and efficiency involved here.
 Estimated σ2 are not the same because RSS and df of the models are different.
 If X has a strong impact on Y, it may reduce RSS more than the loss in df.
 Thus, inclusion of such variables will reduce bias and standard error.
 Once a model is formulated on the basis of the relevant theory,
one is ill-advised to drop a variable from such a model.
8/34


Overfitting a Model
9/34


Tests of Specification Errors
 Suppose the true model is:
 Yi = β1 + β2X2i + ui
 For some reason we fit the following model:
 Yi = 1 + 2X2i + 3X3i + vi
 The consequences of adding variable X3 are as follows:
1. The OLS estimators of the parameters of the “incorrect” model are all
unbiased and consistent.
consistent
2. The error variance σ2 is correctly estimated.
3. The usual hypothesis-testing procedures remain valid.
4. The estimated α’s will be generally inefficient (larger variance):
 Suppose we develop a k-variable model, but are not sure that Xk
really belongs in the model: Yi = β1 + β2X2i + … + βkXki + ui
 One simple way to find this out is to test the significance of the estimated βk with
the usual t test.
 If we are not sure whether X3 and X4 legitimately belong in the model, we can be
easily ascertained by the F test discussed before.
 It is important to remember that in carrying out these tests of significance we
have a specific model in mind.
 Given that model, then, we can find out whether one or more regressors are
really relevant by the usual t and F tests.
 Inclusion of unnecessary variables makes the estimations less
precise.

10/34 

11/34 
Tests of Specification Errors
Tests of Specification Errors
 Note: we should not use t and F tests to build a model iteratively.
 On the basis of theory or prior empirical work, we develop a
 We should not say Y is related to X2 only because estimated β2 is statistically
significant and then expand the model to include X3 and decide to keep that
variable in the model, and so on.
 This strategy of building a model is called the bottom-up
approach or by the somewhat critical term, data mining.
 Nominal significance in data mining approach:
 For c candidate Xs out of which k are finally selected on the basis of data mining,
the true level of significance (α*) is related to the nominal level of significance
(α) as : α* = 1 − (1 − α)c/k
 The art of the applied econometrician is to allow for data-driven theory while
avoiding the considerable dangers in data mining.
model that we believe captures the essence of the subject under
study.
 Then we look at some broad features of the results, such as adjusted R2, t
ratios, signs of the estimated coefficients, Durbin–Watson statistic, …
 If diagnostics do not look encouraging, then we begin to look for remedies:
 Maybe we have omitted an important variable,
 Maybe we have have used the wrong functional form,
 Maybe we have not first-differenced (to remove autocorrelation)
 To determine whether model inadequacy is on account of one of
these problems, we can use some of the coming methods.

12/34 

13/34 
Omitted Variables and Incorrect Function
Omitted Variables and Incorrect Function
 Omission of an important variable or incorrect functional form,
1. From the assumed model, obtain the OLS residuals.
causes a plot of residuals to exhibit distinct patterns.
2. If you believe the model is misspecified because it excludes Z,
order the residuals according to increasing values of Z.
a) A linear function:
Yi = λ1 + λ2Xi + u3i
b) A quadratic function:
Yi = α1 + α2Xi + α3X2i + u2i
c) The true total cost function:
Yi = β1 + β2Xi + β3X2i + β4X3i + ui
 As we approach the truth, the
residuals are smaller and they do not
exhibit noticeable patterns.

14/34 
 The Z variable could be one of the X variables included in the assumed model
or it could be some function of that variable.
3. Compute
C
t the
th d statistic
t ti ti from
f
the
th residuals
id l thus
th ordered.
d d
4. If the estimated d value is significant, one can accept the
hypothesis of model misspecification.

15/34 
Omitted Variables and Incorrect Function
Omitted Variables and Incorrect Function
1. From the chosen model obtain the estimated Yi
1. Estimate the restricted regression by OLS and obtain residuals.
 Yi = λ1 + λ2Xi + u3i
 Yi = λ1 + λ2Xi + u3i
2. Rerun the model introducing the estimated Yi in some form as
an additional regressor. (get idea from the plot of residuals and estimated Y)
 .Yi  1   2 X i   3 Ŷ   4 Ŷ  u i
2
i
3
i
3. Use F test to find out if R2 is significantly improved. F 
 .F 
(0.9983 - 0.8409)/2
 284.4035
(1 . 0.9983)/(10 - 4)
R - R /DF
1 - R /DF
2
new
2
old
2
new
4. If F value is significant, one can accept that the model is mis-
specified.
2. If the unrestricted regression (Yi = 1 + 2Xi + 3X2i + 4X3i + ui)
is the true one, the residuals should be related to X2i and X3i.
3. Regress the estimated ui from Step 1 on all the regressors
 .û i  1   2 X i   3 X i   4 X i  v i
2
3
4. n times R2 from the auxiliary regression follows the chi square
distribution:
5. If the chi-square value exceeds the value, we reject the
restricted regression.

16/34 

17/34 
Errors of Measurement
Errors of Measurement in Y
 We assume that the data is “accurate”.
accurate .
 Consider the following model:
 Not guess estimates, extrapolated, rounded off, etc in any systematic manner.
 Unfortunately, this ideal is not usually met in practice!
 Non-response errors, reporting errors, and computing errors.
 Whatever the reasons, it is a potentially troublesome problem.
 It forms another example of specification bias.
 Y*i = α + βXi + ui
 Y*i = permanent consumption expenditure; Xi = current income
 Y*i is not directly measurable, we may use Yi such that
 Yi = Y*i + εi
 Therefore, we estimate
 Yi = (α + β Xi + ui) + εi = α + βXi + vi
 Will be discussed in two parts:
 Errors of measurement in the dependent variable Y
 Errors of measurement in the explanatory variable X
 vi is a composite error term: population disturbance and measurement error.
 For simplicity assume that:
 E(ui) = E(εi) = 0, cov (Xi, ui) = 0 (which is the assumption of CLRM);
 cov (Xi, εi) = 0: errors of measurement in Y*i are uncorrelated with Xi;
 cov (ui, εi) = 0: equation error and the measurement error are uncorrelated.

18/34 

19/34 
Errors of Measurement in Y
Errors of Measurement in X
 W
With
t these
t ese assu
assumptions,
pt o s, itt ca
can be see
seen that
t at
 Co
Consider
s de the
t e following
o ow g model:
ode :
 β estimated from either equations will be an unbiased estimator of the true β.
 The standard errors of β estimated from the two equations are different:
First model:
Second model:
 The latter variance is larger than the former.
 Therefore, although the errors of measurement in Y still give
unbiased estimates of the parameters and their variances, the
estimated variances are now larger than in the case where there
are no such errors of measurement.
 Yi = α + βX*i + ui
 X*i is not directly measurable, we may use Xi such that
 Xi = X*i + wi
 Therefore, we estimate
 Yi = α + β(Xi – wi) + ui = α + βXi + zi
 zi = ui − βwi is a compound of equation and measurement errors.
errors
 Assumptions:
 Even if we assume that wi has zero mean, is serially independent, and is
uncorrelated with ui,
 We can no longer assume cov (zi, Xi) = 0: we can show cov (zi , Xi) = −βσ2w
 A crucial CLRM assumption is violated!

20/34 

21/34 
Errors of Measurement in X
Incorrect Specification of Error Term
 As the explanatory variable and the error term are correlated, the
 Since the error term is not directly observable, there is no easy
OLS estimators are biased and inconsistent.
 It is shown that:
 What is the solution?
 The answer is not easy.
 If σ2w is small compared to σ2X* , we can “assume away” for practical purposes.
o The rub here is that there is no way to judge their relative magnitudes.
 One other remedy is the use of instrumental or proxy variables.
o Although highly correlated with X, are uncorrelated with the equation and
measurement error terms.
way to determine the form in which it enters the model.
 Consider Yi = βXiui and Yi = αXi + ui
 Assume the multiplicative is the “correct” model (ln ui ∼ N(0,
σ2)) but we estimated the other one.
2
2
2
 It can be shown that ui ∼ log normal [eσ /2, eσ (eσ − 1)], and thus
2
E(estimated α) = β eσ /2
 Estimated α is a biased, as its average is not equal to the true β.
 Measure the data as accurately as possible.

22/34 

23/34 
Nested Versus Non-Nested Models
The Non-Nested F Test
 Two
wo models
ode s aaree nested,
ested, if oonee ca
can be de
derived
ved as a spec
special
a case oof
 Co
Consider:
s de :
the other:
 Model A: Yi = β1 + β2X2i + β3X3i + β4X4i + β5X5i + ui
 Model B: Yi = β1 + β2X2i + β3X3i + ui
 We already have looked at tests of nested models.
 Tests of non-nested hypotheses:
 The discrimination approach:
o Given competing models, one chooses a model based on some criteria of
goodness of fit (R2, adjusted R2, Akaike’s information, etc).
 The discerning approach:
o We take into account information provided by other models, in
investigating one model. (non-nested F test, Davidson–MacKinnon J
test, Cox test, JA test, P test, Mizon–Richard encompassing test, etc).
 Model C: Yi = α1 + α2X2i + α3X3i + ui
 Model D: Yi = β1 + β2Z2i + β3Z3i + vi
 Estimate the following nested or hybrid model and use the F test:
 Yi = λ1 + λ2X2i + λ3X3i + λ4Z2i + λ5Z3i + wi
 If Model C is correct, λ4 = λ5 = 0, whereas Model D is correct if λ2 = λ3 = 0.
 Problems with this procedure:
 If X’s and Z’s are highly correlated, we have no way of deciding which one is
the correct model.
 To test the significance of an incremental contribution, one should choose
Model C or D as the reference. Choice of the reference hypothesis could
determine the outcome of the choice model.
 The artificially nested model F may not have any economic meaning.

24/34 

25/34 
Davidson–MacKinnon J Test
Davidson–MacKinnon J Test
1. Estimate Model D and obtain the estimated Y values.
 Some problems of the J test:
2. Add estimated Y as a regressor to Model C and estimate:
 No clear answer if the test leads to acceptance or rejection of both models.
This model is an example of the encompassing principle.
3. Using the t test, test the hypothesis that α4 = 0.
4. If this is not rejected, we accept Model C as the true model.
 The J test may not be very powerful in small samples.
 This is because the influence of variables not included in Model C, have no
additional explanatory power beyond that contributed by Model C.
 If the null hypothesis is rejected, Model C cannot be the true model.
5. Reverse the roles of hypotheses, or Models C and D.

26/34 

27/34 
Model Selection Criteria
The R2 Criterion
 We distinguish
d st gu s between
betwee in-sample
sa p e and
a d out-of-sample
out o sa p e forecasting.
o ecast g.
 R2 iss de
defined
ed as ESS/TSS
SS/ SS
 In-sample forecasting tells us how the model fits the data in a given sample.
 Necessarily lies between 0 and 1.
 Out-of-sample forecasting tries to determine how a model forecasts future.
 The closer it is to 1, the better is the fit.
 Several criteria are used for this purpose:
1. R2
2. Adjusted R2
 Problems with R2
1. It measures in-sample goodness of fit.

There is no guarantee that it will forecast well out-of-sample observations.
3. Akaike information criterion (AIC)
(
)
2. The dependent
p
variable must be the same,, in comparing
p
g two or more R2’s.
4. Schwarz Information criterion (SIC)
3. An R2 cannot fall when variables are added to the model.
5. Mallow’s Cp criterion

6. forecast χ2 (chi-square)

 There is a tradeoff between goodness of fit and a model’s
complexity (as judged by the number of X’s) in criteria 2 to 5.
There is every temptation to play the game of “maximizing the R2”.
Adding variables may increase R2 but it may also increase the variance of
forecast error.

28/34 

29/34 
Adjusted R2
Akaike Information Criterion (AIC)
 The adjusted R2 is defined as
 AIC is defined as
or
 Adjusted R2 is a better measure than R2, as it penalizes for adding
more X’s.
 Again keep in mind that Y must be the same for the comparison
to be valid.
 AIC imposes a harsher penalty than adjusted R2 for adding X’s.
 In comparing two or more models, the model with the lowest
value of AIC is preferred.
p
 One advantage of AIC is that it is useful for not only in-sample
but also out of-sample forecasting performance of a regression
model.
 Also, it is useful for both nested and non-nested models.

30/34 

31/34 
Schwarz Information Criterion (SIC)
Mallows’s Cp Criterion
 S
SIC
C iss de
defined
ed as
 Mallows
a ows has
as developed
deve oped a ccriterion
te o for
o model
ode selection:
se ect o :
or
 SIC imposes a harsher penalty than AIC.
 Like AIC, the lower the value of SIC, the better the model.
 Like AIC
AIC, SIC can be used to compare in
in-sample
sample or out
out-ofof
sample forecasting performance of a model.
 RSSp is the residual sum of squares using the p regressors.
 If the model with p regressors does not suffer from lack of fit, it
can be shown that E(RSSp) = (n− p)σ2.
 It is true approximately that E(Cp) ≈ p
p.
 In practice, one usually plots Cp against p and look for a model that has a low
Cp value, about equal to p.

32/34 

33/34 
Forecast Chi-Square
Ten Commandments
 Suppose we have a regression model based on n observations and
1. Thou shalt use common sense and economic theory.
suppose we want to use it to forecast for an additional t
observations. Now the forecast χ2 test is defined as follows:
2. Thou shalt ask the right questions.
3. Thou shalt know the context (No ignorant statistical analysis).
4. Thou shalt inspect the data.
 If we hypothesize that the parameter values have not changed between the
sample
l andd post-sample
t
l periods,
i d it can be
b shown
h
that
th t the
th statistic
t ti ti above
b
follows
f ll
the chi-square distribution with t degrees of freedom.
 The forecast χ2 test has a weak statistical power, meaning that the probability
that the test will correctly reject a false null hypothesis is low and therefore the
test should be used as a signal rather than a definitive test.
5. Thou shalt not worship complexity.
6 Thou shalt look long and hard at thy results.
6.
results
7. Thou shalt beware the costs of data mining.
8. Thou shalt be willing to compromise.
9. Thou shalt not confuse statistical with practical significance.
10.Thou shalt anticipate criticism.

34/34 
Homework 7
Basic
as c Econometrics
co o et cs (Gujarati,
(Guja at , 2003)
003)
Chapter 13, Problem 21 [35 points]
2. Chapter 13, Problem 22 [25 points]
3. Chapter 13, Problem 25 [40 points]
1.
Assignment weight factor = 1
1/26

Outline
Basic Econometrics in Transportation
 Features of the discrete choice models.
 What is a choice set?
 Utility-maximizing behavior.
 The most prominent types of discrete choice models
Behavioral Models
 Logit,
 Generalized extreme value (GEV),
 Probit,
Amir Samimi
 Mixed logit
 How the models are used for forecasting over time.
Civil Engineering Department
Sharif University of Technology
Primary Source: Discrete Choice Methods with Simulation (Train)
2/26


3/26


Discrete Choice versus Regression
The Choice Set
 W
With
t regression
eg ess o models,
ode s, the
t e dependent
depe de t variable
va ab e iss continuous,
co t uous,
 The
e set of
o alternatives
a te at ves within
wt
a discrete
d sc ete cchoice
o ce framework
a ewo
 With infinite number of possible outcomes (e.g. how much money to save).
 When there is an infinite number of alternatives, discrete choice models
cannot be applied.
 Often regressions examine choices of “how much” and discrete
choice models examine choice of “which.”
 This distinction is not accurate.
needs to exhibit three characteristics:
 The alternatives must be mutually exclusive.
 The decision maker chooses only one alternative from the choice set.
 The choice set must be exhaustive
 The decision maker necessarily chooses one of the alternatives.
 The number of alternatives must be finite.
 Discrete choice models have been used to examine choices of “how much”
 Whether to use regression or discrete choice models is a specification issue.
 Usually a regression model is more natural and easier.
 A discrete choice model is used only if there were compelling reasons.
 The appropriate specification of the choice set in these
situations is governed largely by the goals of the research and
the data that are available to the researcher.
4/26


5/26


Example
Utility-maximizing Behavior
 Consider households’
households choice among heating fuels:
 Discrete choice models are usually derived under an assumption
of utility-maximizing behavior by the decision maker.
 Available fuels: natural gas, electricity, oil, and wood.
 These alternatives violate both mutual exclusivity (a household can have two
 It is a psychological stimuli, that is also interpreted as utility.
types) and exhaustiveness (a household can have no heating).
 Models that can be derived in this way are called random utility models
 How to handle these issues ?
(RUMs).
 Note:
 To obtain mutually exclusive alternatives:
1. List every possible combination as an alternative.
 Models derived from utility
y maximization can also be used to represent
p
decision
making that does not entail utility maximization.
2. Define the choice as the choice among fuels for the “primary” heating source.
 The models can be seen as simply describing the relation of explanatory
 To obtain exhaustive alternatives:
variables to the outcome of a choice, without reference to exactly how the choice
is made.
1. Include “no heating” as an alternative.
2. Redefine as the choice of heating fuel conditional on having heating.
 The first two conditions can usually be satisfied.
 In contrast, the third condition is actually restrictive.
6/26


7/26


Utility-maximizing Behavior
Derivation of Choice Probabilities
 Notat
Notation:
o :
 You
ou do not
ot obse
observe
ve the
t e decision
dec s o maker’s
a e s utility,
ut ty, rather
at e
 A decision maker, labeled n, faces a choice among J alternatives.
 some attributes of the alternatives, xnj
 The utility that decision maker n obtains from alternative j is Unj.
 and some attributes of the decision maker, sn
 Decision Rule:
 The decision maker chooses the alternative that provides the greatest utility:
choose alternative i if and only if Uni > Unj ∀ j ≠ i .
 Note:
 This utility is known to the decision maker but not by the researcher.
 and then specify a representative utility function, Vnj = V(xnj , sn)
 There are aspects that the researcher does not see: Unj = Vnj + εnj
 These terms are treated as random, as the researcher does not know εnj.
 Joint PDF of the random vector εn = εn1, ... , εnJ is denoted f ((εn)).
 The probability that decision maker n chooses alternative i is:
 Pni = Prob (Uni > Unj ∀ j ≠ i )
= Prob (Vni + εni > Vnj + εnj ∀ j ≠ i)
= Prob (εnj − εni < Vni − Vnj ∀ j ≠ i)
= ∫ε I(εnj − εni < Vni − Vnj ∀ j ≠ i ) f (εn) dεn
I (·) is the indicator function.
8/26


Derivation of Choice Probabilities
Pnii = ∫ε I(εnjj − εnii < Vnii − Vnjj ∀ j ≠ i ) f (εn) dεn
 This is a multidimensional integral over the density of εn.
 Different discrete choice models are obtained from different
specifications of the unobserved portion of utility.
 Logit and nested logit have closed-form expressions for this integral. Under the
assumption that εn is distributed iid extreme value and a type of generalized
extreme value
value, respectively.
respectively
 Probit is derived under the assumption that f is a multivariate normal.
 Mixed logit is based on the assumption that the unobserved portion of utility
consists of a part that follows any distribution specified by the researcher plus a
part that is iid extreme value.

10/26 
9/26


Meaning of Choice Probabilities
 Consider a person who can take car or bus to work.
 The researcher observes the time (T) and cost (M) that the person would incur:
Vc = αTc + βMc
and
Vb = αTb + βMb
 There are other factors that affect the person’s choice.
 If it turns out that Vc = 4 and Vb = 3:
 This means that car is better for this person by 1 unit on observed factors.
 It does not mean that the person chooses car.
 Specifically, the person will choose bus if the unobserved portion of utility is
higher than that for car by at least 1 unit:
Pc = Prob(εb − εc < 1) and Pb = Prob(εb − εc > 1)
 What is meant by the distribution of εn?

11/26 
Meaning of Choice Probabilities
Specific Models
 The
e density
de s ty iss the
t e distribution
d st but o of
o unobserved
u obse ved ut
utility
ty within
wt
the
t e
 A qu
quick
c preview
p ev ew of
o logit,
og t, G
GEV,
V, p
probit,
ob t, aandd mixed
ed logit
og t iss use
useful
u to
population of people who face the same observed utility.
 Pni is the share of people who choose alternative i within the population of
people who face the same observed utility for each alternative as person n.
 The distribution can be considered in subjective terms,
 Pni is the probability that the researcher assigns to the person’s choosing
alternative i given the researcher’s ideas about the unobserved utility.
 The distribution can represent the effect of factors that are
quixotic to the decision maker (aspects of bounded rationality)
 Pni is the probability that these quixotic factors induce the person to choose
alternative i given the observed, realistic factors.
show how they relate to general derivation of all choice models.
 Logit
 The most widely used discrete choice model.
 It is derived under the assumption that εni is iid extreme value for all i:
 Unobserved factors are uncorrelated with the same variance over alternatives.
 Ap
person who dislikes travel byy bus because of the presence
p
of other riders
might have a similar reaction to rail travel.
 GEV
 To avoid the independence assumption within a logit.
 A comparatively simple GEV model places the alternatives into nests, with
unobserved factors having the same correlation for all alternatives within a nest
and no correlation for alternatives in different nests.

12/26 
Specific Models
 Probit
 Unobserved factors are distributed jointly normal.
 With full covariance matrix , any pattern of correlation and heteroskedasticity
can be accommodated.
 A customer’s willingness to pay for a desirable attribute of a product is
necessarily positive. (normal distribution has density on both sides of zero)
 Mixed logit
 Allows
All
th
the unobserved
b
d ffactors
t tto ffollow
ll any di
distribution.
t ib ti
 Unobserved factors can be decomposed into a part that contains all the
correlation and heteroskedasticity, and another part that is iid extreme value.

13/26 
Identification of Choice Models
 Several aspects of the behavioral decision process affect the
specification and estimation of any discrete choice model.
 The issues can be summarized easily in two statements:
 “Only differences in utility matter”
 “The scale of utility is arbitrary.”
 Other models
 These models are often obtained by combining concepts from other models.
 A mixed probit has a similar concept to the mixed logit.

14/26 

15/26 
Differences in Utility Matter
Alternative-Specific Constants
 The
e level
eve oof ut
utility
ty doesn’t
does t matter.
atte .
 Itt iss often
o te reasonable
easo ab e to specify
spec y the
t e observed
obse ved part
pa t of
o utility
ut ty to be
 Pni = Prob(Uni − Unj > 0 ∀ j ≠ i )
 It has several implications for the identification and specification
of discrete choice models.
 It means that the only parameters that can be estimated are those that capture
differences across alternatives.
 This general statement takes several forms
forms.
 Alternative-Specific Constants
 Sociodemographic Variables
 Number of Independent Error Terms
linear in parameters with a constant.
 The alternative-specific constant for an alternative captures the average effect on
utility of all factors that are not included in the model.
 Since alternative-specific constants force the εnj to have a zero mean, it is
reasonable to include a constant in Vnj for each alternative.
 Researcher must normalize the absolute levels of constants
 It is impossible to estimate the constants themselves, since an infinite number of
constants (any values with the same difference) result in the same choice.
 The standard procedure is to normalize one of the constants to zero.
 In our example: Uc = αTc + βMc + εc and Ub = αTb + βMb + kb + εb
 With J alternatives, at most J − 1 constants can enter the model.

16/26 

17/26 
Sociodemographic Variables
Number of Independent Error Terms
 Attributes of the decision maker do not vary over alternatives.
 Remember that Pnii = ∫ε I(εnjj − εnii < Vnii − Vnjj ∀ j ≠ i ) f (εn) dεn
 This probability is a J-dimensional integral.
 The same issue!
 Consider the effect of a person’s income on the mode choice decision.
 Suppose that a person’s utility is higher with higher income (Y):
 With J errors, there are J − 1 error differences.
 The probability can be expressed as a (J − 1)-dimensional integral:
Uc = αTc + βMc + θ0cY + εc and Ub = αTb + βMb + θ0bY + kb + εb
 Since only differences matter, absolute levels of θ’s cannot be estimated.
 To set the level,, one of these parameters
p
is normalized to zero:
Uc = αTc + βMc + εc and Ub = αTb + βMb + θbY + kb + εb
 Sociodemographic variables can enter utility in other ways.
 For example, cost is often divided by income.
 Now, there is no need to normalize the coefficients.
 g(·) is the density of these error differences.
 For any f, the corresponding g can be derived.
 Normalization occurs automatically in logit.
 Normalization should be performed in probit.

18/26 

19/26 
Overall Scale of Utility Is Irrelevant
Normalization with IID Errors
 The
e alternative
a te at ve with
w t the
t e highest
g est utility
ut ty iss the
t e same
sa e noo matter
atte how
ow
 W
With
t iid
d assu
assumption,
pt o , normalization
o a at o for
o scale
sca e iss straightforward.
st a g t o wa d.
utility is scaled.
 Consider U0nj = x’njβ + ε0nj where Var(ε0nj) = σ2
 The model U0nj = Vnj + εnj is equivalent to U1nj = λVnj + λεnj for any λ > 0.
 To take account of this fact, the researcher must normalize the
scale of utility.
 The standard way is to normalize the variance of the error terms.
 The scale of utility and the variance of the error terms are linked by definition
definition.
Var(λεnj
) = λ2Var(ε
 We will discuss
 Normalization with iid Errors,
 Normalization with Heteroskedastic Errors
 Normalization with Correlated Errors
nj)
 The original model is equivalent to U1nj = x’nj(β/σ) + ε1nj where Var(ε1nj) = 1
 As we will see, the error variances in a standard logit model are traditionally
normalized to π2/6, which is about 1.6.
 Interpretation of results must take the normalization into account
 Suppose in a mode choice model, the estimated cost coefficient is −0.55 from a
logit and −0.45 from a probit model.
 It is incorrect to say that the logit model implies more sensitivity to costs.
 Same issue when a model is estimated on different data sets:
 Suppose we estimate cost coefficient of −0.55 and −0.81 for Chicago and Boston
 This scale difference means unobserved utility has less variance in Boston
 Other factors have less effect on people in Boston

20/26 

21/26 
Normalization with Heteroskedastic Errors
Normalization with Correlated Errors
 Variance of error terms can be different for different segments.
 When errors are correlated, normalizing the variance of the error
 One sets the overall scale of utility by normalizing the variance for one segment,
and then estimates the variance for each segment relative to this one segment.
 In our Chicago and Boston mode choice model:
 Unj = αTnj + βMnj + εBnj ∀n in Boston
 Unj = αTnj + βMnj + εCnj ∀n in Chicago
 Define k = Var(εBnjj)/Var(εCnjj)
for one alternative is not sufficient to set the scale of utility
differences.
 Consider the utility for the four alternatives as Unj = Vnj + εnj.
 The error vector εn = {εn1, εn2, εn3, εn4} has zero mean and
 Covariance matrix 
 .
 .
 The scale parameter (k), is estimated along with β and α.
 Variance of error term can differ over geographic regions, data
sets, time, or other factors.

22/26 
 Since only differences in utility matter, this model is equivalent to
one in which all utilities are differenced from, the first alternative.

23/26 
Normalization with Correlated Errors
Aggregation
 The
e variance
va a ce of
o each
eac error
e o difference
d e e ce depends
depe ds on
o the
t e variances
va a ces
 Discrete
sc ete choice
c o ce models
ode s operate
ope ate at individual
d v dua level.
eve .
and covariances of the original errors:
 The researcher is usually interested in some aggregate measure.
 In linear regression models:
 Estimates of aggregate Y are obtained by inserting aggregate values of X’s.
 Setting the variance of one of the original errors is not sufficient to set the
variance of the error differences (σ11 = k is not sufficient)
 Instead,
Instead normalize the variance of one of the error differences to some number.
number
 The covariance matrix for the error differences becomes:
 On recognizing that only differences matter and that the scale of utility is
arbitrary, the number of covariance parameters drops from ten to five.
 Discrete choice models are not linear in explanatory variables:
 Inserting aggregate values of X’s will not
provide an unbiased estimate of average
response.
 Estimating the average response by
calculating derivatives and elasticities at the
average of the explanatory variables is
similarly problematic.
 Sample enumeration or segmentation.

24/26 

25/26 
Sample Enumeration
Segmentation
 The most straightforward, and by far the most popular approach.
 When
 The choice probabilities of each decision maker in a sample are
summed, or averaged, over decision makers.
 Each sampled decision maker n has some weight associated with him,
representing the number of decision makers similar to him in the population.
 A consistent estimate of the total number of the decision makers in the
population who choose i is wnPni.
 Average derivatives and elasticities are similarly obtained by
calculating the derivative and elasticity for each sampled person
and taking the weighted average.

26/26 
Forecasting
 Procedures
ocedu es described
desc bed ea
earlier
e for
o aggregate
agg egate va
variables
ab es are
a e applied.
app ed.
 The exogenous variables and/or the weights are adjusted to reflect changes that
are anticipated over time.
 Recalibration of constants
 To reflect that unobserved factors are different for the forecast area or year.
 An iterative process is used to recalibrate the constants.
 α0j: The estimated alternative-specific constant for alternative j.
 Sj: Share of agents in the forecast area that choose alternative j in base year.
: Predicted share of decision makers in the forecast area who will choose
each alternative.
 Adjust alternative-specific constants:
 With the new constants, predict the share again, compare with the actual
shares, and if needed adjust the constants again.

 The number of explanatory variables is small, and
 Those variables take only a few values.
 Consider
 A model with only two variables in the utility function: education and gender.
 The total number of different types of decision makers (segments) is eight, if
there are four education levels for each of the two genders.
 Choice probabilities vary only over the segments, not over
individuals within each segment.
 Aggregate outcome variables can be estimated by calculating the choice
probability for each segment and taking the weighted sum of these probabilities.
1/31

Choice Probabilities
 The utility that the decision maker obtains is Unj = Vnj + εnj ∀j.
Basic Econometrics in Transportation
 The logit model is obtained by assuming that each εnj is IID
extreme value.
 For each unobserved component of utility, PDF and CDF are:
Logit Models
 The variance of this distribution is π2/6.
 Difference between two extreme value variables
(ε∗nji = εnj − εni) is distributed logistic:
Amir Samimi
 The key assumption is not so much the shape of the
distribution as that the errors are independent of each other.
Civil Engineering Department
Sharif University of Technology
 One should specify utility well enough that a logit model is appropriate (get
white noise error term).
Primary Source: Discrete Choice Methods with Simulation (Train)
2/31


Choice Probabilities
3/31


Logit Choice Probabilities Derivation
 If tthe
e researcher
esea c e tthinkss tthat
at tthee uunobserved
obse ved portion
po t o of
o utility
ut ty iss
correlated over alternatives given her specification of
representative utility, then she has three options:
 For a given εni, this is the cumulative distribution for each εni:
 Use a different model that allows for correlated errors,
 Re-specify representative utility so that the source of the
correlation is captured explicitly and thus the remaining errors are
independent,
independent
 Use the logit model under the current specification of
representative utility, considering the model to be an
approximation.
 Of course, εni is not given, and so the choice probability is the
integral of Pni | εni over all values of εni weighted by its density:
 Some algebraic manipulation of this integral results in:
4/31


5/31
Properties of Logit Probabilities
Scale Parameter
 Pni is necessarily between zero and one.
 Setting the variance to π2/6 is equivalent to normalizing the
model for the scale of utility.
 The logit probability for an alternative is never exactly zero.
 Choice probabilities for all alternatives sum to one.
 Utility can be expressed as U∗nj = Vnj + ε∗nj, that becomes Unj = Vnj /σ + εnj
 The relation of the logit probability to representative utility is
 The Choice probability becomes:
sigmoid, or S-shaped.
 Only the ratio β∗/σ can be estimated (β∗ and σ are not separately identified).
 The sigmoid shape of logit probabilities is shared by most discrete choice
 A larger variance in unobserved factors leads to smaller coefficients
coefficients, even if
models and has important implications for policy makers.
makers
the observed factors have the same effect on utility.
 It is easily interpretable.
 Willingness to pay, values of time, and other measures of marginal rates of
 Ratio of coefficients has economic meaning.
6/31




substitution are not affected by the scale parameter: β1/β2 = β∗1 /β∗2
7/31


Power and Limitations of Logit
Taste Variation
 Three
ee topics
top cs explain
e p a the
t e power
powe and
a d limitations
tat o s of
o logit
og t models:
ode s:
 Logit
og t ca
can only
o y represent
ep ese t syste
systematic
at c taste variation.
va at o .
 Taste variation
 Consider choice among makes and models of cars to buy:
 Logit can represent systematic taste variation but not random taste variation.
 Substitution patterns
 The logit model implies proportional substitution across alternatives, given
the researcher’s specification of representative utility.
 Repeated choices over
o er time (panel data)
 If unobserved factors are independent over time in repeated choice situations,
then logit can capture the dynamics of repeated choice.
 However, logit cannot handle situations where unobserved factors are
correlated over time.
 Two observed attributes are purchase price (PP), and shoulder room (SR).
 So utility is written as Unj = αnSRj + βnPPj + εnj
 Suppose SR varies only with household size (M): αn = ρMn
 Suppose PP is inversely related to income(I): βn = θ/In.
 Substituting
g these relations: Unj = ρ(
ρ(MnSRj) + θ(PP
( j/In) + εnj
 Now suppose αn = ρMn + μn and βn = θ/In + ηn
 Thus Unj = ρ(MnSRj) + μnSRj + θ(PPj/In) + ηnPPj + εnj = ρ(MnSRj) + θ(PPj/In) +
 The new error term cannot possibly be distributed independently and
identically as required for the logit formulation.
8/31


9/31


Substitution Patterns
IIA
 Increase in the probability of one alternative necessarily means a
 While the IIA property is realistic in some choice situations, it is
decrease in probability for other alternatives.
 What is the market share of an alternative and where the share comes from?
 Logit implies a certain substitution pattern.
 If it actually occurs, logit model is appropriate.
 To allow for more general patterns more flexible models are needed.
 IIA (Independence from Irrelevant Alternatives) property
clearly inappropriate in others:
 Red-bus–blue-bus problem:
 Assume choice probabilities of car and blue bus are equal: Pc = Pbb = 0.5
 Now a red bus is introduced that is exactly like the blue bus, and thus Prb/Pbb = 1
 In logit, Pc/Pbb is the same whether or not red bus exists.
 The only probabilities for which Pc/Pbb = 1 and Prb/Pbb = 1 are Pc = Pbb = Prb = 1/3
 In real life, we would expect Pc = ½ and Pbb = Prb = ¼
 A new express bus
 Introduced along a line that already has standard bus service.
 This new mode might be expected to reduce the probability of regular bus by a
greater proportion than it reduces the probability of car.
 The logit model over-predict transit share in this situation.

10/31 

11/31 
Proportional Substitution
Advantages of IIA
 Co
Consider
s de cchanging
a g g an
a attribute
att bute of
o alternative
a te at ve j. We want
wa t to know
ow
 Itt iss possible
poss b e to est
estimate
ate model
ode pa
parameters
a ete s consistently
co s ste t y on
o a subset
effect of this change on the probabilities for all other alternatives.
 We will show that the elasticity of Pni with respect to Znj is
 This cross-elasticity is the same for all i.
 An improvement in the attributes of an alternative reduces the probabilities for all
the other alternatives by the same percentage.
 This pattern of substitution is a manifestation of the IIA property.
 Electric car example:
 Share of large gas, small gas, and small electric cars are 0.66, 0.33, and 0.01.
 We want to subsidize electric cars and increase their share to 0.10.
of alternatives for each sampled decision maker.
 Since relative probabilities within a subset of alternatives are unaffected by the
attributes or existence of alternatives not in the subset, exclusion of alternatives in
estimation does not affect the consistency of the estimator.
 At an extreme, the number of alternatives might be so large as to preclude
estimation altogether if it were not possible to utilize a subset of alternatives.
 When one is interested in examining choices among a subset of
alternatives and not among all alternatives.
 By logit, probability for large gas car would drop to 0.60, and that for the small
 Choice only between car and bus modes for travel to work.
gas car would drop by the same ten percent, from 0.33 to 0.30.
 This pattern of substitution is clearly unrealistic.
 If IIA holds, one can exclude persons who used walking, bicycling, etc.

12/31 

13/31 
Tests of IIA
Panel Data
 Whether IIA holds in a particular setting is an empirical question,
 Data on current and past vehicle purchases of sampled households
open to statistical investigation.
 Two types of tests are suggested:
1.
The model can be reestimated on a subset of the alternatives.
 Ratio of probabilities for any two alternatives is the same under IIA.
The model can be reestimated with new, cross-alternative variables.
 Variables from one alternative entering the utility of another alternative.
alternative
 If Pni/Pnk depends on alternative j (no IIA), attributes of alternative j will enter
significantly the utility of alternatives i or k within a logit specification.
 Introduction of non-IIA models, makes testing IIA easier than before.
2.
 Data that represent repeated choices like these are called panel data.
 Logit model can be used to examine panel data
 If the unobserved factors are independent over the repeated choices.
 Each choice situation by each decision maker becomes a separate observation.
 Dynamic aspects of behavior
 Dependent variable in previous periods can be entered as an explanatory variable.
variable
 Inclusion of the lagged variable does not induce inconsistency in estimation,
since for a logit model the errors are assumed to be independent over time.
 The assumption of independent errors over time is severe.
 Use a model that allows unobserved factors to be correlated over time,
 Respecify utility to bring the sources of the unobserved dynamics into the model.

14/31 

15/31 
Consumer Surplus
Derivatives
 Definition,
e to ,
 How
ow pprobabilities
obab t es cchange
a ge with
w t a change
c a ge in some
so e observed
obse ved factor?
acto ?
 A person’s consumer surplus is utility ($), that he receives in the choice situation.
 CSn = (1/αn) maxj (Unj), where αn is the marginal utility of income.
 For policy analysis
 If a light rail system is being considered in a city, it is important to measure the
benefits of the project to see if they justify the costs.
 Expected consumer surplus
 It can be shown that expected consumer surplus in a logit model that is simply
the log of the denominator (the log-sum term) of the choice probability.
 How do you interpret these?
 This resemblance has no economic meaning, and just the outcome of the
mathematical form of the extreme value distribution.

16/31 

17/31 
Elasticities
Estimation
 Response is usually measured by elasticity rather than derivative.
 There are various estimation methods under different sampling
 Elasticity is normalized for the variable’s unit.
procedures.
 We discuss estimation under the most prominent of these sampling
schemes:
 Estimation when the sample is exogenous.
 Estimation on a subset of alternatives.
 Estimation with certain types of choice-based (i.e.,
(i e non-exogenous) samples
samples.

18/31 

19/31 
Estimation on an Exogenous Sample
Estimation on an Exogenous Sample
 Assume
ssu e
 First-order
st o de optimality
opt a ty condition
co d t o for
o
becomes
beco es
 The sample is random or stratified random.
 Explanatory variables are exogenous to the choice situation. That is, the variables
entering representative utility are independent of the unobserved component.
 Maximum-likelihood procedures can be applied to logit:
 MLE of β are those that
 The probability of person n choosing the alternative
 Make the predicted average of each x equal to observed average in the sample.
that he was observed to choose can be expressed as:
 where yni =1 if person n chose i and zero otherwise.
 Assuming independence of agents, probability of each
person choosing the alternative that he chose is:
 McFadden shows that LL(β) is globally concave for linear-in parameters utility.
 Make sample covariance of the residuals with the explanatory variables zero.

20/31 
Estimation on a Subset of Alternatives

21/31 
Estimation on a Subset of Alternatives
 Denote
 The full set of alternatives as F and a subset of alternatives as K.
 q(K | i) be the probability that K is selected given that alternative i is chosen.
 We have q(K | i) = 0 for any K that does not include i. (why?)
 If selection process is designed so q(K | j) is the same for all j ∈ K:
 Pni is the probability that person n chooses alternative i from the full set.
 Pn(i | K) is the probability that the person chooses alternative i conditional on the
researcher selecting subset K for him. (our goal is to determine that).
 Joint probability that K and i are selected are expressed in 2 forms:
 The conditional LL under the uniform conditioning property is
 Prob(K, i) = q(K | i) Pni.
 Prob(K, i ) = Pn(i | K)Q(K) where Q(K) = j∈F Pnjq(K | j)
 Q(K) is the probability of selecting K marginal over all the alternatives that the
person could choose.
 Equate these two expressions and solve for Pn(i | K).

22/31 
Estimation on a Subset of Alternatives
 Since information on alternatives not in each subset is excluded,
the estimator based on CLL is not efficient.

23/31 
Estimation on Choice-Based Samples
 Example
a p e of
o choice
c o ce of
o home
o e location:
ocat o :
 To identify factors that contribute to choosing one particular community, one
 If a selection process does not exhibit the uniform conditioning
property, q(K | i) should be included into the model:
might draw randomly from within that community at a rate different from all
other communities.
 This procedure assures that the researcher has an adequate number of people in
the sample from the area of interest.
 Sampling Procedures:
 ln q(Kn | j ) is added and its coefficient is constrained to 1.
 Purely choice-based: population is divided into those choose each alternative,
and decision makers are drawn randomly within each group, at different rates.
 Hybrid of choice-based and exogenous: an exogenous sample is supplemented
 Why would one want to a selection procedure that does not satisfy
the uniform conditioning property?
with a sample drawn on the basis of the households’ choices.

24/31 

25/31 
Estimation on Choice-Based Samples
Goodness of Fit
 Purely choice
choice-based
based estimation
 Percent correctly predicted:
 If one is using a purely choice-based sample and includes an alternative-specific
constant, then estimating a regular logit model produces consistent estimates for
all the model parameters except the alternative-specific constants.
 Percentage of sampled decision makers for which the highest-probability
alternative and the chosen alternative are the same.
 It is based on the idea that the decision maker is predicted by the researcher to
choose the alternative with the highest probability.
 These constants can be adjusted to be consistent:
 The procedure misses the point of probabilities, gives obviously inaccurate
market shares, and seems to imply that the researcher has perfect information.
 Aj: Share of decision makers in the population who chose alternative j.
 Sj: Share in the choice-based sample who chose alternative j.

26/31 

27/31 
Hypothesis Testing
Bay Area Rapid Transit (BART)
 Sta
Standard
da d tt-statistics
statistics
 Ap
prominent
o
e t test of
o logit
og t capab
capabilities
t es in the
t e mid-1970s:
d 970s:
 To test hypotheses about individual parameters, such as whether a parameter is 0.
 More complex hypotheses can be tested by likelihood ratio test.
 H0 can be expressed as constraints on the values of the parameters, such as:
 Several parameters are zero,
 Two or more parameters are equal.
 Define the ratio of likelihoods:
is the constrained maximum value of the likelihood function under H0.
is the unconstrained maximum of the likelihood function.
 The test statistic defined as −2 log R is distributed chi-squared with degrees of
freedom equal to the number of restrictions implied by the null hypothesis.


 McFadden applied logit to commuters’ mode choices to predict BART ridership.
 Four modes were considered to be available for the trip to work:
Driving a car by oneself,
Taking the bus and walking to the bus stop,
3.
Taking the bus and driving to the bus stop,
4.
Carpooling.
 Parameters:
 Time and cost of each mode (determined based on home and work location)
 Travel time was differentiated as walk, wait, and on-vehicle time
 Commuters’ characteristics: income, household size, number of cars and
drivers in the household, and whether the commuter was head of household.
 A logit model with linear-in-parameters utility was estimated on these data.
1.
2.

28/31 
Bay Area Rapid Transit (BART)

30/31 

29/31 
Bay Area Rapid Transit (BART)

31/31 
Bay Area Rapid Transit (BART)
Homework 8
 Forecasted
o ecasted aand
d actua
actual sshares
a es for
o eac
each mode:
ode:
Discrete
sc ete C
Choice
o ce Methods
et ods w
with
t S
Simulation
u at o (Kenneth
( e et E.. Train)
a )
Problem set 1 on logit: elsa.berkeley.edu/users/train/ec244ps1.zip
1.
2.
3.
4.
5.
6.
Exercise 1 [20 points]
Exercise 2 [10 points]
Exercise 3 [20 points]
Exercise 4 [20 points]
Exercise 5 [15 points]
Exercise 6 [15 points]
 BART demand was forecast to be %6.3, compared with an actual
share of %6.2.
Assignment weight factor = 2
1/25

Introduction
 Often one is unable to capture all sources of correlation.
Basic Econometrics in Transportation
 Thus, unobserved portions of utility are correlated and IIA does not hold.
 The unifying attribute of GEV models:
 Unobserved portions of utilities are jointly distributed as a GEV.
 When all correlations are zero, the GEV distribution becomes the product of
GEV Models
independent EV distributions and GEV model becomes standard logit.
 GEV includes logit but also includes a variety of other models.
 GEV model can be used to test whether the correlations are zero.
Amir Samimi
 Is also a way to test whether IIA is a proper substitution pattern.
 Nested logit is the most widely used GEV model.
Civil Engineering Department
Sharif University of Technology
Primary Source: Discrete Choice Methods with Simulation (Train)
2/25


3/25


Nested Logit
Choice Probabilities
 Appropriate
pp op ate w
when
e aalternatives
te at ves can
ca be pa
partitioned
t t o ed into
to nests:
ests:
 Co
Consider
s de K non-overlapping
o ove app g nests:
ests: B1, B2, . . . , BK
 IIA holds within each nest.
 The NL model is obtained by assuming that εn has a type of GEV CDF:
 IIA does not hold in general for alternatives in different nests.
 Example of IIA holding within nests of alternatives
 Change in probabilities when one alternative is removed:
 For any two alternatives j and m in nest Bk, εnj is correlated with εnm.
 For any two alternatives in different nests, Cov(εnj, εnm) = 0.
 A higher value of λk means greater independence in unobserved utility among
the alternatives in nest k.
 When λk = 1 for all k, nested logit model reduces to the standard logit model.
 Choice probability:
i ∈ Bk :
4/25


5/25


Choice Probabilities
IIA in Nested Logit
 The parameter λk can differ over nests:
 IIA holds within each subset of alternatives but not across subsets.
 Consider alternatives i ∈ Bk and m ∈ Bl
 One can constrain the λk’s to be the same for all (or some) nests.
 Hypothesis testing can be used to determine whether the λk’s are reasonable.
 Testing λk = 1 ∀k is equivalent to testing whether the standard logit model is a
reasonable specification against the more general nested logit.
 These tests are performed most readily with the likelihood ratio statistic.
 If k = l, IIA holds.
 Consistency with utility-maximizing behavior
 For k ≠ l, the ratio depends on attributes of all alternatives in the nests l and k.
 For 0 ≤ λk ≤ 1, the model is consistent for all possible values of the X’s.

This form of IIA can be loosely described as “independence from irrelevant nests” or IIN.
 For λk > 1, the model is consistent for some range of X’s.
 For λk < 0, the model is not consistent with utility-maximizing behavior.
 λk might differ over decision makers based on their characteristics
 Such as a function of demographics: λ = exp(αzn)
6/25


7/25


Decomposition into Two Logits
Decomposition into Two Logits
 The
e observed
obse ved co
component
po e t of
o utility
ut ty can
ca be decomposed
deco posed into
to
 λk Ink iss tthe
e eexpected
pected utility
ut ty received
ece ved from
o aalternatives
te at ves in nest
est Bk.
 A part (W) that is constant for all alternatives within a nest,
 A part (Y) that varies over alternatives within a nest.
 Utility can be written as: Unj = Wnk + Ynj + εnj
 Ink is often called the inclusive utility, or the “log-sum term”.
 Probability of choosing nest Bk depends on the expected utility that the person
receives from that nest.
 Nested logit probability can be written as: Pni = Pni | Bk X PnBk
 With some algebraic rearrangement of the choice probability:
 Recall coefficients in the lower model are divided by λk
 STATA estimates
i
nestedd logit
l i models
d l without
ih
dividing
di idi by
b λk in
i the
h lower
l
model.
d l
 The package NLOGIT allows either specification.
 If not divided by λk, choice probabilities are not the equations just given.
 Not–dividing approach is not consistent with utility maximization in some cases.
8/25


9/25


Estimation
Three-Level Nested Logit
 The parameters of a nested model can be estimated by:
 Three level nested logit models could be more appropriate.
 Standard maximum likelihood technique.
 Just substitute the choice probabilities into the LL function.
 The values of the parameters that maximize this function are consistent and
efficient.
 Numerical maximization could be difficult, as LL function is not globally
concave.
 Sequential technique.
 Estimate the lower models; calculate the inclusive utility for each lower
model; estimate the upper model.
 Standard errors of the upper-model parameters are biased downward.
 If some parameters appear in several submodels, separate estimations provide
separate different estimates.

10/25 
 Housing unit choice in San Francisco.
 There are unobserved factors that are
common to all units in the same
neighborhood, such as proximity to
shopping and entertainment.
 There are also unobserved factors that
are common to all units with the same
number of bedrooms, such as the
convenience of working at home.
 Top model has an inclusive value for each nest from subnests within the nest
 Middle models have inclusive values for each subnest from the alternatives in subnest

11/25 
IIA in Three-Level NL
Overlapping Nests
 Thiss nesting
est g structure
st uctu e embodies
e bod es IIA in tthee following
o ow g ways:
 Co
Consider
s de ou
our eexample
a p e of
o mode
ode choice:
c o ce:
 Ratio of probabilities of two housing units in the same neighborhood and with
the same number of bedrooms is independent of characteristics of all other units.
 Ratio of probabilities of two housing units in the same neighborhood but with
different numbers of bedrooms is independent of characteristics of units in other
neighborhoods.
 Ratio of probabilities of two housing units in different neighborhoods depends
on the characteristics of all the other housing units in those neighborhoods but
not on the characteristics of units in other neighborhoods.
 Degree of correlation among alternatives within the nests:
 σmk is the coefficient of the inclusive value in the middle model,
 λkσmk is the coefficient of the inclusive value in the top model.
 Parameter must be in certain range to be consistent with utility maximization.
 Carpool and auto alone are in a nest, since they have some similar unobserved
attributes.
 Carpooling also has some unobserved attributes that are similar to bus and rail.
 Alack of flexibility in scheduling.
 We may want unobserved utility for the carpool alternative to be correlated with
that of auto alone, bus and rail, with different degrees.
 It would
ld bbe useful
f l ffor th
the carpooll alternative
lt
ti to
t be
b in
i two
t nests.
t
 Cross-nested logits (CNLs) contain multiple overlapping nests.

12/25 

13/25 
Overlapping Nests
Paired Combinatorial Logit
 Ordered Generalized Extreme Value (OGEV)
 Each pair of alternatives is considered to be a nest.
 The alternatives have a natural order (number of cars in a household).
 Each alternative is member of J −1 nests.
 Correlation in unobserved utility between any two alternatives depends on their
 λij: degree of independence between alternatives i and j.
closeness in the ordering.
 Nested OGEV
 PCL becomes a standard logit when λij = 1 for all pairs of alternatives.
 The choice probability:
 Lower models are OGEV rather than standard logit.
 Paired Combinatorial Logit (PCL)
 Each pair of alternatives forms a nest with its own correlation.
 With J alternatives, each alternative is a member of J −1 nests, and the
correlation of its unobserved utility with each other alternative is estimated.
 Generalized Nested Logit (GNL)
 Includes the PCL and other cross-nested models as special cases.

14/25 
 The term being added in the numerator is the same as the numerator of the
nested logit probability.
 The denominator is also the sum over all nests of the sum of the exp(V/λ)’s
within the nest, raised to the appropriate power λ.

15/25 
Paired Combinatorial Logit
Generalized Nested Logit
 S
Since
ce the
t e scale
sca e and
a d level
eve of
o utility
ut ty are
a e immaterial,
ate a ,
 Each
ac alternative
a te at ve can
ca be a member
e be of
o more
o e than
t a one
o e nest:
est:
 At most J (J −1)/2−1 covariance parameters can be estimated in a choice model.
 Allocation parameter αjk: The extent to which alternative j is a member of nest k.
 αjk reflects the portion of the alternative that is allocated to each nest.
 A PCL model contains J (J −1)/2 λ’s, thus
 The researcher can normalize one of the λ’s to 1.
 λk: indicate the degree of independence among alternatives within the nest.
 The choice probability:
 Or, impose a specific pattern of correlation:
 Set λij = 1 for some or all of the pairs: no significant correlation in the
unobserved utility for that pair.
 Set λij = λkl: correlations is the same among a group of alternatives.
 λij could be specified as a function of the proximity between i and j (OGEV).
 The model becomes a nested logit model, if each alternative enters only one
nest, with αjk = 1 for j ∈ Bk and 0 otherwise.
 The model becomes standard logit, if, in addition, λk = 1 for all nests.

16/25 

17/25 
Generalized Nested Logit
Heteroskedastic Extreme Value
 GNL probability can be decomposed as:
 Instead of capturing correlations among alternatives, the
 The probability of nest k is:
researcher may want to allow the variance of unobserved factors
to differ over alternatives.
 εnj is distributed independently EV with variance (θjπ)2/6.
 No correlation, but different variance for different alternatives.
 To set the overall scale of utility, the variance for one alternative is set to π2/6.
 The
Th choice
h i probability:
b bilit
 The probability of alternative i given nest k is:
 where w = εni/θi

18/25 

19/25 
Heteroskedastic Extreme Value
The GEV Family
 The
e integral
teg a can
ca be approximated
app o
ated by simulation:
s u at o :
 Let:
et:
 Omit the subscript n denoting the decision maker
2.
Take a draw from the extreme value distribution.
For this draw of w, calculate the factor in brackets:
3.
Repeat steps 1 and 2 many times, and average the results to approximate Pni.
1.
 Yj ≡ exp(Vj)
 G = G(Y1, ... , YJ)
 Gi = ∂G/∂Yi
 If G meets certain conditions, then Pi is the choice probability for
a discrete choice model, consistent with utility maximization.
 Any model that can be derived in this way is a GEV model.
 We show how logit, nested logit, and paired combinatorial logit can be derived.

20/25 

21/25 
McFadden Conditions
Logit in GEV Family
1. G ≥ 0 for all positive values of Yj
 Let:
2. G is homogeneous of degree 1: G(ρY1, ... , ρYJ) = ρG(Y1, ... , YJ)
 Ben-Akiva and Francois showed this condition can be relaxed to any degree.
3. G → ∞ as Yj → ∞ for any j
4. Cross partial derivatives of G change signs in a particular way:
 Gi ≥ 0 for all i , Gij = ∂Gi/∂Yj ≤ 0 for all j = i, Gijk = ∂Gij/∂Yk ≥ 0 for any distinct
i , j , andd k,
k andd so on for
f higher-order
hi h
d cross-partials.
i l
 Conditions:
1. The sum of positive Yj’s is positive.
2. If all Yj’s are raised by a factor ρ, G rises by that same factor.
3. If any Yj rises without bound, then G does also.
4 Gi = ∂G/∂Yi = 1,
4.
1 and thus Gi ≥ 0.
0 And higher
higher-order
order derivatives are all zero.
zero
 Lack of intuition behind the properties is a blessing and a curse:
 Disadvantage: Researcher has little guidance on how to specify a G.
 Advantage: Mathematical approach allows the researcher to generate models
that he might not have developed while relying only on his economic intuition.

22/25 

23/25 
Nested Logit in GEV Family
Paired Combinatorial Logit in GEV Family
 Let:
et:
 Let:
et:
 The J alternatives are partitioned into K nests labeled B1, ... , BK
 λk between zero and one
 Verify the conditions.

24/25 
GNL in GEV Family

25/25 
Homework 9
Discrete Choice Methods with Simulation (Kenneth E. Train)
Problem set 3 on nested logit: elsa.berkeley.edu/users/train/ec244ps3.zip
Exercise 1 [30 points]
Exercise 2 [30 points]
3. Exercise 3 [40 points]
1.
2.
Assignment weight factor = 1.5
1/29

Introduction
Basic Econometrics in Transportation
 Logit is limited in three important ways:
 It cannot represent random taste variation.
 It exhibits restrictive substitution patterns due to the IIA property.
 It cannot be used with panel data when unobserved factors are correlated over
time for each decision maker.
Probit Models
 Probit models deal with all three.
three
 It only require normal distributions for all unobserved components of utility.
Amir Samimi
 The coefficient of price with random taste variation, is problematic in probit.
Civil Engineering Department
Sharif University of Technology
Primary Source: Discrete Choice Methods with Simulation (Train)
2/29


3/29


Choice Probabilities
Differences in Utility Matter
 Ut
Utility
ty iss deco
decomposed
posed into:
to: Unj = Vnj + εnj ∀j
 Let
et us ddifference
e e ce against
aga st alternative
a te at ve i::
 Consider
 We assume
 εn is distributed normal with a mean vector of zero and covariance matrix n
 Then,
Then
 The choice probability is:
 Density of the error differences is
 I (·) is an indicator of whether the statement in parentheses holds.
4/29
5/29




Covariance Matrix
Scale of Utility is Irrelevant
 How to derive
 For a four
four-alternative
alternative model:
from
?
 Let Mi be the (J − 1) identity matrix with an extra column of −1’s added as the
 The covariance matrix for the vector of error differences takes the form:
ith column.
 The covariance matrix of error differences:
 This transformation comes in handy when simulating probit probabilities.
6/29


7/29


Scale of Utility is Irrelevant
Covariance Structure
 S
Since
ce there
t e e are
a e five
ve θ∗’ss aandd ten
te σ
σ’s,
s, itt iss not
ot possible
poss b e to solve
so ve for
o
 Route
oute choice:
c o ce:
all the σ’s from estimated values of the θ∗’s.
 This reduction in the number of parameters is not a restriction.
 Scale and level elements has no relevance for behavior.
 Covariance between any two routes depends only on the length of shared route
segments.
 This structure reduces the number of covariance parameters to only one.
 Only the five parameters have economic content, and only the five parameters
can be estimated.
 Physicians’ choice of location:
 A function of their proximity to one another
another.
 Instead of allowing a full covariance matrix, one can imposes
structure on the covariance matrix.
 A few of them are reviewed.
 Often the structure that is imposed will be sufficient to
normalize the model.
8/29


9/29


Covariance Structure
Covariance Structure
 Assumes the covariance matrix has the following form:
 Assumes the covariance matrix has the following form:
 Is this model, as specified, normalized for scale and level?
 Is this model, as specified, normalized for scale and level?
 θ = 1 + (ρ1 + ρ2)/2
 The original model is NOT normalized for scale
and level.
 All that matters for behavior is the average of these
parameters, not their values relative to each other.

10/29 

11/29 
Taste Variation
Taste Variation
 The
e utility
ut ty iss Unj = β’nxnj + εnj aandd suppose βn ∼ N(b,W).
 In tthiss case, the
t e utility
ut ty is:
s: Un1 = βnxn1 + εn1 aandd Un2 = βnxn2 + εn2
 The coefficients vary randomly over decision makers instead of being fixed.
 The goal is to estimate the parameters b and W.
 Rewritten utility as:
 Where
 The covariance of the ηnj’ss depends on W as well as the xnj’ss, so that the
covariance differs over decision makers.
 The covariance of the ηnj’s can be described easily for a two
alternative model with one explanatory variable.
 Assuming βn ~ N (b , σβ): Un1 = bxn1 + ηn1 and Un2 = bxn2 + ηn2
 Assume that εn1 and εn2 are IID N (0, σε).
 The assumption of independence is not needed in general.
 ηn1 and ηn2 are jointly normally distributed with 0 mean covariance of:

12/29 

13/29 
Taste Variation
Substitution Patterns
 The covariance matrix is
 Probit can represent any substitution pattern.
 Different covariance matrices provide different substitution patterns.
 One determines the substitution pattern that is most appropriate for the data.
 We also need to set the scale of utility.
 A convenient normalization for this case is σε = 1.
 We consider
 A full covariance matrix,
 A restricted covariance matrix.
 b and σβ are estimated.
 Researcher learns mean and variance of random coefficient in the population.

14/29 

15/29 
Substitution Patterns: Full Covariance
Substitution Patterns: Structured Covariance
 Ap
probit
ob t model
ode w
with
t four
ou aalternatives
te at ves was discussed.
d scussed.
 Imposing
pos g st
structure,
uctu e, usua
usually
y gets more
o e interpretable
te p etab e pa
parameters.
a ete s.
 A full covariance matrix, when normalized for scale and level was introduced.
 The estimated values can represent any substitution pattern.
 The normalization does not restrict the substitution patterns.
 The normalization only eliminates aspects that are irrelevant to behavior.
 Note:
 Estimated values of the θ∗’ss provide essentially no interpretable information
information.
 If θ23 is estimated to be negative. This does not mean that unobserved utility
for the second alternative is negatively correlated with unobserved utility for
the third alternative (that is, σ23 < 0).
 Structure is necessarily situation-dependent.
 Consider a choice among purchase-money mortgages: four
mortgages are available to the homebuyer:
 One with a fixed rate, and three with variable rates.
 Suppose n consists of two parts:
 Concern
C
about
b t the
th risk
i k off rising
i i interest
i t
t rates
t (r
( n);
)
 All other unobserved factors (ηnj).
 εnj = − rndj + ηnj (dj = 0 for the fixed-rate loan, 1 otherwise)
 Assume rn is normally distributed with variance σ, and ηnj is IID N~(0,ω).

16/29 

17/29 
Substitution Patterns: Structured Covariance
Panel Data
 Then the covariance matrix is:
 Probit with repeated choices is similar to probit on one choice.
 The only difference is the dimension of the covariance matrix of the errors.
 The utility that n obtains from j in t is Unjt = Vnjt + εnjt
 Consider a sequence of alternatives, one for each time period, i ={i1, ... , iT}.
 The probability that the decision maker makes this sequence of choices is:
 Where
 and φ(εn) is the joint normal density with zero mean and covariance 
 The normalization for scale could be accomplished by setting ω
= 1 in the original . Under this procedure, the parameter σ is
estimated directly.

18/29 
 Again, it is often more convenient to work in utility differences.
 This results in a (J − 1)T–dimensional integral.

19/29 
Example: Covariance of Errors
Example: Choice Probabilities
 Suppose in a b
binary
a y case eerror
o co
consists
s sts of:
o:
 The
e choice
c o ce probabilities
p obab t es under
u de this
t s structure:
st uctu e:
 A portion that is specific to the decision maker (ηn)
 ηn is iid over people, with a standard normal density.
 A part that varies over time for each decision maker (μnt)
 μnt is iid over time and people, with a normal density with zero mean and
variance σ.
 Thus,
 V(εnt ) = V(ηn + μnt ) = σ + 1
 Cov(εnt, εns ) = E(ηn + μnt )(ηn + μns ) = σ
 Probability of not taking the action, conditional on ηn :
 Prob(Vnt + ηn + μnt < 0) = Prob(μnt < − (Vnt + ηn)) = (−(Vnt + ηn))
 where (·) is the cumulative standard normal function.
 Probability of taking the action, conditional on ηn:
 1 − (−(Vnt + ηn)) = (Vnt + ηn).
 The probability of the sequence of choices dn, conditional on ηn, is therefore:
 Hndn (ηn) = ∏t ((Vnt + ηn)dnt)
 dnt = 1 if person n took the action in period t, and dnt = −1 otherwise.
 Unconditional probability is:
 Pndn = ∫ Hndn (ηn) (ηn) dηn
 This probability can be simulated very simply

20/29 

21/29 
Simulation of the Choice Probabilities
Accept–Reject Simulator
 Probit probabilities do not have a closed
closed-form
form expression and
 Probit probabilities:
must be approximated numerically.
 Non-simulation procedures
 Quadrature methods approximate the integral by a weighted function of
specially chosen evaluation points.
 Simulation procedures
 Accept–reject,
 Smoothed accept–reject,
 GHK (the most widely used probit simulator)

22/29 
 The AR simulator of this integral is calculated as follows:
1. Draw a value of εn, from a normal density with zero mean and covariance .
Label the draw εrn with r = 1, and the elements of the draw as εrn1, ... , εrnJ
2. Calculate the utility (Urnj = Vnj + εrnj) that each alternative obtains with εrn
3. Determine whether Urni is greater than that for all other alternatives: Ir = 1 if
Urni > Urnj ∀j ≠ i,
i indicating
i di ti an accept,
t andd I r = 0 otherwise
th
i
4. Repeat steps 1–3 many times. Label the number of repetitions as R
5. The simulated probability is the proportion of draws that are accepts:

23/29 
How to Take a Draw
Smoothed AR Simulators
 The
e most
ost st
straightforward
a g t o wa d pprocedure
ocedu e uses the
t eC
Choleski
o es factor.
acto .
 Replace
ep ace 0
0–1 AR indicator
d cato with
w t a smooth,
s oot , strictly
st ct y positive
pos t ve function.
u ct o .
 A Choleski factor of  is a lower-triangular matrix L such that LL’ = 
 η ∼ N(0, I ) is a vector of J IID, where I is the identity matrix.
 Construct ε = Lη by using the Choleski factor to tranform η into ε.
 Any function can be used for simulating Pni as long as it
 rises when Urni rises,
 declines when Urnj rises,
 is strictly positive,
 The first step of the AR simulator becomes two sub–steps:
a. Draw J values from a standard normal
normal. Stack these values into a vector (ηr)
b. Calculate εrn = Lηr
 has defined first and second derivatives with respect to Urnj ∀j.
 A function that is particularly convenient is the logit function,
function
 Use of this function gives the logit-smoothed AR simulator.

24/29 

25/29 
Smoothed AR Simulators
Smoothed AR Simulators
 The simulator is implemented in the following steps:
 The scale factor λ determines the degree of smoothing.
1. Draw a value of εn, from a normal density with zero mean and covariance .
Label the draw εrn with r = 1, and the elements of the draw as εrn1, ... , εrnJ
2. Calculate the utility (Urnj = Vnj + εrnj) that each alternative obtains with εrn
3. Put these utilities into the logit formula. That is, calculate Sr, in which λ > 0 is
a scale factor specified by the researcher.
 As λ → 0, Sr approaches the indicator function Ir.
 The researcher wants to set λ low enough to obtain a good approximation but
not so low as to reintroduce numerical difficulties.
4. Repeat steps 1–3 many times. Label the number of repetitions as R.
5. The simulated probability is the average of the values of the logit formula:
 It is same as the AR simulator except for step 3.

26/29 

27/29 
GHK Simulator
GHK Simulator
 The
e most
ost widely
w de y used p
probit
ob t ssimulator
u ato iss ca
called
ed G
GHK..
 The
e second
seco d factor
acto iss an
a integral,
teg a , and
a d we use ssimulation
u at o to
 A three-alternative case is discussed here.
approximate integrals.
 Draw a value of η1 from a standard normal density truncated above at
 For this draw,
draw calculate the factor
 Repeat this process for many draws, and average the results.
 This average is a simulated approximation to the integral.
 How do we take a draw from a truncated normal?
 Take a draw from a standard uniform, labeled μ.
 Then calculate

28/29 
GHK Simulator

29/29 
Homework 10
Discrete Choice Methods with Simulation (Kenneth E. Train)
Problem set 5 on probit: elsa.berkeley.edu/users/train/ec244ps5.zip
 This probability is simulated as follows:
Exercise 1 [40 points]
Exercise 2 [30 points]
3. Exercise 3 [30 points]
1.
2.
Assignment weight factor = 1
1/28


General Information
 Course
Basic Econometrics in Transportation
 Econometrics (20-563)
 Spring 2016, Saturday and Monday 10:30 – 12:00
 Instructor
 Amir Samimi
Introduction
 Office Hours: With Appointment
 Email: asamimi@sharif.edu
 TA
Amir Samimi
 Ali Shamshiripour
 Email: shamshiripour@gmail.com
Civil Engineering Department
Sharif University of Technology
 Prerequisites
 Undergraduate level of probability and statistics
Primary Source: Basic Econometrics (Gujarati)
2/28


3/28


What is Econometrics?
Methodology
 Gujarati
1.
 Econometrics is mainly interested in the empirical verification of economic
theory.
 Econometrics is based upon the development of statistical methods for
estimating relationships, testing theories, and evaluating and implementing
policy.
 Woodridge
 Econometrics is based upon the development of statistical methods for
estimating economic relationships, testing economic theories, and evaluating
and implementing government and business policy.
2.
3.
4.
5.
6.
7.
8.
Statement of theory or hypothesis
Specification of the mathematical model of the theory
Specification of the statistical, or econometric, model
Obtaining the data
Estimation of the parameters of the econometric model
Hypothesis testing
Forecasting or prediction
Using the model for control or policy purposes
4/28


5/28
Example
Example (cont.)
Keynes stated: The fundamental psychological law . . . is that men
[women] are disposed, as a rule and on average, to increase their
consumption as their income increases, but not as much as the
increase in their income.
A mathematical economist might suggest the following function:
Y = β1 + β2X


Example (cont.)
7/28


Example (cont.)
Y = β1 + β2X + u
 The purely mathematical model is of limited interest to the
econometrician, for it assumes that there is an exact or
deterministic relationship between consumption and income.
 But relationships are generally inexact., because, in addition to
income, other variables (size of family, age, religion) affect
consumption expenditure.
 To allow for inexact relationships, the econometrician add a
disturbance, or error term in the consumption function.
0 < β2 < 1
Y = consumption expenditure (dependent variable)
X = income (independent variable)
β1 and β2 are the parameters of the model.
In short, the marginal propensity to consume (MPC) is greater than
zero but less than 1.
6/28


 The error term, is a random variable that has well-defined
probabilistic properties.
 This is an econometric model.
8/28


9/28


Example (cont.)
Example (cont.)
 To estimate the econometric model, (to obtain numerical values of β1
 The numerical estimates of the parameters give empirical content
and β2) we need data.
to the consumption function.
 The estimated consumption function is:
Y: Personal consumption expenditure,
X: Gross domestic product,
Both in 1992 billions of dollars
 The hat on the Y indicates that it is an estimate.
 The slope coefficient was about 0.70, suggesting that for the sample
period an increase in real income of 1 dollar led, on average, to an
increase of about 70 cents in real consumption expenditure.

10/28 

11/28 
Example (cont.)
Example (cont.)
 To develop suitable criteria to find out whether the estimates are in
 If the chosen model does not refute our hypothesis, we may use it to
accord with the expectations of the theory that is being tested.
 Keynes expected the MPC to be positive but less than 1.
 We found the MPC to be about 0.70.
 Is this estimate sufficiently below unity to convince us that this is
not a chance occurrence or peculiarity of the particular data we have
used?
 Is 0.70 statistically less than 1?
forecast Y on the basis of expected future value of X.
 Predict the mean consumption expenditure for 1997 having the GDP
value of 7269.8 billion dollars:
 The actual value of the consumption expenditure in 1997 was 4913.5
billion dollars.
 The estimated model overpredicted Y by about 37.82 billion dollars.
 The forecast error is about 37.82 billion dollars.
 Such errors are inevitable given the statistical nature of our analysis.

12/28 
Example (cont.)

13/28 
Applications
 Political Science:
 How does political campaign expenditures affect voting outcomes?
 How to manipulate control variable to produce the desired level of
the target variable.
 Suppose the government believes that consumer expenditure of
about 4900 (billions of 1992 dollars) will keep the unemployment
rate at its current level of about 4.2 percent (early 2000).
 What level of income will guarantee the target amount of
consumption expenditure?
 The model suggests that an income level of about 7197 (billion)
dollars, given an MPC of about 0.70, will produce an expenditure of
about 4900 billion dollars.

14/28 
 Education:
 How does school spending affects student performance?
 Transportation:
 How does fuel taxation affect public transportation ridership?
 Social Science:
 How does level of education affect personal income?

15/28 
Data
Data Accuracy
 The result of research is only as good as the quality of data.
 Observational errors
 The modeler is often faced with observational as opposed to
experimental data.
 Cross-Sectional Data
 Data on one or more variables collected at the same point in time.
 Time Series Data
 Data on the values that a variable takes at different times.
 Pooled Cross Sections
 A combination of both time series and cross-section data.
 Panel or Longitudinal Data
 A special type of pooled data in which the same cross-sectional unit is
surveyed over time.
 Social science data are non-experimental in nature.
 Omission or commission
 Errors of measurement
 Approximations and round offs
 Problem of non-response
 Financially sensitive questions
 Selectivity bias in sample
 Sampling method
 How to compare the results with various samples?

16/28 

17/28 
Data Accuracy
Course Objectives
 Aggregate level
 To provide an elementary but comprehensive introduction
 Data are generally available at a highly aggregate level.
to econometrics without resorting to statistics beyond the
elementary level.
 Confidentiality
o IRS is not allowed to disclose individual tax data
o Department of Commerce, which conducts the census of business every
5 years, is not allowed to disclose information at the firm level.
The researcher should always clearly state the data sources used in
the analysis, their definitions, their collection methods, and any
gaps in the data as well as any revisions in the data.

18/28 
 To estimate and interpret simple econometrics models.
 To get familiar with the most common modeling issues.

19/28 
Course Material
Class Policy
 Required Text Books
1.
 Casella, G., and R.L. Berger (2001) Statistical Inference, 2nd Edition,
Syllabus Modifications
A.
Duxbury Press.
 Gujarati, D.N. (2004) Basic Econometrics, 4th Edition, McGraw Hill.
 Train, K. (2009) Discrete Choice Methods with Simulation, 2nd Edition,
Cambridge University Press.
B.
2.
Class Participation
A.
 Additional Readings
The instructor reserves the right to change the course syllabus and any
changes will be announced in class.
Students are responsible for obtaining this information.
Read the course material before class and be active in discussions.
 Wooldridge, J.M. (2008) Introductory Econometrics: A Modern Approach,
4th Edition, South-Western College Pub.
 Hensher, D.A., J.M. Rose, and W.H. Greene (2005) Applied Choice
Analysis: A Primer, Cambridge University Press.
3.
Late Submission of Assignments
A.
Any sort of assignments turned in late will incur a 5% penalty per day.

20/28 

21/28 
Grading Policy
 Attendance
 Homework Assignments
 Quiz
 Final Exam
 Term Paper (Extra)
Attendance
10%
25%
15%
50%
10%

22/28 
 Late arrival results in a 50% penalty.
 More than 30 minutes delay results in an absence penalty.
 Catch up with the class, when circumstances arise.
 Do not discuss your excuses.

23/28 
Homework Assignments
Quiz
 No collaboration.
 Four Quizzes tentatively scheduled for:
 Submission
1.
 Type and email the assignments ONLY to the TA.
2.
 All the homework assignments are due in a week.
3.
 Format
 Have a cover letter.
 Each question starts from the top of the page.
 Get familiar with technical writing.
 Reference your work appropriately, if not, you may be penalized for plagiarism.
 Only PDF files are acceptable.
 Grading
 Effort: 50%, Correctness: 30%, Formatting: 20%
4.
Session 10
Session 14
Session 19
Session 28
 Starts sharp at the beginning of the session.

24/28 

25/28 
Final Exam
Term Paper
 Exam Day
 Format
A.
B.
C.
D.
E.
Closed book.
A double-sided, A4 note is allowed.
Check for any conflict and request a reschedule by Week 6.
You are not allowed to leave the session, during the exam.
If you have special needs during the final exam, inform me at least one
month in advance.
 In accordance with TRB guidelines: www.trb.org/Guidelines/Authors.pdf
 Qualification
 Papers are evaluated in TRB standards.
 Contribution
 No previous works by the student.
 Plagiarism shall result in a deduction of 20 in a scale of 100 from the final
 Exam Material
A.
B.
Includes discussion and problems.
Mainly from the primary text books and lectures.
grade.
 Collaboration is possible if announced by Week 6. Extra points will be
distributed based on the level of contribution.
 Research must be done under the instructor’s supervision.
 Due Seven days after the Final Week.

26/28 
Academic Conduct Policy

27/28 
Tentative Calendar
Session
Topic
 Students are expected be aware of and to adhere to the
1
Introduction
university standards of academic honesty.
 Violations include, but are not limited to, submitting work
which is not your own for credit.
 Copying or sharing of work (including computer files) is
prohibited, except as explicitly authorized by the instructors.
 Any confirmed plagiarism shall result in the student receiving a
zero for that work.
2
Statistical Concepts
3
Statistical Concepts
4
Bivariate Regression Analysis
5
Bivariate Regression Analysis
6
Bivariate Regression Discussion
7
Multiple Regression Analysis
8
Multiple Regression Analysis
9
Multiple Regression Analysis
10
Multiple Regression Analysis
11
Dummy Variables
12
Multicollinearity
13
Multicollinearity
14
Heteroscedasticity
15
Heteroscedasticity

28/28 
Tentative Calendar
Session
Topic
16
Autocorrelation
17
Autocorrelation
18
Model Specification
19
Model Specification
20
Endogeneity
21
Behavioral Models
22
Behavioral Models
23
Logit Models
24
Logit Models
25
GEV Models
26
GEV Models
27
GEV Models
28
Probit Models
29
Probit Models
30
Advanced Topics
Download