# Lecture 12 Heteroscedasticity

```RS – Lecture 12
Lecture 12
Heteroscedasticity
1
Two-Step Estimation of the GR Model: Review
• Use the GLS estimator with an estimate of 
1.  is parameterized by a few estimable parameters, = (θ) .
Example: Harvey’s heteroscedastic model.
2. Iterative estimation procedure:
(a) Use OLS residuals to estimate the variance function.
(b) Use the estimated  in GLS - Feasible GLS, or FGLS.
• True GLS estimator
bGLS = (X’Ω-1X)-1 X’Ω-1y
(converges in probability to .)
• We seek a vector which converges
to the

 same thing that this does.
Call it FGLS, based on [X  -1 X]-1 X -1 y
1
RS – Lecture 12
Two-Step Estimation of the GR Model: Review
Two-Step Estimation of the GR Model: FGLS
• Feasible GLS is based on finding an estimator which has the same
properties as the true GLS.
Example: Var[i] = 2 Exp(zi).
True GLS: Regress yi/[ Exp((1/2)zi)] on xi/[ Exp((1/2)zi)]
FGLS: With a consistent estimator of [,], say [s,c], we do the same
computation with our estimates.
Note: If plim [s,c] = [,], then, FGLS is as good as true GLS.
• Remark: To achieve full efficiency, we do not need an efficient
estimate of the parameters in , only a consistent one.
2
RS – Lecture 12
Heteroscedasticity
• Assumption (A3) is violated in a particular way:  has unequal
variances, but i and j are still not correlated with each other. Some
(higher variance).
f(y|x)
.
.
x1
x2
x3
.
E(y|x) = b0 + b1x
x
5
Heteroscedasticity
• Now, we have the CLM regression with hetero-(different) scedastic
(variance) disturbances.
(A1) DGP: y = X  +  is correctly specified.
(A2) E[|X] = 0
(CLM =&gt; i = 1, for all i.)
(A3’) Var[i] = 2 i, i &gt; 0.
(A4) X has full column rank – rank(X)=k-, where T ≥ k.
• Popular normalization: i i = 1. (A scaling, absorbed into 2.)
• A characterization of the heteroscedasticity: Well defined
estimators and methods for testing hypotheses will be obtainable if
the heteroscedasticity is “well behaved” in the sense that
i / i i → 0 as T → . -i.e., no single observation
becomes dominant.
(1/T)i i → some stable constant.
(Not a plim!)
3
RS – Lecture 12
GR Model and Testing
• Implications for conventional OLS and hypothesis testing:
1. b is still unbiased.
2. Consistent? We need the more general proof. Not difficult.
3. If plim b = , then plim s2 = 2 (with the normalization).
4. Under usual assumptions, we have asymptotic normality.
• Two main problems with OLS estimation under heterocedasticity:
(1) The usual standard errors are not correct. (They are biased!)
(2) OLS is not BLUE.
• Since the standard errors are biased, we cannot use the usual tstatistics or F --statistics or LM statistics for drawing inferences. This
is a serious issue.
Heteroscedasticity: Inference Based on OLS
• Q: But, what happens if we still use s2(XX)-1?
A: It depends on XX - XX. If they are nearly the same, the OLS
covariance matrix will give OK inferences.
But, when will XX - XX be nearly the same? The answer is based
on a property of weighted averages. Suppose i is randomly drawn
from a distribution with E[i] = 1. Then,
p
E[x2]
--just like (1/T)i xi2.
(1/T)i i xi2 
• Remark: For the heteroscedasticity to be a significant issue for
estimation and inference by OLS, the weights must be correlated with
x and/or xi2. The higher correlation, heteroscedasticity becomes
more important (b is more inefficient).
4
RS – Lecture 12
Finding Heteroscedasticity
• There are several theoretical reasons why the i may be related to x
and/or xi2:
1. Following the error-learning models, as people learn, their errors of
behavior become smaller over time. Then, σ2i is expected to decrease.
2. As data collecting techniques improve, σ2i is likely to decrease.
Companies with sophisticated data processing techniques are likely to
commit fewer errors in forecasting customer’s orders.
3. As incomes grow, people have more discretionary income and, thus,
more choice about how to spend their income. Hence, σ2i is likely to
increase with income.
4. Similarly, companies with larger profits are expected to show
greater variability in their dividend/buyback policies than companies
with lower profits.
Finding Heteroscedasticity
• Heteroscedasticity can also be the result of model misspecification.
• It can arise as a result of the presence of outliers (either very small or
very large). The inclusion/exclusion of an outlier, especially if T is
small, can affect the results of regressions.
• Violations of (A1) --model is correctly specified--, can produce
heteroscedasticity, due to omitted variables from the model.
• Skewness in the distribution of one or more regressors included in
the model can induce heteroscedasticity. Examples are economic
variables such as income, wealth, and education.
• David Hendry notes that heteroscedasticity can also arise because of
– (1) incorrect data transformation (e.g., ratio or first difference
transformations).
– (2) incorrect functional form (e.g., linear vs log–linear models).
5
RS – Lecture 12
Finding Heteroscedasticity
• Heteroscedasticity is usually modeled using one the following
specifications:
- H1 : σt2 is a function of past εt2 and past σt2 (GARCH model).
- H2 : σt2 increases monotonically with one (or several) exogenous
variable(s) (x1,, . . . , xT ).
- H3 : σt2 increases monotonically with E(yt).
- H4 : σt2 is the same within p subsets of the data but differs across the
subsets (grouped heteroscedasticity). This specification allows for structural
breaks.
• These are the usual alternatives hypothesis in the heteroscedasticity
tests.
Finding Heteroscedasticity
• Visual test
In a plot of residuals against dependent variable or other variable will
often produce a fan shape.
180
160
140
120
100
Series1
80
60
40
20
0
0
50
100
150
12
6
RS – Lecture 12
Testing for Heteroscedasticity
• Usual strategy when heteroscedasticity is suspected: Use OLS along
the White estimator. This will give us consistent inferences.
• Q: Why do we want to test for heteroscedasticity?
A: OLS is no longer efficient. There is an estimator with lower
asymptotic variance (the GLS/FGLS estimator).
• We want to test: H0: E(ε2|x1, x2,…, xk) = E(ε2) = 2
• The key is whether E[2] = 2i is related to x and/or xi2. Suppose
we suspect a particular independent variable, say X1, is driving i.
• Then, a simple test: Check the RSS for large values of X1, and the
RSS for small values of X1. This is the Goldfeld-Quandt test.
Testing for Heteroscedasticity
• The Goldfeld-Quandt test
- Step 1. Arrange the data from small to large values of the
independent variable suspected of causing heteroscedasticity, Xj.
- Step 2. Run two separate regressions, one for small values of Xj and
one for large values of Xj, omitting d middle observations (≈ 20%).
Get the RSS for each regression: RSS1 for small values of Xj and
- Step 3. Calculate the F ratio
GQ = RSS2/RSS1, ~ Fdf,df with df =[(T – d) – 2(k+1)]/2 (A5 holds).
If (A5) does not hold, we have GQ is asymptoticallly χ2.
7
RS – Lecture 12
Testing for Heteroscedasticity
• The Goldfeld-Quandt test
Note: When we suspect more than one variable is driving the i’s,
this test is not very useful.
• But, the GQ test is a popular to test for structural breaks (two
regimes) in variance. For these tests, we rewrite step 3 to allow for
different size in the sub-samples 1 and 2.
- Step 3. Calculate the F-test ratio
Testing for Heteroscedasticity: LR Test
• The Likelihood Ratio Test
Let’s define the likelihood function, assuming normality, for a
general case, where we have g different variances:
T
ln L   ln 2 
2
g
Ti
1
ln i2 
2
2
i 1

g
1

i 1
2
i
( yi  X i)( yi  X i)
We have two models:
(R) Restricted under H0: i2 = 2. From this model, we calculate ln L
T
T
ln LR   [ln(2)  1]  ln(ˆ 2 )
2
2
(U) Unrestricted. From this model, we calculate the log likelihood.
T
ln LU   [ln(2)  1] 
2
g
 2 ln ˆ ;
i 1
Ti
2
i
ˆ i2  T1 ( yi  X i b)( yi  X i b)
i
8
RS – Lecture 12
Testing for Heteroscedasticity: LR Test
• Now, we can estimate the Likelihood Ratio (LR) test:
LR  2(ln LU  ln LR )  T ln ˆ 2 
g
T ln ˆ
i
2
i
a


 2g 1
i 1
Under the usual regularity conditions, LR is approximated by a χ2g-1.
• Using specific functions for i2, this test has been used by
Rutemiller and Bowers (1968) and in Harvey’s (1976) groupwise
heteroscedasticity paper.
Testing for Heteroscedasticity
• Score LM tests
• We want to develop tests of H0: E(ε2|x1, x2,…, xk) = 2 against an
H1 with a general functional form.
• Recall the central issue is whether E[2] = 2i is related to x
and/or xi2. Then, a simple strategy is to use OLS residuals to estimate
disturbances and look for relationships between ei2 and xi and/or xi2.
• Suppose that the relationship between ε2 and X is linear:
ε2 = Xα + v
Then, we test: H0: α = 0 against H1: α ≠ 0.
• We can base the test on how the squared OLS residuals e correlate
with X.
9
RS – Lecture 12
Testing for Heteroscedasticity
• Popular heteroscedasticity LM tests:
- Breusch and Pagan (1979)’s LM test (BP).
- White (1980)’s general test.
• Both tests are based on OLS residuals. That is, calculated under H0:
No heteroscedasticity.
• The BP test is an LM test, based on the score of the log likelihood
function, calculated under normality. It is a general tests designed to
detect any linear forms of heteroskedasticity.
• The White test is an asymptotic Wald-type test, normality is not
needed. It allows for nonlinearities by using squares and
crossproducts of all the x’s in the auxiliary regression.
Testing for Heteroscedasticity: BP Test
hi(α0 + zi,1’ α1 + .... + zi,m’ αm) = i2
• We want to test: H0: E(εi2|z1, z2,…, zk) = hi(zi’α) =2
or H0: α1 = α2 =... = αm = 
(m restrictions)
• Assume normality. That is, the log likelihood function is:
log L = constant + &frac12;Σ log 2i - &frac12; Σ εi2/i2
Then, construct an LM test:
LM = S(θR)’ I(θR)-1 S(θR)
θ= (β,α)
-2
-2
S(θ)=∂log L/∂θ’=[-Σi X’εi ;-&frac12;Σ(∂h/∂α)zii +&frac12;Σi-4 εi2(∂h/∂α)zi]
I(θ) = E[- ∂2log L/∂θ∂θ’]
• We have block diagonality, we can rewrite the LM test, under H0:
LM = S(α0,0)’ [I22- I21 I11 I21]-1 S(α0,0)
10
RS – Lecture 12
Testing for Heteroscedasticity: BP Test
• We have block diagonality, we can rewrite the LM test, under H0:
LM = S(α0,0)’ [I22- I21 I11 I21]-1 S(α0,0)
S(α0,0) = -&frac12;Σi (∂h/∂α|α0,R,0)z′ R-2 + &frac12; Σi R-4ei2(∂h/∂α |α0,R,0)z′
= &frac12; R-2 (∂h/∂α |α0,R,0) Σi zi (ei2/R2 - 1)
= &frac12; R-2 (∂h/∂α |α0,R,0) Σi zi ωi
ωi =ei2/R2 – 1 = gi - 1
I22(α0,0) = E[- ∂2log L/∂ α∂ α’] = &frac12; [R-2 (∂h/∂α |α0,R,0 )]2 Σi zi zi′
I21(α0,0) = 0
R2 = (1/T) Σi ei2
(MLE of under H0).
Then,
LM = &frac12; (Σi zi ωi)′[Σi zi zi′]-1(Σi zi ωi)= &frac12; W′Z (Z′Z)-1 Z′W ~ χ2m
Note: Recall R2 = [y′X (X′X)-1 X′y – T y 2]/[y′y – T y 2] = ESS/TSS
Also note that under H0: E[ωi]=0, E[ωi2]=1.
Testing for Heteroscedasticity: BP Test
• LM = &frac12; W′Z (Z′Z)-1 Z′W = &frac12; ESS
ESS = Explained SS in regression of ωi (=ei2/R2 - 1) against zi.
• Under the usual regularity condition, and under H0,
d
√T (αML – α) 
N(0, 2 4 (Z′Z/T)-1)
Then,
d
χ2m
LM-BP= (2 R4)-1 ESSe 
ESSe = ESS in regression of ei2 (=gi R2) against zi.
p
d
Since R4 
4
=&gt; LM-BP 
χ2m
Note: Recall R2= [y′X (X′X)-1 X′y – T y 2]/[y′y – T y 2]
Under H0: E[ωi]=0, E[ωi2]=1, then, the LM test can be shown to be
asymptotically equivalent to a T R2. (Think of y  0 and y’y/T=1
above).
11
RS – Lecture 12
Testing for Heteroscedasticity: BP Test
• The LM test is asymptotically equivalent to a T R2 test, where R2 is
calculated from a regression of ei2/R2on the variables Z.
• Usual calculation of the Breusch-Pagan test
- Step 1. (Auxiliary Regression). Run the regression of ei2on all the
explanatory variables, z. In our example,
ei2 = α0 + zi,1’ α1 + .... + zi,m’ αm + vi
- Step 2. Keep the R2 from this regression. Let’s call it Re22. Calculate
either
(a) F = (Re22/m)/[(1-Re22)/(T-(m+1)], which follows a Fm,(T-(m+1)
or
(b) LM = T Re22, which follows χ2m.
Testing for Heteroscedasticity: BP Test
• Variations:
(1) Glesjer (1969) test. Use absolute values instead of ei2 to estimate
the varying second moment. Following our previous example,
|ei|= α0 + zi,1’ α1 + .... + zi,m’ αm + vi
(2) Harvey-Godfrey (1978) test. Use ln(ei2). Then, the implied model
for i2 is an exponential model.
ln(ei2) = α0 + zi,1’ α1 + .... + zi,m’ αm + vi
Note: Implied model for i2 =exp{α0 + zi,1’ α1 + .... + zi,m’ αm + vi}.
12
RS – Lecture 12
Testing for Heteroscedasticity: BP Test
• Variations:
(3) Koenker’s (1981) studentized LM test. A usual problem with
statistic LM is that it crucially depends on the assumption that ε is
normal. Koenker (1981) proposed studentizing the statistic LM-BP
by
LM-S = (2 R4) LM-BP/[Σ (εi2-R2)2 /T]
d


χ2m
Testing for Heteroscedasticity: White Test
• Based on the difference between OLS and true OLS variances:
XX - XX = Σi (E[εi2] - 2)xi ′xi
• Empirical counterpart: (1/T) Σi (ei2 - s2)xi ′xi
• We can express each element of the k(k+1) matrix as:
(1/T) Σi (ei2 - s2)ψi
ψi : Kolmogorov-Gabor polynomial
ψi = (ψ1i, ψ2i, ..., ψmi)’
ψli= ψqi ψpi
p≥q, p,q=1,2,...,k
l= 1,2,...,m
m=k(k-1)/2
• White heteroscedasticity test:
d
W = [(1/T) Σi (ei2 - s2)ψi]’ DT-1 [(1/T) Σi (ei2 - s2)ψi] 
χ2m
where
DT = Var [(1/T)Σi (ei2 - s2)ψi]]
Note: W is asymptotically equivalent to a T R2 test, where R2 is
calculated from a regression of ei2/R2on the ψi’s.
13
RS – Lecture 12
Testing for Heteroscedasticity: White Test
• Usual calculation of the White test
- Step 1. (Auxiliary Regression). Regress e2 on all the explanatory
variables (Xj), their squares (Xj2), and all their cross products. For
example, when the model contains k = 2 explanatory variables, the
test is based on:
ei2 =β0+ β1 x1,i +β2 x2,i+β3 x1,i2+β4 x2,i2 + β5 x1x2,i + vi
Let m be the number of regressors in auxiliary regression. Keep R2,
say Re22.
- Step 2. Compute the statistic
LM = T Re22, which follows a χ2m.
Testing for Heteroscedasticity: Remarks
• Drawbacks of the Breusch-Pagan test:
- It has been shown to be sensitive to violations of the normality
assumption.
- Three other popular LM tests: the Glejser test; the Harvey-Godfrey
test, and the Park test, are also sensitive to such violations.
• Drawbacks of the White test
- If a model has several regressors, the test can consume a lot of df’s.
- In cases where the White test statistic is statistically significant,
heteroscedasticity may not necessarily be the cause, but model
specification errors.
- It is general, but does not give us a clue about how to model
heteroscedasticity to do FGLS. The BP test points us in a direction.
14
RS – Lecture 12
Testing for Heteroscedasticity: Remarks
• Drawbacks of the White test (continuation)
- In simulations, it does not perform well relative to others, especially,
for time-varying heteroscedasticity, typical of financial time series.
- The White test does not depend on normality; but the Koenker’s test
is also not very sensitive to normality. In simulations, Koenker’s test
seems to have more power –see, Lyon and Tsai (1996) for a Monte
Carlo study of the heteroscedasticity tests presented here.
.
Testing for Heteroscedasticity: Remarks
• General problems with heteroscedasticity tests:
- The tests rely on the first four assumptions of the CLM being true.
- In particular, (A2) violations. That is, if the zero conditional mean
assumption, then a test for heteroskedasticity may reject the null
hypothesis even if Var(y|X) is constant.
- This is true if our functional form is specified incorrectly (omitted
comment.
• Knowing the true source (functional form) of heteroscedasticity
may be difficult. A practical solution is to avoid modeling
heteroscedasticity altogether and use OLS along the White
heterosekdasticity-robust standard errors.
15
RS – Lecture 12
Estimation: WLS form of GLS
• While it is always possible to estimate robust standard errors for
OLS estimates, if we know the specific form of the heteroskedasticity,
we can obtain more efficient estimates than OLS: GLS.
• GLS basic idea: Efficient estimation through the transform the
model into one that has homoskedastic errors – called WLS.
• Suppose the heteroskedasticity can be modeled as:
Var(ε|x) = 2 h(x)
• The key is to figure out what h(x) looks like. Suppose that we know
hi. For example, hi(x)=1/xj2. (make sure hi is always positive.)
• Then, use 1/√ xj2 to transform the model.
Estimation: WLS form of GLS
• Suppose that we know hi(x)=1/xj2. Then, use 1/√ xj2 to transform
the model:
Var(εi/√hi|x) = 2
• Thus, if we divide our whole equation by √hi we get a (transformed)
model where the error is homoskedastic.
• Assuming weights are known, we have a two-step GLS estimation:
- Step 1: Use OLS, then the residuals to estimate the weights.
- Step 2: Weighted least squares using the estimated weights.
• Greene has a proof based on our asymptotic theory for the
asymptotic equivalence of the second step to true GLS.
16
RS – Lecture 12
Estimation: FGLS
• More typical is the situation where we do not know the form of the
heteroskedasticity. In this case, we need to estimate h(xi).
• Typically, we start by assuming a fairly flexible model, such as
Var(ε|x) = 2exp(X)
--make sure Var(εi|x) &gt;0.)
Since we don’t know the , it must be estimated. Our assumption
implies that
with E(v|X) = 1.
ε2 = 2exp(X) v
Then, if E(v) = 1
ln(ε2) = X + u
(*)
where E(u) = 0 and u is independent of X.
Now, we know that e is an estimate of ε, so we can estimate (*) by
OLS.
Estimation: FGLS
• Now, an estimate of h is obtained as ĥ = exp(ĝ), and the inverse of
this is our weight. Now, we can do GLS as usual.
• Summary of FGLS
(1) Run the original OLS model, save the residuals, e. Get ln(e2).
(2) Regress ln(e2) on all of the independent variables. Get fitted
values, ĝ.
(3) Do WLS using 1/exp(ĝ) as the weight.
(4) Iterate to gain efficiency.
• Remark: We are using WLS just for efficiency – OLS is still
unbiased and consistent. Sandwich estimator gives us consistent
inferences.
17
RS – Lecture 12
Estimation: MLE
• ML estimates all the parameters simultaneously. To construct the
likelihood, we assume a distribution for ε. Under normality (A5):
T
T
T
1
1
1
ln L   ln 2 
ln i2 
( yi  X i)( yi  X i)
2
2 i1
2 i1 i2


• Suppose i2 = exp(α0 + zi,1 α1 + .... + zi,m αm )= exp(zi’α)
• Then, the first derivatives of the log likelihood wrt θ=(β,α) are:
T
 ln L
  x i  i / i2  X '  1

i 1

 ln L
1

i
2
T
T
1
1
1 / i2 exp(zi ' ) zi  ( ) i2 / i4 exp(zi ' ) zi 
2 i 1
2
i 1


T
 z (
i
2
i
/ i2  1)
i 1
• Then, we get the f.o.c. We get a non-linear system of equations.
Estimation: MLE
• We take second derivatives to calculate the information matrix :
 ln L2

x i x i ' /  i2  X '  1 X
  '
i 1
T

 ln L
1

  i '
2
T
 x z '
i i
i
/  i2
i 1
 ln L
1

 i  i '
2
T
 z z '
i i
2
i
/  i2
i 1
• Then,
I ()  E[
1
 ln L  X '  X
]
'  0

0 

1
Z'Z
2

• We can estimate the model using Gauss-Newton:
θj+1 = θj – Ht-1 gt
gt = ∂log L/∂θ’
18
RS – Lecture 12
Estimation: MLE
• We estimate the model using Gauss-Newton:
θj+1 = θj – Hj-1 gj
gj = ∂log Lj/∂θ’
Since Ht is block diagonal,
βj+1 = βj – (X’ Σj-1 X)-1 X’ Σj-1 εj
αj+1 = αj – (&frac12;Z’Z)-1 [&frac12; Σi zi (εi2/i2 - 1)] = αj – (Z’Z)-1 Z’v,
where
v = (εi2/i2 - 1)]
Convergence will be achieved when gj = ∂log Lj/∂θ’ is close to zero.
• We have an iterative algorithm
=&gt; Iterative FGLS= MLE!
Heteroscedasticity: Log Transformations
• A log transformation of the data, can eliminte (or reduce) a certain
type of heteroskedasticity.
- Assume
- t = E[Zt]
- Var[Zt] = δ t2
(Variance proportional to the
squared mean)
• We log transformed the data: log(Zt). Then, we use the delta
method to approximate the variance of the transformed variable.
Recall: Var[f(X)] using delta method:
Var[ f ( X )]  f ' ( ) 2 Var[ X ]
• Then, the variance of log(Zt) is roughly constant:
Var [log( Z t )]  (1 /  t ) 2 Var [ Z t ]  
19
RS – Lecture 12
ARCH Models
• Until the early 1980s econometrics had focused almost solely on
modeling the means of series -i.e., their actual values.
yt = Et[ yt|x] + εt ,
εt ~ D(0,σ2)
For an AR(1) process:
Et-1[ yt|x] = Et[ yt] = α + β yt-1
Note: The unconditional mean and variance are:
E[yt] = α/(1-β) &amp; Vart[yt] = σ2/(1-β2)
The conditional mean is time varying; the unconditional mean is not!
Key distinction: Conditional vs. Unconditional moments.
• Similar idea for the variance
Unconditional variance: Var [yt] = E[(yt –E[yt])2] = σ2/(1-β2)
Conditional variance: Vart-1[yt] = Et-1[(yt –Et-1[yt])2] = Et-1[εt2]
ARCH Models
• The unconditional variance measures the overall uncertainty. In the
AR(1) example, time t-1 information plays no role: Var[yt] = σ2/(1-β2)
• The conditional variance, Vart-1[yt], is a better measure of
uncertainty at time t-1. It is a function of information at time t-1.
Conditional
Variance
yt
mean
Variance
t-1
Time
20
RS – Lecture 12
ARCH Models: Stylized Facts of Asset Returns
- Thick tails - Mandelbrot (1963): leptokurtic (thicker than Normal)
- Volatility clustering - Mandelbrot (1963): “large changes tend to be
followed by large changes of either sign.”
- Leverage Effects – Black (1976), Christie (1982): Tendency for changes
in stock prices to be negatively correlated with changes in volatility.
- Non-trading Effects, Weekend Effects – Fama (1965), French and Roll
(1986) : When a market is closed information accumulates at a
different rate to when it is open –for example, the weekend effect,
where stock price volatility on Monday is not three times the volatility
on Friday.
ARCH Models: Stylized Facts of Asset Returns
- Expected events – Cornell (1978), Patell and Wolfson (1979), etc:
Volatility is high at regular times such as news announcements or
other expected events, or even at certain times of day –for example,
less volatile in the early afternoon.
- Volatility and serial correlation – LeBaron (1992): Inverse relationship
between the two.
- Co-movements in volatility – Ramchand and Susmel (1998): Volatility is
positively correlated across markets/assets.
• We need a model that accommodates all these facts.
21
RS – Lecture 12
ARCH Models: Stylized Facts of Asset Returns
• Easy to check leptokurtosis (Stylized Fact #1)
Figure: Descriptive Statistics and Distribution for Monthly S&amp;P500 Returns
Statistic
Mean (%)
Standard Dev (%)
Skewness
Excess Kurtosis
Jarque-Bera
AR(1)
0.626332
(p-value: 0.0004)
4.37721
-0.43764
2.29395
145.72
(p-value: &lt;0.0001)
0.0258
(p-value: 0.5249)
ARCH Models: Stylized Facts of Asset Returns
• Easy to check Volatility Clustering (Stylized Fact #2)
Figure: Monthly S&amp;P500 Returns (1964:1-2014:9)
0.2
0.15
0.1
0.05
-0.05
1964
1965
1966
1967
1969
1970
1971
1972
1974
1975
1976
1977
1979
1980
1981
1982
1984
1985
1986
1987
1989
1990
1991
1992
1994
1995
1996
1997
1999
2000
2001
2002
2004
2005
2006
2007
2009
2010
2011
2012
2014
0
-0.1
-0.15
-0.2
-0.25
22
RS – Lecture 12
ARCH Models: Engle (1982)
• We start with assumptions (A1) to (A5), but with a specific (A3’):
 t ~ N (0,  t 2 )
Yt   X t   t
 t 2  Var t 1 (  t )  E t 1 (  t2 )   
q
 
2
i t i
    ( L ) 2
i 1
define  t   t2   t2
 t2     ( L ) t2   t
• This is an AR(q) model for squared innovations. That is, we have an
ARCH model: Auto-Regressive Conditional Heteroskedasticity
This model estimates the unobservable (latent) variance.
Note: We are dealing with a variance, we usually impose
ω&gt;0 and αi &gt;0 for all i.
Robert F. Engle, USA
ARCH Models: Engle (1982)
• The unconditional variance is determined by:
q
q
i 1
i 1
 2  E [ t 2 ]      i E [ t2 i ]      i 2
That is, 
2


1
q

i 1
i
To obtain a positive σ2, we impose another restriction: (1-Σi αi)&gt;0.
• Example: ARCH(1)
Yt   X t   t
 t ~ N (0,  t 2 )
 t 2     1  t21
• We need to impose restrictions: α1&gt;0 &amp; 1 - α1&gt;0.
23
RS – Lecture 12
ARCH Models: Engle (1982)
• Even though the errors may be serially uncorrelated they are not
independent: there will be volatility clustering and fat tails. Let’s define
standardized errors:
zt   t /  t
• They have conditional mean zero and a time invariant conditional
variance equal to 1. That is, zt ~ D(0,1). If zt is assumed to follow a
N(0,1), with a finite fourth moment (use Jensen’s inequality). Then:
E( t4 )  E( zt4 ) E( t4 )  E( zt4 ) E( t2 ) 2  E( zt4 ) E( t2 ) 2  3E( t2 ) 2
 ( t )  E( t4 ) / E( t2 ) 2  3.
• For an ARCH(1), the 4th moment for an ARCH(1):
 ( t )  3(1   2 ) /(1  3 2 )
if 3 2  1.
ARCH Models: Engle (1982)
• More convenient, but less intuitive, presentation of the ARCH(1)
model:
 t   t2  t
where υt is i.i.d. with mean 0, and Var[υt]=1. Since υt is i.i.d., then:
E t 1 [ t2 ]  E t 1[ t2 t2 ]  E t 1 [ t2 ] E t 1 [ t2 ]     1 t21
• It turns out that σt2 is a very persistent process. Such a process can
be captured with an ARCH(q), where q is large. This is not efficient.
24
RS – Lecture 12
GARCH: Bollerslev (1986)
• In practice, q is often large. A more parsimonious representation is
the Generalized ARCH model or GARCH(q,p):
q
p
i 1
j 1
 t2      i  t2 j    j t2 j
    ( L )
2
  ( L )
2
define  t   2 t   2 t
 t2    ( ( L )   ( L ))  2   ( L )   t
which is an ARMA(max(p,q),p) model for the squared innovations.
• Popular GARCH model: GARCH(1,1):
 t21     1  t2   1  t2
GARCH: Bollerslev (1986)
• Technical details: This is covariance stationary if all the roots of
α(L) + β(L) = 1
lie outside the unit circle. For the GARCH(1,1) this amounts to
α1 + β1 &lt; 1.
• Bollerslev (1986) showed that if 3α12 + 2α1β1 + β1 2 &lt; 1, the
second and 4th moments of εt exist:
E[ t2 ] 

(1  1  1 )
E[ t4 ] 
3 2 (1  1  1 )
2
2
(1  1  1 )(1  1  211  31 )
if (1  1  211  31 )  0
2
2
25
RS – Lecture 12
GARCH: Forecasting and Persistence
• Consider the forecast in a GARCH(1,1) model
t21    1 t2  1t2    t2 (1 zt2  1 )
( t2  t2 zt2 )
Taking expectation at time t
E t [ t21 ]     t2 ( 11   1 )
Then, by repeated substitutions:
j 1
E t [ t2 j ]   [  ( 1   1 ) i ]   t2 ( 1   1 ) j
i0
As j→∞, the forecast reverts to the unconditional variance:
ω/(1-α1-β1).
• When α1+β1=1, today’s volatility affect future forecasts forever:
Et [ t2 j ]   t2  j
GARCH-X
• In the GARCH-X model, exogenous variables are added to the
conditional variance equation.
Consider the GARCH(1,1)-X model:
 t2     1 t21   1 t21   f ( X t 1 ,  )
where f(Xt, ) is strictly positive for all t. Usually, Xt, is an observed
economic variable or indicator, say liquidity index, and f(.) is a nonlinear transformation, which is non-negative.
Examples: Glosten et al. (1993) and Engle and Patton (2001) use 3mo T-bill rates for modeling stock return volatility. Hagiwara and
Herce (1999) use interest rate differentials between countries to
model FX return volatility. The US congressional budget office uses
inflation in an ARCH(1) model for interest rate spreads.
26
RS – Lecture 12
IGARCH
• Recall the technical detail: The standard GARCH model:
 t2     ( L ) 2   ( L ) 2
is covariance stationary if α(1) + β(1) &lt; 1.
• But strict stationarity does not require such a stringent restriction
(That is, that the unconditional variance does not depend on t).
In the GARCH(1,1) model, if α1 + β1 =1, we have the Integrated
GARCH (IGARCH) model.
• In the IGARCH model, the autoregressive polynomial in the ARMA
representation has a unit root: a shock to the conditional variance is
“persistent.”
IGARCH
• Variance forecasts are generated with: E t [ t2 j ]   t2  j
• That is, today’s variance remains important for future forecasts of all
horizons.
• Nelson (1990) establishes that, as this satisfies the requirement for
strict stationarity, it is a well defined model.
• In practice, it is often found that α1 + β1 are close to 1.
• It is often argued that IGARCH is a product of omitted variables;
For example, structural breaks. See Lamoreux and Lastrapes (1989),
Hamilton and Susmel (1994), &amp; Mikosch and Starica (2004).
• Shepard and Sheppard (2010) argue for a GARCH-X explanation.
27
RS – Lecture 12
GARCH: Variations
• EGARCH model -- Nelson (Econometrica, 1991).
It models an exponential function for the time-varying variance:
q
p
i 1
j 1
log( t2 )      i ( zt i   (| zt i |  E | zt i |))    j log( t2 j )
• GJR-GARCH model -- Glosten Jagannathan and Runkle (JF, 1993):
q
q
p
i 1
i 1
j 1
 t2      i  t2 i    i  t2 i * I t  i    j t2 j
where It-i=1 if εt-i&lt;0; 0 otherwise.
• Remark: Both models capture sign (asymmetric) effects in volatility:
Negative news (εt-i&lt;0) increase the conditional volatility (leverage
effect).
GARCH: Variations
• Non-linear ARCH model NARCH -- Higgins and Bera (1992) and
Hentschel (1995).
These models apply the Box-Cox-type transformation to the
conditional variance:
q
p
i 1
j 1
 t      i |  t  i   |    j t j
Special case: γ=2 (standard GARCH model).
Note: The variance depends on both the size and the sign of the
variance which helps to capture leverage type (asymmetric) effects.
28
RS – Lecture 12
GARCH: Variations
• Threshold ARCH (TARCH) --Rabemananjara, R. and J.M. Zakoian
(JAE, 1993).
Large events to have an effect but no effect from small events
q
p
i 1
j 1
 t2     (  i I ( t i   )    i I ( t i   )) 2 t i )    j 2 t  j
There are two variances:
q
p
i 1
j 1
q
p
i 1
j 1
 t2 i       i  t2 i    j
2
 t2 i       i  t2 i    j
2
t j
,
if ( t  i   )
t j
,
if ( t  i   )
Many other versions are possible by adding minor asymmetries or
non-linearities in a variety of ways.
GARCH: Variations
• Switching ARCH (SWARCH) - Hamilton and Susmel (JE, 1994).
Intuition: σt2 depends on the state of the economy –regime. It’s based on
Hamilton’s (1989) time series models with changes of regime:
 t2   s , s
t
t 1

q

i 1
i , st , st 1
 t21
The key is to select a parsimonious representation:
 2t
 
s
t
q

i 1
i
 t2 i
s
ti
For a SWARCH(1) with 2 states (1 and 2) we have 4 possible σt2:

2
t
 

2
t
 

2
t
 

2
t
 
1
  1  t2 1  1 /  1 ,
s t  1, s t  1  1
1
  1
2
t 1
 / 2,
s t  1, s t  1  2
2
  1
2
t 1
 / ,
s t  2 , s t 1  1
2
  1
2
t 1
 / ,
s t  2 , s t 1  2
1
2
2
1
2
29
RS – Lecture 12
ARCH Estimation: MLE
• All of these models can be estimated by maximum likelihood. First
we need to construct the sample likelihood.
• Since we are dealing with dependent variables, we use the
conditioning trick to get the joint distribution:
f ( y1 , y 2 ,..., y T ; θ )  f ( y1 | x 1 ; θ ) f ( y 2 | y1 , x 2 , x 1 ; θ )....
.... f ( y T | y T 1 ,.., y1 , x T 1 ,.., x 1 ; θ ).
Taking logs
  log( f ( y1, y2 ,..., yT ; θ))  log( f ( y1 | x1 ; θ))  log( f ( y2 | y1, x2 , x1; θ))  .... 
....  log( f ( yT | yT 1,.., y1, xT 1,.., x1; θ))
 t 1 log( f ( yt | Yt 1, X t ; θ))
T
ARCH Estimation: MLE
• Note that the δϑ/δγ=0 (k f.o.c.’s) will give us GLS.
• Denote δϑ/δθ=S(yt,θ)=0
(S(.) is the score vector)
- We have a (k+2xk+2) system. But, it is a non-linear system. We will
need to use numerical optimization. Gauss-Newton or BHHH (also
approximates H by the product of S(yt,θ)’s) can be easily
implemented.
- Given the AR structure, we will need to make assumptions about σ0
(and ε0,ε1 , ..εp if we assume an AR(p) process for the mean).
- Alternatively, we can take σ0 (and ε0,ε1 , ..εp) as parameters to be
estimated (it can be computationally more intensive and estimation
can lose power.)
30
RS – Lecture 12
ARCH Estimation: MLE
• If the conditional density is well specified and θ0 belongs to Ω, then
T
where A01  T 1 
T 1 / 2 (ˆ   0 )  N ( 0 , A01 ),
t 1
S t ( yt , 0 )

• Under the correct specification assumption, A0=B0, where
T
B 0  T 1  E [ S t ( y t ,  0 ), S t ( y t ,  0 )' ]
t 1
We estimate A0 and B0 by replacing θ0 by its estimated MLE value.
• The estimator B0 has a computational advantage over A0.: Only first
derivatives are needed. But A0=B0 only if the distribution is correctly
specified. This is very difficult to know in practice.
• Common practice in empirical studies: Assume the necessary
regularity conditions are satisfied.
ARCH Estimation: MLE - Example
• Assuming normality, we maximize with respect to θ the function:

T
T
1
T
 log( L )   2 log( 2 )  2  { log( 
t
t 1
i 1
2
t
)   t2 /  t2 }
Example: ARCH(1) model.
T
   log( Lt )  
t 1
T
1 T
1 T
log( 2 )   log(   1 t21 )    2 t /(  1 t21 )
2
2 t 1
2 t 1
Taking derivatives with respect to θ=(ω,α,γ), where γ=k mean pars:
T
T

  1 /(   1 t21 )  ( 1 / 2)  t2 / (  1 t21 ) 2

t 1
t 1

T

T

   t21 /(   1 t21 )  ( 1 / 2)  t2  t21 / (  1 t21 ) 2
 1
t 1
t 1


T

  x t  t /  t2
γ
t 1

31
RS – Lecture 12
ARCH Estimation: MLE
• Then, we set the f.o.c.
=&gt; δϑ/δθ=0.
• We have a (k+2) system. It is a non-linear system. We will use
numerical optimization; usually, Gauss-Newton or BHHH.
• Again, note that δϑ/δγ=0 (k f.o.c.’s) will give us GLS.
T


x t  t ( γ MLE ) /  t2 (  MLE ,  MLE , )  0
γ
t 1

GARCH(1,1): Example – USD/CHF
• We estimate an AR(1)-GARCH(1,1) for St - USD/CHF:
st = [log(St) - log(St-1)]x100 = a0 + a1st-1 + εFt,
σ2t = α0 + α1 e2t-1 + &szlig;1 σ2t-1.
εtΨt-1~N(0,σ2t).
• T: 1548 (January 6, 1977 - January 30, 2010, daily).
The estimated model for st is given by:
-0.055 +
0.059 st-1,
st =
(1.10)
(1.78)
2
σ t = 0.140 +
0.079 e2t-1 + 0.876 σ2t-1.
(2.19)
(4.16)
(25.95)
Note: α1 + &szlig;1 =.955 &lt; 1.
(Very persistent!)
32
RS – Lecture 12
GARCH(1,1): Example – USD/CHF
Changes(%
)
CHF/USD Changes
10
8
6
4
2
0
-2
-4
-6
-8
-10
CHF/USD Variance
14
Variance
12
10
8
6
4
2
2/5/2009
2/5/2007
2/5/2005
2/5/2003
2/5/2001
2/5/1999
2/5/1997
2/5/1995
2/5/1993
2/5/1991
2/5/1989
2/5/1987
2/5/1985
2/5/1983
2/5/1981
2/5/1979
2/5/1977
2/5/1975
2/5/1973
2/5/1971
0
ARCH Estimation: MLE – Regularity Conditions
Note: The appeal of MLE is the optimal properties of the resulting
estimators under ideal conditions.
• Crowder (1976) gives one set of sufficient regularity conditions for
the MLE in models with dependent observations to be consistent
and asymptotically normally distributed.
• Verifying these regularity conditions is very difficult for general
ARCH models - proof for special cases like GARCH(1,1) exists.
For example, for the GARCH(1,1) model: if E[ln(α1zt2 +β1)] &lt; 0, the
model is strictly stationary and ergodic. See Nelson (1990) &amp;
Lumsdaine (1996).
33
RS – Lecture 12
ARCH Estimation: MLE – Regularity Conditions
• Block-diagonality
In many applications of ARCH, the parameters can be partitioned
into mean parameters, θ1, and variance parameters, θ2.
Then, δμt(θ)/δθ2=0 and, although, δσt(θ)/δθ1≠0, the Information
matrix is block-diagonal (under general symmetric distributions for zt
and for particular ARCH specifications).
- Regression can be consistently done with OLS.
- Asymptotically efficient estimates for the ARCH parameters can be
obtained on the basis of the OLS residuals.
ARCH Estimation: MLE – Remarks
• But, block diagonality cannot buy everything:
- Conventional OLS standard errors could be terrible.
- When testing for serial correlation, in the presence of ARCH, the
conventional Bartlett s.e. – T-1/2– could seriously underestimate the
true standard errors.
34
RS – Lecture 12
ARCH Estimation: QMLE
• The assumption of conditional normality is difficult to justify in
many empirical applications. But, it is convenient.
• The MLE based on the normal density may be given a quasimaximum likelihood (QMLE) interpretation.
• If the conditional mean and variance functions are correctly
specified, the normal quasi-score evaluated at θ0 has a martingale
difference property:
E{δϑ/δθ=S(yt,θ0)}=0
Since this equation holds for any value of the true parameters, the
QMLE, say θQMLE is Fisher-consistent –i.e., E[S(yT, yT-1,…y1 ; θ)] = 0 for
any θ∈Ω.
ARCH Estimation: QMLE
• The asymptotic distribution for the QMLE takes the form:
T 1 / 2 (ˆQMLE   0 )  N ( 0 , A0 1 B 0 A0 1 ).
The covariance matrix (A0-1 B0 A0-1) is called “robust.” Robust to
departures from “normality.”
• Bollerslev and Wooldridge (1992) study the finite sample distribution
of the QMLE and the Wald statistics based on the robust covariance
matrix estimator:
For symmetric departures from conditional normality, the QMLE is
generally close to the exact MLE.
For non-symmetric conditional distributions both the asymptotic and
the finite sample loss in efficiency may be large.
35
RS – Lecture 12
ARCH Estimation: Non-Normality
• The basic GARCH model allows a certain amount of leptokurtosis.
It is often insufficient to explain real world data.
Solution: Assume a distribution other than the normal which help to
allow for the fat tails in the distribution.
• t Distribution - Bollerslev (1987)
The t distribution has a degrees of freedom parameter which allows
greater kurtosis. The t likelihood function is
lt  ln((0.5(v  1))(0.5v) 1 (v  2) 1 / 2 (1  zt (v  2) 1 ) ( v1) / 2 )  0.5 ln( 2t )
where Γ is the gamma function and v is the degrees of freedom.
As υ→∞, this tends to the normal distribution.
• GED Distribution - Nelson (1991)
ARCH Estimation: GMM
• Suppose we have an ARCH(q). We need moment conditions:
(1)  E [ m1 ]  E [ x t ' ( y t  x t γ )]  0
( 2 )  E [ m 2 ]  E [ε t2 ' ( t2   t2 )]  0
(3)  E [ m 3 ]  E [ t2   /(1   1  ...   q )]  0
Note: (1) refers to the conditional mean, (2) refers to the conditional
variance, and (3) to the unconditional mean.
• GMM objective function:
^
^
Q ( X, y; θ )  E [ m ( θ ; X, y )]' W E [ m ( θ ; X, y )]
where
^
^
^
^
E [ m ( θ ; X, y )]  [ E [ m 1 ]' E [ m 3 ]' E [ m 3 ]' ]'
36
RS – Lecture 12
ARCH Estimation: GMM
• γ has k free parameters; α has q free parameters. Then, we have
r=k+q+1 parameters.
m(θ;X,y) has r = k+q+1 equations.
Dimensions: Q is 1x1; E[m(θ;X,y)] is rx1; W is rxr.
• Problem is over-identified: more equations than parameters so cannot
solve E[m(θ;X,y)]=0, exactly.
• Choose a weighting matrix W for objective function and minimize
using numerical optimization.
• Optimal weighting matrix: W =[E[m(θ;X,y)]E[m(θ;X,y)]’]-1.
Var(θ)=(1/T)[DW-1D’]-1,
where D = δE[m(θ;X,y)]/δθ’ --expressions evaluated at θGMM.
ARCH Estimation: Testing
• Standard BP test , with auxiliary regression given by:
et2 = α0 + α1 et-12 + .... + αm et-q2 + vt
H0: α1= α2=...= αq=0 (No ARCH). It is not possible to do GARCH
test, since we are using the same lagged squared residuals.
d
Then, the LM test is (T-q)R2 
χ2q
-- Engle’s (1982).
• In ARCH Models, testing as usual: LR, Wald, and LM tests.
Reliable inference from the LM, Wald and LR test statistics
generally does require moderately large sample sizes of at least two
hundred or more observations.
37
RS – Lecture 12
ARCH Estimation: Testing
• Issues:
- Non-negative constraints must be imposed. θ0 is often on the
boundary of Ω. (Two sided tests may be conservative)
- Lack of identification of certain parameters under H0 creates a
singularity of the Information matrix under H0. For example, under
H0: α1=0 (No ARCH), in the GARCH(1,1), ω and β1 are not jointly
identified. See Davies (1977).
• Ignoring ARCH
- You suspect yt has an AR structure: yt = γ0 + γ1 yt-1 + εt
Hamilton (2008) finds that OLS t-test with no correction for ARCH
spuriously reject H0: γ1=0 with arbitrarily high probability for
sufficiently large T. White’s (1980) SE help. NW SE help less.
ARCH Estimation: Testing
Figure. From Hamilton (2008). Fraction of samples in which OLS ttest leads to rejection of H0: γ1=0 as a function of T for regression
with Gaussian errors (solid line) and Student’s t errors (dashed line).
Note: H0 is actually true and test has nominal size of 5%.
38
RS – Lecture 12
ARCH: Which Model to Use
• Questions
1) Lots of ARCH models. Which one to use?
2) Choice of p and q. How many lags to use?
• Hansen and Lunde (2004) compared lots of ARCH models:
- It turns out that the GARCH(1,1) is a great starting model.
- Add a leverage effect for financial series and it’s even better.
- A t-distribution is also a good addition.
Realized Volatility (RV) Models
• French, Schwert and Stambaugh’s (1987) use higher frequency
to estimate the variance as:
1 k 2
2
st   rt 1
k i 1
where rt is realized returns in days, and we estimate monthly variance.
• Model-free measure –i.e., no need for ARCH-family specifications. It
is very popular for intra-daily data, called high frequency (HF) data. The
measure is called realized volatility (RV).
• Very popular to calculate intra-day or daily volatility. For example,
based on TAQ data, say, 1’ or 10’ realized returns (rt,j is jth interval
return on day t), we can calculate the daily variance, or RVt:
RV
t

M

j 1
r t 2, j ,
t  1 , 2 ,...., T
39
RS – Lecture 12
Realized Volatility (RV) Models
Realized Volatility (RV) Models
RV t 
M
r
j 1
2
t, j
,
t  1, 2 ,...., T
where rt,j is jth interval return on day t. That is, RV is defined as the
• We can use time series models –say an ARIMA- for RVt to forecast
daily volatility.
• RV is affected by microstructure effects: bid-ask bounce, infrequent
serial correlation in intra-day returns, which biases RVt. (Big problem!)
-Proposed Solution: Filter the intra-day returns using MA or AR
models before constructing RV measures.
40
RS – Lecture 12
Realized Volatility (RV) Models - Properties
• Under some conditions (bounded kurtosis and autocorrelation of
squared returns less than 1), RVt is consistent and m.s. convergent.
• Realized volatility is a measure. It has a distribution.
• For returns, the distribution of RV is non-normal (as expected). It
tends to be skewed right and leptokurtic. For log returns, the
distribution is approximately normal.
• Daily returns standardized by RV measures are nearly Gaussian.
• RV is highly persistent.
• The key problem is the choice of sampling frequency (or number of
observations per day).
— Bandi and Russell (2003) propose a data-based method for
choosing frequency that minimizes the MSE of the measurement
error.
— Simulations and empirical examples suggest optimal sampling is
around 1-3 minutes for equity returns.
RV Models - Variation
• Another method: AR model for volatility:
|  t |    |  t 1 |  t
The εt are estimated from a first step procedure -i.e., a regression.
Asymmetric/Leverage effects can also be introduced.
OLS estimation possible. Make sure that the variance estimates are
positive.
41
RS – Lecture 12
Other Models - Parkinson’s (1980) estimator
• The Parkinson’s (1980) estimator:
s2t={Σt [ln(Ht)-ln(Lt)]2 /(4ln(2)T)},
where Ht is the highest price and Lt is the lowest price.
• There is an RV counterpart, using HF data: Realized Range (RR):
RRt={Σj [100x(ln(Ht,j)-ln(Lt,j))]2 /(4ln(2)},
where Ht,j and Lt,j are the highest and lowest price in the jth interval.
• These “range estimators are very good and very efficient.
Reference: Christensen and Podolskij (2005).
Stochastic volatility (SV/SVOL) models
• Now, instead of a known volatility at time t, like ARCH models, we
allow for a stochastic shock to σt, υt:
 t     t 1  t ;
t ~ N (0,   )
Or using logs:
log  t     log  t 1  t
• We have 3 SVOL parameters to estimate: φ=(ω,β,σv).
• The difference with ARCH models: The shocks that govern the
volatility are not necessarily εt’s.
• Estimation: Bayesian (Gibbs sampling, MCMC Models).
• References: Jacquier, E., Poulson, N., Rossi, P. (1994), Bayesian analysis of stochastic
volatility models, Journal of Business and Economic Statistics. (Estimation)
Heston, S.L. (1993), A closed-form solution for options with stochastic volatility with
applications to bond and currency options, Review of Financial Studies. (Theory)
42
```