Uploaded by Tivas Gupta

Analytics of Finance 15.50 Cheat Sheet

advertisement
Statistical Modeling, Inference, and Forecasting
-Mean-squared Error (MSE):A common criterion to measure
a model’s performance. MSE is a loss function.
𝐸[(𝑦 − 𝑦̂)2 ] or, with a given sample, (1⁄𝑛) ∑𝑛𝑖=1(𝑦𝑖 − 𝑓̂(π‘₯𝑖 ))2
Estimators: Let Ρ² be the parameter vector, and Ρ° = h(Ρ²) a
function of the model parameters. An estimator of Ρ° is a
function of the observed sample ϑ(x).
-Estimating the Sharpe Ratio: consider a sample x of excess
returns: 𝑅𝑑𝑒 = πœ‡ + πœŽπœ€π‘‘ , t = 1,…,T where Ρ” 𝑇 are IID N(0,1)
random variables, and 𝜎 > 0. πœ‡ and 𝜎 are unknown. How do
we estimate Sharpe ratio Ρ° = πœ‡/ 𝜎. An intuitive estimator of
1
Ρ° is ϑ(x) = πœ‡Μ‚ / πœŽΜ‚ where πœ‡Μ‚ = (𝑇) ∑𝑇𝑑=1 𝑅𝑑𝑒 and πœŽΜ‚ =
1
√(𝑇−1) ∑𝑇𝑑=1(𝑅𝑑𝑒 − πœ‡Μ‚ )2 . Random estimator valid, not useful
-Estimator is a random variable. Realized value of the
estimator depends on the sample, which is random.
-Properties of estimators: An estimator ϑ(x) of Ρ° = h(Ρ²) is
consistent if as sample size goes to inf, estimator → Ρ°
- An estimator ϑ(x) of Ρ° = h(Ρ²) is unbiased if, given the true
value of Ρ², the conditional expectation of ϑ(x) equals Ρ°
- We generally prefer consistent estimators. Bias may be
impossible to avoid, we just don’t want it to be too big.
-Method of Moments Estimator Ρ²^ of parameter Ρ²_0 is the
solution to Ê[f(x, θ^)] = 1⁄𝑛 ∑𝑛𝑖=1 𝑓(π‘₯𝑖 , Ρ²) = 0. Set each
moment in the sample equal to moment in the population.
Use this to solve for unknown parameters.
-MLE: The estimator ϑ(x) of the parameter vector Ρ² defined
by ϑ(x) = argmax_Ρ²(p(x|Ρ²)_ = argmax_Ρ²(ln p(x|Ρ²))
- A coin is flipped 100x. Given that there were 55 heads,
find the MLE for the probability p of heads on a single toss.
-Log likelihood:
100
ln(P(55 heads | p) = ln(( 55 )) + 55 ln(p) + 45 ln(1 − p)
Taking FOC equals 55⁄𝑝 - 45⁄(1 − 𝑝) = 0. Solve, p=0.55.
Standard Errors and Hypothesis Testing
Suppose Ρ² is a vector. We always think of Ρ² as a column:
Ρ²′ = (Ρ²1 , … , Ѳ𝑁 ). Partial derivatives of a function h(Ρ²):
πœ•β„Ž(Ρ²)
πœ•β„Ž(Ρ²)
πœ•β„Ž(Ρ²)
= ( πœ•Ρ² … . πœ•Ρ² ) . If h(Ρ²) is a vector of functions:
πœ•(Ρ²)′
1
𝑁
πœ•β„Ž(Ρ²)
will have N col and M row as h(Ρ²) = (β„Ž1 (Ρ²). . . β„Žπ‘€ (Ρ²))′
πœ•(Ρ²)′
-Ordinary Least Squares: When n is large, the estimator βΜ‚LS is
approximately normally dist. βΜ‚ 𝐿𝑆 ~ 𝑁(βΜ‚0 , σ2(X’X)-1)
-Maximum Likelihood: When n is large, the estimator πœƒˆπ‘€πΏ is
approximately normally dist. πœƒˆπ‘€πΏ ~ 𝑁(Ρ²0 , I(Ρ²0 )−1 ) where,
I(Ρ²0 ) = −𝐸[(πœ• 2 ln (𝑝(π‘₯|Ρ²0 )))⁄(πœ•Ρ²πœ•Ρ²′ )]
𝑝(π‘₯|Ρ²0 ) is the likelihood function. In practice, Ρ²0 is θΜ‚
Ex: IID Gauss obs, mean µ, var σ2 (known). Parameter θ = µ
ln 𝑝(π‘₯|πœ‡) = ln ∏𝑇𝑑=1 𝑝(π‘₯𝑑 |πœ‡)
= ∑𝑇𝑑=1 ln 𝑝(π‘₯𝑑 |πœ‡) =
1
(π‘₯𝑑 −πœ‡) 2
∑𝑇𝑑=1(ln (
) − ( √2πœ‹πœŽ
) ,asymptotic dist. µΜ‚ML~N(µ,(σ2/T))
2
√2πœ‹πœŽ2
MoM estimator: estimate p parameters with p moment
conditions. Under mild regularity conditions, MM estimators
are consistent. Assume f(xi, θ) uncorrelated obs.
Moment conditions in finite sample: 𝐸̂𝑏[𝑓(π‘₯𝑖 , πœƒ)] =
1⁄𝑛 ∑𝑛𝑖=1 𝑓(π‘₯𝑖, πœƒ) = 0. Also, we define the following:
′
𝑑̂ = πœ•πΈΜ‚ (𝑓(π‘₯𝑖, , πœƒ))⁄πœ•πœƒ ′ |θΜ‚, and SΜ‚ = 𝐸̂ [𝑓(π‘₯𝑖 , θΜ‚)𝑓(π‘₯𝑖 , θΜ‚) ]
When n is large, the MM estimator θ̂𝑀𝑀 is approx. norm dist.
Example: Sharpe Ratio Dist. By the Delta Method. You have
the mean and the s.d. of monthly ex. returns (µΜ‚,σΜ‚). The
implied Sharpe Ratio is then SR_hat = h(θΜ‚) = (µΜ‚, σΜ‚ ). What is
the 95% confidence interval for SR?
Suppose the asymptotic var-covar matrix of param. estimates
θΜ‚ = (µΜ‚, σΜ‚ )′ is estimated to be Ω̂_β„Žπ‘Žπ‘‘. We then compute:
 = πœ•h(θ)⁄πœ• θΜ‚|θΜ‚ = (1⁄σ
Μ‚ − µΜ‚2 ⁄σΜ‚ 2 ). Variance of SR est. is:
1⁄σ
Μ‚
(1⁄σ
Μ‚ − µΜ‚⁄σ
Μ‚2 )Ω̂_β„Žπ‘Žπ‘‘(
).
−(µΜ‚⁄(σ
Μ‚2 ))
Type I error: False rejection of a true null (false positive).
Type II error: Failure to reject a false null (false negative).
Test size: Upper bound on the prob. of rejecting the null
hypothesis over all cases in which the null is correct
Example: OLS. Suppose we run a regression of yt on a vector
of predictors xt: yt = β0+xt’β1+εt. Examples of tests: β0=0 (can
the portfolio manager generate alpha?), β1=0 (can xt (signal)
pred. yt, returns)? Test stat: πœ‰ = βΜ‚′ [π‘‰π‘Žπ‘Ÿ(βΜ‚)]−1 βΜ‚~πœ’ 2 (dim (β))
Test of size α: reject the Null if πœ‰ ≥ πœ‰_π‘π‘Žπ‘Ÿ, where:
πΆπ·πΉπœ’2(dim(β)) (πœ‰π‘π‘Žπ‘Ÿ ) = 1 − α
A large test size α raises the probability of a Type I error. A
small test size likely raises the probability of Type II error.
𝛼 ∗ (π‘₯) = argmin(𝛼)𝑝(π‘š = 𝐼|π‘₯, 𝛼)𝐿(π‘š = 𝐼) + 𝑝(π‘š =
𝐼𝐼|π‘₯, 𝛼)𝐿(π‘š = 𝐼𝐼) where L(m=I) and L(m=II) are the
economic losses of Type I and Type II errors.
Example: Comparing Portfolio Managers. Suppose you are
trying to choose between two portfolio managers. Two series
of historical excess returns: (π‘₯11 , … , π‘₯ 1𝑇 ) and (π‘₯12 , … , π‘₯𝑇2 ). How
do we see which PM is better? Observation vector: (π‘₯𝑑1 , π‘₯𝑑2 )′
The parameter vector is: Ρ²0 = (µ10 , σ10 , µ02 , σ02 )
The Null Hypothesis is: 𝐻0 : {µ10 ⁄σ10 − µ02 ⁄σ02 = 0}
Define: h(πœƒ) = µ10 ⁄σ10 − µ02 ⁄σ02. Compute the following:
 = πœ•h(θ)⁄πœ• θ′|θΜ‚ = (1⁄σ
Μ‚1, − µΜ‚1 ⁄(σ
Μ‚1 )2 , − 1⁄σ
Μ‚2 , − µΜ‚2 ⁄(σ
Μ‚ 2 )2 )
Asymptotic Var of h(θΜ‚) =
(1⁄σ
Μ‚1, − µΜ‚1 ⁄(σ
Μ‚1 )2 , − 1⁄σ
Μ‚2 , − µΜ‚2 ⁄(σ
Μ‚2 )2 ) Ω̂_β„Žπ‘Žπ‘‘(1st matrix T)
Reject Region, test h(Ρ²0 ) = 0: A={|h(θΜ‚)/√π‘‰π‘Žπ‘Ÿ_β„Žπ‘Žπ‘‘[β„Ž(θΜ‚)]| ≥ z}
A 5% test, set z=1.96 = Φ-1(0.975), where Φ is the
standard normal CDE
Linear Models
𝑒
𝑒
- Market model: 𝑅𝑖,𝑑
= ∝𝑖 + 𝛽𝑖 𝑅𝑖,𝑑
+ πœ€π‘–,𝑑
-Univariate linear model (f):𝑦𝑖 = 𝛽0 + 𝛽𝑖 π‘₯𝑖 + πœ€π‘– . For any
fitted model fΜ‚, with coefficients (β0, β1), we can compute the
fitting errors (residuals): Êi = yi – Ε·i = yi - fΜ‚(xi) = yi - (BΜ‚ 0 + BΜ‚ ixi)
-Least Squares Estimators: Find (BΜ‚ 0, BΜ‚ 1) that minimizes RSS
-Multiple Linear Regression→ Data: (yi, xi1,…,xip), i = 1,…,n
πœ•πΈ(𝑦𝑖 |π‘₯𝑖1 ,…,π‘₯𝑖𝑝
Model: 𝑦𝑖 = 𝛽0 + 𝛽𝑖 π‘₯𝑖1 + β‹― + 𝛽𝑝 π‘₯𝑖𝑝 + πœ€π‘– , 𝛽𝑗 =
πœ•π‘₯𝑖𝑗
Assumptions: Linearity, Full rank (X is an nxp matrix with rank
p), exogeneity of the ind. Variables (E[πœ€π‘– | X] = 0),
homoscedasticity and nonautocorrelation (E[πœ€πœ€′|X] = 𝜎 2 I)
Derivation of the LSE: Find βΜ‚ that minimizes the RSS:
min(β) ∑𝑛𝑖=1 πœ€π‘–2 = πœ€′πœ€ = (Y - Xβ)'(Y - Xβ). Take the derivative of
RSS and set it to 0 (FOC to find min RSS). This gives us:
0 = -2X'(Y - Xβ). Solving for β, X'Xβ = X'Y → βΜ‚ = (X'X)-1X'Y and
the Var[βΜ‚|X] = σ2(X’X)-1.
The least squares estimator is BLUE (best linear unbiased
estimator). Best bc it has the min variance among all linear
unbiased estimators (Gauss-Markov Theorem).
Estimating σ2 = 𝑅𝑆𝑆⁄𝑛 − 𝑝 where 𝑅𝑆𝑆 = ∑𝑛𝑖=1(𝑦𝑖 -βixi1…) = πœ€Μ‚′πœ€Μ‚
Residual Standard Error: RSE = πœŽΜ‚ = √𝑅𝑆𝑆⁄𝑛 − 𝑝,
R2 = 1-𝑅𝑆𝑆⁄𝑇𝑆𝑆, Adj. R2 = 1-(𝑅𝑆𝑆 ⁄(𝑛 − 𝑝))⁄(𝑅𝑆𝑆 ⁄(𝑛 − 1))
t-statistic: t = βΜ‚ 𝑗 ⁄𝑆𝐸( βΜ‚ 𝑗 ) with n-p degrees of freedom
P-value, probability of equal to or above |t|, assuming 𝛽̂𝑗 = 0
Confidence Interval: [βΜ‚ j − t α/2 𝑆𝐸(βΜ‚ 𝑗 ), βΜ‚j + t α/2 𝑆𝐸(βΜ‚ 𝑗 )]
F-statistic: Does any of the predictor show significant effects?
F = ((𝑇𝑆𝑆 − 𝑅𝑆𝑆)⁄(𝑝 − 1))⁄(𝑅𝑆𝑆 ⁄(𝑛 − 𝑝))
Multicollinearity: When two or more predictors are closely
related, LSE’s accuracy is greatly reduced. To diagnose, find
variance inflation factor: 𝑉𝐼𝐹(βΜ‚j) = 1⁄(1 − 𝑅𝑋2𝑗 |𝑋−𝑗 ) , where
𝑅𝑋2𝑗 |𝑋−𝑗 is the R2 from a regression of Xj = X-j γ + πœ€
Two types of specification errors in regression models:
Omission of relevant param, inclusion of irrelevant param.
Omission causes bias, inclusion does not
Event Studies
Several types of events: corporate, macro, policy, natural
Anatomy of an event study: event definition and event
window, selection criteria for firms and securities,
specification and estimation of reference model for “normal”
returns, computation and aggregation of “abnormal” returns,
hypothesis testing, interpretation. The event study setup:
Estimation
Window
Event
Window
Post-Event
Window
Timeline: Several types of events: corporate, macro, policy,
natural
T0
T1
0
T2
T3
N is # of securities, xΜ‚ is our estimate for x, L1,2,3 = T1,2,3 – T0,1,2
For the reference model for “normal” returns (non-event days),
keep simple. Market model: 𝑅𝑖,𝑑 = ∝𝑑 + 𝛽𝑖 π‘…π‘š,𝑑 + πœ€π‘–,𝑑
Parameters of the reference model usually estimated over the
estimation window and held fixed over the event window.
Abnormal Returns: Our null hypothesis → event does not
change the distribution of abnormal returns. Define:
𝐸̂𝑖∗ as the (L2 x 1) vector of estimated abnormal returns for
security i from the period t = T1 + 1,…,T2. According to the
estimated market model:
∗
Μ‚ 𝑖 1 − β̂𝑖 π‘…π‘š
πœ€Μ‚∗𝑖 = 𝑅𝑖∗ −∝
= 𝑅𝑖∗ − 𝑋𝑖∗ θ̂𝑖 where:
∗
∗ ], Μ‚ ′
Μ‚ 𝑖 β̂𝑖 ]
𝑅𝑖∗ = [𝑅𝑖,𝑇1+1 … 𝑅𝑖,𝑇2 ]′,π‘…π‘š
= [π‘…π‘š,𝑇1+1 … π‘…π‘š,𝑇2 ]′,𝑋𝑖∗ = [1 π‘…π‘š
θ𝑖 = [∝
Distribution of abnormal returns: Conditional on the market
∗
returns over the event window π‘…π‘š
, πœ€Μ‚∗𝑖 ~ 𝑁(0, 𝑉𝑖) where
𝑉𝑖 = 𝐈σ2πœ€π‘– + 𝑋𝑖∗ (𝑋𝑖′ 𝑋𝑖 )−1 𝑋𝑖∗ ′σ2πœ€π‘– Intuition: OLS is unbiased→
E[πœ€Μ‚π‘–∗ |𝑋𝑖∗ ] = E[𝑅𝑖∗ − 𝑋𝑖∗ θ̂𝑖 |𝑋𝑖∗ ] = E[𝑅𝑖∗ − 𝑋𝑖∗ θ𝑖 |𝑋𝑖∗ ] = 0
𝑉𝑖 is not diagonal! Estimated abnormal returns are serially
correlated due to sampling error for θ̂𝑖 . If L1 large, 𝑉𝑖 ≈ 𝐈σ2πœ€π‘–
Under H0, al returns are IID over time and across securities.
Aggregating over time => cumulative effect of the event
Aggregating across securities =>more precise measurements
Aggregating abnormal returns: Over the Event Window
Cumulative abnormal returns for security I over the interval
(τ1,τ2) (any part of the event window, i.e., T1 < τ1 ≤ τ2 ≤ T2):
𝐢𝐴𝑅_β„Žπ‘Žπ‘‘π‘– (τ1,τ2) = γπœ€Μ‚π‘–∗ where γ: (L2 x 1) vector with ones in
positions τ1 – T1 through τ2 – T1 and zeros elsewhere.
Variance of 𝐢𝐴𝑅_β„Žπ‘Žπ‘‘π‘– (τ1,τ2) = σ2𝑖 (τ1,τ2)= γ′𝐕i γ. Under H0:
𝐢𝐴𝑅_β„Žπ‘Žπ‘‘π‘– (τ1,τ2) ~ 𝑁(0, γ′𝐕i γ)
Aggregating abnormal returns: Across Securities
Simple average of the abnormal return vectors of N securities:
π‘’π‘π‘Žπ‘Ÿ ∗ = (1⁄𝑁) ∑𝑁𝑖=1 πœ€Μ‚∗𝑖 , and π‘‰π‘Žπ‘Ÿ[π‘’π‘π‘Žπ‘Ÿ ∗ ] = 𝐕 = (1⁄𝑁 2 ) ∑𝑁𝑖=1 𝐕i
Cumulative average abnormal returns from τ1 to τ2:
∗
𝐢𝐴𝑅_π‘π‘Žπ‘Ÿ(τ1 , τ2 ) = γ′πœ€π‘π‘Žπ‘Ÿ
where γ: (L2 x 1) vector with ones
in positions τ1 – T1 through τ2 – T1 and zeros elsewhere.
Variance of 𝐢𝐴𝑅_π‘π‘Žπ‘Ÿ(τ1 , τ2 )= σ2π‘π‘Žπ‘Ÿ (τ1, τ2) = γ′𝐕γ. Under H0:
𝐢𝐴𝑅_π‘π‘Žπ‘Ÿ(τ1 , τ2 )~ 𝑁(0, γ′𝐕γ)
Hypothesis Testing: Is the cumulative abnormal return for
security i over (τ1, τ2) = 0? Standardized cumulative
abnormal return: 𝑆𝐢𝐴𝑅𝑖 (τ1, τ2) = 𝐢𝐴𝑅𝑖 (τ1, τ2)⁄σΜ‚ 𝑖 (τ1, τ2)
All the above vars are hat, change to bar for across securities
If event windows overlap across securities, their abnormal
returns are more likely to be correlated. Example: companies
announce earnings on the same day or week. Solution: Make
portfolio of securities that have the same event window.
Linear Models (continued)
Omitted variable formula: E[b1|X] = β1+(X1’X1)-1 X1’X2 β2. Bias
exists unless β2 = 0 or X1’X2 = 0.
Other considerations: Are there influential outliers? Is it data
error or informative observation? Is there a systematic trend
between the absolute residuals and the predicted responses?
Is the data linear? Is it nonstationary?
Linear Regression ≠ Linear Relationship
With a linear model, we can still capture some nonlinear
effects by adding interactions and non-linear terms (e.g. x2,
ln(x), exi,…).
Example: Rm, t+1= a + b ln(D/P) + πœ€π‘‘+1
We might suspect the predictive power of dividend yield to
change depending on market volatility (Use VIX as a proxy):
Rm, t+1= a + b ln(D/P) + c VIXt + d ln(D/P)t VIXt + πœ€π‘‘+1
Interpretation: Rm, t+1=a+(b+d VIXt)ln(D/P)t+c VIXt+πœ€π‘‘+1
Hierarchical Principle: If we include an interaction in a model,
we should also include the main effects, even if the p-values
associated with their coefficients are not significant.
Financial Time Series
-Stationarity and Autocorrelation Function
Weak Stationarity: a time series {xt} is weakly stationary if
E(xt) = µ, E(xt -µ)2= γ < ∞, Cov(xt, xt-j) = γ𝑗 for any integer j
Autocorrelation: For a weakly stationary time series {xt},
correlation of xt and xt-k is the lag-k autocorrelation of xt.
πœŒπ‘˜ = πΆπ‘œπ‘£(x𝑑 , xt−j )⁄√π‘‰π‘Žπ‘Ÿ(x𝑑 )π‘‰π‘Žπ‘Ÿ(x𝑑−π‘˜ ) = γπ‘˜ ⁄γ0
AR model: Regression with lagged variables.
AR(1): x𝑑+1 = πœ‘0 + πœ‘1 x𝑑 + πœ€π‘‘+1 , πœ€π‘‘ is Gaussian white noise
Model of 1-year treasury rates (using data from 1953-2020):
𝐸𝑑 [π‘Ÿπ‘‘+1 ] = 0.033 + 0.993π‘Ÿπ‘‘ + πœ€π‘‘+1
Properties AR(1): Stationarity→|πœ‘1 | < 1, Mean: E(π‘Ÿπ‘‘ ) =
Var(π‘Ÿπ‘‘ )=σ2⁄(1 − πœ‘12 ) , Autocorrelations: πœŒπ‘˜ = πœ‘1 , 𝜌2 = πœ‘12 , πœŒπ‘˜ = πœ‘σ21π‘˜
=µ
Demeaned representation: π‘Ÿ − µ = πœ‘ (π‘Ÿ − µ) + πœ€ 1−πœ‘
𝑑+1
1
𝑑
𝑑+1 1
Forecasting with AR(1): 1-step ahead forecast →
𝐸𝑑 [π‘Ÿπ‘‘+1 ] = πœ‘0 + πœ‘1 π‘Ÿπ‘‘ = µ + πœ‘1 (π‘Ÿπ‘‘ − µ), as µ = πœ‘0 /(1 − πœ‘1 )
1-step ahead forecast error: π‘Ÿπ‘‘+1 − 𝐸𝑑 [π‘Ÿπ‘‘+1 ] = πœ€π‘‘+1
ζ
ζ-step ahead forecast: 𝐸𝑑 [π‘Ÿπ‘‘+ζ ] =µ + πœ‘1 (π‘Ÿπ‘‘ − µ)
Half-life: Time τ for the average distance from the mean to
shrink by half, |𝐸𝑑 [π‘Ÿπ‘‘+τ ] − µ| = (1⁄2)|π‘Ÿπ‘‘ − µ|, τ =
,,, , |πœ‘1 | = 1
AR(p): x𝑑+1 = πœ‘0 + πœ‘1 x𝑑 + β‹― + πœ‘π‘ x𝑑+1−𝑝 + πœ€π‘‘+1, εt+1
~N(0,σ2) VAR(p)(Vector AutoRegressive Model):
−ln (2)
x𝑑+1 = π‘Ž0 + 𝐴1 x𝑑 + β‹― + 𝐴𝑝 x𝑑+1−𝑝 + πœ€π‘‘+1 , εln𝑑+1
~ N(0, σ2 )
(|πœ‘1 |)
where x𝑑 and π‘Ž0 are N-dim vectors, An are NxN matrices, and
πœ€π‘‘ is N-dim vector of shocks.
Moving-average Models: MA model has shocks with finite
life MA(1): x𝑑 = µ + πœ€π‘‘ -Ρ²1 πœ€π‘‘−1 →always stationary,
autocorrelated shocks, finite memory: 𝜌ζ = 0 for ζ > 1
MA(q): x𝑑 = µ + πœ€π‘‘ -Ρ²1 πœ€π‘‘−1 − β‹― − Ρ²π‘ž πœ€π‘‘−π‘ž , combine MA and
AR→ARMA. ARMA(1,1): x𝑑+1 = πœ‘0 + πœ‘1 x𝑑 + πœ€π‘‘+1 −Ρ²πœ€π‘‘
Example of MA(1): Bid-ask Bounce: Market makers buy at Pb
and sell at Pa. Market price Pt: Pt = P* + It(S/2), where:
P*: fundamental value in a frictionless market, S= Pa - Pb: bidask spread, It: iid Bernoulli (taking value of 1 and -1 with
p=0.5), Interpretation: buyer-initiated (It =1) vs. sellerinitiated transaction (It =-1). Observed price change:
βˆ†π‘ƒπ‘‘ = (I𝑑 − I𝑑−1 )(𝑆/2). Autocovariance: Cov(βˆ†π‘ƒπ‘‘ , βˆ†π‘ƒπ‘‘−𝑗 ) =
S2/2 if j=0, (-S2/4) if j=1, and 0 if j>1. Autocovariance of
returns can be an indicator of illiquidity!
Nonstationary Time Series: Random walk (with drift) →
x𝑑+1 = π‘Ž + x𝑑 + πœ€π‘‘+1 . Popular model for log stock prices.
Nonstationary (unit root), 1st difference becomes white noise
Permanent Shocks. Time trend: x𝑑 = π‘Žπ‘‘ + x0 + πœ€π‘‘ + β‹― + πœ€1
Trend-stationary time series: x𝑑 = 𝛽0 + 𝛽1 𝑑 + 𝑦𝑑 ,
yt is a stationary time series such as an AR(1)
Unit-root test (Dickey-Fuller test): x𝑑+1 = πœ‘0 + πœ‘1 x𝑑 + πœ€π‘‘+1
H0 : πœ‘1 = 1 vs. Ha : πœ‘1 < 1. 𝐷𝐹 = (πœ‘Μ‚1 − 1)⁄(𝑆𝐸( πœ‘Μ‚1 ))
DF test used to find non-stationarity in time series data.
MLE for Dependent Observations: MLE approach works even
if obs. are dependent, need dependence to die out quickly
enough. Consider a time series x𝑑 , x𝑑+1 , … and assume that
the distribution of x𝑑+1 depends only on L lags: x𝑑 , … , x𝑑+1−𝐿
Likelihood function: 𝑝(𝐱|θ) = 𝑝(x1 |θ) … 𝑝(x𝐿 |x𝐿−1 , … , π‘₯1 ; θ)
Can be reduced to 𝑝(𝐱|θ) = ∏𝑇𝑑=𝐿+1 𝑝(π‘₯𝑑 |π‘₯𝑑−1 , … , π‘₯𝑑−𝐿 ; θ)
θ maximizes the (conditional) likelihood:
𝐿(π‘₯𝐿+1 , … , π‘₯ 𝑇 |π‘₯𝐿 , … , π‘₯1 ; θ) ≡ ∏𝑇𝑑=𝐿+1 𝑝(π‘₯𝑑 |π‘₯𝑑−1 , … , π‘₯𝑑−𝐿 ; θ)
if T large and π‘₯𝑑 stationary.
Short-hand notation: 𝐿(θ) and β„’(θ) ≡ ln 𝐿(θ)
MLE for AR(p) Time Series: x𝑑+1 is conditional on previous xt
and is Gaussian with mean a0+a1x𝑑 +…+ apx𝑑+1−𝑝 and var. σ2
2
Likelihood: β„’(θ) = ∑𝑇−1
𝑑=𝑝 −𝑙𝑛√2πœ‹σ −
(x𝑑+1 −a0 −a1 x𝑑 −...−a𝑝 x𝑑+1−𝑝 )2
2σ2
,
MLE estimates of (a0, a1,…, ap) are the same as OLS:
2
π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯(a 0..𝑝 )β„’(θ) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›(a 0..𝑝 ) ∑𝑇−1
𝑑=𝑝 (π‘₯𝑑+1 − π‘Ž0 − π‘Ž0 π‘₯𝑑 − β‹― − π‘Žπ‘ π‘₯𝑑+1−𝑝 )
MLE for VAR(p) Time Series: Construct conditional log𝑁
1 ′
−1
likelihood: β„’(θ) = ∑𝑇−1
πœ€π‘‘+1
𝑑=𝑝 −𝑙𝑛√(2πœ‹ |Ω|) − 2 πœ€π‘‘+1 Ω
AIC and BIC: Start by specifying the maximum possible p for
VAR(p). P_bar grows with the sample size, but not too fast.
Lim as t->
of p_bar is
and lim of p_bar/T as t->
is 0
p* = argmax(0<=p<=p_bar)((2/T)*β„’(θ; p)-penalty(p)) where
penalty(p) = if AIC: ((2/T)*pN2) if BIC: ((lnT/T)*pN2). In Larger
samples, BIC selects lower order models than AIC.
Seasonality introduces serial correlation in a particular way
Work with log revenue: π‘₯𝑑 = log(quarterly revenue), as focusing on
∞
∞
∞
growth rates rather than levels adds stationarity.The airline model is:
(π‘₯𝑑 − π‘₯𝑑−1 ) − (π‘₯𝑑−𝑠 − π‘₯𝑑−𝑠−1 ) = ( πœ€π‘‘ − θ1 πœ€π‘‘−1 ) − θ𝐬 (πœ€π‘‘−𝑠 − θ1 πœ€π‘‘−𝑠−1 )
NOTES
Classification
-Predicting binary or categorical data is referred to as
classification. Need function F(x, θ) to assign y as 0 or 1.
-Many potential choices for F(), i.e. probit or logit model.
𝑝(𝑦=1 |π‘₯)
Log odds ratio: 𝐿𝑂 = log 𝑝(𝑦=0 |π‘₯).
LO is continuous and ranges between infinity and negative
infinity. We can model LO as linear function of x: LO= θ’x. This
gives us the logit model: p(y=1|x) = 𝑒 θ’x ⁄(1 + 𝑒 θ’x ). After
estimating the logistic regression coefficients θΜ‚ we can make
predictions: 𝑝̂ (𝑦 = 1|π‘₯) = 𝑒 θΜ‚ ’x ⁄(1 + 𝑒 θΜ‚’x ). We can also
classify the data based on a cutoff rule with threshold p_bar:
Predict y=1 if pΜ‚ (y=1|x)>p_bar, else y=0
Confusion matrix: Type 1 error rate (false positive) =𝐹𝑃⁄𝑁
Type II error rate (false negative) = 𝐹𝑁⁄𝑃
Economic loss function: Suppose the dollar costs of making
Type-I and Type II errors are LI and LII. Goal, max profit:
𝑁
𝑁+𝑃
(1 −
𝐹𝑃
𝑁
) 𝐿𝐼 +
𝑃
𝐹𝑁
𝑁+𝑃 𝑃
(−𝐿𝐼𝐼 ) =
Profit of lending Loss of lending
to good borrower to bad borrower
𝑁
𝐿
𝑁+𝑃 𝐼
𝑁
− (𝑁+𝑃
𝐹𝑃
𝑁
𝑃
𝐿𝐼 + 𝑁+𝑃
Max
Expected
Profit
Loss
𝐹𝑁
𝑃
LII)
Optimal decision rule: Find p_bar to minimize the expected economic loss
Multi-Class Logistic Regression: Suppose y takes value β„“, where β„“ ∈ {1,…,C}
Model the log odds ratio for class c relative to class C as before:
𝑝(𝑦 = 𝑐 |π‘₯ )
log 𝑝(𝑦 = 𝐢 |π‘₯) = θc’x. Normalized against C-th class, C does
not matter. Notice that θc is a px1 vector that changes with c.
Implies: 𝑝(𝑦 = 𝑐|π‘₯, θ) = exp (θβ„“ ’x)⁄(1 + ∑𝐢−1
β„“=1 exp (θβ„“ ’x)) and
𝑝(𝑦 = 𝐢|π‘₯, θ) = 1⁄(1 + ∑𝐢−1
β„“=1 exp (θβ„“ ’x))
K-Nearest Neighbors: Instead of fitting a linear function to the
data, try predicting the value for y based on the observations
nearby. KNN does the following: identify K points in the
training that are closest to x0, denoted by N0. Conditional
probability of y belonging to class c is:
𝑝(𝑦 = 𝑐|π‘₯ = π‘₯0) = 1⁄𝐾 ∑𝑖∈𝑁 𝐼(𝑦𝑖 = 𝑐). Pick cutoff probabilities
What do we do if a model has high bias or variance? If a model
has high bias (underfitting), it means the model is too
simplistic and is not capturing the underlying patterns in the
data. To address this, you can increase the model's complexity
by adding more features, using a more flexible algorithm, or
reducing constraints. Additionally, gathering more relevant or
informative data can help improve the model's performance.
If a model has high variance (overfitting), it means the model
is too sensitive to noise or fluctuations in the training data
and fails to generalize well. To tackle this issue, you can
simplify the model by reducing its complexity, such as
decreasing the number of features or using feature selection
techniques. Increasing regularization, applying techniques
like dropout or early stopping, and gathering more training
data can also help reduce overfitting and improve the model's
ability to generalize to new data.
What is the difference between Ridge Regression and LASSO,
when do you want to use one or the other? Ridge Regression
uses L2 regularization and is suitable when you want to
mitigate multicollinearity and keep all features. LASSO uses
L1 regularization and performs both regularization and
feature selection, making it suitable for identifying a subset of
relevant features. The choice depends on the problem and the
desired outcome.
Model Selection
-True model: 𝑦𝑖 = 𝑓(π‘₯𝑖 ) + πœ€π‘– where π‘₯𝑖 is a px1 vector, E[πœ€π‘– ] =
0, π‘‰π‘Žπ‘Ÿ[πœ€π‘– ] = σ2, and Cov(πœ€π‘– , πœ€π‘— ) = 0. We try to find a “good”
fitted model fΜ‚(x) to predict y. Need feature selection (vars to
use in x) and functional form selection (class of fΜ‚ to use).
𝑀𝑆𝐸 = 1⁄π‘š ∑π‘š
̂𝑖 − fΜ‚(x𝑖 ))2 . For classification, use:
𝑖=1(𝑦
𝐢𝐸 = 1⁄𝑛 ∑𝑛𝑖=1 𝐼(𝑦𝑖 ≠ 𝑦̂𝑖 )
MSE(fΜ‚(x0 )) = 𝐸[(𝑦0 − fΜ‚(x𝑖 ))2 |π‘₯0 ] =
Model Selection
-Decision trees: The goal is to predict y using a px1 vector of
features x. x is in a p-dimensional feature space.
The idea of tree-based methods is to divide up the feature
space into a set of rectangles {R1,…, Rm}.
If x falls into a rectangle Rj, the prediction f(x) will be based
on the “consensus” of the observations from the training
sample that fall into Rj. You can have regression trees (mean_
or classification tree (mode).
= σ2 + (𝑓(π‘₯0 ) − 𝐸[fΜ‚(x0 )])2 + 𝐸[(𝐸[fΜ‚(x0 )] − fΜ‚(x0 ))2 |x0 ] Regression trees: the prediction is the mean of response
values in training data in Rj. 𝑐𝑗 = 1⁄𝑛𝑗 ∑i∈Rj 𝑦𝑖 where nj is the
Irreducible
Bias
Variance
num. of observations from the training sample that are in Rj
Sources of prediction error: σ2 (irreducible), Bias (error due
Classification tree: The prediction will be the most common
to using the wrong model f vs. fΜ‚), and Variance (impact of the response among the training observations in Rj.
1
𝑐𝑗 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯(π‘˜)π‘Μ‚π‘˜ (Rj) where π‘Μ‚π‘˜ (Rj) = 𝑛 ∑i∈Rj 1{ 𝑦𝑖 = π‘˜}.
randomness of the training sample on the fitted model Μ‚f )
𝑗
Bias-variance trade-off: Bias is not necessarily a bad thing. πœ†
Having a bigger bias in exchange for smaller variance might
Can be interpreted as the fraction of training obs. in Rj in class
be good for overall prediction accuracy. For example: OLS =
k. How can we divide up the feature space? Which set of
BLUE. But there exists other (biased) linear estimators that
rectangles gives us the best fit in the training data. With a
produce smaller MSE.
regression tree, this can be done by minimizing the
2
How to trade off bias against variance? Indirect methods;
RSS(residual sum of squares): min(𝑅𝑖 ) ∑π‘š
𝑗=1 ∑i∈Rj(𝑦𝑖 − 𝑐𝑗 )
Make (theoretical) adjustment to the training error by
Need to consider all possible ways to divide up the feature
penalizing model flexibility. Direct methods; Estimate MSE
space using rectangles, a computationally difficult problem.
directly via validation or cross-validation.
Top-Down Tree-Building Process: Starting at the top of the
Validation: Split the data randomly into two, a training set and tree (root), pick one predictor π‘₯β„“ out of the p features, and
a validation (or hold out set. Fit the model fΜ‚(•) using the
split the feature space into two regions based on some
training set. Use the fitted model to make predictions on the
threshold s: {x: π‘₯β„“ ≤ 𝑠}, {x: π‘₯β„“ > 𝑠}. Keep splitting in each new
validation sett and compute the prediction errors (MSE). This region until a certain stopping criterion is met, ie. When the
approach is inefficient, not using all info to determine fΜ‚(•). If
number of observations remining in a region is below some
the sample is small, both fitting/testing become unreliable.
threshold. The terminal nodes are called leaves, which
K-fold cross-validation: Partition the sample size n randomly correspond to the final set of regions: R1, R2,…, Rm. The points
into K separate sets (typically with equal sizes). For each
along the tree where the split occurs are called internal nodes.
k=1,2,…,k, fit the model fΜ‚(•) to the full training set excluding
At each internal node, how to choose the splitting variable π‘₯β„“
the kth set. Compute the total cross-validation error for the kth and threshold s? Use a greedy strategy. See CART.
1
subset: π‘‡πΈπ‘˜ = ∑𝑖∈πΉπ‘˜(𝑦𝑖 − fΜ‚−π‘˜ (π‘₯𝑖 ))2.Compute:𝐢𝑉𝐸 = 𝑛 ∑πΎπ‘˜=1 π‘‡πΈπ‘˜ CART for Regression: For any pair π‘₯β„“,s an old region is split
Repeat steps for each model, choose model with lowest CVE
into two new regions→R1 = {x: π‘₯β„“ ≤ 𝑠}, R2 = {x: π‘₯β„“ > 𝑠}.
Cross validation for time series data: Choose the minimum
For a regression tree, we choose (β„“, 𝑠) by minimizing the RSS:
size of training set K. For i=1,2,…,T-k, select the observation
π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›(β„“, 𝑠) ∑i∈R1 (𝑦𝑖 − 𝑐1 )2 + ∑i∈R2(𝑦𝑖 − 𝑐2 )2 , where 𝑐𝑗 is the
at t=k+I as the test set. Use observations from t=1 to t=k+i-1 average response in region Rj. The greedy strategy favors
as the training sample. Fit the model on the training sample
splits that give more pure regions in the immediate step.
and compute the prediction error for observation at t=k+i.
CART for Classification: For a classification tree, we can
Subset selection: Can be hard to test all potential models.
choose (β„“, 𝑠) by minimizing the classification error:
Best subset selection, start with the null model, Uo, which
π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›(β„“, 𝑠) (n1 (1 − 𝑝̂𝑐1 (𝑅1 )) + (n2 (1 − 𝑝̂𝑐2 (𝑅2 )) where
𝑝
contains no predictors. For k=1,2,…,p, fit all ( ) models that
π‘Μ‚π‘˜ (Rj) is the fraction of training observations in Rj that belong
π‘˜
contain k predictors. Pick the best among them, Uk, with the
to class k. In the region Rj, cj is the most common class.
smallest RSS or largest R2. Select single best model among Ui
In practice, classificatoion error is often not sensitive enough
for i=0..p using cross-validation, AIC, BIC, adjusted R2.
for tree-growing. Two alternative criteria: Gini Index
Comparatively, with forward stepwise selection, for all k,
(A/A+B) and cross-entropy (LCE = -∑𝑛𝑖=1 𝑑𝑖 log (𝑝𝑖 )).
𝑝
consider all p-k models instead of ( ).
Pruning: Build a large tree To until stopping condition, and
π‘˜
then prune the tree back. For tree T, |t| is the # of leaves.
Ridge Regression: Control variance Var(x0’βΜ‚) by regularizing Among all trees that are a subset of large Tree T0, look for the
|𝑇|
𝑝
the coefficients. Model→min(β) ∑𝑛𝑖=1(𝑦𝑖 − π‘₯𝑖′ β)2 + πœ† ∑𝑗=1 β2𝑗
one that minimizes:min(𝑇 ⊆ 𝑇0 ) ∑𝑗=1(1 − 𝑝̂𝑐𝑗 (𝑅𝑗 )) + 𝛼|𝑇|
The first term is OLS, the new part is called an β„“2 penalty.
Sum is classification error, replace with RSS if regression tree
π‘Ÿπ‘–π‘‘π‘”π‘’
Solution: Μ‚βπœ†
= (𝑋 ′ 𝑋 + πœ†πΌπ‘ )−1 𝑋′𝑦. πœ† is a tuning parameter.
We can pick α through cross-validation.
Generally, bias increases and var. decreases as πœ† increases
Bagging: Trees are not very accurate in prediction. Bagging
π‘Ÿπ‘–π‘‘π‘”π‘’
𝑂𝐿𝑆
π‘Ÿπ‘–π‘‘π‘”π‘’
As πœ† → 0: Μ‚βπœ†
→ Μ‚β . As πœ† → 𝑖𝑛𝑓𝑖𝑛𝑖𝑑𝑦: Μ‚βπœ†
→ 0.
averages the trees over a collection of bootstrap samples to
AIC and BIC penalize a model by the number of parameters.
reduce the variance of the prediction.
Therefore, we cannot use them to determine our πœ† for the
Bootstrap is a Monte Carlo study that treats the empirical dist.
ridge regression, as πœ† does not change the number of the
as the dist. Improves function’s stability, helps improve the
parameters, only their magnitude.
precision of asymptotic approximations in small samples
The LASSO approach: the penalty in ridge regression reduces (confidence intervals, test rejection regions, etc.).
the size of coefficients, but does not drive them to zero. Does
Bagging Algorithm via bootstrap aggregation. Draw B
not help with variable selection, but useful if p>>n. LASSO
bootstrap samples, each with n observations. Fit a tree fΜ‚tree, b
(least absolute shrinkage and selection operator) has an β„“1
to bootstrap sample b, b=1,…,B (with or without pruning).
penalty. Model→min(β) ∑𝑛𝑖=1(𝑦𝑖 − π‘₯𝑖′ β)2 + πœ† ∑𝑝𝑗=1 |β𝑗 |
For classification, take the majority vote among B trees:
LASSO and ridge regression generally perform comparably in fΜ‚ bag (x) = argmax(k=1,…,K)∑𝐡𝑏=1 1{ fΜ‚ tree,b (x) = π‘˜}.
terms of prediction error, but LASSO has advantage in doing
For regression, take the average of predictions from B trees: :
variable selection by setting some of the coefficients to zero.
fΜ‚ bag (x) = 1⁄𝐡 ∑𝐡𝑏=1 fΜ‚ tree,b (x)
You can choose πœ† with k-fold cross-validation.
Random forest improves on bagging further. W/ bagging, the
trees are not independent, because they are fit on similar
data. We can make the trees on different bootstrap samples
less correlated by only considering a random selection of h
predictors (usually h ≈ √𝑝 ). Throwing away data helps as the
benefit of less correlated trees outweighs the costs. While RF
typically improves the prediction accuracy, it loses much of
the appeal of decision trees in interpretability.
Measuring Predictor importance: Ensemble methods are good
at reducing variance, but at the expense of interpretability.
For bagging, RF< and boosting, we can still measure how
important each variable is for the overall model’s
performance. For each tree, we can add up the total amount of
RSS (or prediction error) that is reduced due to the splits over
a given predictor. Averaging over the B trees gives us the
variable importance measure.
Metrics: R-squared, adjusted Rsquared, AIC, BIC, accuracy, MSE,
ROC curve. What is the formula
for each of these metrics? How do
you use them?
R-squared: R-squared measures
the proportion of the variance in
the dependent variable that can
be explained by the independent
variables in a regression model. It
is calculated as the ratio of the
explained sum of squares to the
total sum of squares. Formula: Rsquared = 1 - (Sum of squared
residuals / Total sum of squares)
Usage: R-squared is used to
evaluate the goodness-of-fit of a
regression model. A higher Rsquared value indicates a better
fit, as it suggests that a larger
proportion of the variance in the
dependent variable is explained by
the model.
Adjusted R-squared: Adjusted Rsquared adjusts the R-squared
value based on the number of
predictors in the model, penalizing
the inclusion of unnecessary
variables. It takes into account the
degrees of freedom and the
sample size. Formula: Adj. Rsquared = 1 - [(1 - R-squared) * (n 1) / (n - k - 1)] Usage: Adjusted Rsquared is useful when comparing
models with different numbers of
predictors. It helps determine if
the addition of a new predictor
improves the model's fit beyond
what is expected by chance.
AIC (Akaike Information Criterion):
AIC is a measure of the quality of a
statistical model, balancing the
model's goodness-of-fit with the
number of parameters used. It
penalizes complex models to avoid
overfitting. Formula: AIC = -2 * loglikelihood + 2 * number of
parameters. Usage: AIC is used for
model selection among a set of
candidate models. The model with
the lowest AIC value is preferred,
indicating a better tradeoff
between model complexity and fit.
BIC (Bayesian Information
Criterion): BIC is similar to AIC but
places a higher penalty on model
complexity. It is particularly useful
in situations with a small sample
size. Formula: BIC = -2 * loglikelihood + log(n) * number of
parameters. Usage: Like AIC, BIC is
used for model selection. It
provides a more stringent penalty
for model complexity, leading to
the selection of simpler models
compared to AIC.
Accuracy: Accuracy is a
classification metric that measures
the proportion of correct
predictions over the total number
of predictions made. Formula:
Accuracy = (Number of correct
predictions) / (Total number of
predictions)Usage: Accuracy is
commonly used to evaluate
classification models. It provides
an overall measure of how well
the model predicts the correct
class labels.
MSE (Mean Squared Error): MSE
measures the average squared
difference between the predicted
and actual values in a regression
model .Formula: MSE = (Sum of
squared residuals) / (Number of
observations). Usage: MSE is a
commonly used regression metric
that quantifies the average
prediction error of the model.
Lower MSE values indicate better
model power.
Final Review Questions and Answers:
What is an estimator? An estimator in statistics is a rule or a formula that allows you to use sample data to calculate an estimate of a population parameter. For example, if you're trying to estimate the average height of adults in a city, you might
take a sample, measure their heights, and calculate the average — that calculation is your estimator. Estimators can be evaluated for their bias, consistency, efficiency, sufficiency, and robustness.
What is a random sample? A random sample is a group of items or individuals chosen from a larger population, where every member has an equal chance of being selected. This method helps ensure that the sample is representative of the entire
population.
What is Method of Moments Estimation? The Method of Moments is an approach to estimate the parameters of a statistical model. It matches population moments (like mean or variance) with sample moments. The 'moments' are expected
values of powers of the random variable. For instance, if you have a sample from a distribution and you want to estimate parameters, you would: Compute sample moments (like the sample mean). Equate these to the theoretical moments of the
distribution (expressed in terms of the parameters). Solve the resulting equations to find parameter estimates.
Use the Method of Moments Estimator on Bernoulli Random Variables. The Bernoulli distribution is a discrete distribution taking value 1 with probability of success p and value 0 with probability of failure 1-p. Here's how you can use the method
of moments to estimate p. First moment (mean) of Bernoulli distribution: The theoretical mean of a Bernoulli distribution is p. First moment (mean) of the sample data: Let's assume you have a sample of n Bernoulli-distributed random variables.
The sample mean is the sum of these observations divided by the sample size n. Let's denote this as αΊ‹ (x-bar). According to the method of moments, you set the population moment equal to the sample moment to solve for the parameter. So, p = αΊ‹.
For example, if you had a sample of 100 observations with 37 successes (value of 1), your estimate for p using the method of moments would be 37/100 = 0.37.
What is the Maximum Likelihood estimation method? The Maximum Likelihood Estimation (MLE) method is a statistical technique used for estimating the parameters of a model. The method works by finding the parameters that maximize the
likelihood function, which measures the probability of observing the given data given the parameters. The steps involved in MLE are: Define the likelihood function, which is a probability function that expresses how likely the observed data is for
different values of the parameters. Find the parameters that maximize this likelihood function. In essence, MLE finds the most likely values of the parameters, given the observed data. These are the parameter values that make the observed data as
likely as possible.
Use the Maximum Likelihood Estimator on Bernoulli Random Variables. Consider a Bernoulli distribution, which is a discrete distribution with outcomes 0 or 1. Here, the parameter to estimate is p, which represents the probability of success
(getting a 1). Suppose you have n independent observations from this distribution. Let's denote the number of successes (1s) in the sample as S. Likelihood function: For a single Bernoulli trial, the likelihood is p^x * (1-p)^(1-x), where x is either 0 or
1. For n independent trials, the likelihood is the product of individual likelihoods, which simplifies to: p^S * (1-p)^(n-S). Log-likelihood function: Taking the log of the likelihood function (logarithms turn products into sums and are easier to
differentiate), we get: S*log(p) + (n-S)*log(1-p). Maximize the log-likelihood: We find the p that maximizes this log-likelihood by taking the derivative with respect to p and setting it equal to 0. The derivative of the log-likelihood function is: S/p - (nS)/(1-p) = 0. Solving for p gives: p = S/n. So, the maximum likelihood estimator (MLE) for p in a Bernoulli distribution is the sample proportion of successes, which matches the result from the method of moments in this case.
What is the GMM estimation method? The Generalized Method of Moments (GMM) is a flexible method for estimating parameters in statistical models. It's a generalization of the method of moments. Instead of setting sample moments equal to
population moments, the GMM minimizes a weighted sum of the squared differences between sample and population moments. The steps of GMM are: Specify a set of moment conditions that relate the parameters to be estimated to the data.
Form an objective function, which is a weighted sum of the squared differences between the sample moments and the population moments, expressed as a function of the parameters. Choose the parameter values that minimize the objective
function. GMM is useful because it only requires you to specify a set of moment conditions, which can often be derived from economic or statistical theory, and it can be applied even when the standard assumptions of other methods are not met.
For instance, it's commonly used in situations where the errors are not independently and identically distributed (non-IID) or where there's heteroskedasticity.
What is the Bayes Estimation (MAP) estimation method? The Maximum A Posteriori (MAP) estimation is a Bayesian method for estimating statistical parameters. While methods like Maximum Likelihood Estimation (MLE) look for the parameters
that maximize the likelihood of the observed data, MAP estimation incorporates prior knowledge about the parameters. Here's how it works: Prior: We start with a prior distribution, which represents our beliefs about the parameters before seeing
any data. Likelihood: Then we calculate the likelihood, which is the probability of the observed data given the parameters. Posterior: We use Bayes' theorem to update our beliefs based on the observed data. The posterior distribution is
proportional to the product of the likelihood and the prior. The MAP estimate is the value of the parameter that maximizes this posterior distribution. In mathematical terms, if θ is the parameter, x is the data, and p(θ|x) is the posterior probability
of θ given x, the MAP estimate is: θ_MAP = argmax_θ p(θ|x). By Bayes' theorem, this is equivalent to: θ_MAP = argmax_θ p(x|θ) * p(θ), where p(x|θ) is the likelihood and p(θ) is the prior. The MAP estimate balances the information from the prior
and the data. If we have little data, the prior will dominate. As we get more data, the likelihood becomes more influential.
What are the asymptotic distributions of these estimation methods? Here are the asymptotic distributions for these estimation methods:
Maximum Likelihood Estimation (MLE): Under certain regularity conditions, MLEs are asymptotically normal, which means as the sample size n goes to infinity, the distribution of the estimators tends to a normal distribution. More specifically, if θ is
the true parameter and θΜ‚ is the MLE estimator, the distribution of sqrt(n) * (θΜ‚ - θ) converges to a normal distribution with mean 0 and variance equal to the inverse of the Fisher information I(θ). Method of Moments (MoM): Similar to MLE, under
certain conditions, MoM estimators are also asymptotically normal. However, they might not be as efficient as MLEs, meaning their asymptotic variance might be larger. Generalized Method of Moments (GMM): GMM estimators are also
asymptotically normal under certain conditions. If we denote the GMM estimator as θΜ‚, then as n goes to infinity, sqrt(n) * (θΜ‚ - θ) converges in distribution to a normal distribution with mean 0 and a variance-covariance matrix that can be estimated
from the data. Bayesian Estimation (MAP and others): Under certain conditions, the posterior distribution of the parameter given the data becomes approximately normal as the sample size increases, regardless of the form of the prior distribution.
This is known as the Bernstein-von Mises theorem. In this case, the mode of the posterior distribution, which gives the MAP estimate, also converges to the true parameter value, and the posterior distribution itself becomes concentrated around
the true parameter value. However, the specific asymptotic distribution may depend on the prior and likelihood used. Please note that these results are all asymptotic, meaning they apply as the sample size goes to infinity, and under certain
regularity conditions. These conditions might include, but are not limited to, assumptions about the continuity, differentiability, and boundedness of the functions involved. If these conditions are violated, the given asymptotic results may not hold.
What does it mean for an estimator to be unbiased/consistent? Unbiasedness: Mathematical Definition: An estimator is called unbiased if the expected value of the estimator equals the true parameter value. Mathematically, if we denote the
estimator by θΜ‚ and the true parameter by θ, the estimator is unbiased if E(θΜ‚) = θ. Intuitive Explanation: In layman's terms, an unbiased estimator is right on target, on average. If you were to repeat your study many times, the average value of your
estimator across all these studies would be equal to the true value. Each individual study might be off, but on average, you're not consistently overestimating or underestimating the parameter.
Consistency: Mathematical Definition: An estimator is called consistent if it converges in probability to the true parameter value as the sample size goes to infinity. Mathematically, if θΜ‚ is the estimator and θ is the true parameter, the estimator is
consistent if for every ε > 0, P(|θΜ‚ - θ| > ε) goes to 0 as n (the sample size) goes to infinity. Intuitive Explanation: Consistency means that as you get more and more data, the estimator gets closer and closer to the true value. You might start off with
estimates that are off target, but as your sample size grows, your estimates should become increasingly accurate.
What is an example of a consistent but biased estimator? A good example of a consistent but biased estimator is the sample variance estimator. If we define sample variance in the standard way, it's an unbiased estimator: s² = Σ(xi - xΜ„)² / (n - 1)
Here, s² is the sample variance, xi are the individual sample points, xΜ„ is the sample mean, and n is the number of observations. However, if we were to use n instead of n - 1 in the denominator, the resulting estimator would be biased:
s²_biased = Σ(xi - xΜ„)² / n. In this case, the estimator is biased because it underestimates the true population variance. However, as n becomes very large, the difference between n and n - 1 becomes negligible. As such, the bias decreases as the
sample size increases, meaning this estimator is consistent despite being biased.
What is an example of an unbiased but consistent estimator? An example of an estimator that is unbiased but not consistent is the sample mean from a simple random sample without replacement from a finite population, where the sample size is
not a fixed fraction of the population size. As an example, consider a school with 1000 students. You want to estimate the average height of the students. Each day, you randomly select 10 students and measure their heights. The daily average
height would be an unbiased estimator of the average height of all students - on average, you'd expect it to equal the true average height. However, this estimator would not be consistent. If you were to increase the number of days of the study,
thus effectively increasing the total sample size, you wouldn't necessarily get a better estimate of the average height. That's because you're still only sampling 10 students each day, not increasing the fraction of the total population that you're
sampling. Thus, the sampling error wouldn't necessarily decrease with increasing sample size. The estimator would only be consistent if you increased the number of students you sampled each day as the total number of students increased - that is,
if you sampled a fixed fraction of the population, not a fixed number. In this case, the sample mean would be both unbiased and consistent. But in the described setup, it is unbiased but not consistent.
Explain Fisher Information I(0): Fisher information (I(θ)) is a measure of the amount of information that observed data carries about an unknown parameter θ in a statistical model. It quantifies how sensitive the probability distribution is to changes
in θ. It provides bounds on the variance of estimators, is additive for independent observations, invariant under parameter transformations, and relates to the asymptotic behavior of the maximum likelihood estimator. Overall, it is a fundamental
concept in statistical inference.
How can you show that an estimator is consistent? To show that an estimator is consistent, we generally appeal to certain laws or theorems in probability and statistics. These often include the Law of Large Numbers or the Central Limit Theorem,
and certain properties of the estimator itself. Here's a general sketch of the process: Expected Value and Variance: Show that the estimator is unbiased, or at least asymptotically unbiased. This means that its expected value is equal to (or converges
to) the true parameter value. You would also ideally want to show that its variance goes to zero as the sample size increases. Law of Large Numbers (LLN): Use the LLN if the estimator is a mean or sum of independent and identically distributed
(i.i.d.) random variables. The Weak Law of Large Numbers states that the sample average converges in probability to the expected value as the sample size goes to infinity, which can demonstrate consistency. Central Limit Theorem (CLT): If your
estimator is a sum or average of a large number of i.i.d. random variables, the CLT may help. While the CLT itself doesn't prove consistency (it's more about asymptotic normality), it can be used in conjunction with other results, as it provides
information about the distribution of the estimator. Convergence in Probability: Show that for every positive number ε, the probability that the absolute difference between the estimator and the true parameter is greater than ε goes to zero as the
sample size goes to infinity. This is the definition of consistency, so demonstrating it would prove that the estimator is consistent.
How can you show that an estimator is unbiased? To show that an estimator is unbiased, you need to show that its expected value is equal to the parameter it is estimating. Here's a general outline of how you could do that: Formulate the
estimator: First, write down the form of your estimator. For example, if you're estimating the population mean μ using the sample mean XΜ„, your estimator is XΜ„ = ΣXi / n, where Xi are your sample points and n is the sample size. Calculate the
expected value: Next, compute the expected value of your estimator. Use the properties of expectations to simplify your calculation. For instance, the expected value operator E() is linear, meaning that for any random variables X and Y and any
constants a and b, E(aX + bY) = aE(X) + bE(Y). Continuing the example from step 1, the expected value of the sample mean is E(XΜ„) = E(ΣXi / n) = ΣE(Xi) / n. If the X's are independently and identically distributed with mean μ, then E(Xi) = μ for all i, and
so E(XΜ„) = Σμ / n = μ. Compare to the parameter: Finally, compare the expected value of your estimator to the parameter you're estimating. If they're equal, your estimator is unbiased. In our example, we found that E(XΜ„) = μ, so the sample mean XΜ„ is
an unbiased estimator of the population mean μ. This process requires some assumptions (such as the X's being i.i.d. in our example), and the calculations could be more complex depending on the nature of your estimator.
Why does something not BLUE not imply biased? BLUE stands for Best Linear Unbiased Estimator. By definition, a BLUE estimator is unbiased. However, just because an estimator is not BLUE, it doesn't mean it's biased. It might fail to be BLUE
because it's not the best (i.e., it doesn't have the smallest variance among linear unbiased estimators) or it's not linear, but it could still be unbiased. So, not BLUE doesn't necessarily mean biased.
What is Omitted Variable Bias? Omitted variable bias occurs when a statistical model leaves out one or more relevant variables. The omission can lead to biased and inconsistent estimates, as the effect of the omitted variable may be falsely
attributed to the included variables.
Explain the OLS Estimator both mathematically and intuitively. Mathematical Explanation: Ordinary Least Squares (OLS) aims to minimize the sum of the squared residuals in a linear regression model. Given a model Y = Xβ + ε, where Y is the
dependent variable, X is the matrix of independent variables, β are the parameters to be estimated, and ε is the error term, the OLS estimator βΜ‚ is given by: βΜ‚ = (X'X)^-1 X'Y where ' denotes the transpose and ^-1 denotes the inverse. Intuitive
Explanation: Intuitively, OLS is like finding the best-fitting line through a scatter plot of data. "Best-fitting" means the line is chosen such that the squared vertical distances (residuals) from each data point to the line are as small as possible in total. It
assumes a linear relationship between the dependent and independent variables.
What is the asymptotic distribution of the OLS estimator? The asymptotic distribution of the Ordinary Least Squares (OLS) estimator is known as the Gauss-Markov theorem. Under certain assumptions, including the presence of i.i.d. errors with
finite variances and no perfect multicollinearity, the OLS estimator has an asymptotic normal distribution. Specifically, as the sample size approaches infinity (n → ∞), the distribution of the OLS estimator approaches a multivariate normal
distribution. The mean of this distribution is equal to the true parameter values, and the variance-covariance matrix is proportional to the inverse of the "information matrix," which is a function of the error variance and the design matrix of the
model. This result allows for the construction of confidence intervals and hypothesis tests using the OLS estimator. The normality assumption is often robust to violations for large sample sizes due to the Central Limit Theorem. However, it is
essential to verify the underlying assumptions and check for potential violations in specific cases.
What key assumption do you need to show that OLS is unbiased? How can you prove that OLS is unbiased using this assumption? To show that the Ordinary Least Squares (OLS) estimator is unbiased, the key assumption required is that the error
term in the linear regression model has a zero conditional mean, also known as the exogeneity assumption or the conditional mean independence assumption. Formally, the assumption is: E(ε|X) = 0. This assumption implies that the error term has
no systematic relationship with the independent variables in the model. To prove that OLS is unbiased under this assumption, we can calculate the expected value of the OLS estimator and show that it equals the true parameter value. Given the
linear regression model Y = Xβ + ε, where Y is the dependent variable, X is the matrix of independent variables, β is the true parameter vector, and ε is the error term, the OLS estimator βΜ‚ is given by: βΜ‚ = (X'X)^-1 X'Y. Taking the expected value of the
OLS estimator, we have: E(βΜ‚) = E((X'X)^-1 X'Y). Using the assumption E(ε|X) = 0, we can apply conditional expectations: E(βΜ‚) = E((X'X)^-1 X'(Xβ + ε)) = E((X'X)^-1 X'Xβ) + E((X'X)^-1 X'ε). By simplifying the expression, we find: E(βΜ‚) = β + 0 = β. Thus, under
the assumption of zero conditional mean, the OLS estimator is unbiased, as the expected value of the estimator equals the true parameter value.
Gauss-Markov: what are the assumptions? what does it mean for an estimator to be BLUE? The Gauss-Markov theorem provides conditions under which the Ordinary Least Squares (OLS) estimator has several desirable properties. The
assumptions of the Gauss-Markov theorem are as follows: Linearity: The relationship between the dependent variable and the independent variables is linear in the parameters. No perfect multicollinearity: The independent variables are not
perfectly linearly dependent on each other. Exogeneity: The error term has a zero conditional mean (E(ε|X) = 0). This assumption ensures that the error term is not systematically related to the independent variables. Homoscedasticity: The error
term has constant variance, meaning the variance of the error term is the same for all values of the independent variables. No autocorrelation: The error term is not correlated with itself across observations. If these assumptions hold, then the OLS
estimator is said to be Best Linear Unbiased Estimator (BLUE). The properties of BLUE estimators are: Unbiasedness: The expected value of the estimator equals the true parameter value. Efficiency: Among all linear unbiased estimators, the OLS
estimator has the smallest variance. It achieves the lowest possible mean squared error. Linearity: The estimator is a linear function of the dependent variable. Being BLUE is desirable because it ensures unbiasedness and efficiency, making OLS the
best choice among linear unbiased estimators when the Gauss-Markov assumptions are met.
How do you design an event study? Define the event: Identify the specific event or treatment that you want to study. It could be a corporate announcement, a policy change, a natural disaster, or any event that you believe may have an impact on
the outcome you're interested in. Specify the outcome variable: Determine the outcome variable that you want to measure or analyze to assess the impact of the event. This could be stock returns, sales figures, customer satisfaction scores, or any
other relevant measure. Select a time window: Define the period over which you will observe the outcome variable. This typically includes a pre-event period (before the event occurs) and a post-event period (after the event occurs). The length of
these periods can vary depending on the nature of the event and the expected time it takes for the event's impact to materialize. Identify a comparison group: Establish a suitable comparison group that did not experience the event. This group
should be as similar as possible to the treatment group (i.e., the group affected by the event) in terms of relevant characteristics, such as industry, size, location, or any other factors that may influence the outcome variable. Collect data: Gather data
on the outcome variable for both the treatment group and the comparison group during the defined time window. Ensure data quality and consistency to enable meaningful analysis. Analyze the data: Apply statistical and econometric methods to
compare the outcomes of the treatment group with the comparison group. Common approaches include difference-in-differences (DID) analysis, event study regression, or matching techniques. These methods help isolate the effect of the event
from other factors that may influence the outcome variable. Interpret the results: Interpret the findings in light of the analysis. Assess the magnitude, direction, and statistical significance of the event's impact on the outcome variable. Consider
potential limitations and alternative explanations for the observed results.
With regards to the size of the estimation window, what is the tradeoff between reducing the variance of the estimated abnormal returns and issues related to non-stationarity? The size of the estimation window in an event study plays a critical
role in balancing the tradeoff between reducing the variance of estimated abnormal returns and addressing issues related to non-stationarity. A longer estimation window can help reduce the variance of estimated abnormal returns by capturing
more pre-event data points and providing a more robust estimate of the average or expected returns in the absence of the event. This can enhance the precision of the abnormal return estimation. However, using a longer estimation window
introduces the risk of non-stationarity. Non-stationarity refers to the violation of the assumption that the statistical properties of the data remain constant over time. In event studies, non-stationarity can arise due to structural changes in the
market, evolving economic conditions, or other factors that affect the normal behavior of stock returns. Non-stationarity can compromise the validity of statistical tests and lead to biased or unreliable estimates of abnormal returns. Longer
estimation windows might include periods with changing market conditions, rendering the estimated normal returns less representative of the pre-event period. To strike a balance, researchers often employ shorter estimation windows that
capture sufficient data for estimating the expected returns but avoid periods with significant structural changes or non-stationarity. This approach aims to mitigate the impact of non-stationarity while still providing reliable estimates of abnormal
returns. Determining the optimal size of the estimation window requires careful consideration of the specific event, the underlying market dynamics, and the available data. Researchers often conduct sensitivity analyses by varying the estimation
window size to assess the robustness of their findings to different specifications.
How to deal with multiple firms announcing earnings on the same day? To address multiple firms announcing earnings on the same day: Use an event-window approach to capture the market's reaction over a broader period around the
announcement day. Select a control group of similar firms that do not announce earnings on the same day to isolate the specific impact of each announcement. Apply statistical adjustments, such as regression analysis or panel data models, to
control for common factors affecting firms announcing earnings on the same day. Consider aggregating results from multiple event studies to gain broader insights. The approach chosen depends on the research question and available data, aiming
to accurately analyze the effects of earnings announcements.
Provide a mathematical definition and an intuitive explanation for stationarity. Mathematical Definition: A stochastic process is said to be stationary if its statistical properties do not change over time. More formally, for a time series process {X_t}:
The mean (E[X_t]) is constant for all time points t. The variance (Var[X_t]) is constant for all time points t. The covariance between X_t and X_{t+h} only depends on the time difference h, not on the specific time points t and t+h. Intuitive
Explanation: In simpler terms, a stationary process maintains consistent statistical behavior over time. This means that the process has a stable mean and variance, and the relationship between observations at different time points remains the
same. It implies that the underlying dynamics or patterns in the process do not change as time progresses. For example, if we have a stationary time series of daily temperature measurements, the average temperature would remain relatively
constant throughout the entire time period, and the temperature fluctuations would exhibit consistent variability. Additionally, the relationship between temperatures on any given day and temperatures a week later would be the same as the
relationship between temperatures on any other pair of days with the same time lag. Stationarity is a crucial assumption in many time series analyses as it simplifies modeling and facilitates the use of various statistical techniques.
What is mean-reversion? Mean-reversion refers to the tendency of a variable or time series to move towards its long-term average or mean value over time. In other words, when a variable deviates significantly from its mean, it is likely to revert or
move back towards that mean in the future. This pattern occurs due to various forces, such as market forces or economic equilibrium, that act to bring the variable back to its average level.
What AR and MA processes? Autoregressive (AR) Process: In an autoregressive process, the value of a variable at a given time is linearly related to its past values. The current value is influenced by its own lagged values. Intuitively, an AR process
can be thought of as a variable that tends to persist its behavior over time. If the previous values are high, the current value is likely to be high, and if the previous values are low, the current value is likely to be low. The degree of influence from past
values depends on the order of the AR process (e.g., AR(1), AR(2), etc.). Moving Average (MA) Process: In a moving average process, the value of a variable at a given time is a linear combination of the current and past error terms or "shocks" that
have affected the variable. Intuitively, an MA process can be understood as a variable being influenced by the recent random shocks it has experienced. The shocks have an immediate effect on the current value, and the influence of each shock
diminishes as time passes. The order of the MA process (e.g., MA(1), MA(2), etc.) determines the number of past error terms considered. In summary, an AR process models a variable's behavior based on its own past values, while an MA process
models a variable's behavior based on the past random shocks it has experienced. Both AR and MA processes are commonly used in time series analysis to capture different patterns and dynamics observed in data.
What is a random walk? A random walk is a stochastic process where future values are unpredictable and depend solely on the current value plus a random shock. The mathematical formula for a random walk can be represented as: X(t) = X(t-1) +
ε(t) where X(t) is the value of the random walk at time t, X(t-1) is the value at the previous time step (t-1), and ε(t) represents the random shock or error term at time t.
How to test for a unit root? To test for a unit root: Formulate the null (presence of a unit root) and alternative (absence of a unit root) hypotheses. Choose between the Augmented Dickey-Fuller (ADF) or Phillips-Perron (PP) test. Estimate the test
regression by regressing differenced series on lagged values. Calculate the test statistic and compare it to critical values. If the test statistic exceeds the critical value, reject the null hypothesis and conclude stationarity; otherwise, fail to reject the
null hypothesis, indicating the presence of a unit root and non-stationarity.
How do you deal with seasonality? To deal with seasonality: Use seasonal decomposition to separate the time series into seasonal, trend, and residual components. Apply differencing to remove the seasonal component by taking the difference
between consecutive observations. Utilize seasonal adjustment models like SARIMA or seasonal regression. Consider calendar adjustments for factors like holidays or the number of days in a month. Apply moving averages to smooth out the series
and reveal underlying patterns.
Logistic Regression: Mathematical Formula: The estimator βΜ‚ for a Logistic Regression model is obtained by maximizing the likelihood function: βΜ‚ = argmax[Σ(Yi log(p(Xi; β)) + (1 - Yi) log(1 - p(Xi; β)))]. Intuitive Explanation: The logistic regression
estimator finds the best coefficients (β) that maximize the likelihood of the observed data. It models the relationship between predictors (X) and the probability of a positive outcome (Y = 1) using a sigmoid function. The estimator assigns higher
weights to influential variables and aims to predict the probability of an event occurring based on the predictors.
KNN: Mathematical Formula: The estimator for a KNN model involves finding the majority class among the k nearest neighbors of a given data point. There is no specific mathematical formula for the estimator in KNN since it is based on the voting
or averaging scheme used. Intuitive Explanation: The KNN model estimates the class or value of a data point based on the majority vote or averaging of its k nearest neighbors in the feature space. It looks at the k nearest data points and assigns the
most common class or average value among them as the estimated value for the new data point. The intuition behind KNN is that data points with similar features tend to belong to the same class or have similar values. Therefore, the estimator
leverages the information from nearby neighbors to predict the outcome for the new data point.
When do you want to use Logistic Regression vs KNN? Logistic Regression is preferred for interpretability, linearity, large feature space, and imbalanced datasets. KNN is suitable for capturing nonlinear relationships, local patterns, small to
medium-sized datasets, and when few relevant features are present. Consider data characteristics when choosing between them.
Explain the Bias-Variance Tradeoff. The bias-variance tradeoff relates to the balance between underfitting (high bias) and overfitting (high variance) in machine learning models. It suggests that reducing one type of error often increases the other.
The goal is to find an optimal balance that minimizes overall error. Techniques like cross-validation and regularization help manage this tradeoff. Bias refers to the error caused by a model's simplifying assumptions, leading to underfitting. Variance
refers to the inconsistency of a model's predictions due to sensitivity to training data fluctuations, leading to overfitting. Finding a balance is crucial for accurate and generalizable models.
What is an ROC Curve? ROC curve (Receiver Operating Characteristic curve): The ROC curve is a graphical representation of the performance of a binary classification model. It plots the true positive rate (sensitivity) against the false positive rate (1 specificity) for various classification thresholds. Formula: ROC curve is not calculated using a single formula, but by varying the classification threshold and calculating the true positive rate and false positive rate at each threshold. Usage: The ROC
curve is used to evaluate the tradeoff between the true positive rate and the false positive rate for different classification thresholds. It helps in assessing the discriminatory power and performance of a binary classification model. The area under
the ROC curve (AUC) is often used as a summary measure of model performance, with higher values indicating better discrimination between classes.
Download