Uploaded by Quincy Sue

stat-cookbook

advertisement
Probability and Statistics
Cookbook
Version 0.2.6
19th
December, 2017
http://statistics.zone/
Copyright c Matthias Vallentin
Contents
3 15 Bayesian Inference
15.1 Credible Intervals . . . . . . . . . . . .
3
15.2 Function of parameters . . . . . . . . .
5
15.3 Priors . . . . . . . . . . . . . . . . . .
Probability Theory
8
15.3.1 Conjugate Priors . . . . . . . .
15.4 Bayesian Testing . . . . . . . . . . . .
Random Variables
8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
Expectation
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
Variance
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
Inequalities
10
17 Decision Theory
Distribution Relationships
10
17.1 Risk . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . .
Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . .
Functions
11
17.4 Minimax Rules . . . . . . . . . . . . .
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2
3
4
5
6
7
8
14 Exponential Family
16
21.5 Spectral Analysis . . . . . . . . . . . . . 28
.
.
.
.
.
16 22 Math
22.1 Gamma Function
16
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
17
18
.
.
.
.
.
18
18
18
18
19
19
.
.
.
.
19
19
20
20
20
9 Multivariate Distributions
11 18 Linear Regression
9.1 Standard Bivariate Normal . . . . . . . 11
18.1 Simple Linear Regression . . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . . 11
18.2 Prediction . . . . . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . . 11
18.3 Multiple Regression . . . . . . . . . . .
18.4 Model Selection . . . . . . . . . . . . . .
10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation
10.2 Central Limit Theorem (CLT) . . . . . 12
19.1 Density Estimation . . . . . . . . . . . .
19.1.1 Histograms . . . . . . . . . . . .
11 Statistical Inference
12
19.1.2 Kernel Density Estimator (KDE)
11.1 Point Estimation . . . . . . . . . . . . . 12
19.2 Non-parametric Regression . . . . . . .
11.2 Normal-Based Confidence Interval . . . 13
19.3 Smoothing Using Orthogonal Functions
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . . . . . .
12 Parametric Inference
13
20.2 Poisson Processes . . . . . . . . . . . . .
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series
21.1 Stationary Time Series . . . . . . . . . .
12.2.1 Delta Method . . . . . . . . . . . 14
21.2 Estimation of Correlation . . . . . . . .
12.3 Multiparameter Models . . . . . . . . . 14
21.3 Non-Stationary Time Series . . . . . . .
12.3.1 Multiparameter delta method . . 15
21.3.1 Detrending . . . . . . . . . . . .
12.4 Parametric Bootstrap . . . . . . . . . . 15
21.4 ARIMA models . . . . . . . . . . . . . .
13 Hypothesis Testing
15
21.4.1 Causality and Invertibility . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
29
30
20
20
21
21
22
22
22
23
23
23
24
24
24
25
25
26
26
26
27
27
28
This cookbook integrates various topics in probability theory
and statistics, based on literature [1, 6, 3] and in-class material
from courses of the statistics department at the University of
California in Berkeley but also influenced by others [4, 5]. If you
find errors or have suggestions for improvements, please get in
touch at http://statistics.zone/.
1
Distribution Overview
1.1
Discrete Distributions
Notation1
Uniform
Unif {a, . . . , b}
FX (x)


0

Bernoulli
Binomial
Multinomial
Hypergeometric
Negative Binomial
1 We
1
Bern (p)
Bin (n, p)
x<a
a≤x≤b
x>b
(1 − p)1−x
bxc−a+1
 b−a
I1−p (n − x, x + 1)
Mult (n, p)
Hyp (N, m, n)
NBin (r, p)
Geometric
Geo (p)
Poisson
Po (λ)
≈Φ
x − np
p
np(1 − p)
!
Ip (r, x + 1)
1 − (1 − p)x
e−λ
x ∈ N+
x
X
λi
i!
i=0
fX (x)
E [X]
V [X]
MX (s)
I(a ≤ x ≤ b)
b−a+1
a+b
2
(b − a + 1)2 − 1
12
eas − e−(b+1)s
s(b − a)
px (1 − p)1−x
p
p(1 − p)
1 − p + pes
np
np(1 − p)
(1 − p + pes )n
!
n x
p (1 − p)n−x
x
n!
x
px1 · · · pk k
x1 ! . . . xk ! 1
m N −m
x
k
X

xi = n
i=1
n−x
N
n
!
x+r−1 r
p (1 − p)x
r−1
p(1 − p)x−1
λx e−λ
x!
x ∈ N+

np1
 . 
 .. 
npk
nm
N
np1 (1 − p1 )
−np2 p1
−np1 p2
..
.
!
k
X
!n
pi e
si
i=0
nm(N − n)(N − m)
N 2 (N − 1)
1−p
r
p
1−p
r 2
p
1
p
1−p
p2
λ
λ
p
1 − (1 − p)es
r
pes
1 − (1 − p)es
eλ(e
s
−1)
use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
Uniform (discrete)
Binomial
●
Geometric
●
●
●
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0.8
Poisson
●
●
●
●
● ●
p = 0.2
p = 0.5
p = 0.8
●
●
●
λ=1
λ=4
λ = 10
●
0.3
0.2
●
0.6
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
0.1
●
● ●
a
b
●
●
●●●
●
●●
●
●
●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●
0
10
1.00
●
30
0.0
2.5
●
●
●
●
●
●
●
●
5.0
●
●
●
●
7.5
●
●
●
0.0
10.0
●
● ● ●
●
●
●
●
●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
●
●
●
0
●
5
10
Geometric
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
1.00
●
●
●
●
●
0.75
●
●
●
20
Poisson
●
●
0.8
●
15
x
●
●
●
●
●
●
●
●
●
0.6
●
●
●
●
a
b
x
0
10
●
0.4
0.25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●
0.00
●
●
●
●
●
●
●
0.50
●
●
●
●
CDF
0.50
CDF
CDF
CDF
●
●
0.25
0
40
●
●
i
n
●
0.0
●
●
●
x
0.75
●
●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
●
●
i
n
●
●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●
●
● ●
●
●
●
●
●
●
Binomial
●
●
● ●
x
Uniform (discrete)
● ●
●
●
20
x
1
●
0.1
●
●
●
●
0.0
0.2
●
●
●
0.2
●
●
●
●
0.4
●
●
●
PMF
●
PMF
●
●
PMF
PMF
●
1
n
20
x
●
●
●
30
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40
0.2
●
●
●
●
0.0
2.5
5.0
x
7.5
p = 0.2
p = 0.5
p = 0.8
10.0
●
●
0.00
●
●
●
●
●
●
● ● ● ●
0
5
10
15
λ=1
λ=4
λ = 10
20
x
4
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N µ, σ 2


0
x<a
a<x<b

1
x>b
Z x
Φ(x) =
φ(t) dt
x−a
 b−a
−∞
Log-Normal
ln N µ, σ 2
Multivariate Normal
MVN (µ, Σ)
Student’s t
Chi-square
Student(ν)
χ2k
1
1
ln x − µ
+ erf √
2
2
2σ 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)
b−a
(x − µ)2
1
φ(x) = √ exp −
2σ 2
σ 2π
(ln x − µ)2
1
√
exp −
2σ 2
x 2πσ 2
a+b
2
(b − a)2
12
µ
σ2
esb − esa
s(b − a)
σ 2 s2
exp µs +
2
T
1
(2π)−k/2 |Σ|−1/2 e− 2 (x−µ)
ν ν Ix
,
2 2
k x
1
γ
,
Γ(k/2)
2 2
eµ+σ
Σ−1 (x−µ)
2
2
(eσ − 1)e2µ+σ
/2
µ
0
1
xk/2−1 e−x/2
2k/2 Γ(k/2)
r
ν>1
1
exp µT s + sT Σs
2
Σ
(
−(ν+1)/2
Γ ν+1
x2
2
1
+
√
ν
νπΓ ν2
2
ν
ν−2
∞
ν>2
1<ν≤2
k
2k
d2
d2 − 2
2d22 (d1 + d2 − 2)
d1 (d2 − 2)2 (d2 − 4)
(1 − 2s)−k/2 s < 1/2
d
F
F(d1 , d2 )
Exponential∗
Exp (β)
Gamma∗
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
∗
I
d1 x
d1 x+d2
Gamma (α, β)
InvGamma (α, β)
d1 d2
,
2 2
Pareto(xm , α)
β2
1
1−
γ(α, βx)
Γ(α)
Γ α, βx
Γ (α)
β α α−1 −βx
x
e
Γ (α)
α
β
α
β2
1
1−
β α −α−1 −β/x
x
e
Γ (α)
P
k
k
Γ
i=1 αi Y α −1
xi i
Qk
i=1 Γ (αi ) i=1
β
α>1
α−1
β2
α>2
(α − 1)2 (α − 2)
Γ (α + β) α−1
x
(1 − x)β−1
Γ (α) Γ (β)
α
α+β
1
λΓ 1 +
k
αβ
(α + β)2 (α + β + 1)
2
λ2 Γ 1 +
− µ2
k
αxm
α>1
α−1
x2m α
α>2
(α − 1)2 (α − 2)
1 − e−(x/λ)
x α
We use the rate parameterization where β =
β
k
1−
xB
d1 d1
, 2
2
1 −x/β
e
β
Ix (α, β)
Weibull(λ, k)
(d1 x+d2 )d1 +d2
1 − e−x/β
Dir (α)
Beta (α, β)
(d1 x)d1 d2 2
m
x
1
.
λ
x ≥ xm
k x k−1 −(x/λ)k
e
λ λ
xα
m
α α+1
x
αi
Pk
i=1 αi
x ≥ xm
(s < β)
s
β
!α
s
β
(s < β)
p
2(−βs)α/2
Kα
−4βs
Γ(α)
E [Xi ] (1 − E [Xi ])
Pk
i=1 αi + 1
1+
∞
X
k=1
∞
X
n=0
k−1
Y
r=0
α+r
α+β+r
!
sk
k!
sn λn n
Γ 1+
n!
k
α(−xm s)α Γ(−α, −xm s) s < 0
Some textbooks use β as scale parameter instead [6].
5
Uniform (continuous)
Normal
2.0
Log−Normal
1.00
µ = 0, σ = 0.2
µ = 0, σ2 = 1
µ = 0, σ2 = 5
µ = −2, σ2 = 0.5
2
●
1.0
0.5
●
●
a
b
0.0
0.00
−2.5
0.0
x
x
χ
F
2
k=1
k=2
k=3
k=4
k=5
2.5
5.0
0.0
0
1
2
3
−5.0
0.0
Exponential
3
5.0
Gamma
β = 0.5
β=1
β = 2.5
2.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2.5
x
α = 1, β = 0.5
α = 2, β = 0.5
α = 3, β = 0.5
α = 5, β = 1
α = 9, β = 2
0.5
0.4
1.5
2
PDF
PDF
0.3
PDF
PDF
−2.5
x
0.75
0.50
0.2
0.1
0.25
−5.0
1.00
0.50
ν=1
ν=2
ν=5
ν=∞
0.3
PDF
●
µ = 0, σ = 3
µ = 2, σ2 = 2
µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
0.75
PDF
1
b−a
PDF
PDF
1.5
Student's t
0.4
2
1.0
0.2
1
0.5
0.25
0.1
0
0.00
0
2
4
6
8
0.0
0.0
0
1
x
2
3
4
5
0
1
2
x
Inverse Gamma
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
4
5
0
5
10
15
20
x
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
4
4
x
Beta
5
3
2.0
Pareto
4
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
3
1.5
3
2
PDF
PDF
PDF
PDF
3
1.0
2
2
1
0.5
1
1
0
0
0
1
2
3
x
4
5
0
0.0
0.00
0.25
0.50
x
0.75
1.00
0.0
0.5
1.0
1.5
x
2.0
2.5
1.0
1.5
2.0
2.5
x
6
Uniform (continuous)
Normal
1
Log−Normal
1.00
Student's t
1.00
µ = 0, σ = 3
µ = 2, σ2 = 2
µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
2
0.75
0.75
0.75
CDF
CDF
CDF
CDF
0.50
0.50
0.50
0.25
0.25
0
0.25
µ = 0, σ2 = 0.2
µ = 0, σ2 = 1
µ = 0, σ2 = 5
µ = −2, σ2 = 0.5
0.00
a
b
−5.0
−2.5
0.0
x
2
k=1
k=2
k=3
k=4
k=5
0.00
4
6
Exponential
0.75
0.75
0.50
0.00
2
3
4
1
2
CDF
0.75
CDF
0.75
0.50
0.00
3
4
5
4
0.00
0.25
0.50
x
0
5
10
0.75
1.00
15
20
x
Pareto
1.00
1.00
0.75
0.75
0.50
0.50
0.25
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
0.00
0.00
0.00
5
0.25
0.25
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
3
α = 1, β = 0.5
α = 2, β = 0.5
α = 3, β = 0.5
α = 5, β = 1
α = 9, β = 2
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
5.0
0.50
x
Beta
1.00
x
0
CDF
Inverse Gamma
2
β = 0.5
β=1
β = 2.5
0.00
5
2.5
0.25
x
1.00
1
0.50
0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
1
0.0
Gamma
0.75
0
−2.5
x
1.00
x
0.25
−5.0
1.00
8
0.50
3
1.00
0.25
0.25
0
2
CDF
0.50
2
1
x
CDF
CDF
0.75
0
0.00
0
F
1.00
CDF
0.00
5.0
x
χ
CDF
2.5
ν=1
ν=2
ν=5
ν=∞
0.0
0.5
1.0
1.5
x
2.0
2.5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
0.00
1.0
1.5
2.0
2.5
x
7
2
Probability Theory
Law of Total Probability
Definitions
•
•
•
•
P [B] =
Sample space Ω
Outcome (point or element) ω ∈ Ω
Event A ⊆ Ω
σ-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes’ Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Ω=
n
G
Ai
i=1
Inclusion-Exclusion Principle
n
[
• Probability Distribution P
i=1
1. P [A] ≥ 0 ∀A
2. P [Ω] = 1
"∞ #
∞
G
X
3. P
Ai =
P [Ai ]
3
Ai =
n
X
X
(−1)r−1
r
\
Ai j
i≤i1 <···<ir ≤n j=1
r=1
Random Variables
Random Variable (RV)
i=1
X:Ω→R
• Probability space (Ω, A, P)
Probability Mass Function (PMF)
Properties
•
•
•
•
•
•
•
•
Ω=
i=1
1. ∅ ∈ A
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A
3. A ∈ A =⇒ ¬A ∈ A
i=1
n
X
P [∅] = 0
B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
P [¬A] = 1 − P [A]
P [B] = P [A ∩ B] + P [¬A ∩ B]
P [Ω] = 1
P [∅] = 0
S
T
T
S
¬( n An ) = n ¬An ¬( n An ) = n ¬An
DeMorgan
S
T
P [ n An ] = 1 − P [ n ¬An ]
P [A ∪ B] = P [A] + P [B] − P [A ∩ B]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
Probability Density Function (PDF)
Z
P [a ≤ X ≤ b] =
b
f (x) dx
a
Cumulative Distribution Function (CDF)
FX : R → [0, 1]
FX (x) = P [X ≤ x]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A]
S∞
where A = i=1 Ai
T∞
where A = i=1 Ai
Z
P [a ≤ Y ≤ b | X = x] =
fY |X (y | x)dy
a≤b
a
Independence ⊥
⊥
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
b
P [A ∩ B]
P [B]
P [B] > 0
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
2. fX,Y (x, y) = fX (x)fY (y)
8
3.1
Z
Transformations
• E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X])
(cf. Jensen inequality)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
• P [X = Y ] = 1 =⇒ E [X] = E [Y ]
∞
X
• E [X] =
P [X ≥ x]
X discrete
Z = ϕ(X)
Discrete
X
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
fX (x)
x∈ϕ−1 (z)
x=1
Sample mean
Continuous
n
Z
FZ (z) = P [ϕ(X) ≤ z] =
X̄n =
with Az = {x : ϕ(x) ≤ z}
f (x) dx
Az
Special case if ϕ strictly monotone
dx
1
d −1
ϕ (z) = fX (x)
= fX (x)
dz
dz
|J|
fZ (z) = fX (ϕ−1 (z))
Conditional expectation
Z
• E [Y | X = x] = yf (y | x) dy
• E [X] = E [E [X | Y ]]
Z ∞
• Eϕ(X,Y ) | X=x [=]
ϕ(x, y)fY |X (y | x) dx
−∞
Z ∞
• E [ϕ(Y, Z) | X = x] =
ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
The Rule of the Lazy Statistician
Z
E [Z] =
1X
Xi
n i=1
ϕ(x) dFX (x)
−∞
Z
E [IA (x)] =
• E [Y + Z | X] = E [Y | X] + E [Z | X]
• E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
• E [Y | X] = c =⇒ Cov [X, Y ] = 0
Z
dFX (x) = P [X ∈ A]
IA (x) dFX (x) =
A
Convolution
Z
• Z := X + Y
∞
fX,Y (x, z − x) dx
fZ (z) =
−∞
Z
X,Y ≥0
Z
z
fX,Y (x, z − x) dx
=
0
∞
• Z := |X − Y |
• Z :=
4
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z ∞
Z ∞
⊥
⊥
fZ (z) =
|y|fX,Y (yz, y) dy =
|y|fX (yz)fY (y) dy
−∞
−∞
5
Variance
Definition and properties
2
2
• V [X] = σX
= E (X − E [X])2 = E X 2 − E [X]
" n
#
n
X
X
X
• V
Xi =
V [Xi ] +
Cov [Xi , Xj ]
i=1
Expectation
• V
Definition and properties
Z
• E [X] = µX =
x dFX (x) =
i=1
#
Xi =
i=1
X

xfX (x)



 x
Z




 xfX (x) dx
• P [X = c] = 1 =⇒ E [X] = c
• E [cX] = c E [X]
• E [X + Y ] = E [X] + E [Y ]
" n
X
X discrete
n
X
i6=j
V [Xi ]
if Xi ⊥
⊥ Xj
i=1
Standard deviation
sd[X] =
p
V [X] = σX
Covariance
X continuous
•
•
•
•
Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
9
7
• Cov [aX, bY ] = abCov [X, Y ]
• Cov [X + a, Y + b] = Cov [X, Y ]


n
m
n X
m
X
X
X
• Cov 
Xi ,
Yj  =
Cov [Xi , Yj ]
i=1
j=1
Distribution Relationships
Binomial
• Xi ∼ Bern (p) =⇒
i=1 j=1
n
X
Xi ∼ Bin (n, p)
i=1
Correlation
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
• limn→∞ Bin (n, p) = Po (np)
(n large, p small)
• limn→∞ Bin (n, p) = N (np, np(1 − p))
(n large, p far from 0 and 1)
Cov [X, Y ]
ρ [X, Y ] = p
V [X] V [Y ]
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
Sample variance
n
S2 =
1 X
(Xi − X̄n )2
n − 1 i=1
Conditional variance
2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
6
•
•
•
•
X ∼ NBin (1, p) = Geo (p)
Pr
X ∼ NBin (r, p) = i=1 Geo (p)
P
P
Xi ∼ NBin (ri , p) =⇒
Xi ∼ NBin ( ri , p)
X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Poisson
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒
n
X
Xi ∼ Po
i=1
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi
Inequalities
P [ϕ(X) ≥ t] ≤
P [|X − E [X]| ≥ t] ≤
Chernoff
P [X ≥ (1 + δ)µ] ≤
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒
E [ϕ(X)]
t
Chebyshev
δ
n
X

n
X
λi
Xj ∼ Bin 
Xj , Pn
j=1
j=1

λj

n
X
Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
V [X]
t2
e
(1 + δ)1+δ
λi
Exponential
2
E [XY ] ≤ E X 2 E Y 2
Markov
!
i=1
j=1
Cauchy-Schwarz
n
X
Normal
• X ∼ N µ, σ 2
δ > −1
Hoeffding
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n
2
P X̄ − E X̄ ≥ t ≤ e−2nt t > 0
2n2 t2
P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn
t>0
2
i=1 (bi − ai )
Jensen
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex
=⇒
X−µ
σ
∼ N (0, 1)
2
• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
P
P
P
• Xi ∼ N µi , σi2 ∧ Xi ⊥
⊥ Xj =⇒
Xi ∼ N
µi , i σi2
i
i
• P [a < X ≤ b] = Φ b−µ
− Φ a−µ
σ
σ
• Φ(−x) = 1 − Φ(x)
φ0 (x) = −xφ(x)
φ00 (x) = (x2 − 1)φ(x)
−1
• Upper quantile of N (0, 1): zα = Φ (1 − α)
Gamma
• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
• Gamma (α, β) ∼ i=1 Exp (β)
P
P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒
i Xi ∼ Gamma (
i αi , β)
10
•
Γ(α)
=
λα
Z
∞
9.2
xα−1 e−λx dx
0
Bivariate Normal
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta
1
Γ(α + β) α−1
•
xα−1 (1 − x)β−1 =
x
(1 − x)β−1
B(α, β)
Γ(α)Γ(β)
B(α + k, β)
α+k−1
=
E X k−1
• E Xk =
B(α, β)
α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1)
f (x, y) =
"
z=
x − µx
σx
2πσx σy
2
+
1
p
1 − ρ2
y − µy
σy
exp −
2
− 2ρ
z
2(1 − ρ2 )
x − µx
σx
y − µy
σy
#
Conditional mean and variance
8
Probability and Moment Generating Functions
• GX (t) = E tX
σX
(Y − E [Y ])
σY
p
V [X | Y ] = σX 1 − ρ2
E [X | Y ] = E [X] + ρ
|t| < 1
"∞
#
∞
X (Xt)i
X
E Xi
• MX (t) = GX (et ) = E eXt = E
=
· ti
i!
i!
i=0
i=0
9.3
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0)
Multivariate Normal
(Precision matrix Σ−1 )


V [X1 ]
· · · Cov [X1 , Xk ]


..
..
..
Σ=

.
.
.
Covariance matrix Σ
(i)
GX (0)
i!
• E [X] = G0X (1− )
(k)
• E X k = MX (0)
X!
(k)
• E
= GX (1− )
(X − k)!
• P [X = i] =
Cov [Xk , X1 ] · · ·
V [Xk ]
If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
−n/2
fX (x) = (2π)
d
• GX (t) = GY (t) =⇒ X = Y
−1/2
|Σ|
1
exp − (x − µ)T Σ−1 (x − µ)
2
Properties
9
9.1
Multivariate Distributions
•
•
•
•
Standard Bivariate Normal
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX +
p
1 − ρ2 Z
Joint density
x2 + y 2 − 2ρxy
p
f (x, y) =
exp −
2(1 − ρ2 )
2π 1 − ρ2
1
10
and
(X | Y = y) ∼ N ρy, 1 − ρ2
Convergence
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence
Conditionals
(Y | X = x) ∼ N ρx, 1 − ρ2
Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0
lim Fn (t) = F (t)
n→∞
∀t where F continuous
11
P
2. In probability: Xn → X
X̄n − µ
Zn := q =
V X̄n
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞
as
3. Almost surely (strongly): Xn → X
h
i
h
i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞
CLT notations
Zn ≈ N (0, 1)
σ2
X̄n ≈ N µ,
n
σ2
X̄n − µ ≈ N 0,
n
√
2
n(X̄n − µ) ≈ N 0, σ
√
n(X̄n − µ)
≈ N (0, 1)
σ
Relationships
qm
P
D
• Xn → X =⇒ Xn → X =⇒ Xn → X
as
P
• Xn → X =⇒ Xn → X
D
P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
P
∧ Yn
∧ Yn
∧ Yn
=⇒
P
P
→ Y =⇒ Xn + Yn → X + Y
qm
qm
→ Y =⇒ Xn + Yn → X + Y
P
P
→ Y =⇒ Xn Yn → XY
P
ϕ(Xn ) → ϕ(X)
D
Continuity correction
P X̄n ≤ x ≈ Φ
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ
Slutzky’s Theorem
D
x + 12 − µ
√
σ/ n
x − 12 − µ
√
σ/ n
Delta method
P
D
Law of Large Numbers (LLN)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ.
Weak (WLLN)
P
X̄n → µ
n→∞
Strong (SLLN)
as
X̄n → µ
10.2
P X̄n ≥ x ≈ 1 − Φ
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
P
D
• Xn → X and Yn → c =⇒ Xn Yn → cX
D
D
D
• In general: Xn → X and Yn → Y =⇒
6
Xn + Yn → X + Y
10.1
z∈R
n→∞
n→∞
→X
qm
→X
P
→X
P
→X
where Z ∼ N (0, 1)
lim P [Zn ≤ z] = Φ(z)
qm
Xn
Xn
Xn
Xn
n(X̄n − µ) D
→Z
σ
n→∞
4. In quadratic mean (L2 ): Xn → X
lim E (Xn − X)2 = 0
•
•
•
•
√
n→∞
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
Yn ≈ N
11
µ,
σ2
n
=⇒ ϕ(Yn ) ≈ N
ϕ(µ), (ϕ0 (µ))
2
σ2
n
Statistical Inference
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
11.1
Point Estimation
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
• bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
• Sampling distribution: F (θbn )
r h i
b
• Standard error: se(θn ) = V θbn
12
h
i
h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
θbn − θ D
• Asymptotic normality:
→ N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ
bn .
11.2
•
•
•
•
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of θ = (F ): θbn = T (Fbn )
R
Linear functional: T (F ) = ϕ(x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =
Normal-Based Confidence Interval
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
Suppose θbn ≈ N θ, se
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
b
Cn = θbn ± zα/2 se
11.3
11.4
Empirical distribution
Empirical Distribution Function (ECDF)
Pn
I(Xi ≤ x)
b
Fn (x) = i=1
n
(
1 Xi ≤ x
I(Xi ≤ x) =
0 Xi > x
Properties (for any fixed x)
h i
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn =
n
F (x)(1 − F (x)) D
• mse =
→0
n
P
• Fbn → F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
2
b
P sup F (x) − Fn (x) > ε = 2e−2nε
n
1X
ϕ(Xi )
ϕ(x) dFbn (x) =
n i=1
b 2 =⇒ T (Fbn ) ± zα/2 se
b
• Often: T (Fbn ) ≈ N T (F ), se
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
• µ
b = X̄n
n
1 X
(Xi − X̄n )2
• σ
b2 =
n − 1 i=1
Pn
1
b)3
i=1 (Xi − µ
n
• κ
b=
b3
Pσ
n
i=1 (Xi − X̄n )(Yi − Ȳn )
qP
• ρb = qP
n
n
2
2
(X
−
X̄
)
i
n
i=1
i=1 (Yi − Ȳn )
12
Parametric Inference
Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
and parameter θ = (θ1 , . . . , θk ).
12.1
Method of Moments
j th moment
αj (θ) = E X j =
xj dFX (x)
j th sample moment
n
x
Nonparametric 1 − α confidence band for F
L(x) = max{Fbn − n , 0}
Z
α
bj =
1X j
X
n i=1 i
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s
1
2
log
=
2n
α
α1 (θ) = α
b1
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
αk (θ) = α
bk
α2 (θ) = α
b2
.. ..
.=.
13
Properties of the MoM estimator
• θbn exists with probability tending to 1
P
• Consistency: θbn → θ
• Asymptotic normality:
√
D
n(θb − θ) → N (0, Σ)
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ
αj (θ)
12.2
Maximum Likelihood
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is:
p
1. se ≈ 1/In (θ)
(θbn − θ) D
→ N (0, 1)
se
q
b ≈ 1/In (θbn )
2. se
(θbn − θ) D
→ N (0, 1)
b
se
• Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞)
Ln (θ) =
n
Y
h i
V θbn
are(θen , θbn ) = h i ≤ 1
V θen
f (Xi ; θ)
i=1
• Approximately the Bayes estimator
Log-likelihood
`n (θ) = log Ln (θ) =
n
X
log f (Xi ; θ)
i=1
12.2.1
Delta Method
b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
(b
τn − τ ) D
→ N (0, 1)
b τ)
se(b
Ln (θbn ) = sup Ln (θ)
θ
b is the mle of τ and
where τb = ϕ(θ)
Score function
s(X; θ) =
∂
log f (X; θ)
∂θ
b se(
b = ϕ0 (θ)
b θbn )
se
Fisher information
I(θ) = Vθ [s(X; θ)]
In (θ) = nI(θ)
Fisher information (exponential family)
∂
I(θ) = Eθ − s(X; θ)
∂θ
Observed Fisher information
Inobs (θ) = −
Properties of the mle
P
• Consistency: θbn → θ
12.3
Multiparameter Models
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Hjj =
∂ 2 `n
∂θ2
Hjk =
Fisher information matrix

n
∂2 X
log f (Xi ; θ)
∂θ2 i=1
∂ 2 `n
∂θj ∂θk
···
..
.
Eθ [Hk1 ] · · ·
Eθ [H11 ]

..
In (θ) = − 
.

Eθ [H1k ]

..

.
Eθ [Hkk ]
Under appropriate regularity conditions
(θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then
(θbj − θj ) D
→ N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se
•
•
•
•
•
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function β(θ) = P [X ∈ R]
Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1
Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be
H0 true
H1 true
∂ϕ
 ∂θ1 
 . 

∇ϕ = 
 .. 
 ∂ϕ 


θ=θb
• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]
• p-value = supθ∈Θ0
= inf α : T (X) ∈ Rα
|
{z
}
b Then,
6= 0 and τb = ϕ(θ).
1−Fθ (T (X))
(b
τ − τ) D
→ N (0, 1)
b τ)
se(b
where
r
b τ) =
se(b
b and ∇ϕ
b = ∇ϕ
and Jbn = Jn (θ)
12.4
b
∇ϕ
T
.
θ=θb
since T (X ? )∼Fθ
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald test
Parametric Bootstrap
Hypothesis Testing
H0 : θ ∈ Θ0
versus
• Two-sided test
θb − θ0
• Reject H0 when |W | > zα/2 where W =
b
se
• P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
H1 : θ ∈ Θ1
Definitions
•
•
•
•
•
•
p-value
< 0.01
0.01 − 0.05
0.05 − 0.1
> 0.1
b
Jbn ∇ϕ
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator.
13
Type II Error (β)
Reject H0
Type
√ I Error (α)
(power)
p-value
∂θk
Suppose ∇ϕ
Retain H0
√
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis θ = θ0
Composite hypothesis θ > θ0 or θ < θ0
Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
Likelihood ratio test
supθ∈Θ Ln (θ)
Ln (θbn )
=
supθ∈Θ0 Ln (θ)
Ln (θbn,0 )
k
X
D
iid
• λ(X) = 2 log T (X) → χ2r−q where
Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• T (X) =
i=1
• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
15
Multinomial LRT
X1
Xk
• mle: pbn =
,...,
n
n
k
Y pbj Xj
Ln (b
pn )
• T (X) =
=
Ln (p0 )
p0j
j=1
k
X
pbj
D
Xj log
→ χ2k−1
• λ(X) = 2
p
0j
j=1
Natural form
fX (x | η) = h(x) exp {η · T(x) − A(η)}
= h(x)g(η) exp {η · T(x)}
= h(x)g(η) exp η T T(x)
15
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
Bayesian Inference
Bayes’ Theorem
Pearson Chi-square Test
• T =
k
X
j=1
f (θ | x) =
(Xj − E [Xj ])2
where E [Xj ] = np0j under H0
E [Xj ]
f (x | θ)f (θ)
f (x | θ)f (θ)
=R
∝ Ln (θ)f (θ)
f (xn )
f (x | θ)f (θ) dθ
Definitions
D
• T → χ2k−1
• p-value = P χ2k−1 > T (x)
•
•
•
•
D
2
• Faster → Xk−1
than LRT, hence preferable for small n
Independence testing
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: pbij = nij
X
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
PI PJ
nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)
X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f (θ)
Likelihood f (xn | θ): joint density of the data
n
Y
In particular, X n iid =⇒ f (xn | θ) =
f (xi | θ) = Ln (θ)
i=1
• Posterior density f (θ | xn )
R
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
• Kernel: part of a density that dependsRon θ
R
θL (θ)f (θ)dθ
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ
15.1
Credible Intervals
Posterior interval
14
Exponential Family
P [θ ∈ (a, b) | xn ] =
Scalar parameter
b
Z
f (θ | xn ) dθ = 1 − α
a
Equal-tail credible interval
Z a
Z
f (θ | xn ) dθ =
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)}
= h(x)g(θ) exp {η(θ)T (x)}
Vector parameter
−∞
(
fX (x | θ) = h(x) exp
s
X
)
ηi (θ)Ti (x) − A(θ)
i=1
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)}
∞
f (θ | xn ) dθ = α/2
b
Highest posterior density (HPD) region Rn
1. P [θ ∈ Rn ] = 1 − α
2. Rn = {θ : f (θ | xn ) > k} for some k
Rn is unimodal =⇒ Rn is an interval
16
15.2
Function of parameters
15.3.1
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
n
Conjugate Priors
n
Z
n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] =
f (θ | x ) dθ
Likelihood
Conjugate prior
Unif (0, θ)
Pareto(xm , k)
Exp (λ)
Gamma (α, β)
Posterior hyperparameters
max x(n) , xm , k + n
n
X
xi
α + n, β +
A
i=1
2
N µ, σc
Posterior density
2
N µ0 , σ0
h(τ | xn ) = H 0 (τ | xn )
N µc , σ 2
Bayesian delta method
b se
b
b ϕ0 (θ)
τ | X n ≈ N ϕ(θ),
15.3
Priors
N µ, σ 2
MVN(µ, Σc )
Scaled Inverse Chisquare(ν, σ02 )
Normalscaled
Inverse
Gamma(λ, ν, α, β)
MVN(µ0 , Σ0 )
Choice
• Subjective Bayesianism: prior should incorporate as much detail as possible
the research’s a priori knowledge—via prior elicitation
• Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior)
• Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior
MVN(µc , Σ)
InverseWishart(κ, Ψ)
Pareto(xmc , k)
Gamma (α, β)
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (αc , β)
Gamma (α0 , β0 )
Pn
µ0
1
n
i=1 xi
+
/
+ 2 ,
σ2
σ2
σ02
σc
0
c−1
1
n
+ 2
σ02
σc
Pn
νσ02 + i=1 (xi − µ)2
ν + n,
ν+n
νλ + nx̄
n
,
ν + n,
α + ,
ν+n
2
n
2
1X
γ(x̄
−
λ)
β+
(xi − x̄)2 +
2 i=1
2(n + γ)
−1
Σ−1
0 + nΣc
−1
−1
Σ−1
x̄ ,
0 µ0 + nΣ
−1 −1
Σ−1
0 + nΣc
n
X
(xi − µc )(xi − µc )T
n + κ, Ψ +
i=1
n
X
xi
x
mc
i=1
x0 , k0 − kn where k0 > kn
n
X
α0 + nαc , β0 +
xi
α + n, β +
log
i=1
Types
• Flat: f (θ) ∝ constant
R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):
f (θ) ∝
p
I(θ)
f (θ) ∝
p
det(I(θ))
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family
17
Discrete likelihood
Likelihood
Conjugate prior
Bern (p)
Posterior hyperparameters
Beta (α, β)
Bin (p)
Bayes factor
α+
Beta (α, β)
α+
n
X
i=1
n
X
xi , β + n −
xi , β +
i=1
NBin (p)
Beta (α, β)
Po (λ)
α + rn, β +
Gamma (α, β)
α+
n
X
n
X
n
X
n
X
xi
i=1
Ni −
i=1
n
X
xi
i=1
p∗ =
xi
1
log10 BF10
BF10
evidence
0 − 0.5
0.5 − 1
1−2
>2
1 − 1.5
1.5 − 10
10 − 100
> 100
Weak
Moderate
Strong
Decisive
p
1−p BF10
p
+ 1−p
BF10
where p = P [H1 ] and p∗ = P [H1 | xn ]
i=1
xi , β + n
16
Sampling Methods
i=1
Multinomial(p)
Dir (α)
α+
n
X
i=1
Geo (p)
Beta (α, β)
16.1
x(i)
α + n, β +
n
X
Setup
xi
i=1
15.4
Inverse Transform Sampling
• U ∼ Unif (0, 1)
• X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
Bayesian Testing
1. Generate u ∼ Unif (0, 1)
2. Compute x = F −1 (u)
If H0 : θ ∈ Θ0 :
Z
Prior probability P [H0 ] =
f (θ) dθ
16.2
f (θ | xn ) dθ
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
Posterior probability P [H0 | xn ] =
Z
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ),
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood
Z
n
f (xn | θ, Hi )f (θ | Hi ) dθ
f (x | Hi ) =
The Bootstrap
∗
∗
(a) Repeat the following B times to get Tn,1
, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
!2
B
B
1 X
1 X ∗
∗
b
vboot = VFbn =
Tn,b −
T
B
B r=1 n,r
b=1
Θ
16.2.1
Posterior odds (of Hi relative to Hj )
P [Hi | x ]
P [Hj | xn ]
Normal-based interval
n
n
=
Bootstrap Confidence Intervals
f (x | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
×
P [Hi ]
P [Hj ]
| {z }
prior odds
b boot
Tn ± zα/2 se
Pivotal interval
1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗
∗
4. Let Rn,b
= θbn,b
− θbn . Approximate H using bootstrap:
2. Generate u ∼ Unif (0, 1)
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn )
B
1 X
∗
b
H(r)
=
I(Rn,b
≤ r)
B
16.4
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
b=1
∗
∗
5. θβ∗ = β sample quantile of (θbn,1
, . . . , θbn,B
)
iid
∗
∗
), i.e., rβ∗ = θβ∗ − θbn
6. rβ∗ = beta sample quantile of (Rn,1
, . . . , Rn,B
7. Approximate 1 − α confidence interval Cn = â, b̂ where
b −1 1 − α =
θbn − H
2
α
−1
b
=
θbn − H
2
â =
b̂ =
∗
θbn − r1−α/2
=
∗
2θbn − θ1−α/2
∗
θbn − rα/2
=
∗
2θbn − θα/2
∗
∗
Cn = θα/2
, θ1−α/2
Rejection Sampling
Setup
• We can easily sample from g(θ)
• We want to sample from h(θ), but it is difficult
k(θ)
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ
• We know h(θ) up to a proportional constant: h(θ) = R
Algorithm
1. Draw θcand ∼ g(θ)
2. Generate u ∼ Unif (0, 1)
k(θcand )
3. Accept θcand if u ≤
M g(θcand )
4. Repeat until B values of θcand have been accepted
Example
•
•
•
•
17
Decision Theory
• Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of θ, θ(x).
• Loss function L: consequences of taking action a when true state is θ or
b L : Θ × A → [−k, ∞).
discrepancy between θ and θ,
Loss functions
• Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
• Lp loss: L(θ, a) = |θ − a|p
(
0 a=θ
• Zero-one loss: L(θ, a) =
1 a 6= θ
17.1
Risk
Posterior risk
We can easily sample from the prior g(θ) = f (θ)
Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M
Algorithm
1. Draw θ
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
Ln (θi )
2. wi = PB
∀i = 1, . . . , B
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
Definitions
Percentile interval
16.3
Importance Sampling
cand
∼ f (θ)
Z
h
i
b
b
L(θ, θ(x))f
(θ | x) dθ = Eθ|X L(θ, θ(x))
Z
h
i
b
b
L(θ, θ(x))f
(x | θ) dx = EX|θ L(θ, θ(X))
r(θb | x) =
(Frequentist) risk
b =
R(θ, θ)
19
18
Bayes risk
ZZ
b =
r(f, θ)
h
i
b
b
L(θ, θ(x))f
(x, θ) dx dθ = Eθ,X L(θ, θ(X))
h
h
ii
h
i
b = Eθ EX|θ L(θ, θ(X)
b
b
r(f, θ)
= Eθ R(θ, θ)
h
h
ii
h
i
b = EX Eθ|X L(θ, θ(X)
b
r(f, θ)
= EX r(θb | X)
Linear Regression
Definitions
• Response variable Y
• Covariate X (aka predictor variable or feature)
18.1
Simple Linear Regression
Model
17.2
Yi = β0 + β1 Xi + i
Admissibility
E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0
b
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
∃θ : R(θ, θb0 ) < R(θ, θ)
• θb is inadmissible if there is at least one other estimator θb0 that dominates
it. Otherwise it is called admissible.
rb(x) = βb0 + βb1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
Residual sums of squares (rss)
17.3
Bayes Rule
rss(βb0 , βb1 ) =
Bayes rule (or Bayes estimator)
n
X
ˆ2i
i=1
b = inf e r(f, θ)
e
• r(f, θ)
θ
R
b
b = r(θb | x)f (x) dx
• θ(x)
= inf r(θb | x) ∀x =⇒ r(f, θ)
Least square estimates
βbT = (βb0 , βb1 )T : min rss
b0 ,β
b1
β
Theorems
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode
17.4
Minimax Rules
Maximum risk
b = sup R(θ, θ)
b
R̄(θ)
R̄(a) = sup R(θ, a)
θ
θ
Minimax rule
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
sup R(θ, θ)
θ
θe
θe
θ
b =c
θb = Bayes rule ∧ ∃c : R(θ, θ)
Least favorable prior
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ
βb0 = Ȳn − βb1 X̄n
Pn
Pn
i=1 Xi Yi − nX̄Y
i=1 (Xi − X̄n )(Yi − Ȳn )
b
Pn
= P
β1 =
n
2
2
2
i=1 (Xi − X̄n )
i=1 Xi − nX
h
i
β0
E βb | X n =
β1
P
h
i
σ 2 n−1 ni=1 Xi2 −X n
V βb | X n = 2
1
−X n
nsX
r Pn
2
σ
b
i=1 Xi
√
b βb0 ) =
se(
n
sX n
σ
b
√
b βb1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = n−2
ˆi (unbiased estimate).
i=1 Further properties:
P
P
• Consistency: βb0 → β0 and βb1 → β1
20
18.3
• Asymptotic normality:
βb0 − β0 D
→ N (0, 1)
b βb0 )
se(
and
βb1 − β1 D
→ N (0, 1)
b βb1 )
se(
Multiple Regression
Y = Xβ + where
• Approximate 1 − α confidence intervals for β0 and β1 :

b βb1 )
and βb1 ± zα/2 se(
b βb0 )
βb0 ± zα/2 se(
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
b βb1 ).
W = βb1 /se(
Xn1
Pn b
Pn 2
2
ˆ
rss
i=1 (Yi − Y )
R = Pn
= 1 − Pn i=1 i 2 = 1 −
2
tss
i=1 (Yi − Y )
i=1 (Yi − Y )
Likelihood
L2 =

β1
 
β =  ... 
Xnk
n
Y
i=1
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi ) ×
n
Y
(
fY |X (Yi | Xi ) ∝ σ
2
1 X
exp − 2
Yi − (β0 − β1 Xi )
2σ i
βb = (X T X)−1 X T Y
h
i
V βb | X n = σ 2 (X T X)−1
)
Under the assumption of Normality, the least squares estimator is also the mle
but the least squares variance estimator is not the mle.
βb ≈ N β, σ 2 (X T X)−1
1X 2
ˆ
n i=1 i
rb(x) =
k
X
βbj xj
j=1
Unbiased estimate for σ 2
Prediction
Observe X = x∗ of the covariate and want to predict their outcome Y∗ .
Yb∗ = βb0 + βb1 x∗
h i
h i
h i
h
i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
Prediction interval
Estimate regression function
n
σ
b2 =
N
X
(Yi − xTi β)2
If the (k × k) matrix X T X is invertible,
fX (Xi )
−n
n
i=1
fY |X (Yi | Xi ) = L1 × L2
i=1
i=1
 
1
 .. 
=.
βk
rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 =
i=1
18.2

1
L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
2σ
2
L1 =

X1k
.. 
. 
Likelihood
R2
L=
···
..
.
···
X11
 ..
X= .
Pn
2
2
2
i=1 (Xi − X∗ )
b
P
ξn = σ
b
+1
n i (Xi − X̄)2 j
Yb∗ ± zα/2 ξbn
σ
b2 =
n
1 X 2
ˆ
n − k i=1 i
ˆ = X βb − Y
mle
µ
b = X̄
σ
b2 =
n−k 2
σ
n
1 − α Confidence interval
b βbj )
βbj ± zα/2 se(
21
18.4
Model Selection
Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance
AIC(S) = `n (βbS , σ
bS2 ) − k
Bayesian Information Criterion (BIC)
BIC(S) = `n (βbS , σ
bS2 ) −
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
k
log n
2
Validation and training
bV (S) =
R
Hypothesis testing
m
X
(Ybi∗ (S) − Yi∗ )2
m = |{validation data}|, often
i=1
n
n
or
4
2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) − Y ∗ )2
bCV (S) =
R
n
X
(Yi − Yb(i) )2 =
i=1
i=1
Prediction risk
R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) − Yi∗ )2
n
X
Yi − Ybi (S)
1 − Uii (S)
!2
U (S) = XS (XST XS )−1 XS (“hat matrix”)
i=1
Training error
btr (S) =
R
n
X
(Ybi (S) − Yi )2
19
Non-parametric Function Estimation
i=1
R
19.1
2
btr (S)
rss(S)
R
R2 (S) = 1 −
=1−
=1−
tss
tss
Pn b
2
i=1 (Yi (S) − Y )
P
n
2
i=1 (Yi − Y )
The training error is a downward-biased estimate of the prediction risk.
h
i
btr (S) < R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
Cov Ybi , Yi
i=1
Adjusted R2
R2 (S) = 1 −
Density Estimation
Estimate f (x), where f (x) = P [X ∈ A] =
Integrated square error (ise)
L(f, fbn ) =
Z R
A
f (x) dx.
Z
2
b
f (x) − fn (x) dx = J(h) + f 2 (x) dx
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
n − 1 rss
n − k tss
Mallow’s Cp statistic
b
btr (S) + 2kb
R(S)
=R
σ 2 = lack of fit + complexity penalty
h
i
b(x) = E fbn (x) − f (x)
h
i
v(x) = V fbn (x)
22
19.1.1
Histograms
KDE
n
1X1
x − Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
K 2 (x) dx
R(f, fn ) ≈ (hσK )
(f (x)) dx +
4
nh
Z
Z
−2/5 −1/5 −1/5
c
c2 c3
2
2
c
=
σ
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
h∗ = 1
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
∗
2
00 2
b
R (f, fn ) = 4/5
c4 = (σK )
K (x) dx
(f ) dx
4
n
|
{z
}
Definitions
•
•
•
•
Number of bins m
1
Binwidth h = m
Bin Bj has νj observations
R
Define pbj = νj /n and pj = Bj f (u) du
Histogram estimator
fbn (x) =
m
X
pbj
j=1
h
C(K)
I(x ∈ Bj )
Epanechnikov Kernel
pj
E fbn (x) =
h
h
i p (1 − p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f ) ≈
(f 0 (u)) du +
12
nh
!1/3
1
6
∗
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
C
3
2
∗ b
0
R (fn , f ) ≈ 2/3
(f (u)) du
C=
4
n
h
i
(
K(x) =
√
3
√
4 5(1−x2 /5)
|x| <
0
otherwise
5
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
n
n
n
2Xb
1 X X ∗ Xi − Xj
2
fbn2 (x) dx −
K(0)
f(−i) (Xi ) ≈
K
+
n i=1
hn2 i=1 j=1
h
nh
K ∗ (x) = K (2) (x) − 2K(x)
K (2) (x) =
Z
K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
19.1.2
n
m
2Xb
2
n+1 X 2
fbn2 (x) dx −
f(−i) (Xi ) =
−
pb
n i=1
(n − 1)h (n − 1)h j=1 j
19.2
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
Yi = r(xi ) + i
Kernel Density Estimator (KDE)
E [i ] = 0
V [i ] = σ 2
Kernel K
•
•
•
•
K(x) ≥ 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx ≡ σK
>0
k-nearest Neighbor Estimator
rb(x) =
1
k
X
i:xi ∈Nk (x)
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
23
20
Nadaraya-Watson Kernel Estimator
n
X
rb(x) =
wi (x)Yi
Stochastic Processes
Stochastic Process
i=1
wi (x) =
x−xi
h
Pn
x−xj
K
j=1
h
K
(
{0, ±1, . . . } = Z discrete
T =
[0, ∞)
continuous
{Xt : t ∈ T }
∈ [0, 1]
Z
4 Z 2
h4
f 0 (x)
2 2
00
0
R(b
rn , r) ≈
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
Z 2R 2
σ K (x) dx
+
dx
nhf (x)
c1
h∗ ≈ 1/5
n
c2
R∗ (b
rn , r) ≈ 4/5
n
• Notations Xt , X(t)
• State space X
• Index set T
20.1
Markov Chains
Markov chain
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ]
∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
JbCV (h) =
n
X
(Yi − rb(−i) (xi ))2 =
i=1
n
X
i=1
(Yi − rb(xi ))2
1−
Pn
j=1
19.3
K(0)
x−x j
K
h
Smoothing Using Orthogonal Functions
r(x) =
βj φj (x) ≈
j=1
J
X
pij ≡ P [Xn+1 = j | Xn = i]
pij (n) ≡ P [Xm+n = j | Xm = i]
n-step
Transition matrix P (n-step: Pn )
Approximation
∞
X
Transition probabilities
!2
• (i, j) element is pij
• pij > 0
P
•
i pij = 1
βj φj (x)
j=1
Multivariate regression
where
ηi = i
Y = Φβ + η

φ0 (x1 )
 ..
and Φ =  .
···
..
.
φ0 (xn ) · · ·
Chapman-Kolmogorov

φJ (x1 )
.. 
. 
pij (m + n) =
X
pij (m)pkj (n)
k
φJ (xn )
Pm+n = Pm Pn
Least squares estimator
βb = (ΦT Φ)−1 ΦT Y
1
≈ ΦT Y (for equally spaced observations only)
n
Cross-validation estimate of E [J(h)]

2
n
J
X
X
bCV (J) =
Yi −
R
φj (xi )βbj,(−i) 
i=1
j=1
Pn = P × · · · × P = Pn
Marginal probability
µn = (µn (1), . . . , µn (N ))
where
µi (i) = P [Xn = i]
µ0 , initial distribution
µn = µ0 Pn
24
20.2
Poisson Processes
Autocorrelation function (ACF)
Poisson process
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t
• X0 = 0
• Independent increments:
γ(s, t)
Cov [xs , xt ]
=p
ρ(s, t) = p
V [xs ] V [xt ]
γ(s, s)γ(t, t)
Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
Cross-correlation function (CCF)
– P [Xt+h − Xt = 1] = λ(t)h + o(h)
– P [Xt+h − Xt = 2] = o(h)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
0
ρxy (s, t) = p
λ(s) ds
γxy (s, t)
γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt)
λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
1
Wt ∼ Gamma t,
λ
Difference operator
∇d = (1 − B)d
White noise
Interarrival times
St = Wt+1 − Wt
1
St ∼ Exp
λ
2
• wt ∼ wn(0, σw
)
St
Wt−1
Wt
t
•
•
•
•
iid
2
Gaussian: wt ∼ N 0, σw
E [wt ] = 0 t ∈ T
V [wt ] = σ 2 t ∈ T
γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Random walk
21
Time Series
Mean function
Z
∞
µxt = E [xt ] =
xft (x) dx
• Drift δ
Pt
• xt = δt + j=1 wj
• E [xt ] = δt
−∞
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
γx (t, t) = E (xt − µt )2 = V [xt ]
mt =
k
X
j=−k
aj xt−j
where aj = a−j ≥ 0 and
k
X
j=−k
aj = 1
25
21.1
Stationary Time Series
Sample variance
Strictly stationary
n 1 X
|h|
1−
γx (h)
n
n
V [x̄] =
h=−n
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
Sample autocovariance function
∀k ∈ N, tk , ck , h ∈ Z
Weakly stationary
∀t ∈ Z
• E x2t < ∞
2
∀t ∈ Z
• E xt = m
• γx (s, t) = γx (s + r, t + r)
n−h
1 X
(xt+h − x̄)(xt − x̄)
n t=1
γ
b(h) =
∀r, s, t ∈ Z
Sample autocorrelation function
Autocovariance function
•
•
•
•
•
γ(h) = E [(xt+h − µ)(xt − µ)]
γ(0) = E (xt − µ)2
γ(0) ≥ 0
γ(0) ≥ |γ(h)|
γ(h) = γ(−h)
ρb(h) =
∀h ∈ Z
Sample cross-variance function
γ
bxy (h) =
Autocorrelation function (ACF)
Cov [xt+h , xt ]
γ(t + h, t)
γ(h)
ρx (h) = p
=p
=
γ(0)
V [xt+h ] V [xt ]
γ(t + h, t + h)γ(t, t)
n−h
1 X
(xt+h − x̄)(yt − y)
n t=1
Sample cross-correlation function
γ
bxy (h)
ρbxy (h) = p
γ
bx (0)b
γy (0)
Jointly stationary time series
γxy (h) = E [(xt+h − µx )(yt − µy )]
γxy (h)
ρxy (h) = p
γx (0)γy (h)
∞
X
ψj wt−j
where
j=−∞
∞
X
Properties
1
• σρbx (h) = √ if xt is white noise
n
1
• σρbxy (h) = √ if xt or yt is white noise
n
Linear process
xt = µ +
γ
b(h)
γ
b(0)
|ψj | < ∞
j=−∞
γ(h) =
2
σw
∞
X
21.3
ψj+h ψj
Non-Stationary Time Series
Classical decomposition model
j=−∞
21.2
xt = µt + st + wt
Estimation of Correlation
Sample mean
n
1X
x̄ =
xt
n t=1
• µt = trend
• st = seasonal component
• wt = random noise term
26
21.3.1
Detrending
Moving average polynomial
Least squares
θ(z) = 1 + θ1 z + · · · + θq zq
z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t
2. Minimize rss to obtain trend estimate µ
bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt
Moving average operator
θ(B) = 1 + θ1 B + · · · + θp B p
Moving average
MA (q) (moving average model order q)
• The low-pass filter vt is a symmetric moving average mt with aj =
vt =
1
2k+1 :
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
k
X
1
xt−1
2k + 1
E [xt ] =
i=−k
θj E [wt−j ] = 0
j=0
(
Pk
1
• If 2k+1
i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
Differencing
γ(h) = Cov [xt+h , xt ] =
2
σw
0
Pq−h
j=0
θj θj+h
0≤h≤q
h>q
MA (1)
xt = wt + θwt−1

2 2

(1 + θ )σw h = 0
2
γ(h) = θσw
h=1


0
h>1
(
θ
h=1
2
ρ(h) = (1+θ )
0
h>1
• µt = β0 + β1 t =⇒ ∇xt = β1
21.4
q
X
ARIMA models
Autoregressive polynomial
φ(z) = 1 − φ1 z − · · · − φp zp
z ∈ C ∧ φp 6= 0
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − · · · − φp B p
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
AR (1)
• xt = φk (xt−k ) +
k−1
X
φj (wt−j )
k→∞,|φ|<1
j=0
=
∞
X
φj (wt−j )
{z
}
linear process
P∞
j
j=0 φ (E [wt−j ]) = 0
• γ(h) = Cov [xt+h , xt ] =
• ρ(h) =
γ(h)
γ(0)
• xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
• φhh = corr(xh − xh−1
, x0 − xh−1
) h≥2
0
h
• E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0
|
• E [xt ] =
Partial autocorrelation function (PACF)
2 h
σw
φ
1−φ2
= φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . .
ARIMA (p, d, q)
∇d xt = (1 − B)d xt is ARMA (p, q)
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt−1 + wt − λwt−1
27
xt =
∞
X
(1 − λ)λj−1 xt−j + wt
•
•
•
•
when |λ| < 1
j=1
x̃n+1 = (1 − λ)xn + λx̃n
Seasonal ARIMA
Frequency index ω (cycles per unit time), period 1/ω
Amplitude A
Phase φ
U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Periodic mixture
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
d
s
• ΦP (B s )φ(B)∇D
s ∇ xt = δ + ΘQ (B )θ(B)wt
xt =
q
X
(Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1
Causality and Invertibility
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } :
xt =
∞
X
P∞
j=0
ψj < ∞ such that
wt−j = ψ(B)wt
Spectral representation of a periodic process
j=0
ARMA (p, q) is invertible ⇐⇒ ∃{πj } :
π(B)xt =
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
• γ(0) = E x2t = k=1 σk2
P∞
j=0
∞
X
γ(h) = σ 2 cos(2πω0 h)
πj < ∞ such that
σ 2 −2πiω0 h σ 2 2πiω0 h
e
+
e
2
2
Z 1/2
=
e2πiωh dF (ω)
=
Xt−j = wt
j=0
−1/2
Properties
• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
∞
X
θ(z)
ψ(z) =
ψj z =
φ(z)
j=0
j
Spectral distribution function


0
F (ω) = σ 2 /2

 2
σ
|z| ≤ 1
ω < −ω0
−ω ≤ ω < ω0
ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
X
φ(z)
π(z) =
πj z j =
θ(z)
j=0
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
|z| ≤ 1
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
21.5
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
Spectral Analysis
Periodic process
xt = A cos(2πωt + φ)
= U1 cos(2πωt) + U2 sin(2πωt)
ARMA (p, q)
tails off
tails off
∞
X
γ(h)e−2πiωh
−
|γ(h)| < ∞ =⇒ γ(h) =
R 1/2
f (ω) =
h=−∞
• Needs
P∞
h=−∞
1
1
≤ω≤
2
2
−1/2
e2πiωh f (ω) dω
h = 0, ±1, . . .
• f (ω) ≥ 0
• f (ω) = f (−ω)
• f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
2
• White noise: fw (ω) = σw
28
22.2
• ARMA (p, q) , φ(B)xt = θ(B)wt :
Z 1
Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) =
tx−1 (1 − t)y−1 dt =
Γ(x + y)
0
Z x
a−1
b−1
• Incomplete: B(x; a, b) =
t
(1 − t)
dt
|θ(e−2πiω )|2
fx (ω) =
|φ(e−2πiω )|2
Pp
Pq
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
2
σw
0
• Regularized incomplete:
a+b−1
(a + b − 1)!
B(x; a, b) a,b∈N X
=
xj (1 − x)a+b−1−j
Ix (a, b) =
B(a, b)
j!(a
+
b
−
1
−
j)!
j=a
Discrete Fourier Transform (DFT)
d(ωj ) = n−1/2
n
X
Beta Function
xt e−2πiωj t
• I0 (a, b) = 0
I1 (a, b) = 1
• Ix (a, b) = 1 − I1−x (b, a)
i=1
Fourier/Fundamental frequencies
22.3
ωj = j/n
Series
Finite
Inverse DFT
xt = n
−1/2
n−1
X
d(ωj )e
•
2πiωj t
j=0
Periodogram
•
I(j/n) = |d(j/n)|2
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2πtj/n +
n t=1
n
2X
xt sin(2πtj/n
n t=1
!2
k=
k=1
n
X
n(n + 1)
2
(2k − 1) = n2
k=1
n
X
•
k=1
n
X
ck =
22.1
k=0
n X
k
= 2n
k=0
• Vandermonde’s Identity:
r X
m
n
m+n
=
k
r−k
r
k=0
c 6= 1
Math
• Binomial
Theorem:
n X
n n−k k
a
b = (a + b)n
k
k=0
Gamma Function
Z
Infinite
∞
ts−1 e−t dt
0
Z ∞
• Upper incomplete: Γ(s, x) =
ts−1 e−t dt
Z xx
• Lower incomplete: γ(s, x) =
ts−1 e−t dt
• Ordinary: Γ(s) =
0
•
•
•
•
•
cn+1 − 1
c−1
n X
n
r+k
r+n+1
=
k
n
k=0
n
X k
n+1
•
=
m
m+1
•
k2 =
k=0
22
•
n(n + 1)(2n + 1)
6
k=1
2
n
X
n(n + 1)
•
k3 =
2
•
P (j/n) =
Binomial
n
X
Γ(α + 1) = αΓ(α)
α>1
Γ(n) = (n − 1)!
n∈N
Γ(0) = Γ(−1) = ∞
√
Γ(1/2) = π
Γ(−1/2) = −2Γ(1/2)
•
∞
X
k=0
∞
X
pk =
1
,
1−p
∞
X
p
|p| < 1
1−p
k=1
!
∞
X
d
1
1
k
=
p
=
dp 1 − p
(1 − p)2
pk =
d
dp
k=0
k=0
∞ X
r+k−1 k
•
x = (1 − x)−r r ∈ N+
k
k=0
∞ X
α k
•
p = (1 + p)α |p| < 1 , α ∈ C
k
•
kpk−1 =
|p| < 1
k=0
29
22.4
Combinatorics
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n
w/o replacement
nk =
ordered
k−1
Y
(n − i) =
i=0
w/ replacement
n!
(n − k)!
n
nk
n!
=
=
k
k!
k!(n − k)!
unordered
nk
[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
n−1+r
n−1+r
=
r
n−1
Stirling numbers, 2nd kind
n
n−1
n−1
=k
+
k
k
k−1
(
1
n
=
0
0
1≤k≤n
n=0
else
Partitions
Pn+k,k =
n
X
Pn,i
k > n : Pn,k = 0
n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns
|B| = n, |U | = m
f :B→U
f arbitrary
mn
B : D, U : D
B : ¬D, U : D
B : D, U : ¬D
f injective
(
mn m ≥ n
0
else
m+n−1
n
m X
n
k=1
B : ¬D, U : ¬D
D = distinguishable, ¬D = indistinguishable.
m
X
k=1
m
n
m!
n
m
n−1
m−1
(
1
0
m≥n
else
n
m
(
1
0
m≥n
else
Pn,m
k
Pn,k
f surjective
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
31
Univariate distribution relationships, courtesy Leemis and McQueston [2].
Download