Probability and Statistics Cookbook

advertisement
Probability and Statistics
Cookbook
c Matthias Vallentin, 2011
Copyright vallentin@icir.org
12th December, 2011
12 Parametric Inference
12.1 Method of Moments . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . .
versity of California in Berkeley but also influenced by other
12.2.1 Delta Method . . . . . . . . .
sources [4, 5]. If you find errors or have suggestions for further
12.3 Multiparameter Models . . . . . . .
topics, I would appreciate if you send me an email. The most re12.3.1 Multiparameter delta method
cent version of this document is available at http://matthias.
12.4 Parametric Bootstrap . . . . . . . .
15 Exponential Family
11 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
11
20.2 Poisson Processes . . . . . . . . .
12
12
21 Time Series
12
21.1 Stationary Time Series . . . . . .
13
21.2 Estimation of Correlation . . . .
13
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
13
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
14
21.5 Spectral Analysis . . . . . . . . .
14
14 22 Math
22.1 Gamma Function . . . . . . . . .
15
22.2 Beta Function . . . . . . . . . . .
15
22.3 Series . . . . . . . . . . . . . . .
15
22.4 Combinatorics . . . . . . . . . .
16
16 Sampling Methods
16.1 The Bootstrap . . . . . . . . . . . . .
16.1.1 Bootstrap Confidence Intervals
16.2 Rejection Sampling . . . . . . . . . . .
16.3 Importance Sampling . . . . . . . . . .
16
16
16
17
17
This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.
2 Probability Theory
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
6
7
4 Expectation
7
5 Variance
7
6 Inequalities
8
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
7 Distribution Relationships
8 Probability
Functions
and
Moment
9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .
10 Convergence
10.1 Law of Large Numbers (LLN) . . . . . .
10.2 Central Limit Theorem (CLT) . . . . .
11 Statistical Inference
11.1 Point Estimation . . . . . . . . . .
11.2 Normal-Based Confidence Interval
11.3 Empirical distribution . . . . . . .
11.4 Statistical Functionals . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17 Decision Theory
17.1 Risk . . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . . .
17.3 Bayes Rule . . . . . . . . . . . . . . . .
9
17.4 Minimax Rules . . . . . . . . . . . . . .
9
9 18 Linear Regression
9
18.1 Simple Linear Regression . . . . . . . .
9
18.2 Prediction . . . . . . . . . . . . . . . . .
18.3 Multiple Regression . . . . . . . . . . .
9
18.4 Model Selection . . . . . . . . . . . . . .
10
10
19 Non-parametric Function Estimation
19.1 Density Estimation . . . . . . . . . . . .
10
19.1.1 Histograms . . . . . . . . . . . .
10
19.1.2 Kernel Density Estimator (KDE)
11
19.2 Non-parametric Regression . . . . . . .
11
11
19.3 Smoothing Using Orthogonal Functions
8
Generating
.
.
.
.
.
.
13 Hypothesis Testing
14 Bayesian Inference
14.1 Credible Intervals . . . .
14.2 Function of parameters .
3
3
14.3 Priors . . . . . . . . . .
4
14.3.1 Conjugate Priors
14.4 Bayesian Testing . . . .
6
Contents
.
.
.
.
.
.
17
17
17
18
18
18
18
19
19
19
20
20
20
21
21
21
22
. . . . 22
. . . . 22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
24
24
24
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
26
27
27
Distribution Overview
1.1
Discrete Distributions
Notation1
FX (x)
Unif {a, . . . , b}
Uniform


0
x<a
a≤x≤b
x>b
(1 − p)1−x
bxc−a+1
 b−a

1
Bernoulli
Bern (p)
Binomial
Multinomial
Hypergeometric
≈Φ
Hyp (N, m, n)
Negative Binomial
NBin (r, p)
Geometric
Geo (p)
Poisson
Po (λ)
!
e−λ
MX (s)
I(a < x < b)
b−a+1
a+b
2
(b − a + 1)2 − 1
12
eas − e−(b+1)s
s(b − a)
px (1 − p)1−x
p
p(1 − p)
1 − p + pes
np
np(1 − p)
(1 − p + pes )n
x
x ∈ N+
x
X
λi
i!
i=0
Uniform (discrete)
1−p
p
r
1−p
p2
1
p
1−p
p2
λ
λ
x ∈ N+
λx e−λ
x!
p
1 − (1 − p)es
eλ(e
s
−1)
Poisson
●
p = 0.2
p = 0.5
p = 0.8
●
●
●
λ=1
λ=4
λ = 10
●
●
●
●
●
0.2
0.6
●
PMF
●
0.4
●
PMF
0.15
●
n
PMF
1
●
0.10
●
●
●
0.05
x
1 We
●
●
●
0
●
●
●
10
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20
x
30
40
●
●
●
●
●
0.0
●
●
0.0
● ● ● ●
●
●
●
b
●
●
●
a
0.1
0.2
●
●
0.00
PMF
0.20
0.3
0.25
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
r
p
1 − (1 − p)es
Geometric
●
pi e
nm(N − n)(N − m)
N 2 (N − 1)
nm
N
r
Binomial
!n
si
i=0
n−x
N
x
p(1 − p)x−1
k
X
npi (1 − pi )
npi
!
x+r−1 r
p (1 − p)x
r−1
Ip (r, x + 1)
1 − (1 − p)x
V [X]
k
X
n!
x
xi = n
px1 1 · · · pkk
x1 ! . . . xk !
i=1
m m−x
Mult (n, p)
x − np
p
np(1 − p)
E [X]
!
n x
p (1 − p)n−x
x
I1−p (n − x, x + 1)
Bin (n, p)
fX (x)
0.8
1
0
2
4
6
8
10
●
0
x
5
●
●
●
●
●
10
●
●
●
●
●
15
●
●
●
●
●
20
x
use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N µ, σ 2


0
x<a
a<x<b

1
x>b
Z x
Φ(x) =
φ(t) dt
x−a
 b−a
−∞
Log-Normal
ln N µ, σ 2
Multivariate Normal
MVN (µ, Σ)
Student’s t
Student(ν)
Chi-square
χ2k
1
1
ln x − µ
+ erf √
2
2
2σ 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)
b−a
(x − µ)2
1
φ(x) = √ exp −
2σ 2
σ 2π
(ln x − µ)2
1
√
exp −
2σ 2
x 2πσ 2
a+b
2
(b − a)2
12
µ
σ2
esb − esa
s(b − a)
σ 2 s2
exp µs +
2
T
1
(2π)−k/2 |Σ|−1/2 e− 2 (x−µ)
ν ν Ix
,
2 2
1
k x
γ
,
Γ(k/2)
2 2
Σ−1 (x−µ)
−(ν+1)/2
Γ ν+1
x2
2
1+
√
ν
ν
νπΓ 2
1
xk/2−1 e−x/2
2k/2 Γ(k/2)
r
eµ+σ
2
/2
2
(eσ − 1)e2µ+σ
2
µ
Σ
0
0
k
2k
d2
d2 − 2
2d22 (d1 + d2 − 2)
d1 (d2 − 2)2 (d2 − 4)
1
exp µT s + sT Σs
2
(1 − 2s)−k/2 s < 1/2
d
F
F(d1 , d2 )
Exponential
Exp (β)
Gamma
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
I
d1 x
d1 x+d2
Gamma (α, β)
InvGamma (α, β)
d1 d1
,
2 2
Pareto(xm , α)
β
β2
γ(α, x/β)
Γ(α)
Γ α, βx
Γ (α)
1
xα−1 e−x/β
Γ (α) β α
αβ
αβ 2
β α −α−1 −β/x
x
e
Γ (α)
P
k
k
Γ
i=1 αi Y α −1
xi i
Qk
i=1 Γ (αi ) i=1
β
α>1
α−1
β2
α>2
(α − 1)2 (α − 2)2
Γ (α + β) α−1
x
(1 − x)β−1
Γ (α) Γ (β)
α
α+β
1
λΓ 1 +
k
αβ
(α + β)2 (α + β + 1)
2
λ2 Γ 1 +
− µ2
k
αxm
α>1
α−1
xα
m
α>2
(α − 1)2 (α − 2)
k
1 − e−(x/λ)
1−
d1
, 2
2
1 −x/β
e
β
Ix (α, β)
Weibull(λ, k)
xB
d1
1 − e−x/β
Dir (α)
Beta (α, β)
(d1 x)d1 d2 2
(d1 x+d2 )d1 +d2
x α
m
x
x ≥ xm
k x k−1 −(x/λ)k
e
λ λ
xα
m
α α+1
x
x ≥ xm
αi
Pk
i=1 αi
1
(s < 1/β)
1 − βs
α
1
(s < 1/β)
1 − βs
p
2(−βs)α/2
−4βs
Kα
Γ(α)
E [Xi ] (1 − E [Xi ])
Pk
i=1 αi + 1
1+
∞
X
k=1
∞
X
n=0
k−1
Y
r=0
α+r
α+β+r
!
sk
k!
sn λn n
Γ 1+
n!
k
α(−xm s)α Γ(−α, −xm s) s < 0
4
Normal
Log−normal
Student's t
1.0
µ = 0, σ2 = 3
µ = 2, σ2 = 2
µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
ν=1
ν=2
ν=5
ν=∞
0.2
PDF
PDF
0.4
φ(x)
PDF
0.6
0.6
0.3
0.8
0.8
µ = 0, σ2 = 0.2
µ = 0, σ2 = 1
µ = 0, σ2 = 5
µ = −2, σ2 = 0.5
0.4
Uniform (continuous)
●
0.1
−4
−2
0
x
2
4
0.0
0.0
0.2
0.2
●
b
0.0
●
a
0.0
0.5
1.0
1.5
x
2
χ
F
2.5
3.0
−4
−2
0
Exponential
4
Gamma
2.0
β=2
β=1
β = 0.4
α = 1, β = 2
α = 2, β = 2
α = 3, β = 2
α = 5, β = 1
α = 9, β = 0.5
0.4
3.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2
x
0.3
PDF
0.2
1.0
PDF
PDF
1.5
4
6
8
0.1
0
1
2
x
3
4
5
0.0
0
1
2
x
Inverse Gamma
Beta
5
0
5
10
15
20
x
Pareto
xm = 1, α = 1
xm = 1, α = 2
xm = 1, α = 4
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
2.0
2.5
4
4
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
3.0
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
3
x
0
1
2
3
x
4
5
0.0
0.2
0.4
0.6
x
0.8
1.0
2
0
0.0
0
0.0
0.5
0.5
1
1
1.0
1.0
PDF
PDF
1.5
PDF
2
PDF
1.5
3
2.0
3
2
2.5
0
0.0
0.0
0.0
0.5
0.1
0.5
1.0
0.2
PDF
0.3
2.0
1.5
2.5
0.4
0.5
k=1
k=2
k=3
k=4
k=5
2.0
x
0.5
●
0.4
1
b−a
0.0
0.5
1.0
1.5
x
2.0
2.5
0
1
2
3
4
5
x
5
2
Probability Theory
Law of Total Probability
Definitions
•
•
•
•
P [B] =
Sample space Ω
Outcome (point or element) ω ∈ Ω
Event A ⊆ Ω
σ-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes’ Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
n
n
[ X
Ai =
(−1)r−1
• Probability Distribution P
i=1
1. P [A] ≥ 0 ∀A
2. P [Ω] = 1
"∞ #
∞
G
X
3. P
Ai =
P [Ai ]
3
r=1
X
Ω=
n
G
Ai
i=1
r
\
Aij i≤i1 <···<ir ≤n j=1
Random Variables
Random Variable (RV)
i=1
X:Ω→R
• Probability space (Ω, A, P)
Probability Mass Function (PMF)
Properties
•
•
•
•
•
•
•
•
Ω=
i=1
1. ∅ ∈ A
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A
3. A ∈ A =⇒ ¬A ∈ A
i=1
n
X
P [∅] = 0
B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
P [¬A] = 1 − P [A]
P [B] = P [A ∩ B] + P [¬A ∩ B]
P [Ω] = 1
P [∅] = 0
S
T
T
S
¬( n An ) = n ¬An ¬( n An ) = n ¬An
DeMorgan
S
T
P [ n An ] = 1 − P [ n ¬An ]
P [A ∪ B] = P [A] + P [B] − P [A ∩ B]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
Probability Density Function (PDF)
Z
P [a ≤ X ≤ b] =
b
f (x) dx
a
Cumulative Distribution Function (CDF)
FX : R → [0, 1]
FX (x) = P [X ≤ x]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A]
S∞
whereA = i=1 Ai
T∞
whereA = i=1 Ai
Z
a≤b
a
Independence ⊥
⊥
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
b
fY |X (y | x)dy
P [a ≤ Y ≤ b | X = x] =
P [A ∩ B]
P [B]
P [B] > 0
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
2. fX,Y (x, y) = fX (x)fY (y)
6
3.1
Z
Transformations
• E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X])
(cf. Jensen inequality)
• P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ]
∞
X
• E [X] =
P [X ≥ x]
Z = ϕ(X)
Discrete
X
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
f (x)
x∈ϕ−1 (z)
x=1
Sample mean
n
Continuous
X̄n =
Z
FZ (z) = P [ϕ(X) ≤ z] =
with Az = {x : ϕ(x) ≤ z}
f (x) dx
Az
Special case if ϕ strictly monotone
dx d
1
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
dz
dz
|J|
Conditional expectation
Z
• E [Y | X = x] = yf (y | x) dy
• E [X] = E [E [X | Y ]]
∞
Z
• E[ϕ(X, Y ) | X = x] =
ϕ(x, y)fY |X (y | x) dx
Z −∞
∞
The Rule of the Lazy Statistician
• E [ϕ(Y, Z) | X = x] =
Z
E [Z] =
• E [Y + Z | X] = E [Y | X] + E [Z | X]
• E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
• E[Y | X] = c =⇒ Cov [X, Y ] = 0
Z
dFX (x) = P [X ∈ A]
IA (x) dFX (x) =
ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
ϕ(x) dFX (x)
Z
E [IA (x)] =
1X
Xi
n i=1
A
Convolution
Z
• Z := X + Y
∞
fX,Y (x, z − x) dx
fZ (z) =
−∞
Z
X,Y ≥0
Z
z
fX,Y (x, z − x) dx
=
0
∞
• Z := |X − Y |
• Z :=
4
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z ∞
Z ∞
⊥
⊥
fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx
−∞
−∞
Expectation
5
Variance
Definition and properties
2
2
• V [X] = σX
= E (X − E [X])2 = E X 2 − E [X]
" n
#
n
X
X
X
• V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1
• V
" n
X
#
Xi =
i=1
Definition and properties
Z
• E [X] = µX =
x dFX (x) =
i6=j
V [Xi ]
iff Xi ⊥
⊥ Xj
i=1
Standard deviation
X

xfX (x)



 x
Z




 xfX (x)
• P [X = c] = 1 =⇒ E [c] = c
• E [cX] = c E [X]
• E [X + Y ] = E [X] + E [Y ]
i=1
n
X
sd[X] =
X discrete
p
V [X] = σX
Covariance
X continuous
•
•
•
•
•
Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
7
• Cov [X + a, Y + b] = Cov [X, Y ]


n
m
n X
m
X
X
X


• Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1
j=1
Correlation
• limn→∞ Bin (n, p) = Po (np)
(n large, p small)
• limn→∞ Bin (n, p) = N (np, np(1 − p))
(n large, p far from 0 and 1)
Negative Binomial
i=1 j=1
•
•
•
•
Cov [X, Y ]
ρ [X, Y ] = p
V [X] V [Y ]
Independence
X ∼ NBin (1, p) = Geo (p)
Pr
X ∼ NBin (r, p) = i=1 Geo (p)
P
P
Xi ∼ NBin (ri , p) =⇒
Xi ∼ NBin ( ri , p)
X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Poisson
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒
n
X
Xi ∼ Po
i=1
Sample variance
n
X
!
λi
i=1


X
n
X
n
λi

Xj ∼ Bin 
Xj , Pn
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi j=1
j=1 λj
j=1
n
1 X
S2 =
(Xi − X̄n )2
n − 1 i=1
Conditional variance
2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
Exponential
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒
n
X
Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
6
Inequalities
Normal
• X ∼ N µ, σ 2
Cauchy-Schwarz
2
•
•
•
E [XY ] ≤ E X 2 E Y 2
Markov
P [ϕ(X) ≥ t] ≤
E [ϕ(X)]
t
Chebyshev
P [|X − E [X]| ≥ t] ≤
Chernoff
P [X ≥ (1 + δ)µ] ≤
•
δ > −1
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex
Distribution Relationships
Binomial
• Xi ∼ Bern (p) =⇒
n
X
X−µ
σ
2
Gamma
Jensen
7
∼ N (0, 1)
X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22
P
P
P
2
Xi ∼ N µi , σi2 =⇒
X ∼N
i µi ,
i σi
i i
− Φ a−µ
P [a < X ≤ b] = Φ b−µ
σ
σ
=⇒
• Φ(−x) = 1 − Φ(x)
φ0 (x) = −xφ(x)
φ00 (x) = (x2 − 1)φ(x)
−1
• Upper quantile of N (0, 1): zα = Φ (1 − α)
V [X]
t2
eδ
(1 + δ)1+δ
Xi ∼ Bin (n, p)
i=1
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
• Gamma (α, β) ∼ i=1 Exp (β)
P
P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒
Xi ∼ Gamma ( i αi , β)
i
Z ∞
Γ(α)
•
=
xα−1 e−λx dx
λα
0
Beta
1
Γ(α + β) α−1
•
xα−1 (1 − x)β−1 =
x
(1 − x)β−1
B(α, β)
Γ(α)Γ(β)
B(α + k, β)
α+k−1
• E Xk =
=
E X k−1
B(α, β)
α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1)
8
8
Probability and Moment Generating Functions
• GX (t) = E tX
σX
(Y − E [Y ])
σY
p
V [X | Y ] = σX 1 − ρ2
E [X | Y ] = E [X] + ρ
|t| < 1
t
Xt
• MX (t) = GX (e ) = E e
=E
"∞
#
X (Xt)i
i!
i=0
∞
X
E Xi
=
· ti
i!
i=0
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0)
• P [X = i] =
Conditional mean and variance
9.3
(i)
GX (0)
Multivariate Normal
(Precision matrix Σ−1 )


V [X1 ]
· · · Cov [X1 , Xk ]


..
..
..
Σ=

.
.
.
Covariance matrix Σ
i!
• E [X] = G0X (1− )
(k)
• E X k = MX (0)
X!
(k)
• E
= GX (1− )
(X − k)!
Cov [Xk , X1 ] · · ·
If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
d
−n/2
• GX (t) = GY (t) =⇒ X = Y
9
9.1
V [Xk ]
fX (x) = (2π)
|Σ|
1
exp − (x − µ)T Σ−1 (x − µ)
2
Properties
Multivariate Distributions
•
•
•
•
Standard Bivariate Normal
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX +
−1/2
p
1 − ρ2 Z
Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
Joint density
f (x, y) =
2
1
x + y 2 − 2ρxy
p
exp −
2(1 − ρ2 )
2π 1 − ρ2
10
Convergence
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Conditionals
(Y | X = x) ∼ N ρx, 1 − ρ2
and
(X | Y = y) ∼ N ρy, 1 − ρ2
Types of convergence
Independence
D
1. In distribution (weakly, in law): Xn → X
X⊥
⊥ Y ⇐⇒ ρ = 0
lim Fn (t) = F (t)
9.2
n→∞
Bivariate Normal
P
2. In probability: Xn → X
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
f (x, y) =
"
z=
x − µx
σx
2πσx σy
2
+
1
p
z
exp −
2
2(1 − ρ2 )
1−ρ
y − µy
σy
∀t where F continuous
2
− 2ρ
x − µx
σx
(∀ε > 0) lim P [|Xn − X| > ε] = 0
y − µy
σy
n→∞
as
#
3. Almost surely (strongly): Xn → X
h
i
h
i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞
n→∞
9
qm
4. In quadratic mean (L2 ): Xn → X
CLT notations
Zn ≈ N (0, 1)
σ2
X̄n ≈ N µ,
n
σ2
X̄n − µ ≈ N 0,
n
√
2
n(X̄n − µ) ≈ N 0, σ
√
n(X̄n − µ)
≈ N (0, 1)
n
lim E (Xn − X)2 = 0
n→∞
Relationships
qm
P
D
• Xn → X =⇒ Xn → X =⇒ Xn → X
as
P
• Xn → X =⇒ Xn → X
D
P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
•
•
•
•
Xn
Xn
Xn
Xn
P
→X
qm
→X
P
→X
P
→X
∧ Yn
∧ Yn
∧ Yn
=⇒
P
P
→ Y =⇒ Xn + Yn → X + Y
qm
qm
→ Y =⇒ Xn + Yn → X + Y
P
P
→ Y =⇒ Xn Yn → XY
P
ϕ(Xn ) → ϕ(X)
D
Continuity correction
D
x + 12 − µ
√
σ/ n
x − 12 − µ
√
P X̄n ≥ x ≈ 1 − Φ
σ/ n
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ
P X̄n ≤ x ≈ Φ
Slutzky’s Theorem
D
P
D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
P
D
• Xn → X and Yn → c =⇒ Xn Yn → cX
D
D
D
• In general: Xn → X and Yn → Y =⇒
6
Xn + Yn → X + Y
10.1
Delta method
Yn ≈ N
11
Law of Large Numbers (LLN)
σ2
µ,
n
=⇒ ϕ(Yn ) ≈ N
σ2
ϕ(µ), (ϕ (µ))
n
0
2
Statistical Inference
iid
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Weak (WLLN)
11.1
n→∞
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
• bias(θbn ) = E θbn − θ
as
n→∞
P
• Consistency: θbn → θ
• Sampling distribution: F (θbn )
r h i
b
• Standard error: se(θn ) = V θbn
h
i
h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
X̄n → µ
Strong (WLLN)
X̄n → µ
10.2
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
X̄n − µ
Zn := q =
V X̄n
√
n(X̄n − µ) D
→Z
σ
lim P [Zn ≤ z] = Φ(z)
n→∞
Point Estimation
P
where Z ∼ N (0, 1)
z∈R
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
θbn − θ D
• Asymptotic normality:
→ N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ
bn .
10
11.2
Normal-Based Confidence Interval
b 2 . Let zα/2
Suppose θbn ≈ N θ, se
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
= Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
b
Cn = θbn ± zα/2 se
11.4
•
•
•
•
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of θ = (F ): θbn = T (Fbn )
R
Linear functional: T (F ) = ϕ(x) dFX (x)
Plug-in estimator for linear functional:
n
Z
11.3
1X
ϕ(Xi )
ϕ(x) dFbn (x) =
n i=1
b 2 =⇒ T (Fbn ) ± zα/2 se
b
• Often: T (Fbn ) ≈ N T (F ), se
T (Fbn ) =
Empirical distribution
Empirical Distribution Function (ECDF)
Pn
Fbn (x) =
i=1
I(Xi ≤ x)
n
(
1
I(Xi ≤ x) =
0
Xi ≤ x
Xi > x
Properties (for any fixed x)
h i
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn =
n
F (x)(1 − F (x)) D
• mse =
→0
n
P
• Fbn → F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
2
P sup F (x) − Fbn (x) > ε = 2e−2nε
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
• µ
b = X̄n
n
1 X
• σ
b2 =
(Xi − X̄n )2
n − 1 i=1
Pn
1
b)3
i=1 (Xi − µ
n
• κ
b=
σ
b3 j
Pn
(Xi − X̄n )(Yi − Ȳn )
qP
• ρb = qP i=1
n
n
2
(X
−
X̄
)
i
n
i=1
i=1 (Yi − Ȳn )
12
Parametric Inference
Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
and parameter θ = (θ1 , . . . , θk ).
12.1
Method of Moments
j th moment
αj (θ) = E X j =
x
Nonparametric 1 − α confidence band for F
L(x) = max{Fbn − n , 0}
U (x) = min{Fbn + n , 1}
s
1
2
log
=
2n
α
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
Z
xj dFX (x)
j th sample moment
n
α
bj =
1X j
X
n i=1 i
Method of moments estimator (MoM)
α1 (θ) = α
b1
α2 (θ) = α
b2
.. ..
.=.
αk (θ) = α
bk
11
Properties of the MoM estimator
• θbn exists with probability tending to 1
P
• Consistency: θbn → θ
• Asymptotic normality:
√
D
n(θb − θ) → N (0, Σ)
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ
αj (θ)
12.2
Maximum Likelihood
Likelihood: Ln : Θ → [0, ∞)
Ln (θ) =
n
Y
f (Xi ; θ)
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ)
• Asymptotic normality:
p
1. se ≈ 1/In (θ)
(θbn − θ) D
→ N (0, 1)
se
q
b ≈ 1/In (θbn )
2. se
(θbn − θ) D
→ N (0, 1)
b
se
• Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is
h i
V θbn
are(θen , θbn ) = h i ≤ 1
V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
`n (θ) = log Ln (θ) =
n
X
log f (Xi ; θ)
i=1
12.2.1
Delta Method
b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
(b
τn − τ ) D
→ N (0, 1)
b τ)
se(b
Ln (θbn ) = sup Ln (θ)
θ
b is the mle of τ and
where τb = ϕ(θ)
Score function
s(X; θ) =
∂
log f (X; θ)
∂θ
b b
b = ϕ0 (θ)
b θn )
se
se(
Fisher information
I(θ) = Vθ [s(X; θ)]
In (θ) = nI(θ)
Fisher information (exponential family)
∂
I(θ) = Eθ − s(X; θ)
∂θ
Observed Fisher information
Inobs (θ) = −
Properties of the mle
P
• Consistency: θbn → θ
12.3
Multiparameter Models
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Hjj =
∂ 2 `n
∂θ2
Hjk =
Fisher information matrix

n
∂2 X
log f (Xi ; θ)
∂θ2 i=1
∂ 2 `n
∂θj ∂θk
···
..
.
Eθ [Hk1 ] · · ·
Eθ [H11 ]

..
In (θ) = − 
.

Eθ [H1k ]

..

.
Eθ [Hkk ]
Under appropriate regularity conditions
(θb − θ) ≈ N (0, Jn )
12
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then
13
Hypothesis Testing
H0 : θ ∈ Θ0
(θbj − θj ) D
→ N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se
12.3.1
Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be

∂ϕ
 ∂θ1 
 . 

∇ϕ = 
 .. 
 ∂ϕ 
∂θk

Definitions
•
•
•
•
•
•
•
•
•
•
•
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis θ = θ0
Composite hypothesis θ > θ0 or θ < θ0
Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function β(θ) = P [X ∈ R]
Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
(b
τ − τ) D
→ N (0, 1)
b τ)
se(b
θ∈Θ0
H0 true
H1 true
b
∇ϕ
T
Type II Error (β)
1−Fθ (T (X))
p-value
< 0.01
0.01 − 0.05
0.05 − 0.1
> 0.1
b
Jbn ∇ϕ
b and ∇ϕ
b = ∇ϕ b.
and Jbn = Jn (θ)
θ=θ
12.4
Retain H0
√
Reject H0
Type
√ I Error (α)
(power)
• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
• p-value = supθ∈Θ0
Pθ [T (X ? ) ≥ T (X)]
= inf α : T (X) ∈ Rα
|
{z
}
where
b τ) =
se(b
θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
p-value
b Then,
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
r
H1 : θ ∈ Θ1
versus
Parametric Bootstrap
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator.
since T (X ? )∼Fθ
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald test
• Two-sided test
θb − θ0
• Reject H0 when |W | > zα/2 where W =
b
se
• P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Likelihood ratio test (LRT)
• T (X) =
supθ∈Θ Ln (θ)
Ln (θbn )
=
supθ∈Θ0 Ln (θ)
Ln (θbn,0 )
13
D
• λ(X) = 2 log T (X) → χ2r−q where
k
X
iid
Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
i=1
• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
Multinomial LRT
Xk
X1
,...,
• mle: pbn =
n
n
Xj
k Y
Ln (b
pn )
pbj
• T (X) =
=
Ln (p0 )
p0j
j=1
k
X
pbj
D
Xj log
• λ(X) = 2
→ χ2k−1
p
0j
j=1
• xn = (x1 , . . . , xn )
• Prior density f (θ)
• Likelihood f (xn | θ): joint density of the data
n
Y
In particular, X n iid =⇒ f (xn | θ) =
f (xi | θ) = Ln (θ)
i=1
• Posterior density f (θ | xn )
R
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
• Kernel: part of a density that depends Ron θ
R
θLn (θ)f (θ)
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
(θ) dθ
14.1
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
Credible Intervals
Posterior interval
Pearson Chi-square Test
n
k
X
(Xj − E [Xj ])2
where E [Xj ] = np0j under H0
• T =
E [Xj ]
j=1
D
f (θ | xn ) dθ = 1 − α
P [θ ∈ (a, b) | x ] =
a
Equal-tail credible interval
χ2k−1
• T →
• p-value = P χ2k−1 > T (x)
Z
a
f (θ | xn ) dθ =
Z
−∞
D
2
• Faster → Xk−1
than LRT, hence preferable for small n
∞
f (θ | xn ) dθ = α/2
b
Highest posterior density (HPD) region Rn
Independence testing
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: pbij = nij
X
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
PI PJ
nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)
14
b
Z
1. P [θ ∈ Rn ] = 1 − α
2. Rn = {θ : f (θ | xn ) > k} for some k
Rn is unimodal =⇒ Rn is an interval
14.2
Function of parameters
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
Bayesian Inference
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] =
Z
f (θ | xn ) dθ
A
Bayes’ Theorem
Posterior density
f (x | θ)f (θ)
f (x | θ)f (θ)
f (θ | x) =
=R
∝ Ln (θ)f (θ)
n
f (x )
f (x | θ)f (θ) dθ
h(τ | xn ) = H 0 (τ | xn )
Bayesian delta method
Definitions
n
• X = (X1 , . . . , Xn )
b
b se
b ϕ0 (θ)
τ | X n ≈ N ϕ(θ),
14
14.3
Priors
Continuous likelihood (subscript c denotes constant)
Likelihood
Conjugate prior
Unif (0, θ)
Pareto(xm , k)
Exp (λ)
Gamma (α, β)
Choice
• Subjective bayesianism.
• Objective bayesianism.
• Robust bayesianism.
i=1
2
2
N µ, σc
N µ0 , σ0
Types
N µc , σ 2
• Flat: f (θ) ∝ constant
R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s prior (transformation-invariant):
f (θ) ∝
p
I(θ)
f (θ) ∝
N µ, σ 2
p
Scaled Inverse Chisquare(ν, σ02 )
Normalscaled
Inverse
Gamma(λ, ν, α, β)
det(I(θ))
MVN(µ, Σc )
MVN(µ0 , Σ0 )
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family
MVN(µc , Σ)
14.3.1
Conjugate Priors
Discrete likelihood
Likelihood
Bern (p)
Bin (p)
Conjugate prior
Beta (α, β)
Beta (α, β)
Posterior hyperparameters
α+
α+
n
X
i=1
n
X
xi , β + n −
xi , β +
i=1
NBin (p)
Po (λ)
Beta (α, β)
Gamma (α, β)
α + rn, β +
α+
n
X
n
X
n
X
i=1
xi
Dir (α)
α+
n
X
xi , β + n
x(i)
i=1
Geo (p)
Beta (α, β)
n
X
InverseWishart(κ, Ψ)
Pareto(xmc , k)
Gamma (α, β)
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (αc , β)
Gamma (α0 , β0 )
Pn
n
µ0
1
i=1 xi
+
+ 2 ,
/
σ2
σ2
σ02
σc
0
c−1
n
1
+ 2
σ02
σc
Pn
νσ02 + i=1 (xi − µ)2
ν + n,
ν+n
n
νλ + nx̄
,
ν + n,
α + ,
ν+n
2
n
1X
γ(x̄ − λ)2
2
β+
(xi − x̄) +
2 i=1
2(n + γ)
−1
Σ−1
0 + nΣc
α + n, β +
n
X
i=1
−1
−1
Σ−1
x̄ ,
0 µ0 + nΣ
−1 −1
Σ−1
0 + nΣc
n
X
n + κ, Ψ +
(xi − µc )(xi − µc )T
i=1
n
X
xi
xm c
i=1
x0 , k0 − kn where k0 > kn
n
X
α0 + nαc , β0 +
xi
α + n, β +
log
i=1
xi
i=1
Ni −
n
X
14.4
xi
Bayesian Testing
If H0 : θ ∈ Θ0 :
i=1
Z
Prior probability P [H0 ] =
i=1
i=1
Multinomial(p)
Posterior hyperparameters
max x(n) , xm , k + n
n
X
α + n, β +
xi
Posterior probability P [H0 | xn ] =
f (θ) dθ
ZΘ0
f (θ | xn ) dθ
Θ0
Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ),
xi
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]
15
Marginal likelihood
f (xn | Hi ) =
Z
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
f (xn | θ, Hi )f (θ | Hi ) dθ
∗
∗
(a) Repeat the following B times to get Tn,1
, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X ∗ , . . . , Xn∗ ∼ Fbn .
Θ
Posterior odds (of Hi relative to Hj )
P [Hi | xn ]
P [Hj | xn ]
=
f (xn | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
×
P [Hi ]
P [Hj ]
| {z }
1
ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
prior odds
Bayes factor
log10 BF10
p∗ =
15
0 − 0.5
0.5 − 1
1−2
>2
p
1−p BF10
1+
p
1−p BF10
BF10
evidence
1 − 1.5
1.5 − 10
10 − 100
> 100
Weak
Moderate
Strong
Decisive
vboot
B
B
X
1 X ∗
∗
bb = 1
T
−
=V
T
n,b
Fn
B
B r=1 n,r
!2
b=1
16.1.1
where p = P [H1 ] and p∗ = P [H1 | xn ]
Bootstrap Confidence Intervals
Normal-based interval
b boot
Tn ± zα/2 se
Exponential Family
Pivotal interval
Scalar parameter
1.
2.
3.
4.
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)}
= h(x)g(θ) exp {η(θ)T (x)}
Vector parameter
(
fX (x | θ) = h(x) exp
s
X
Location parameter θ = T (F )
Pivot Rn = θbn − θ
Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗
∗
Let Rn,b
= θbn,b
− θbn . Approximate H using bootstrap:
)
B
1 X
∗
b
H(r)
=
I(Rn,b
≤ r)
B
ηi (θ)Ti (x) − A(θ)
i=1
b=1
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)}
Natural form
fX (x | η) = h(x) exp {η · T(x) − A(η)}
= h(x)g(η) exp {η · T(x)}
= h(x)g(η) exp η T T(x)
∗
∗
, . . . , θbn,B
)
5. θβ∗ = β sample quantile of (θbn,1
∗
∗
6. rβ∗ = β sample quantile of (Rn,1
, . . . , Rn,B
), i.e., rβ∗ = θβ∗ − θbn
7. Approximate 1 − α confidence interval Cn = â, b̂ where
â =
b̂ =
16
16.1
Sampling Methods
The Bootstrap
Let Tn = g(X1 , . . . , Xn ) be a statistic.
b −1 1 − α =
θbn − H
2
b −1 α =
θbn − H
2
∗
θbn − r1−α/2
=
∗
2θbn − θ1−α/2
∗
θbn − rα/2
=
∗
2θbn − θα/2
Percentile interval
∗
∗
Cn = θα/2
, θ1−α/2
16
16.2
Rejection Sampling
Setup
• We can easily sample from g(θ)
• We want to sample from h(θ), but it is difficult
k(θ)
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ
• We know h(θ) up to a proportional constant: h(θ) = R
Algorithm
1. Draw θcand ∼ g(θ)
2. Generate u ∼ Unif (0, 1)
k(θcand )
3. Accept θcand if u ≤
M g(θcand )
4. Repeat until B values of θcand have been accepted
Example
•
•
•
•
Loss functions
• Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
• Lp loss: L(θ, a) = |θ − a|p
(
0 a=θ
• Zero-one loss: L(θ, a) =
1 a 6= θ
17.1
We can easily sample from the prior g(θ) = f (θ)
Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M
Algorithm
1. Draw θcand ∼ f (θ)
2. Generate u ∼ Unif (0, 1)
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn )
16.3
• Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of θ, θ(x).
• Loss function L: consequences of taking action a when true state is θ or
b L : Θ × A → [−k, ∞).
discrepancy between θ and θ,
Posterior risk
Z
h
i
b
b
L(θ, θ(x))f
(θ | x) dθ = Eθ|X L(θ, θ(x))
Z
h
i
b
b
L(θ, θ(x))f
(x | θ) dx = EX|θ L(θ, θ(X))
r(θb | x) =
(Frequentist) risk
b =
R(θ, θ)
Bayes risk
ZZ
Importance Sampling
b =
r(f, θ)
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
Decision Theory
Definitions
• Unknown quantity affecting our decision: θ ∈ Θ
h
i
b
b
L(θ, θ(x))f
(x, θ) dx dθ = Eθ,X L(θ, θ(X))
h
h
ii
h
i
b = Eθ EX|θ L(θ, θ(X)
b
b
r(f, θ)
= Eθ R(θ, θ)
h
h
ii
h
i
b = EX Eθ|X L(θ, θ(X)
b
r(f, θ)
= EX r(θb | X)
iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
Ln (θi )
2. wi = PB
∀i = 1, . . . , B
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
17
Risk
17.2
Admissibility
• θb0 dominates θb if
b
∀θ : R(θ, θb0 ) ≤ R(θ, θ)
b
∃θ : R(θ, θb0 ) < R(θ, θ)
• θb is inadmissible if there is at least one other estimator θb0 that dominates
it. Otherwise it is called admissible.
17
17.3
Bayes Rule
Residual sums of squares (rss)
Bayes rule (or Bayes estimator)
rss(βb0 , βb1 ) =
b = inf e r(f, θ)
e
• r(f, θ)
θ
R
b
b
b = r(θb | x)f (x) dx
• θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
n
X
ˆ2i
i=1
Least square estimates
Theorems
βbT = (βb0 , βb1 )T : min rss
b0 ,β
b1
β
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode
17.4
βb0 = Ȳn − βb1 X̄n
Pn
Pn
(Xi − X̄n )(Yi − Ȳn )
i=1 Xi Yi − nX̄Y
Pn
βb1 = i=1
=
P
n
2
2
2
(X
−
X̄
)
i
n
i=1
i=1 Xi − nX
h
i
β0
E βb | X n =
β1
P
h
i
σ 2 n−1 ni=1 Xi2 −X n
n
b
V β |X =
−X n
1
nsX
r Pn
2
σ
b
i=1 Xi
√
b βb0 ) =
se(
n
sX n
σ
b
√
b βb1 ) =
se(
sX n
Minimax Rules
Maximum risk
b = sup R(θ, θ)
b
R̄(θ)
R̄(a) = sup R(θ, a)
θ
θ
Minimax rule
e
e = inf sup R(θ, θ)
b = inf R̄(θ)
sup R(θ, θ)
θ
θe
θe
θ
b =c
θb = Bayes rule ∧ ∃c : R(θ, θ)
Least favorable prior
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ
18
Linear Regression
Pn
b2 =
where s2X = n−1 i=1 (Xi − X n )2 and σ
Further properties:
Pn
ˆ2i
i=1 (unbiased estimate).
P
P
• Consistency: βb0 → β0 and βb1 → β1
• Asymptotic normality:
Definitions
• Response variable Y
• Covariate X (aka predictor variable or feature)
18.1
1
n−2
βb0 − β0 D
→ N (0, 1)
b βb0 )
se(
Simple Linear Regression
βb1 − β1 D
→ N (0, 1)
b βb1 )
se(
• Approximate 1 − α confidence intervals for β0 and β1 :
Model
Yi = β0 + β1 Xi + i
and
E [i | Xi ] = 0, V [i | Xi ] = σ 2
b βb0 )
βb0 ± zα/2 se(
b βb1 )
and βb1 ± zα/2 se(
Fitted line
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
b βb1 ).
W = βb1 /se(
rb(x) = βb0 + βb1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
R2
Pn b
Pn 2
2
ˆ
rss
i=1 (Yi − Y )
R = Pn
= 1 − Pn i=1 i 2 = 1 −
2
tss
i=1 (Yi − Y )
i=1 (Yi − Y )
2
18
If the (k × k) matrix X T X is invertible,
Likelihood
L=
L1 =
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi ) ×
i=1
n
Y
βb = (X T X)−1 X T Y
h
i
V βb | X n = σ 2 (X T X)−1
fY |X (Yi | Xi ) = L1 × L2
i=1
βb ≈ N β, σ 2 (X T X)−1
fX (Xi )
Estimate regression function
i=1
n
Y
(
2
1 X
Yi − (β0 − β1 Xi )
fY |X (Yi | Xi ) ∝ σ −n exp − 2
L2 =
2σ i
i=1
)
rb(x) =
k
X
βbj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
Unbiased estimate for σ
2
n
1 X 2
ˆ
σ
b =
n − k i=1 i
n
2
1X 2
ˆ
σ
b2 =
n i=1 i
ˆ = X βb − Y
mle
18.2
µ
b = X̄
Prediction
Observe X = x∗ of the covariate and want to predict their outcome Y∗ .
Yb∗ = βb0 + βb1 x∗
i
h i
h i
h
i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
h
Prediction interval
ξbn2 = σ
b2
Pn
2
i=1 (Xi − X∗ )
P
+1
n i (Xi − X̄)2 j
n−k 2
σ
n
1 − α Confidence interval
b βbj )
βbj ± zα/2 se(
18.4
Model Selection
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance
Yb∗ ± zα/2 ξbn
18.3
σ
b2 =
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Multiple Regression
Y = Xβ + Hypothesis testing
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
where

X11
 ..
X= .
Xn1
···
..
.
···

X1k
.. 
. 
Xnk


β1
 
β =  ... 
 
1
 .. 
=.
βk
n
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) − Y ∗ )2
Prediction risk
Likelihood
1
2 −n/2
L(µ, Σ) = (2πσ )
exp − 2 rss
2σ
R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) − Yi∗ )2
i=1
Training error
rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 =
N
X
(Yi − xTi β)2
i=1
btr (S) =
R
n
X
(Ybi (S) − Yi )2
i=1
19
R2
btr (S)
rss(S)
R
R2 (S) = 1 −
=1−
=1−
tss
tss
Pn b
2
i=1 (Yi (S) − Y )
P
n
2
i=1 (Yi − Y )
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
The training error is a downward-biased estimate of the prediction risk.
h
i
btr (S) < R(S)
E R
h
i
btr (S)) = E R
btr (S) − R(S) = −2
bias(R
n
X
h
i
b(x) = E fbn (x) − f (x)
h
i
v(x) = V fbn (x)
h
i
Cov Ybi , Yi
i=1
Adjusted R
2
R2 (S) = 1 −
n − 1 rss
n − k tss
19.1.1
Mallow’s Cp statistic
Histograms
Definitions
b
btr (S) + 2kb
R(S)
=R
σ 2 = lack of fit + complexity penalty
•
•
•
•
Akaike Information Criterion (AIC)
AIC(S) = `n (βbS , σ
bS2 ) − k
Bayesian Information Criterion (BIC)
Number of bins m
1
Binwidth h = m
Bin Bj has νj observations
R
Define pbj = νj /n and pj = Bj f (u) du
Histogram estimator
k
BIC(S) = `n (βbS , σ
bS2 ) − log n
2
Validation and training
bV (S) =
R
m
X
fbn (x) =
(Ybi∗ (S) − Yi∗ )2
m = |{validation data}|, often
i=1
Leave-one-out cross-validation
bCV (S) =
R
n
X
2
(Yi − Yb(i) ) =
i=1
n
X
i=1
Yi − Ybi (S)
1 − Uii (S)
!2
U (S) = XS (XST XS )−1 XS (“hat matrix”)
19
19.1
Non-parametric Function Estimation
n
n
or
4
2
m
X
pbj
j=1
h
I(x ∈ Bj )
h
i p
j
E fbn (x) =
h
h
i p (1 − p )
j
j
V fbn (x) =
nh2
Z
1
h2
2
b
(f 0 (u)) du +
R(fn , f ) ≈
12
nh
!1/3
1
6
h∗ = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
3
C
2
C=
(f 0 (u)) du
R∗ (fbn , f ) ≈ 2/3
4
n
Density Estimation
R
Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx.
Integrated square error (ise)
Z Z
2
L(f, fbn ) =
f (x) − fbn (x) dx = J(h) + f 2 (x) dx
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
n
m
2Xb
2
n+1 X 2
f(−i) (Xi ) =
pb
fbn2 (x) dx −
−
n i=1
(n − 1)h (n − 1)h j=1 j
20
19.1.2
Kernel Density Estimator (KDE)
k-nearest Neighbor Estimator
X
1
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
rb(x) =
k
Kernel K
•
•
•
•
i:xi ∈Nk (x)
K(x) ≥ 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
>0
x K(x) dx ≡ σK
Nadaraya-Watson Kernel Estimator
rb(x) =
n
X
wi (x)Yi
i=1
KDE
x−xi
h
Pn
x−xj
K
j=1
h
K
wi (x) =
n
1X1
x − Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) ≈ (hσK )
(f (x)) dx +
K 2 (x) dx
4
nh
Z
Z
−2/5 −1/5 −1/5
c
c2 c3
2
2
h∗ = 1
c
=
σ
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
∗
2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (σK )
4
n
|
{z
}
R(b
rn , r) ≈
h4
4
Z
+
∈ [0, 1]
4 Z 2
f 0 (x)
x2 K 2 (x) dx
r00 (x) + 2r0 (x)
dx
f (x)
R
σ 2 K 2 (x) dx
dx
nhf (x)
Z
c1
n1/5
c2
R∗ (b
rn , r) ≈ 4/5
n
h∗ ≈
C(K)
Cross-validation estimate of E [J(h)]
Epanechnikov Kernel
(
K(x) =
√
3
√
4 5(1−x2 /5)
|x| <
0
otherwise
JbCV (h) =
5
n
X
(Yi − rb(−i) (xi ))2 =
i=1
n
X
i=1
(Yi − rb(xi ))2
1−
Pn
j=1
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
K ∗ (x) = K (2) (x) − 2K(x)
19.2
19.3
n
n
n
2Xb
1 X X ∗ Xi − Xj
2
2
b
fn (x) dx −
f(−i) (Xi ) ≈
+
K(0)
K
n i=1
hn2 i=1 j=1
h
nh
K (2) (x) =
Approximation
r(x) =
∞
X
j=1
βj φj (x) ≈
J
X
βj φj (x)
i=1
Multivariate regression
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
K(0)
x−x j
K
h
Smoothing Using Orthogonal Functions
Z
K(x − y)K(y) dy
!2
where
ηi = i
Y = Φβ + η

φ0 (x1 )
 ..
and Φ =  .
···
..
.
φ0 (xn ) · · ·

φJ (x1 )
.. 
. 
φJ (xn )
Least squares estimator
Yi = r(xi ) + i
E [i ] = 0
V [i ] = σ 2
βb = (ΦT Φ)−1 ΦT Y
1
≈ ΦT Y (for equally spaced observations only)
n
21
Cross-validation estimate of E [J(h)]

2
n
J
X
X
bCV (J) =
Yi −
R
φj (xi )βbj,(−i) 
i=1
20
j=1
20.2
Poisson Processes
Poisson process
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t
• X0 = 0
• Independent increments:
Stochastic Processes
Stochastic Process
(
{0, ±1, . . . } = Z discrete
T =
[0, ∞)
continuous
{Xt : t ∈ T }
• Notations Xt , X(t)
• State space X
• Index set T
20.1
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ ··· ⊥
⊥ Xtn − Xtn−1
• Intensity function λ(t)
– P [Xt+h − Xt = 1] = λ(t)h + o(h)
– P [Xt+h − Xt = 2] = o(h)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Markov Chains
Markov chain
Rt
0
λ(s) ds
Homogeneous Poisson process
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ]
∀n ∈ T, x ∈ X
λ(t) ≡ λ =⇒ Xt ∼ Po (λt)
Transition probabilities
pij ≡ P [Xn+1 = j | Xn = i]
pij (n) ≡ P [Xm+n = j | Xm = i]
λ>0
Waiting times
n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
• (i, j) element is pij
• pij > 0
P
•
i pij = 1
1
Wt ∼ Gamma t,
λ
Chapman-Kolmogorov
Interarrival times
pij (m + n) =
X
pij (m)pkj (n)
St = Wt+1 − Wt
k
Pm+n = Pm Pn
Pn = P × · · · × P = Pn
St ∼ Exp
1
λ
Marginal probability
µn = (µn (1), . . . , µn (N ))
where
µi (i) = P [Xn = i]
St
µ0 , initial distribution
µn = µ0 Pn
Wt−1
Wt
t
22
21
Time Series
21.1
Mean function
Z
Strictly stationary
∞
µxt = E [xt ] =
Stationary Time Series
xft (x) dx
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
−∞
Autocovariance function
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
γx (t, t) = E (xt − µt )2 = V [xt ]
Autocorrelation function (ACF)
∀k ∈ N, tk , ck , h ∈ Z
Weakly stationary
∀t ∈ Z
• E x2t < ∞
2
• E xt = m
∀t ∈ Z
• γx (s, t) = γx (s + r, t + r)
γ(s, t)
Cov [xs , xt ]
=p
ρ(s, t) = p
V [xs ] V [xt ]
γ(s, s)γ(t, t)
Cross-covariance function (CCV)
∀r, s, t ∈ Z
Autocovariance function
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
•
•
•
•
•
Cross-correlation function (CCF)
ρxy (s, t) = p
γxy (s, t)
γx (s, s)γy (t, t)
γ(h) = E [(xt+h − µ)(xt − µ)]
γ(0) = E (xt − µ)2
γ(0) ≥ 0
γ(0) ≥ |γ(h)|
γ(h) = γ(−h)
∀h ∈ Z
Backshift operator
B k (xt ) = xt−k
Autocorrelation function (ACF)
Difference operator
γ(t + h, t)
γ(h)
Cov [xt+h , xt ]
ρx (h) = p
=p
=
γ(0)
V [xt+h ] V [xt ]
γ(t + h, t + h)γ(t, t)
∇d = (1 − B)d
White noise
2
• wt ∼ wn(0, σw
)
•
•
•
•
Jointly stationary time series
iid
2
0, σw
Gaussian: wt ∼ N
E [wt ] = 0 t ∈ T
V [wt ] = σ 2 t ∈ T
γw (s, t) = 0 s 6= t ∧ s, t ∈ T
γxy (h) = E [(xt+h − µx )(yt − µy )]
ρxy (h) = p
Random walk
Linear process
• Drift δ
Pt
• xt = δt + j=1 wj
• E [xt ] = δt
xt = µ +
k
X
j=−k
aj xt−j
∞
X
ψj wt−j
where
j=−∞
Symmetric moving average
mt =
γxy (h)
γx (0)γy (h)
where aj = a−j ≥ 0 and
k
X
j=−k
aj = 1
γ(h) =
∞
X
|ψj | < ∞
j=−∞
2
σw
∞
X
ψj+h ψj
j=−∞
23
21.2
Estimation of Correlation
21.3.1
Detrending
Least squares
Sample mean
n
1X
x̄ =
xt
n t=1
Sample variance
n |h|
1 X
1−
γx (h)
V [x̄] =
n
n
h=−n
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2
2. Minimize rss to obtain trend estimate µ
bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt
Moving average
• The low-pass filter vt is a symmetric moving average mt with aj =
Sample autocovariance function
n−h
1 X
γ
b(h) =
(xt+h − x̄)(xt − x̄)
n t=1
vt =
1
2k+1 :
k
X
1
xt−1
2k + 1
i=−k
Pk
1
• If 2k+1
i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
Sample autocorrelation function
ρb(h) =
γ
b(h)
γ
b(0)
Differencing
• µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function
γ
bxy (h) =
n−h
1 X
(xt+h − x̄)(yt − y)
n t=1
21.4
ARIMA models
Autoregressive polynomial
φ(z) = 1 − φ1 z − · · · − φp zp
Sample cross-correlation function
γ
bxy (h)
ρbxy (h) = p
γ
bx (0)b
γy (0)
z ∈ C ∧ φp 6= 0
Autoregressive operator
φ(B) = 1 − φ1 B − · · · − φp B p
Properties
1
• σρbx (h) = √ if xt is white noise
n
1
• σρbxy (h) = √ if xt or yt is white noise
n
21.3
Non-Stationary Time Series
Autoregressive model order p, AR (p)
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
AR (1)
• xt = φk (xt−k ) +
k−1
X
φj (wt−j )
k→∞,|φ|<1
j=0
Classical decomposition model
=
∞
X
φj (wt−j )
j=0
|
{z
}
linear process
xt = µt + st + wt
• µt = trend
• st = seasonal component
• wt = random noise term
• E [xt ] =
P∞
j=0
φj (E [wt−j ]) = 0
• γ(h) = Cov [xt+h , xt ] =
• ρ(h) =
γ(h)
γ(0)
2 h
σw
φ
1−φ2
= φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . .
24
Moving average polynomial
Seasonal ARIMA
θ(z) = 1 + θ1 z + · · · + θq zq
z ∈ C ∧ θq 6= 0
Moving average operator
θ(B) = 1 + θ1 B + · · · + θp B p
21.4.1
MA (q) (moving average model order q)
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
E [xt ] =
q
X
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
d
s
• ΦP (B s )φ(B)∇D
s ∇ xt = δ + ΘQ (B )θ(B)wt
Causality and Invertibility
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } :
xt =
θj E [wt−j ] = 0
γ(h) = Cov [xt+h , xt ] =
2
σw
0
Pq−h
j=0
θj θj+h
j=0
ψj < ∞ such that
wt−j = ψ(B)wt
j=0
j=0
(
∞
X
P∞
0≤h≤q
h>q
ARMA (p, q) is invertible ⇐⇒ ∃{πj } :
MA (1)
xt = wt + θwt−1

2 2

(1 + θ )σw h = 0
2
γ(h) = θσw
h=1


0
h>1
(
θ
h=1
2
ρ(h) = (1+θ )
0
h>1
π(B)xt =
P∞
j=0
∞
X
πj < ∞ such that
Xt−j = wt
j=0
Properties
• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
ψ(z) =
∞
X
ψj z j =
j=0
θ(z)
φ(z)
|z| ≤ 1
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
φ(B)xt = θ(B)wt
π(z) =
Partial autocorrelation function (PACF)
∞
X
πj z j =
j=0
• xh−1
, regression of xi on {xh−1 , xh−2 , . . . , x1 }
i
• φhh = corr(xh − xh−1
, x0 − xh−1
) h≥2
0
h
• E.g., φ11 = corr(x1 , x0 ) = ρ(1)
φ(z)
θ(z)
|z| ≤ 1
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
ARIMA (p, d, q)
∇d xt = (1 − B)d xt is ARMA (p, q)
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
ARMA (p, q)
tails off
tails off
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt−1 + wt − λwt−1
xt =
∞
X
(1 − λ)λj−1 xt−j + wt
when |λ| < 1
21.5
Spectral Analysis
Periodic process
xt = A cos(2πωt + φ)
= U1 cos(2πωt) + U2 sin(2πωt)
j=1
x̃n+1 = (1 − λ)xn + λx̃n
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A
• Phase φ
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Discrete Fourier Transform (DFT)
d(ωj ) = n−1/2
xt =
q
X
Fourier/Fundamental frequencies
(Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
ωj = j/n
k=1
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
• γ(0) = E x2t = k=1 σk2
Inverse DFT
xt = n−1/2
j=0
Scaled Periodogram
σ 2 −2πiω0 h σ 2 2πiω0 h
e
+
e
=
2
2
Z 1/2
=
e2πiωh dF (ω)
4
I(j/n)
n
!2
n
2X
=
xt cos(2πtj/n +
n t=1
P (j/n) =
−1/2
Spectral distribution function
ω < −ω0
−ω ≤ ω < ω0
ω ≥ ω0
22
22.1
γ(h)e−2πiωh
−
h=−∞
h=−∞
!2
Gamma Function
∞
ts−1 e−t dt
0
Z ∞
• Upper incomplete: Γ(s, x) =
ts−1 e−t dt
x
Z x
• Lower incomplete: γ(s, x) =
ts−1 e−t dt
• Ordinary: Γ(s) =
Spectral density
∞
X
n
2X
xt sin(2πtj/n
n t=1
Math
Z
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
f (ω) =
d(ωj )e2πiωj t
I(j/n) = |d(j/n)|2
γ(h) = σ 2 cos(2πω0 h)


0
F (ω) = σ 2 /2

 2
σ
n−1
X
Periodogram
Spectral representation of a periodic process
P∞
xt e−2πiωj t
i=1
Periodic mixture
• Needs
n
X
|γ(h)| < ∞ =⇒ γ(h) =
1
1
≤ω≤
2
2
R 1/2
−1/2
e2πiωh f (ω) dω
• f (ω) ≥ 0
• f (ω) = f (−ω)
• f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
0
h = 0, ±1, . . .
• Γ(α + 1) = αΓ(α)
α>1
• Γ(n) = (n − 1)!
n∈N
√
• Γ(1/2) = π
22.2
Beta Function
Z 1
Γ(x)Γ(y)
tx−1 (1 − t)y−1 dt =
• Ordinary: B(x, y) = B(y, x) =
Γ(x + y)
0
Z x
• Incomplete: B(x; a, b) =
ta−1 (1 − t)b−1 dt
2
• White noise: fw (ω) = σw
• ARMA (p, q) , φ(B)xt = θ(B)wt :
|θ(e−2πiω )|2
fx (ω) =
|φ(e−2πiω )|2
Pp
Pq
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
2
σw
0
• Regularized incomplete:
a+b−1
B(x; a, b) a,b∈N X
(a + b − 1)!
Ix (a, b) =
=
xj (1 − x)a+b−1−j
B(a, b)
j!(a
+
b
−
1
−
j)!
j=a
26
Stirling numbers, 2nd kind
n
n−1
n−1
=k
+
k
k
k−1
• I0 (a, b) = 0
I1 (a, b) = 1
• Ix (a, b) = 1 − I1−x (b, a)
22.3
Series
Finite
•
n(n + 1)
k=
2
•
(2k − 1) = n2
•
k=1
n
X
•
k=1
n
X
k=1
n
X
•
k2 =
ck =
k=0
cn+1 − 1
c−1
n X
n
k=0
n X
k
n
=2
k=0
∞
X
Balls and Urns
|B| = n, |U | = m
B : D, U : ¬D
• Binomial
Theorem:
n X
n n−k k
a
b = (a + b)n
k
B : ¬D, U : D
k=0
c 6= 1
k > n : Pn,k = 0
f :B→U
f arbitrary
n
m
B : D, U : ¬D
n+n−1
n
m X
n
k=1
1
,
1−p
∞
X
p
|p| < 1
1−p
k=1
!
∞
X
d
1
1
k
=
p
=
dp 1 − p
(1 − p)2
pk =
kpk−1 =
B : ¬D, U : ¬D
|p| < 1
ordered
(n − i) =
i=0
unordered
Pn,k
f injective
(
mn m ≥ n
0
else
m
n
(
1 m≥n
0 else
(
1 m≥n
0 else
f surjective
n
m!
m
n−1
m−1
n
m
Pn,m
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
w/o replacement
nk =
k
D = distinguishable, ¬D = indistinguishable.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):45–53, 2008.
Sampling
k−1
Y
n ≥ 1 : Pn,0 = 0, P0,0 = 1
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
Combinatorics
k out of n
m
X
k=1
k=0
22.4
Pn,i
k=0
d
dp
k=0
k=0
∞ X
r+k−1 k
•
x = (1 − x)−r r ∈ N+
k
k=0
∞ X
α k
•
p = (1 + p)α |p| < 1 , α ∈ C
k
•
n
X
• Vandermonde’s Identity:
r X
m
n
m+n
=
k
r−k
r
k=0
Infinite
∞
X
•
pk =
Pn+k,k =
i=1
r+k
r+n+1
=
k
n
k=0
n
X k
n+1
•
=
m
m+1
n(n + 1)(2n + 1)
6
k=1
2
n
X
n(n + 1)
•
k3 =
2
•
Partitions
Binomial
n
X
1≤k≤n
(
1 n=0
n
=
0
0 else
n!
(n − k)!
n
nk
n!
=
=
k
k!
k!(n − k)!
w/ replacement
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
nk
[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
n−1+r
n−1+r
=
r
n−1
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
28
Univariate distribution relationships, courtesy Leemis and McQueston [2].
Download