Uploaded by Quincy Sue

# stat-cookbook

advertisement ```Probability and Statistics
Cookbook
Version 0.2.6
19th
December, 2017
http://statistics.zone/
Copyright c Matthias Vallentin
Contents
3 15 Bayesian Inference
15.1 Credible Intervals . . . . . . . . . . . .
3
15.2 Function of parameters . . . . . . . . .
5
15.3 Priors . . . . . . . . . . . . . . . . . .
Probability Theory
8
15.3.1 Conjugate Priors . . . . . . . .
15.4 Bayesian Testing . . . . . . . . . . . .
Random Variables
8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
Expectation
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
Variance
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
Inequalities
10
17 Decision Theory
Distribution Relationships
10
17.1 Risk . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . .
Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . .
Functions
11
17.4 Minimax Rules . . . . . . . . . . . . .
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2
3
4
5
6
7
8
14 Exponential Family
16
21.5 Spectral Analysis . . . . . . . . . . . . . 28
.
.
.
.
.
16 22 Math
22.1 Gamma Function
16
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
17
18
.
.
.
.
.
18
18
18
18
19
19
.
.
.
.
19
19
20
20
20
9 Multivariate Distributions
11 18 Linear Regression
9.1 Standard Bivariate Normal . . . . . . . 11
18.1 Simple Linear Regression . . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . . 11
18.2 Prediction . . . . . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . . 11
18.3 Multiple Regression . . . . . . . . . . .
18.4 Model Selection . . . . . . . . . . . . . .
10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation
10.2 Central Limit Theorem (CLT) . . . . . 12
19.1 Density Estimation . . . . . . . . . . . .
19.1.1 Histograms . . . . . . . . . . . .
11 Statistical Inference
12
19.1.2 Kernel Density Estimator (KDE)
11.1 Point Estimation . . . . . . . . . . . . . 12
19.2 Non-parametric Regression . . . . . . .
11.2 Normal-Based Confidence Interval . . . 13
19.3 Smoothing Using Orthogonal Functions
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . . . . . .
12 Parametric Inference
13
20.2 Poisson Processes . . . . . . . . . . . . .
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series
21.1 Stationary Time Series . . . . . . . . . .
12.2.1 Delta Method . . . . . . . . . . . 14
21.2 Estimation of Correlation . . . . . . . .
12.3 Multiparameter Models . . . . . . . . . 14
21.3 Non-Stationary Time Series . . . . . . .
12.3.1 Multiparameter delta method . . 15
21.3.1 Detrending . . . . . . . . . . . .
12.4 Parametric Bootstrap . . . . . . . . . . 15
21.4 ARIMA models . . . . . . . . . . . . . .
13 Hypothesis Testing
15
21.4.1 Causality and Invertibility . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
29
30
20
20
21
21
22
22
22
23
23
23
24
24
24
25
25
26
26
26
27
27
28
This cookbook integrates various topics in probability theory
and statistics, based on literature [1, 6, 3] and in-class material
from courses of the statistics department at the University of
California in Berkeley but also influenced by others [4, 5]. If you
find errors or have suggestions for improvements, please get in
touch at http://statistics.zone/.
1
Distribution Overview
1.1
Discrete Distributions
Notation1
Uniform
Unif {a, . . . , b}
FX (x)


0

Bernoulli
Binomial
Multinomial
Hypergeometric
Negative Binomial
1 We
1
Bern (p)
Bin (n, p)
x&lt;a
a≤x≤b
x&gt;b
(1 − p)1−x
bxc−a+1
 b−a
I1−p (n − x, x + 1)
Mult (n, p)
Hyp (N, m, n)
NBin (r, p)
Geometric
Geo (p)
Poisson
Po (λ)
≈Φ
x − np
p
np(1 − p)
!
Ip (r, x + 1)
1 − (1 − p)x
e−λ
x ∈ N+
x
X
λi
i!
i=0
fX (x)
E [X]
V [X]
MX (s)
I(a ≤ x ≤ b)
b−a+1
a+b
2
(b − a + 1)2 − 1
12
eas − e−(b+1)s
s(b − a)
px (1 − p)1−x
p
p(1 − p)
1 − p + pes
np
np(1 − p)
(1 − p + pes )n
!
n x
p (1 − p)n−x
x
n!
x
px1 &middot; &middot; &middot; pk k
x1 ! . . . xk ! 1
m N −m
x
k
X

xi = n
i=1
n−x
N
n
!
x+r−1 r
p (1 − p)x
r−1
p(1 − p)x−1
λx e−λ
x!
x ∈ N+

np1
 . 
 .. 
npk
nm
N
np1 (1 − p1 )
−np2 p1
−np1 p2
..
.
!
k
X
!n
pi e
si
i=0
nm(N − n)(N − m)
N 2 (N − 1)
1−p
r
p
1−p
r 2
p
1
p
1−p
p2
λ
λ
p
1 − (1 − p)es
r
pes
1 − (1 − p)es
eλ(e
s
−1)
use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see &sect;22.1), and use B(x, y) and Ix to refer to the Beta functions (see &sect;22.2).
3
Uniform (discrete)
Binomial
●
Geometric
●
●
●
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0.8
Poisson
●
●
●
●
● ●
p = 0.2
p = 0.5
p = 0.8
●
●
●
λ=1
λ=4
λ = 10
●
0.3
0.2
●
0.6
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
0.1
●
● ●
a
b
●
●
●●●
●
●●
●
●
●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●
0
10
1.00
●
30
0.0
2.5
●
●
●
●
●
●
●
●
5.0
●
●
●
●
7.5
●
●
●
0.0
10.0
●
● ● ●
●
●
●
●
●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
●
●
●
0
●
5
10
Geometric
1.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
1.00
●
●
●
●
●
0.75
●
●
●
20
Poisson
●
●
0.8
●
15
x
●
●
●
●
●
●
●
●
●
0.6
●
●
●
●
a
b
x
0
10
●
0.4
0.25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●
0.00
●
●
●
●
●
●
●
0.50
●
●
●
●
CDF
0.50
CDF
CDF
CDF
●
●
0.25
0
40
●
●
i
n
●
0.0
●
●
●
x
0.75
●
●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
●
●
i
n
●
●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●
●
● ●
●
●
●
●
●
●
Binomial
●
●
● ●
x
Uniform (discrete)
● ●
●
●
20
x
1
●
0.1
●
●
●
●
0.0
0.2
●
●
●
0.2
●
●
●
●
0.4
●
●
●
PMF
●
PMF
●
●
PMF
PMF
●
1
n
20
x
●
●
●
30
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40
0.2
●
●
●
●
0.0
2.5
5.0
x
7.5
p = 0.2
p = 0.5
p = 0.8
10.0
●
●
0.00
●
●
●
●
●
●
● ● ● ●
0
5
10
15
λ=1
λ=4
λ = 10
20
x
4
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N &micro;, σ 2


0
x&lt;a
a&lt;x&lt;b

1
x&gt;b
Z x
Φ(x) =
φ(t) dt
x−a
 b−a
−∞
Log-Normal
ln N &micro;, σ 2
Multivariate Normal
MVN (&micro;, Σ)
Student’s t
Chi-square
Student(ν)
χ2k
1
1
ln x − &micro;
+ erf √
2
2
2σ 2
fX (x)
E [X]
V [X]
MX (s)
I(a &lt; x &lt; b)
b−a
(x − &micro;)2
1
φ(x) = √ exp −
2σ 2
σ 2π
(ln x − &micro;)2
1
√
exp −
2σ 2
x 2πσ 2
a+b
2
(b − a)2
12
&micro;
σ2
esb − esa
s(b − a)
σ 2 s2
exp &micro;s +
2
T
1
(2π)−k/2 |Σ|−1/2 e− 2 (x−&micro;)
ν ν Ix
,
2 2
k x
1
γ
,
Γ(k/2)
2 2
e&micro;+σ
Σ−1 (x−&micro;)
2
2
(eσ − 1)e2&micro;+σ
/2
&micro;
0
1
xk/2−1 e−x/2
2k/2 Γ(k/2)
r
ν&gt;1
1
exp &micro;T s + sT Σs
2
Σ
(
−(ν+1)/2
Γ ν+1
x2
2
1
+
√
ν
νπΓ ν2
2
ν
ν−2
∞
ν&gt;2
1&lt;ν≤2
k
2k
d2
d2 − 2
2d22 (d1 + d2 − 2)
d1 (d2 − 2)2 (d2 − 4)
(1 − 2s)−k/2 s &lt; 1/2
d
F
F(d1 , d2 )
Exponential∗
Exp (β)
Gamma∗
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
∗
I
d1 x
d1 x+d2
Gamma (α, β)
InvGamma (α, β)
d1 d2
,
2 2
Pareto(xm , α)
β2
1
1−
γ(α, βx)
Γ(α)
Γ α, βx
Γ (α)
β α α−1 −βx
x
e
Γ (α)
α
β
α
β2
1
1−
β α −α−1 −β/x
x
e
Γ (α)
P
k
k
Γ
i=1 αi Y α −1
xi i
Qk
i=1 Γ (αi ) i=1
β
α&gt;1
α−1
β2
α&gt;2
(α − 1)2 (α − 2)
Γ (α + β) α−1
x
(1 − x)β−1
Γ (α) Γ (β)
α
α+β
1
λΓ 1 +
k
αβ
(α + β)2 (α + β + 1)
2
λ2 Γ 1 +
− &micro;2
k
αxm
α&gt;1
α−1
x2m α
α&gt;2
(α − 1)2 (α − 2)
1 − e−(x/λ)
x α
We use the rate parameterization where β =
β
k
1−
xB
d1 d1
, 2
2
1 −x/β
e
β
Ix (α, β)
Weibull(λ, k)
(d1 x+d2 )d1 +d2
1 − e−x/β
Dir (α)
Beta (α, β)
(d1 x)d1 d2 2
m
x
1
.
λ
x ≥ xm
k x k−1 −(x/λ)k
e
λ λ
xα
m
α α+1
x
αi
Pk
i=1 αi
x ≥ xm
(s &lt; β)
s
β
!α
s
β
(s &lt; β)
p
2(−βs)α/2
Kα
−4βs
Γ(α)
E [Xi ] (1 − E [Xi ])
Pk
i=1 αi + 1
1+
∞
X
k=1
∞
X
n=0
k−1
Y
r=0
α+r
α+β+r
!
sk
k!
sn λn n
Γ 1+
n!
k
α(−xm s)α Γ(−α, −xm s) s &lt; 0
Some textbooks use β as scale parameter instead .
5
Uniform (continuous)
Normal
2.0
Log−Normal
1.00
&micro; = 0, σ = 0.2
&micro; = 0, σ2 = 1
&micro; = 0, σ2 = 5
&micro; = −2, σ2 = 0.5
2
●
1.0
0.5
●
●
a
b
0.0
0.00
−2.5
0.0
x
x
χ
F
2
k=1
k=2
k=3
k=4
k=5
2.5
5.0
0.0
0
1
2
3
−5.0
0.0
Exponential
3
5.0
Gamma
β = 0.5
β=1
β = 2.5
2.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2.5
x
α = 1, β = 0.5
α = 2, β = 0.5
α = 3, β = 0.5
α = 5, β = 1
α = 9, β = 2
0.5
0.4
1.5
2
PDF
PDF
0.3
PDF
PDF
−2.5
x
0.75
0.50
0.2
0.1
0.25
−5.0
1.00
0.50
ν=1
ν=2
ν=5
ν=∞
0.3
PDF
●
&micro; = 0, σ = 3
&micro; = 2, σ2 = 2
&micro; = 0, σ2 = 1
&micro; = 0.5, σ2 = 1
&micro; = 0.25, σ2 = 1
&micro; = 0.125, σ2 = 1
0.75
PDF
1
b−a
PDF
PDF
1.5
Student's t
0.4
2
1.0
0.2
1
0.5
0.25
0.1
0
0.00
0
2
4
6
8
0.0
0.0
0
1
x
2
3
4
5
0
1
2
x
Inverse Gamma
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
4
5
0
5
10
15
20
x
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
4
4
x
Beta
5
3
2.0
Pareto
4
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
3
1.5
3
2
PDF
PDF
PDF
PDF
3
1.0
2
2
1
0.5
1
1
0
0
0
1
2
3
x
4
5
0
0.0
0.00
0.25
0.50
x
0.75
1.00
0.0
0.5
1.0
1.5
x
2.0
2.5
1.0
1.5
2.0
2.5
x
6
Uniform (continuous)
Normal
1
Log−Normal
1.00
Student's t
1.00
&micro; = 0, σ = 3
&micro; = 2, σ2 = 2
&micro; = 0, σ2 = 1
&micro; = 0.5, σ2 = 1
&micro; = 0.25, σ2 = 1
&micro; = 0.125, σ2 = 1
2
0.75
0.75
0.75
CDF
CDF
CDF
CDF
0.50
0.50
0.50
0.25
0.25
0
0.25
&micro; = 0, σ2 = 0.2
&micro; = 0, σ2 = 1
&micro; = 0, σ2 = 5
&micro; = −2, σ2 = 0.5
0.00
a
b
−5.0
−2.5
0.0
x
2
k=1
k=2
k=3
k=4
k=5
0.00
4
6
Exponential
0.75
0.75
0.50
0.00
2
3
4
1
2
CDF
0.75
CDF
0.75
0.50
0.00
3
4
5
4
0.00
0.25
0.50
x
0
5
10
0.75
1.00
15
20
x
Pareto
1.00
1.00
0.75
0.75
0.50
0.50
0.25
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
0.00
0.00
0.00
5
0.25
0.25
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
3
α = 1, β = 0.5
α = 2, β = 0.5
α = 3, β = 0.5
α = 5, β = 1
α = 9, β = 2
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
5.0
0.50
x
Beta
1.00
x
0
CDF
Inverse Gamma
2
β = 0.5
β=1
β = 2.5
0.00
5
2.5
0.25
x
1.00
1
0.50
0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
1
0.0
Gamma
0.75
0
−2.5
x
1.00
x
0.25
−5.0
1.00
8
0.50
3
1.00
0.25
0.25
0
2
CDF
0.50
2
1
x
CDF
CDF
0.75
0
0.00
0
F
1.00
CDF
0.00
5.0
x
χ
CDF
2.5
ν=1
ν=2
ν=5
ν=∞
0.0
0.5
1.0
1.5
x
2.0
2.5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
0.00
1.0
1.5
2.0
2.5
x
7
2
Probability Theory
Law of Total Probability
Definitions
•
•
•
•
P [B] =
Sample space Ω
Outcome (point or element) ω ∈ Ω
Event A ⊆ Ω
σ-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes’ Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Ω=
n
G
Ai
i=1
Inclusion-Exclusion Principle
n
[
• Probability Distribution P
i=1
1. P [A] ≥ 0 ∀A
2. P [Ω] = 1
&quot;∞ #
∞
G
X
3. P
Ai =
P [Ai ]
3
Ai =
n
X
X
(−1)r−1
r
\
Ai j
i≤i1 &lt;&middot;&middot;&middot;&lt;ir ≤n j=1
r=1
Random Variables
Random Variable (RV)
i=1
X:Ω→R
• Probability space (Ω, A, P)
Probability Mass Function (PMF)
Properties
•
•
•
•
•
•
•
•
Ω=
i=1
1. ∅ ∈ A
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A
3. A ∈ A =⇒ &not;A ∈ A
i=1
n
X
P [∅] = 0
B = Ω ∩ B = (A ∪ &not;A) ∩ B = (A ∩ B) ∪ (&not;A ∩ B)
P [&not;A] = 1 − P [A]
P [B] = P [A ∩ B] + P [&not;A ∩ B]
P [Ω] = 1
P [∅] = 0
S
T
T
S
&not;( n An ) = n &not;An &not;( n An ) = n &not;An
DeMorgan
S
T
P [ n An ] = 1 − P [ n &not;An ]
P [A ∪ B] = P [A] + P [B] − P [A ∩ B]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
• P [A ∪ B] = P [A ∩ &not;B] + P [&not;A ∩ B] + P [A ∩ B]
• P [A ∩ &not;B] = P [A] − P [A ∩ B]
fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
Probability Density Function (PDF)
Z
P [a ≤ X ≤ b] =
b
f (x) dx
a
Cumulative Distribution Function (CDF)
FX : R → [0, 1]
FX (x) = P [X ≤ x]
1. Nondecreasing: x1 &lt; x2 =⇒ F (x1 ) ≤ F (x2 )
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A]
S∞
where A = i=1 Ai
T∞
where A = i=1 Ai
Z
P [a ≤ Y ≤ b | X = x] =
fY |X (y | x)dy
a≤b
a
Independence ⊥
⊥
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
b
P [A ∩ B]
P [B]
P [B] &gt; 0
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
2. fX,Y (x, y) = fX (x)fY (y)
8
3.1
Z
Transformations
• E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X])
(cf. Jensen inequality)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
• P [X = Y ] = 1 =⇒ E [X] = E [Y ]
∞
X
• E [X] =
P [X ≥ x]
X discrete
Z = ϕ(X)
Discrete
X
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
fX (x)
x∈ϕ−1 (z)
x=1
Sample mean
Continuous
n
Z
FZ (z) = P [ϕ(X) ≤ z] =
X̄n =
with Az = {x : ϕ(x) ≤ z}
f (x) dx
Az
Special case if ϕ strictly monotone
dx
1
d −1
ϕ (z) = fX (x)
= fX (x)
dz
dz
|J|
fZ (z) = fX (ϕ−1 (z))
Conditional expectation
Z
• E [Y | X = x] = yf (y | x) dy
• E [X] = E [E [X | Y ]]
Z ∞
• Eϕ(X,Y ) | X=x [=]
ϕ(x, y)fY |X (y | x) dx
−∞
Z ∞
• E [ϕ(Y, Z) | X = x] =
ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
The Rule of the Lazy Statistician
Z
E [Z] =
1X
Xi
n i=1
ϕ(x) dFX (x)
−∞
Z
E [IA (x)] =
• E [Y + Z | X] = E [Y | X] + E [Z | X]
• E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
• E [Y | X] = c =⇒ Cov [X, Y ] = 0
Z
dFX (x) = P [X ∈ A]
IA (x) dFX (x) =
A
Convolution
Z
• Z := X + Y
∞
fX,Y (x, z − x) dx
fZ (z) =
−∞
Z
X,Y ≥0
Z
z
fX,Y (x, z − x) dx
=
0
∞
• Z := |X − Y |
• Z :=
4
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z ∞
Z ∞
⊥
⊥
fZ (z) =
|y|fX,Y (yz, y) dy =
|y|fX (yz)fY (y) dy
−∞
−∞
5
Variance
Definition and properties
2
2
• V [X] = σX
= E (X − E [X])2 = E X 2 − E [X]
&quot; n
#
n
X
X
X
• V
Xi =
V [Xi ] +
Cov [Xi , Xj ]
i=1
Expectation
• V
Definition and properties
Z
• E [X] = &micro;X =
x dFX (x) =
i=1
#
Xi =
i=1
X

xfX (x)



 x
Z




 xfX (x) dx
• P [X = c] = 1 =⇒ E [X] = c
• E [cX] = c E [X]
• E [X + Y ] = E [X] + E [Y ]
&quot; n
X
X discrete
n
X
i6=j
V [Xi ]
if Xi ⊥
⊥ Xj
i=1
Standard deviation
sd[X] =
p
V [X] = σX
Covariance
X continuous
•
•
•
•
Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
9
7
• Cov [aX, bY ] = abCov [X, Y ]
• Cov [X + a, Y + b] = Cov [X, Y ]


n
m
n X
m
X
X
X
• Cov 
Xi ,
Yj  =
Cov [Xi , Yj ]
i=1
j=1
Distribution Relationships
Binomial
• Xi ∼ Bern (p) =⇒
i=1 j=1
n
X
Xi ∼ Bin (n, p)
i=1
Correlation
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
• limn→∞ Bin (n, p) = Po (np)
(n large, p small)
• limn→∞ Bin (n, p) = N (np, np(1 − p))
(n large, p far from 0 and 1)
Cov [X, Y ]
ρ [X, Y ] = p
V [X] V [Y ]
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
Sample variance
n
S2 =
1 X
(Xi − X̄n )2
n − 1 i=1
Conditional variance
2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
6
•
•
•
•
X ∼ NBin (1, p) = Geo (p)
Pr
X ∼ NBin (r, p) = i=1 Geo (p)
P
P
Xi ∼ NBin (ri , p) =⇒
Xi ∼ NBin ( ri , p)
X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Poisson
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒
n
X
Xi ∼ Po
i=1
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi
Inequalities
P [ϕ(X) ≥ t] ≤
P [|X − E [X]| ≥ t] ≤
Chernoff
P [X ≥ (1 + δ)&micro;] ≤
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒
E [ϕ(X)]
t
Chebyshev
δ
n
X

n
X
λi
Xj ∼ Bin 
Xj , Pn
j=1
j=1

λj

n
X
Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X &gt; x + y | X &gt; y] = P [X &gt; x]
V [X]
t2
e
(1 + δ)1+δ
λi
Exponential
2
E [XY ] ≤ E X 2 E Y 2
Markov
!
i=1
j=1
Cauchy-Schwarz
n
X
Normal
• X ∼ N &micro;, σ 2
δ &gt; −1
Hoeffding
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n
2
P X̄ − E X̄ ≥ t ≤ e−2nt t &gt; 0
2n2 t2
P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn
t&gt;0
2
i=1 (bi − ai )
Jensen
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex
=⇒
X−&micro;
σ
∼ N (0, 1)
2
• X ∼ N &micro;, σ ∧ Z = aX + b =⇒ Z ∼ N a&micro; + b, a2 σ 2
P
P
P
• Xi ∼ N &micro;i , σi2 ∧ Xi ⊥
⊥ Xj =⇒
Xi ∼ N
&micro;i , i σi2
i
i
• P [a &lt; X ≤ b] = Φ b−&micro;
− Φ a−&micro;
σ
σ
• Φ(−x) = 1 − Φ(x)
φ0 (x) = −xφ(x)
φ00 (x) = (x2 − 1)φ(x)
−1
• Upper quantile of N (0, 1): zα = Φ (1 − α)
Gamma
• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
• Gamma (α, β) ∼ i=1 Exp (β)
P
P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒
i Xi ∼ Gamma (
i αi , β)
10
•
Γ(α)
=
λα
Z
∞
9.2
xα−1 e−λx dx
0
Bivariate Normal
Let X ∼ N &micro;x , σx2 and Y ∼ N &micro;y , σy2 .
Beta
1
Γ(α + β) α−1
•
xα−1 (1 − x)β−1 =
x
(1 − x)β−1
B(α, β)
Γ(α)Γ(β)
B(α + k, β)
α+k−1
=
E X k−1
• E Xk =
B(α, β)
α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1)
f (x, y) =
&quot;
z=
x − &micro;x
σx
2πσx σy
2
+
1
p
1 − ρ2
y − &micro;y
σy
exp −
2
− 2ρ
z
2(1 − ρ2 )
x − &micro;x
σx
y − &micro;y
σy
#
Conditional mean and variance
8
Probability and Moment Generating Functions
• GX (t) = E tX
σX
(Y − E [Y ])
σY
p
V [X | Y ] = σX 1 − ρ2
E [X | Y ] = E [X] + ρ
|t| &lt; 1
&quot;∞
#
∞
X (Xt)i
X
E Xi
• MX (t) = GX (et ) = E eXt = E
=
&middot; ti
i!
i!
i=0
i=0
9.3
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0)
Multivariate Normal
(Precision matrix Σ−1 )


V [X1 ]
&middot; &middot; &middot; Cov [X1 , Xk ]


..
..
..
Σ=

.
.
.
Covariance matrix Σ
(i)
GX (0)
i!
• E [X] = G0X (1− )
(k)
• E X k = MX (0)
X!
(k)
• E
= GX (1− )
(X − k)!
• P [X = i] =
Cov [Xk , X1 ] &middot; &middot; &middot;
V [Xk ]
If X ∼ N (&micro;, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
−n/2
fX (x) = (2π)
d
• GX (t) = GY (t) =⇒ X = Y
−1/2
|Σ|
1
exp − (x − &micro;)T Σ−1 (x − &micro;)
2
Properties
9
9.1
Multivariate Distributions
•
•
•
•
Standard Bivariate Normal
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX +
p
1 − ρ2 Z
Joint density
x2 + y 2 − 2ρxy
p
f (x, y) =
exp −
2(1 − ρ2 )
2π 1 − ρ2
1
10
and
(X | Y = y) ∼ N ρy, 1 − ρ2
Convergence
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence
Conditionals
(Y | X = x) ∼ N ρx, 1 − ρ2
Z ∼ N (0, 1) ∧ X = &micro; + Σ1/2 Z =⇒ X ∼ N (&micro;, Σ)
X ∼ N (&micro;, Σ) =⇒ Σ−1/2 (X − &micro;) ∼ N (0, 1)
X ∼ N (&micro;, Σ) =⇒ AX ∼ N A&micro;, AΣAT
X ∼ N (&micro;, Σ) ∧ kak = k =⇒ aT X ∼ N aT &micro;, aT Σa
D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0
lim Fn (t) = F (t)
n→∞
∀t where F continuous
11
P
2. In probability: Xn → X
X̄n − &micro;
Zn := q =
V X̄n
(∀ε &gt; 0) lim P [|Xn − X| &gt; ε] = 0
n→∞
as
3. Almost surely (strongly): Xn → X
h
i
h
i
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞
CLT notations
Zn ≈ N (0, 1)
σ2
X̄n ≈ N &micro;,
n
σ2
X̄n − &micro; ≈ N 0,
n
√
2
n(X̄n − &micro;) ≈ N 0, σ
√
n(X̄n − &micro;)
≈ N (0, 1)
σ
Relationships
qm
P
D
• Xn → X =⇒ Xn → X =⇒ Xn → X
as
P
• Xn → X =⇒ Xn → X
D
P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
P
∧ Yn
∧ Yn
∧ Yn
=⇒
P
P
→ Y =⇒ Xn + Yn → X + Y
qm
qm
→ Y =⇒ Xn + Yn → X + Y
P
P
→ Y =⇒ Xn Yn → XY
P
ϕ(Xn ) → ϕ(X)
D
Continuity correction
P X̄n ≤ x ≈ Φ
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
qm
• X1 , . . . , Xn iid ∧ E [X] = &micro; ∧ V [X] &lt; ∞ ⇐⇒ X̄n → &micro;
Slutzky’s Theorem
D
x + 12 − &micro;
√
σ/ n
x − 12 − &micro;
√
σ/ n
Delta method
P
D
Law of Large Numbers (LLN)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = &micro;.
Weak (WLLN)
P
X̄n → &micro;
n→∞
Strong (SLLN)
as
X̄n → &micro;
10.2
P X̄n ≥ x ≈ 1 − Φ
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
P
D
• Xn → X and Yn → c =⇒ Xn Yn → cX
D
D
D
• In general: Xn → X and Yn → Y =⇒
6
Xn + Yn → X + Y
10.1
z∈R
n→∞
n→∞
→X
qm
→X
P
→X
P
→X
where Z ∼ N (0, 1)
lim P [Zn ≤ z] = Φ(z)
qm
Xn
Xn
Xn
Xn
n(X̄n − &micro;) D
→Z
σ
n→∞
4. In quadratic mean (L2 ): Xn → X
lim E (Xn − X)2 = 0
•
•
•
•
√
n→∞
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = &micro;, and V [X1 ] = σ 2 .
Yn ≈ N
11
&micro;,
σ2
n
=⇒ ϕ(Yn ) ≈ N
ϕ(&micro;), (ϕ0 (&micro;))
2
σ2
n
Statistical Inference
iid
Let X1 , &middot; &middot; &middot; , Xn ∼ F if not otherwise noted.
11.1
Point Estimation
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
• bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
• Sampling distribution: F (θbn )
r h i
b
• Standard error: se(θn ) = V θbn
12
h
i
h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
θbn − θ D
• Asymptotic normality:
→ N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ
bn .
11.2
•
•
•
•
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of θ = (F ): θbn = T (Fbn )
R
Linear functional: T (F ) = ϕ(x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =
Normal-Based Confidence Interval
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z &gt; zα/2 = α/2
Suppose θbn ≈ N θ, se
and P −zα/2 &lt; Z &lt; zα/2 = 1 − α where Z ∼ N (0, 1). Then
b
Cn = θbn &plusmn; zα/2 se
11.3
11.4
Empirical distribution
Empirical Distribution Function (ECDF)
Pn
I(Xi ≤ x)
b
Fn (x) = i=1
n
(
1 Xi ≤ x
I(Xi ≤ x) =
0 Xi &gt; x
Properties (for any fixed x)
h i
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn =
n
F (x)(1 − F (x)) D
• mse =
→0
n
P
• Fbn → F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
2
b
P sup F (x) − Fn (x) &gt; ε = 2e−2nε
n
1X
ϕ(Xi )
ϕ(x) dFbn (x) =
n i=1
b 2 =⇒ T (Fbn ) &plusmn; zα/2 se
b
• Often: T (Fbn ) ≈ N T (F ), se
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
• &micro;
b = X̄n
n
1 X
(Xi − X̄n )2
• σ
b2 =
n − 1 i=1
Pn
1
b)3
i=1 (Xi − &micro;
n
• κ
b=
b3
Pσ
n
i=1 (Xi − X̄n )(Yi − Ȳn )
qP
• ρb = qP
n
n
2
2
(X
−
X̄
)
i
n
i=1
i=1 (Yi − Ȳn )
12
Parametric Inference
Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
and parameter θ = (θ1 , . . . , θk ).
12.1
Method of Moments
j th moment
αj (θ) = E X j =
xj dFX (x)
j th sample moment
n
x
Nonparametric 1 − α confidence band for F
L(x) = max{Fbn − n , 0}
Z
α
bj =
1X j
X
n i=1 i
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s
1
2
log
=
2n
α
α1 (θ) = α
b1
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
αk (θ) = α
bk
α2 (θ) = α
b2
.. ..
.=.
13
Properties of the MoM estimator
• θbn exists with probability tending to 1
P
• Consistency: θbn → θ
• Asymptotic normality:
√
D
n(θb − θ) → N (0, Σ)
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
∂ −1
g = (g1 , . . . , gk ) and gj = ∂θ
αj (θ)
12.2
Maximum Likelihood
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is:
p
1. se ≈ 1/In (θ)
(θbn − θ) D
→ N (0, 1)
se
q
b ≈ 1/In (θbn )
2. se
(θbn − θ) D
→ N (0, 1)
b
se
• Asymptotic optimality
Likelihood: Ln : Θ → [0, ∞)
Ln (θ) =
n
Y
h i
V θbn
are(θen , θbn ) = h i ≤ 1
V θen
f (Xi ; θ)
i=1
• Approximately the Bayes estimator
Log-likelihood
`n (θ) = log Ln (θ) =
n
X
log f (Xi ; θ)
i=1
12.2.1
Delta Method
b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum likelihood estimator (mle)
(b
τn − τ ) D
→ N (0, 1)
b τ)
se(b
Ln (θbn ) = sup Ln (θ)
θ
b is the mle of τ and
where τb = ϕ(θ)
Score function
s(X; θ) =
∂
log f (X; θ)
∂θ
b se(
b = ϕ0 (θ)
b θbn )
se
Fisher information
I(θ) = Vθ [s(X; θ)]
In (θ) = nI(θ)
Fisher information (exponential family)
∂
I(θ) = Eθ − s(X; θ)
∂θ
Observed Fisher information
Inobs (θ) = −
Properties of the mle
P
• Consistency: θbn → θ
12.3
Multiparameter Models
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Hjj =
∂ 2 `n
∂θ2
Hjk =
Fisher information matrix

n
∂2 X
log f (Xi ; θ)
∂θ2 i=1
∂ 2 `n
∂θj ∂θk
&middot;&middot;&middot;
..
.
Eθ [Hk1 ] &middot; &middot; &middot;
Eθ [H11 ]

..
In (θ) = − 
.

Eθ [H1k ]

..

.
Eθ [Hkk ]
Under appropriate regularity conditions
(θb − θ) ≈ N (0, Jn )
14
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then
(θbj − θj ) D
→ N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se
•
•
•
•
•
Critical value c
Test statistic T
Rejection region R = {x : T (x) &gt; c}
Power function β(θ) = P [X ∈ R]
Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
12.3.1
Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be
H0 true
H1 true
∂ϕ
 ∂θ1 
 . 

∇ϕ = 
 .. 
 ∂ϕ 


θ=θb
• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]
• p-value = supθ∈Θ0
= inf α : T (X) ∈ Rα
|
{z
}
b Then,
6= 0 and τb = ϕ(θ).
1−Fθ (T (X))
(b
τ − τ) D
→ N (0, 1)
b τ)
se(b
where
r
b τ) =
se(b
b and ∇ϕ
b = ∇ϕ
and Jbn = Jn (θ)
12.4
b
∇ϕ
T
.
θ=θb
since T (X ? )∼Fθ
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald test
Parametric Bootstrap
Hypothesis Testing
H0 : θ ∈ Θ0
versus
• Two-sided test
θb − θ0
• Reject H0 when |W | &gt; zα/2 where W =
b
se
• P |W | &gt; zα/2 → α
• p-value = Pθ0 [|W | &gt; |w|] ≈ P [|Z| &gt; |w|] = 2Φ(−|w|)
H1 : θ ∈ Θ1
Definitions
•
•
•
•
•
•
p-value
&lt; 0.01
0.01 − 0.05
0.05 − 0.1
&gt; 0.1
b
Jbn ∇ϕ
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator.
13
Type II Error (β)
Reject H0
Type
√ I Error (α)
(power)
p-value
∂θk
Suppose ∇ϕ
Retain H0
√
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis θ = θ0
Composite hypothesis θ &gt; θ0 or θ &lt; θ0
Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
One-sided test: H0 : θ ≤ θ0 versus H1 : θ &gt; θ0
Likelihood ratio test
supθ∈Θ Ln (θ)
Ln (θbn )
=
supθ∈Θ0 Ln (θ)
Ln (θbn,0 )
k
X
D
iid
• λ(X) = 2 log T (X) → χ2r−q where
Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• T (X) =
i=1
• p-value = Pθ0 [λ(X) &gt; λ(x)] ≈ P χ2r−q &gt; λ(x)
15
Multinomial LRT
X1
Xk
• mle: pbn =
,...,
n
n
k
Y pbj Xj
Ln (b
pn )
• T (X) =
=
Ln (p0 )
p0j
j=1
k
X
pbj
D
Xj log
→ χ2k−1
• λ(X) = 2
p
0j
j=1
Natural form
fX (x | η) = h(x) exp {η &middot; T(x) − A(η)}
= h(x)g(η) exp {η &middot; T(x)}
= h(x)g(η) exp η T T(x)
15
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
Bayesian Inference
Bayes’ Theorem
Pearson Chi-square Test
• T =
k
X
j=1
f (θ | x) =
(Xj − E [Xj ])2
where E [Xj ] = np0j under H0
E [Xj ]
f (x | θ)f (θ)
f (x | θ)f (θ)
=R
∝ Ln (θ)f (θ)
f (xn )
f (x | θ)f (θ) dθ
Definitions
D
• T → χ2k−1
• p-value = P χ2k−1 &gt; T (x)
•
•
•
•
D
2
• Faster → Xk−1
than LRT, hence preferable for small n
Independence testing
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: pbij = nij
X
• mles under H0 : pb0ij = pbi&middot; pb&middot;j = Xni&middot; n&middot;j
PI PJ
nX
• LRT: λ = 2 i=1 j=1 Xij log Xi&middot; Xij&middot;j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)
X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f (θ)
Likelihood f (xn | θ): joint density of the data
n
Y
In particular, X n iid =⇒ f (xn | θ) =
f (xi | θ) = Ln (θ)
i=1
• Posterior density f (θ | xn )
R
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
• Kernel: part of a density that dependsRon θ
R
θL (θ)f (θ)dθ
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ
15.1
Credible Intervals
Posterior interval
14
Exponential Family
P [θ ∈ (a, b) | xn ] =
Scalar parameter
b
Z
f (θ | xn ) dθ = 1 − α
a
Equal-tail credible interval
Z a
Z
f (θ | xn ) dθ =
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)}
= h(x)g(θ) exp {η(θ)T (x)}
Vector parameter
−∞
(
fX (x | θ) = h(x) exp
s
X
)
ηi (θ)Ti (x) − A(θ)
i=1
= h(x) exp {η(θ) &middot; T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) &middot; T (x)}
∞
f (θ | xn ) dθ = α/2
b
Highest posterior density (HPD) region Rn
1. P [θ ∈ Rn ] = 1 − α
2. Rn = {θ : f (θ | xn ) &gt; k} for some k
Rn is unimodal =⇒ Rn is an interval
16
15.2
Function of parameters
15.3.1
Continuous likelihood (subscript c denotes constant)
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
n
Conjugate Priors
n
Z
n
H(r | x ) = P [ϕ(θ) ≤ τ | x ] =
f (θ | x ) dθ
Likelihood
Conjugate prior
Unif (0, θ)
Pareto(xm , k)
Exp (λ)
Gamma (α, β)
Posterior hyperparameters
max x(n) , xm , k + n
n
X
xi
α + n, β +
A
i=1
2
N &micro;, σc
Posterior density
2
N &micro;0 , σ0
h(τ | xn ) = H 0 (τ | xn )
N &micro;c , σ 2
Bayesian delta method
b se
b
b ϕ0 (θ)
τ | X n ≈ N ϕ(θ),
15.3
Priors
N &micro;, σ 2
MVN(&micro;, Σc )
Scaled Inverse Chisquare(ν, σ02 )
Normalscaled
Inverse
Gamma(λ, ν, α, β)
MVN(&micro;0 , Σ0 )
Choice
• Subjective Bayesianism: prior should incorporate as much detail as possible
the research’s a priori knowledge—via prior elicitation
• Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior)
• Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior
MVN(&micro;c , Σ)
InverseWishart(κ, Ψ)
Pareto(xmc , k)
Gamma (α, β)
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (αc , β)
Gamma (α0 , β0 )
Pn
&micro;0
1
n
i=1 xi
+
/
+ 2 ,
σ2
σ2
σ02
σc
0
c−1
1
n
+ 2
σ02
σc
Pn
νσ02 + i=1 (xi − &micro;)2
ν + n,
ν+n
νλ + nx̄
n
,
ν + n,
α + ,
ν+n
2
n
2
1X
γ(x̄
−
λ)
β+
(xi − x̄)2 +
2 i=1
2(n + γ)
−1
Σ−1
0 + nΣc
−1
−1
Σ−1
x̄ ,
0 &micro;0 + nΣ
−1 −1
Σ−1
0 + nΣc
n
X
(xi − &micro;c )(xi − &micro;c )T
n + κ, Ψ +
i=1
n
X
xi
x
mc
i=1
x0 , k0 − kn where k0 &gt; kn
n
X
α0 + nαc , β0 +
xi
α + n, β +
log
i=1
Types
• Flat: f (θ) ∝ constant
R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s Prior (transformation-invariant):
f (θ) ∝
p
I(θ)
f (θ) ∝
p
det(I(θ))
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family
17
Discrete likelihood
Likelihood
Conjugate prior
Bern (p)
Posterior hyperparameters
Beta (α, β)
Bin (p)
Bayes factor
α+
Beta (α, β)
α+
n
X
i=1
n
X
xi , β + n −
xi , β +
i=1
NBin (p)
Beta (α, β)
Po (λ)
α + rn, β +
Gamma (α, β)
α+
n
X
n
X
n
X
n
X
xi
i=1
Ni −
i=1
n
X
xi
i=1
p∗ =
xi
1
log10 BF10
BF10
evidence
0 − 0.5
0.5 − 1
1−2
&gt;2
1 − 1.5
1.5 − 10
10 − 100
&gt; 100
Weak
Moderate
Strong
Decisive
p
1−p BF10
p
+ 1−p
BF10
where p = P [H1 ] and p∗ = P [H1 | xn ]
i=1
xi , β + n
16
Sampling Methods
i=1
Multinomial(p)
Dir (α)
α+
n
X
i=1
Geo (p)
Beta (α, β)
16.1
x(i)
α + n, β +
n
X
Setup
xi
i=1
15.4
Inverse Transform Sampling
• U ∼ Unif (0, 1)
• X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
Bayesian Testing
1. Generate u ∼ Unif (0, 1)
2. Compute x = F −1 (u)
If H0 : θ ∈ Θ0 :
Z
Prior probability P [H0 ] =
f (θ) dθ
16.2
f (θ | xn ) dθ
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
Posterior probability P [H0 | xn ] =
Z
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ),
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood
Z
n
f (xn | θ, Hi )f (θ | Hi ) dθ
f (x | Hi ) =
The Bootstrap
∗
∗
(a) Repeat the following B times to get Tn,1
, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
!2
B
B
1 X
1 X ∗
∗
b
vboot = VFbn =
Tn,b −
T
B
B r=1 n,r
b=1
Θ
16.2.1
Posterior odds (of Hi relative to Hj )
P [Hi | x ]
P [Hj | xn ]
Normal-based interval
n
n
=
Bootstrap Confidence Intervals
f (x | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
&times;
P [Hi ]
P [Hj ]
| {z }
prior odds
b boot
Tn &plusmn; zα/2 se
Pivotal interval
1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗
∗
4. Let Rn,b
= θbn,b
− θbn . Approximate H using bootstrap:
2. Generate u ∼ Unif (0, 1)
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn )
B
1 X
∗
b
H(r)
=
I(Rn,b
≤ r)
B
16.4
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
b=1
∗
∗
5. θβ∗ = β sample quantile of (θbn,1
, . . . , θbn,B
)
iid
∗
∗
), i.e., rβ∗ = θβ∗ − θbn
6. rβ∗ = beta sample quantile of (Rn,1
, . . . , Rn,B
7. Approximate 1 − α confidence interval Cn = â, b̂ where
b −1 1 − α =
θbn − H
2
α
−1
b
=
θbn − H
2
â =
b̂ =
∗
θbn − r1−α/2
=
∗
2θbn − θ1−α/2
∗
θbn − rα/2
=
∗
2θbn − θα/2
∗
∗
Cn = θα/2
, θ1−α/2
Rejection Sampling
Setup
• We can easily sample from g(θ)
• We want to sample from h(θ), but it is difficult
k(θ)
k(θ) dθ
• Envelope condition: we can find M &gt; 0 such that k(θ) ≤ M g(θ) ∀θ
• We know h(θ) up to a proportional constant: h(θ) = R
Algorithm
1. Draw θcand ∼ g(θ)
2. Generate u ∼ Unif (0, 1)
k(θcand )
3. Accept θcand if u ≤
M g(θcand )
4. Repeat until B values of θcand have been accepted
Example
•
•
•
•
17
Decision Theory
• Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of θ, θ(x).
• Loss function L: consequences of taking action a when true state is θ or
b L : Θ &times; A → [−k, ∞).
discrepancy between θ and θ,
Loss functions
• Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ &lt; 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
• Lp loss: L(θ, a) = |θ − a|p
(
0 a=θ
• Zero-one loss: L(θ, a) =
1 a 6= θ
17.1
Risk
Posterior risk
We can easily sample from the prior g(θ) = f (θ)
Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M
Algorithm
1. Draw θ
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
Ln (θi )
2. wi = PB
∀i = 1, . . . , B
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
Definitions
Percentile interval
16.3
Importance Sampling
cand
∼ f (θ)
Z
h
i
b
b
L(θ, θ(x))f
(θ | x) dθ = Eθ|X L(θ, θ(x))
Z
h
i
b
b
L(θ, θ(x))f
(x | θ) dx = EX|θ L(θ, θ(X))
r(θb | x) =
(Frequentist) risk
b =
R(θ, θ)
19
18
Bayes risk
ZZ
b =
r(f, θ)
h
i
b
b
L(θ, θ(x))f
(x, θ) dx dθ = Eθ,X L(θ, θ(X))
h
h
ii
h
i
b = Eθ EX|θ L(θ, θ(X)
b
b
r(f, θ)
= Eθ R(θ, θ)
h
h
ii
h
i
b = EX Eθ|X L(θ, θ(X)
b
r(f, θ)
= EX r(θb | X)
Linear Regression
Definitions
• Response variable Y
• Covariate X (aka predictor variable or feature)
18.1
Simple Linear Regression
Model
17.2
Yi = β0 + β1 Xi + i
Admissibility
E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0
b
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
∃θ : R(θ, θb0 ) &lt; R(θ, θ)
• θb is inadmissible if there is at least one other estimator θb0 that dominates
it. Otherwise it is called admissible.
rb(x) = βb0 + βb1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
Residual sums of squares (rss)
17.3
Bayes Rule
rss(βb0 , βb1 ) =
Bayes rule (or Bayes estimator)
n
X
ˆ2i
i=1
b = inf e r(f, θ)
e
• r(f, θ)
θ
R
b
b = r(θb | x)f (x) dx
• θ(x)
= inf r(θb | x) ∀x =⇒ r(f, θ)
Least square estimates
βbT = (βb0 , βb1 )T : min rss
b0 ,β
b1
β
Theorems
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode
17.4
Minimax Rules
Maximum risk
b = sup R(θ, θ)
b
R̄(θ)
R̄(a) = sup R(θ, a)
θ
θ
Minimax rule
b = inf R̄(θ)
e = inf sup R(θ, θ)
e
sup R(θ, θ)
θ
θe
θe
θ
b =c
θb = Bayes rule ∧ ∃c : R(θ, θ)
Least favorable prior
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ
βb0 = Ȳn − βb1 X̄n
Pn
Pn
i=1 Xi Yi − nX̄Y
i=1 (Xi − X̄n )(Yi − Ȳn )
b
Pn
= P
β1 =
n
2
2
2
i=1 (Xi − X̄n )
i=1 Xi − nX
h
i
β0
E βb | X n =
β1
P
h
i
σ 2 n−1 ni=1 Xi2 −X n
V βb | X n = 2
1
−X n
nsX
r Pn
2
σ
b
i=1 Xi
√
b βb0 ) =
se(
n
sX n
σ
b
√
b βb1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = n−2
ˆi (unbiased estimate).
i=1 Further properties:
P
P
• Consistency: βb0 → β0 and βb1 → β1
20
18.3
• Asymptotic normality:
βb0 − β0 D
→ N (0, 1)
b βb0 )
se(
and
βb1 − β1 D
→ N (0, 1)
b βb1 )
se(
Multiple Regression
Y = Xβ + where
• Approximate 1 − α confidence intervals for β0 and β1 :

b βb1 )
and βb1 &plusmn; zα/2 se(
b βb0 )
βb0 &plusmn; zα/2 se(
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | &gt; zα/2 where
b βb1 ).
W = βb1 /se(
Xn1
Pn b
Pn 2
2
ˆ
rss
i=1 (Yi − Y )
R = Pn
= 1 − Pn i=1 i 2 = 1 −
2
tss
i=1 (Yi − Y )
i=1 (Yi − Y )
Likelihood
L2 =

β1
 
β =  ... 
Xnk
n
Y
i=1
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi ) &times;
n
Y
(
fY |X (Yi | Xi ) ∝ σ
2
1 X
exp − 2
Yi − (β0 − β1 Xi )
2σ i
βb = (X T X)−1 X T Y
h
i
V βb | X n = σ 2 (X T X)−1
)
Under the assumption of Normality, the least squares estimator is also the mle
but the least squares variance estimator is not the mle.
βb ≈ N β, σ 2 (X T X)−1
1X 2
ˆ
n i=1 i
rb(x) =
k
X
βbj xj
j=1
Unbiased estimate for σ 2
Prediction
Observe X = x∗ of the covariate and want to predict their outcome Y∗ .
Yb∗ = βb0 + βb1 x∗
h i
h i
h i
h
i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
Prediction interval
Estimate regression function
n
σ
b2 =
N
X
(Yi − xTi β)2
If the (k &times; k) matrix X T X is invertible,
fX (Xi )
−n
n
i=1
fY |X (Yi | Xi ) = L1 &times; L2
i=1
i=1
 
1
 .. 
=.
βk
rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 =
i=1
18.2

1
L(&micro;, Σ) = (2πσ 2 )−n/2 exp − 2 rss
2σ
2
L1 =

X1k
.. 
. 
Likelihood
R2
L=
&middot;&middot;&middot;
..
.
&middot;&middot;&middot;
X11
 ..
X= .
Pn
2
2
2
i=1 (Xi − X∗ )
b
P
ξn = σ
b
+1
n i (Xi − X̄)2 j
Yb∗ &plusmn; zα/2 ξbn
σ
b2 =
n
1 X 2
ˆ
n − k i=1 i
ˆ = X βb − Y
mle
&micro;
b = X̄
σ
b2 =
n−k 2
σ
n
1 − α Confidence interval
b βbj )
βbj &plusmn; zα/2 se(
21
18.4
Model Selection
Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance
AIC(S) = `n (βbS , σ
bS2 ) − k
Bayesian Information Criterion (BIC)
BIC(S) = `n (βbS , σ
bS2 ) −
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
k
log n
2
Validation and training
bV (S) =
R
Hypothesis testing
m
X
(Ybi∗ (S) − Yi∗ )2
m = |{validation data}|, often
i=1
n
n
or
4
2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) − Y ∗ )2
bCV (S) =
R
n
X
(Yi − Yb(i) )2 =
i=1
i=1
Prediction risk
R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) − Yi∗ )2
n
X
Yi − Ybi (S)
1 − Uii (S)
!2
U (S) = XS (XST XS )−1 XS (“hat matrix”)
i=1
Training error
btr (S) =
R
n
X
(Ybi (S) − Yi )2
19
Non-parametric Function Estimation
i=1
R
19.1
2
btr (S)
rss(S)
R
R2 (S) = 1 −
=1−
=1−
tss
tss
Pn b
2
i=1 (Yi (S) − Y )
P
n
2
i=1 (Yi − Y )
The training error is a downward-biased estimate of the prediction risk.
h
i
btr (S) &lt; R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
Cov Ybi , Yi
i=1
Adjusted R2
R2 (S) = 1 −
Density Estimation
Estimate f (x), where f (x) = P [X ∈ A] =
Integrated square error (ise)
L(f, fbn ) =
Z R
A
f (x) dx.
Z
2
b
f (x) − fn (x) dx = J(h) + f 2 (x) dx
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
n − 1 rss
n − k tss
Mallow’s Cp statistic
b
btr (S) + 2kb
R(S)
=R
σ 2 = lack of fit + complexity penalty
h
i
b(x) = E fbn (x) − f (x)
h
i
v(x) = V fbn (x)
22
19.1.1
Histograms
KDE
n
1X1
x − Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
K 2 (x) dx
R(f, fn ) ≈ (hσK )
(f (x)) dx +
4
nh
Z
Z
−2/5 −1/5 −1/5
c
c2 c3
2
2
c
=
σ
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
h∗ = 1
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
∗
2
00 2
b
R (f, fn ) = 4/5
c4 = (σK )
K (x) dx
(f ) dx
4
n
|
{z
}
Definitions
•
•
•
•
Number of bins m
1
Binwidth h = m
Bin Bj has νj observations
R
Define pbj = νj /n and pj = Bj f (u) du
Histogram estimator
fbn (x) =
m
X
pbj
j=1
h
C(K)
I(x ∈ Bj )
Epanechnikov Kernel
pj
E fbn (x) =
h
h
i p (1 − p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f ) ≈
(f 0 (u)) du +
12
nh
!1/3
1
6
∗
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
C
3
2
∗ b
0
R (fn , f ) ≈ 2/3
(f (u)) du
C=
4
n
h
i
(
K(x) =
√
3
√
4 5(1−x2 /5)
|x| &lt;
0
otherwise
5
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
n
n
n
2Xb
1 X X ∗ Xi − Xj
2
fbn2 (x) dx −
K(0)
f(−i) (Xi ) ≈
K
+
n i=1
hn2 i=1 j=1
h
nh
K ∗ (x) = K (2) (x) − 2K(x)
K (2) (x) =
Z
K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
19.1.2
n
m
2Xb
2
n+1 X 2
fbn2 (x) dx −
f(−i) (Xi ) =
−
pb
n i=1
(n − 1)h (n − 1)h j=1 j
19.2
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
Yi = r(xi ) + i
Kernel Density Estimator (KDE)
E [i ] = 0
V [i ] = σ 2
Kernel K
•
•
•
•
K(x) ≥ 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx ≡ σK
&gt;0
k-nearest Neighbor Estimator
rb(x) =
1
k
X
i:xi ∈Nk (x)
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
23
20
Nadaraya-Watson Kernel Estimator
n
X
rb(x) =
wi (x)Yi
Stochastic Processes
Stochastic Process
i=1
wi (x) =
x−xi
h
Pn
x−xj
K
j=1
h
K
(
{0, &plusmn;1, . . . } = Z discrete
T =
[0, ∞)
continuous
{Xt : t ∈ T }
∈ [0, 1]
Z
4 Z 2
h4
f 0 (x)
2 2
00
0
R(b
rn , r) ≈
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
Z 2R 2
σ K (x) dx
+
dx
nhf (x)
c1
h∗ ≈ 1/5
n
c2
R∗ (b
rn , r) ≈ 4/5
n
• Notations Xt , X(t)
• State space X
• Index set T
20.1
Markov Chains
Markov chain
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ]
∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
JbCV (h) =
n
X
(Yi − rb(−i) (xi ))2 =
i=1
n
X
i=1
(Yi − rb(xi ))2
1−
Pn
j=1
19.3
K(0)
x−x j
K
h
Smoothing Using Orthogonal Functions
r(x) =
βj φj (x) ≈
j=1
J
X
pij ≡ P [Xn+1 = j | Xn = i]
pij (n) ≡ P [Xm+n = j | Xm = i]
n-step
Transition matrix P (n-step: Pn )
Approximation
∞
X
Transition probabilities
!2
• (i, j) element is pij
• pij &gt; 0
P
•
i pij = 1
βj φj (x)
j=1
Multivariate regression
where
ηi = i
Y = Φβ + η

φ0 (x1 )
 ..
and Φ =  .
&middot;&middot;&middot;
..
.
φ0 (xn ) &middot; &middot; &middot;
Chapman-Kolmogorov

φJ (x1 )
.. 
. 
pij (m + n) =
X
pij (m)pkj (n)
k
φJ (xn )
Pm+n = Pm Pn
Least squares estimator
βb = (ΦT Φ)−1 ΦT Y
1
≈ ΦT Y (for equally spaced observations only)
n
Cross-validation estimate of E [J(h)]

2
n
J
X
X
bCV (J) =
Yi −
R
φj (xi )βbj,(−i) 
i=1
j=1
Pn = P &times; &middot; &middot; &middot; &times; P = Pn
Marginal probability
&micro;n = (&micro;n (1), . . . , &micro;n (N ))
where
&micro;i (i) = P [Xn = i]
&micro;0 , initial distribution
&micro;n = &micro;0 Pn
24
20.2
Poisson Processes
Autocorrelation function (ACF)
Poisson process
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t
• X0 = 0
• Independent increments:
γ(s, t)
Cov [xs , xt ]
=p
ρ(s, t) = p
V [xs ] V [xt ]
γ(s, s)γ(t, t)
Cross-covariance function (CCV)
∀t0 &lt; &middot; &middot; &middot; &lt; tn : Xt1 − Xt0 ⊥
⊥ &middot; &middot; &middot; ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − &micro;xs )(yt − &micro;yt )]
• Intensity function λ(t)
Cross-correlation function (CCF)
– P [Xt+h − Xt = 1] = λ(t)h + o(h)
– P [Xt+h − Xt = 2] = o(h)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
0
ρxy (s, t) = p
λ(s) ds
γxy (s, t)
γx (s, s)γy (t, t)
Homogeneous Poisson process
Backshift operator
λ(t) ≡ λ =⇒ Xt ∼ Po (λt)
λ&gt;0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
1
Wt ∼ Gamma t,
λ
Difference operator
∇d = (1 − B)d
White noise
Interarrival times
St = Wt+1 − Wt
1
St ∼ Exp
λ
2
• wt ∼ wn(0, σw
)
St
Wt−1
Wt
t
•
•
•
•
iid
2
Gaussian: wt ∼ N 0, σw
E [wt ] = 0 t ∈ T
V [wt ] = σ 2 t ∈ T
γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Random walk
21
Time Series
Mean function
Z
∞
&micro;xt = E [xt ] =
xft (x) dx
• Drift δ
Pt
• xt = δt + j=1 wj
• E [xt ] = δt
−∞
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − &micro;s )(xt − &micro;t )] = E [xs xt ] − &micro;s &micro;t
γx (t, t) = E (xt − &micro;t )2 = V [xt ]
mt =
k
X
j=−k
aj xt−j
where aj = a−j ≥ 0 and
k
X
j=−k
aj = 1
25
21.1
Stationary Time Series
Sample variance
Strictly stationary
n 1 X
|h|
1−
γx (h)
n
n
V [x̄] =
h=−n
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
Sample autocovariance function
∀k ∈ N, tk , ck , h ∈ Z
Weakly stationary
∀t ∈ Z
• E x2t &lt; ∞
2
∀t ∈ Z
• E xt = m
• γx (s, t) = γx (s + r, t + r)
n−h
1 X
(xt+h − x̄)(xt − x̄)
n t=1
γ
b(h) =
∀r, s, t ∈ Z
Sample autocorrelation function
Autocovariance function
•
•
•
•
•
γ(h) = E [(xt+h − &micro;)(xt − &micro;)]
γ(0) = E (xt − &micro;)2
γ(0) ≥ 0
γ(0) ≥ |γ(h)|
γ(h) = γ(−h)
ρb(h) =
∀h ∈ Z
Sample cross-variance function
γ
bxy (h) =
Autocorrelation function (ACF)
Cov [xt+h , xt ]
γ(t + h, t)
γ(h)
ρx (h) = p
=p
=
γ(0)
V [xt+h ] V [xt ]
γ(t + h, t + h)γ(t, t)
n−h
1 X
(xt+h − x̄)(yt − y)
n t=1
Sample cross-correlation function
γ
bxy (h)
ρbxy (h) = p
γ
bx (0)b
γy (0)
Jointly stationary time series
γxy (h) = E [(xt+h − &micro;x )(yt − &micro;y )]
γxy (h)
ρxy (h) = p
γx (0)γy (h)
∞
X
ψj wt−j
where
j=−∞
∞
X
Properties
1
• σρbx (h) = √ if xt is white noise
n
1
• σρbxy (h) = √ if xt or yt is white noise
n
Linear process
xt = &micro; +
γ
b(h)
γ
b(0)
|ψj | &lt; ∞
j=−∞
γ(h) =
2
σw
∞
X
21.3
ψj+h ψj
Non-Stationary Time Series
Classical decomposition model
j=−∞
21.2
xt = &micro;t + st + wt
Estimation of Correlation
Sample mean
n
1X
x̄ =
xt
n t=1
• &micro;t = trend
• st = seasonal component
• wt = random noise term
26
21.3.1
Detrending
Moving average polynomial
Least squares
θ(z) = 1 + θ1 z + &middot; &middot; &middot; + θq zq
z ∈ C ∧ θq 6= 0
2
1. Choose trend model, e.g., &micro;t = β0 + β1 t + β2 t
2. Minimize rss to obtain trend estimate &micro;
bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt
Moving average operator
θ(B) = 1 + θ1 B + &middot; &middot; &middot; + θp B p
Moving average
MA (q) (moving average model order q)
• The low-pass filter vt is a symmetric moving average mt with aj =
vt =
1
2k+1 :
xt = wt + θ1 wt−1 + &middot; &middot; &middot; + θq wt−q ⇐⇒ xt = θ(B)wt
k
X
1
xt−1
2k + 1
E [xt ] =
i=−k
θj E [wt−j ] = 0
j=0
(
Pk
1
• If 2k+1
i=−k wt−j ≈ 0, a linear trend function &micro;t = β0 + β1 t passes
without distortion
Differencing
γ(h) = Cov [xt+h , xt ] =
2
σw
0
Pq−h
j=0
θj θj+h
0≤h≤q
h&gt;q
MA (1)
xt = wt + θwt−1

2 2

(1 + θ )σw h = 0
2
γ(h) = θσw
h=1


0
h&gt;1
(
θ
h=1
2
ρ(h) = (1+θ )
0
h&gt;1
• &micro;t = β0 + β1 t =⇒ ∇xt = β1
21.4
q
X
ARIMA models
Autoregressive polynomial
φ(z) = 1 − φ1 z − &middot; &middot; &middot; − φp zp
z ∈ C ∧ φp 6= 0
Autoregressive operator
ARMA (p, q)
φ(B) = 1 − φ1 B − &middot; &middot; &middot; − φp B p
xt = φ1 xt−1 + &middot; &middot; &middot; + φp xt−p + wt + θ1 wt−1 + &middot; &middot; &middot; + θq wt−q
Autoregressive model order p, AR (p)
φ(B)xt = θ(B)wt
xt = φ1 xt−1 + &middot; &middot; &middot; + φp xt−p + wt ⇐⇒ φ(B)xt = wt
AR (1)
• xt = φk (xt−k ) +
k−1
X
φj (wt−j )
k→∞,|φ|&lt;1
j=0
=
∞
X
φj (wt−j )
{z
}
linear process
P∞
j
j=0 φ (E [wt−j ]) = 0
• γ(h) = Cov [xt+h , xt ] =
• ρ(h) =
γ(h)
γ(0)
• xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
• φhh = corr(xh − xh−1
, x0 − xh−1
) h≥2
0
h
• E.g., φ11 = corr(x1 , x0 ) = ρ(1)
j=0
|
• E [xt ] =
Partial autocorrelation function (PACF)
2 h
σw
φ
1−φ2
= φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . .
ARIMA (p, d, q)
∇d xt = (1 − B)d xt is ARMA (p, q)
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt−1 + wt − λwt−1
27
xt =
∞
X
(1 − λ)λj−1 xt−j + wt
•
•
•
•
when |λ| &lt; 1
j=1
x̃n+1 = (1 − λ)xn + λx̃n
Seasonal ARIMA
Frequency index ω (cycles per unit time), period 1/ω
Amplitude A
Phase φ
U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Periodic mixture
• Denoted by ARIMA (p, d, q) &times; (P, D, Q)s
d
s
• ΦP (B s )φ(B)∇D
s ∇ xt = δ + ΘQ (B )θ(B)wt
xt =
q
X
(Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
21.4.1
Causality and Invertibility
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } :
xt =
∞
X
P∞
j=0
ψj &lt; ∞ such that
wt−j = ψ(B)wt
Spectral representation of a periodic process
j=0
ARMA (p, q) is invertible ⇐⇒ ∃{πj } :
π(B)xt =
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
• γ(0) = E x2t = k=1 σk2
P∞
j=0
∞
X
γ(h) = σ 2 cos(2πω0 h)
πj &lt; ∞ such that
σ 2 −2πiω0 h σ 2 2πiω0 h
e
+
e
2
2
Z 1/2
=
e2πiωh dF (ω)
=
Xt−j = wt
j=0
−1/2
Properties
• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
∞
X
θ(z)
ψ(z) =
ψj z =
φ(z)
j=0
j
Spectral distribution function


0
F (ω) = σ 2 /2

 2
σ
|z| ≤ 1
ω &lt; −ω0
−ω ≤ ω &lt; ω0
ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
X
φ(z)
π(z) =
πj z j =
θ(z)
j=0
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
|z| ≤ 1
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
21.5
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
Spectral Analysis
Periodic process
xt = A cos(2πωt + φ)
= U1 cos(2πωt) + U2 sin(2πωt)
ARMA (p, q)
tails off
tails off
∞
X
γ(h)e−2πiωh
−
|γ(h)| &lt; ∞ =⇒ γ(h) =
R 1/2
f (ω) =
h=−∞
• Needs
P∞
h=−∞
1
1
≤ω≤
2
2
−1/2
e2πiωh f (ω) dω
h = 0, &plusmn;1, . . .
• f (ω) ≥ 0
• f (ω) = f (−ω)
• f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
2
• White noise: fw (ω) = σw
28
22.2
• ARMA (p, q) , φ(B)xt = θ(B)wt :
Z 1
Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) =
tx−1 (1 − t)y−1 dt =
Γ(x + y)
0
Z x
a−1
b−1
• Incomplete: B(x; a, b) =
t
(1 − t)
dt
|θ(e−2πiω )|2
fx (ω) =
|φ(e−2πiω )|2
Pp
Pq
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
2
σw
0
• Regularized incomplete:
a+b−1
(a + b − 1)!
B(x; a, b) a,b∈N X
=
xj (1 − x)a+b−1−j
Ix (a, b) =
B(a, b)
j!(a
+
b
−
1
−
j)!
j=a
Discrete Fourier Transform (DFT)
d(ωj ) = n−1/2
n
X
Beta Function
xt e−2πiωj t
• I0 (a, b) = 0
I1 (a, b) = 1
• Ix (a, b) = 1 − I1−x (b, a)
i=1
Fourier/Fundamental frequencies
22.3
ωj = j/n
Series
Finite
Inverse DFT
xt = n
−1/2
n−1
X
d(ωj )e
•
2πiωj t
j=0
Periodogram
•
I(j/n) = |d(j/n)|2
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2πtj/n +
n t=1
n
2X
xt sin(2πtj/n
n t=1
!2
k=
k=1
n
X
n(n + 1)
2
(2k − 1) = n2
k=1
n
X
•
k=1
n
X
ck =
22.1
k=0
n X
k
= 2n
k=0
• Vandermonde’s Identity:
r X
m
n
m+n
=
k
r−k
r
k=0
c 6= 1
Math
• Binomial
Theorem:
n X
n n−k k
a
b = (a + b)n
k
k=0
Gamma Function
Z
Infinite
∞
ts−1 e−t dt
0
Z ∞
• Upper incomplete: Γ(s, x) =
ts−1 e−t dt
Z xx
• Lower incomplete: γ(s, x) =
ts−1 e−t dt
• Ordinary: Γ(s) =
0
•
•
•
•
•
cn+1 − 1
c−1
n X
n
r+k
r+n+1
=
k
n
k=0
n
X k
n+1
•
=
m
m+1
•
k2 =
k=0
22
•
n(n + 1)(2n + 1)
6
k=1
2
n
X
n(n + 1)
•
k3 =
2
•
P (j/n) =
Binomial
n
X
Γ(α + 1) = αΓ(α)
α&gt;1
Γ(n) = (n − 1)!
n∈N
Γ(0) = Γ(−1) = ∞
√
Γ(1/2) = π
Γ(−1/2) = −2Γ(1/2)
•
∞
X
k=0
∞
X
pk =
1
,
1−p
∞
X
p
|p| &lt; 1
1−p
k=1
!
∞
X
d
1
1
k
=
p
=
dp 1 − p
(1 − p)2
pk =
d
dp
k=0
k=0
∞ X
r+k−1 k
•
x = (1 − x)−r r ∈ N+
k
k=0
∞ X
α k
•
p = (1 + p)α |p| &lt; 1 , α ∈ C
k
•
kpk−1 =
|p| &lt; 1
k=0
29
22.4
Combinatorics
 R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling
 A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n
w/o replacement
nk =
ordered
k−1
Y
(n − i) =
i=0
w/ replacement
n!
(n − k)!
n
nk
n!
=
=
k
k!
k!(n − k)!
unordered
nk
 A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.
 L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
n−1+r
n−1+r
=
r
n−1
Stirling numbers, 2nd kind
n
n−1
n−1
=k
+
k
k
k−1
(
1
n
=
0
0
1≤k≤n
n=0
else
Partitions
Pn+k,k =
n
X
Pn,i
k &gt; n : Pn,k = 0
n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns
|B| = n, |U | = m
f :B→U
f arbitrary
mn
B : D, U : D
B : &not;D, U : D
B : D, U : &not;D
f injective
(
mn m ≥ n
0
else
m+n−1
n
m X
n
k=1
B : &not;D, U : &not;D
D = distinguishable, &not;D = indistinguishable.
m
X
k=1
m
n
m!
n
m
n−1
m−1
(
1
0
m≥n
else
n
m
(
1
0
m≥n
else
Pn,m
k
Pn,k
f surjective
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
 P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
 L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
30
31
Univariate distribution relationships, courtesy Leemis and McQueston .
```