Probability and Statistics Cookbook Version 0.2.6 19th December, 2017 http://statistics.zone/ Copyright c Matthias Vallentin Contents 3 15 Bayesian Inference 15.1 Credible Intervals . . . . . . . . . . . . 3 15.2 Function of parameters . . . . . . . . . 5 15.3 Priors . . . . . . . . . . . . . . . . . . Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . 15.4 Bayesian Testing . . . . . . . . . . . . Random Variables 8 3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 16.1 Inverse Transform Sampling . . . . . . Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . 16.2.1 Bootstrap Confidence Intervals Variance 9 16.3 Rejection Sampling . . . . . . . . . . . 16.4 Importance Sampling . . . . . . . . . . Inequalities 10 17 Decision Theory Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . 17.2 Admissibility . . . . . . . . . . . . . . Probability and Moment Generating 17.3 Bayes Rule . . . . . . . . . . . . . . . Functions 11 17.4 Minimax Rules . . . . . . . . . . . . . 1 Distribution Overview 1.1 Discrete Distributions . . . . . . . . . . 1.2 Continuous Distributions . . . . . . . . 2 3 4 5 6 7 8 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28 . . . . . 16 22 Math 22.1 Gamma Function 16 22.2 Beta Function . . 17 22.3 Series . . . . . . 17 22.4 Combinatorics . 17 18 . . . . . 18 18 18 18 19 19 . . . . 19 19 20 20 20 9 Multivariate Distributions 11 18 Linear Regression 9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 18.4 Model Selection . . . . . . . . . . . . . . 10 Convergence 11 10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 19.1.1 Histograms . . . . . . . . . . . . 11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 11.3 Empirical distribution . . . . . . . . . . 13 11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 20.1 Markov Chains . . . . . . . . . . . . . . 12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 12.1 Method of Moments . . . . . . . . . . . 13 12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 21.1 Stationary Time Series . . . . . . . . . . 12.2.1 Delta Method . . . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 12.3 Multiparameter Models . . . . . . . . . 14 21.3 Non-Stationary Time Series . . . . . . . 12.3.1 Multiparameter delta method . . 15 21.3.1 Detrending . . . . . . . . . . . . 12.4 Parametric Bootstrap . . . . . . . . . . 15 21.4 ARIMA models . . . . . . . . . . . . . . 13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 29 29 30 20 20 21 21 22 22 22 23 23 23 24 24 24 25 25 26 26 26 27 27 28 This cookbook integrates various topics in probability theory and statistics, based on literature [1, 6, 3] and in-class material from courses of the statistics department at the University of California in Berkeley but also influenced by others [4, 5]. If you find errors or have suggestions for improvements, please get in touch at http://statistics.zone/. 1 Distribution Overview 1.1 Discrete Distributions Notation1 Uniform Unif {a, . . . , b} FX (x) 0 Bernoulli Binomial Multinomial Hypergeometric Negative Binomial 1 We 1 Bern (p) Bin (n, p) x<a a≤x≤b x>b (1 − p)1−x bxc−a+1 b−a I1−p (n − x, x + 1) Mult (n, p) Hyp (N, m, n) NBin (r, p) Geometric Geo (p) Poisson Po (λ) ≈Φ x − np p np(1 − p) ! Ip (r, x + 1) 1 − (1 − p)x e−λ x ∈ N+ x X λi i! i=0 fX (x) E [X] V [X] MX (s) I(a ≤ x ≤ b) b−a+1 a+b 2 (b − a + 1)2 − 1 12 eas − e−(b+1)s s(b − a) px (1 − p)1−x p p(1 − p) 1 − p + pes np np(1 − p) (1 − p + pes )n ! n x p (1 − p)n−x x n! x px1 · · · pk k x1 ! . . . xk ! 1 m N −m x k X xi = n i=1 n−x N n ! x+r−1 r p (1 − p)x r−1 p(1 − p)x−1 λx e−λ x! x ∈ N+ np1 . .. npk nm N np1 (1 − p1 ) −np2 p1 −np1 p2 .. . ! k X !n pi e si i=0 nm(N − n)(N − m) N 2 (N − 1) 1−p r p 1−p r 2 p 1 p 1−p p2 λ λ p 1 − (1 − p)es r pes 1 − (1 − p)es eλ(e s −1) use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2). 3 Uniform (discrete) Binomial ● Geometric ● ● ● n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9 0.8 Poisson ● ● ● ● ● ● p = 0.2 p = 0.5 p = 0.8 ● ● ● λ=1 λ=4 λ = 10 ● 0.3 0.2 ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 ● ● ● a b ● ● ●●● ● ●● ● ● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● 0 10 1.00 ● 30 0.0 2.5 ● ● ● ● ● ● ● ● 5.0 ● ● ● ● 7.5 ● ● ● 0.0 10.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● 5 10 Geometric 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.00 ● ● ● ● ● 0.75 ● ● ● 20 Poisson ● ● 0.8 ● 15 x ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● a b x 0 10 ● 0.4 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● 0.00 ● ● ● ● ● ● ● 0.50 ● ● ● ● CDF 0.50 CDF CDF CDF ● ● 0.25 0 40 ● ● i n ● 0.0 ● ● ● x 0.75 ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ● i n ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● Binomial ● ● ● ● x Uniform (discrete) ● ● ● ● 20 x 1 ● 0.1 ● ● ● ● 0.0 0.2 ● ● ● 0.2 ● ● ● ● 0.4 ● ● ● PMF ● PMF ● ● PMF PMF ● 1 n 20 x ● ● ● 30 n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9 40 0.2 ● ● ● ● 0.0 2.5 5.0 x 7.5 p = 0.2 p = 0.5 p = 0.8 10.0 ● ● 0.00 ● ● ● ● ● ● ● ● ● ● 0 5 10 15 λ=1 λ=4 λ = 10 20 x 4 1.2 Continuous Distributions Notation FX (x) Uniform Unif (a, b) Normal N µ, σ 2 0 x<a a<x<b 1 x>b Z x Φ(x) = φ(t) dt x−a b−a −∞ Log-Normal ln N µ, σ 2 Multivariate Normal MVN (µ, Σ) Student’s t Chi-square Student(ν) χ2k 1 1 ln x − µ + erf √ 2 2 2σ 2 fX (x) E [X] V [X] MX (s) I(a < x < b) b−a (x − µ)2 1 φ(x) = √ exp − 2σ 2 σ 2π (ln x − µ)2 1 √ exp − 2σ 2 x 2πσ 2 a+b 2 (b − a)2 12 µ σ2 esb − esa s(b − a) σ 2 s2 exp µs + 2 T 1 (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) ν ν Ix , 2 2 k x 1 γ , Γ(k/2) 2 2 eµ+σ Σ−1 (x−µ) 2 2 (eσ − 1)e2µ+σ /2 µ 0 1 xk/2−1 e−x/2 2k/2 Γ(k/2) r ν>1 1 exp µT s + sT Σs 2 Σ ( −(ν+1)/2 Γ ν+1 x2 2 1 + √ ν νπΓ ν2 2 ν ν−2 ∞ ν>2 1<ν≤2 k 2k d2 d2 − 2 2d22 (d1 + d2 − 2) d1 (d2 − 2)2 (d2 − 4) (1 − 2s)−k/2 s < 1/2 d F F(d1 , d2 ) Exponential∗ Exp (β) Gamma∗ Inverse Gamma Dirichlet Beta Weibull Pareto ∗ I d1 x d1 x+d2 Gamma (α, β) InvGamma (α, β) d1 d2 , 2 2 Pareto(xm , α) β2 1 1− γ(α, βx) Γ(α) Γ α, βx Γ (α) β α α−1 −βx x e Γ (α) α β α β2 1 1− β α −α−1 −β/x x e Γ (α) P k k Γ i=1 αi Y α −1 xi i Qk i=1 Γ (αi ) i=1 β α>1 α−1 β2 α>2 (α − 1)2 (α − 2) Γ (α + β) α−1 x (1 − x)β−1 Γ (α) Γ (β) α α+β 1 λΓ 1 + k αβ (α + β)2 (α + β + 1) 2 λ2 Γ 1 + − µ2 k αxm α>1 α−1 x2m α α>2 (α − 1)2 (α − 2) 1 − e−(x/λ) x α We use the rate parameterization where β = β k 1− xB d1 d1 , 2 2 1 −x/β e β Ix (α, β) Weibull(λ, k) (d1 x+d2 )d1 +d2 1 − e−x/β Dir (α) Beta (α, β) (d1 x)d1 d2 2 m x 1 . λ x ≥ xm k x k−1 −(x/λ)k e λ λ xα m α α+1 x αi Pk i=1 αi x ≥ xm (s < β) s β !α s β (s < β) p 2(−βs)α/2 Kα −4βs Γ(α) E [Xi ] (1 − E [Xi ]) Pk i=1 αi + 1 1+ ∞ X k=1 ∞ X n=0 k−1 Y r=0 α+r α+β+r ! sk k! sn λn n Γ 1+ n! k α(−xm s)α Γ(−α, −xm s) s < 0 Some textbooks use β as scale parameter instead [6]. 5 Uniform (continuous) Normal 2.0 Log−Normal 1.00 µ = 0, σ = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5 2 ● 1.0 0.5 ● ● a b 0.0 0.00 −2.5 0.0 x x χ F 2 k=1 k=2 k=3 k=4 k=5 2.5 5.0 0.0 0 1 2 3 −5.0 0.0 Exponential 3 5.0 Gamma β = 0.5 β=1 β = 2.5 2.0 d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100 2.5 x α = 1, β = 0.5 α = 2, β = 0.5 α = 3, β = 0.5 α = 5, β = 1 α = 9, β = 2 0.5 0.4 1.5 2 PDF PDF 0.3 PDF PDF −2.5 x 0.75 0.50 0.2 0.1 0.25 −5.0 1.00 0.50 ν=1 ν=2 ν=5 ν=∞ 0.3 PDF ● µ = 0, σ = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1 0.75 PDF 1 b−a PDF PDF 1.5 Student's t 0.4 2 1.0 0.2 1 0.5 0.25 0.1 0 0.00 0 2 4 6 8 0.0 0.0 0 1 x 2 3 4 5 0 1 2 x Inverse Gamma α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5 4 5 0 5 10 15 20 x Weibull α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5 4 4 x Beta 5 3 2.0 Pareto 4 λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5 xm = 1, k = 1 xm = 1, k = 2 xm = 1, k = 4 3 1.5 3 2 PDF PDF PDF PDF 3 1.0 2 2 1 0.5 1 1 0 0 0 1 2 3 x 4 5 0 0.0 0.00 0.25 0.50 x 0.75 1.00 0.0 0.5 1.0 1.5 x 2.0 2.5 1.0 1.5 2.0 2.5 x 6 Uniform (continuous) Normal 1 Log−Normal 1.00 Student's t 1.00 µ = 0, σ = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1 2 0.75 0.75 0.75 CDF CDF CDF CDF 0.50 0.50 0.50 0.25 0.25 0 0.25 µ = 0, σ2 = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5 0.00 a b −5.0 −2.5 0.0 x 2 k=1 k=2 k=3 k=4 k=5 0.00 4 6 Exponential 0.75 0.75 0.50 0.00 2 3 4 1 2 CDF 0.75 CDF 0.75 0.50 0.00 3 4 5 4 0.00 0.25 0.50 x 0 5 10 0.75 1.00 15 20 x Pareto 1.00 1.00 0.75 0.75 0.50 0.50 0.25 λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5 0.00 0.00 0.00 5 0.25 0.25 α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5 3 α = 1, β = 0.5 α = 2, β = 0.5 α = 3, β = 0.5 α = 5, β = 1 α = 9, β = 2 Weibull α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5 5.0 0.50 x Beta 1.00 x 0 CDF Inverse Gamma 2 β = 0.5 β=1 β = 2.5 0.00 5 2.5 0.25 x 1.00 1 0.50 0.25 d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100 1 0.0 Gamma 0.75 0 −2.5 x 1.00 x 0.25 −5.0 1.00 8 0.50 3 1.00 0.25 0.25 0 2 CDF 0.50 2 1 x CDF CDF 0.75 0 0.00 0 F 1.00 CDF 0.00 5.0 x χ CDF 2.5 ν=1 ν=2 ν=5 ν=∞ 0.0 0.5 1.0 1.5 x 2.0 2.5 xm = 1, k = 1 xm = 1, k = 2 xm = 1, k = 4 0.00 1.0 1.5 2.0 2.5 x 7 2 Probability Theory Law of Total Probability Definitions • • • • P [B] = Sample space Ω Outcome (point or element) ω ∈ Ω Event A ⊆ Ω σ-algebra A P [B|Ai ] P [Ai ] n G Ai i=1 Bayes’ Theorem P [B | Ai ] P [Ai ] P [Ai | B] = Pn j=1 P [B | Aj ] P [Aj ] Ω= n G Ai i=1 Inclusion-Exclusion Principle n [ • Probability Distribution P i=1 1. P [A] ≥ 0 ∀A 2. P [Ω] = 1 "∞ # ∞ G X 3. P Ai = P [Ai ] 3 Ai = n X X (−1)r−1 r \ Ai j i≤i1 <···<ir ≤n j=1 r=1 Random Variables Random Variable (RV) i=1 X:Ω→R • Probability space (Ω, A, P) Probability Mass Function (PMF) Properties • • • • • • • • Ω= i=1 1. ∅ ∈ A S∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A i=1 n X P [∅] = 0 B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) P [¬A] = 1 − P [A] P [B] = P [A ∩ B] + P [¬A ∩ B] P [Ω] = 1 P [∅] = 0 S T T S ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan S T P [ n An ] = 1 − P [ n ¬An ] P [A ∪ B] = P [A] + P [B] − P [A ∩ B] =⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B] fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}] Probability Density Function (PDF) Z P [a ≤ X ≤ b] = b f (x) dx a Cumulative Distribution Function (CDF) FX : R → [0, 1] FX (x) = P [X ≤ x] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 ) 2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-Continuous: limy↓x F (y) = F (x) Continuity of Probabilities • A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] • A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] S∞ where A = i=1 Ai T∞ where A = i=1 Ai Z P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b a Independence ⊥ ⊥ fY |X (y | x) = A⊥ ⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] f (x, y) fX (x) Independence Conditional Probability P [A | B] = b P [A ∩ B] P [B] P [B] > 0 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y] 2. fX,Y (x, y) = fX (x)fY (y) 8 3.1 Z Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y) X,Y Transformation function • E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality) • P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ] • P [X = Y ] = 1 =⇒ E [X] = E [Y ] ∞ X • E [X] = P [X ≥ x] X discrete Z = ϕ(X) Discrete X fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) = fX (x) x∈ϕ−1 (z) x=1 Sample mean Continuous n Z FZ (z) = P [ϕ(X) ≤ z] = X̄n = with Az = {x : ϕ(x) ≤ z} f (x) dx Az Special case if ϕ strictly monotone dx 1 d −1 ϕ (z) = fX (x) = fX (x) dz dz |J| fZ (z) = fX (ϕ−1 (z)) Conditional expectation Z • E [Y | X = x] = yf (y | x) dy • E [X] = E [E [X | Y ]] Z ∞ • Eϕ(X,Y ) | X=x [=] ϕ(x, y)fY |X (y | x) dx −∞ Z ∞ • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz The Rule of the Lazy Statistician Z E [Z] = 1X Xi n i=1 ϕ(x) dFX (x) −∞ Z E [IA (x)] = • E [Y + Z | X] = E [Y | X] + E [Z | X] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X] • E [Y | X] = c =⇒ Cov [X, Y ] = 0 Z dFX (x) = P [X ∈ A] IA (x) dFX (x) = A Convolution Z • Z := X + Y ∞ fX,Y (x, z − x) dx fZ (z) = −∞ Z X,Y ≥0 Z z fX,Y (x, z − x) dx = 0 ∞ • Z := |X − Y | • Z := 4 X Y fZ (z) = 2 fX,Y (x, z + x) dx 0 Z ∞ Z ∞ ⊥ ⊥ fZ (z) = |y|fX,Y (yz, y) dy = |y|fX (yz)fY (y) dy −∞ −∞ 5 Variance Definition and properties 2 2 • V [X] = σX = E (X − E [X])2 = E X 2 − E [X] " n # n X X X • V Xi = V [Xi ] + Cov [Xi , Xj ] i=1 Expectation • V Definition and properties Z • E [X] = µX = x dFX (x) = i=1 # Xi = i=1 X xfX (x) x Z xfX (x) dx • P [X = c] = 1 =⇒ E [X] = c • E [cX] = c E [X] • E [X + Y ] = E [X] + E [Y ] " n X X discrete n X i6=j V [Xi ] if Xi ⊥ ⊥ Xj i=1 Standard deviation sd[X] = p V [X] = σX Covariance X continuous • • • • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] Cov [X, a] = 0 Cov [X, X] = V [X] Cov [X, Y ] = Cov [Y, X] 9 7 • Cov [aX, bY ] = abCov [X, Y ] • Cov [X + a, Y + b] = Cov [X, Y ] n m n X m X X X • Cov Xi , Yj = Cov [Xi , Yj ] i=1 j=1 Distribution Relationships Binomial • Xi ∼ Bern (p) =⇒ i=1 j=1 n X Xi ∼ Bin (n, p) i=1 Correlation • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • limn→∞ Bin (n, p) = Po (np) (n large, p small) • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1) Cov [X, Y ] ρ [X, Y ] = p V [X] V [Y ] Independence Negative Binomial X⊥ ⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] Sample variance n S2 = 1 X (Xi − X̄n )2 n − 1 i=1 Conditional variance 2 • V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] • V [Y ] = E [V [Y | X]] + V [E [Y | X]] 6 • • • • X ∼ NBin (1, p) = Geo (p) Pr X ∼ NBin (r, p) = i=1 Geo (p) P P Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p) X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r] Poisson • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ n X Xi ∼ Po i=1 • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Inequalities P [ϕ(X) ≥ t] ≤ P [|X − E [X]| ≥ t] ≤ Chernoff P [X ≥ (1 + δ)µ] ≤ • Xi ∼ Exp (β) ∧ Xi ⊥ ⊥ Xj =⇒ E [ϕ(X)] t Chebyshev δ n X n X λi Xj ∼ Bin Xj , Pn j=1 j=1 λj n X Xi ∼ Gamma (n, β) i=1 • Memoryless property: P [X > x + y | X > y] = P [X > x] V [X] t2 e (1 + δ)1+δ λi Exponential 2 E [XY ] ≤ E X 2 E Y 2 Markov ! i=1 j=1 Cauchy-Schwarz n X Normal • X ∼ N µ, σ 2 δ > −1 Hoeffding X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n 2 P X̄ − E X̄ ≥ t ≤ e−2nt t > 0 2n2 t2 P |X̄ − E X̄ | ≥ t ≤ 2 exp − Pn t>0 2 i=1 (bi − ai ) Jensen E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex =⇒ X−µ σ ∼ N (0, 1) 2 • X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2 P P P • Xi ∼ N µi , σi2 ∧ Xi ⊥ ⊥ Xj =⇒ Xi ∼ N µi , i σi2 i i • P [a < X ≤ b] = Φ b−µ − Φ a−µ σ σ • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x) −1 • Upper quantile of N (0, 1): zα = Φ (1 − α) Gamma • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1) Pα • Gamma (α, β) ∼ i=1 Exp (β) P P • Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β) 10 • Γ(α) = λα Z ∞ 9.2 xα−1 e−λx dx 0 Bivariate Normal Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . Beta 1 Γ(α + β) α−1 • xα−1 (1 − x)β−1 = x (1 − x)β−1 B(α, β) Γ(α)Γ(β) B(α + k, β) α+k−1 = E X k−1 • E Xk = B(α, β) α+β+k−1 • Beta (1, 1) ∼ Unif (0, 1) f (x, y) = " z= x − µx σx 2πσx σy 2 + 1 p 1 − ρ2 y − µy σy exp − 2 − 2ρ z 2(1 − ρ2 ) x − µx σx y − µy σy # Conditional mean and variance 8 Probability and Moment Generating Functions • GX (t) = E tX σX (Y − E [Y ]) σY p V [X | Y ] = σX 1 − ρ2 E [X | Y ] = E [X] + ρ |t| < 1 "∞ # ∞ X (Xt)i X E Xi • MX (t) = GX (et ) = E eXt = E = · ti i! i! i=0 i=0 9.3 • P [X = 0] = GX (0) • P [X = 1] = G0X (0) Multivariate Normal (Precision matrix Σ−1 ) V [X1 ] · · · Cov [X1 , Xk ] .. .. .. Σ= . . . Covariance matrix Σ (i) GX (0) i! • E [X] = G0X (1− ) (k) • E X k = MX (0) X! (k) • E = GX (1− ) (X − k)! • P [X = i] = Cov [Xk , X1 ] · · · V [Xk ] If X ∼ N (µ, Σ), 2 • V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) −n/2 fX (x) = (2π) d • GX (t) = GY (t) =⇒ X = Y −1/2 |Σ| 1 exp − (x − µ)T Σ−1 (x − µ) 2 Properties 9 9.1 Multivariate Distributions • • • • Standard Bivariate Normal Let X, Y ∼ N (0, 1) ∧ X ⊥ ⊥ Z where Y = ρX + p 1 − ρ2 Z Joint density x2 + y 2 − 2ρxy p f (x, y) = exp − 2(1 − ρ2 ) 2π 1 − ρ2 1 10 and (X | Y = y) ∼ N ρy, 1 − ρ2 Convergence Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote the cdf of Xn and let F denote the cdf of X. Types of Convergence Conditionals (Y | X = x) ∼ N ρx, 1 − ρ2 Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ) X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa D 1. In distribution (weakly, in law): Xn → X Independence X⊥ ⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) n→∞ ∀t where F continuous 11 P 2. In probability: Xn → X X̄n − µ Zn := q = V X̄n (∀ε > 0) lim P [|Xn − X| > ε] = 0 n→∞ as 3. Almost surely (strongly): Xn → X h i h i P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 n→∞ CLT notations Zn ≈ N (0, 1) σ2 X̄n ≈ N µ, n σ2 X̄n − µ ≈ N 0, n √ 2 n(X̄n − µ) ≈ N 0, σ √ n(X̄n − µ) ≈ N (0, 1) σ Relationships qm P D • Xn → X =⇒ Xn → X =⇒ Xn → X as P • Xn → X =⇒ Xn → X D P • Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X P ∧ Yn ∧ Yn ∧ Yn =⇒ P P → Y =⇒ Xn + Yn → X + Y qm qm → Y =⇒ Xn + Yn → X + Y P P → Y =⇒ Xn Yn → XY P ϕ(Xn ) → ϕ(X) D Continuity correction P X̄n ≤ x ≈ Φ D • Xn → X =⇒ ϕ(Xn ) → ϕ(X) qm • Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 qm • X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ Slutzky’s Theorem D x + 12 − µ √ σ/ n x − 12 − µ √ σ/ n Delta method P D Law of Large Numbers (LLN) Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ. Weak (WLLN) P X̄n → µ n→∞ Strong (SLLN) as X̄n → µ 10.2 P X̄n ≥ x ≈ 1 − Φ • Xn → X and Yn → c =⇒ Xn + Yn → X + c D P D • Xn → X and Yn → c =⇒ Xn Yn → cX D D D • In general: Xn → X and Yn → Y =⇒ 6 Xn + Yn → X + Y 10.1 z∈R n→∞ n→∞ →X qm →X P →X P →X where Z ∼ N (0, 1) lim P [Zn ≤ z] = Φ(z) qm Xn Xn Xn Xn n(X̄n − µ) D →Z σ n→∞ 4. In quadratic mean (L2 ): Xn → X lim E (Xn − X)2 = 0 • • • • √ n→∞ Central Limit Theorem (CLT) Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . Yn ≈ N 11 µ, σ2 n =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ0 (µ)) 2 σ2 n Statistical Inference iid Let X1 , · · · , Xn ∼ F if not otherwise noted. 11.1 Point Estimation • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn ) h i • bias(θbn ) = E θbn − θ P • Consistency: θbn → θ • Sampling distribution: F (θbn ) r h i b • Standard error: se(θn ) = V θbn 12 h i h i • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn • limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent θbn − θ D • Asymptotic normality: → N (0, 1) se • Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ bn . 11.2 • • • • Statistical Functionals Statistical functional: T (F ) Plug-in estimator of θ = (F ): θbn = T (Fbn ) R Linear functional: T (F ) = ϕ(x) dFX (x) Plug-in estimator for linear functional: Z T (Fbn ) = Normal-Based Confidence Interval b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 Suppose θbn ≈ N θ, se and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then b Cn = θbn ± zα/2 se 11.3 11.4 Empirical distribution Empirical Distribution Function (ECDF) Pn I(Xi ≤ x) b Fn (x) = i=1 n ( 1 Xi ≤ x I(Xi ≤ x) = 0 Xi > x Properties (for any fixed x) h i • E Fbn = F (x) h i F (x)(1 − F (x)) • V Fbn = n F (x)(1 − F (x)) D • mse = →0 n P • Fbn → F (x) Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F ) 2 b P sup F (x) − Fn (x) > ε = 2e−2nε n 1X ϕ(Xi ) ϕ(x) dFbn (x) = n i=1 b 2 =⇒ T (Fbn ) ± zα/2 se b • Often: T (Fbn ) ≈ N T (F ), se • pth quantile: F −1 (p) = inf{x : F (x) ≥ p} • µ b = X̄n n 1 X (Xi − X̄n )2 • σ b2 = n − 1 i=1 Pn 1 b)3 i=1 (Xi − µ n • κ b= b3 Pσ n i=1 (Xi − X̄n )(Yi − Ȳn ) qP • ρb = qP n n 2 2 (X − X̄ ) i n i=1 i=1 (Yi − Ȳn ) 12 Parametric Inference Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk and parameter θ = (θ1 , . . . , θk ). 12.1 Method of Moments j th moment αj (θ) = E X j = xj dFX (x) j th sample moment n x Nonparametric 1 − α confidence band for F L(x) = max{Fbn − n , 0} Z α bj = 1X j X n i=1 i Method of Moments estimator (MoM) U (x) = min{Fbn + n , 1} s 1 2 log = 2n α α1 (θ) = α b1 P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α bk α2 (θ) = α b2 .. .. .=. 13 Properties of the MoM estimator • θbn exists with probability tending to 1 P • Consistency: θbn → θ • Asymptotic normality: √ D n(θb − θ) → N (0, Σ) where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , gk ) and gj = ∂θ αj (θ) 12.2 Maximum Likelihood • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ) • Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is: p 1. se ≈ 1/In (θ) (θbn − θ) D → N (0, 1) se q b ≈ 1/In (θbn ) 2. se (θbn − θ) D → N (0, 1) b se • Asymptotic optimality Likelihood: Ln : Θ → [0, ∞) Ln (θ) = n Y h i V θbn are(θen , θbn ) = h i ≤ 1 V θen f (Xi ; θ) i=1 • Approximately the Bayes estimator Log-likelihood `n (θ) = log Ln (θ) = n X log f (Xi ; θ) i=1 12.2.1 Delta Method b where ϕ is differentiable and ϕ0 (θ) 6= 0: If τ = ϕ(θ) Maximum likelihood estimator (mle) (b τn − τ ) D → N (0, 1) b τ) se(b Ln (θbn ) = sup Ln (θ) θ b is the mle of τ and where τb = ϕ(θ) Score function s(X; θ) = ∂ log f (X; θ) ∂θ b se( b = ϕ0 (θ) b θbn ) se Fisher information I(θ) = Vθ [s(X; θ)] In (θ) = nI(θ) Fisher information (exponential family) ∂ I(θ) = Eθ − s(X; θ) ∂θ Observed Fisher information Inobs (θ) = − Properties of the mle P • Consistency: θbn → θ 12.3 Multiparameter Models Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle. Hjj = ∂ 2 `n ∂θ2 Hjk = Fisher information matrix n ∂2 X log f (Xi ; θ) ∂θ2 i=1 ∂ 2 `n ∂θj ∂θk ··· .. . Eθ [Hk1 ] · · · Eθ [H11 ] .. In (θ) = − . Eθ [H1k ] .. . Eθ [Hkk ] Under appropriate regularity conditions (θb − θ) ≈ N (0, Jn ) 14 with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then (θbj − θj ) D → N (0, 1) bj se h i b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k) where se • • • • • Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function β(θ) = P [X ∈ R] Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ) θ∈Θ1 • Test size: α = P [Type I error] = sup β(θ) θ∈Θ0 12.3.1 Multiparameter delta method Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be H0 true H1 true ∂ϕ ∂θ1 . ∇ϕ = .. ∂ϕ θ=θb • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα Pθ [T (X ? ) ≥ T (X)] • p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα | {z } b Then, 6= 0 and τb = ϕ(θ). 1−Fθ (T (X)) (b τ − τ) D → N (0, 1) b τ) se(b where r b τ) = se(b b and ∇ϕ b = ∇ϕ and Jbn = Jn (θ) 12.4 b ∇ϕ T . θ=θb since T (X ? )∼Fθ evidence very strong evidence against H0 strong evidence against H0 weak evidence against H0 little or no evidence against H0 Wald test Parametric Bootstrap Hypothesis Testing H0 : θ ∈ Θ0 versus • Two-sided test θb − θ0 • Reject H0 when |W | > zα/2 where W = b se • P |W | > zα/2 → α • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|) H1 : θ ∈ Θ1 Definitions • • • • • • p-value < 0.01 0.01 − 0.05 0.05 − 0.1 > 0.1 b Jbn ∇ϕ Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method of moments estimator. 13 Type II Error (β) Reject H0 Type √ I Error (α) (power) p-value ∂θk Suppose ∇ϕ Retain H0 √ Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis θ = θ0 Composite hypothesis θ > θ0 or θ < θ0 Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0 One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 Likelihood ratio test supθ∈Θ Ln (θ) Ln (θbn ) = supθ∈Θ0 Ln (θ) Ln (θbn,0 ) k X D iid • λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1) • T (X) = i=1 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) 15 Multinomial LRT X1 Xk • mle: pbn = ,..., n n k Y pbj Xj Ln (b pn ) • T (X) = = Ln (p0 ) p0j j=1 k X pbj D Xj log → χ2k−1 • λ(X) = 2 p 0j j=1 Natural form fX (x | η) = h(x) exp {η · T(x) − A(η)} = h(x)g(η) exp {η · T(x)} = h(x)g(η) exp η T T(x) 15 • The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Bayesian Inference Bayes’ Theorem Pearson Chi-square Test • T = k X j=1 f (θ | x) = (Xj − E [Xj ])2 where E [Xj ] = np0j under H0 E [Xj ] f (x | θ)f (θ) f (x | θ)f (θ) =R ∝ Ln (θ)f (θ) f (xn ) f (x | θ)f (θ) dθ Definitions D • T → χ2k−1 • p-value = P χ2k−1 > T (x) • • • • D 2 • Faster → Xk−1 than LRT, hence preferable for small n Independence testing • I rows, J columns, X multinomial sample of size n = I ∗ J X • mles unconstrained: pbij = nij X • mles under H0 : pb0ij = pbi· pb·j = Xni· n·j PI PJ nX • LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j PI PJ (X −E[X ])2 • PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij D • LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) X n = (X1 , . . . , Xn ) xn = (x1 , . . . , xn ) Prior density f (θ) Likelihood f (xn | θ): joint density of the data n Y In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1 • Posterior density f (θ | xn ) R • Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ • Kernel: part of a density that dependsRon θ R θL (θ)f (θ)dθ • Posterior mean θ̄n = θf (θ | xn ) dθ = R Lnn(θ)f (θ) dθ 15.1 Credible Intervals Posterior interval 14 Exponential Family P [θ ∈ (a, b) | xn ] = Scalar parameter b Z f (θ | xn ) dθ = 1 − α a Equal-tail credible interval Z a Z f (θ | xn ) dθ = fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} = h(x)g(θ) exp {η(θ)T (x)} Vector parameter −∞ ( fX (x | θ) = h(x) exp s X ) ηi (θ)Ti (x) − A(θ) i=1 = h(x) exp {η(θ) · T (x) − A(θ)} = h(x)g(θ) exp {η(θ) · T (x)} ∞ f (θ | xn ) dθ = α/2 b Highest posterior density (HPD) region Rn 1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k} for some k Rn is unimodal =⇒ Rn is an interval 16 15.2 Function of parameters 15.3.1 Continuous likelihood (subscript c denotes constant) Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. Posterior CDF for τ n Conjugate Priors n Z n H(r | x ) = P [ϕ(θ) ≤ τ | x ] = f (θ | x ) dθ Likelihood Conjugate prior Unif (0, θ) Pareto(xm , k) Exp (λ) Gamma (α, β) Posterior hyperparameters max x(n) , xm , k + n n X xi α + n, β + A i=1 2 N µ, σc Posterior density 2 N µ0 , σ0 h(τ | xn ) = H 0 (τ | xn ) N µc , σ 2 Bayesian delta method b se b b ϕ0 (θ) τ | X n ≈ N ϕ(θ), 15.3 Priors N µ, σ 2 MVN(µ, Σc ) Scaled Inverse Chisquare(ν, σ02 ) Normalscaled Inverse Gamma(λ, ν, α, β) MVN(µ0 , Σ0 ) Choice • Subjective Bayesianism: prior should incorporate as much detail as possible the research’s a priori knowledge—via prior elicitation • Objective Bayesianism: prior should incorporate as little detail as possible (non-informative prior) • Robust Bayesianism: consider various priors and determine sensitivity of our inferences to changes in the prior MVN(µc , Σ) InverseWishart(κ, Ψ) Pareto(xmc , k) Gamma (α, β) Pareto(xm , kc ) Pareto(x0 , k0 ) Gamma (αc , β) Gamma (α0 , β0 ) Pn µ0 1 n i=1 xi + / + 2 , σ2 σ2 σ02 σc 0 c−1 1 n + 2 σ02 σc Pn νσ02 + i=1 (xi − µ)2 ν + n, ν+n νλ + nx̄ n , ν + n, α + , ν+n 2 n 2 1X γ(x̄ − λ) β+ (xi − x̄)2 + 2 i=1 2(n + γ) −1 Σ−1 0 + nΣc −1 −1 Σ−1 x̄ , 0 µ0 + nΣ −1 −1 Σ−1 0 + nΣc n X (xi − µc )(xi − µc )T n + κ, Ψ + i=1 n X xi x mc i=1 x0 , k0 − kn where k0 > kn n X α0 + nαc , β0 + xi α + n, β + log i=1 Types • Flat: f (θ) ∝ constant R∞ • Proper: −∞ f (θ) dθ = 1 R∞ • Improper: −∞ f (θ) dθ = ∞ • Jeffrey’s Prior (transformation-invariant): f (θ) ∝ p I(θ) f (θ) ∝ p det(I(θ)) • Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family 17 Discrete likelihood Likelihood Conjugate prior Bern (p) Posterior hyperparameters Beta (α, β) Bin (p) Bayes factor α+ Beta (α, β) α+ n X i=1 n X xi , β + n − xi , β + i=1 NBin (p) Beta (α, β) Po (λ) α + rn, β + Gamma (α, β) α+ n X n X n X n X xi i=1 Ni − i=1 n X xi i=1 p∗ = xi 1 log10 BF10 BF10 evidence 0 − 0.5 0.5 − 1 1−2 >2 1 − 1.5 1.5 − 10 10 − 100 > 100 Weak Moderate Strong Decisive p 1−p BF10 p + 1−p BF10 where p = P [H1 ] and p∗ = P [H1 | xn ] i=1 xi , β + n 16 Sampling Methods i=1 Multinomial(p) Dir (α) α+ n X i=1 Geo (p) Beta (α, β) 16.1 x(i) α + n, β + n X Setup xi i=1 15.4 Inverse Transform Sampling • U ∼ Unif (0, 1) • X∼F • F −1 (u) = inf{x | F (x) ≥ u} Algorithm Bayesian Testing 1. Generate u ∼ Unif (0, 1) 2. Compute x = F −1 (u) If H0 : θ ∈ Θ0 : Z Prior probability P [H0 ] = f (θ) dθ 16.2 f (θ | xn ) dθ Let Tn = g(X1 , . . . , Xn ) be a statistic. Θ0 Posterior probability P [H0 | xn ] = Z Θ0 1. Estimate VF [Tn ] with VFbn [Tn ]. 2. Approximate VFbn [Tn ] using simulation: Let H0 . . .Hk−1 be k hypotheses. Suppose θ ∼ f (θ | Hk ), f (xn | Hk )P [Hk ] P [Hk | xn ] = PK , n k=1 f (x | Hk )P [Hk ] Marginal likelihood Z n f (xn | θ, Hi )f (θ | Hi ) dθ f (x | Hi ) = The Bootstrap ∗ ∗ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from b the sampling distribution implied by Fn i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn . ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ). (b) Then !2 B B 1 X 1 X ∗ ∗ b vboot = VFbn = Tn,b − T B B r=1 n,r b=1 Θ 16.2.1 Posterior odds (of Hi relative to Hj ) P [Hi | x ] P [Hj | xn ] Normal-based interval n n = Bootstrap Confidence Intervals f (x | Hi ) f (xn | Hj ) | {z } Bayes Factor BFij × P [Hi ] P [Hj ] | {z } prior odds b boot Tn ± zα/2 se Pivotal interval 1. Location parameter θ = T (F ) 18 2. Pivot Rn = θbn − θ 3. Let H(r) = P [Rn ≤ r] be the cdf of Rn ∗ ∗ 4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: 2. Generate u ∼ Unif (0, 1) Ln (θcand ) 3. Accept θcand if u ≤ Ln (θbn ) B 1 X ∗ b H(r) = I(Rn,b ≤ r) B 16.4 Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) | xn ]: b=1 ∗ ∗ 5. θβ∗ = β sample quantile of (θbn,1 , . . . , θbn,B ) iid ∗ ∗ ), i.e., rβ∗ = θβ∗ − θbn 6. rβ∗ = beta sample quantile of (Rn,1 , . . . , Rn,B 7. Approximate 1 − α confidence interval Cn = â, b̂ where b −1 1 − α = θbn − H 2 α −1 b = θbn − H 2 â = b̂ = ∗ θbn − r1−α/2 = ∗ 2θbn − θ1−α/2 ∗ θbn − rα/2 = ∗ 2θbn − θα/2 ∗ ∗ Cn = θα/2 , θ1−α/2 Rejection Sampling Setup • We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • We know h(θ) up to a proportional constant: h(θ) = R Algorithm 1. Draw θcand ∼ g(θ) 2. Generate u ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted Example • • • • 17 Decision Theory • Unknown quantity affecting our decision: θ ∈ Θ • Decision rule: synonymous for an estimator θb • Action a ∈ A: possible value of the decision rule. In the estimation b context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or b L : Θ × A → [−k, ∞). discrepancy between θ and θ, Loss functions • Squared error loss: L(θ, a) = (θ − a)2 ( K1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 ) • Lp loss: L(θ, a) = |θ − a|p ( 0 a=θ • Zero-one loss: L(θ, a) = 1 a 6= θ 17.1 Risk Posterior risk We can easily sample from the prior g(θ) = f (θ) Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M Algorithm 1. Draw θ 1. Sample from the prior θ1 , . . . , θn ∼ f (θ) Ln (θi ) 2. wi = PB ∀i = 1, . . . , B i=1 Ln (θi ) PB 3. E [q(θ) | xn ] ≈ i=1 q(θi )wi Definitions Percentile interval 16.3 Importance Sampling cand ∼ f (θ) Z h i b b L(θ, θ(x))f (θ | x) dθ = Eθ|X L(θ, θ(x)) Z h i b b L(θ, θ(x))f (x | θ) dx = EX|θ L(θ, θ(X)) r(θb | x) = (Frequentist) risk b = R(θ, θ) 19 18 Bayes risk ZZ b = r(f, θ) h i b b L(θ, θ(x))f (x, θ) dx dθ = Eθ,X L(θ, θ(X)) h h ii h i b = Eθ EX|θ L(θ, θ(X) b b r(f, θ) = Eθ R(θ, θ) h h ii h i b = EX Eθ|X L(θ, θ(X) b r(f, θ) = EX r(θb | X) Linear Regression Definitions • Response variable Y • Covariate X (aka predictor variable or feature) 18.1 Simple Linear Regression Model 17.2 Yi = β0 + β1 Xi + i Admissibility E [i | Xi ] = 0, V [i | Xi ] = σ 2 Fitted line • θb0 dominates θb if b0 b ∀θ : R(θ, θ ) ≤ R(θ, θ) b ∃θ : R(θ, θb0 ) < R(θ, θ) • θb is inadmissible if there is at least one other estimator θb0 that dominates it. Otherwise it is called admissible. rb(x) = βb0 + βb1 x Predicted (fitted) values Ybi = rb(Xi ) Residuals ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi Residual sums of squares (rss) 17.3 Bayes Rule rss(βb0 , βb1 ) = Bayes rule (or Bayes estimator) n X ˆ2i i=1 b = inf e r(f, θ) e • r(f, θ) θ R b b = r(θb | x)f (x) dx • θ(x) = inf r(θb | x) ∀x =⇒ r(f, θ) Least square estimates βbT = (βb0 , βb1 )T : min rss b0 ,β b1 β Theorems • Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode 17.4 Minimax Rules Maximum risk b = sup R(θ, θ) b R̄(θ) R̄(a) = sup R(θ, a) θ θ Minimax rule b = inf R̄(θ) e = inf sup R(θ, θ) e sup R(θ, θ) θ θe θe θ b =c θb = Bayes rule ∧ ∃c : R(θ, θ) Least favorable prior θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ βb0 = Ȳn − βb1 X̄n Pn Pn i=1 Xi Yi − nX̄Y i=1 (Xi − X̄n )(Yi − Ȳn ) b Pn = P β1 = n 2 2 2 i=1 (Xi − X̄n ) i=1 Xi − nX h i β0 E βb | X n = β1 P h i σ 2 n−1 ni=1 Xi2 −X n V βb | X n = 2 1 −X n nsX r Pn 2 σ b i=1 Xi √ b βb0 ) = se( n sX n σ b √ b βb1 ) = se( sX n Pn Pn 2 1 where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = n−2 ˆi (unbiased estimate). i=1 Further properties: P P • Consistency: βb0 → β0 and βb1 → β1 20 18.3 • Asymptotic normality: βb0 − β0 D → N (0, 1) b βb0 ) se( and βb1 − β1 D → N (0, 1) b βb1 ) se( Multiple Regression Y = Xβ + where • Approximate 1 − α confidence intervals for β0 and β1 : b βb1 ) and βb1 ± zα/2 se( b βb0 ) βb0 ± zα/2 se( • Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where b βb1 ). W = βb1 /se( Xn1 Pn b Pn 2 2 ˆ rss i=1 (Yi − Y ) R = Pn = 1 − Pn i=1 i 2 = 1 − 2 tss i=1 (Yi − Y ) i=1 (Yi − Y ) Likelihood L2 = β1 β = ... Xnk n Y i=1 n Y i=1 n Y f (Xi , Yi ) = n Y fX (Xi ) × n Y ( fY |X (Yi | Xi ) ∝ σ 2 1 X exp − 2 Yi − (β0 − β1 Xi ) 2σ i βb = (X T X)−1 X T Y h i V βb | X n = σ 2 (X T X)−1 ) Under the assumption of Normality, the least squares estimator is also the mle but the least squares variance estimator is not the mle. βb ≈ N β, σ 2 (X T X)−1 1X 2 ˆ n i=1 i rb(x) = k X βbj xj j=1 Unbiased estimate for σ 2 Prediction Observe X = x∗ of the covariate and want to predict their outcome Y∗ . Yb∗ = βb0 + βb1 x∗ h i h i h i h i V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 Prediction interval Estimate regression function n σ b2 = N X (Yi − xTi β)2 If the (k × k) matrix X T X is invertible, fX (Xi ) −n n i=1 fY |X (Yi | Xi ) = L1 × L2 i=1 i=1 1 .. =. βk rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = i=1 18.2 1 L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss 2σ 2 L1 = X1k .. . Likelihood R2 L= ··· .. . ··· X11 .. X= . Pn 2 2 2 i=1 (Xi − X∗ ) b P ξn = σ b +1 n i (Xi − X̄)2 j Yb∗ ± zα/2 ξbn σ b2 = n 1 X 2 ˆ n − k i=1 i ˆ = X βb − Y mle µ b = X̄ σ b2 = n−k 2 σ n 1 − α Confidence interval b βbj ) βbj ± zα/2 se( 21 18.4 Model Selection Akaike Information Criterion (AIC) Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J denote a subset of the covariates in the model, where |S| = k and |J| = n. Issues • Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance AIC(S) = `n (βbS , σ bS2 ) − k Bayesian Information Criterion (BIC) BIC(S) = `n (βbS , σ bS2 ) − Procedure 1. Assign a score to each model 2. Search through all models to find the one with the highest score k log n 2 Validation and training bV (S) = R Hypothesis testing m X (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often i=1 n n or 4 2 H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J Leave-one-out cross-validation Mean squared prediction error (mspe) h i mspe = E (Yb (S) − Y ∗ )2 bCV (S) = R n X (Yi − Yb(i) )2 = i=1 i=1 Prediction risk R(S) = n X mspei = i=1 n X h i E (Ybi (S) − Yi∗ )2 n X Yi − Ybi (S) 1 − Uii (S) !2 U (S) = XS (XST XS )−1 XS (“hat matrix”) i=1 Training error btr (S) = R n X (Ybi (S) − Yi )2 19 Non-parametric Function Estimation i=1 R 19.1 2 btr (S) rss(S) R R2 (S) = 1 − =1− =1− tss tss Pn b 2 i=1 (Yi (S) − Y ) P n 2 i=1 (Yi − Y ) The training error is a downward-biased estimate of the prediction risk. h i btr (S) < R(S) E R n h i h i X b b bias(Rtr (S)) = E Rtr (S) − R(S) = −2 Cov Ybi , Yi i=1 Adjusted R2 R2 (S) = 1 − Density Estimation Estimate f (x), where f (x) = P [X ∈ A] = Integrated square error (ise) L(f, fbn ) = Z R A f (x) dx. Z 2 b f (x) − fn (x) dx = J(h) + f 2 (x) dx Frequentist risk Z h i Z R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx n − 1 rss n − k tss Mallow’s Cp statistic b btr (S) + 2kb R(S) =R σ 2 = lack of fit + complexity penalty h i b(x) = E fbn (x) − f (x) h i v(x) = V fbn (x) 22 19.1.1 Histograms KDE n 1X1 x − Xi fbn (x) = K n i=1 h h Z Z 1 1 4 00 2 b K 2 (x) dx R(f, fn ) ≈ (hσK ) (f (x)) dx + 4 nh Z Z −2/5 −1/5 −1/5 c c2 c3 2 2 c = σ , c = K (x) dx, c = (f 00 (x))2 dx h∗ = 1 1 2 3 K n1/5 Z 4/5 Z 1/5 c4 5 2 2/5 ∗ 2 00 2 b R (f, fn ) = 4/5 c4 = (σK ) K (x) dx (f ) dx 4 n | {z } Definitions • • • • Number of bins m 1 Binwidth h = m Bin Bj has νj observations R Define pbj = νj /n and pj = Bj f (u) du Histogram estimator fbn (x) = m X pbj j=1 h C(K) I(x ∈ Bj ) Epanechnikov Kernel pj E fbn (x) = h h i p (1 − p ) j j V fbn (x) = nh2 Z h2 1 2 R(fbn , f ) ≈ (f 0 (u)) du + 12 nh !1/3 1 6 ∗ h = 1/3 R 2 du n (f 0 (u)) 2/3 Z 1/3 C 3 2 ∗ b 0 R (fn , f ) ≈ 2/3 (f (u)) du C= 4 n h i ( K(x) = √ 3 √ 4 5(1−x2 /5) |x| < 0 otherwise 5 Cross-validation estimate of E [J(h)] Z JbCV (h) = n n n 2Xb 1 X X ∗ Xi − Xj 2 fbn2 (x) dx − K(0) f(−i) (Xi ) ≈ K + n i=1 hn2 i=1 j=1 h nh K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = Z K(x − y)K(y) dy Cross-validation estimate of E [J(h)] Z JbCV (h) = 19.1.2 n m 2Xb 2 n+1 X 2 fbn2 (x) dx − f(−i) (Xi ) = − pb n i=1 (n − 1)h (n − 1)h j=1 j 19.2 Non-parametric Regression Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points (x1 , Y1 ), . . . , (xn , Yn ) related by Yi = r(xi ) + i Kernel Density Estimator (KDE) E [i ] = 0 V [i ] = σ 2 Kernel K • • • • K(x) ≥ 0 R K(x) dx = 1 R xK(x) dx = 0 R 2 2 x K(x) dx ≡ σK >0 k-nearest Neighbor Estimator rb(x) = 1 k X i:xi ∈Nk (x) Yi where Nk (x) = {k values of x1 , . . . , xn closest to x} 23 20 Nadaraya-Watson Kernel Estimator n X rb(x) = wi (x)Yi Stochastic Processes Stochastic Process i=1 wi (x) = x−xi h Pn x−xj K j=1 h K ( {0, ±1, . . . } = Z discrete T = [0, ∞) continuous {Xt : t ∈ T } ∈ [0, 1] Z 4 Z 2 h4 f 0 (x) 2 2 00 0 R(b rn , r) ≈ x K (x) dx r (x) + 2r (x) dx 4 f (x) Z 2R 2 σ K (x) dx + dx nhf (x) c1 h∗ ≈ 1/5 n c2 R∗ (b rn , r) ≈ 4/5 n • Notations Xt , X(t) • State space X • Index set T 20.1 Markov Chains Markov chain P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X Cross-validation estimate of E [J(h)] JbCV (h) = n X (Yi − rb(−i) (xi ))2 = i=1 n X i=1 (Yi − rb(xi ))2 1− Pn j=1 19.3 K(0) x−x j K h Smoothing Using Orthogonal Functions r(x) = βj φj (x) ≈ j=1 J X pij ≡ P [Xn+1 = j | Xn = i] pij (n) ≡ P [Xm+n = j | Xm = i] n-step Transition matrix P (n-step: Pn ) Approximation ∞ X Transition probabilities !2 • (i, j) element is pij • pij > 0 P • i pij = 1 βj φj (x) j=1 Multivariate regression where ηi = i Y = Φβ + η φ0 (x1 ) .. and Φ = . ··· .. . φ0 (xn ) · · · Chapman-Kolmogorov φJ (x1 ) .. . pij (m + n) = X pij (m)pkj (n) k φJ (xn ) Pm+n = Pm Pn Least squares estimator βb = (ΦT Φ)−1 ΦT Y 1 ≈ ΦT Y (for equally spaced observations only) n Cross-validation estimate of E [J(h)] 2 n J X X bCV (J) = Yi − R φj (xi )βbj,(−i) i=1 j=1 Pn = P × · · · × P = Pn Marginal probability µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i] µ0 , initial distribution µn = µ0 Pn 24 20.2 Poisson Processes Autocorrelation function (ACF) Poisson process • {Xt : t ∈ [0, ∞)} = number of events up to and including time t • X0 = 0 • Independent increments: γ(s, t) Cov [xs , xt ] =p ρ(s, t) = p V [xs ] V [xt ] γ(s, s)γ(t, t) Cross-covariance function (CCV) ∀t0 < · · · < tn : Xt1 − Xt0 ⊥ ⊥ · · · ⊥⊥ Xtn − Xtn−1 γxy (s, t) = E [(xs − µxs )(yt − µyt )] • Intensity function λ(t) Cross-correlation function (CCF) – P [Xt+h − Xt = 1] = λ(t)h + o(h) – P [Xt+h − Xt = 2] = o(h) • Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = Rt 0 ρxy (s, t) = p λ(s) ds γxy (s, t) γx (s, s)γy (t, t) Homogeneous Poisson process Backshift operator λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0 B k (xt ) = xt−k Waiting times Wt := time at which Xt occurs 1 Wt ∼ Gamma t, λ Difference operator ∇d = (1 − B)d White noise Interarrival times St = Wt+1 − Wt 1 St ∼ Exp λ 2 • wt ∼ wn(0, σw ) St Wt−1 Wt t • • • • iid 2 Gaussian: wt ∼ N 0, σw E [wt ] = 0 t ∈ T V [wt ] = σ 2 t ∈ T γw (s, t) = 0 s 6= t ∧ s, t ∈ T Random walk 21 Time Series Mean function Z ∞ µxt = E [xt ] = xft (x) dx • Drift δ Pt • xt = δt + j=1 wj • E [xt ] = δt −∞ Autocovariance function Symmetric moving average γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt γx (t, t) = E (xt − µt )2 = V [xt ] mt = k X j=−k aj xt−j where aj = a−j ≥ 0 and k X j=−k aj = 1 25 21.1 Stationary Time Series Sample variance Strictly stationary n 1 X |h| 1− γx (h) n n V [x̄] = h=−n P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] Sample autocovariance function ∀k ∈ N, tk , ck , h ∈ Z Weakly stationary ∀t ∈ Z • E x2t < ∞ 2 ∀t ∈ Z • E xt = m • γx (s, t) = γx (s + r, t + r) n−h 1 X (xt+h − x̄)(xt − x̄) n t=1 γ b(h) = ∀r, s, t ∈ Z Sample autocorrelation function Autocovariance function • • • • • γ(h) = E [(xt+h − µ)(xt − µ)] γ(0) = E (xt − µ)2 γ(0) ≥ 0 γ(0) ≥ |γ(h)| γ(h) = γ(−h) ρb(h) = ∀h ∈ Z Sample cross-variance function γ bxy (h) = Autocorrelation function (ACF) Cov [xt+h , xt ] γ(t + h, t) γ(h) ρx (h) = p =p = γ(0) V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) n−h 1 X (xt+h − x̄)(yt − y) n t=1 Sample cross-correlation function γ bxy (h) ρbxy (h) = p γ bx (0)b γy (0) Jointly stationary time series γxy (h) = E [(xt+h − µx )(yt − µy )] γxy (h) ρxy (h) = p γx (0)γy (h) ∞ X ψj wt−j where j=−∞ ∞ X Properties 1 • σρbx (h) = √ if xt is white noise n 1 • σρbxy (h) = √ if xt or yt is white noise n Linear process xt = µ + γ b(h) γ b(0) |ψj | < ∞ j=−∞ γ(h) = 2 σw ∞ X 21.3 ψj+h ψj Non-Stationary Time Series Classical decomposition model j=−∞ 21.2 xt = µt + st + wt Estimation of Correlation Sample mean n 1X x̄ = xt n t=1 • µt = trend • st = seasonal component • wt = random noise term 26 21.3.1 Detrending Moving average polynomial Least squares θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 2 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t 2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 3. Residuals , noise wt Moving average operator θ(B) = 1 + θ1 B + · · · + θp B p Moving average MA (q) (moving average model order q) • The low-pass filter vt is a symmetric moving average mt with aj = vt = 1 2k+1 : xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt k X 1 xt−1 2k + 1 E [xt ] = i=−k θj E [wt−j ] = 0 j=0 ( Pk 1 • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes without distortion Differencing γ(h) = Cov [xt+h , xt ] = 2 σw 0 Pq−h j=0 θj θj+h 0≤h≤q h>q MA (1) xt = wt + θwt−1 2 2 (1 + θ )σw h = 0 2 γ(h) = θσw h=1 0 h>1 ( θ h=1 2 ρ(h) = (1+θ ) 0 h>1 • µt = β0 + β1 t =⇒ ∇xt = β1 21.4 q X ARIMA models Autoregressive polynomial φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0 Autoregressive operator ARMA (p, q) φ(B) = 1 − φ1 B − · · · − φp B p xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q Autoregressive model order p, AR (p) φ(B)xt = θ(B)wt xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt AR (1) • xt = φk (xt−k ) + k−1 X φj (wt−j ) k→∞,|φ|<1 j=0 = ∞ X φj (wt−j ) {z } linear process P∞ j j=0 φ (E [wt−j ]) = 0 • γ(h) = Cov [xt+h , xt ] = • ρ(h) = γ(h) γ(0) • xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 } • φhh = corr(xh − xh−1 , x0 − xh−1 ) h≥2 0 h • E.g., φ11 = corr(x1 , x0 ) = ρ(1) j=0 | • E [xt ] = Partial autocorrelation function (PACF) 2 h σw φ 1−φ2 = φh • ρ(h) = φρ(h − 1) h = 1, 2, . . . ARIMA (p, d, q) ∇d xt = (1 − B)d xt is ARMA (p, q) φ(B)(1 − B)d xt = θ(B)wt Exponentially Weighted Moving Average (EWMA) xt = xt−1 + wt − λwt−1 27 xt = ∞ X (1 − λ)λj−1 xt−j + wt • • • • when |λ| < 1 j=1 x̃n+1 = (1 − λ)xn + λx̃n Seasonal ARIMA Frequency index ω (cycles per unit time), period 1/ω Amplitude A Phase φ U1 = A cos φ and U2 = A sin φ often normally distributed rv’s Periodic mixture • Denoted by ARIMA (p, d, q) × (P, D, Q)s d s • ΦP (B s )φ(B)∇D s ∇ xt = δ + ΘQ (B )θ(B)wt xt = q X (Uk1 cos(2πωk t) + Uk2 sin(2πωk t)) k=1 21.4.1 Causality and Invertibility ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : xt = ∞ X P∞ j=0 ψj < ∞ such that wt−j = ψ(B)wt Spectral representation of a periodic process j=0 ARMA (p, q) is invertible ⇐⇒ ∃{πj } : π(B)xt = • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Pq • γ(h) = k=1 σk2 cos(2πωk h) Pq • γ(0) = E x2t = k=1 σk2 P∞ j=0 ∞ X γ(h) = σ 2 cos(2πω0 h) πj < ∞ such that σ 2 −2πiω0 h σ 2 2πiω0 h e + e 2 2 Z 1/2 = e2πiωh dF (ω) = Xt−j = wt j=0 −1/2 Properties • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle ∞ X θ(z) ψ(z) = ψj z = φ(z) j=0 j Spectral distribution function 0 F (ω) = σ 2 /2 2 σ |z| ≤ 1 ω < −ω0 −ω ≤ ω < ω0 ω ≥ ω0 • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle ∞ X φ(z) π(z) = πj z j = θ(z) j=0 • F (−∞) = F (−1/2) = 0 • F (∞) = F (1/2) = γ(0) |z| ≤ 1 Spectral density Behavior of the ACF and PACF for causal and invertible ARMA models ACF PACF 21.5 AR (p) tails off cuts off after lag p MA (q) cuts off after lag q tails off q Spectral Analysis Periodic process xt = A cos(2πωt + φ) = U1 cos(2πωt) + U2 sin(2πωt) ARMA (p, q) tails off tails off ∞ X γ(h)e−2πiωh − |γ(h)| < ∞ =⇒ γ(h) = R 1/2 f (ω) = h=−∞ • Needs P∞ h=−∞ 1 1 ≤ω≤ 2 2 −1/2 e2πiωh f (ω) dω h = 0, ±1, . . . • f (ω) ≥ 0 • f (ω) = f (−ω) • f (ω) = f (1 − ω) R 1/2 • γ(0) = V [xt ] = −1/2 f (ω) dω 2 • White noise: fw (ω) = σw 28 22.2 • ARMA (p, q) , φ(B)xt = θ(B)wt : Z 1 Γ(x)Γ(y) • Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt = Γ(x + y) 0 Z x a−1 b−1 • Incomplete: B(x; a, b) = t (1 − t) dt |θ(e−2πiω )|2 fx (ω) = |φ(e−2πiω )|2 Pp Pq where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 2 σw 0 • Regularized incomplete: a+b−1 (a + b − 1)! B(x; a, b) a,b∈N X = xj (1 − x)a+b−1−j Ix (a, b) = B(a, b) j!(a + b − 1 − j)! j=a Discrete Fourier Transform (DFT) d(ωj ) = n−1/2 n X Beta Function xt e−2πiωj t • I0 (a, b) = 0 I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a) i=1 Fourier/Fundamental frequencies 22.3 ωj = j/n Series Finite Inverse DFT xt = n −1/2 n−1 X d(ωj )e • 2πiωj t j=0 Periodogram • I(j/n) = |d(j/n)|2 Scaled Periodogram 4 I(j/n) n !2 n 2X = xt cos(2πtj/n + n t=1 n 2X xt sin(2πtj/n n t=1 !2 k= k=1 n X n(n + 1) 2 (2k − 1) = n2 k=1 n X • k=1 n X ck = 22.1 k=0 n X k = 2n k=0 • Vandermonde’s Identity: r X m n m+n = k r−k r k=0 c 6= 1 Math • Binomial Theorem: n X n n−k k a b = (a + b)n k k=0 Gamma Function Z Infinite ∞ ts−1 e−t dt 0 Z ∞ • Upper incomplete: Γ(s, x) = ts−1 e−t dt Z xx • Lower incomplete: γ(s, x) = ts−1 e−t dt • Ordinary: Γ(s) = 0 • • • • • cn+1 − 1 c−1 n X n r+k r+n+1 = k n k=0 n X k n+1 • = m m+1 • k2 = k=0 22 • n(n + 1)(2n + 1) 6 k=1 2 n X n(n + 1) • k3 = 2 • P (j/n) = Binomial n X Γ(α + 1) = αΓ(α) α>1 Γ(n) = (n − 1)! n∈N Γ(0) = Γ(−1) = ∞ √ Γ(1/2) = π Γ(−1/2) = −2Γ(1/2) • ∞ X k=0 ∞ X pk = 1 , 1−p ∞ X p |p| < 1 1−p k=1 ! ∞ X d 1 1 k = p = dp 1 − p (1 − p)2 pk = d dp k=0 k=0 ∞ X r+k−1 k • x = (1 − x)−r r ∈ N+ k k=0 ∞ X α k • p = (1 + p)α |p| < 1 , α ∈ C k • kpk−1 = |p| < 1 k=0 29 22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006. Sampling [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra. Springer, 2001. k out of n w/o replacement nk = ordered k−1 Y (n − i) = i=0 w/ replacement n! (n − k)! n nk n! = = k k! k!(n − k)! unordered nk [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik. Springer, 2002. [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003. n−1+r n−1+r = r n−1 Stirling numbers, 2nd kind n n−1 n−1 =k + k k k−1 ( 1 n = 0 0 1≤k≤n n=0 else Partitions Pn+k,k = n X Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1 i=1 Balls and Urns |B| = n, |U | = m f :B→U f arbitrary mn B : D, U : D B : ¬D, U : D B : D, U : ¬D f injective ( mn m ≥ n 0 else m+n−1 n m X n k=1 B : ¬D, U : ¬D D = distinguishable, ¬D = indistinguishable. m X k=1 m n m! n m n−1 m−1 ( 1 0 m≥n else n m ( 1 0 m≥n else Pn,m k Pn,k f surjective f bijective ( n! m = n 0 else ( 1 m=n 0 else ( 1 m=n 0 else ( 1 m=n 0 else References [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole, 1972. [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician, 62(1):45–53, 2008. 30 31 Univariate distribution relationships, courtesy Leemis and McQueston [2].