Probability and Statistics Cookbook c Matthias Vallentin, 2011 Copyright vallentin@icir.org 12th December, 2011 12 Parametric Inference 12.1 Method of Moments . . . . . . . . . 12.2 Maximum Likelihood . . . . . . . . . versity of California in Berkeley but also influenced by other 12.2.1 Delta Method . . . . . . . . . sources [4, 5]. If you find errors or have suggestions for further 12.3 Multiparameter Models . . . . . . . topics, I would appreciate if you send me an email. The most re12.3.1 Multiparameter delta method cent version of this document is available at http://matthias. 12.4 Parametric Bootstrap . . . . . . . . 15 Exponential Family 11 20 Stochastic Processes 20.1 Markov Chains . . . . . . . . . . 11 20.2 Poisson Processes . . . . . . . . . 12 12 21 Time Series 12 21.1 Stationary Time Series . . . . . . 13 21.2 Estimation of Correlation . . . . 13 21.3 Non-Stationary Time Series . . . 21.3.1 Detrending . . . . . . . . 13 21.4 ARIMA models . . . . . . . . . . 21.4.1 Causality and Invertibility 14 21.5 Spectral Analysis . . . . . . . . . 14 14 22 Math 22.1 Gamma Function . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . 15 22.4 Combinatorics . . . . . . . . . . 16 16 Sampling Methods 16.1 The Bootstrap . . . . . . . . . . . . . 16.1.1 Bootstrap Confidence Intervals 16.2 Rejection Sampling . . . . . . . . . . . 16.3 Importance Sampling . . . . . . . . . . 16 16 16 17 17 This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class material from courses of the statistics department at the Uni- vallentin.net/probability-and-statistics-cookbook/. To reproduce, please contact me. 2 Probability Theory 3 Random Variables 3.1 Transformations . . . . . . . . . . . . . 6 7 4 Expectation 7 5 Variance 7 6 Inequalities 8 1 Distribution Overview 1.1 Discrete Distributions . . . . . . . . . . 1.2 Continuous Distributions . . . . . . . . 7 Distribution Relationships 8 Probability Functions and Moment 9 Multivariate Distributions 9.1 Standard Bivariate Normal . . . . . . . 9.2 Bivariate Normal . . . . . . . . . . . . . 9.3 Multivariate Normal . . . . . . . . . . . 10 Convergence 10.1 Law of Large Numbers (LLN) . . . . . . 10.2 Central Limit Theorem (CLT) . . . . . 11 Statistical Inference 11.1 Point Estimation . . . . . . . . . . 11.2 Normal-Based Confidence Interval 11.3 Empirical distribution . . . . . . . 11.4 Statistical Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Decision Theory 17.1 Risk . . . . . . . . . . . . . . . . . . . . 17.2 Admissibility . . . . . . . . . . . . . . . 17.3 Bayes Rule . . . . . . . . . . . . . . . . 9 17.4 Minimax Rules . . . . . . . . . . . . . . 9 9 18 Linear Regression 9 18.1 Simple Linear Regression . . . . . . . . 9 18.2 Prediction . . . . . . . . . . . . . . . . . 18.3 Multiple Regression . . . . . . . . . . . 9 18.4 Model Selection . . . . . . . . . . . . . . 10 10 19 Non-parametric Function Estimation 19.1 Density Estimation . . . . . . . . . . . . 10 19.1.1 Histograms . . . . . . . . . . . . 10 19.1.2 Kernel Density Estimator (KDE) 11 19.2 Non-parametric Regression . . . . . . . 11 11 19.3 Smoothing Using Orthogonal Functions 8 Generating . . . . . . 13 Hypothesis Testing 14 Bayesian Inference 14.1 Credible Intervals . . . . 14.2 Function of parameters . 3 3 14.3 Priors . . . . . . . . . . 4 14.3.1 Conjugate Priors 14.4 Bayesian Testing . . . . 6 Contents . . . . . . 17 17 17 18 18 18 18 19 19 19 20 20 20 21 21 21 22 . . . . 22 . . . . 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 24 24 24 25 25 . . . . . . . . . . . . . . . . 26 26 26 27 27 Distribution Overview 1.1 Discrete Distributions Notation1 FX (x) Unif {a, . . . , b} Uniform 0 x<a a≤x≤b x>b (1 − p)1−x bxc−a+1 b−a 1 Bernoulli Bern (p) Binomial Multinomial Hypergeometric ≈Φ Hyp (N, m, n) Negative Binomial NBin (r, p) Geometric Geo (p) Poisson Po (λ) ! e−λ MX (s) I(a < x < b) b−a+1 a+b 2 (b − a + 1)2 − 1 12 eas − e−(b+1)s s(b − a) px (1 − p)1−x p p(1 − p) 1 − p + pes np np(1 − p) (1 − p + pes )n x x ∈ N+ x X λi i! i=0 Uniform (discrete) 1−p p r 1−p p2 1 p 1−p p2 λ λ x ∈ N+ λx e−λ x! p 1 − (1 − p)es eλ(e s −1) Poisson ● p = 0.2 p = 0.5 p = 0.8 ● ● ● λ=1 λ=4 λ = 10 ● ● ● ● ● 0.2 0.6 ● PMF ● 0.4 ● PMF 0.15 ● n PMF 1 ● 0.10 ● ● ● 0.05 x 1 We ● ● ● 0 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 x 30 40 ● ● ● ● ● 0.0 ● ● 0.0 ● ● ● ● ● ● ● b ● ● ● a 0.1 0.2 ● ● 0.00 PMF 0.20 0.3 0.25 n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9 r p 1 − (1 − p)es Geometric ● pi e nm(N − n)(N − m) N 2 (N − 1) nm N r Binomial !n si i=0 n−x N x p(1 − p)x−1 k X npi (1 − pi ) npi ! x+r−1 r p (1 − p)x r−1 Ip (r, x + 1) 1 − (1 − p)x V [X] k X n! x xi = n px1 1 · · · pkk x1 ! . . . xk ! i=1 m m−x Mult (n, p) x − np p np(1 − p) E [X] ! n x p (1 − p)n−x x I1−p (n − x, x + 1) Bin (n, p) fX (x) 0.8 1 0 2 4 6 8 10 ● 0 x 5 ● ● ● ● ● 10 ● ● ● ● ● 15 ● ● ● ● ● 20 x use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2). 3 1.2 Continuous Distributions Notation FX (x) Uniform Unif (a, b) Normal N µ, σ 2 0 x<a a<x<b 1 x>b Z x Φ(x) = φ(t) dt x−a b−a −∞ Log-Normal ln N µ, σ 2 Multivariate Normal MVN (µ, Σ) Student’s t Student(ν) Chi-square χ2k 1 1 ln x − µ + erf √ 2 2 2σ 2 fX (x) E [X] V [X] MX (s) I(a < x < b) b−a (x − µ)2 1 φ(x) = √ exp − 2σ 2 σ 2π (ln x − µ)2 1 √ exp − 2σ 2 x 2πσ 2 a+b 2 (b − a)2 12 µ σ2 esb − esa s(b − a) σ 2 s2 exp µs + 2 T 1 (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) ν ν Ix , 2 2 1 k x γ , Γ(k/2) 2 2 Σ−1 (x−µ) −(ν+1)/2 Γ ν+1 x2 2 1+ √ ν ν νπΓ 2 1 xk/2−1 e−x/2 2k/2 Γ(k/2) r eµ+σ 2 /2 2 (eσ − 1)e2µ+σ 2 µ Σ 0 0 k 2k d2 d2 − 2 2d22 (d1 + d2 − 2) d1 (d2 − 2)2 (d2 − 4) 1 exp µT s + sT Σs 2 (1 − 2s)−k/2 s < 1/2 d F F(d1 , d2 ) Exponential Exp (β) Gamma Inverse Gamma Dirichlet Beta Weibull Pareto I d1 x d1 x+d2 Gamma (α, β) InvGamma (α, β) d1 d1 , 2 2 Pareto(xm , α) β β2 γ(α, x/β) Γ(α) Γ α, βx Γ (α) 1 xα−1 e−x/β Γ (α) β α αβ αβ 2 β α −α−1 −β/x x e Γ (α) P k k Γ i=1 αi Y α −1 xi i Qk i=1 Γ (αi ) i=1 β α>1 α−1 β2 α>2 (α − 1)2 (α − 2)2 Γ (α + β) α−1 x (1 − x)β−1 Γ (α) Γ (β) α α+β 1 λΓ 1 + k αβ (α + β)2 (α + β + 1) 2 λ2 Γ 1 + − µ2 k αxm α>1 α−1 xα m α>2 (α − 1)2 (α − 2) k 1 − e−(x/λ) 1− d1 , 2 2 1 −x/β e β Ix (α, β) Weibull(λ, k) xB d1 1 − e−x/β Dir (α) Beta (α, β) (d1 x)d1 d2 2 (d1 x+d2 )d1 +d2 x α m x x ≥ xm k x k−1 −(x/λ)k e λ λ xα m α α+1 x x ≥ xm αi Pk i=1 αi 1 (s < 1/β) 1 − βs α 1 (s < 1/β) 1 − βs p 2(−βs)α/2 −4βs Kα Γ(α) E [Xi ] (1 − E [Xi ]) Pk i=1 αi + 1 1+ ∞ X k=1 ∞ X n=0 k−1 Y r=0 α+r α+β+r ! sk k! sn λn n Γ 1+ n! k α(−xm s)α Γ(−α, −xm s) s < 0 4 Normal Log−normal Student's t 1.0 µ = 0, σ2 = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1 ν=1 ν=2 ν=5 ν=∞ 0.2 PDF PDF 0.4 φ(x) PDF 0.6 0.6 0.3 0.8 0.8 µ = 0, σ2 = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5 0.4 Uniform (continuous) ● 0.1 −4 −2 0 x 2 4 0.0 0.0 0.2 0.2 ● b 0.0 ● a 0.0 0.5 1.0 1.5 x 2 χ F 2.5 3.0 −4 −2 0 Exponential 4 Gamma 2.0 β=2 β=1 β = 0.4 α = 1, β = 2 α = 2, β = 2 α = 3, β = 2 α = 5, β = 1 α = 9, β = 0.5 0.4 3.0 d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100 2 x 0.3 PDF 0.2 1.0 PDF PDF 1.5 4 6 8 0.1 0 1 2 x 3 4 5 0.0 0 1 2 x Inverse Gamma Beta 5 0 5 10 15 20 x Pareto xm = 1, α = 1 xm = 1, α = 2 xm = 1, α = 4 λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5 2.0 2.5 4 4 Weibull α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5 3.0 α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5 3 x 0 1 2 3 x 4 5 0.0 0.2 0.4 0.6 x 0.8 1.0 2 0 0.0 0 0.0 0.5 0.5 1 1 1.0 1.0 PDF PDF 1.5 PDF 2 PDF 1.5 3 2.0 3 2 2.5 0 0.0 0.0 0.0 0.5 0.1 0.5 1.0 0.2 PDF 0.3 2.0 1.5 2.5 0.4 0.5 k=1 k=2 k=3 k=4 k=5 2.0 x 0.5 ● 0.4 1 b−a 0.0 0.5 1.0 1.5 x 2.0 2.5 0 1 2 3 4 5 x 5 2 Probability Theory Law of Total Probability Definitions • • • • P [B] = Sample space Ω Outcome (point or element) ω ∈ Ω Event A ⊆ Ω σ-algebra A P [B|Ai ] P [Ai ] n G Ai i=1 Bayes’ Theorem P [B | Ai ] P [Ai ] P [Ai | B] = Pn j=1 P [B | Aj ] P [Aj ] Inclusion-Exclusion Principle n n [ X Ai = (−1)r−1 • Probability Distribution P i=1 1. P [A] ≥ 0 ∀A 2. P [Ω] = 1 "∞ # ∞ G X 3. P Ai = P [Ai ] 3 r=1 X Ω= n G Ai i=1 r \ Aij i≤i1 <···<ir ≤n j=1 Random Variables Random Variable (RV) i=1 X:Ω→R • Probability space (Ω, A, P) Probability Mass Function (PMF) Properties • • • • • • • • Ω= i=1 1. ∅ ∈ A S∞ 2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A 3. A ∈ A =⇒ ¬A ∈ A i=1 n X P [∅] = 0 B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B) P [¬A] = 1 − P [A] P [B] = P [A ∩ B] + P [¬A ∩ B] P [Ω] = 1 P [∅] = 0 S T T S ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan S T P [ n An ] = 1 − P [ n ¬An ] P [A ∪ B] = P [A] + P [B] − P [A ∩ B] =⇒ P [A ∪ B] ≤ P [A] + P [B] • P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] • P [A ∩ ¬B] = P [A] − P [A ∩ B] fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}] Probability Density Function (PDF) Z P [a ≤ X ≤ b] = b f (x) dx a Cumulative Distribution Function (CDF) FX : R → [0, 1] FX (x) = P [X ≤ x] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 ) 2. Normalized: limx→−∞ = 0 and limx→∞ = 1 3. Right-Continuous: limy↓x F (y) = F (x) Continuity of Probabilities • A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] • A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] S∞ whereA = i=1 Ai T∞ whereA = i=1 Ai Z a≤b a Independence ⊥ ⊥ fY |X (y | x) = A⊥ ⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] f (x, y) fX (x) Independence Conditional Probability P [A | B] = b fY |X (y | x)dy P [a ≤ Y ≤ b | X = x] = P [A ∩ B] P [B] P [B] > 0 1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y] 2. fX,Y (x, y) = fX (x)fY (y) 6 3.1 Z Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y) X,Y Transformation function • E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality) • P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ] ∞ X • E [X] = P [X ≥ x] Z = ϕ(X) Discrete X fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) = f (x) x∈ϕ−1 (z) x=1 Sample mean n Continuous X̄n = Z FZ (z) = P [ϕ(X) ≤ z] = with Az = {x : ϕ(x) ≤ z} f (x) dx Az Special case if ϕ strictly monotone dx d 1 fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x) dz dz |J| Conditional expectation Z • E [Y | X = x] = yf (y | x) dy • E [X] = E [E [X | Y ]] ∞ Z • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx Z −∞ ∞ The Rule of the Lazy Statistician • E [ϕ(Y, Z) | X = x] = Z E [Z] = • E [Y + Z | X] = E [Y | X] + E [Z | X] • E [ϕ(X)Y | X] = ϕ(X)E [Y | X] • E[Y | X] = c =⇒ Cov [X, Y ] = 0 Z dFX (x) = P [X ∈ A] IA (x) dFX (x) = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz −∞ ϕ(x) dFX (x) Z E [IA (x)] = 1X Xi n i=1 A Convolution Z • Z := X + Y ∞ fX,Y (x, z − x) dx fZ (z) = −∞ Z X,Y ≥0 Z z fX,Y (x, z − x) dx = 0 ∞ • Z := |X − Y | • Z := 4 X Y fZ (z) = 2 fX,Y (x, z + x) dx 0 Z ∞ Z ∞ ⊥ ⊥ fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx −∞ −∞ Expectation 5 Variance Definition and properties 2 2 • V [X] = σX = E (X − E [X])2 = E X 2 − E [X] " n # n X X X • V Xi = V [Xi ] + 2 Cov [Xi , Yj ] i=1 • V " n X # Xi = i=1 Definition and properties Z • E [X] = µX = x dFX (x) = i6=j V [Xi ] iff Xi ⊥ ⊥ Xj i=1 Standard deviation X xfX (x) x Z xfX (x) • P [X = c] = 1 =⇒ E [c] = c • E [cX] = c E [X] • E [X + Y ] = E [X] + E [Y ] i=1 n X sd[X] = X discrete p V [X] = σX Covariance X continuous • • • • • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] Cov [X, a] = 0 Cov [X, X] = V [X] Cov [X, Y ] = Cov [Y, X] Cov [aX, bY ] = abCov [X, Y ] 7 • Cov [X + a, Y + b] = Cov [X, Y ] n m n X m X X X • Cov Xi , Yj = Cov [Xi , Yj ] i=1 j=1 Correlation • limn→∞ Bin (n, p) = Po (np) (n large, p small) • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1) Negative Binomial i=1 j=1 • • • • Cov [X, Y ] ρ [X, Y ] = p V [X] V [Y ] Independence X ∼ NBin (1, p) = Geo (p) Pr X ∼ NBin (r, p) = i=1 Geo (p) P P Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p) X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r] Poisson X⊥ ⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ n X Xi ∼ Po i=1 Sample variance n X ! λi i=1 X n X n λi Xj ∼ Bin Xj , Pn • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi j=1 j=1 λj j=1 n 1 X S2 = (Xi − X̄n )2 n − 1 i=1 Conditional variance 2 • V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] • V [Y ] = E [V [Y | X]] + V [E [Y | X]] Exponential • Xi ∼ Exp (β) ∧ Xi ⊥ ⊥ Xj =⇒ n X Xi ∼ Gamma (n, β) i=1 • Memoryless property: P [X > x + y | X > y] = P [X > x] 6 Inequalities Normal • X ∼ N µ, σ 2 Cauchy-Schwarz 2 • • • E [XY ] ≤ E X 2 E Y 2 Markov P [ϕ(X) ≥ t] ≤ E [ϕ(X)] t Chebyshev P [|X − E [X]| ≥ t] ≤ Chernoff P [X ≥ (1 + δ)µ] ≤ • δ > −1 E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex Distribution Relationships Binomial • Xi ∼ Bern (p) =⇒ n X X−µ σ 2 Gamma Jensen 7 ∼ N (0, 1) X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2 X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22 P P P 2 Xi ∼ N µi , σi2 =⇒ X ∼N i µi , i σi i i − Φ a−µ P [a < X ≤ b] = Φ b−µ σ σ =⇒ • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x) −1 • Upper quantile of N (0, 1): zα = Φ (1 − α) V [X] t2 eδ (1 + δ)1+δ Xi ∼ Bin (n, p) i=1 • X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1) Pα • Gamma (α, β) ∼ i=1 Exp (β) P P • Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ Xi ∼ Gamma ( i αi , β) i Z ∞ Γ(α) • = xα−1 e−λx dx λα 0 Beta 1 Γ(α + β) α−1 • xα−1 (1 − x)β−1 = x (1 − x)β−1 B(α, β) Γ(α)Γ(β) B(α + k, β) α+k−1 • E Xk = = E X k−1 B(α, β) α+β+k−1 • Beta (1, 1) ∼ Unif (0, 1) 8 8 Probability and Moment Generating Functions • GX (t) = E tX σX (Y − E [Y ]) σY p V [X | Y ] = σX 1 − ρ2 E [X | Y ] = E [X] + ρ |t| < 1 t Xt • MX (t) = GX (e ) = E e =E "∞ # X (Xt)i i! i=0 ∞ X E Xi = · ti i! i=0 • P [X = 0] = GX (0) • P [X = 1] = G0X (0) • P [X = i] = Conditional mean and variance 9.3 (i) GX (0) Multivariate Normal (Precision matrix Σ−1 ) V [X1 ] · · · Cov [X1 , Xk ] .. .. .. Σ= . . . Covariance matrix Σ i! • E [X] = G0X (1− ) (k) • E X k = MX (0) X! (k) • E = GX (1− ) (X − k)! Cov [Xk , X1 ] · · · If X ∼ N (µ, Σ), 2 • V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− )) d −n/2 • GX (t) = GY (t) =⇒ X = Y 9 9.1 V [Xk ] fX (x) = (2π) |Σ| 1 exp − (x − µ)T Σ−1 (x − µ) 2 Properties Multivariate Distributions • • • • Standard Bivariate Normal Let X, Y ∼ N (0, 1) ∧ X ⊥ ⊥ Z where Y = ρX + −1/2 p 1 − ρ2 Z Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ) X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa Joint density f (x, y) = 2 1 x + y 2 − 2ρxy p exp − 2(1 − ρ2 ) 2π 1 − ρ2 10 Convergence Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote the cdf of Xn and let F denote the cdf of X. Conditionals (Y | X = x) ∼ N ρx, 1 − ρ2 and (X | Y = y) ∼ N ρy, 1 − ρ2 Types of convergence Independence D 1. In distribution (weakly, in law): Xn → X X⊥ ⊥ Y ⇐⇒ ρ = 0 lim Fn (t) = F (t) 9.2 n→∞ Bivariate Normal P 2. In probability: Xn → X Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . f (x, y) = " z= x − µx σx 2πσx σy 2 + 1 p z exp − 2 2(1 − ρ2 ) 1−ρ y − µy σy ∀t where F continuous 2 − 2ρ x − µx σx (∀ε > 0) lim P [|Xn − X| > ε] = 0 y − µy σy n→∞ as # 3. Almost surely (strongly): Xn → X h i h i P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 n→∞ n→∞ 9 qm 4. In quadratic mean (L2 ): Xn → X CLT notations Zn ≈ N (0, 1) σ2 X̄n ≈ N µ, n σ2 X̄n − µ ≈ N 0, n √ 2 n(X̄n − µ) ≈ N 0, σ √ n(X̄n − µ) ≈ N (0, 1) n lim E (Xn − X)2 = 0 n→∞ Relationships qm P D • Xn → X =⇒ Xn → X =⇒ Xn → X as P • Xn → X =⇒ Xn → X D P • Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X • • • • Xn Xn Xn Xn P →X qm →X P →X P →X ∧ Yn ∧ Yn ∧ Yn =⇒ P P → Y =⇒ Xn + Yn → X + Y qm qm → Y =⇒ Xn + Yn → X + Y P P → Y =⇒ Xn Yn → XY P ϕ(Xn ) → ϕ(X) D Continuity correction D x + 12 − µ √ σ/ n x − 12 − µ √ P X̄n ≥ x ≈ 1 − Φ σ/ n • Xn → X =⇒ ϕ(Xn ) → ϕ(X) qm • Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 qm • X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ P X̄n ≤ x ≈ Φ Slutzky’s Theorem D P D • Xn → X and Yn → c =⇒ Xn + Yn → X + c D P D • Xn → X and Yn → c =⇒ Xn Yn → cX D D D • In general: Xn → X and Yn → Y =⇒ 6 Xn + Yn → X + Y 10.1 Delta method Yn ≈ N 11 Law of Large Numbers (LLN) σ2 µ, n =⇒ ϕ(Yn ) ≈ N σ2 ϕ(µ), (ϕ (µ)) n 0 2 Statistical Inference iid Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞. Let X1 , · · · , Xn ∼ F if not otherwise noted. Weak (WLLN) 11.1 n→∞ • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn ) h i • bias(θbn ) = E θbn − θ as n→∞ P • Consistency: θbn → θ • Sampling distribution: F (θbn ) r h i b • Standard error: se(θn ) = V θbn h i h i • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn X̄n → µ Strong (WLLN) X̄n → µ 10.2 Central Limit Theorem (CLT) Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . X̄n − µ Zn := q = V X̄n √ n(X̄n − µ) D →Z σ lim P [Zn ≤ z] = Φ(z) n→∞ Point Estimation P where Z ∼ N (0, 1) z∈R • limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent θbn − θ D • Asymptotic normality: → N (0, 1) se • Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ bn . 10 11.2 Normal-Based Confidence Interval b 2 . Let zα/2 Suppose θbn ≈ N θ, se and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 b Cn = θbn ± zα/2 se 11.4 • • • • Statistical Functionals Statistical functional: T (F ) Plug-in estimator of θ = (F ): θbn = T (Fbn ) R Linear functional: T (F ) = ϕ(x) dFX (x) Plug-in estimator for linear functional: n Z 11.3 1X ϕ(Xi ) ϕ(x) dFbn (x) = n i=1 b 2 =⇒ T (Fbn ) ± zα/2 se b • Often: T (Fbn ) ≈ N T (F ), se T (Fbn ) = Empirical distribution Empirical Distribution Function (ECDF) Pn Fbn (x) = i=1 I(Xi ≤ x) n ( 1 I(Xi ≤ x) = 0 Xi ≤ x Xi > x Properties (for any fixed x) h i • E Fbn = F (x) h i F (x)(1 − F (x)) • V Fbn = n F (x)(1 − F (x)) D • mse = →0 n P • Fbn → F (x) Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F ) 2 P sup F (x) − Fbn (x) > ε = 2e−2nε • pth quantile: F −1 (p) = inf{x : F (x) ≥ p} • µ b = X̄n n 1 X • σ b2 = (Xi − X̄n )2 n − 1 i=1 Pn 1 b)3 i=1 (Xi − µ n • κ b= σ b3 j Pn (Xi − X̄n )(Yi − Ȳn ) qP • ρb = qP i=1 n n 2 (X − X̄ ) i n i=1 i=1 (Yi − Ȳn ) 12 Parametric Inference Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk and parameter θ = (θ1 , . . . , θk ). 12.1 Method of Moments j th moment αj (θ) = E X j = x Nonparametric 1 − α confidence band for F L(x) = max{Fbn − n , 0} U (x) = min{Fbn + n , 1} s 1 2 log = 2n α P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α Z xj dFX (x) j th sample moment n α bj = 1X j X n i=1 i Method of moments estimator (MoM) α1 (θ) = α b1 α2 (θ) = α b2 .. .. .=. αk (θ) = α bk 11 Properties of the MoM estimator • θbn exists with probability tending to 1 P • Consistency: θbn → θ • Asymptotic normality: √ D n(θb − θ) → N (0, Σ) where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , gk ) and gj = ∂θ αj (θ) 12.2 Maximum Likelihood Likelihood: Ln : Θ → [0, ∞) Ln (θ) = n Y f (Xi ; θ) • Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ) • Asymptotic normality: p 1. se ≈ 1/In (θ) (θbn − θ) D → N (0, 1) se q b ≈ 1/In (θbn ) 2. se (θbn − θ) D → N (0, 1) b se • Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is h i V θbn are(θen , θbn ) = h i ≤ 1 V θen i=1 • Approximately the Bayes estimator Log-likelihood `n (θ) = log Ln (θ) = n X log f (Xi ; θ) i=1 12.2.1 Delta Method b where ϕ is differentiable and ϕ0 (θ) 6= 0: If τ = ϕ(θ) Maximum likelihood estimator (mle) (b τn − τ ) D → N (0, 1) b τ) se(b Ln (θbn ) = sup Ln (θ) θ b is the mle of τ and where τb = ϕ(θ) Score function s(X; θ) = ∂ log f (X; θ) ∂θ b b b = ϕ0 (θ) b θn ) se se( Fisher information I(θ) = Vθ [s(X; θ)] In (θ) = nI(θ) Fisher information (exponential family) ∂ I(θ) = Eθ − s(X; θ) ∂θ Observed Fisher information Inobs (θ) = − Properties of the mle P • Consistency: θbn → θ 12.3 Multiparameter Models Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle. Hjj = ∂ 2 `n ∂θ2 Hjk = Fisher information matrix n ∂2 X log f (Xi ; θ) ∂θ2 i=1 ∂ 2 `n ∂θj ∂θk ··· .. . Eθ [Hk1 ] · · · Eθ [H11 ] .. In (θ) = − . Eθ [H1k ] .. . Eθ [Hkk ] Under appropriate regularity conditions (θb − θ) ≈ N (0, Jn ) 12 with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then 13 Hypothesis Testing H0 : θ ∈ Θ0 (θbj − θj ) D → N (0, 1) bj se h i b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k) where se 12.3.1 Multiparameter delta method Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be ∂ϕ ∂θ1 . ∇ϕ = .. ∂ϕ ∂θk Definitions • • • • • • • • • • • Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis θ = θ0 Composite hypothesis θ > θ0 or θ < θ0 Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0 One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 Critical value c Test statistic T Rejection region R = {x : T (x) > c} Power function β(θ) = P [X ∈ R] Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ) (b τ − τ) D → N (0, 1) b τ) se(b θ∈Θ0 H0 true H1 true b ∇ϕ T Type II Error (β) 1−Fθ (T (X)) p-value < 0.01 0.01 − 0.05 0.05 − 0.1 > 0.1 b Jbn ∇ϕ b and ∇ϕ b = ∇ϕ b. and Jbn = Jn (θ) θ=θ 12.4 Retain H0 √ Reject H0 Type √ I Error (α) (power) • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα • p-value = supθ∈Θ0 Pθ [T (X ? ) ≥ T (X)] = inf α : T (X) ∈ Rα | {z } where b τ) = se(b θ∈Θ1 • Test size: α = P [Type I error] = sup β(θ) p-value b Then, Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ). r H1 : θ ∈ Θ1 versus Parametric Bootstrap Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method of moments estimator. since T (X ? )∼Fθ evidence very strong evidence against H0 strong evidence against H0 weak evidence against H0 little or no evidence against H0 Wald test • Two-sided test θb − θ0 • Reject H0 when |W | > zα/2 where W = b se • P |W | > zα/2 → α • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|) Likelihood ratio test (LRT) • T (X) = supθ∈Θ Ln (θ) Ln (θbn ) = supθ∈Θ0 Ln (θ) Ln (θbn,0 ) 13 D • λ(X) = 2 log T (X) → χ2r−q where k X iid Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1) i=1 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) Multinomial LRT Xk X1 ,..., • mle: pbn = n n Xj k Y Ln (b pn ) pbj • T (X) = = Ln (p0 ) p0j j=1 k X pbj D Xj log • λ(X) = 2 → χ2k−1 p 0j j=1 • xn = (x1 , . . . , xn ) • Prior density f (θ) • Likelihood f (xn | θ): joint density of the data n Y In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1 • Posterior density f (θ | xn ) R • Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends Ron θ R θLn (θ)f (θ) • Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f (θ) dθ 14.1 • The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Credible Intervals Posterior interval Pearson Chi-square Test n k X (Xj − E [Xj ])2 where E [Xj ] = np0j under H0 • T = E [Xj ] j=1 D f (θ | xn ) dθ = 1 − α P [θ ∈ (a, b) | x ] = a Equal-tail credible interval χ2k−1 • T → • p-value = P χ2k−1 > T (x) Z a f (θ | xn ) dθ = Z −∞ D 2 • Faster → Xk−1 than LRT, hence preferable for small n ∞ f (θ | xn ) dθ = α/2 b Highest posterior density (HPD) region Rn Independence testing • I rows, J columns, X multinomial sample of size n = I ∗ J X • mles unconstrained: pbij = nij X • mles under H0 : pb0ij = pbi· pb·j = Xni· n·j PI PJ nX • LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j PI PJ (X −E[X ])2 • PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij D • LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) 14 b Z 1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k} for some k Rn is unimodal =⇒ Rn is an interval 14.2 Function of parameters Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. Posterior CDF for τ Bayesian Inference H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = Z f (θ | xn ) dθ A Bayes’ Theorem Posterior density f (x | θ)f (θ) f (x | θ)f (θ) f (θ | x) = =R ∝ Ln (θ)f (θ) n f (x ) f (x | θ)f (θ) dθ h(τ | xn ) = H 0 (τ | xn ) Bayesian delta method Definitions n • X = (X1 , . . . , Xn ) b b se b ϕ0 (θ) τ | X n ≈ N ϕ(θ), 14 14.3 Priors Continuous likelihood (subscript c denotes constant) Likelihood Conjugate prior Unif (0, θ) Pareto(xm , k) Exp (λ) Gamma (α, β) Choice • Subjective bayesianism. • Objective bayesianism. • Robust bayesianism. i=1 2 2 N µ, σc N µ0 , σ0 Types N µc , σ 2 • Flat: f (θ) ∝ constant R∞ • Proper: −∞ f (θ) dθ = 1 R∞ • Improper: −∞ f (θ) dθ = ∞ • Jeffrey’s prior (transformation-invariant): f (θ) ∝ p I(θ) f (θ) ∝ N µ, σ 2 p Scaled Inverse Chisquare(ν, σ02 ) Normalscaled Inverse Gamma(λ, ν, α, β) det(I(θ)) MVN(µ, Σc ) MVN(µ0 , Σ0 ) • Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family MVN(µc , Σ) 14.3.1 Conjugate Priors Discrete likelihood Likelihood Bern (p) Bin (p) Conjugate prior Beta (α, β) Beta (α, β) Posterior hyperparameters α+ α+ n X i=1 n X xi , β + n − xi , β + i=1 NBin (p) Po (λ) Beta (α, β) Gamma (α, β) α + rn, β + α+ n X n X n X i=1 xi Dir (α) α+ n X xi , β + n x(i) i=1 Geo (p) Beta (α, β) n X InverseWishart(κ, Ψ) Pareto(xmc , k) Gamma (α, β) Pareto(xm , kc ) Pareto(x0 , k0 ) Gamma (αc , β) Gamma (α0 , β0 ) Pn n µ0 1 i=1 xi + + 2 , / σ2 σ2 σ02 σc 0 c−1 n 1 + 2 σ02 σc Pn νσ02 + i=1 (xi − µ)2 ν + n, ν+n n νλ + nx̄ , ν + n, α + , ν+n 2 n 1X γ(x̄ − λ)2 2 β+ (xi − x̄) + 2 i=1 2(n + γ) −1 Σ−1 0 + nΣc α + n, β + n X i=1 −1 −1 Σ−1 x̄ , 0 µ0 + nΣ −1 −1 Σ−1 0 + nΣc n X n + κ, Ψ + (xi − µc )(xi − µc )T i=1 n X xi xm c i=1 x0 , k0 − kn where k0 > kn n X α0 + nαc , β0 + xi α + n, β + log i=1 xi i=1 Ni − n X 14.4 xi Bayesian Testing If H0 : θ ∈ Θ0 : i=1 Z Prior probability P [H0 ] = i=1 i=1 Multinomial(p) Posterior hyperparameters max x(n) , xm , k + n n X α + n, β + xi Posterior probability P [H0 | xn ] = f (θ) dθ ZΘ0 f (θ | xn ) dθ Θ0 Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), xi f (xn | Hk )P [Hk ] P [Hk | xn ] = PK , n k=1 f (x | Hk )P [Hk ] 15 Marginal likelihood f (xn | Hi ) = Z 1. Estimate VF [Tn ] with VFbn [Tn ]. 2. Approximate VFbn [Tn ] using simulation: f (xn | θ, Hi )f (θ | Hi ) dθ ∗ ∗ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from b the sampling distribution implied by Fn i. Sample uniformly X ∗ , . . . , Xn∗ ∼ Fbn . Θ Posterior odds (of Hi relative to Hj ) P [Hi | xn ] P [Hj | xn ] = f (xn | Hi ) f (xn | Hj ) | {z } Bayes Factor BFij × P [Hi ] P [Hj ] | {z } 1 ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ). (b) Then prior odds Bayes factor log10 BF10 p∗ = 15 0 − 0.5 0.5 − 1 1−2 >2 p 1−p BF10 1+ p 1−p BF10 BF10 evidence 1 − 1.5 1.5 − 10 10 − 100 > 100 Weak Moderate Strong Decisive vboot B B X 1 X ∗ ∗ bb = 1 T − =V T n,b Fn B B r=1 n,r !2 b=1 16.1.1 where p = P [H1 ] and p∗ = P [H1 | xn ] Bootstrap Confidence Intervals Normal-based interval b boot Tn ± zα/2 se Exponential Family Pivotal interval Scalar parameter 1. 2. 3. 4. fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} = h(x)g(θ) exp {η(θ)T (x)} Vector parameter ( fX (x | θ) = h(x) exp s X Location parameter θ = T (F ) Pivot Rn = θbn − θ Let H(r) = P [Rn ≤ r] be the cdf of Rn ∗ ∗ Let Rn,b = θbn,b − θbn . Approximate H using bootstrap: ) B 1 X ∗ b H(r) = I(Rn,b ≤ r) B ηi (θ)Ti (x) − A(θ) i=1 b=1 = h(x) exp {η(θ) · T (x) − A(θ)} = h(x)g(θ) exp {η(θ) · T (x)} Natural form fX (x | η) = h(x) exp {η · T(x) − A(η)} = h(x)g(η) exp {η · T(x)} = h(x)g(η) exp η T T(x) ∗ ∗ , . . . , θbn,B ) 5. θβ∗ = β sample quantile of (θbn,1 ∗ ∗ 6. rβ∗ = β sample quantile of (Rn,1 , . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn 7. Approximate 1 − α confidence interval Cn = â, b̂ where â = b̂ = 16 16.1 Sampling Methods The Bootstrap Let Tn = g(X1 , . . . , Xn ) be a statistic. b −1 1 − α = θbn − H 2 b −1 α = θbn − H 2 ∗ θbn − r1−α/2 = ∗ 2θbn − θ1−α/2 ∗ θbn − rα/2 = ∗ 2θbn − θα/2 Percentile interval ∗ ∗ Cn = θα/2 , θ1−α/2 16 16.2 Rejection Sampling Setup • We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • We know h(θ) up to a proportional constant: h(θ) = R Algorithm 1. Draw θcand ∼ g(θ) 2. Generate u ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted Example • • • • Loss functions • Squared error loss: L(θ, a) = (θ − a)2 ( K1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 ) • Lp loss: L(θ, a) = |θ − a|p ( 0 a=θ • Zero-one loss: L(θ, a) = 1 a 6= θ 17.1 We can easily sample from the prior g(θ) = f (θ) Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ) Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M Algorithm 1. Draw θcand ∼ f (θ) 2. Generate u ∼ Unif (0, 1) Ln (θcand ) 3. Accept θcand if u ≤ Ln (θbn ) 16.3 • Decision rule: synonymous for an estimator θb • Action a ∈ A: possible value of the decision rule. In the estimation b context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or b L : Θ × A → [−k, ∞). discrepancy between θ and θ, Posterior risk Z h i b b L(θ, θ(x))f (θ | x) dθ = Eθ|X L(θ, θ(x)) Z h i b b L(θ, θ(x))f (x | θ) dx = EX|θ L(θ, θ(X)) r(θb | x) = (Frequentist) risk b = R(θ, θ) Bayes risk ZZ Importance Sampling b = r(f, θ) Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) | xn ]: Decision Theory Definitions • Unknown quantity affecting our decision: θ ∈ Θ h i b b L(θ, θ(x))f (x, θ) dx dθ = Eθ,X L(θ, θ(X)) h h ii h i b = Eθ EX|θ L(θ, θ(X) b b r(f, θ) = Eθ R(θ, θ) h h ii h i b = EX Eθ|X L(θ, θ(X) b r(f, θ) = EX r(θb | X) iid 1. Sample from the prior θ1 , . . . , θn ∼ f (θ) Ln (θi ) 2. wi = PB ∀i = 1, . . . , B i=1 Ln (θi ) PB 3. E [q(θ) | xn ] ≈ i=1 q(θi )wi 17 Risk 17.2 Admissibility • θb0 dominates θb if b ∀θ : R(θ, θb0 ) ≤ R(θ, θ) b ∃θ : R(θ, θb0 ) < R(θ, θ) • θb is inadmissible if there is at least one other estimator θb0 that dominates it. Otherwise it is called admissible. 17 17.3 Bayes Rule Residual sums of squares (rss) Bayes rule (or Bayes estimator) rss(βb0 , βb1 ) = b = inf e r(f, θ) e • r(f, θ) θ R b b b = r(θb | x)f (x) dx • θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ) n X ˆ2i i=1 Least square estimates Theorems βbT = (βb0 , βb1 )T : min rss b0 ,β b1 β • Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode 17.4 βb0 = Ȳn − βb1 X̄n Pn Pn (Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y Pn βb1 = i=1 = P n 2 2 2 (X − X̄ ) i n i=1 i=1 Xi − nX h i β0 E βb | X n = β1 P h i σ 2 n−1 ni=1 Xi2 −X n n b V β |X = −X n 1 nsX r Pn 2 σ b i=1 Xi √ b βb0 ) = se( n sX n σ b √ b βb1 ) = se( sX n Minimax Rules Maximum risk b = sup R(θ, θ) b R̄(θ) R̄(a) = sup R(θ, a) θ θ Minimax rule e e = inf sup R(θ, θ) b = inf R̄(θ) sup R(θ, θ) θ θe θe θ b =c θb = Bayes rule ∧ ∃c : R(θ, θ) Least favorable prior θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ 18 Linear Regression Pn b2 = where s2X = n−1 i=1 (Xi − X n )2 and σ Further properties: Pn ˆ2i i=1 (unbiased estimate). P P • Consistency: βb0 → β0 and βb1 → β1 • Asymptotic normality: Definitions • Response variable Y • Covariate X (aka predictor variable or feature) 18.1 1 n−2 βb0 − β0 D → N (0, 1) b βb0 ) se( Simple Linear Regression βb1 − β1 D → N (0, 1) b βb1 ) se( • Approximate 1 − α confidence intervals for β0 and β1 : Model Yi = β0 + β1 Xi + i and E [i | Xi ] = 0, V [i | Xi ] = σ 2 b βb0 ) βb0 ± zα/2 se( b βb1 ) and βb1 ± zα/2 se( Fitted line • Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where b βb1 ). W = βb1 /se( rb(x) = βb0 + βb1 x Predicted (fitted) values Ybi = rb(Xi ) Residuals ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi R2 Pn b Pn 2 2 ˆ rss i=1 (Yi − Y ) R = Pn = 1 − Pn i=1 i 2 = 1 − 2 tss i=1 (Yi − Y ) i=1 (Yi − Y ) 2 18 If the (k × k) matrix X T X is invertible, Likelihood L= L1 = n Y i=1 n Y f (Xi , Yi ) = n Y fX (Xi ) × i=1 n Y βb = (X T X)−1 X T Y h i V βb | X n = σ 2 (X T X)−1 fY |X (Yi | Xi ) = L1 × L2 i=1 βb ≈ N β, σ 2 (X T X)−1 fX (Xi ) Estimate regression function i=1 n Y ( 2 1 X Yi − (β0 − β1 Xi ) fY |X (Yi | Xi ) ∝ σ −n exp − 2 L2 = 2σ i i=1 ) rb(x) = k X βbj xj j=1 Under the assumption of Normality, the least squares estimator is also the mle Unbiased estimate for σ 2 n 1 X 2 ˆ σ b = n − k i=1 i n 2 1X 2 ˆ σ b2 = n i=1 i ˆ = X βb − Y mle 18.2 µ b = X̄ Prediction Observe X = x∗ of the covariate and want to predict their outcome Y∗ . Yb∗ = βb0 + βb1 x∗ i h i h i h i V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 h Prediction interval ξbn2 = σ b2 Pn 2 i=1 (Xi − X∗ ) P +1 n i (Xi − X̄)2 j n−k 2 σ n 1 − α Confidence interval b βbj ) βbj ± zα/2 se( 18.4 Model Selection Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J denote a subset of the covariates in the model, where |S| = k and |J| = n. Issues • Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance Yb∗ ± zα/2 ξbn 18.3 σ b2 = Procedure 1. Assign a score to each model 2. Search through all models to find the one with the highest score Multiple Regression Y = Xβ + Hypothesis testing H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J where X11 .. X= . Xn1 ··· .. . ··· X1k .. . Xnk β1 β = ... 1 .. =. βk n Mean squared prediction error (mspe) h i mspe = E (Yb (S) − Y ∗ )2 Prediction risk Likelihood 1 2 −n/2 L(µ, Σ) = (2πσ ) exp − 2 rss 2σ R(S) = n X mspei = i=1 n X h i E (Ybi (S) − Yi∗ )2 i=1 Training error rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 = N X (Yi − xTi β)2 i=1 btr (S) = R n X (Ybi (S) − Yi )2 i=1 19 R2 btr (S) rss(S) R R2 (S) = 1 − =1− =1− tss tss Pn b 2 i=1 (Yi (S) − Y ) P n 2 i=1 (Yi − Y ) Frequentist risk Z h i Z R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx The training error is a downward-biased estimate of the prediction risk. h i btr (S) < R(S) E R h i btr (S)) = E R btr (S) − R(S) = −2 bias(R n X h i b(x) = E fbn (x) − f (x) h i v(x) = V fbn (x) h i Cov Ybi , Yi i=1 Adjusted R 2 R2 (S) = 1 − n − 1 rss n − k tss 19.1.1 Mallow’s Cp statistic Histograms Definitions b btr (S) + 2kb R(S) =R σ 2 = lack of fit + complexity penalty • • • • Akaike Information Criterion (AIC) AIC(S) = `n (βbS , σ bS2 ) − k Bayesian Information Criterion (BIC) Number of bins m 1 Binwidth h = m Bin Bj has νj observations R Define pbj = νj /n and pj = Bj f (u) du Histogram estimator k BIC(S) = `n (βbS , σ bS2 ) − log n 2 Validation and training bV (S) = R m X fbn (x) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often i=1 Leave-one-out cross-validation bCV (S) = R n X 2 (Yi − Yb(i) ) = i=1 n X i=1 Yi − Ybi (S) 1 − Uii (S) !2 U (S) = XS (XST XS )−1 XS (“hat matrix”) 19 19.1 Non-parametric Function Estimation n n or 4 2 m X pbj j=1 h I(x ∈ Bj ) h i p j E fbn (x) = h h i p (1 − p ) j j V fbn (x) = nh2 Z 1 h2 2 b (f 0 (u)) du + R(fn , f ) ≈ 12 nh !1/3 1 6 h∗ = 1/3 R 2 du n (f 0 (u)) 2/3 Z 1/3 3 C 2 C= (f 0 (u)) du R∗ (fbn , f ) ≈ 2/3 4 n Density Estimation R Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx. Integrated square error (ise) Z Z 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx Cross-validation estimate of E [J(h)] Z JbCV (h) = n m 2Xb 2 n+1 X 2 f(−i) (Xi ) = pb fbn2 (x) dx − − n i=1 (n − 1)h (n − 1)h j=1 j 20 19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator X 1 Yi where Nk (x) = {k values of x1 , . . . , xn closest to x} rb(x) = k Kernel K • • • • i:xi ∈Nk (x) K(x) ≥ 0 R K(x) dx = 1 R xK(x) dx = 0 R 2 2 >0 x K(x) dx ≡ σK Nadaraya-Watson Kernel Estimator rb(x) = n X wi (x)Yi i=1 KDE x−xi h Pn x−xj K j=1 h K wi (x) = n 1X1 x − Xi fbn (x) = K n i=1 h h Z Z 1 1 4 00 2 b R(f, fn ) ≈ (hσK ) (f (x)) dx + K 2 (x) dx 4 nh Z Z −2/5 −1/5 −1/5 c c2 c3 2 2 h∗ = 1 c = σ , c = K (x) dx, c = (f 00 (x))2 dx 1 2 3 K n1/5 Z 4/5 Z 1/5 c4 5 2 2/5 ∗ 2 00 2 b R (f, fn ) = 4/5 K (x) dx (f ) dx c4 = (σK ) 4 n | {z } R(b rn , r) ≈ h4 4 Z + ∈ [0, 1] 4 Z 2 f 0 (x) x2 K 2 (x) dx r00 (x) + 2r0 (x) dx f (x) R σ 2 K 2 (x) dx dx nhf (x) Z c1 n1/5 c2 R∗ (b rn , r) ≈ 4/5 n h∗ ≈ C(K) Cross-validation estimate of E [J(h)] Epanechnikov Kernel ( K(x) = √ 3 √ 4 5(1−x2 /5) |x| < 0 otherwise JbCV (h) = 5 n X (Yi − rb(−i) (xi ))2 = i=1 n X i=1 (Yi − rb(xi ))2 1− Pn j=1 Cross-validation estimate of E [J(h)] Z JbCV (h) = K ∗ (x) = K (2) (x) − 2K(x) 19.2 19.3 n n n 2Xb 1 X X ∗ Xi − Xj 2 2 b fn (x) dx − f(−i) (Xi ) ≈ + K(0) K n i=1 hn2 i=1 j=1 h nh K (2) (x) = Approximation r(x) = ∞ X j=1 βj φj (x) ≈ J X βj φj (x) i=1 Multivariate regression Non-parametric Regression Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points (x1 , Y1 ), . . . , (xn , Yn ) related by K(0) x−x j K h Smoothing Using Orthogonal Functions Z K(x − y)K(y) dy !2 where ηi = i Y = Φβ + η φ0 (x1 ) .. and Φ = . ··· .. . φ0 (xn ) · · · φJ (x1 ) .. . φJ (xn ) Least squares estimator Yi = r(xi ) + i E [i ] = 0 V [i ] = σ 2 βb = (ΦT Φ)−1 ΦT Y 1 ≈ ΦT Y (for equally spaced observations only) n 21 Cross-validation estimate of E [J(h)] 2 n J X X bCV (J) = Yi − R φj (xi )βbj,(−i) i=1 20 j=1 20.2 Poisson Processes Poisson process • {Xt : t ∈ [0, ∞)} = number of events up to and including time t • X0 = 0 • Independent increments: Stochastic Processes Stochastic Process ( {0, ±1, . . . } = Z discrete T = [0, ∞) continuous {Xt : t ∈ T } • Notations Xt , X(t) • State space X • Index set T 20.1 ∀t0 < · · · < tn : Xt1 − Xt0 ⊥ ⊥ ··· ⊥ ⊥ Xtn − Xtn−1 • Intensity function λ(t) – P [Xt+h − Xt = 1] = λ(t)h + o(h) – P [Xt+h − Xt = 2] = o(h) • Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = Markov Chains Markov chain Rt 0 λ(s) ds Homogeneous Poisson process P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X λ(t) ≡ λ =⇒ Xt ∼ Po (λt) Transition probabilities pij ≡ P [Xn+1 = j | Xn = i] pij (n) ≡ P [Xm+n = j | Xm = i] λ>0 Waiting times n-step Wt := time at which Xt occurs Transition matrix P (n-step: Pn ) • (i, j) element is pij • pij > 0 P • i pij = 1 1 Wt ∼ Gamma t, λ Chapman-Kolmogorov Interarrival times pij (m + n) = X pij (m)pkj (n) St = Wt+1 − Wt k Pm+n = Pm Pn Pn = P × · · · × P = Pn St ∼ Exp 1 λ Marginal probability µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i] St µ0 , initial distribution µn = µ0 Pn Wt−1 Wt t 22 21 Time Series 21.1 Mean function Z Strictly stationary ∞ µxt = E [xt ] = Stationary Time Series xft (x) dx P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ] −∞ Autocovariance function γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt γx (t, t) = E (xt − µt )2 = V [xt ] Autocorrelation function (ACF) ∀k ∈ N, tk , ck , h ∈ Z Weakly stationary ∀t ∈ Z • E x2t < ∞ 2 • E xt = m ∀t ∈ Z • γx (s, t) = γx (s + r, t + r) γ(s, t) Cov [xs , xt ] =p ρ(s, t) = p V [xs ] V [xt ] γ(s, s)γ(t, t) Cross-covariance function (CCV) ∀r, s, t ∈ Z Autocovariance function γxy (s, t) = E [(xs − µxs )(yt − µyt )] • • • • • Cross-correlation function (CCF) ρxy (s, t) = p γxy (s, t) γx (s, s)γy (t, t) γ(h) = E [(xt+h − µ)(xt − µ)] γ(0) = E (xt − µ)2 γ(0) ≥ 0 γ(0) ≥ |γ(h)| γ(h) = γ(−h) ∀h ∈ Z Backshift operator B k (xt ) = xt−k Autocorrelation function (ACF) Difference operator γ(t + h, t) γ(h) Cov [xt+h , xt ] ρx (h) = p =p = γ(0) V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) ∇d = (1 − B)d White noise 2 • wt ∼ wn(0, σw ) • • • • Jointly stationary time series iid 2 0, σw Gaussian: wt ∼ N E [wt ] = 0 t ∈ T V [wt ] = σ 2 t ∈ T γw (s, t) = 0 s 6= t ∧ s, t ∈ T γxy (h) = E [(xt+h − µx )(yt − µy )] ρxy (h) = p Random walk Linear process • Drift δ Pt • xt = δt + j=1 wj • E [xt ] = δt xt = µ + k X j=−k aj xt−j ∞ X ψj wt−j where j=−∞ Symmetric moving average mt = γxy (h) γx (0)γy (h) where aj = a−j ≥ 0 and k X j=−k aj = 1 γ(h) = ∞ X |ψj | < ∞ j=−∞ 2 σw ∞ X ψj+h ψj j=−∞ 23 21.2 Estimation of Correlation 21.3.1 Detrending Least squares Sample mean n 1X x̄ = xt n t=1 Sample variance n |h| 1 X 1− γx (h) V [x̄] = n n h=−n 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 3. Residuals , noise wt Moving average • The low-pass filter vt is a symmetric moving average mt with aj = Sample autocovariance function n−h 1 X γ b(h) = (xt+h − x̄)(xt − x̄) n t=1 vt = 1 2k+1 : k X 1 xt−1 2k + 1 i=−k Pk 1 • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes without distortion Sample autocorrelation function ρb(h) = γ b(h) γ b(0) Differencing • µt = β0 + β1 t =⇒ ∇xt = β1 Sample cross-variance function γ bxy (h) = n−h 1 X (xt+h − x̄)(yt − y) n t=1 21.4 ARIMA models Autoregressive polynomial φ(z) = 1 − φ1 z − · · · − φp zp Sample cross-correlation function γ bxy (h) ρbxy (h) = p γ bx (0)b γy (0) z ∈ C ∧ φp 6= 0 Autoregressive operator φ(B) = 1 − φ1 B − · · · − φp B p Properties 1 • σρbx (h) = √ if xt is white noise n 1 • σρbxy (h) = √ if xt or yt is white noise n 21.3 Non-Stationary Time Series Autoregressive model order p, AR (p) xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt AR (1) • xt = φk (xt−k ) + k−1 X φj (wt−j ) k→∞,|φ|<1 j=0 Classical decomposition model = ∞ X φj (wt−j ) j=0 | {z } linear process xt = µt + st + wt • µt = trend • st = seasonal component • wt = random noise term • E [xt ] = P∞ j=0 φj (E [wt−j ]) = 0 • γ(h) = Cov [xt+h , xt ] = • ρ(h) = γ(h) γ(0) 2 h σw φ 1−φ2 = φh • ρ(h) = φρ(h − 1) h = 1, 2, . . . 24 Moving average polynomial Seasonal ARIMA θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 Moving average operator θ(B) = 1 + θ1 B + · · · + θp B p 21.4.1 MA (q) (moving average model order q) xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt E [xt ] = q X • Denoted by ARIMA (p, d, q) × (P, D, Q)s d s • ΦP (B s )φ(B)∇D s ∇ xt = δ + ΘQ (B )θ(B)wt Causality and Invertibility ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : xt = θj E [wt−j ] = 0 γ(h) = Cov [xt+h , xt ] = 2 σw 0 Pq−h j=0 θj θj+h j=0 ψj < ∞ such that wt−j = ψ(B)wt j=0 j=0 ( ∞ X P∞ 0≤h≤q h>q ARMA (p, q) is invertible ⇐⇒ ∃{πj } : MA (1) xt = wt + θwt−1 2 2 (1 + θ )σw h = 0 2 γ(h) = θσw h=1 0 h>1 ( θ h=1 2 ρ(h) = (1+θ ) 0 h>1 π(B)xt = P∞ j=0 ∞ X πj < ∞ such that Xt−j = wt j=0 Properties • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle ψ(z) = ∞ X ψj z j = j=0 θ(z) φ(z) |z| ≤ 1 ARMA (p, q) xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle φ(B)xt = θ(B)wt π(z) = Partial autocorrelation function (PACF) ∞ X πj z j = j=0 • xh−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 } i • φhh = corr(xh − xh−1 , x0 − xh−1 ) h≥2 0 h • E.g., φ11 = corr(x1 , x0 ) = ρ(1) φ(z) θ(z) |z| ≤ 1 Behavior of the ACF and PACF for causal and invertible ARMA models ACF PACF ARIMA (p, d, q) ∇d xt = (1 − B)d xt is ARMA (p, q) AR (p) tails off cuts off after lag p MA (q) cuts off after lag q tails off q ARMA (p, q) tails off tails off φ(B)(1 − B)d xt = θ(B)wt Exponentially Weighted Moving Average (EWMA) xt = xt−1 + wt − λwt−1 xt = ∞ X (1 − λ)λj−1 xt−j + wt when |λ| < 1 21.5 Spectral Analysis Periodic process xt = A cos(2πωt + φ) = U1 cos(2πωt) + U2 sin(2πωt) j=1 x̃n+1 = (1 − λ)xn + λx̃n • Frequency index ω (cycles per unit time), period 1/ω 25 • Amplitude A • Phase φ • U1 = A cos φ and U2 = A sin φ often normally distributed rv’s Discrete Fourier Transform (DFT) d(ωj ) = n−1/2 xt = q X Fourier/Fundamental frequencies (Uk1 cos(2πωk t) + Uk2 sin(2πωk t)) ωj = j/n k=1 • Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Pq • γ(h) = k=1 σk2 cos(2πωk h) Pq • γ(0) = E x2t = k=1 σk2 Inverse DFT xt = n−1/2 j=0 Scaled Periodogram σ 2 −2πiω0 h σ 2 2πiω0 h e + e = 2 2 Z 1/2 = e2πiωh dF (ω) 4 I(j/n) n !2 n 2X = xt cos(2πtj/n + n t=1 P (j/n) = −1/2 Spectral distribution function ω < −ω0 −ω ≤ ω < ω0 ω ≥ ω0 22 22.1 γ(h)e−2πiωh − h=−∞ h=−∞ !2 Gamma Function ∞ ts−1 e−t dt 0 Z ∞ • Upper incomplete: Γ(s, x) = ts−1 e−t dt x Z x • Lower incomplete: γ(s, x) = ts−1 e−t dt • Ordinary: Γ(s) = Spectral density ∞ X n 2X xt sin(2πtj/n n t=1 Math Z • F (−∞) = F (−1/2) = 0 • F (∞) = F (1/2) = γ(0) f (ω) = d(ωj )e2πiωj t I(j/n) = |d(j/n)|2 γ(h) = σ 2 cos(2πω0 h) 0 F (ω) = σ 2 /2 2 σ n−1 X Periodogram Spectral representation of a periodic process P∞ xt e−2πiωj t i=1 Periodic mixture • Needs n X |γ(h)| < ∞ =⇒ γ(h) = 1 1 ≤ω≤ 2 2 R 1/2 −1/2 e2πiωh f (ω) dω • f (ω) ≥ 0 • f (ω) = f (−ω) • f (ω) = f (1 − ω) R 1/2 • γ(0) = V [xt ] = −1/2 f (ω) dω 0 h = 0, ±1, . . . • Γ(α + 1) = αΓ(α) α>1 • Γ(n) = (n − 1)! n∈N √ • Γ(1/2) = π 22.2 Beta Function Z 1 Γ(x)Γ(y) tx−1 (1 − t)y−1 dt = • Ordinary: B(x, y) = B(y, x) = Γ(x + y) 0 Z x • Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt 2 • White noise: fw (ω) = σw • ARMA (p, q) , φ(B)xt = θ(B)wt : |θ(e−2πiω )|2 fx (ω) = |φ(e−2πiω )|2 Pp Pq where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 2 σw 0 • Regularized incomplete: a+b−1 B(x; a, b) a,b∈N X (a + b − 1)! Ix (a, b) = = xj (1 − x)a+b−1−j B(a, b) j!(a + b − 1 − j)! j=a 26 Stirling numbers, 2nd kind n n−1 n−1 =k + k k k−1 • I0 (a, b) = 0 I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a) 22.3 Series Finite • n(n + 1) k= 2 • (2k − 1) = n2 • k=1 n X • k=1 n X k=1 n X • k2 = ck = k=0 cn+1 − 1 c−1 n X n k=0 n X k n =2 k=0 ∞ X Balls and Urns |B| = n, |U | = m B : D, U : ¬D • Binomial Theorem: n X n n−k k a b = (a + b)n k B : ¬D, U : D k=0 c 6= 1 k > n : Pn,k = 0 f :B→U f arbitrary n m B : D, U : ¬D n+n−1 n m X n k=1 1 , 1−p ∞ X p |p| < 1 1−p k=1 ! ∞ X d 1 1 k = p = dp 1 − p (1 − p)2 pk = kpk−1 = B : ¬D, U : ¬D |p| < 1 ordered (n − i) = i=0 unordered Pn,k f injective ( mn m ≥ n 0 else m n ( 1 m≥n 0 else ( 1 m≥n 0 else f surjective n m! m n−1 m−1 n m Pn,m f bijective ( n! m = n 0 else ( 1 m=n 0 else ( 1 m=n 0 else ( 1 m=n 0 else References [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006. w/o replacement nk = k D = distinguishable, ¬D = indistinguishable. [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician, 62(1):45–53, 2008. Sampling k−1 Y n ≥ 1 : Pn,0 = 0, P0,0 = 1 [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole, 1972. Combinatorics k out of n m X k=1 k=0 22.4 Pn,i k=0 d dp k=0 k=0 ∞ X r+k−1 k • x = (1 − x)−r r ∈ N+ k k=0 ∞ X α k • p = (1 + p)α |p| < 1 , α ∈ C k • n X • Vandermonde’s Identity: r X m n m+n = k r−k r k=0 Infinite ∞ X • pk = Pn+k,k = i=1 r+k r+n+1 = k n k=0 n X k n+1 • = m m+1 n(n + 1)(2n + 1) 6 k=1 2 n X n(n + 1) • k3 = 2 • Partitions Binomial n X 1≤k≤n ( 1 n=0 n = 0 0 else n! (n − k)! n nk n! = = k k! k!(n − k)! w/ replacement [4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra. Springer, 2001. nk [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik. Springer, 2002. n−1+r n−1+r = r n−1 [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003. 27 28 Univariate distribution relationships, courtesy Leemis and McQueston [2].