Summary of Stat 531/532 Taught by Dr. Dennis Cox Text: A work-in-progress by Dr. Cox Supplemented by Statistical Inference by Cassella & Berger Chapter 1: Probability Theory and Measure Spaces 1.1: Set Theory (Casella-Berger) The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment. An event is any collection of possible outcomes of an experiment (i.e. any subset of S). A and B are disjoint (or mutually exclusive) if A B . The events A1, A2, … are pairwise disjoint if Ai A j i j. If A1, A2, … are pairwise disjoint and A i S , then the collection A1, A2, … forms a partition i 0 of S. 1.2: Probability Theory (Casella-Berger) A collection of subsets of S is called a Borel field (or sigma algebra) denoted by , if it satisfies the following three properties: 1. 2. A Ac 3. A1 , A2 , Ai i 1 i.e. the empty set is contained in , is closed under complementation, and is closed under countable unions. Given a sample space S and an associated Borel field , a probability function is a function P with domain that satisfies: 1. P(A) 0 for all A 2. P(S)=1 3. If A1, A2, are pairwise disjoint, then P( i 1 Ai )= i 1 P(Ai). Axiom of Finite Additivity: If A and B are disjoint, then P( A B )=P(A) + P(B). 1.1: Measures (Cox) A measure space is denoted as , F , , where is an underlying set, F is a sigma-algebra, and is a measure, meaning that it satisfies: 1. 0 2. 0 3. If A1, A2, … is a sequence of disjoint elements of F, then i i . i 1 i 1 A measure, P, where P()=1 is called a probability measure. A measure on , is called a Borel measure. A measure, #, where #(A)=number of elements in A, is called a counting measure. A counting measure may be written in terms of unit point masses (UPM) as # xi , where i 1 if x . . UPM measure is also a probability measure. 0 if x . x There is a unique Borel Measure, m, satisfying m([a,b]) = b - a, for every finite closed interval [a,b], a b . Properties of Measures: a) (monotonicity) b) (subadditivity) i i , where Ai is any sequence of measurable sets. c) If Ai , i=1,2,… is a decreasing sequence of measurable sets (i.e. 1 2 ) and if lim i . 1 , then i i 1 i 1.2: Measurable Functions and Integration (Cox) Properties of the Integral: a) If f d exists and a , then b) c) d) a f d exists and equals a f d . If f d and g d both exist and f d g d is defined (i.e. not of the form ), then f g d is defined and equals f d g d . If f ( ) g ( ) , then f d g d , provided the integrals exist. f d f d , provided f d exists. S holds -almost everywhere (-a.e.) iff a set F 0 and S is true for . Consider all extended Borel functions on , F , : Monotone Convergence Theorem: f n is an increasing sequence of nonnegative functions (i.e. 0 f1 f 2 f n f n1 ) and f ( ) lim f n ( ) lim f n d f d . Lebesgue’s Dominated Convergence Theorem: f n f a.e. as n and an integrable function g n, f n g a.e. lim f n d f d . g is called the dominating function. Change of Variables Theorem: Suppose f : , F , , G and g : , G , . Then g f ( ) d ( ) g ( ) d f 1 ( ) . Interchange of Differentiation and Integration Theorem: For each fixed , consider a function g , . If 1. The integral of g is finite, 2. The partial derivative w.r.t. exists, and 3. the abs. value of the partial derivative is bounded by G ( ) , then g , d g , is integrable w.r.t. and g , d ( ) d ( ) . d 1.3: Measures on Product Spaces (Cox) A measure space , F , is called -finite iff an infinite sequence 1 , 2 , F (i) i for each i, and (ii) i 1 i . Product Measure Theorem: Let 1 , F1 , 1 and 2 , F2 , 2 be -finite measure spaces. Then a unique product measure 1 2 on 1 , F1 2 , F2 1 F1 and 2 F2 , 1 2 1 2 1 1 2 2 . Fubini’s Theorem: Let 1 2 , F F1 F2 and 1 2 , where 1 and 2 are finite. If f is a Borel function on whose integral w.r.t. exists, then f ( ) d ( ) Conclude that 1 2 1 f (1 , 2 ) d (1 2 )(1 , 2 ) f (1 , 2 ) d1 (1 ) d 2 ( 2 ) . 2 1 f (1 , 2 ) d1 (1 ) exists 2 a.e. and defines a Borel function on 2 whose integral w.r.t. 2 exists. A function : , F , n is called an n-dimensional random vector (a.k.a random nvector). The distribution of is the same as the law of . (i.e. Law ). n Law Law i i ' s independen t . i 1 1.4: Densities and The Radon-Nikodym Theorem (Cox) Let and be measures on , F . is absolutely continuous w.r.t. (i.e. ) iff for all F , 0 0 . This can also be said as dominates (i.e. is a dominating measure for ). If both measures dominate one another, then they are equivalent. Radon-Nikodym Theorem: Let , F , be a -finite measure space and suppose . Then a nonnegative Borel function f f d , F . Furthermore, f is unique a.e. 1.5: Conditional Expectation (Cox) Theorem 1.5.1: Suppose : , F , G and : , F n , n . Then Z is measurable a Borel function h : , G n , n such that Z = h(Y). Let : , F , , be a random variable with , and suppose G is a sub--field of F. Then the conditional expectation of X given G, denoted | G, is the essentially unique random variable * satisfying (where * = | G) (i) * is G measurable (ii) * d d, G. Proposition 1.5.4: Suppose : , F , 1 , G1 and : , F , 2 , G2 are random elements and i is a -finite measure on i , Gi for I=1,2 such that Law, 1 2 . Let f x, y denote the corresponding joint density. Let g x, y be any Borel function 1 2 g , . g x, f x, d x Then g , | , a.s. f x, d x 1 1 1 The conditional density of X given Y is denoted by f | x, y 1 f x, y . f y Proposition 1.5.5: Suppose the assumptions of proposition 1.5.4 hold. Then the following are true: (i) the family of regular conditional distributions Law | y exists; (ii) Law | y 1 for Law almost all values of y ; (iii) the Radon-Nikodym derivative is given by dLaw | y x f | x, y , 1 2 a.e. d 1 Theorem 1.5.7 - Basic Properties of Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on , F, P , and let G be a fixed sub--field of F . (a) If k a.s., k a constant, then | G k a.s. (b) If 1 2 a.s., then 1 | G 2 | G a.s. (c) If a1 , a2 , then a1 1 a2 2 | G a1 1 | G a2 2 | G , a.s. (d) (Law of Total Expectation) | G . (e) | , . (f) If G, then | G a.s. (g) (Law of Successive Conditioning) If G is a sub--field of G, then | G | G | G | G | G a.s. (h) If 1 G and 1 2 , then 1 2 | G 1 2 | G a.s. Theorem 1.5.8 – Convergence Theorems for Conditional Expectation: Let X, X1, and X2 be integrable r.v.’s on , F, P , and let G be a fixed sub--field of F . (a) (Monotone Convergence Theorem) If 0 i a.s. then i | G | G a.s. (b) (Dominated Convergence Theorem) Suppose there is an integrable r.v. Y such that i a.s. for all I, and suppose that i a.s. Then i | G | G a.s. To Stage Experiment Theorem: Let , G be a measurable space and suppose p : n satisfies the following: (i) p, is Borel measurable for each fixed n ; (ii) p, y is a Borel p.m. for each fixed y . Let be any p.m. on , G . Then there is a unique p.m. P on n , n G C p, y d y , n and C G. C Theorem 1.5.12 - Bayes Formula: Suppose : , F , 2 , G2 is a random element and let be a -finite measure on 2 , G2 Law . Denote the corresponding density by dLaw . d Let be a -finite measure on 1 ,G1 . Suppose that for each 1 there is a given pdf w.r.t denoted f , . Denote by X a random element taking values in 1 with dLaw | f , . Then there is a version of Law | x given by d dLaw | x f x | . | x d f x | d 2 Chapter 2: Transformations and Expectations 2.1: Moments and Moment Inequalities (Cox) A moment refers to an expectation of a random variable or a function of a random variable. The first moment is the mean The kth moment is E[Xk] The kth central moment is E[(X-E[X])k]= E[(X-)k] Finite second moments Finite first moments (i.e. 2 Markov’s Inequality: Suppose 0 a.s., then 0, Corollary to Markov’s Inequality: . 2 2 Chebyshev’s Inequality: Suppose X is a random variable with E[X2]< . Then for any k>0 1 k 2 . k Jensen’s Inequality: Let f be a convex function on a convex set n and suppose is a random n-vector with and a.s. The E[ ] and f f . If f concave, change the sign. If strictly convex/concave, can use strict inequality holds. Cauchy-Schwarz Inequality: 2 2 . 2 Theorem 2.1.6 (Spectral Decomposition of a Symmetric Matrix): Let A be a symmetric matrix. Then there is an orthogonal matrix U and a diagonal matrix such that UU t . The diagonal entries of are eigenvalues of A The columns of U are the corresponding eigenvectors. 2.2: Characteristic and Moment Generating Functions (Cox) The characteristic function (chf) of a random n-vector is the complex valued function : n C given by u e iut cos u t X i sin u t X The moment generating function (mgf) is defined as u e ut Theorem 2.2.1: Let be a random n-vector with chf and mgf . (a) (Continuity) is uniformly continuous on n , and is continuous at every point u such that for all in a neighborhood of u . , is (b) (Relation to moments) If is integrable, then the gradient , , , u n u1 u 2 defined at u 0 and equals iE[ ]. Also, has finite second moments iff the Hessian D 2 u u of exists at u 0 and then 0 . (c) (Linear Transformation Formulae) Let b for some m n matrix A and some m-vector b . Then for all m , e i t b t t e t t b (d) (Uniqueness) If is a random n-vector and if u u for all u m , then Law Law. If both u , u are defined and equal in a neighborhood of 0 , then Law Law. (e) (Chf for Sums of Independent Random Variables) Suppose and are independent random p-vectors and let . Then u u u . Cumulant Generating Function is ln (u) =K(u). 2.3: Common Distributions Used in Statistics (Cox) Location-Scale Families A distribution is a location family if Lawa Law0 a . I.e. f a x f x a A distribution is a scale family if Lawa Law0 . a f a x 1a f x a b A distribution is a location-scale family if Lawa ,b Law0 . a 1 x b f a ,b x f a a A class of transformations T on , G is called a transformation group iff the following hold: (i) Every g is measurable g : , G , G . (ii) T is closed under composition. i.e. if g1 and g2 are in T, then so is g1 g 2 . (iii) T is closed under taking inverses. i.e. g g 1 . Let T be a transformation group and let P be a family of probability measures on , G . Then P is T-invariant iff 1 . 2.4: Distributional Calculations (Cox) Jensen’s Inequality for Conditional Expectation: g , y g y , Law[Y]a.s., g(.,y) is a convex function. 2.1: Distribution of Functions of a Random Variable (Casella-Berger) Transformation for Monotone Functions: Theorem 2.1.1 (for cdf): Let X have cdf FX(x), let Y=g(X), and let and be defined as x : f X x 0 and y : y g x for some x 1. 2. If g is an increasing function on , then FY(y)=FX(g-1(y)) for y . If g is a decreasing function on and X is a continuous random variable, then FY(y)=1-FX(g-1(y)) for y . Theorem 2.1.2 (for pdf): Let X have pdf fX(x) and let Y=g(X), where g is a monotone function. Let and be defined as above. Suppose fX(x) is continuous on and that g-1(y) has a continuous derivative on . Then the pdf of Y is given by d 1 1 g y for y f X g y . fY y dy 0 otherwise The Probability Integral Transformation: Theorem 2.1.4: Let X have a continuous cdf FX(x) and define the random variable Y as Y = FX(X). Then Y is uniformly distributed on (0,1). I.e. P(Y y) = y, 0 < y < 1. (interpretation: If X is continuous and Y = cdf of X, then Y is U(0,1)). 2.2: Expected Values (Casella-Berger) The expected value or mean of a random variable g(X), denoted Eg(X), is g x f X x dx if X is continuous Eg X if X is discrete . g x f x X x 2.3: Moments and Moment Generating Functions (Casella-Berger) For each integer n, the nth moment of X, n' , is n' EX n . The nth central moment of X is n E X n . Mean = 1st moment of X Variance = 2nd central moment of X The moment generating function of X is M X t Ee tX . NOTE: The nth moment is equal to the nth derivative of the mgf evaluated at t=0. dn i.e. M Xn 0 n M X t t 0 . dt n n n Useful relationship: Binomial Formula is u x v n x u v . x x 0 Lemma 2.3.1: Let a1,a2,… be a sequence of numbers converging to ‘a’. Then n lim a n 1 e a . n n Chapter 3: Fundamental Concepts of Statistics 3.1: Basic Notations of Statistics (Cox) Statistics is the art and science of gathering, analyzing, and making inferences from data. Four parts to statistics: 1. Models 2. Inference 3. Decision Theory 4. Data Analysis, model checking, robustness, study design, etc… Statistical Models A statistical model consists of three things: 1. a measurable space , F , 2. a collection of probability measures P on , F , and 3. a collection of possible observable random vectors . Statistical Inference (i) Point Estimation: Goal – estimate the true from the data. (ii) Hypothesis Testing: Goal – choose between two hypotheses: the null or the alternative. (iii) Interval & Set Estimation: Goal – find S x it’s likely true that S x . MSE ,ˆ ˆ Bias ,ˆ ˆ MSE ,ˆ Bias 2 ,ˆ 2 Var ˆ Decision Theory Nature vs. the Statistician Nature picks a value of and generates according to . There exists an action space of allowable decisions/actions and the Statistician must choose a decision rule from a class of allowable decision rules. There is also a loss function based on the Statistician’s decision rule chosen and Nature’s true value picked. Risk is Expected Loss denoted R , L , x . A -optimal decision rule * is one that satisfies R , * R , , . Bayesian Decision Theory Suppose Nature chooses the parameter as well as the data at random. We know the distribution Nature uses for selecting is , the prior distribution. Goal is the minimize Bayes risk: r R , d . r R , d L , x d L , x f x d xd x, ad x a = any allowable action. Want to find * by minimizing x, a * x : x, * x x, x , a This then implies that r * r for any other decision rule . 3.2: Sufficient Statistics (Cox) A statistic x is sufficient for iff the conditional distribution of X given T=t is independent of . Intuitively, a sufficient statistic contains all the information about that is in X. A loss function, L( ,a), is convex if (i) action space A is a convex set and (ii) for each , L , : 0, is a convex function. Rao-Blackwell Theorem: Let 1 , 2 , be iid random variables with pdf f ; . Let T be a sufficient statistic for and U be an unbiased estimate for , which is not a function of T alone. Set t U t . Then we have that: (1) The random variable is a function of the sufficient statistic T alone. (2) is an unbiased estimator for . (3) Var( ) < Var(U), provided U 2 . Factorization Theorem: x is a sufficient statistic for iff f x g x , hx for some g and h. If you can put the distribution if the form then i ' s are sufficient statistics. e n i i x i 1 hx , Minimal Sufficient Statistic ' If m is the smallest number for which T = 1 ,, m , j j 1 , , n , j 1, , m , is a sufficient statistic for 1 ,, m , then T is called a minimal sufficient statistic for . Alternative definition: If T is a sufficient statistic for a family P, then T is minimal sufficient iff for any other sufficient statistic S, T = h(S) P-a.s. i.e. Any sufficient statistic can be written in terms of the minimal sufficient statistic. ' Proposition 4.2.5: Let P be an exponential family on a Euclidean space with ' densities f x exp[ x ] hx , where and are p-dimensional. Suppose there exists 0 , 1 , , p 0 such that the vectors i i 0 , 1 i p are linearly independent in p . Then is minimal sufficient for . In particular, if the exponential family is full rank, then T is minimal sufficient. 3.3: Complete Statistics and Ancillary Statistics (Cox) A statistic X is complete if for every g, g 0 g 0 . x Example: Consider the Poisson Family, F f .; : f x; e x , 0, where x! x 0 g x 0 . So F is complete. A = [1,2,…]. g e g x x! A statistic V is ancillary iff Law V does not depend on . i.e. Law V Law0 V Basu’s Theorem: Suppose T is a complete and sufficient statistic and V is ancillary for . Then T and V are independent. A statistic V V is: Location invariant iff V V a a Location equivariant iff V a V a a Scale invariant iff V a V a 0 Scale equivariant iff V a aV a 0 PPN: If the pdf is a location family and T is location invariant, then T is also ancillary. PPN: If the pdf is a scale family with iid observations and T is scale invariant, then T is also ancillary. A sufficient statistic has ALL the information with respect to a parameter. An ancillary statistic has NO information with respect to a parameter. 3.1: Discrete Distributions (Casella-Berger) Uniform, U(N0,N1): x | 0 , 1 1 , x 0 , 0 1,, 1 1 0 1 puts equal mass on each outcome (i.e. x = 1,2,...,N) x x Hypergeometric: x | , , , x 0,1,, sampling without replacement example: N balls, M or which are one color, N-M another, and you select a sample of size K. restriction: x and x x . 1, with probabilit y p , 0 p 1 Bernoulli: 0, with probabilit y 1 p has only two possible outcomes n n y Binomial: y | n, p p y 1 p , y 0,1,2,, n. y n i (binomial is the sum of i.i.d. bernoulli trials) i 1 counts the number of successes in a fixed number of bernoulli trials n n n Binomial Theorem: For any real numbers x and y and integer n 0, x y x i y ni . i 0 i Poisson: x | e x , x 0,1, x! Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up. other applications: spatial distributions (i.e. fish in a lake) Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let = np. x 1 r p 1 p x r , x r , r 1, Negative Binomial: x | r , p r 1 th X = trial at which the r success occurs counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes) independent bernoulli trials (not necessarily identical) must be r-1 successes in first x-1 trials...and then the rth success can also be viewed as Y, the number of failures before rth success, where Y=X-r. , p 1 P() NB(r,p) r Geometric: x | p p1 p , x 1,2, x 1 Same as NB(1,p) X = trial at which first success occurs distribution is “memoryless” (i.e. s | t s t ) o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start. 3.2: Continuous Distributions (Casella-Berger) 1 , if x a, b Uniform, U(a,b): f x | a, b b a 0, otherwise spreads mass uniformly over an interval Gamma: f x | , x 1 1 x e , 0 x , 0, 0 = shape parameter; = scale parameter 1 , 0 n n 1! for any integer n>0 1 1; 12 t 1e t dt = Gamma Function 0 if is an integer, then gamma is related to Poisson via = x/ p 2p ~ Gamma ,2 2 Exponential ~ Gamma(1,) If X ~ exponential(), then Normal: f x | , 2 x 2 1 e 1 ~Weibull() 2 , x 2 2 symmetric, bell-shaped often used to approximate other distributions 1 1 x 1 x , 0 x 1, 0, 0 used to model proportions (since domain is (0,1)) Beta() = U(0,1) Beta: f x | , Cauchy: f x | 1 1 1 x 2 Symmetric, bell-shaped , x , Similar to the normal distribution, but the mean does not exist is the median the ratio of two standard normals is Cauchy ln(x ) 2 1 2 2 , 0 x e Lognormal: f x | , 2 x used when the logarithm of a random variable is normally distributed used in applications where the variable of interest is right skewed 2 1 1 x e , x 2 reflects the exponential around its mean symmetric with “fat” tails has a peak ( sharp point = nondifferentiable) at x= Double Exponential: f x | , 3.3: Exponential Families (Casella-Berger) A pdf is an exponential family if it can be expressed as k f x | hx c exp i t i x . i 1 k * f x | h x c exp t x This can also be written as i1 i i . k The set 1 , , k : h x exp i t i x dx is called the natural parameter i 1 space. H is convex. 3.4: Location and Scale Families (Casella-Berger) SEE COX 2.3FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY. 3.1: Discrete Distributions (Casella-Berger) Uniform, U(N0,N1): x | 0 , 1 1 , x 0 , 0 1,, 1 1 0 1 puts equal mass on each outcome (i.e. x = 1,2,...,N) x x Hypergeometric: x | , , , x 0,1,, sampling without replacement example: N balls, M or which are one color, N-M another, and you select a sample of size K. restriction: x and x x . 1, with probabilit y p , 0 p 1 Bernoulli: 0, with probabilit y 1 p has only two possible outcomes n n y Binomial: y | n, p p y 1 p , y 0,1,2,, n. y n i (binomial is the sum of i.i.d. bernoulli trials) i 1 counts the number of successes in a fixed number of bernoulli trials n n n Binomial Theorem: For any real numbers x and y and integer n 0, x y x i y ni . i 0 i Poisson: x | e x , x 0,1, x! Assumption: For a small time interval, the probability of an arrival is proportional to the length of waiting time. (i.e. waiting for a bus or customers) Obviously, the longer one waits, the more likely it is a bus will show up. other applications: spatial distributions (i.e. fish in a lake) Poisson Approximation to the Binomial Distribution: If n is large and p is small, then let = np. x 1 r p 1 p x r , x r , r 1, Negative Binomial: x | r , p r 1 th X = trial at which the r success occurs counts the number of bernoulli trials necessary to get a fixed number of successes (i.e. number of trials to get x successes) independent bernoulli trials (not necessarily identical) must be r-1 successes in first x-1 trials...and then the rth success can also be viewed as Y, the number of failures before rth success, where Y=X-r. , p 1 P() NB(r,p) r Geometric: x | p p1 p , x 1,2, x 1 Same as NB(1,p) X = trial at which first success occurs distribution is “memoryless” (i.e. s | t s t ) o Interpretation: The probability of getting s failures after already getting t failures is the same as getting s-t failures right from the start. 3.2: Continuous Distributions (Casella-Berger) 1 , if x a, b Uniform, U(a,b): f x | a, b b a 0, otherwise spreads mass uniformly over an interval Gamma: f x | , x 1 1 x e , 0 x , 0, 0 = shape parameter; = scale parameter 1 , 0 n n 1! for any integer n>0 1 1; 12 t 1e t dt = Gamma Function 0 if is an integer, then gamma is related to Poisson via = x/ p 2p ~ Gamma ,2 2 Exponential ~ Gamma(1,) If X ~ exponential(), then Normal: f x | , 2 x 2 1 e 1 ~Weibull() 2 , x 2 2 symmetric, bell-shaped often used to approximate other distributions 1 1 x 1 x , 0 x 1, 0, 0 used to model proportions (since domain is (0,1)) Beta() = U(0,1) Beta: f x | , Cauchy: f x | 1 1 1 x 2 Symmetric, bell-shaped , x , Similar to the normal distribution, but the mean does not exist is the median the ratio of two standard normals is Cauchy ln(x ) 2 1 2 2 , 0 x e Lognormal: f x | , 2 x used when the logarithm of a random variable is normally distributed used in applications where the variable of interest is right skewed 2 1 1 x e , x 2 reflects the exponential around its mean symmetric with “fat” tails has a peak ( sharp point = nondifferentiable) at x= Double Exponential: f x | , 3.3: Exponential Families (Casella-Berger) A pdf is an exponential family if it can be expressed as k f x | hx c exp i t i x . i 1 k * f x | h x c exp t x This can also be written as i1 i i . k The set 1 , , k : h x exp i t i x dx is called the natural parameter i 1 space. H is convex. 3.4: Location and Scale Families (Casella-Berger) SEE COX 2.3FOR DEFINITIONS AND EXAMPLE PDF’S FOR EACH FAMILY. Chapter 4: Multiple Random Variables 4.1: Joint and Marginal Distributions (Casella-Berger) An n-dimensional random vector is a function from a sample space S into Rn, n-dimensional Euclidean space. Let (X,Y) be a discrete bivariate random vector. Then the function f(x,y) from R2 to R defined by f(x,y)=fX,Y(x,y)=P(X=x,Y=y) is called the joint probability mass function or the joint pmf of (X,Y). Let (X,Y) be a discrete bivariate random vector with joint pmf fX,Y(x,y). Then the marginal pmfs of X and Y are f X x f X ,Y x, y and f Y y f X ,Y x, y . y x On the continuous side, the joint probability density function or joint pdf is X , Y f x, y dxdy . The continuous marginal pdfs are f X x f x, y dy and f Y x f x, y dx , - <x.y< . Bivariate Fundamental Theorem of Calculus F x, y f x, y . xy 4.2: Conditional Distributions and Independence (Casella-Berger) The conditional pmf/pdf of Y given that X=x is the function of y dentoed by f(y|x) and defined f x, y as f y x Y y X x f X x The expected value of g(X,Y) is g X , Y g x, y f x, y dxdy . The conditional expected value of g(Y) given X=x is g Y x g y f y x and y g Y x g y f y x dy . The conditional variance of Y given X=x is Var Y x Y 2 x Y x . 2 f x, y f X x f Y y X and Y are independent. Theorem 4.2.1: Let X and Y be independent random variables. Then (a) For any and , X , Y X Y . I.e. Y are independent events. (b) Let g(x) be a function only of x and h(y) be a function only of y. Then g h g h X and 4.3: Bivariate Transformations (Casella-Berger) Theorem 4.3.1: If X ~ Poisson() and Y ~ Poisson() and X and Y are independent, then X+Y ~ Poisson(). x Jacobian: J u y u x v x y x y y u v v u v Bivariate Transformation: fU ,V u, v f X ,Y x h1 u, v, y h2 u, v J Theorem 4.3.2: Let X and Y be independent random variables. Let g(x) be a function only of x and h(y) be a function only of y. Then U=g(x) and V=h(y) are independent. 4.4: Hierarchical Models and Mixture Distributions (Casella-Berger) A hierarchical model is a series of nested distributions. A random variable X has a mixture distribution if it is dependent upon another random variable which also has a distribution. Useful Expectation Identities: Var Var Var 4.5: Covariance and Correlation (Casella-Berger) The covariance of X and Y is denoted by Cov, . The correlation or correlation coefficient of X and Y is denoted by Theorem 4.5.2: If X and Y are independent, then Cov(X,Y) = 0. Cov, . Theorem 4.5.3: If X and Y are random variables and a and b are constants, then Var(aX + bY)=a2 Var(X) + b2 Var(Y) + 2abCov(X,Y). Definition of the bivariate normal pdf: f x, y 1 2 1 2 e 1 2 1 2 2 x 2 x y y 2 Properties of bivariate normal pdf: 1. The marginal distribution of X is N(X,X). 2. The marginal distribution of Y is N(Y,Y). 3. The correlation between X and Y is XY = . 4. For any constants a and b, the distribution of aX+bY is N(aX + bY, a2X + b2Y + 2abXY). 4.6: Multivariate (Casella-Berger) joint pmf (discrete): f x f x1 , xn P(X1=x1,…,Xn=xn) for each x1 , x n n . This can also be shown as = f x . x joint pdf (continuous): f x dx f x1 ,, xn dx1 dxn . The marginal pmf or pdf of any subset of the coordinates 1 ,, n can be computed by integrating or summing the joint pmf or pdf over all possible values of the other coordinates. conditional pdf or pmf: f xk 1 ,, xn x1 ,, xk f x1 ,, xn . f x1 ,, xk Let n and m be positive integers and let p1 ,, p n be numbers satisfying 0 pi 1, i 1,, n n and p i 1 i 1 . Then 1 ,, n has a multinomial distribution with m trials and cell probabilities p1 ,, p n if the joint pmf (i.e. DISCRETE) is f x1 ,, x n n p xi m! p1x1 p nxn m! i x1! x n ! i 1 xi ! on the set of x1 , xn such that each xi is a nonnegative integer and n i 1 xi m . Multinomial Theorem: Let m and n be positive integers. Let A be the set of vectors x1 , xn such that each xi is a nonnegative integer and n i 1 xi m . Then for any real numbers p1 ,, p n , p1 p n m n Recall: p i 1 i m! p1x1 p nxn . x x1 ! x n ! 1 Multinomial Theorem = 1m = 1. Note: A linear combination of independent normal random variables is normally distributed. 4.7: Inequalities and Identities (Casella-Berger) NUMERICAL INEQUALITIES: 1 1 1 . Then p q Lemma 4.7.1: Consider a>0 and b>0 and let p 1 and q 1 such that 1 p 1 q a b ab with equality only if ap=bq. p q Holder ' s Inequality : Let X and Y be random variables and p and q satisfy Lemma 4.7.1. Then p 1 q p 1 q When p=q=2, this yields the Cauchy Schwarz Inequality , 2 2 r Liapunov’s Inequality: 1 r s 1 . s 1<r<s< . NOTE: Expectations are not necessary. These inequalities can also be applied to numerical sums, such as this version of Holder ' s Inequality : FUNCTIONAL INEQUALITIES: Convex functions “hold water” (i.e. bowl-shaped) whereas concave functions “spill water”. It is also said that if you connect two points with a line segment, then a convex function will lie below the segment and a concave function above. Jensen’s Inequality: For any random variable X, if g(x) is a convex (bowl-shaped) function, then g g . Covariance Inequality: Let X be a random variable and g(x) and h(x) any functions such that Eg(X), Eh(X), and E(g(X)h(X)) exist. Then (a) If g(x) is a nondecreasing function and h(x) is a nonincreasing function, then g h g h . (b) If g(x) and h(x) are either both nondecreasing or both nonincreasing, then g h g h . PROBABILITY INEQUALITY Chebychev’s Inequality: Let X be a random variable and let g(x) be a nonnegative function. g 1 Then for any r>0, g r or k 2 . r nk IDENTITIES Recursive Formula for Poisson Distribution: x 1 x 1 x ; 0 e . Theorem 4.7.1: Let , denote a gamma( , ) random variable with pdf f x , where 1 . Then for any constants a and b, a , b f a , f b , a 1, b . Stein’s Lemma: Let ~ , 2 and let g be a differentiable function satisfying g ' . Then g 2 g ' . Theorem 4.7.2: Let 2p denote a chi-squared random variable with p degrees of freedom. For h2p 2 , provided the expectations exist. any function h(x), h p 2 p2 2 p Theorem 4.7.3 (Hwang): Let g(x) be a function with g and g 1 . Then (a) If ~ Poisson , g g 1 . g 1 . (b) If ~ Negative Binomial r , p , 1 p g r 1