Stigler 1-4 Cameron Boulanger October 2023 1 Introduction ” Chapter 1 [[Conditional Probability]] [[Counting]] [[Random Variables]] [[Binomial Experiments]] [[Probability Distributions]] Probabilities of Events **properties of Probability** scaling: P (S) = 1 and 0 ≤ P (E) for all E in S additivity: if E and F are mutually exclusive, P (E ∪ F ) = P (E) + P (F ) complementarity: P (E) + P (E c ) = 1 for all E in S general additivity: For any E and F in S P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ) finite additivity; For any finite collection of mutually exclusive events E1 , E2 , . . . , En P( n [ Ei ) = i=1 n X P (Ei ) i=1 Countable additivity: for any countably infinite collection of mutually exclusive events E1 , E2 , . . . in S, ∞ ∞ [ X P ( Ei ) = P (Ei ) i=1 i=1 **Sample Space**: the set of all possible outcomes denoted S **An Event:** A set of possible outcomes (say E), which is a subset of S **Mutually Exclusive:** If E and F have no common elements they are said to be mutually exlusive (E ∩ F = ∅) Conditional Probability **Conditional probability:** The probability that E occurs given F has occured is defined to be P (E|F ) = P (E ∩ F ) P (F ) if P (F ) > 0 can be thought of as relative probability P (E|F ) is the probability of E relavitve to the reduced sample space consisting of only those outcomes in the event F . **General Multiplication** For any events E and F in S, P (E ∩ F ) = P (F )P (E|F ) 1 **Independent events** We say events E and F in S are independent if P (E) = P (E|F ) If E and F are independent that P (E|F ) = P (E|F c ) If E and F are independent then P (E ∩ F ) = P (E) · P (F ) Counting **Permutations** The number of ways of choosing r objects from n distinguishable objects where the order of choice makes a difference. the number of perumations of n choose r, given by Pr,n = n! (n − r)! and we also have Pr,n = 1 · 2 · 3 · ··· · n = (n − r + 1) · · · · · (n − 1)n 1 · 2 · . . . (n − r) **Why divide by (n − r)!** After choosing r items, there are n − r items left, and we are not arranging them. So, we divide by the number of arrangements of these remaining items. E.g., for arranging 2 letters out of A, B, C, the numerator 3! represents all arrangements of A, B, C, but we only want arrangements of 2 letters, so we divide by the arrangements of the remaining 1 letter, (32)!. **Combinations** The number of ways of choosing r objects from n distinguishable objects where the order of choice makes no difference n! n = Cr,n = r r!(n − r)! **Why divide by r!** In permutations, different orders of the same set of items are counted separately. But in combinations, we don’t want to count them separately. So, we divide by the number of arrangements (permutations) of those r items, which is r!. E.g., for selecting 2 letters out of A, B, C, we calculate all arrangements (3!), divide by the arrangements of the chosen 2 (2!), and divide by the arrangements of the remaining 1 (1!). We can also derive Pr,n = and 1 · 2 · 3 · ··· · n = (n − r + 1) · · · · · (n − 1)n 1 · 2 · . . . (n − r) n Pr,n (n − r + 1) · · · · · (n − 1)n = = r r! 1 · 2 · 3 · · · · · (r − 1)r Stirling’s Formula **Stirling’s Formula** 1 1 loge (n!) ≈ loge (2π) + n + loge (n) − n 2 2 thus n! ≈ √ n+ 21 −n 2πn 2 e where ≈ means the ratio of two sides tends to 1 as n increases. This formula can be used to derive appromixations for both Pr,n and n r . r −(n+1/2) (n − r)r e−r Pr,n ≈ 1 − n n 1 r −(n−r+1/2) r −(r+1/2) ≈√ 1− n n r 2πn Random Variables **Random Variables** a function that assigns a numerical value to each outcome in S; a real-valued function defined on S **Discrete Random Variables** random variables whose values can be listed sequentially (0, 1, 2, 3) a **Probability distributions of discrete random variables** A list of possible values of a discrete random variable together with the probabilities of these values the probability that random variable X is equal to the possible value x is denoted by pX (x) or when there is no likely confusion p(x) 3 the probability distrubution can be described in many ways pX (x) = x3 12 for x = 0, 1, 2, 3 **Cumulative distrubution function** (CDF) X FX (x) = P (X ≤ x) = pX (a) a≤x A helpful interpretation is that the probability distribution can be thought of as a distributions of unit mass spread across the real line, thus FX (x) gives the cumulative mass starting from left, up to and inclusive of the point x. you can derive the probability of x from FX (x) as such pX (x) = FX (x) − FX (x − 1) tags: Binomial Experiments **The class of Binomial experiments** binomial experiments are characterized by: - The experiment consists of a series of n independent trials - The possible outcomes of a single trial are classified as one of two types: A: success, Ac : failure - The probability of success on a single trial P (A) = θ, is the same for all n trials, this probability θ is called the **Parameter of the Experiment** For binomial experiments we are usually only interested in a numerical summary of the outcome the random variable: X = number of successess = number of A’s **The Binomial (n, θ) Distributions** b(x; n, θ) = nx θx (1 − θ)n−x for x = 0, 1, 2, . . . , n and = 0 otherwise **Parametric Family of probability distrubutions** The family of probability distrubutions where parameters determine the distrubution such that fo every possible value of the parameters we have a different distrubution. A binomial distrubution is an example with n and θ as parameters 3 **Bernoulli Trials** Instead of conduting a fixed number of n trials, the trials are conducted until a fixed number r sucesses have been observed. due to it being a reversal of the original procedure, it is called a **negative binomial experiment**. In this type of experiment the random variable of interest is Z= number of ”failures” before the rth ”success” generally we have pZ (z) = (1 − θ)z θ for z = 1, 2, 3, . . . **Negative Binomial Ditrubution** The probability distrubution of the number of failures Z before the rth success in a series of Bernoulli trials with probability of success θ is r+z−1 r nb(z : r, θ) = θ (1 − θ)z r−1 for z = 1, 2, . . . and = 0 otherwise **Relations between the two binomial distributions** Let B(x; n, θ) = P (X ≤ x) and N B(z; r, θ) = P (Z ≤ z) be their respective CDF ’s If we are computing X and Z from the same number of trials, we have X ≥ r if and only if Z ≤ n−r, Since P (X ≥ r) = 1 − P (X ≤ r − 1), this means N B(n − r, r, θ) = 1 − B(r − 1, n, θ) The binomial distributions has other symmetries, particularly B(x : n, θ) = 1 − B(n − x − 1 : n, 1 − θ) Continuous Distributions **Continuous Random Variable** A random variable whose possible values form an interval **Probability Density Functions** non-negative function which give the probabilities of an interval through the area under the function over the interval, we define fX (x) to be the probability density function of the continuous random variable X if for any numbers c and d, with c < d Z d P (c < X ≤ d) = fX (x) dx c R∞ the following statements are necessarily true: fX (x) ≥ 0 for all x and −∞ fX (x) dx = 1 We can think of fX (x)dx (height(fX (x) times base dx) as the probability X falls in an infintisemal interval at x: P (x < X ≤ x + dx) = fX (x)dx P (X = c) = 0 for any c, as a consequence of using probability densities to describe distributions **Culmative Distrubution Function of a continous random variable** for a continuous random variable X Z x FX (x) = P (X ≤ x) = fX (u) du −∞ 4 and we see that FX (x) is the area under fX (x) to the left of x, In this case although FX (x) is non-decreasing as in the discrete case it is no longer a jump function. It also results in another way to describe continuous distributions. Z x d fX (u) du = fX (x) dx −∞ so d FX (x) = fX (x) dx **Exponential Distribution** Consider an experiment where X is the time until failure, X is a continuous random variable with possible values {x : 0 ≤ x < ∞}. To specify a class of probability distributions for X, we would expect to have the probability beyond survival time t, P (X > t). decreasing as t → ∞. One class of decreasing functions which also have P (X > 0) = 1 are the exponentially decreasing functions P (X > t) = C t where 0 < C < 1, equivalently writing e−θt where θ is a fixed parameter we have P (X > t) = e−θt , (t ≥ 0) and FX (t) = P (X ≤ t) = 1 − e−θt the corresponding probability density function follows from differentiation: fX (t) = θe−Θt for t ≥ 0 and fX (t) = 0 for t < 0 Transformations of Random Variables **Strictly monotone transformations** if Y = h(X) is a strictly monotone transformation of X, we can solve for X in terms of Y . that is find the **inverse transformation** X = g(Y ). If we have Y = h(x) = 2X + 3 then X = g(Y ) = Y −3 X > 0 then X = g(Y ) = eY . If Y = h(x) = X 2 2 , if Y = h(x) = loge (X) for √ for X > 0 then X = g(Y ) = Y **Discrete case** if pX (x) is the probability density function of X then the probability density function of Y is pY (y) = P (Y = y) = P (h(x) = y) = P (X = g(y)) = pX (g(y)) Example: X has a previously seen binomial distributions, so we have pX (x) = 3 3 (.5) for x = 0, 1, 2, 3 and Y = X 2 , what is the distributions of Y , we have x √ √ √ g(y) = + y and thus pY (y) = pX ( y) = √3y (.5)3 for y = 0, 1, 2, 3 . . . , which gives: pY (y) = 18 for y = 0, 38 for y = 1, **Continuous Case** we have 1 8 for y = 9, = 0 otherwise fY (y) = fX (g(y)) · | dg(y) | dy when |g ′ (y)| is small x = g(y) is changing slowly as y changes, and we scale down, when |g ′ (y)| is large we scale up, as x changes rapidly with y. To show 5 the correctness of this factor in the equation, compute P (Y ≤ a) in two different ways First, Z a P (Y ≤ a) = fY (y) dy −∞ by definition of fY (y). Second, supposing for a moment that h(x) is monotone increasing we have Z g(a) P (Y ≤ a) = P (h(X) ≤ a) = P (X ≤ g(a)) = fX (x) dx −∞ Now change variables as such, x = g(y) and dx = g ′ (y)dy and we have Z a P (Y ≤ a) = fX (g(y))g ′ (y) dy −∞ Differentiating both solutions with respect to a, we have fY (y) = fX (g(y))g ′ (y) **Example** Let X be the time to failure of the first of two lightbulbs, and Y the probability that the second lightbulb lasts longer than the first. we have Y = h(X) = e−θx the random time X has density fX (x) = θe−θx , (x ≥ 0) Now loge (Y ) = −θX and the inverse transformation is X = g(y) = − logeθ(Y ) we find g ′ (y) = − θ1 · y1 and 1 |g ′ (y)| = , (y > 0) θy Then fY (y) = fX (g(y))|g ′ (y)| and, noting that fX (g(y)) = 0 for y ≤ 0 and y > 1 we have fY (y) = θe−θ(− log(y)/θ) · 1 θy or since θe−θ(− log(y)/θ) = θy fY (y) = 1, (0 < Y ≤ 1) We recognize this as the **Uniform (0,1) distributuion** **Probability Integral transformation** the transformation h(x) = FX (x) −1 to find the distribution of Y = h(x) we need to differentiate g(y) = FX (y), **Inverse Cumulative distrubution function** the function that for each y, 0 < y < 1 gives the value of x for which FX (x) = y Aside: For continuous random variables X, with desnisties, FX (x) is continous and thus an x will exist for all 0 < y < 1, for more general random −1 −1 variables FX (y) can be defined as FX (y) = inf{x : FX (x) ≥ y}. the deriva−1 tive g(y) = FX (y) can be found by implicit differentiation y = FX (x) 6 ⇒1= d dx FX (x) = fX (x) · dy dy by chain rule, and so dx 1 = dy fX (x) −1 or, with x = g(y) = FX (y) g ′ (y) = d −1 1 FX (y) = −1 dy fX (FX (y)) But then −1 (y)) · fY (y) = fX (g(y))|g ′ (y)| = fX (FX 1 = 1, (0 < y < 1) −1 fX (FX (y)) = 0 otherwise Chapter 2 [[Expectation]] [[Probability Distributions]] [[Linear Transformations]] [[Transformations]] [[Normal Distributions]] **General** If we state that random variable X has a binomial (5, .3) distribution we are saying that the probability distribution is given by P (X = x) = x5 (.3)x (.7)5−x for x = 0, 1, 2, 3, 4, 5 Analagolousy if X has an exponential (2, 3) distribution we mean fX (x) = 2.3e2.3x for x ≥ 0 Expectation The expectation of a random variable, is a weighted average of its possible values, (weighted by its probability distribution) We derive expectation of X, denoted E(X) **Discrete Case** X E(X) = xpx (x) x **continuous case** Z ∞ E(X) = xfX (x) dx −∞ We will refer to E(X) as both the *mean* and *expected value* of X., The notation suggests that E(X) is a function of X however E(X) is a number and is described as a function of the probability distribution of X The expectation of X summarizes the distribution of X, by describing its center. **Alternate descriptor of center** Another descriptor of center is mode, which is the most probable value in the discrete case, and the value with the highest density in the continuous case. d fX (x) dx The Beta Distribution The Beta (α, β) distrubition is a member of a parametric family of distrbutions that will be useful for inference 7 **The Gamma Function** A generalization of n factorial. A definite integral that was studied by Euler, Z ∞ Γ(a) = xa−1 e−x dx 0 If a is an integer, this integer can be evaluated by repeated integration-by-parts to give Γ(n) = (n − 1)! The function has the properties that Γ(a) = (a − 1)Γ(a − 1), ∀a > 1 √ Γ(.5) = π Stirlings formula also applies here Γ(a + 1) ≈ √ 2π aa+1 −a e 2 The family of Beta Distributions **Probability density function** The probability density function is ( Γ(α+β) · xα−1 (1 − x)β−1 0 ≤ x ≤ 1 fX (x) = Γ(α)Γ(β) 0 Otherwise Where α, β > 0 When we have α = β = 1 we have fX (x) = 1 for 0 ≤ x ≤ 1 which is the Uniform (0,1) distribution If α = β the density is symmetric about 1/2, the larger α, β are the more concentrated the distribution is around its center. If α, β ∈ Z then we can say that Γ(α + β) (α + β − 1)! (α + β − 2)!(α + β − 1) α+β−2 = = = (α+β−1) Γ(α)Γ(β) (α − 1)!(β − 1)! (α − 1)!(β − 1)! α−1 Thus we can rewrite ( fX (x) = Γ(α+β) Γ(α)Γ(β) · xα−1 (1 − x)β−1 0 0≤x≤1 Otherwise as fX (x) = (α + β − 1) α + β − 2 α−1 x (1 − x)β−1 α−1 The name for the beta distribution comes from the fact in classical analysis: Z 1 B(α, β) = xα−1 (1 − x)β−1 dx 0 8 is called the **Beta Function** and it relates to the gamma function as such Γ(α)Γ(β) Γ(α + β) B(α, β) = which is the reciprocal of the coefficient of fX (x) Thus it becomes evident that 1 Z fX (x) dx = 1 0 **Expectation for the Beta Distribution** the expression for the expectation given α, β > 1 is α E(X) = α+β R∞ to show this, utilize −∞ fX (x) dx = 1 By definition Z 1 x· E(X) = 0 Γ(α + β) α−1 Γ(α + β) x (1 − x)β−1 dx = Γ(α)Γ(β) Γ(α)Γ(β) Z 1 xα (1 − x)β−1 dx 0 Now we manipulate the integrand by rescaling it so it becomes a probability density, and multiplying the rest of the expression by the reciprocal to preserve value Z 1 Γ(α + β + 1) α Γ(α + β) Γ(α + 1)Γ(β) · · x (1 − x)β−1 dx E(X) = Γ(α)Γ(β) Γ(α + β + 1) 0 Γ(α + 1)Γ(β) the left factor is the probability distrubition for a beta function thus E(X) = Γ(α + β) Γ(α + 1)Γ(β) Γ(α + β) Γ(α + 1)Γ(β) · ·1= · Γ(α)Γ(β) Γ(α + β + 1) Γ(α)Γ(β) Γ(α + β + 1) = Γ(α + β) Γ(α + 1) α · = Γ(α + β + 1) Γ(α) α+β Expectations of Transformations If we have the random variable Y = h(X) where X is also a random variable. to find E(Y ) we do the following fY (y) = fX (g(y))|g ′ (y)| where g = h−1 and then calculate Z ∞ E(Y ) = yfY y dy −∞ or combining the steps Z ∞ E(Y ) = yfX (g(y))|g ′ (y)| dy −∞ 9 However this is unecessary there is a simpler method for both the discrete and continuous case **Discrete Case** X E(h(X)) = h(x)PX (x) x **Continuous Case** Z ∞ h(x)fX (x) dx E(h(x)) = −∞ To see why this works consider that we are making the following change of variable x = g(y), y = h(x), dx = g ′ (y)dy thus we have Z ∞ Z ∞ E(Y ) = yfX (g(y))|g ′ (y)| dy = h(x)fX (x) dx −∞ −∞ In general E(h(X)) ̸= h(E(X)) although this is true for linear transformations. **Example** Suppose X has the standard Normal distribution with density 2 1 ϕ(x) = √ e−x /2 , −∞ < x < ∞ 2π What is E(Y ), Y = X 2 , the distribution of Y is the Chi-square distribution with density 1 fY (y) = √ e−y/2 , y > 0, 2πy Thus Z E(Y ) = 0 ∞ 1 1 e−y/2 dy = √ y· √ 2πy 2π Z ∞ √ ye−y/2 dy 0 we can evaluate this integral by making the change of variable z = y2 , dz = 12 dy Z ∞ √ Z ∞ 1/2 −z √ √ √ √ √ √ −y/2 ye dy = 2 2 z e dz = 2 2Γ(1.5) = 2 2·(.5)Γ(.5) = 2· π = 2π ∞ Thus 0 √ 1 E(Y ) = √ · 2π = 1 2π Alternatively we can obtain this by calculating Z ∞ 2 1 2 E(X ) = x2 √ e−x /2 dx 2π −∞ which would have been easier had we not been given fY (y) Linear Transformations The simplest and most used of the transformations. An example is Y = h(X) = aX + b **Theorem** For any constants a and b E(aX + b) = aE(X) + b 10 Proof: we have Z Z ∞ (ax+b)fX (x) dx = E(aX+b) = −∞ ∞ Z ∞ −∞ Z −∞ = aE(X) + b · 1 = aE(X) + b if b = 0, E(aX) = aE(X), and if a = 0, E(b) = b We will adopt the notation E(X) = µX = µ Variance Expectation is a measure of the center of a probability distribution, variance is another important measures as it measures the spread of the distribution. The way it measures this spread is by asking how far on average can we expect X to be from the center of its distribution. So that is X − E(X) = X − µX and since there is no regard to the sign of the equation we have E|X − µx | Although this measure may seem to be the most natural way of measuring dispersion, we use a different measure due to both mathematical convience and the fact that the alternate form arises naturally from theoretical considerations. The measure is called **variance**, and it is as such 2 V ar(X) = E|(X − µX )2 | = σX Variance is difficult to interpret because it is defined in terms of square units so we define the **standard deviation** p p σX = V ar(X) = E(X − µx )2 This p is different from our intuitive measure defined earlier as E|X − µx | = E( (X − µx )2 ) The following device simplifies the calculation of variance V ar(X) = E|(X − µ)2 | = E(X 2 + (−2µX) + µ2 ) = E(X) + E(−2µX) + E(µ2 ) Now we have E(−2µX) = −2µE(X) = −2µ · µ = −2µ2 , and , E(µ2 ) = µ2 thus V ar(X) = E(X 2 ) − µ2 = E(X 2 ) − (E(X))2 Linear Change of Scale The most common transformation, a linear change of scale Y = aX + b. **Variance of Y** Theorem: For any constants a and b, V ar(aX + b) = a2 V ar(X) Proof: By definition V ar(aX + b) is the expectation of |(aX + b) − E(aX + b)|2 = [aX + b − aµx + b]2 = (aX − aµX )2 = (a2 (X − µX )2 ) V ar(aX + b) = E[a2 (X − µx )2 ] = a2 E(X − µX )2 = a2 V ar(X) 11 ∞ fX (x) dx xfX (x) dx+b (axfX (x)+bfX (x)) dx = a −∞ and we can immediately deduce σaX+b = |a|σX Something to note is that neither the variance nor the standard deviation is affected by b, this means that the spread of the distribution is unaffected by a shift of the origin by b units. **Standard Form** By a linear change of scale we can arrange to have random variable expressed with expectation zero and variance one. This is accomplished by transforming X by subtracting its expectation and dividing its standard deviation: X − µX W = σX Note that W = aX + b for the special choices a = σ1X and b = − µσxx then The Normal (µ, σ 2 ) distributions The standard Normal distribution with continuous random variable X has density 2 1 ϕ(x) = √ e−x /2 , −∞ < x < ∞ 2π And we define the Normal (µ, σ 2 ) distribution of Y = σX +µ, which has density fY (y) = √ −(y−µ)2 1 e 2σ2 2πσ 2 = 1 so now for the The standard has the mean and variance, µX = 0 and σX distribution of Y we have E(Y ) = σE(X) + µ = µ, V ar(Y ) = σ 2 V ar(X) = σ 2 or µY = µ, σY2 = σ 2 , σY = σ We consider the inverse of the transformation that defines Y X= Y −µ σ And we see that X can be said to be Y expressed in standard form, That is where the name standard normal comes from. and Usually we say Normal (µ, σ 2 ) as N (µ, σ 2 ) Chapter 3 [[Covariance]] [[Multivariate Distributions]] [[Bivariate Distributions]] Discrete Bivariate Distributions If X and Y are two random variables on the same sample space S. that means they were defined in reference to the same experiment. We define the **Bivariate Probability function**:: p(x, y) = P (X = x, Y = y) p(x, y) may be though of as describing the distribution of unit mass in the (x, y) plane, with p(x, y) representing the mass assigned to (x, y), p(x, y) is the height of the spike at (x, y). like the univariate case the total for all possible points must be one XX p(x, y) = 1 x y 12 **Example** Consider experiment of tossing a dair coin three times, and then independently tossing a second coin tree times X= num of heads for first coin Y = num of tails for second coin Z= num of tails for first coin. The coins are independent, so any pair (x, y) of X and Y we have, if {X = x} stands for event X = x p(x, y) = P (X = x, Y = y) = P ({X = x}∩{Y = y}) = P ({X = x})·P ({Y = y}) = PX (x)·PY (y) On the other hand, X and Z refer to the same coin so we have p(x, y) = P (X = x, Z = z) = P ({X = x} ∩ {Z = z}) = P ({X = x}) = pX (x) if z = 3 − x, = 0, Otherwise This is because we must necessarily have x + z = 3 which means {X = x} and {Z = x − 3} describe the same event. If z ̸= 3 − x then {X = x} and {Z = z} are mutually exclusive and the probability both occur is zero.. If we have a bivariate pdf such as p(x, y) then we can write univariate distributions ∞ X X pX (x) = p(x, y), pY (y) = p(x, y) x Y the intuition is we can compose {X = x} into a collection of smaller sets {X = x} = {X = x, Y = 0} ∪ {X = xY = 1} ∪ . . . the values on the righthand side run through all possible values of Y , and yet all the events are mutually exclusive, so the right hand side is the sum of all the P probabilities of Y or ∀y p(x, y) and the LHS is pX (x) univariate distrubutions in a multivariate context are called **marginal probability functions**, you can always find marginal distrubutions from the bivariate distribution but in general you can’t go the other way. The distributions tell us about the probability of all possible values of one variable but has zero regard to the other variables. p(x, y) = pX (x) · pY (y) and p(x, z) = pX (x) What was needed is info on how knowing one variables outcome affects another **Conditional Probability function** p(y|x) = P (Y = y|X = x) the probability that Y = y given X = x, this is the same notion of conditional probability we expressed earlier. p(y|x) = P (Y = y|X = x) = P (X = x, Y = y) p(x, y) = P (X = x) pX (x) as long as pX (x) > 0, if p(y|x) = pY (y), ∀x then we say Y and X are independent random variables and thusly p(x, y) = PX (x) · pY (y) 13 X and Y are independent only if all X = x and Y = y are independent, if it fails for a single (x0 , y0 ) X and Y are dependent. X and Y were independent, but X and Z were dependent take x = 2 p(z|2) = p(2, z) = 1, z = 1, = 0 Otherwise pX (2) so p(z|x) ̸= pZ (z) Continuous Bivariate Distributions **Bivariate probability density function** f (x, y) the rectangular region with a < X < b and c < Y < d has probability Z dZ b f (x, y) dx dy P (a < X < b, c < Y < d) = c a It is always rue that f (x, y) ≥ 0, ∀x, y and Z ∞Z ∞ f (x, y) dx dy = 1 −∞ −∞ any function satisfying these properties describes a continuous bivariate probability distribution unit mass resting quarely on the plane, f (x, y) describes the upper surface of the mass if we are given f (x, y) we can find Z ∞ Z ∞ fX (x) = f (x, y) dy, ∀x, fY (y) = f (x, y) dx, ∀y −∞ −∞ Mathematically we can justify this as follows, if we hae a < b, a < X ≤ b, −∞ < Y <∞ Z ∞Z Z aZ ∞ Z a Z ∞ f (x, y) dx dy = f (x, y) dy dx = f (x, y) dy dx −∞ a b −∞ b Therefore Z b Z −∞ ∞ P (a < X ≤ b) = f (x, y) dy dx∀a < b a −∞ R∞ and thus −∞ f (x, y) dy fufills the definition, it is a function that of x that gives the probabilities as integrals of areas. the integral Z x+dx Z ∞ Z ∞ f (x, y) dy du ≈ f (x, y) dy · dx x −∞ −∞ gives the total mass, between x and x + dx **Example** Consider the bivariate density function 1 − x + x, for , 0 < x < 1, 0 < y < 2, f (x, y) = y 2 14 and = 0 otherwise. One way of visualizing such a function is to look at cross-sections: the function f x, 12 = 21 12 − x + x = x2 + 41 is the cross section of the surface, cutting it with a plane at y = 12 to show that f (x, y) is a bivariate density we do the following Z 2Z 1 Z ∞Z ∞ Z 2 Z 1 1 1 y f (x, y) dx dy = y 0x + x dx dy = − x + x dx dy8 2 2 0 0 −∞ −∞ 0 0 Conditional Probability densities In the discrete case we defined Y given X as p(y|x) = p(x, y) , pX (X) > 0 pX (x) So for the continuous case we define the conditional probability density of Y given X as f (x, y) , fX (x) > 0 f (y|x) = fX (x) The discrete case is a re-expresssion of p(E|F ) = P (E ∩ F ) P (F ) Since f (y|x) is a density it gives the conditional probabilites as areas of a region, we havee Z b P (a < Y ≤ b|X = x) = f (y|x) dy, ∀a < b a What does P (a < Y ≤ b|X = x) mean since P (X = x) = 0. We proceed Heuristically, by P (a < Y ≤ b|X = x) we mean something like P (a < Y ≤ b|x ≤ X ≤ x + h) for very small h. if fX (x) > 0 the latter probability si well defined since P (x ≤ X ≤ x + h) > 0 even though it is small thus P (a < Y ≤ b|X = x) ≈ P (a < Y ≤ b|x ≤ X ≤ x+h) = P (x ≤ X ≤ x + h, a < Y ≤ b) = P (x ≤ X ≤ x + h) but if fX (u) doesnt’t vary greatly for u ∈ [x, x + h] then Z x+h fX (u) du ≈ fX (x) · h x if for fixeed y the function f (u, y) doesn’t change value greatly for u ∈ [x, x + h] we have Z x+h f (u, y) du ≈ f (x, y) · h x Substituting these into the above expression gives Rb Rb Z b f (x, y) dy f (x, y) · hdy f (x, y) a a = = dx P (a < Y ≤ b|X = x) ≈ fX (x) · h fX (x) fX (x) a 15 R b hR x+h a x i f (u, y) du d R x+h x fX (x) dx f (y|x) is a cross section of the surface f (x, y) at X = x, rescaled so it has total area 1. Indeed for a fixed x, the denom of the conditional probability function is just the right scaling factror so the area is 1 Z ∞ Z ∞ Z ∞ f (x, y) 1 1 f (y|x) dy = dy = · f (x, y) dy = ·fX (x) = 1, ∀x s.t.fX (x) > 0 fX (x) −∞ fX (x) −∞ −∞ fX (x) if the following are true f (y|x) = fY (y), f (x|y) = fX (x), f (x, y) = fX (x) · fY (y) then X and Y are independent random variables any of these conditions is equivalent to P (a < X < b, c < Y < d) = P (a < X < b) · P (c < Y < d) if any of these conditions fail to hold X and Y are dependent. Using one marginal density, and one set of conditional densities, usually that agree fX , f (y|x) or fY , f (x|y) we determine the bivariate density by f (x, y) = fX (x)f (y|x) Expectations of Transformations of Bivariate Random Variables An example of a transfomation h1 (X, Y ) = X + Y , h2 (X, Y ) = X − Y , and h3 (X, Y ) = X · Y are all transforamtions, we can find the expectation by setting Z = h(X, Y ) as a random variable and if we find the distribution of Z the expectation follows: Z ∞ X E(Z) = zpZ (z), E(Z) = zfZ (z) dz −∞ Z in the discrete and continuous cases respectively However we can find these expectations without finding the distibution of z, by a generalized change of varibales argument Z ∞Z ∞ XX Eh(X, Y ) = h(x, y)p(x, y), Eh(X, Y ) = h(x, y)f (x, y) dx dy x −∞ y −∞ discrete and continuous respectively. Mixed Cases We have encountered both X, Y discrete or continous, now we consider situations where X discrete and Y continuous **Example** A ball is rolled on a table ranged 0 → 1 and the position is Y , and we roll a second ball n times, and X is the number of times the ball lands to the left of Y Then (X, Y ) is a bivariate random variable, X discrete, Y cnts From our description we have Y is uniform fY (y) = 1, 0 < y < 1 16 If we consider each role independent then conditional upon Y = y, X is a success count for n bernoulli trials, and the chance of success on a single trial is y, that means p(x|y) is given by Binomial (n, y) n x p(x|y) = y (1 − y)n−x x For mixed cases as for other cases, we construct the distribution by multiplication n x f (x, y) = fY (y) · p(x|y) = y (1 − y)n−x x this distribution can be picterd as a series of parallel sheets on the x − y plane, concentrated on the lines x = 0, x = 1, . . . , x = n for dealing with mixed cases, we use our results from the earlier, the Marginal distribution of X is Z ∞ Z 1 n x pX (x) = f (x, y) dy = y (1 − y)n−x dy x −∞ 0 Z 1 1 n n n 1 = = y x (1−y)n−x dy = ·B(x+1, n−x+1) = · n n + 1 x 0 x x (n + 1) x We could also calculate f (x, y) = f (y|x) = pX (x) n x y x (1 − y)n−x 1 n+1 = (n + 1) n x y (1 − y)n−x x Higher Dimensions Bivariate was focused on due to their mathematic simplicty, but the same ideas carry over for multiple dimensions for X1 , X2 , X3 discrete we have p(x1 , x2 , x3 ) = P (X1 = x1 , X2 = x2 , X3 = x3 ) and X p(x1 , x2 , x3 ) = 1 ∀x1 ,x2 ,x3 For the continuous case we describe the distribution by a density f (x1 , x2 , x3 ) where Z ∞Z ∞Z ∞ f (x1 , x2 , x3 ) dx1 dx2 dx3 −∞ −∞ −∞ A collection of random variables is independent if their distribution factors into a product of univariate distributions p(x1 , x2 , . . . , xn ) = pX1 (x1 ) · pX2 (x2 ) · · · · · pXn (xn ) or f (x1 , x2 , . . . , xn ) = fX1 (x1 ) · fX2 (x2 ) · · · · · fXn (xn ) The terms on the right are the marginal distributions of the xi ’s and they can be found by XX pX1 (x1 ) = p(x1 , x2 , x3 ) x2 x3 17 and Z ∞ Z ∞ f (x1 , x2 , x3 ) dx2 dx3 fX1 (x1 ) = −∞ −∞ There may also be multivariate marginal distributions for X1 , X2 , X3 , X4 cnts we have Z ∞Z ∞ f (x1 , x2 , x3 , x4 ) dx3 dx4 f (x1 , x2 ) = −∞ −∞ Conditional distributions are found analagolousy to bivariate case f (x3 , x4 |x1 , x2 ) = f (x1 , x2 , x3 , x4 ) f (x1 , x2 ) and in the cnts case Z ∞ Z ∞ ··· E(h(X1 , X2 , . . . , Xn )) = −∞ h(x1 , . . . , xn )f (x1 , . . . , xn ) dx1 , . . . , dxn −∞ Measuring Multivariate Distributions A natural starting point for summary measures is linear transformations, in the bivariate case h(X, Y ) = aX + bY a,b are constants. **Theorem** For any constants a, b and any multivariate random variable (X, Y ), E(aX + bY ) = aE(X) + bE(Y ) We generalize these results using indution we have ! n n X X E ai Xi = ai E(Xi ) i=1 i=1 It is hopeless to try to capture information about how X and Y vary together by looking at the expectations of linear trasforamtions. To learn this we must go beyond expectations of linear transformations Covariance and Correlation The simplest nonlinear transformation of X, Y is XY and we could consider E(XY ) as a summary. Since it is affected by where the distributions are centered and how the variables vary together. We strart with the product of X −E(X) and Y −E(Y ) that is the expectation of h(X, Y ) = (X − µX )(Y − µY ) which is called the covariance of X and Y denoted Cov(X, Y ) = E[(X − µX )(Y − µY )] ⇒ h(X, Y ) = XY − µX Y − XµY + µX µy using properties from above we arrive at COV (X, Y ) = E(XY )−µX E(Y )−µY E(X)+µX µy = E(XY )−µx µy −µy µx +µx µy 18 Thus COV (X, Y ) = E(XY ) − µX µy It follows immediately that COV (X, Y ) = COV (Y, X) if X and Y are the same random variable we have COV (X, X) = E(X · X) − µX µX = V ar(X) If we have the other extreme where Y = −X then COV (X, −X) = E(X(−X)) − µX µ−X = −E(X 2 ) + µ2x = −V ar(X) and lastly if X and Y are independent COV (X, Y ) = 0 Yet COV (X, Y ) = 0 isn’t sufficient to guarantee independence. theoretically they can be an exact balance between h(X, Y ) in the positve quadrants and the negative quadrants. In general we refer to X, Y with COV (X, Y ) = 0 as **Uncorrelated** **Correction Term** The most important use of covariance and the best means of interpreting it quantitatively is by viewing it as a correction term that arises in calculating the variance of sums Lets take E(X + Y ) = E(X) + E(Y ) Now V ar(X + Y ) is the expectation of (X+Y −µX+Y )2 = [X+Y −(µx +µy )]2 = [(X−µx )+(Y −µY )]2 = (X−µX )2 +(Y −µy )2 +2(X−µx )(Y −µy ) thus we have V ar(X + Y ) = V ar(X) + V ar(Y ) + 2COV (X, Y ) In the special case where X and Y are independent(or uncorrelated) V ar(X + Y ) = V ar(X) + V ar(Y ) Large covariance may be due to a high degree of association or dependence between X and Y , but it may also be due to the choice of scales of measurement. To eliminate this scale effect we define the correlation of X and Y to be the covariance of the standardized X and Y Y − µY X − µx Y − µY COV (X, Y ) X − µx = COV , = pXY = E σX σY σX σY σX σY If X and Y are independent pXY = 0 Yet uncorrelation doesn’t imply independence 19 Chapter 4 Baye’s Theorem The most elementary situation is we have an effect F , a list of independent, comprehensive causes E1 , E2 , . . . , En . We have the **priori Probabilites** P (E1 ), . . . , P (En ) and we know the probability of F given each cause P (F |E1 ), . . . , P (F |En ) We want to ind the **Posteori Probabilites** P (E1 |F ), . . . , P (En |F ) **Example** E1 = Patient has Cancer, E2 = No Cancer Let F be a positive test result we know that P (F |E1 ) = .9, P (F |E2 ) = .1 We say the patient is randomly selected from a population with a cancer prevalance of θo So P (E1 ) = θo , P (E2 ) = 1 − θo There is a disagreement on θo what is P (E1 |F ) **Bayes Theorem Elementary Version** if E1 , E2 , . . . , En partition a sample space S, then for each Ei and F with P (F ) > 0 P (Ei )P (F |Ei ) P (Ei |F ) = Pn j=1 P (Ej )P (F |Ej ) **proof** generally we have that for any E and F P (E ∩ F ) = P (F )P (E|F ) applying twice we have P (Ei ∩ F ) = P (F )P (Ei |F ), P (Ei ∩ F ) = P (Ei )P (F |Ei ) and since P (F ) = 0 equating this gives P (Ei |F ) = P (Ei )P (F |Ei ) P (F ) this is essentially the theorem now we WTS P (F ) = n X P (Ej )P (F |Ej ) j=1 Which will be done in two ways first n n n X X P (Ei )P (F |Ei ) 1 X 1= P (Ei |F ) = P (Ei )P (F |Ei ) = P (F ) P (F ) i=1 i=1 i=1 20 or P (F ) = n X P (Ei )P (F |Ei ) i=1 Altenatively, we have F = ∪ni=1 (Ei ∩ F ) and finite additivity gives P (F ) = n X P (Ei ∩ F ) i=1 But for each i P (Ei ∩ F ) = P (Ei )P (F |Ei ), thus P (F ) = n X P (Ei )P (F |Ei ) i=1 **Example return** we have P (F |E1 ) = .9, P (F |E2 ) = .2, P (E1 ) = θo So from above we have P (E1 |F ) = θo .9 θ0 (.9) + (1 − θo ).2 Bayes theorem more generally suppose we have fY (y) and f (x|y) if we want to find f (y|x) we proceed as follows, if fX (x) > 0 f (y|x) = f (x, y) fX (x) second we know that f (x, y) = f (x|y)fY (y) together we arrive at f (y|x) = f (x|y)fY (y) fX (x) all that remains is to find fX (x) whcih we can do in either of two ways first we must have Z ∞ f (x|u)fY (u)du =1 fX (x) −∞ or Z ∞ fX (x) = f (x|u)fY (u) du −∞ These equations together provide the general frorm of Bayes theorem f (x|y)fY (y) f (x|u)fY (u) du −∞ f (y|x) = R ∞ 21 This is a distribuition for Y given X = x, that is for a fixed x this gives the conditional density of Y . sometimes we write the above as f (y|x)αf (x|y)fY (y) meaning f (y|x) is proportional to the product of f (x|y)fY (y), though the constant of proportionality can depend upon x Inference about binomial distribution For the case X disrcrete and Y contininous Bayes becomes p(x|y)fY (y) p(x|u)fY (u) du ∞ f (y|x) = R ∞ or f (y|x)αP (x|y)fY (y) We consider an example to show Bayesian Inference **Example** We select n = 100 ames from a list of a million registered democrats, that all n are interviewd and express an opinion, the poll will result in count X, for the incumbent, and count n − X against formallu θ = fraction for the incumbent If the survey is truly random we would accept that given θ, X has a binomial (100, θ) distribution p(x|θ) = b(x; n, θ) = nx θx (1 − θ)n−x But we aren’t given θ and we want to determine θ, We suppose our uncertainty about θ is well represented as uniform distribution f (θ) = 1, 0 < θ < 1 Now this becomes an application of Bayes theorem, We observe n = 100, X = 40, and we have f (θ|x)αf (θ)p(x|θ), so the density is 100 40 100 40 f (θ|40)αp(40|θ) · f (θ) = θ (1 − θ)60 ∗ 1 = θ (1 − θ)60 40 40 Now this represents the density of θ given X = 40 up to a constant of proportionality. We could evaluate the constant if we wish but we recognize that the portion of the density that depends upon θ is exactly equal to the variable part of a Beta (41, 61) density, since the variable parts agree and both integrate to 1, the constants must agreee and we conclude that f (θ|40) is a Beta(41, 61) density To summarize we thought (a) θ is uniform (0,1) before the survey (b) Given θ, X is binomial (100, x) our conclusion is that (c) θ is Beta (41, 61) a posteriori, given X = 40 A richer class of Models for Binomial Inference The previous section was restrictive to the hypothesis that the fraction θ is a priori distributed as Uniform (0, 1) Bayes Therorem can be applied to any f (θ) and this flexibilty can be exploited to represent any degree of uncertainty based upon prior experience can be summarized as a density. A particular class is the Beta (α, β) Distribtution: f (θ) = Γ(α + β) α−1 θ (1 − θ)β−1 Γ(α)Γ(β) 22 This class includes the Uniform (0, 1) as a special case α = β = 1 If we accept a parrticular f (θ), the analysis is straightforward, subsequent to observing the number of successes X = x in n independent trials, the condiational density θ given X = x, n x Γ(α + β) α−1 θ (1 − θ)β−1 f (θ|x) ∝ p(x|θ)f (θ) = θ (1 − θ)n−x · Γ(α)Γ(β) x = Cθx+α−1 (1 − θ)n−x+β−1 where C depends on n, x, α and β but not θ. we recognize the portion of f (θ|x) that depends on θ, is a Beta(x + α, n − x + β) density. Thus f (θ|x) must be a Beta(x + α, n − x + β) density. f (θ|x) = Γ(n + α + β) θx+α−1 (1 − θ)n−x+β−1 Γ(x + α)Γ(n − x + β) The specification of a priori density f (θ) from within the class of Beta distributions can be simplified using three facts about Beta(α, β) distirbutions. First, µθ = and σθ2 = α α+β µθ (1 − µθ ) αβ = (α + β)2 (α + β + 1) α+β+1 And if α, and β not very small fθ (y) is approximately like the density of a N (µθ , σθ2 ) distribution 2 P (|Y − µθ | < σθ ) ≈ 3 **Returning to Example** Before conducting the survey we specify f (θ) by asking what is µθ , If α and β not very small, f (θ) apprroximately symmetric about µθ , and µθ approximately median of f (θ). We ask for waht value is it even odds that θ is above or below µθ . Suppose µθ = .5 answers this approximately. prior to the survey we consider θ equally likely to be above or below .5. Next we ask for what interval of values of θ around µθ = .5 is the probbility 23 . For what what value σθ is it twice as likely that .5 − σθ < θ < .5 + σθ as not? Suppose we agree that a priori, P (.4 < θ < .6) ≈ 23 or σθ ≈ .1 gives us σθ2 = µθ (1 − µθ) (α + β + 1) if µθ = .5 and σθ2 = (.1)2 , this gives us α+β+1= µθ (1 − µθ ) .52 = = 25 σθ2 .12 23 or α + β = 24, together with µθ = .5 we have α = β = 12 So we take our priori distribution for θ with Beta(12,12) as f (θ) = 23! 11 θ (1 − θ)1 1 11!11! After the survey the distribution that describes the uncertainty about θ given n = 100 and X = 40 is then, from the Beta(52,72) density f (θ|40) = 123! 51 θ (1 − θ)71 51!71! The effect of the sample information X upon the distribution of θ can be summarixed in chages of expectation and variance, before X if θ is as follows: E(θ) = µθ (1 − µθ ) α = µθ , V ar(θ) = α+β α+β+1 After we observe X = x α+x E(θ|X = x) = = α+β+n V ar(θ|X = x) = α+β α+β+n µθ + n α+β+n x n E(θ|X = x)(1 − E(θ|X = x)) α+β+n+1 The expectation formula is particulary informative, it gives E(θ|X = x) as a weighted average of the prior expectation adn the fraction of the sample that are successes. That is E(θ|X = x) is a compromise between µθ our expectation n with no data, and the sample fraction nx , the larger α+β+n is the more weight X we put on n The folowing two situations give the same f (θ|x) (i) A uniform (0,1) prior,(Beta with α = β = 1) and a sample of n = 10, with X = 5 successes (ii) A Beta(5,,5) prior and n = 2 with X = 1 successes 24