Two Discrete Random Variables The probability mass function (pmf) of a single discrete rv X specifies how much probability mass is placed on each possible X value. The joint pmf of two discrete rv’s X and Y describes how much probability mass is placed on each possible pair of values (x, y). Definition Let X and Y be two discrete rv’s defined on the sample space of an experiment. The joint probability mass function p(x, y) is defined for each pair of numbers (x, y) by p(x, y) = P(X = x and Y = y) 1 Two Discrete Random Variables It must be the case that p(x, y) 0 and p(x, y) = 1. Now let A be any set consisting of pairs of (x, y) values (e.g., A = {(x, y): x + y = 5} or {(x, y): max(x, y) 3}). Then the probability P[(X, Y) A] is obtained by summing the joint pmf over pairs in A: P[(X, Y) A] = p(x, y) 2 Two Discrete Random Variables Definition The marginal probability mass function of X, denoted by pX (x), is given by pX (x) = p(x, y) for each possible value x Similarly, the marginal probability mass function of Y is pY (y) = p(x, y) for each possible value y. 3 Two Continuous Random Variables The probability that the observed value of a continuous rv X lies in a one-dimensional set A (such as an interval) is obtained by integrating the pdf f(x) over the set A. Similarly, the probability that the pair (X, Y) of continuous rv’s falls in a two-dimensional set A (such as a rectangle) is obtained by integrating a function called the joint density function. 4 Two Continuous Random Variables Definition Let X and Y be continuous rv’s. A joint probability density function f(x, y) for these two variables is a function satisfying f(x, y) 0 and Then for any two-dimensional set A 5 Two Continuous Random Variables In particular, if A is the two-dimensional rectangle {(x, y): a x b, c y d}, then We can think of f(x, y) as specifying a surface at height f(x, y) above the point (x, y) in a three-dimensional coordinate system. Then P[(X, Y) A] is the volume underneath this surface and above the region A, analogous to the area under a curve in the case of a single rv. 6 Two Continuous Random Variables This is illustrated in Figure 5.1. P[(X, Y ) A] = volume under density surface above A Figure 5.1 7 Two Continuous Random Variables The marginal pdf of each variable can be obtained in a manner analogous to what we did in the case of two discrete variables. The marginal pdf of X at the value x results from holding x fixed in the pair (x, y) and integrating the joint pdf over y. Integrating the joint pdf with respect to x gives the marginal pdf of Y. 8 Two Continuous Random Variables Definition The marginal probability density functions of X and Y, denoted by fX(x) and fY(y), respectively, are given by 9 Independent Random Variables Here is an analogous definition for the independence of two rv’s. Definition Two random variables X and Y are said to be independent if for every pair of x and y values p(x, y) = pX (x) pY (y) or f(x, y) = fX(x) fY(y) when X and Y are discrete (5.1) when X and Y are continuous If (5.1) is not satisfied for all (x, y), then X and Y are said to be dependent. 10 Independent Random Variables The definition says that two variables are independent if their joint pmf or pdf is the product of the two marginal pmf’s or pdf’s. Intuitively, independence says that knowing the value of one of the variables does not provide additional information about what the value of the other variable might be. 11 Independent Random Variables Independence of two random variables is most useful when the description of the experiment under study suggests that X and Y have no effect on one another. Then once the marginal pmf’s or pdf’s have been specified, the joint pmf or pdf is simply the product of the two marginal functions. It follows that P(a X b, c Y d) = P(a X b) P(c Y d) 12 More Than Two Random Variables To model the joint behavior of more than two random variables, we extend the concept of a joint distribution of two variables. Definition If X1, X2, . . ., Xn are all discrete random variables, the joint pmf of the variables is the function p(x1, x2, . . . , xn) = P(X1 = x1, X2 = x2, . . . , Xn = xn) 13 More Than Two Random Variables If the variables are continuous, the joint pdf of X1, . . ., Xn is the function f(x1, x2, . . ., xn) such that for any n intervals [a1, b1], . . . , [an, bn], In a binomial experiment, each trial could result in one of only two possible outcomes. Consider now an experiment consisting of n independent and identical trials, in which each trial can result in any one of r possible outcomes. 14 More Than Two Random Variables Let pi = P(outcome i on any particular trial), and define random variables by Xi = the number of trials resulting in outcome i (i = 1, . . . , r). Such an experiment is called a multinomial experiment, and the joint pmf of X1, . . . , Xr is called the multinomial distribution. 15 More Than Two Random Variables The notion of independence of more than two random variables is similar to the notion of independence of more than two events. Definition The random variables X1, X2, . . . , Xn are said to be independent if for every subset of the variables (each pair, each triple, and so on), the joint pmf or pdf of the subset is equal to the product of the marginal pmf’s or pdf’s. 16 Conditional Distributions Definition Let X and Y be two continuous rv’s with joint pdf f(x, y) and marginal X pdf fX(x). Then for any X value x for which fX(x) > 0, the conditional probability density function of Y given that X = x is If X and Y are discrete, replacing pdf’s by pmf’s in this definition gives the conditional probability mass function of Y when X = x. 17 Conditional Distributions Notice that the definition of fY | X(y | x) parallels that of P(B | A), the conditional probability that B will occur, given that A has occurred. Once the conditional pdf or pmf has been determined, questions of the type posed at the outset of this subsection can be answered by integrating or summing over an appropriate set of Y values. 18 Expected Values, Covariance, and Correlation Proposition Let X and Y be jointly distributed rv’s with pmf p(x, y) or pdf f(x, y) according to whether the variables are discrete or continuous. Then the expected value of a function h(X, Y), denoted by E[h(X, Y)] or h(X, Y), is given by if X and Y are discrete if X and Y are continuous 19 Covariance When two random variables X and Y are not independent, it is frequently of interest to assess how strongly they are related to one another. Definition The covariance between two rv’s X and Y is Cov(X, Y) = E[(X – X)(Y – Y)] X, Y discrete X, Y continuous 20 Covariance Then most of the probability mass or density will be associated with (x – X) and (y – Y), either both positive (both X and Y above their respective means) or both negative, so the product (x – X)(y – Y) will tend to be positive. Thus for a strong positive relationship, Cov(X, Y) should be quite positive. For a strong negative relationship, the signs of (x – X) and (y – Y) will tend to be opposite, yielding a negative product. 21 Covariance Thus for a strong negative relationship, Cov(X, Y) should be quite negative. If X and Y are not strongly related, positive and negative products will tend to cancel one another, yielding a covariance near 0. 22 Covariance Figure 5.4 illustrates the different possibilities. The covariance depends on both the set of possible pairs and the probabilities. In Figure 5.4, the probabilities could be changed without altering the set of possible pairs, and this could drastically change the value of Cov(X, Y). p(x, y) = 1/10 for each of ten pairs corresponding to indicated points: (a) positive covariance; (b) negative covariance; Figure 5.4 (c) covariance near zero 23 Covariance The following shortcut formula for Cov(X, Y) simplifies the computations. Proposition Cov(X, Y) = E(XY) – X Y According to this formula, no intermediate subtractions are necessary; only at the end of the computation is X Y subtracted from E(XY). The proof involves expanding (X – X)(Y – Y) and then taking the expected value of each term separately. 24 Correlation Definition The correlation coefficient of X and Y, denoted by Corr(X, Y), X,Y, or just , is defined by 25 Correlation The following proposition shows that remedies the defect of Cov(X, Y) and also suggests how to recognize the existence of a strong (linear) relationship. Proposition 1. If a and c are either both positive or both negative, Corr(aX + b, cY + d) = Corr(X, Y) 2. For any two rv’s X and Y, –1 Corr(X, Y) 1. 26 Correlation If we think of p(x, y) or f(x, y) as prescribing a mathematical model for how the two numerical variables X and Y are distributed in some population (height and weight, verbal SAT score and quantitative SAT score, etc.), then is a population characteristic or parameter that measures how strongly X and Y are related in the population. We will consider taking a sample of pairs (x1, y1), . . . , (xn, yn) from the population. The sample correlation coefficient r will then be defined and used to make inferences about . 27 Correlation The correlation coefficient is actually not a completely general measure of the strength of a relationship. Proposition 1. If X and Y are independent, then = 0, but = 0 does not imply independence. 2. = 1 or –1 iff Y = aX + b for some numbers a and b with a 0. 28 Correlation This proposition says that is a measure of the degree of linear relationship between X and Y, and only when the two variables are perfectly related in a linear manner will be as positive or negative as it can be. A less than 1 in absolute value indicates only that the relationship is not completely linear, but there may still be a very strong nonlinear relation. 29 Correlation Also, = 0 does not imply that X and Y are independent, but only that there is a complete absence of a linear relationship. When = 0, X and Y are said to be uncorrelated. Two variables could be uncorrelated yet highly dependent because there is a strong nonlinear relationship, so be careful not to conclude too much from knowing that = 0. 30 Correlation A value of near 1 does not necessarily imply that increasing the value of X causes Y to increase. It implies only that large X values are associated with large Y values. For example, in the population of children, vocabulary size and number of cavities are quite positively correlated, but it is certainly not true that cavities cause vocabulary to grow. Instead, the values of both these variables tend to increase as the value of age, a third variable, increases. 31 Correlation For children of a fixed age, there is probably a low correlation between number of cavities and vocabulary size. In summary, association (a high correlation) is not the same as causation. 32 The Distribution of the Sample Mean The importance of the sample mean springs from its use in drawing conclusions about the population mean . Some of the most frequently used inferential procedures are based on properties of the sampling distribution of . A preview of these properties appeared in the calculations and simulation experiments of the previous section, where we noted relationships between E( ) and and also among V( ), 2, and n. 33 The Distribution of the Sample Mean Proposition Let X1, X2, . . . , Xn be a random sample from a distribution with mean value and standard deviation . Then 1. 2. In addition, with T0 = X1+ . . . + Xn (the sample total), 34 The Distribution of the Sample Mean The standard deviation is often called the standard error of the mean; it describes the magnitude of a typical or representative deviation of the sample mean from the population mean. 35 The Case of a Normal Population Distribution Proposition Let X1, X2, . . . , Xn be a random sample from a normal distribution with mean and standard deviation . Then for any n, is normally distributed (with mean and standard deviation , as is To (with mean n and standard Deviation ). We know everything there is to know about the and To distributions when the population distribution is normal. In particular, probabilities such as P(a b) and P(c To d) can be obtained simply by standardizing. 36 The Central Limit Theorem When the Xi’s are normally distributed, so is sample size n. for every Even when the population distribution is highly nonnormal, averaging produces a distribution more bell-shaped than the one being sampled. A reasonable conjecture is that if n is large, a suitable normal curve will approximate the actual distribution of . The formal statement of this result is the most important theorem of probability. 37 The Central Limit Theorem Theorem The Central Limit Theorem (CLT) Let X1, X2, . . . , Xn be a random sample from a distribution with mean and variance 2. Then if n is sufficiently large, has approximately a normal distribution with and and To also has approximately a normal distribution with The larger the value of n, the better the approximation. 38 The Central Limit Theorem According to the CLT, when n is large and we wish to calculate a probability such as P(a b), we need only “pretend” that is normal, standardize it, and use the normal table. The resulting answer will be approximately correct. The exact answer could be obtained only by first finding the distribution of , so the CLT provides a truly impressive shortcut. 39 Other Applications of the Central Limit Theorem The CLT can be used to justify the normal approximation to the binomial distribution discussed earlier. We know that a binomial variable X is the number of successes in a binomial experiment consisting of n independent success/failure trials with p = P(S) for any particular trial. Define a new rv X1 by and define X2, X3, . . . , Xn analogously for the other n – 1 trials. Each Xi indicates whether or not there is a success on the corresponding trial. 40 Other Applications of the Central Limit Theorem Because the trials are independent and P(S) is constant from trial to trial, the Xi ’s are iid (a random sample from a Bernoulli distribution). The CLT then implies that if n is sufficiently large, both the sum and the average of the Xi’s have approximately normal distributions. 41 Other Applications of the Central Limit Theorem When the Xi’s are summed, a 1 is added for every S that occurs and a 0 for every F, so X1 + . . . + Xn = X. The sample mean of the Xi’s is X/n, the sample proportion of successes. That is, both X and X/n are approximately normal when n is large. 42 Other Applications of the Central Limit Theorem The necessary sample size for this approximation depends on the value of p: When p is close to .5, the distribution of each Xi is reasonably symmetric (see Figure 5.19), whereas the distribution is quite skewed when p is near 0 or 1. Using the approximation only if both np 10 and n(1 p) 10 ensures that n is large enough to overcome any skewness in the underlying Bernoulli distribution. (b) (a) Two Bernoulli distributions: (a) p = .4 (reasonably symmetric); (b) p = .1 (very skewed) Figure 5.19 43 The Distribution of a Linear Combination The sample mean X and sample total To are special cases of a type of random variable that arises very frequently in statistical applications. Definition Given a collection of n random variables X1, . . . , Xn and n numerical constants a1, . . . , an, the rv (5.7) is called a linear combination of the Xi’s. 44 The Distribution of a Linear Combination Proposition Let X1, X2, . . . , Xn have mean values 1, . . . , n, respectively, and variances respectively. 1. Whether or not the Xi’s are independent, E(a1X1 + a2X2 + . . . + anXn) = a1E(X1) + a2E(X2) + . . . + anE(Xn) = a11 + . . . + ann (5.8) 2. If X1, . . . , Xn are independent, V(a1X1 + a2X2 + . . . + anXn) (5.9) 45 The Distribution of a Linear Combination And (5.10) 3. For any X1, . . . , Xn, (5.11) 46