5 Joint Probability Distributions and Random Samples 5.1 Jointly Distributed Random Variables Two Discrete Random Variables 3 Two Discrete Random Variables The probability mass function (pmf) of a single discrete rv X specifies how much probability mass is placed on each possible X value. The joint pmf of two discrete rv’s X and Y describes how much probability mass is placed on each possible pair of values (x, y). Definition Let X and Y be two discrete rv’s defined on the sample space of an experiment. The joint probability mass function p(x, y) is defined for each pair of numbers (x, y) by p(x, y) = P(X = x and Y = y) 4 Two Discrete Random Variables It must be the case that p(x, y) 0 and p(x, y) = 1. Now let A be any set consisting of pairs of (x, y) values (e.g., A = {(x, y): x + y = 5} or {(x, y): max(x, y) 3}). Then the probability P[(X, Y) A] is obtained by summing the joint pmf over pairs in A: P[(X, Y) A] = p(x, y) 5 Example 1 A large insurance agency services a number of customers who have purchased both a homeowner’s policy and an automobile policy from the agency. For each type of policy, a deductible amount must be specified. For an automobile policy, the choices are $100 and $250, whereas for a homeowner’s policy, the choices are 0, $100, and $200. Suppose an individual with both types of policy is selected at random from the agency’s files. Let X = the deductible amount on the auto policy and Y = the deductible amount on the homeowner’s policy. 6 Example 1 cont’d Possible (X, Y) pairs are then (100, 0), (100, 100), (100, 200), (250, 0), (250, 100), and (250, 200); the joint pmf specifies the probability associated with each one of these pairs, with any other pair having probability zero. Suppose the joint pmf is given in the accompanying joint probability table: 7 Example 1 cont’d Then p(100, 100) = P(X = 100 and Y = 100) = P($100 deductible on both policies) = .10. The probability P(Y 100) is computed by summing probabilities of all (x, y) pairs for which y 100: P(Y 100) = p(100, 100) + p(250, 100) + p(100, 200) + p(250, 200) = .75 8 Two Discrete Random Variables Definition The marginal probability mass function of X, denoted by pX (x), is given by pX (x) = p(x, y) for each possible value x Similarly, the marginal probability mass function of Y is pY (y) = p(x, y) for each possible value y. 9 Example 2 Example 1 continued… The possible X values are x = 100 and x = 250, so computing row totals in the joint probability table yields pX(100) = p(100, 0) + p(100, 100) + p(100, 200) = .50 and pX(250) = p(250, 0) + p(250, 100) + p(250, 200) = .50 The marginal pmf of X is then 10 Example 2 cont’d Similarly, the marginal pmf of Y is obtained from column totals as so P(Y 100) = pY(100) + pY(200) = .75 as before. 11 Two Continuous Random Variables 12 Two Continuous Random Variables The probability that the observed value of a continuous rv X lies in a one-dimensional set A (such as an interval) is obtained by integrating the pdf f(x) over the set A. Similarly, the probability that the pair (X, Y) of continuous rv’s falls in a two-dimensional set A (such as a rectangle) is obtained by integrating a function called the joint density function. 13 Two Continuous Random Variables Definition Let X and Y be continuous rv’s. A joint probability density function f(x, y) for these two variables is a function satisfying f(x, y) 0 and Then for any two-dimensional set A 14 Two Continuous Random Variables In particular, if A is the two-dimensional rectangle {(x, y): a x b, c y d}, then We can think of f(x, y) as specifying a surface at height f(x, y) above the point (x, y) in a three-dimensional coordinate system. Then P[(X, Y) A] is the volume underneath this surface and above the region A, analogous to the area under a curve in the case of a single rv. 15 Two Continuous Random Variables This is illustrated in Figure 5.1. P[(X, Y ) A] = volume under density surface above A Figure 5.1 16 Example 3 A bank operates both a drive-up facility and a walk-up window. On a randomly selected day, let X = the proportion of time that the drive-up facility is in use and Y = the proportion of time that the walk-up window is in use. Then the set of possible values for (X, Y) is the rectangle D = {(x, y): 0 x 1, 0 y 1}. 17 Example 3 cont’d Suppose the joint pdf of (X, Y) is given by To verify that this is a legitimate pdf, note that f(x, y) 0 and 18 Example 3 cont’d The probability that neither facility is busy more than one-quarter of the time is 19 Example 3 cont’d 20 Two Continuous Random Variables The marginal pdf of each variable can be obtained in a manner analogous to what we did in the case of two discrete variables. The marginal pdf of X at the value x results from holding x fixed in the pair (x, y) and integrating the joint pdf over y. Integrating the joint pdf with respect to x gives the marginal pdf of Y. 21 Two Continuous Random Variables Definition The marginal probability density functions of X and Y, denoted by fX(x) and fY(y), respectively, are given by 22 Independent Random Variables 23 Independent Random Variables In many situations, information about the observed value of one of the two variables X and Y gives information about the value of the other variable. In Example 1, the marginal probability of X at x = 250 was .5, as was the probability that X = 100. If, however, we are told that the selected individual had Y = 0, then X = 100 is four times as likely as X = 250. Thus there is a dependence between the two variables. Earlier, we pointed out that one way of defining independence of two events is via the condition P(A B) = P(A) P(B). 24 Independent Random Variables Here is an analogous definition for the independence of two rv’s. Definition Two random variables X and Y are said to be independent if for every pair of x and y values p(x, y) = pX (x) pY (y) or f(x, y) = fX(x) fY(y) when X and Y are discrete (5.1) when X and Y are continuous If (5.1) is not satisfied for all (x, y), then X and Y are said to be dependent. 25 Independent Random Variables The definition says that two variables are independent if their joint pmf or pdf is the product of the two marginal pmf’s or pdf’s. Intuitively, independence says that knowing the value of one of the variables does not provide additional information about what the value of the other variable might be. 26 Example 6 In the insurance situation of Examples 1 and 2, p(100, 100) = .10 (.5)(.25) = pX(100) pY(100) so X and Y are not independent. Independence of X and Y requires that every entry in the joint probability table be the product of the corresponding row and column marginal probabilities. 27 Independent Random Variables Independence of two random variables is most useful when the description of the experiment under study suggests that X and Y have no effect on one another. Then once the marginal pmf’s or pdf’s have been specified, the joint pmf or pdf is simply the product of the two marginal functions. It follows that P(a X b, c Y d) = P(a X b) P(c Y d) 28 5.2 Expected Values, Covariance, and Correlation 29 Expected Values, Covariance, and Correlation Proposition Let X and Y be jointly distributed rv’s with pmf p(x, y) or pdf f(x, y) according to whether the variables are discrete or continuous. Then the expected value of a function h(X, Y), denoted by E[h(X, Y)] or h(X, Y), is given by if X and Y are discrete if X and Y are continuous 30 Example 13 Five friends have purchased tickets to a certain concert. If the tickets are for seats 1–5 in a particular row and the tickets are randomly distributed among the five, what is the expected number of seats separating any particular two of the five? Let X and Y denote the seat numbers of the first and second individuals, respectively. Possible (X, Y) pairs are {(1, 2), (1, 3), . . . , (5, 4)}, and the joint pmf of (X, Y) is x = 1, . . . , 5; y = 1, . . . , 5; x y otherwise 31 Example 13 cont’d The number of seats separating the two individuals is h(X, Y) = |X – Y| – 1. The accompanying table gives h(x, y) for each possible (x, y) pair. 32 Example 13 cont’d Thus 33 Covariance 34 Covariance When two random variables X and Y are not independent, it is frequently of interest to assess how strongly they are related to one another. Definition The covariance between two rv’s X and Y is Cov(X, Y) = E[(X – X)(Y – Y)] X, Y discrete X, Y continuous 35 Covariance That is, since X – X and Y – Y are the deviations of the two variables from their respective mean values, the covariance is the expected product of deviations. Note that Cov(X, X) = E[(X – X)2] = V(X). The rationale for the definition is as follows. Suppose X and Y have a strong positive relationship to one another, by which we mean that large values of X tend to occur with large values of Y and small values of X with small values of Y. 36 Covariance Then most of the probability mass or density will be associated with (x – X) and (y – Y), either both positive (both X and Y above their respective means) or both negative, so the product (x – X)(y – Y) will tend to be positive. Thus for a strong positive relationship, Cov(X, Y) should be quite positive. For a strong negative relationship, the signs of (x – X) and (y – Y) will tend to be opposite, yielding a negative product. 37 Covariance Thus for a strong negative relationship, Cov(X, Y) should be quite negative. If X and Y are not strongly related, positive and negative products will tend to cancel one another, yielding a covariance near 0. 38 Covariance Figure 5.4 illustrates the different possibilities. The covariance depends on both the set of possible pairs and the probabilities. In Figure 5.4, the probabilities could be changed without altering the set of possible pairs, and this could drastically change the value of Cov(X, Y). p(x, y) = 1/10 for each of ten pairs corresponding to indicated points: (a) positive covariance; (b) negative covariance; Figure 5.4 (c) covariance near zero 39 Example 15 The joint and marginal pmf’s for X = automobile policy deductible amount and Y = homeowner policy deductible amount in Example 5.1 were from which X = xpX(x) = 175 and Y = 125. 40 Example 15 cont’d Therefore, Cov(X, Y) = (x – 175)(y – 125)p(x, y) (x, y) = (100 – 175)(0 – 125)(.20) + . . . + (250 – 175)(200 – 125)(.30) = 1875 41 Covariance The following shortcut formula for Cov(X, Y) simplifies the computations. Proposition Cov(X, Y) = E(XY) – X Y According to this formula, no intermediate subtractions are necessary; only at the end of the computation is X Y subtracted from E(XY). The proof involves expanding (X – X)(Y – Y) and then taking the expected value of each term separately. 42 Correlation 43 Correlation Definition The correlation coefficient of X and Y, denoted by Corr(X, Y), X,Y, or just , is defined by 44 Example 17 It is easily verified that in the insurance scenario of Example 15, E(X2) = 36,250, = 36,250 – (175)2 = 5625, X = 75, E(Y2) = 22,500, = 6875, and Y = 82.92. This gives 45 Correlation The following proposition shows that remedies the defect of Cov(X, Y) and also suggests how to recognize the existence of a strong (linear) relationship. Proposition 1. If a and c are either both positive or both negative, Corr(aX + b, cY + d) = acCorr(X, Y) 2. For any two rv’s X and Y, –1 Corr(X, Y) 1. 46 Correlation If we think of p(x, y) or f(x, y) as prescribing a mathematical model for how the two numerical variables X and Y are distributed in some population (height and weight, verbal SAT score and quantitative SAT score, etc.), then is a population characteristic or parameter that measures how strongly X and Y are related in the population. We will consider taking a sample of pairs (x1, y1), . . . , (xn, yn) from the population. The sample correlation coefficient r will then be defined and used to make inferences about . 47 Correlation The correlation coefficient is actually not a completely general measure of the strength of a relationship. Proposition 1. If X and Y are independent, then = 0, but = 0 does not imply independence. 2. = 1 or –1 iff Y = aX + b for some numbers a and b with a 0. 48 Correlation This proposition says that is a measure of the degree of linear relationship between X and Y, and only when the two variables are perfectly related in a linear manner will be as positive or negative as it can be. A less than 1 in absolute value indicates only that the relationship is not completely linear, but there may still be a very strong nonlinear relation. 49 Correlation Also, = 0 does not imply that X and Y are independent, but only that there is a complete absence of a linear relationship. When = 0, X and Y are said to be uncorrelated. Two variables could be uncorrelated yet highly dependent because there is a strong nonlinear relationship, so be careful not to conclude too much from knowing that = 0. 50 Correlation A value of near 1 does not necessarily imply that increasing the value of X causes Y to increase. It implies only that large X values are associated with large Y values. For example, in the population of children, vocabulary size and number of cavities are quite positively correlated, but it is certainly not true that cavities cause vocabulary to grow. Instead, the values of both these variables tend to increase as the value of age, a third variable, increases. 51 5.3 Statistics and Their Distributions 52 Statistics and Their Distributions Definition A statistic is any quantity whose value can be calculated from sample data. Prior to obtaining data, there is uncertainty as to what value of any particular statistic will result. Therefore, a statistic is a random variable and will be denoted by an uppercase letter; a lowercase letter is used to represent the calculated or observed value of the statistic. 53 Statistics and Their Distributions Thus the sample mean, regarded as a statistic (before a sample has been selected or an experiment carried out), is denoted by ; the calculated value of this statistic is . Similarly, S represents the sample standard deviation thought of as a statistic, and its computed value is s. If samples of two different types of bricks are selected and the individual compressive strengths are denoted by X1, . . . , Xm and Y1, . . . , Yn, respectively, then the statistic , the difference between the two sample mean compressive strengths, is often of great interest. 54 Statistics and Their Distributions The probability distribution of a statistic is sometimes referred to as its sampling distribution to emphasize that it describes how the statistic varies in value across all samples that might be selected. 55 Random Samples 56 Random Samples Definition The rv’s X1, X2, . . . , Xn are said to form a (simple) random sample of size n if 1. The Xi’s are independent rv’s. 2. Every Xi has the same probability distribution. 57 Random Samples Conditions 1 and 2 can be paraphrased by saying that the Xi’s are independent and identically distributed (iid). If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 are satisfied exactly. These conditions will be approximately satisfied if sampling is without replacement, yet the sample size n is much smaller than the population size N. 58 Random Samples In practice, if n/N .05 (at most 5% of the population is sampled), we can proceed as if the Xi’s form a random sample. The virtue of this sampling method is that the probability distribution of any statistic can be more easily obtained than for any other sampling method. There are two general methods for obtaining information about a statistic’s sampling distribution. One method involves calculations based on probability rules, and the other involves carrying out a simulation experiment. 59 Simulation Experiments 60 Simulation Experiments The following characteristics of an experiment must be specified: 1. The statistic of interest ( mean, etc.) , S, a particular trimmed 2. The population distribution (normal with = 100 and = 15, uniform with lower limit A = 5 and upper limit B = 10,etc.) 3. The sample size n (e.g., n = 10 or n = 50) 4. The number of replications k (number of samples to be obtained) 61 Simulation Experiments Then use appropriate software to obtain k different random samples, each of size n, from the designated population distribution. For each sample, calculate the value of the statistic and construct a histogram of the k values. This histogram gives the approximate sampling distribution of the statistic. The larger the value of k, the better the approximation will tend to be (the actual sampling distribution emerges as k ). In practice, k = 500 or 1000 is usually sufficient if the statistic is “fairly simple.” 62 Simulation Experiments The final aspect of the histograms to note is their spread relative to one another. The larger the value of n, the more concentrated is the sampling distribution about the mean value. This is why the histograms for n = 20 and n = 30 are based on narrower class intervals than those for the two smaller sample sizes. For the larger sample sizes, most of the values are quite close to 8.25. This is the effect of averaging. When n is small, a single unusual x value can result in an value far from the center. 63 Simulation Experiments With a larger sample size, any unusual x values, when averaged in with the other sample values, still tend to yield an value close to . Combining these insights yields a result that should appeal to your intuition: based on a large n tends to be closer to than does based on a small n. 64 5.4 The Distribution of the Sample Mean Copyright © Cengage Learning. All rights reserved. 65 The Distribution of the Sample Mean The importance of the sample mean springs from its use in drawing conclusions about the population mean . Some of the most frequently used inferential procedures are based on properties of the sampling distribution of . A preview of these properties appeared in the calculations and simulation experiments of the previous section, where we noted relationships between E( ) and and also among V( ), 2, and n. 66 The Distribution of the Sample Mean Proposition Let X1, X2, . . . , Xn be a random sample from a distribution with mean value and standard deviation . Then 1. 2. In addition, with T0 = X1+ . . . + Xn (the sample total), 67 The Distribution of the Sample Mean The sampling distribution of is centered precisely at the mean of the population The distribution becomes more concentrated about as the sample size n increases. The distribution of To becomes more spread out as n increases. Averaging moves probability in toward the middle, whereas totaling spreads probability out over a wider and wider range of values. The standard deviation standard error of the mean is often called the 68 Example 24 In a notched tensile fatigue test on a titanium specimen, the expected number of cycles to first acoustic emission (used to indicate crack initiation) is = 28,000, and the standard deviation of the number of cycles is = 5000. Let X1, X2, . . . , X25 be a random sample of size 25, where each Xi is the number of cycles on a different randomly selected specimen. Then the expected value of the sample mean number of cycles until first emission is E( )= = 28,000, and the expected total number of cycles for the 25 specimens is E(To) = n = 25(28,000) = 700,000. 69 Example 24 The standard deviation of and of To are cont’d (standard error of the mean) If the sample size increases to n = 100, E( ) is unchanged, but = 500, half of its previous value (the sample size must be quadrupled to halve the standard deviation of ). 70 The Case of a Normal Population Distribution 71 The Case of a Normal Population Distribution Proposition Let X1, X2, . . . , Xn be a random sample from a normal distribution with mean and standard deviation . Then for any n, is normally distributed (with mean and standard deviation , as is To (with mean n and standard Deviation ). We know everything there is to know about the and To distributions when the population distribution is normal. In particular, probabilities such as P(a b) and P(c To d) can be obtained simply by standardizing. 72 The Case of a Normal Population Distribution Figure 5.14 illustrates the proposition. A normal population distribution and sampling distributions Figure 5.14 73 Example 25 The time that it takes a randomly selected rat of a certain subspecies to find its way through a maze is a normally distributed rv with = 1.5 min and = .35 min. Suppose five rats are selected. Let X1, . . . , X5 denote their times in the maze. Assuming the Xi’s to be a random sample from this normal distribution, what is the probability that the total time To = X1 + . . . + X5 for the five is between 6 and 8 min? 74 Example 25 cont’d By the proposition, To has a normal distribution with = n = 5(1.5) = 7.5 and variance = n 2 = 5(.1225) = .6125, so To standardize To, subtract and divide by = .783. : 75 Example 25 cont’d Determination of the probability that the sample average time (a normally distributed variable) is at most 2.0 min requires = = 1.5 and = = .1565. Then 76 The Central Limit Theorem 77 The Central Limit Theorem When the Xi’s are normally distributed, so is sample size n. for every Even when the population distribution is highly nonnormal, averaging produces a distribution more bell-shaped than the one being sampled. A reasonable conjecture is that if n is large, a suitable normal curve will approximate the actual distribution of . The formal statement of this result is the most important theorem of probability. 78 The Central Limit Theorem Theorem The Central Limit Theorem (CLT) Let X1, X2, . . . , Xn be a random sample from a distribution with mean and variance 2. Then if n is sufficiently large, has approximately a normal distribution with and and To also has approximately a normal distribution with The larger the value of n, the better the approximation. 79 The Central Limit Theorem Figure 5.15 illustrates the Central Limit Theorem. The Central Limit Theorem illustrated Figure 5.15 80 Example 26 The amount of a particular impurity in a batch of a certain chemical product is a random variable with mean value 4.0 g and standard deviation 1.5 g. If 50 batches are independently prepared, what is the (approximate) probability that the sample average amount of impurity is between 3.5 and 3.8 g? According to the rule of thumb to be stated shortly, n = 50 is large enough for the CLT to be applicable. 81 Example 26 cont’d then has approximately a normal distribution with mean value = 4.0 and so 82 The Central Limit Theorem The CLT provides insight into why many random variables have probability distributions that are approximately normal. For example, the measurement error in a scientific experiment can be thought of as the sum of a number of underlying perturbations and errors of small magnitude. A practical difficulty in applying the CLT is in knowing when n is sufficiently large. The problem is that the accuracy of the approximation for a particular n depends on the shape of the original underlying distribution being sampled. 83 The Central Limit Theorem If the underlying distribution is close to a normal density curve, then the approximation will be good even for a small n, whereas if it is far from being normal, then a large n will be required. Rule of Thumb If n > 30, the Central Limit Theorem can be used. There are population distributions for which even an n of 40 or 50 does not suffice, but such distributions are rarely encountered in practice. 84 The Central Limit Theorem On the other hand, the rule of thumb is often conservative; for many population distributions, an n much less than 30 would suffice. For example, in the case of a uniform population distribution, the CLT gives a good approximation for n 12. 85 5.5 The Distribution of a Linear Combination Copyright © Cengage Learning. All rights reserved. 86 The Distribution of a Linear Combination The sample mean X and sample total To are special cases of a type of random variable that arises very frequently in statistical applications. Definition Given a collection of n random variables X1, . . . , Xn and n numerical constants a1, . . . , an, the rv (5.7) is called a linear combination of the Xi’s. 87 The Distribution of a Linear Combination For example, 4X1 – 5X2 + 8X3 is a linear combination of X1, X2, and X3 with a1 = 4, a2 = –5, and a3 = 8. Taking a1 = a2 = . . . = an = 1 gives Y = X1 + . . . + Xn = To, and a1 = a2 = . . . = an = yields 88 The Distribution of a Linear Combination Proposition Let X1, X2, . . . , Xn have mean values 1, . . . , n, respectively, and variances respectively. 1. Whether or not the Xi’s are independent, E(a1X1 + a2X2 + . . . + anXn) = a1E(X1) + a2E(X2) + . . . + anE(Xn) (5.8) = a11 + . . . + ann 2. If X1, . . . , Xn are independent, V(a1X1 + a2X2 + . . . + anXn) (5.9) 89 The Distribution of a Linear Combination And (5.10) 3. For any X1, . . . , Xn, (5.11) 90 Example 29 A gas station sells three grades of gasoline: regular, extra, and super. These are priced at $3.00, $3.20, and $3.40 per gallon, respectively. Let X1, X2, and X3 denote the amounts of these grades purchased (gallons) on a particular day. Suppose the Xi’s are independent with 1 = 1000, 2 = 500, 3 = 300, 1 = 100, 2 = 80, and 3 = 50. 91 Example 29 cont’d The revenue from sales is Y = 3.0X1 + 3.2X2 + 3.4X3, and E(Y) = 3.01 + 3.22 + 3.43 = $5620 92 The Difference Between Two Random Variables 93 The Difference Between Two Random Variables An important special case of a linear combination results from taking n = 2, a1 = 1, and a2 = –1: Y = a1X1 + a2X2 = X1 – X2 We then have the following corollary to the proposition. Corollary E(X1 – X2) = E(X1) – E(X2) for any two rv’s X1 and X2. V(X1 – X2) = V(X1) + V(X2) if X1 and X2 are independent rv’s. 94 Example 30 A certain automobile manufacturer equips a particular model with either a six-cylinder engine or a four-cylinder engine. Let X1 and X2 be fuel efficiencies for independently and randomly selected six-cylinder and four-cylinder cars, respectively. With 1 = 22, 2 = 26, 1 = 1.2, and 2 = 1.5, E(X1 – X2) = 1 – 2 = 22 – 26 = –4 95 Example 30 cont’d If we relabel so that X1 refers to the four-cylinder car, then E(X1 – X2) = 4, but the variance of the difference is still 3.69. 96 The Case of Normal Random Variables 97 The Case of Normal Random Variables When the Xi’s form a random sample from a normal distribution, X and To are both normally distributed. Here is a more general result concerning linear combinations. Proposition If X1, X2, . . . , Xn are independent, normally distributed rv’s (with possibly different means and/or variances), then any linear combination of the Xi’s also has a normal distribution. In particular, the difference X1 – X2 between two independent, normally distributed variables is itself normally distributed. 98 The Case of Normal Random Variables The CLT can also be generalized so it applies to certain linear combinations. Roughly speaking, if n is large and no individual term is likely to contribute too much to the overall value, then Y has approximately a normal distribution. 99