The Theory of Probability Chapter 5, Sections 1 and 2 Probability is the name of a branch of mathematics which deals with random variation. Probability is also the name of a numerical measure of the likelihood of occurrence of an event where occurrences are random. Probabilities are numbers in the interval (0, 1). Zero signifies impossible to occur, one signifies certain to occur, and one-half indicates a 50-50 chance. Statistics theory is based on Probability theory. The likelihood of various outcomes when a random sample is taken from a population are stated as probabilities. A major concept in theory is that of a random variable. Every random variable X is defined on a specified population. X obtains its value by taking a random sample of size 1 from the population (i.e., getting a number (element) such that every element is equally likely to be the one selected). We say that this is the value assumed by the random variable. Populations on which random variables are defined fall into two categories. Discrete populations are such that the set of unique values in the population do not constitute a continuum (Binomial populations for example). Continuous populations are such that the unique elements in the population form a continuum (Standard Normal population for example). Any random variable defined on a discrete population is called a discrete random variable, and we may consider the probability that the random variable assumes any one of the unique values in the population. Any random variable defined on a continuous population is called a continuous random variable, and we may consider the probability that the random variable assumes a value in any specified interval of the real line. In each case, the probability is the proportion of population elements satisfying the condition. Populations and associated random variables that arise in theory (i.e., theoretical populations) have accompanying functions which are used to find probabilities. Each discrete random variable has an associated probability function which is evaluated to give probability. Each continuous random variable has an associated probability density function which is integrated to give probability relative to an interval. Suppose X is a discrete random variable having probability function f(x), and Y is a continuous random variable having probability density function h(y). We use the notation P(X = x) and P(Y ∈ (a, b) ) for the probabilities that, respectively, X assumes the value x and Y assumes a value in the interval (a, b). Thus P( X = x) = f ( x) P (Y ∈ (a, b) ) = b ∫a h( y )dy . Example Consider the discrete population having proportions of values as follows: Value 0 1 2 Proportion 1/10 7/10 2/10 Thus the unique elements of this population are 0, 1, and 2. Seven tenths of the elements are 1. Define a random variable X on this population. X is a discrete random variable. It can assume value 0, 1, or 2. The likelihood (probability) of each outcome is the associated proportion. Thus x 0 1 2 P(X = x) = f(x) 1/10 7/10 2/10 The probability function f(x) in this example is defined by the table of proportions. * * * * In the case of discrete random variables, the probability that the random variable assumes one of several given possible values x1, x2, …, xn is the proportion of population elements which are x1, x2, …, and xn . This probability is found as f(x1) + f(x2) + … + f(xn). Thus for the previous example P(X = 0 or X = 1) = 1/10 + 7/10, P(X = 0 or X = 2) = 3/10, P(X = 0 or X = 1 or X = 2) = 1, and P(X = 3) = 0. Example Consider the Standard Normal population. Define the random variable Z on this population. Z is called a Standard Normal random variable. It is a continuous kind of random variable. The probability that Z assumes a value in a given interval (a, b) is found by integrating the probability density function φ (z) defined as φ ( z) = 1 2π Thus 2 e− z 2 /2 P(Z ∈ (0, 1) ) = 1 2π P(Z ∈ (−∞, ∞) ) = 1 2π 1 ∫0 e− y ∞ ∫−∞ 2 /2 dy 2 /2 e− y dy = 1 P(Z ∈ (0, ∞) ) = 1 2π ∞ ∫0 e− y 2 /2 dy 1 2 = P(Z ∈ (2, 2) ) = 1 2π 2 ∫2 e− y 2 /2 dy = 0 Table B.3 gives probabilities for intervals (−∞, z). Example Your textbook, on page 232, defines what I will say is a Binomial Experiment. This is an activity which consists of a set of identical trials. The trials are such that the outcome of any one is unpredictable and 1. Each trial results in one of two possible outcomes. Call the outcomes “success” and “failure” for lack of better names. 2. The trials are independent in the sense that the outcome of any one trial (success or failure) is in no way connected to the outcome of any other trial. You can see that this kind of experiment will fit many different situations. Mathematical theory gives the result that the number of successes obtained, in a Binomial Experiment consisting of n trials, is the value assumed by a Binomial (n, p) random variable. The value of p is the probability of success in a trial. Define, for a Binomial (n, p) experiment, the Binomial random variable X = the number of successes obtained. 3 That is to say define X on a Binomial (n, p) population. Then we have, for this discrete random variable, that f ( x) = P( X = x) = n! p x (1 − p ) n− x , x!(n − x)! for x = 0, 1, 2, ..., n . End of chapter exercises which feature a Binomial Experiment include 5.1, 5.12(c), 5.15(b), 5.45(b) and 5.46(b). The related Geometric random variable is defined on page 237. Its value is the number of trials required to obtain the first success. The Geometric distribution applies in this case. * * * * It is important to note that a probability is always a proportion of population elements. Continuing along this line, and thinking in terms of quantiles in a population, consider the Cumulative Probability Function F(x) of a random variable X. This function is defined as F ( x) = P ( X ≤ x) = p (i.e., F(x) is the proportion p of population elements that don't exceed x). Thus x is by definition the p quantile of the population, so x ≡ Q(p) using the notation from chapter 3. Thus, summarizing we have F (Q( p ) ) = p Q (F ( x ) ) = x Q = F −1 4 Probability Theory, Subsection 5.5.5 Sampling Random Variables, Sample Mean Random Variable, and the Central Limit Theorem When a random sample is taken from a population P we can imagine that it is drawn: 1. Without replacement, i.e., each selected element is not a candidate for subsequent selections. Each draw after the first is from a subset of the entire population. or 2. With replacement, i.e., each draw is made from the entire population. If the population P is very large compared to the sample size, as a practical matter, it doesn't matter which sampling method is used because the likelihood of drawing any given element two or more times is negligible. Let us henceforth assume that any population of interest to us is very large, and we sample with replacement. Furthermore, let us say that the mean and variance parameters of P are μ and σ 2. The outcome of a random sample of size n taken with replacement from P can be thought of as the values assumed by n random variables each defined on P. Call these random variables, X1, X2, …, Xn. Since there is no connection between value assumed by any two of these random variables, we say they are independent. Also, since each Xi is defined on the same population P, all of the random variables have the same probability distribution. The X1, X2, …, Xn are said to be independently and identically distributed which we abbreviate (iid). The random variables X1, X2, …, Xn are called sampling random variables. Sometimes in real applications, we are able to only observe one value taken at random from a continuous population of interest, and our question is of the kind “how likely is it that when we get the value it will be in the interval (say) (1.65, 2.13)?”. Thus, the situation is that we can take a random sample of size n = 1. The mathematical model is a single sampling random variable X1 ≡ X (no need for a subscript here). In order to answer our question we need to compute P[X ∈ (1.65, 2.13)]. This is only possible if we know the probability distribution of X, i.e. can compute the proportion of population elements that lie in any given interval. In almost every instance this requires that we assume that our population of interest is sufficiently like a known theoretical population so that we can reasonably assume our sample of n = 1 comes from that theoretical population. End of chapter exercises 5.1(a), 5.12(a), 5.15(a), 5.24, 5.30(a), 5.36(a), 5.42(a), 5.44(a) and 5.45(a) all fit this situation. Now consider the sample mean x for a random sample of size n from P. We may think of x as the value assumed by the random variable X which is the linear combination X = 1 ( X 1 + X 2 + ... + X n ) n of sampling random variables. 5 We have said that every random variable is defined on some population. Consider the population on which X is defined. Call it the derived population Pd. Obviously the elements in Pd are all the possible means x of subsets of n elements from P, selecting subsets with replacement. If, for example, N is the number of elements in P, then the number of elements in Pd is: n 2 3 4 -etc- # elements in Pd N2 N3 N4 The population mean and variance of Pd are given in Equations (5.55) and (5.56) on page 309 as EX = 1 n n ∑ i =1 E Xi = μ ⎛1⎞ Var X = ⎜ ⎟ ⎝n⎠ 2 2 ∑ i =1 Var X i = σ 2 / n This uses the facts that EXi = μ, Var Xi = σ . 2 To summarize, we may think of the mean of a random sample of size n (from a population P) as the value assumed by a random variable X defined on the derived population Pd. We have that E X = μ and Var X = σ 2/n where μ and σ 2 are parameters of P. What is the probability distribution of X ? I.e., what are the proportions of occurrences of values in Pd? Theory gives us that if P is a Normal (μ, σ 2) population then Pd is a Normal (μ, σ 2 /n) population. That is to say X ~ N ( μ, σ 2 /n) when Xi ~ N( μ, σ 2), i = 1, 2, …, n and the Xi are independent. When P is not a Normal population, and the sample size is large (say n ≥ 25) then the following theorem called the Central Limit Theorem says that Pd is approximately a Normal population and X is approximately a Normal (μ, σ 2 /n) random variable (see Proposition 3, page 316) regardless of the distribution of elements in P. * * * * Central Limit Theorem If X1, X2, …, Xn are iid random variables (with mean μ and variance σ 2), then for large n, the random variable X is approximately normally distributed ( μ, σ 2 /n). * * * * 6 I will write this approximation as X ≈ N (μ, σ 2 /n). As an example consider the outcomes of n identical Binomial (t, p) experiments. Each experiment results in a number of successes which we view as the value assumed by a Binomial 1 n (t, p) random variable Bi. The mean outcome for the n experiments is B = ∑i =1 Bi . We know n (from Equations (5.4) and (5.5), page 236) that EBi = tp and Var Bi = tp(1 − p), so E B = tp and Var B = tp(1 − p) /n. Then according to the Central Limit Theorem, if n is large B B B ≈ N (tp, tp(1 − p) / n ) The Central Limit Theorem suggests that the Normal distribution is generally of great value. This theorem gives one of the most amazing results in all of mathematics (in my opinion). End of chapter exercises 5.12(b), 5.13(b), 5.15(c), 5.20, 5.22(b), 5.36(b), 5.43(b), 5.44(b) and 5.45(d) all illustrate computation of, or approximation of, probability for X . The general term “Sampling Distribution” is used to refer to the probability distribution of a random variable which is a function of sampling random variables. The distribution of random variables X and (n − 1)S 2/σ 2 are examples of Sampling Distributions. Names of theoretical Sampling Distributions which we will encounter are Student’s t, chi-Squared, and F. 7