here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Calculating Probabilities III:
Random Variables and Probability Distributions
A random variable is just a way of assigning a numerical result to the outcome of a random experiment. (In
mathematical terms, a random variable is a function which produces a numerical result when it acts on
elements of the sample space). The "rule" that associates probabilities with specific values or ranges of
values of a random variable is referred to as a probability distribution.
Because of their very different mathematical properties, procedures for working with discrete random
variables (random variables with possible values forming a set of distinct numerical values separated by
gaps) are different in fundamental ways from procedures for working with continuous random variables
(random variables whose possible values form one or more continuous ranges of numerical values.). We
will summarize some of these distinctions near the end of the document. However, there are also some
significant similarities between the way we work with both kinds of random variables (or perhaps "analogies"
is a better word), and you should note these as well as you review the material covered here.
A number of random variables and associated probability distributions arise frequently in applications in
biological sciences. Among the important discrete probability distributions that we will study in this course
are the binomial distribution and the Poisson distribution. We will make extensive use of two
continuous distributions: the normal distribution and the t-distribution. Two other continuous distributions
that will come up in the course are the 2-distribution (or chi-squared distribution) and the Fdistribution. There are certainly dozens, if not hundreds, of other named distributions that arise in more
specialized applications of probability and statistics.
Discrete Random Variables and Probability Distributions
We've already looked at some very simple examples of discrete random variables in the first document on
this series titled "Calculating Probabilities I: Equally-likely Simple Events and Branching Diagrams." The
coin-flipping experiment described there leads to a discrete random variable.
To make things a bit more interesting here, we will consider the experiment in which a fair coin is flipped five
times (or five identical coins are flipped simultaneously). We know from previous work that this experiment
will have 25 = 32 distinct outcomes, each of which can be written as a string of five H's and T's, representing
the sequence of heads and tails observed in the experiment. It's not too hard to set up a branching diagram
to determine that the 32 possible outcomes of this experiment are:
HHHHH
HHHHT
HHHTH
HHTHH
HTHHH
THHHH
HHHTT
HHTHT
HTHHT
THHHT
HHTTH
HTHTH
THHTH
HTTHH
THTHH
TTHHH
TTTHH
TTHTH
THTTH
HTTTH
TTHHT
THTHT
HTTHT
THHTT
HTHTT
HHTTT
HTTTT
THTTT
TTHTT
TTTHT
TTTTH
TTTTT
We've sorted these 32 outcomes into columns, each corresponding to a different number of heads overall.
Now, define:
x = number of heads when a fair coin in flipped 5 times
Then, x is a random variable (because it associates a numerical value with each of the 32 outcomes listed
above), and further, it is a discrete random variable because x can have only the distinct values 0, 1, 2, 3, 4,
David W. Sabo (1999)
Calculating Probabilities III
Page 1 of 8
or 5. The listing above is organized so that the first column gives all the simple events resulting in x = 5, the
second column gives all the simple events resulting in x = 4, the third column gives the simple events
resulting in x = 3, and so on, down to x = 0 in the right-most column.
The probability distribution for x is just some way of specifying the probability of observing each possible
value of x. Since we know that the 32 outcomes listed above are simple events, and the "fair" coin
assumption means they are equally-likely and so each have a probability of 1/32, it is quite easy to work out
the probabilities associated with each of the six possible values of x.
For discrete random variables, the probability distributions are conveniently expressed in one of several
possible ways:
(i) simply tabulate the values: In this case, we can compute the probabilities associated with each
possible value of x by simply counting entries in the columns of the table above. This gives:
x
0
1
2
3
4
5
total:
Pr(x)
1/32 = 0.03125
5/32 = 0.15625
10/32 = 0.3125
10/32 = 0.3125
5/32 = 0.15625
1/32 = 0.03125
32/32 = 1
Notice that since the values, Pr(x), are just probabilities, they must obey the basic properties of probabilities:
(a) they are numbers with values between 0 and 1 inclusive
(b) the sum of the probabilities for all possible values of x must be 1.
(ii) construct a probability histogram: This is not a very practical method for discrete random variables,
but it illustrates what becomes the method of choice (indeed, the only practical method) for computing
probabilities for continuous random variables. You just construct a bar of height Pr(x) and width 1 centered
on the value of x. For the five-coin-toss example, we get:
0.350
0.3125
0.3125
0.300
Pr(x)
0.250
0.200
0.15625
0.15625
0.150
0.100
0.050
0.03125
0.03125
0.000
0
1
2
3
4
5
x
Other than the visual image this representation gives, there's no real new information here that wasn't
present in the tabulation earlier. Because we've drawn the columns for each possible value of x to have a
width of 1 unit, the shaded areas and the probabilities are equivalent.
(iii) use an algebraic formula: In many instances, actual algebraic formulas for Pr(x) can be worked out,
based on the known characteristics of the random experiment giving rise to the values of x. For example,
without going into the details of the derivation here, it is possible to discover that the values of Pr(x) for this
five-coin-toss example are given by the formula:
Page 2 of 8
Calculating Probabilities III
David W. Sabo (1999)
Pr( x ) 
C 5,x
32

5!
.
32 x! (5  x )!
This may look a bit scary at first, but recall that factorials are just products of whole numbers. Thus, for
example, taking x = 3, we get
Pr( x  3) 
5!
5  4  3  2 1
10


 0.3125
32 (3! ) (2! ) 32  3  2  1 2  1 32
(When we look at the binomial distribution in greater detail near the end of the course, you'll see that with
very little modification, this sort of formula can be used to cover an experiment with any number of coins or
their equivalent, and any degree of unequal-likelihood between the two possible simple outcomes.)
Use of Cumulative Probabilities
Very early in the course, you encountered cumulative frequencies and cumulative relative frequencies.
Cumulative probabilities are analogous. They are just probabilities that x is less than or equal to some
value. For the five-coin-toss example, we can easily tabulate cumulative probabilities using the probabilities
for the individual values of x in the table above:
k
0
1
2
3
4
5
Pr(x  k)
1/32 = 0.03125
6/32 = 0.1875
16/32 = 0.5
26/32 = 0.8125
31/32 = 0.96875
32/32 = 1
Thus, for example
Pr(x  2) = Pr(x = 0) + Pr(x = 1) + Pr(x = 2) = 1/32 + 5/32 + 10/32 = 16/32 =0.5.
Cumulative probabilities are particularly useful for calculating the probability that ranges of values of x occur.
Thus, for example,
Pr(1  x  4) = Pr (x  4) - Pr(x  0)
= 0.96875 - 0.03125 = 0.9375
is a shorter calculation than
Pr(1  x  4) = Pr(x = 1) + Pr(x = 2) + Pr(x = 3) + Pr(x = 4)
= 0.15625 + 0.3125 + 0.3125 + 0.15625 = 0.9375
The difference may not look like much here, but it becomes significant in more complicated applications
where the random variable, x, may have many more possible values than just six. To calculate probabilities
using cumulative probability tables or formulas requires at the most a single subtraction, regardless of how
many possible values x may have.
The notation above is a bit awkward. In the case of actual standard probability distributions, there are
distinct symbols for the probabilities of individual outcomes as opposed to cumulative probabilities.
The detailed procedures for working with cumulative probabilities are little more complicated for discrete
probability distributions than they are for continuous probability distributions. We will look at them in greater
depth later when we deal with the binomial distribution.
David W. Sabo (1999)
Calculating Probabilities III
Page 3 of 8
Expected Values of Discrete Random Variables
Think of performing the five-coin-toss experiment many, many times, recording the number of heads
observed each time the experiment is performed. One important question is: What is the average number
of heads observed per experiment over the long run? This turns out to be quite easy to calculate.
Recall that Pr(k), the probability that you will observe the value x = k when the experiment is performed, is
equivalent to the long run relative frequency of the result x = k. So, knowing, for instance, that in the fivecoin-toss experiment, Pr(x = 1) = 5/32, we know that over the long run, 5/32 of the experiments result in
x = 1. Now, suppose that the experiment is performed N times, where N stands for some very large number.
Then, we can say the following:
x=0
x=1
x=2
x=3
x=4
x=5
will be observed approximately
will be observed approximately
will be observed approximately
will be observed approximately
will be observed approximately
will be observed approximately
(1/32)N times
(5/32)N times
(10/32)N times
(10/32)N times
(5/32)N times
(1/32)N times
Now, it is easy to estimate the sum of all the observed values of x in these N repetitions of the experiment.
The (1/32)N zeros in our list contribute (1/32)N x 0 = 0 to the sum. The (5/32)N ones in the list contribute
(5/32)N x 1 = (5/32)N to the sum. The (10/32)N twos in the sum contribute (10/32)N x 2 to the sum, and so
on. We sum up all the x-values, and divide by N to get the mean of the x values:
1
5
10
10
5
1
N 0 
N 1
N 2 
N 3 
N 4 
N 5
32
32
32
32
32
  E[ x]  32
N
5
10
10
5
1
 1

N
0 
1
2
3 
4
 5
32
32
32
32
32
 32


N
1
5
10
10
5
1

0 
1
2
3 
4
5
32
32
32
32
32
32
0  5  20  30  20  5

 2 .5
32
This makes sense. Each time the experiment is done, five coin faces are exposed. Since H and T are
equally likely, it is reasonable that the long run average would have half the exposed faces being H and half
of them being T. Expressed as a number per experiment repetition, this means that an average of 2.5 H's
and 2.5 T's should occur for each repetition.
The notation, E[x], stands for "expected value of x", and is equivalent in meaning to the symbol . The first
line above simply details the sum and divide definition of a mean value. In the second line, we've removed
the common factor N from each term in the numerator, and notice that it will cancel the N in the
denominator. The third line is most revealing, because we see that (as we might have hoped had we
stopped to think deeply), the average value of x does not depend on the value of N as long as you are
contemplating enough repetitions of the experiment to make the first line approximately valid. Further,
notice that the third line is just he sum of the products of the values of x and their probabilities:
E[ x] 
 x  Pr( x )
(RV-1)
all v alues
of x
This formula applies to any discrete random variable and its probability distribution. The value of E[x] can be
interpreted as the long run average value of x when the random experiment is repeated many times. If you
were to construct a frequency or relative frequency histogram of the outcomes of many repetitions of the
random experiment, E[x] would give you the value at x at the mean of that histogram.
Page 4 of 8
Calculating Probabilities III
David W. Sabo (1999)
The mean value of any quantity which depends on the value of x can be computed using a very similar
method, and hence a very similar formula. In general, if f(x) is some function of x, then its mean value over
many repetitions of the experiment is simply
E[f ( x )] 
 f ( x k )  Pr( x  x k )
(RV-2)
all v alues
x k of x
In particular, the quantity
Var ( x)  E[(x   ) 2 ] 
2
 ( x   )  Pr( x)
(RV-3)
all v alues
of x
gives the variance of the observed values of x over the long run, and so, together with its square root (which
would be the standard deviation of the observed values of x over the long run) is a measure of how much
spread there is in the observed values of x. A probability distribution with a small value of Var(x) would give
a very tight pattern of values of x over the long run  its probability histogram would be very narrow. On the
other hand, a probability distribution with a larger value of Var(x) would result in a much wider variety of
outcomes  its probability histogram would be broader.
For the five-coin-toss experiment,
Var ( x )  (0  2.5) 2
1
5
10
10
5
1
 (1  2.5) 2
 ( 2  2. 5 ) 2
 ( 3  2 .5 ) 2
 ( 4  2. 5 ) 2
 (5  2.5) 2
32
32
32
32
32
32
= 1.25
Example:
A game of chance consists of flipping a coin five times. The payoff is the square of the number of heads
resulting, in dollars. What is the maximum amount of money you should be willing to pay to play this game?
Solution:
So, what this example says is: if you flip the five coins and get no heads up, you win nothing. You get $1 for
getting one head, $4 for two heads, $9 for three heads, $16 for four heads, and $25 for five heads. We
could regard winnings here as a random variable (for which we can easily work out the probability
distribution), but it is just as easy to consider winnings as a simple mathematical function, f(x) = x 2, of the
random variable x, the number of heads obtained in five flips of the coin.
Now, the less we pay to play this game, the greater our potential benefit. However, the most we should be
willing to pay is the expected value of the winnings. Using formula (RV-2), this is
E[ winnings ]  0 
1
5
10
10
5
1
 1
 4
9
 16 
 25 
 7.50
32
32
32
32
32
32
Thus, if we played this game many times, the average number of dollars won per play would be $7.50. If the
cost of playing the game is less than $7.50, then over the long run we win more money than it costs to play.
If the cost to play the game is more than $7.50, then over the long run we pay more than we win.
Here, E[winning] = $7.50 is a long run average, though. If you just play one game, you may win $25 that
game, and so would gain even if it cost you $24 to play the game. However, if you play just one game, you
might also win nothing, so that any amount paid to play would be a loss.

David W. Sabo (1999)
Calculating Probabilities III
Page 5 of 8
Continuous Random Variables
The possible values of a discrete random variable form a set of specific, isolated values, to each of which
can be assigned a probability. Even if there is in principle an infinite number of possible values of a discrete
random variable, {x1, x2, x3, …}, it makes sense to speak of Pr(x = xk), the probability of one of those
possible values occurring.
It does not make sense to talk about Pr(x = c) if x is a continuous random variable because such events, x =
c, are impossible to observe. You may think that a randomly selected apple has a mass of 176.8453 g
(pretty precise for an apple!), but how do you know it isn't really 176.84530000000001? While there is the
practical problem of present technology being limited in precision even in measuring quantities that can be
measured to many significant figures, there is a much more fundamental mathematical problem here as
well. As a result, the only meaningful events involving continuous random variables are events in which we
ask for the probability that the value of x falls into some interval of finite length: Pr(a < x < b), where a and b
are two distinct numbers. This is not as limiting in practice as you might think, because we are usually not
interested in questions of whether some quantity has a precise value, but whether its value falls within a
certain range. In fact, even the statement, x = 176.8453 g really implies that our measurement establishes
the value of x rounded to four decimal places is 176.8453 g, that is, the true value of x is somewhere
between 176.84525 g and 176.84535 g, an interval!
The most effective way for computing probabilities for continuous random variable starts with the
development of a so-called probability density function, f(x). For example, in the case of the normal
distribution, the familiar bell-curve, the probability density function can be written as:
f (x) 
1
2 
e ( x   )
2
/ 2 2
where  and  are constants that would have specific values in specific applications (you should be able to
guess what these values represent!), and  is just the mathematical constant 3.141592653…. . In this case,
plotting a graph of the probability density function gives the familiar bell-shape. Other probability density
functions would, of course, give other shapes.
Then, very simply:
Pr(a < x < b) = area under the graph of f(x)
between x = a and x = b.
or, in pictures,
Pr(a < x < b)
y = f(x)
x
a
Page 6 of 8
b
Calculating Probabilities III
David W. Sabo (1999)
Now, areas are expressed mathematically as integrals, and so formally, we can write
b
Pr(a  x  b)   f ( x ) dx
(RV-4)
a
Providing f(x) is simple enough that we know how to work out the integral in this formula, we now have a
very effective way to calculate any probability we need for an event involving x.
There are some probability distributions for which (RV-4) is a practical formula. However, for the normal
distribution, and all of the other continuous distributions that we will use in this course, the integral in (RV-4)
is too impractical to attempt by hand. As a result, numerical tables have been developed to give
approximate values of the integral for most problems of practical interest.
Even though we won't use formula (RV-4) to calculate any actual numbers in this course, it does imply a
number of useful properties, which we will just note here briefly. Actual applications and implementation of
these properties will be illustrated in detail for the normal distribution in the next section of the course.
(i.)
there is no difference between strict inequality and non-strict inequality in calculations involving
continuous probability distributions. That is:
Pr(a < x < b)  Pr(a  x  b)  Pr(a  x < b)  Pr(a < x  b)
since the difference between these is just the area above one point on the x-axis, which we know
from our study of calculus to be zero. This is consistent, for example, with the observation that the
statement x > 7.5 is indistinguishable in practice from the statement x  7.5, since x > 7.5 is
satisfied by values of x which are for all practical purposes equal to 7.5. This is not true for
discrete random variables!
Thus, we needn't be too obsessive about using < or > in place of  or .
(ii.)
One property that the probability density function must satisfy is

 f ( x ) dx  1

since any observation of the value of x must give a value between - and +. Thus, the total area
underneath the graph of the probability density function must always be exactly equal to 1.
area = F(b) = Pr(-< x < b)
(iii.)
Cumulative probability functions can be
defined as
F(a) = Pr(- < x  a)
area = F(a) = Pr(-< x < a)
(RV-5)
That is, F(a) is the probability of observing x to
y = f(x)
have a value less than or equal to 'a', which is
the area under the graph of f(x) to the left of x
= a. As in (RV-5), we will represent the
x
generic cumulative probability function by the
a
b
symbol F(), the upper case of f(). If we have a
Pr(a < x < b) = F(b) - F(a)
formula for the cumulative probability, or, as is
more usual, if we have tables giving values of cumulative probability functions, then it is easy to
work out any probability, using the formula:
Pr(a < x < b) = F(b) - F(a)
(RV-6)
This, or some minor variation, is probably the formula you will use most often in the remainder of
the course.
David W. Sabo (1999)
Calculating Probabilities III
Page 7 of 8
(iv.)
Continuous random variables have mean values, variances, and other expected values in the same
way that discrete random variables do, except that now, the summation is replaced by an integral
involving the probability density function. Thus, for a continuous random variable x, with probability
function, f(x), we have:

  E[ x]   x  f ( x ) dx
(RV-7a)


 2  Var [ x]   ( x   ) 2  f ( x) dx
(RV-7b)

and, in general

E[g( x)]   g( x)  f ( x) dx
(RV-7c)

for some general function, g(x), of x. While we will not use these formulas much (if at all) in the
present course to calculate values of  or 2 for various probability distributions, we will state the
values of these quantities in order to convey information about the overall shape of the probability
distribution. As usual, the value of  tells us where the "center" of the probability distribution lies,
and the value of  or 2 gives a measure of how narrow or broad the probability distribution is.
Page 8 of 8
Calculating Probabilities III
David W. Sabo (1999)
Download