Probability

advertisement
ECON 309
Lecture 6: Probability Theory
I. Basic Notions of Probability
Classical, empirical, and subjective probability: All of these in some way relate to the
frequency with which you expect an event to occur. Frequency simply means the
fraction of the time that a particular event happens (in a defined type of situation).



Classical probability applies to situations where you know all the possible
outcomes, such as when throwing dice or flipping a coin. Our assumption in such
cases is that we know the underlying distribution (e.g., we know that every side of
a die is equally likely, we know we have a “fair” coin).
Empirical probability applies when we have to estimate a frequency based on
actual observations. This is what we do when we don’t actually know the
underlying distribution and have to estimate it. (If you weren’t sure if a coin was
fair or a die was weighted, you’d have to use empirical probability to try to find
out.)
Subjective probability applies in situations that cannot really be repeated, so we
have to imagine repeating them to make sense of the notion of “frequency.” For
example, “What is the probability Jack and Diane will eventually get married?”
This is a unique situation; Jack and Diane are unique individuals. But we might
imagine repeating the Jack-and-Diane relationship (or Jack-and-Diane-likerelationships) many, many times to see how often marriage resulted.
Basics of probability. We let P(A) = the probability (that is, the true frequency) of some
event defined by A.
P(A) = 1 means the event occurs with certainty; P(A) = 0 means that it will certainly not
occur. P(A) is always in the interval [0, 1].
The probability of all events put together must add up to 1, so long as we don’t doublecount by including events that overlap. Another way of putting this: something must
happen, even if it’s the absence of anything else. E.g., say we’re interested in the kinds
of pet a randomly chosen household will have. Either they will have a cat, or they’ll have
a dog, or they’ll have a fish, or they’ll have some other kind of pet, or they’ll have no pet
at all. If you add all these together (being sure not to double-count the chance of having
more than one kind of pet), the probabilities add up to one. Either they have a pet of
some kind or they don’t.
The complement of event A is denoted ~A (or in the book A’). This is the probability
that A will not occur. P(~A) = 1 – P(A), and P(A) + P(~A) = 1.
II. Intersections and Unions of Events
The intersection of two events is when two events both happen. What is the probability
that a randomly chosen voter is both African-American and a Democrat? If we let A =
voter is African-American and B = voter is Democrat, the intersection is denoted A ∩ B
= voter is both African-American and Democrat. P(A ∩ B) = the probability that a
randomly chosen voter is both African-American and Democrat.
The union of two events is when one or the other happens, or both. What is the
probability that a randomly chosen voter is either African-American or a Democrat?
(Note that, by convention, we interpret this “or” not to exclude voters who are both
African-American and Democrat.) Using the same A and B as above, the union is
denoted A U B = voter is either African-American or Democrat. P(A U B) = probability
that a randomly chosen voter is both African-American and Democrat.
Think of intersection as “and,” union as “or.”
Probability of the union can be found using this formula:
P(A U B) = P(A) + P(B) – P(A ∩ B)
Why do we need to subtract P(A ∩ B)? To keep us from double-counting. P(A), the
probability of African-American, includes African-American Democrats. P(B), the
probability of Democrat, also include African-American Democrats. But we should only
be including African-American Democrats once, not twice.
If the two events are mutually exclusive, meaning they cannot happen at the same time,
then the formula simplifies to P(A U B) = P(A) + P(B).
Consider this breakdown (I made up these numbers, so don’t take them seriously):
African-American
White/Other
Total
Voter Breakdown
Republican
Democrat
4
12
44
40
48
52
Total
16
84
100
Given these numbers,
P(A) = 16/100 = 0.16
P(B) = 52/100 = 0.52
P(A ∩ B) = 12/100 = 0.12
P(A U B) = P(A) + P(B) – P(A ∩ B) = 0.16 + 0.52 – 0.12 = 0.56
(in other words, all the Democrats plus the African-American Republicans)
III. Conditional Probabilities
A conditional probability tells you the probability of some event given that you already
know another event has occurred.
For instance, what is the probability a voter is a Democrat given that he is AfricanAmerican? This is designated P(B|A). Or what is the probability a voter is AfricanAmerican given that he is a Democrat? This is designated P(A|B).
The formula for conditional probability is:
P( A | B) 
P( A  B)
P( B)
So if we’re interested in P(A|B) for our definitions above, that is, the probability a voter
is African-American given that he is a Democrat, the answer is
P(A|B) = 0.12 / 0.52 = 0.23
If you think about it, all we’ve really done here is treat the “given” event as the whole
population. In this case, Democrat is the given event, so our relevant population is the 52
Democrats. Of those Democrats, how many are African-American? 12. And 12 as a
fraction of 52 is 0.23.
What about P(B|A)?
P(B|A) = P(A ∩ B)/P(A) = 0.12 / 0.16 = 0.75
Again, we’ve essentially treated the “given” event – this time that the voter is AfricanAmerican – as the relevant population. There are 16 African-Americans, and 4 of them
are Democrats, for a fraction of 0.75.
IV. Multiplication Rule
Events are called independent if the occurrence of one does not affect the probability of
the other. That is, P(A|B) = P(A).
A good example of independent events is rolls of dice or flips of a coin. Even if you’ve
just thrown heads 10 times in a row, the probability of heads is still the same as it ever
was (that is, one-half if this is a fair coin). The gambler’s fallacy is the mistaken belief
that events are dependent when they’re really not; gamblers often speak of “hot streaks”
as though they are likely to continue.
The multiplication rules says that if A and B are independent events, then you can
multiply the probabilities together to get the joint probability (the probability of the
intersection). That is,
P(A ∩ B) = P(A) ∙ P(B)
Example: What is the probability of rolling snake eyes? This the probability of die #1
coming up “1” and die #2 also coming up “1.” P(snake eyes) = (1/6)(1/6) = (1/36).
This rule actually follows directly from the rule for condition probabilities, once we
include independence:
P( A | B) 
P( A  B)
 P( A)
P( B)
And now just multiply through by P(B) to get the multiplication rule.
V. Bayes’ Rule
Bayes’ Rule is a formula for finding a conditional probability P(A|B) given information
about P(B|A). It’s really not useful in the example above, where you have all the
information you need to calculate P(B|A) directly. But what if you didn’t have all that
information?
Let’s say you want to know the likelihood that a voter is a Democrat, given that he’s
African-American. Instead of the information given in the table above, suppose you only
know the following: 23% of Democrats are African-American, 8.3% of Republicans are
African-American, and Democrats constitute 52% of the population.
Bayes’ Rule says:
P( B | A) 
P( A  B)
P( A | B)  P( B)

P( A  B)  P( A ~ B) P( A | B)  P( B)  P( A |~ B)  P(~ B)
(This is actually two equivalent formulas; use the one that’s more convenient.)
The idea is this: The numerator is the likelihood of having both events (in this case,
African-American and Democrat) occur. The bottom is the likelihood of just event A
occur, and it can happen in two ways: either with B or without B. So we’re saying: of
all the times that A occurs, how many of those times involve both A and B occurring?
In this case,
P( B | A) 
P( A | B)  P( B)
(0.23)(0.52)

P( A | B)  P( B)  P( A |~ B)  P(~ B) (0.23)(0.52)  (0.083)(0.48)

0.1196
 0.75
0.1196  0.03984
Notice that this is just what we found from the initial data, but we were able to find it
using more limited information.
Medical Testing. Probably the most important applications of Bayes’ Rule involve
medical testing, such as for disease or drug use. Suppose a school implements are new
random drug testing program. Students’ names are picked randomly from the registration
records, and the selected students have to take a drug test. It is known that 5% of all
students use drugs. It is also known that the drug test’s false positive rate is 1% (a false
positive means indicating drug use when it did not in fact occur). Its false negative rate is
2% (a false negative means indicating no drug use even though it did occur). If a student
tests positive for drugs, what is the probability that he actually did use drugs? Most
people would say, “99 percent.” But that answer is wrong!
Let B = student used drugs
A = test was positive
We want to know P(B|A).
We know P(B) = 0.05, P(~B) = 0.95, P(A|B) = 0.98, and P(A|~B) = 0.01.
Using Bayes’ Rule,
Numerator = P(A|B)∙P(B) = (0.98)(0.05) = 0.049
Denominator = P(A|B)∙P(B) + P(A|~B)∙P(~B) = (0.98)(0.05) + (0.01)(0.95) = 0.0585
P(B|A) = numerator/denominator = 0.049 / 0.0585 = 0.848, or 84.8%
That is, the chance a randomly selected student who tested positive for drugs actually
used drugs is only about 85%, not the 99% that some people naively assume.
The result is even more dramatic with tests for diseases, such as HIV, where the fraction
of the public that has the disease (that is, P(B)) is very small. The probability that a
randomly tested person (that is, a person getting tested with no particular reason to think
he’s been exposed) actually has the disease can be as low as 50%, or even less.
To see the logic, suppose we have a population of 10,000 people. It is known that 1 in
200 people (that is, 50 people total) have the disease. And suppose the test for the
disease has 98% accuracy (false positive and false negative rates of 2%). Of the 50
people with the disease, (0.98)(50) = 49 will test positive. Of the 9,550 people without
the disease, (0.02)(9550) = 199 will test positive. That’s a total of 248 positive test
results, but only 49 of those people actually have the disease. That’s only 19.8%. The
remaining 80.2% of those who tested positive are disease-free.
The Monty Hall Problem. You’re on a game show. Monty Hall, the host presents you
with three doors. Behind one there’s a bag of gold; behind the other two there are goats.
You pick door A. Before you open it, the host opens door B to reveal there’s a goat
behind it. Then he offers you a chance to switch to door C. Should you?
To answer properly, you should understand that the host knows where the gold is, and
he’ll never open the door on the gold. Does that change your answer?
We want to find P(A has gold | host shows B has goat).
We know P(A has gold) = 1/3, P(A has goat) = 2/3
P(host shows B has goat | A has gold) = 1/2 (because he randomly opens B or C)
P(host shows B has goat | A has goat) = 1/2 (because he always opens whichever
remaining door has the other goat, and there’s a 1/2 chance B is that door)
Numerator = P(host shows B has goat | A has gold)∙P(A has gold) = (1/2)(1/3)
Denominator = P(host shows B has goat | A has gold)∙P(A has gold)
+ P(host shows B has goat | A has goat)∙P(A does not)
= (1/2)(1/3) + (1/2)(2/3) = 1/6 + 1/3 = 1/2
So by Bayes’ Rule, P(A has gold | host shows B does not) = (1/2)(1/3)/(1/2) = 1/3.
And if there’s only a 1/3 chance the gold is behind A, and the host has already revealed
that it’s not behind B, then there’s a 2/3 chance it’s behind C. You’re better off
switching!
Here’s the logic. One-third of the time, you will have guessed correctly. Suppose you
adopt a policy of always sticking with your original choice. Then obviously, you will
win 1/3 of the time. Can you improve on that? Suppose you adopt the alternative policy
of switching. That means 1/3 of the time you’ll lose because you guessed correctly to
begin with. The other 2/3 of the time, you’ll switch to one of the other two doors, one of
which must have the gold. And since Monty Hall has already eliminated the other door
that has the goat, you’ll be switching to the door with gold. Thus, you win 2/3 of the
time.
The key here is realizing that Monty Hall’s action reveals information. The result would
be different if Monty Hall chose randomly between doors B and C (and therefore
sometimes revealed the gold, causing you to lose immediately).
VI. Probability Distributions
A probability distribution is a specification of all different possible values of a random
variable along with a measure of the frequency for each of those values.
There are two kinds of random variables and thus two kinds of probability distributions:
discrete and continuous.
A discrete variable that can only take a countable number of different values. (Countable
is not the same as finite. Countable means that you could use the numbers 1, 2, 3, etc. to
designate the possible outcomes). Examples of discrete variables would be the value of a
die roll (there are exactly 6 possible outcomes), the outcome of a coin flip (2 possible
outcomes), number of marriages (always a whole number 0, 1, 2, etc.; it’s impossible to
have a fraction of a marriage), etc.
A continuous variable is one that is measured on a continuous (unbroken) number scale,
and therefore can take on an uncountable and infinite number of possible values.
Examples include height, weight, time, and so on. If these variables are treated as
discrete sometimes (e.g., we don’t report heights down to infinitely small units, but in
discrete numbers of inches), it is because we can’t measure with infinite precision and
because it’s often convenient to round off the results.
With discrete variables, we can reasonably assign a probability value to each possible
outcome. We can represent it with a function such as the following, which is for the roll
of a die:
P( x)  (1 / 6), x  {1,2,3,4,5,6}
To show that countable does not mean infinite, consider the probability distribution for
the following process: flip a coin until it comes up heads, and let x = the flip on which
the first heads came up. The probability of x = 1 is ½; the probability of x = 2 is ¼; etc.;
and there is no maximum possible value of x. We would write this like so:
P( x)  (1 / 2) x , x  {1,2,3,...}
For discrete probability distributions, the value of the function P(x) can be interpreted as
the probability of the value x.
But for continuous probability distributions, we cannot assign a probability to each
possible value. Why not? Because there is an infinite number of possible values. The
probability of any given (and specifically defined) value is approximately zero. What we
really want to know is the probability of the value falling within a certain interval. For
example, what’s the probability of an American man being 6’2”? The probability of
anyone being exactly 6’2”, and not the tiniest fraction taller or shorter, is zero. But we
can talk about the probability of a man being between 6’1.5” and 6’2.5”, and that
probability is greater than zero.
We represent a continuous probability distribution with a probability density function
(pdf) such as the following:
f ( x)  (1 / 10), x  [0,10]
This function defines a uniform distribution over the interval [0,10]. Every value in the
range from 0 to 10 can occur (and not just 0, 1, 2, etc., but all the fractional values in
between). We cannot interpret f(x) as the probability of the value x, because there are
more than 10 possible values of x, so the probabilities would add up to more than 1. And
that would clearly be wrong anyway, because the chance of (say) x = 2 is not 1/10.
What f(x) does do for us is allow us to find the probability of intervals. We do this by
looking at the area underneath the curve defined by f(x). [Draw graph of this function: a
horizontal line at f(x) = 1/10, going from x = 0 to x = 10.] Note that the total area
underneath this function is 1. This makes sense, because all probabilities must add up to
1, and no value can fall outside the interval [0,10]. Note also that the area under the
curve for any interval with a length of one, such as [0,1] or [1,2] or [3.5,4.5] is equal to
1/10.
There are many different continuous distributions; the uniform distribution is a very
simple one. (Also, we could have defined our uniform distribution over any interval we
wanted, such as [0,1] or [-1,1] or [50,100] or whatever.) We will be concerned, for the
most part, with just one: the normal distribution. The normal distribution has a pdf that
looks like this:
f ( x) 
1
 2
e
 (1 / 2 )[( x   ) /  ]2
This is the first and last time we’ll be looking at this formula. The important thing to
note is that, except for x, everything else in there is a parameter – that is, a fixed number.
Pi and e are just irrational numbers that happen to turn up a lot in the world. Mu (μ) and
sigma (σ), as you know, are the mean and standard deviation of a population. Note that
these can take on many different values, depending on the population you’re talking
about.
If you graphed this function, you’d get the famous bell curve. [Draw it, with μ marked as
the center of the distribution.] Just as with the uniform distribution, the value of f(x)
doesn’t have any important meaning. What does matter is the area underneath the curve
for any given interval. The area under the whole curve (that is, in the interval [-∞, ∞]) is
equal to 1, just as with the uniform distribution. The area under the left half is ½; the area
under the right half is ½.
Knowing the standard deviation lets us get even more information. It turns out that the
area in the interval [μ – σ, μ + σ] is equal to approximately 0.68, or just over 2/3. That is,
the probability of x falling within one standard deviation of the mean is about 2/3. It also
turns out the area in the interval [μ – 2σ, μ + 2σ] is equal to approximately 0.95, meaning
the probability of x falling within two standard deviations of the mean is about 1/20.
The standard normal distribution is the normal distribution with mean of zero and
standard deviation of 1. If you plugged those into the pdf above, you’d get:
f ( z) 
1 (1 / 2) z 2
e
2
(We changed x to z because, for historical reasons, we happen to call the standard normal
variable z instead of x.) Again, this is the first and last time we’ll see this formula. The
important part is that in any statistics book, you’ll find a table that summarizes lots of
information about the area underneath the standard normal bell curve. It’s Table 3, on p.
348-9, in our text.
And it turns out we can use the information in that table for any normal distribution by
making a simple conversion. If you have a variable x that is normally distributed with
mean μ and standard deviation σ, you can convert any value of that variable into an
equivalent value of a standard normal variable using the following:
z
x

This is called a z-score, and it can be interpreted as the number of standard deviations the
value x is from the mean.
For each value of z, Table 3 gives the area under the bell curve and to the left of z. In
other words, it gives the area under the curve in the interval [-∞, z]. [Draw picture for z =
1.] The table tells us the area under the curve to the left of z = 1.00 is 0.8413. We can
easily find the area to the right of z by taking one minus the area given in the table. Thus,
the area to the right of z = 1 is 1 – 0.8413 = 0.1587.
Because the standard normal distribution is symmetrical, the area to the left of any value
z is equal to the area to the right of –z, and the area to the right of any value z is equal to
the area to the left of –z. Thus, since the area to right of z = 1 is 0.1587, the area to the
left of z = –1 is 0.1587 as well.
And we can easily find the area between any two values of z by subtracting the area to
the left of the lower value from the area to the left of the higher value. If we wanted the
area between z = –1 and z = 1, we’d note that the area to the left of z = -1 is 0.1587 (as
shown above), and the area to the left of z = 1 is 0.8413. Subtract the former from the
latter to get 0.8413 – 0.1587 = 0.6826.
We can do the same thing for any interval, not just symmetrical ones.
Example: The mean IQ is 100, and the standard deviation is 15. How many people have
an IQ between 120 and 145? First, convert both of these into z-scores: 1.33 and 3.00.
The area to the left of 1.33, as given in Table 3, is 0.9082. The area to the left of 3.00 is
0.9987. Subtract the 0.9082 from 0.9987 to get 0.0905, or about 9%.
Example: Prof. Nerdberger’s class grades are normally distributed, with a mean of 65
and a standard deviation of 17. (Yes, he sometimes gives scores above 100.) How many
students get B’s, if the B range is 80 to 90? Convert these to z-scores: (90 – 65)/17 =
1.47; (80 – 65)/17 = 0.88. The area to the left of z = 0.88 is 0.8106; the area to the left of
z = 1.47 is 0.9292. Subtracting, we get 0.9292 – 0.8106 = 0.1186, or almost 12%.
You can use this kind of information to solve some economic problems.
Example: You run a sandwich shop that also sells bowls of soup. Your price for a bowl
of soup is $5 (assume you’ve already set this optimally). You’ve discovered that the
number of bowls of soup requested by customers per day is normally distributed with
mean of 20 and standard deviation of 5. You fix all of your soup at the beginning of the
day, and you currently put in enough ingredients for 20 bowls’ worth. It costs you $1 to
add another bowl’s worth. Should you increase your number of bowls prepared, and if
so, by how much? We will use marginal analysis: compare the marginal cost (MC) of
preparing another bowl with the expected marginal revenue (MR) from selling it. If MR
> MC, prepare the bowl.
So, if you prepare a 21st bowl’s worth, what is the chance it will be sold? 21 means a zscore of z = 0.20. Table 3 gives a value of 0.58, meaning there’s a 58% chance you’ll
sell less than or equal to 21, or a 42% chance you’ll sell 21 or more. Because of the fact
that 21 is included in both (less than or equal to 21, 21 or more), we need to think a little
more carefully.
The problem is that we’re using a normal distribution, which is for continuous variables,
to approximate a discrete variable. We can do this by thinking of the number of actual
bowls (the discrete variable) as the rounded-off form of a continuous variable. If you
think about the non-existent bowl 20.5, you get a z-score of 0.10, and the table tells us
there’s 54% chance of less than 20.5 – that is, 20 or fewer bowls; and thus there’s a 46%
chance of getting more than 20.5 – that is, 21 or more bowls. Your expected MR from
preparing bowl #21 is 0.46($5) = $2.30, which exceeds the MC of $1. What about the
22nd bowl? Using z = 0.30 (bowl 21.5), we find a 1 – 0.62 or 38% chance of selling at
least 22 bowls; MR = 0.38($5) = $1.90 > $1, so you prepare it. The table below
summarizes the rest of the calculations.
Bowl
21
22
23
24
25
P(x > Bowl)
0.46
0.38
0.31
0.24
0.18
Expected MR
$2.30
$1.90
$1.55
$1.20
$0.90
MC
$1
$1
$1
$1
$1
Prepare?
Yes
Yes
Yes
Yes
No
So you’d want to prepare 24 bowls’ worth and no more. (Unless, perhaps, you were
worried that you’d lose some angry customers forever by running out of soup. How often
would you turn away a customer? 16% of the time.)
What if your number of soup-requesters per day were not normally distributed? Then
this approach wouldn’t work quite as well. What could you do? Maybe transform your
number of bowls using a natural log…
Download