Conducting simulations in Excel to discover probability

advertisement
Using simulations in Excel to explore combined probability.
Tim Pelton (tpelton@uvic.ca) and Leslee Francis Pelton (lfrancis@uvic.ca), University of Victoria
www.csc.uvic.ca/~tpelton/combprob.html
Simulations of all sorts can be used to provide students with an intuitive understanding of
events, models and laws in the physical, biological and mathematical sciences. By using
Excel to build and manipulate mathematical models that approximate observations in the
real world, students are able to investigate chance events and develop their understanding
of probability concepts.
With simulations we are able to explore systems and models for which mathematical
solutions may be beyond our knowledge and ability to generate or even unavailable.
Einstein, used ‘mind experiments’ to investigate the laws of physics beyond the realm of
practical experience (Kaku, 1994), and many theorists today follow his example.
Conducting a simulation in Excel is similar to conducting a mind experiment where
either the model is too complex or the data set is too extensive to allow us to manipulate
it in our heads.
The explorations attached to this site are conducted in Excel although similar
spreadsheets developed in Clarisworks, Microsoftworks, or StarOffice might also be
used. Some statisticians and applied mathematicians might point out that concerns have
been raised about the accuracy of the statistical functions and the pseudo-random number
generators in these spreadsheets (McCullough, 1999). However, the effects of these
anomalies typically only appear when dealing with very extreme (i.e. z=7) values or in
situations where high degrees of accuracy are required (i.e. beyond the 6th decimal place
and often at the 15th or higher decimal place). For our purposes we will simply recognize
that simulations are somewhat imperfect as are the models we are applying. We cannot
prove the laws of probability with simulations, rather, we simply hope that we can
generate convincing inductive evidence that might promote understanding and motivate
further investigations.
In this paper we present a few simulations that may be used to help students build their
understanding of:
1. The nature of stochastic events
2. The potential for estimating underlying probabilities through repeated
observations
3. The mathematical laws that apply to combined probabilities
4. The potential utility of mathematical models and simulations in understanding
complex systems
Four simulations are suggested here (and accompanied on the web site by Excel files).
Along with simulation description we present questions for the students to answer and
suggest additional investigations that allow the students to explore the concepts on their
own. Teachers are encouraged to develop their own simulations and sets of questions,
and we would be happy to add more samples to our collection if you wish to post it with
us (send to lfrancis@uvic.ca).
The sample spreadsheets that we are providing have been left unlocked and fully
exposed. For initial investigations, the teacher may wish to modify the spreadsheets and
hide or lock some cells so that the model will not be accidentally disturbed and so that
formulae or parameters may be left for discovery. The teacher may later present the
unprotected spreadsheets to allow students to modify parameters and extend the sheet to
answer the questions provided. Alternatively, the teacher may choose to provide only
locked versions and challenge the students to build their own simulations from scratch.
Simulation 1:The Simpsons phone
When the telephone rings in the Simpson house Lisa knows that some unknown but
stable portion of the time the call is for her and likewise there are probabilities for her
mom, dad and brother (Maggie doesn’t get calls). In the attached spreadsheet we have
created a mathematical model to simulate the telephone ringing and the recipient
identified. With this simulation your task is to answer the following questions. You
may use the uncolored portion of the spreadsheet to collect information and perform
calculations to support your answers (the teacher might choose to lock the colored
portion).
To do this simulation we set the probabilities for Lisa, Marge, Bart and Homer, and used
these to set lower limits for partitions for the range 0-1. We then used the random number
function (‘rand()’) to generate a number between 0 and 1, and used ‘vlookup’ function to
look up the name associated with the random number generated. We used a combination
of the ‘sum’ and ‘if’ functions to calculate the number of phonecalls for each character
(we could also have used the ‘countif’ function).
1. How can you turn the counts of calls into probability estimates?
2. If the telephone rings, what is the probability that the caller wants to speak with Lisa?
Bart? Marge? Homer?
3. What must these four probabilities sum up to?
4. What real world situations have we not modeled here?
5. How certain are you that these are the correct probabilities?
6. How might this spreadsheet be modified to simplify the data gathering process?
7. Does there appear to be any order to the recipients of the calls?
Simulation 2: Traffic lights
Imagine that you have three lights you have to travel through to get to school.
If each light has a unique probability of being green (start with 50% green) what is the
probability that you will get all green lights?
(cf. PBS mathline)
In the spreadsheet you are provided with a simulation of 30 trips to school. Each trip has
randomly generated light conditions using a 50% probability that a light is green when
you arrive at it. The green and red status was generated by using rand() and comparing
the random number to the probability for green, when the value is less than the desired
probability, ‘green’ is written otherwise ‘red’ is written. The ‘countif’ function in Excel
was used to count how many green lights were encountered in each trial and then the
‘countif’ function was used again to count how many times each outcome occurred (0, 1,
2, or 3 green lights) over the 30 trials.
1.
2.
3.
4.
5.
6.
How can you turn those counts into probability estimates?
What is your estimate of the probability that you will get all green lights?
What is your estimate of the probability that you will get all red lights
How certain are you of your estimated probabilities?
Can you extend this spreadsheet to make your estimates more accurate?
How many trials do you think you would need to be certain about the probability
within 2 decimal places?
7. What is your estimate of the probability that you will get two or more green lights?
8. What is your estimate of the probability that you will get at least one green light?
9. What is your estimate of the probability that you will get exactly1 green light?
10. What happens when the probabilities for the lights being green change?
11. Look at the simulated cases - can you see any patterns?
12. Can you extend this problem to a 4, 5 or 6 light situation?
13. Can you solve these problems analytically?
Simulation 3: Game show paradox
This is an old problem/paradox that has been presented many times. If you search the
web for Monty Hall paradox you will come up with thousands of hits.
Imagine that you are a contestant in the game show called "Lets Make a Deal". There are
three curtains (numbered 1, 2, and 3), and behind one curtain is a great prize worth
thousands of dollars. Behind the others are worthless gag prizes. Once you select one
curtain the host, Monty Hall, will open one of unselected the curtains and show you one
of the gag prizes (this is not exactly the same as the way they play it on TV, but we can
adjust).
Your goal is to determine the winning strategy for this game. Should you “stay” with
your first choice or “switch” to the other remaining curtain? Does it matter?
For the simulation we assume that you always start by picking curtain 1. (This is entirely
arbitrary and you could change this without affecting the results.) Then we set the prize
curtain by generating a random number between 1 and 3. The curtain Monty opens is
always a gag prize and never the curtain that you have first selected.
To accomplish this in our simulation, we use the fact that you ‘selected’ curtain 1 and
compare 1 to the winning curtain using the ‘if’ function. When the selected and winning
curtains are the same, we use ‘rand(), multiply it by 2, round it up, and add 1 to generate
the curtain that is opened (either 2 or 3). If the winning curtain was not curtain 1, then
Monty opens the remaining gag prize curtain by comparing the winning curtain to 2, if it
is equal, show 3 otherwise show 2.
Finally we determine whether your choice was the winning curtain. For the ‘stay’
strategy, you have chosen to stick with your original choice and using the ‘if’ function we
generate the result “big prize” or “gag”. For the ‘switch’ strategy you select the
remaining unopened curtain, and again using the if function the result is determined..
1.
2.
3.
4.
5.
6.
7.
Which strategy do you predict will be the winning strategy, or does it matter?
Does it matter if you choose a different curtain all of the time?
Does it matter if you select the curtain randomly?
If you used the stay strategy, what is your probability of winning?
If you used the switch strategy, what is your probability of winning?
How might you verify that the simulation is working correctly?
Now extend the simulation so that there are 1000 trials. Would you change your
estimated probabilities for “Switch” and “Stay”?
8. How many trials do you think you would need before you would be confident of your
prediction to 1 decimal place.
9. Can you find an exact analytic solution to this problem?
10. Now let’s mentally extend the problem. Imagine there were 10 curtains and after you
selected one, Monty opened 8 curtains and exposed 8 gag prizes. What is the
probability of winning with the 'Stay' strategy?
11. What is the probability of winning with the "Switch” strategy?
Simulation 4: Birthday problem
How many randomly selected persons would need to be in a room before it is likely that
there are 1 or more shared birthdays? This problem has been around for decades also. Do
a search on the web and you will find many sites that might answer your questions.
To build this simulation we began with the simple relationship between counting
numbers and days 1 – Jan 1, 2 – Jan 2, …365 – Dec 31. We then generated random
whole numbers between 1 and 365 by using ‘rand()’ to generate numbers between 0 and
1, multiplying them by 365, and using
1. What does likely mean?
2. How many would you predict before doing the simulation?
3. How many persons would need to be in a room before it is likely that three people
share a common birthday
4. How might you extend the simulation to deal with Feb 29th birthdays?
5. Can you work out this problem analytically?
Basic probability review.
Simple probability:
An experiment or trial is something that happens (or might happen) and an outcome is
one of the possible results that can be objectively observed (or could possibly be
observed) about the experiment
The sample space is all of the possible outcomes of the experiment or trial.
An event is some subset of outcomes in the sample space that we might be trying to
determine the probability of.
We speak of the likelihood of some particular outcome or event happening as a
probability.
Probabilities can range from 0 to 1:
i.e.
X will never happen = P(X) = 0 and,
X will always happen = P(X) = 1.
Where X is the event or outcome that we are considering
Most outcomes and events are less certain and have a probability between 0 and 1.
As an example, when you flip a coin (the experiment), it is likely that it will come up (the
observation) heads (an outcome) half of the time and tails (another outcome) the other
half of the time (as long as the coin is fair). We might write this as:
P(heads) = 1/2 = 0.5 and
P(tails) = 1/2 = 0.5
When you roll (the experiment) a fair dice each of the six possible outcomes is equally
likely:
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 = 0.1666…
When you have a game card (the experiment) for McDonalds with a 1 in 100 chance of
winning a Big Mac (an outcome):
P(you win Big Mac) = 1/100 = 0.01
P(you do not win a Big Mac) = 99/100 = 0.99
The sum of all of the possible outcomes of an experiment is always 1.
e.g.:
P(heads) +P(tails) = 1
P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1
P(win Big Mac) + P(don’t win Big Mac) = 0.01+0.99 = 1
This also extends to events – or groups of outcomes – so long as they don’t overlap…
P(dice is even)+ P(dice is odd) = 1
Independent and dependent stages of an experiment:
If an experiment consists of a sequence of simpler experiments, then it is called a
complex or compound experiment.
Two events are independent if the occurrence or nonoccurrence of one does not change
the probability of the other event.
Conversely when the probability of one outcome or depends upon the probability of
another outcome of another component of the experiment then the two outcomes or
events are described as dependent.
Some examples of independent components in an experiment:
Flipping two coins (or a single coin twice)
Flipping a coin 10 times in a row
Rolling two dice
The winning lottery numbers on consecutive weeks
The birthdays of a random collection of students
Some examples of dependent components in an experiment:
Drawing two cards from a deck of cards without replacing the first card
Collecting two game pieces from McDonalds (there is a finite number of pieces)
The status of two traffic lights a block apart on a road with synchronized lights
The existence of clouds in the sky and the amount precipitation on a given day
Outcomes or events with respect to multiple observations on a single experiment or object
are independent of one another so long as the probabilities for the outcomes or events of
each observation are independent of the probabilities for the outcomes of the other
observation(s).
Some examples of independent observations on a single event or object:
A person’s gender, race, and favorite flavor of ice cream
A card’s suit and value
Some examples of dependent observations on a single event or object:
A person’s mass and shoe size
A gemstone’s clarity and value
(note that the relationships between the observations don’t have to be perfect – any
significant correlation indicates dependency)
Compound events
When you have two or more events and are examining the outcomes (or a single event
and are examining the two observations), a combined probability can be calculated from
the individual probabilities using the following formulae.
The probability of two distinct outcomes or events both occurring is:
P(A and B) = P(A)*P(B|A) = P(B)*P(A|B)
Which in words says the probability of A and B is the same as the probability of A times
the probability of B given the fact that A happened.
E.g. 1. The probability that when selecting two cards you get the ace of spades and the 2
of clubs (here we have dependent events):
P(selecting Ace-spades and 2-clubs)
= P(2-clubs)*P(Ace-spades given that the 2-clubs is already removed)
= 1/52*1/51
or
= P(Ace-spades)*P(2-clubs given that the Ace-spades is already removed)
= 1/52*1/51
= 1/2652
E.g. 2. The probability of getting two green lights in a row on a empty road when lights
are synchronized so that light 2 is timed to be green when you had a green on light 1, and
green light 1 is green 60% of the time.
P(green light 1 and green light 2)
= P(green light 1) * P(green light 2 | green light 1)
= 0.6 * 1.0
= 0.6
For independent observations and/or events, the intersection formula simplifies to:
P(A and B) = P(A)*P(B)
Because there is no inter-dependency between the observation that yields outcome A and
the observation that yields outcome B.
E.g. 3. The probability of cutting both the Ace of spades and the two of clubs in two cuts
of a deck (here we have independent events)
P(cutting the Ace-spades and the 2-clubs in 2 cuts)
= P(2-clubs)*P(Ace-spades)
= 1/52*1/52
= 1/2704
E.g. 4. The probability of winning a Big Mac on the game card you received with lunch,
P(win Big Mac) = 0.01, and the probability of getting a green light on the one light you
have to pass through on the way back to work, P(green)=0.6.
P(win Big Mac and get green light)
= P(win Big Mac) * P(get green light)
= 0.01 * 0.6
= 0.006
The probability of either one or the other or both outcomes occurring on two
observations (and/or events) is:
P(A or B) = P(A) + P(B) - P(A and B)
Which is the sum of their probabilities less the probability of both outcomes (removing
the duplicate count of the overlapping observations)
Again, for the ‘and’ probability component there is the dependent case and the
independent case.
E.g. 1. The probability of getting the first light green or the second light green (or both)
when the lights are synchronized so that you have a 80% likelihood of getting green on
the second light if you got a green light on the first, and the probability of getting green
on either is 50%. Here we have two observations on dependent events.
P(green on light 1 or green on light 2)
= P(green on light 1) + P(green on light 2) – P(green on light 1 and 2)
= 0.5 + 0.5 - P(green on light 1) * P(green light 2 | green light 1)
= 1.0 - 0.5 * 0.8
= 1.0 - 0.4
= 0.6 or 60%
E.g. 2. The probability of selecting a card that is either a face card or a heart (or both).
This is an example of two independent observations on a single event, although there is
still an overlap that must be attended to.
P(card is a face card or heart)
= P(card is face card) + P(card is heart) - P(card is face card and heart)
= 12/52 + 13/52 - 12/52*13/52
= 25/52 - 3/52
= 22/52
E.g. 3. The probability of not winning a Big Mac on the card you got at lunch or getting
a red light on the single light on your route home or both. Here we have independent
observations on independent events and we still have an overlap.
P(don’t win Big Mac or get ret light)
= P(don’t win Big Mac) + P( red light) – P(don’t win Big Mac and red)
= 0.99 + 0.5 – 0.99*0.5
= 1.49 – 0.495
= 0.995 or 99.5%
Other, more complex probabilities may be calculated using these basic formulae, and by
being aware of how the and and or combined probability formulas may be extended
beyond two observations.
When a problem is difficult, alternate views often help. Here one of the instances of
DeMorgans law may prove useful:
1 - P(A and B) = P( not A or not B)
or
P(A and B) = 1 - P( not A or not B)
or
1- P(A or B) = P(not A and not B)
or
P(A or B) = 1 - P(not A and not B)
Some Excel tips.
Addressing cell ranges in Excel
Range: specified by the upper left and lower right cells separated by a colon e.g.
a2:g8
Relative addresses: addresses that change when you move/copy the formula
Absolute addresses: addresses that don’t change when you move/copy the formula
Some Excel features you will want to use
F4 (command-t on Mac):
toggle through relative, absolute, and mixed
references
Fill special:
incrementally increase numbers – i.e. 1, 2, 3, 4,…
F9 (command-= on Mac):
recalculate the cells – necessary for multiple
experiments
Some Excel functions you will likely need:
rand() returns a pseudo-random number between 0 and 1 (deterministic from
seed)
if(condition, do if true, do if false) does one or the other depending on whether
the condition is true or false.
countif(range, desired) count how many cells in the range are equal to the
desired value.
average(range) determine the average value across all cells in the range.
ceiling(x, significant digit) returns the multiple of the significant digit that is just
above x. To transform a random number from 0-1 to a random whole number
from 1-10 you would multiply by 10 and use ‘ceiling’. E.g. ceiling(rand()*10,1).
Some simulation tips
1. Be aware of the assumptions you are making. What are their implications
2. GIGO. Don’t assume everything is working correctly, Check for data errors and
formula errors
3. Some problems are hard until you think of their complements.
4. Remember the ‘law of large numbers’ states that the variance of the mean is the
overall variance/number of replications, so SE X  SD X
N
 So 9 replications will yield SE X  SDX / 3

100 replications will yield SE X  SDX / 10

625 replications will yield SE X  SD X / 25
References:
Kaku, M. (1994). Hyperspace: A scientific odyssey through parallel universes, time
warps, and the tenth dimension. New York: Oxford University Press.
McCullough, B. D., & Wilson, B. (1999). On the accuracy of statistical procedures in
Microsoft Excel 97. Computational Statistics and Data Analysis, 31, 27-37.
Download