Monte Carlo for Proportions

advertisement
Binomial Proportion Monte Carlo
I prepared this lesson for the son of a seventh grade friend who was working on a project for a
Math/Science Fair. It involves simulating the results of flips of a fair coin, where the estimated
parameter is the probability of success. As you probably know, an estimated binomial probability is a
consistent estimator, and his project should demonstrate this. In statistics, “Monte Carlo” is refers to
using a mathematical model of something of interest (often a sampling distribution) along with random
number generators to address the effects of varying assorted parameters.
Here we have the outcomes from 1,000,000 samples. P2 is the proportion of heads in each
sample of two flips of a fair coin. As you can see, there is a lot of error in the estimation of the true
proportion (.5). If you had just one sample, 50% of the time you would be off as much as possible
(proportion = 0 or 1).
The FREQ Procedure
P2 Frequency Percent
0
25045
25.05
0.5
49701
49.70
1
25254
25.25
Here we have the outcomes from 1,000,000 samples. P10 is the proportion of heads in each
sample of ten flips of a fair coin. As you can see, there is much less error in the estimation of the true
proportion (.5). If you had just one sample, 66% of the time you would be off by not more than .1 (.4
 proportion  .6).
P10 Frequency Percent Cumulative Cumulative
Frequency
Percent
0
102
0.10
102
0.10
0.1
916
0.92
1018
1.02
0.2
4442
4.44
5460
5.46
0.3
11549
11.55
17009
17.01
0.4
20356
20.36
37365
37.37
0.5
24875
24.88
62240
62.24
0.6
20692
20.69
82932
82.93
0.7
11584
11.58
94516
94.52
0.8
4368
4.37
98884
98.88
0.9
1003
1.00
99887
99.89
1
113
0.11
100000
100.00
Here we also have the outcomes from 1,000,000 samples. P20 is the proportion of heads in
each sample of twenty flips of a fair coin. As you can see, there is even less error in the estimation of
the true proportion (.5). If you had just one sample, 74% of the time you would be off by not more
than .1 (.4  proportion  .6).
P20 Frequency Percent Cumulative Cumulative
Frequency
Percent
0.05
2
0.00
2
0.00
0.1
26
0.03
28
0.03
0.15
123
0.12
151
0.15
0.2
452
0.45
603
0.60
0.25
1422
1.42
2025
2.03
0.3
3707
3.71
5732
5.73
0.35
7300
7.30
13032
13.03
0.4
11829
11.83
24861
24.86
0.45
15990
15.99
40851
40.85
0.5
17873
17.87
58724
58.72
0.55
16259
16.26
74983
74.98
0.6
11944
11.94
86927
86.93
0.65
7283
7.28
94210
94.21
0.7
3693
3.69
97903
97.90
0.75
1519
1.52
99422
99.42
0.8
447
0.45
99869
99.87
0.85
104
0.10
99973
99.97
0.9
27
0.03
100000
100.00
The MEANS Procedure
Variable
Std Dev
P2
0.3546092
P10
0.1577809
P20
0.1115296
So, how much are we likely to be offer when we have just one sample of 2, 10, or 20 flips. We
could compute the mean absolute deviation. This would involve finding, for each sample, the amount
by which it was off (observed proportion minus .5) and then average those deviations, ignoring the
sign of the deviations. A much more common procedure is to compute the standard deviation, which
the square root of the mean squared deviation. This statistic is called the “standard error.” In the
table above are the observed standard deviations for each of the three sample sizes. Notice that as
the number of coin flips increases, the expected amount of error decreases. This is a very important
property of estimators, a property known as “consistency.”
pq
, where p is the probability of
n
success (heads in this case), q is the probability of failure (number of heads), and n is the sample
.5(.5)
size (number of coin flips). For n = 2, the standard error is
 .354 . For n = 10, the standard
2
.5(.5)
.5(.5)
error is
 .158 . For n = 20, the standard error is
 .112 . Notice that these theoretical
10
20
values are exceptionally close to those observed. If we had all of eternity to get an infinite number of
samples (and if p really were exactly .5), the observed standard errors would perfectly match the
theoretical standard error.
It is known that the standard error of a proportion is equal to
SAS Code for This Monte Carlo
options formdlim='-' pageno=min nodate;
DATA coinflips; drop I;
do Sample=1 to 100000;
array Ys[20] Y1-Y20;
do I = 1 to 20;
Ys[I]=UNIFORM(0);
array Head[20] Head1-Head20;
Head[I} = 0;
if Ys[I] >.5 then Head[I] = 1;
end; output; end;
*proc print;
Data Flips; set coinflips;
P2 = (Head1 + Head2)/2;
P10 = Sum(of Head1-Head10)/10;
P20 = Sum(of Head1-Head20)/20;
*proc print;
Proc Freq; Table P2 P10 P20 / plots=freqplot; run;
Proc Means std; var P2 P10 P20; run;
Download