David R. Merrell
90-786 Intermediate Empirical
Methods for Public Policy and
Management
Agenda
Normal Approximation to Binomial
Poisson Process
Random sampling
Sampling statistics and sampling distributions
Expected values and standard errors of sample sums and sample means
Binomial Random Variable
Binomial random variable of “successes” in
X is the number n trials, where
Probability of success remains the same from trial to trial
Trials are independent
Binomial Probability Distribution
Discrete distribution with:
P(X=x) = ( n!/(x!(n-x)!)) p x q n-x
n is number of trials
x is number of successes in n trials
(x = 0, 1, 2, ..., n) p is the probability of success on a single trial q is the probability of failure on a single trial
Properties of the Binomial RV
Mean: = np
Variance: = npq
Standard Deviation:
Binomial(n = 10, p = .4)
5
6
3
4
1
2 x
0
7
8
9
10
P(X=x)
0.006047
0.040311
0.120932
0.214991
0.250823
0.200658
0.111477
0.042467
0.010617
0.001573
0.000105
5
6
3
4
0
1
2
7
8
9
10
0.006047
0.040311
0.120932
0.214991
0.250823
0.200658
0.111477
0.042467
0.010617
0.001573
0.000105
Approximation to Binomial
Distribution
Use normal distribution when:
n is large np > 10 n(1 - p) > 10
Parameters of the approximating normal distribution are the mean and standard deviation from the binomial distribution
Approximation of Binomial
Distribution
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
n = 80, p = .4
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58
10 20 30 40 50 60
C1
How Good is the Approximation?
Binomial with n = 80 and p = 0.400000
P(X < 29) x P( X <= x)
28.00 0.2131
Normal with mean = 32.0000 and standard deviation = 4.38000
x P( X <= x)
28.0000 0.1806
x P( X <= x)
28.5000 0.2121
Application 1
The Chicago Equal Employment Commission believes that the Chicago Transit Authority
(CTA) discriminates against Republicans. The records show that 37.5% of the individuals listed as passing the CTA exam were
Republicans; the remainder were Democrats
(no one registers as an independent in
Illinois). CTA hired 30 people last year, 25 of them were Democrats. What is the probability that this situation could exist if
CTA did not discriminate?
Application 1 (cont.)
Success: a Republican is hired
The probability of success, p = 0.375
The number of trials, n = 30
The number of successes, x = 5
P(x 5) = ???
Application 1 (cont.)
Mean: = np = 30*.375 = 11.25
Variance:
7.03
= npq = 30*.375*.625 =
Standard Deviation: = 2.65
Normal with mean = 11.25 and standard deviation = 2.65
x P( X <= x)
5.5000 0.0150
Poisson Process rate
x
0 x x time
Poisson Process
Earthquakes strike randomly over time with a rate of = 4 per year.
Model time of earthquake strike as a
Poisson process
Count: How many earthquakes will strike in the next six months?
Duration: How long will it take before the next earthquake hits?
: Poisson Distribution
What is the probability that 3 earthquakes will strike during the next six months?
Poisson Distribution
Count in time period t
P Y
e
t
t y !
y
, y
0 1
Minitab Probability Calculation
Click:
Poisson
Calc > Probability Distributions >
Enter: For mean 2 , input constant 3
Output:
Probability Density Function
Poisson with mu = 2.00000
x P( X = x)
3.00 0.1804
Exponential Distribution
Time between occurrences in a Poisson process
Continuous probability distribution
Mean =1/ t
Exponential Probability Problem
What is the probability that 9 months will pass with no earthquake?
t = 1/12, t
= 1/3
1/ t = 3
Minitab Probability Calculation
Click: Calc > Probability Distributions >
Exponential
Enter: For mean 3 , input constant 9
Output:
Cumulative Distribution Function
Exponential with mean = 3.00000
x P( X <= x)
9.0000 0.9502
Exponential Probability Density
Function
MTB > set c1
DATA > 0:12000
DATA > end
Let c1 = c1/1000
Click: Calc > Probability distributions > Exponential
> Probability density > Input column
Enter: Input column c1 > Optional storage c2
Click: OK > Graph > Plot
Enter: Y c2 > X c1
Click: Display > Connect > OK
Exponential Probability Density
Function
0.3
0.2
0.1
0.0
0 5
C1
10
Sampling
Population - entire set of objects that we are interested in studying
Sample - a chosen subset of a population
Some Samples Are ...
random -- each item in the population has an equal chance of being selected to be part of the sample representative -- has the same characteristics as the population under study, a microcosm of the population
Population Parameters and Sample
Statistics
Population Parameter
Numerical descriptor of a population
Values usually uncertain e.g., population mean ( ), population standard deviation ( )
Sample Statistics
Numerical descriptor of a sample
Calculated from observations in the sample
What is a sampling distribution?
Sample statistics are random variables
Sample statistics have probability distributions
“Sampling distribution” is the probability distribution of a sample statistic
MTB > Retrieve 'C:\MTBWIN\DATA\RESTRNT.MTW'.
Retrieving worksheet from file: C:\MTBWIN\DATA\RESTRNT.MTW
Worksheet was saved on 5/31/1994
MTB > info
Information on the Worksheet
Column Name Count Missing
C1 ID 279 0
C2 OUTLOOK 279 1
C3 SALES 279 25
C4 NEWCAP 279 55
C5 VALUE 279 39
C6 COSTGOOD 279 42
C7 WAGES 279 44
C8 ADS 279 44
C9 TYPEFOOD 279 12
C10 SEATS 279 11
C11 OWNER 279 10
C12 FT.EMPL 279 14
C13 PT.EMPL 279 13
C14 SIZE 279 16
MTB > desc 'sales'
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMean
SALES 254 25 332.6 200.0 248.9 650.5 40.8
Variable Min Max Q1 Q3
SALES 0.0 8064.0 83.7 382.7
MTB > boxp 'sales'
* NOTE * N missing = 25
8000
7000
6000
5000
4000
3000
2000
1000
0
MTB > hist 'sales'
* NOTE * N missing = 25
200
100
0
0 1000 2000 3000 4000 5000 6000 7000 8000
SALES
MTB > let c15 = loge('sales')
MTB > let c15 = loge('sales')
J
*** Values out of bounds during operation at J
Missing returned 1 times
MTB > let c15 = loge('sales' + 1)
MTB > name c15 'logsales'
MTB > desc 'logsales'
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMean logsales 254 25 5.1830 5.3033 5.2134 1.1387 0.0715
Variable Min Max Q1 Q3 logsales 0.0000 8.9953 4.4394 5.9500
MTB > boxp 'logsales'
* NOTE * N missing = 25
9
8
7
4
3
6
5
2
1
0
40
30
20
10
0
90
80
70
60
50
0 1 2 3 4 5 logsales
6 7 8 9
Four Samples of Size 50 From Restaurant “Logsales” Data--Histograms
5
0
15
10
3 4 5
C16
6 7
20
10
0
2 3 4
C18
5 6 7
15
10
5
25
20
0
2 4
C17
6 8
20
10
0
3 4 5
C19
6 7
Random Samples from Restaurant “Logsales” Data--Summary
MTB > Desc c16-c19
Descriptive Statistics
Variable N N* Mean Median TrMean StDev SEMean
C16 43 7 5.246 5.375 5.280 0.867 0.132
C17 43 7 5.351 5.352 5.383 1.223 0.186
C18 48 2 5.366 5.461 5.388 0.888 0.128
C19 43 7 5.244 5.198 5.253 0.937 0.143
Variable Min Max Q1 Q3
C16 2.773 6.621 4.625 5.787
C17 1.099 8.456 4.710 6.176
C18 2.485 7.091 4.961 5.994
C19 3.434 6.868 4.595 6.089
Next Time ...
Central Limit Theorem--”Sample averages are approximately normally distributed”