Binomial setting and distributions

advertisement
Binomial setting and distributions
Binomial distributions are models for some categorical variables,
typically representing the number of successes in a series of n
independent trials.
The observations must meet these requirements:
 the total number of observations n is fixed in advance
 each observation falls into just one of two categories: success and failure
 the outcomes of all n observations are statistically independent
 all n observations have the same probability p of “success”
Applications for binomial distributions
Binomial distributions describe the possible number of times that a
particular event will occur in a sequence of observations.

In a clinical trial, a patient’s condition may improve or not. The binomial
distribution describes the number of patients who improved (not how much
better they feel) among the study participants.

Is a child obese or not (based on their body mass index)? The binomial
distribution describes the number of obese children in a random sample of
school-age children.

In a quality control study, we assess the number of defective items in a lot
of goods, irrespective of the type of defect.
Binomial parameters
We express a binomial distribution for the count X of successes among
n observations as a function of the parameters n and p: X ~ B(n,p).

The parameter n is the total number of observations.

The parameter p is the probability of success on each observation.

The count of successes X can be any whole number between 0 and n.
The CDC estimates that a third of adult men are obese. In a random
sample of 10 adult men, each man is either obese or not.
The variable X is the number of obese men among those 10 men
sampled, our count of “successes.”
For each man, the probability of success, “obese,” is 1/3. The number X of
obese men among 10 men has the binomial distribution B(n = 10, p = 1/3).
Binomial probabilities
The number of ways of arranging k successes in a series of n
observations (with constant probability p of success) is the number of
possible combinations (unordered sequences).
This can be calculated with the binomial coefficient: R: choose(n,k)
n!
 n  
 k  k!(n  k )!
where k = 0, 1, 2, ..., or n
The binomial coefficient “n_choose_k” uses the factorial notation “!”.
The factorial n! for any strictly positive whole number n is:
n! = n × (n − 1) × (n − 2) × … × 3 × 2 × 1
The binomial coefficient counts the number of ways in which k
successes can be arranged among n observations.
The binomial probability P(X = k) is this count multiplied by the
probability of any specific arrangement of the k successes:
P( X  k )   n  p k (1  p) nk
k
X
P(X)
0
𝑛 0 n
p q = qn
0
𝑛 1 n-1
pq
1
𝑛 2 n-2
pq
2
…
1
2
…
The probability that a binomial random variable takes any
range of values is the sum of each probability for getting
exactly that many successes in n observations.
k
…
n
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2)
Total
𝑛 k n-k
pq
𝑘
…
𝑛 n 0
p q = pn
𝑛
1
The frequency of color blindness (dyschromatopsia) in the
Caucasian American male population is estimated to be
about 8%. In a group of 25 Caucasian American males, what
is the probability that exactly five are color blind?

P(x = 5) = [n! / k!(n – k)!] pk(1 – p)n-k = (25! / 5!(20)!) 0.0850.9220
= [21*22*23*24*24*25 / 1*2*3*4*5] 0.0850.9220
= 53,130 * 0.0000033 * 0.1887 = 0.03285

Use technology
> dbinom(5,25,.08)
[1] 0.03285083
The incidence of major depression in adults is about 10%. A random sample of
50 adults will be tested for depression. The variable X is the number of
individuals diagnosed with depression among all 50 and has the binomial
distribution Bin(n = 50, p = 0.1).
The probability that exactly 2 adults in the sample have depression is ????
A) 0.010
B) 0.020
C) 0.078
D) 0.100
E) 0.112
Binomial mean and variance
The center and spread of the binomial distribution for a count X are
defined by the mean m and standard deviation s:
m  np
s  np(1  p)
The incidence of major depression in adults is about 10%. A random sample of
50 adults will be tested for depression. The variable X is the number of
individuals diagnosed with depression among all 50 and has the binomial
distribution Bin(n = 50, p = 0.1). Thus,
m  np  50  0.1  5
s  np(1  p)  50  0.1 0.9  4.5  2.12
Effect of changing p when n is fixed
Binomial distributions are skewed
when p is close to 0 or close to 1
(especially if the sample is small).
P(X=x)
0.4
B(5,0.5)
0.3
0.2
0.1
0
0
1
2
3
4
5
3
4
5
3
4
5
X
0.4
B(5,0.1)
P(X=x)
P(X=x)
1
0.5
0
0
1
2
3
4
0.3
0.2
0.1
0
5
0
X
1
2
X
0.4
0.8
B(5,0.3)
0.3
P(X=x)
P(X=x)
B(5,0.7)
0.2
0.1
0.6
B(5,0.9)
0.4
0.2
0
0
0
1
2
X
3
4
5
0
1
2
X
Effect of changing n for a fixed value of p
0.5
0.3
B(5,0.15)
0.3
0.2
0.2
0.15
0.1
0.05
0.1
0
0
0
2
4
6
0
8 10 12 14 16 18 20
X
2
4
6
8 10 12 14 16 18 20
X
0.3
0.4
P(X=x)
0.2
0.1
B(20,0.15)
0.25
B(10,0.15)
0.3
P(X=x)
B(15,0.15)
0.25
P(X=x)
P(X=x)
0.4
0.2
0.15
0.1
0.05
0
0
2
4
6
8 10 12 14 16 18 20
X
0
0
2
4
6
8 10 12 14 16 18 20
X
Normal approximation to binomial
Binomial distribution can be approximated by a Normal distribution,
when both np ≥10 and n(1 − p) ≥10.



B m  np, s  np (1  p) ~ N m  np, s  np (1  p)

The approximation can be improved by using a continuity correction
to take into account the fact that the Normal distribution is continuous.
Hint: P(X=x) = P(x-.5 ≤ X ≤ x+.5)
The incidence of major depression in adults is about 10%.
0.30
Count of adults diagnosed
with depression in a sample of
20 adults, Bin(n = 20, p = 0.1).
Binomial,
n=20,p=0.1
p=0.1
Binomial,
n=20,
0.25
Probability
0.20
No Normal approximation
0.15
0.10
Why??
0.05
0.00
0
1
2
3
4
5
6
Count of adults with depression
0.30
Binomial, n=100, p=0.1
Binomial,
n=100, p=0.1
0.25
Probability
0.20
Count of adults diagnosed with
depression in a sample of 100
adults, Bin(n = 100, p = 0.1).
0.15
Normal approximation OK
Why?
0.10
0.05
0.00
0
5
10
15
Count of adults with depression
20
7
8
The frequency of color blindness (dyschromatopsia) in the
Caucasian American male population is about 8%.
We take a random sample of size 125 from this population.
What is the probability that 6 individuals or fewer in the sample are color blind?

Distribution of the count X: B (n = 125, p = 0.08)  np = 10
P(X ≤ 6) = pbinom(6,size=125,prob=.08) in R
[1] 0.1198136 or about 12%

Normal approximation: N (np = 10, √np(1 − p) = 3.033)
P(X ≤ 6) = pnorm(6, mean=10, sd=3.033) = 0.0936 or about 9%
Or z = (x - µ)/σ = (6 − 10)/3.033 = -1.32  P(X ≤ 6) = 0.0934 from Table B
The Normal approximation is reasonable, but not quite close to 12%. Here p =
.08 is not close to 0.5, but np = 10 just meets the criterion. Using a continuity
correction greatly improves the approximation:

P(X ≤ 6) = P(X≤6.5) = pnorm(6.5, mean=10, se=3.033) = 0.1243
Distributions for the color blindness example.
Binomial
Normal approx.
0.25
P(X=x)
0.2
n = 50
0.15
0.1
The larger the sample size the better
0.05
the Normal approximation fits the
0
0
1
2
3
4
5
6
7
8
9 10 11 12
binomial distribution.
Count of successes
Normal approx.
Binomial
0.14
0.05
0.12
0.1
0.04
n = 125
0.08
0.06
P(X=x)
P(X=x)
Binomial
0.04
0.02
Normal approx.
n = 1000
0.03
0.02
0.01
0
0
0
5
10
15
Count of successes
20
25
0
20
40
60
80
100
Count of successes
120
140
The Poisson distributions
A Poisson distribution describes the count X of occurrences of an
event in fixed, finite intervals of time or space when

occurrences are all independent,

and the probability of an occurrence is the same over all possible
intervals.
Think of the
Items
Containers
Poisson distribution

Radioactive decays

Second
as describing the

Weeds

Acre of farm land
number of items in

Fleas

Dog
containers.

Cardiovascular deaths

County / year
If we divide a natural lawn into
1 ft2 quadrants, we can count
how many dandelions are in
each quadrant.
Dandelions seeds are wind-spread. The probabilities of a quadrant containing
0,1,2,3… dandelions are given by a Poisson distribution:
(i) independence of dandelions: the presence of one dandelion in a
quadrant does not make the presence of another more or less likely.
(ii) homogeneity of quadrants: each quadrant is equally susceptible
to contain dandelions.
Poisson probabilities
If μ is the population mean number of occurrences for a specified
interval of time or space, then the Poisson probability distribution of
observing k occurrences (k = 0, 1, 2, …) at constant μ (> 0) is:
P( X  k )  e  m
mk
k!
The Poisson distribution has mean μ and standard deviation σ:
m
s m
Effect of changing μ:
0.35
Poisson, Mean=3.5
Poisson, Mean=1.5
0.30
0.30
0.25
0.25
Probability
Probability
0.35
0.20
0.15
0.20
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0
5
10
15
20
25
0
5
10
25
Poisson, Mean=15
Poisson, Mean=7
0.30
0.30
0.25
Probability
0.25
Probability
20
0.35
0.35
0.20
0.15
0.20
0.15
0.10
0.10
0.05
0.05
0.00
15
X
X
0
5
10
15
X
20
25
0.00
0
5
10
15
X
The Poisson distribution is skewed when μ < 5.
20
25
The number of deer crossing a road at night during mating season in a
particular rural area can be modeled with a Poisson distribution. A local survey
conducted over 4 nights found a total of 20 deer crossings. Based on this
information, what is the probability that fewer than three deer would cross on a
given night during mating season in this area?
e m m k
P( X  k ) 
,x
x  0,1,2...for some m >0
k!
To compute this probability using the Poisson distribution, we need to know μ.
In this case μ = 20 / 4 = 5 deer crossings per night.
> ppois(2,lambda=5)
[1] 0.124652
P ( X < 3)  P ( X  0)  P ( X  1)  P ( X  2)
e
5
1
2
(5)0
5 (5)
5 (5)
e
e
 e 5 (1  5  12.5)
0!
1!
2!
 0.1247
Historical records over 20 years in a particular town indicate
an average of 4 severe rainstorms per year.
Modeling the occurrences of severe rainstorms with the
Poisson distribution, the probability that there would be
no severe rainstorm next year is
P(X = 0) = (4)0 e–4 / 0! = 0.018
Probability of 5 severe rainstorms next year
P(X = 5) = (4)5 e–4 / 5! = 0.156
Probability of 1 or more severe rainstorms next year
P(X > 1) = 1 – P(X = 0) = 1 – 0.018 = 0.982
Probability of more than 5 severe rainstorms next year
P(X > 5) = 1 – P(X ≤ 5) = 1 – 0.785 = 0.215
x
P(X=x)
P(X≤x)
0
1.832%
1.832%
1
7.326%
9.158%
2
14.653%
23.810%
3
19.537%
43.347%
4
19.537%
62.884%
5
15.629%
78.513%
6
10.420%
88.933%
7
5.954%
94.887%
8
2.977%
97.864%
9
1.323%
99.187%
10
0.529%
99.716%
11
0.192%
99.908%
12
0.064%
99.973%
13
0.020%
99.992%
14
0.006%
99.998%
Download