Binomial, Multinomial & Poisson Stat 557 Fall 2012

advertisement
Binomial, Multinomial &
Poisson
Stat 557
Fall 2012
Outline
• Coverage of C.I for π
• Multinomial
• Poisson
• Sampling Types
• Contingency Tables
pU
= 0.6 + 1.96 ·
=
115
= 0.6 + 0.0895 = 0.6895
�
�
�
�
n
n
�
� nn
αα
j j
n−j
n−j
==PP
(Y(Y≥≥y)y)==
p p(1(1−−p)p)
22
jj
j=y
j=y
�
Wald C.I.
�
�
�
�
y
y
1
�
α α p ± Zα/2 �
n np) j j
p(1
−
n−j
n−j
==PP
(Y(Y≤≤y)y)==n
p p(1(1−−p)p)
j
22
j
j=0
j=0
Exact C.I.
α
�� P (Y ≤ y) =
��
−1−1
n n−−y2y++1 1
pLpL== 1 1++
yF
α/2)
yF
(1(1−−α/2)
2y,2(n+y−1)
2y,2(n+y−1)
��
��
−1−1
n n−−y y
pUpU== 1 1++
5
1)F
(α/2)
(y(y++1)F
(α/2)
2(y+1),2(n+y)
2(y+1),2(n+y)
�
�
�
�
n
n
�
�
Confidence Intervals
•
•
α
= P (Y ≤ y) =
2
�
y � �
�
n
Coverage
j=0
j
pj (1 − p)n−j
�−1
n−y+1
pL = 1 +
Definition: yF2y,2(n+y−1) (1 − α/2)
for a �
fixed value of a parameter the actual�
−1
−y
coverage probability of ann interval
estimator
pU = 1 +
is the probability
that2(y+1),2(n+y)
the interval (α/2)
contains
(y + 1)F
the parameter:
�
�
n
�
n j
n−j
C(p, n) =
I(j, p) ·
p (1 − p)
j
•
j=0
�contains p for
I(j,p) is 1, if the interval
1
observation
j and
0 otherwise
p(1 − p)
M≤
1.96
·
n
Wald & Exact 95% C.I.s
Coverage of Confidence Intervals for Binomial
5
10
50
1.00
0.95
coverage
0.90
Method
Wald
Score
0.85
Exact
adj_Wald
0.80
0.75
0.70
0.0
0.2
0.4
•
•
0.6
0.8
1.0
0.0
0.2
0.4
p
0.6
0.8
1.0
0.0
0.2
Wald is excessively liberal
Clopper-Pearson (Exact) is impractically
conservative
0.4
0.6
0.8
1.0
Score
•
y+2
p̃ =
n+4
�
1
p̃ ± ztest
p̃(1null,
− p̃)
Invert normal
rather than
α/2 using
n
estimated standard error:
�
p − po
1
n po (1
− po )
= ±zα/2
p − po
��

=
±z
α/2
�
�
�−1
�
� 1
2
2
2
� n po (1 − po zα/2
zα/2
zα/2
� p(1 − p) +
p +

±
z
/n
1+
α/2
��

 2n
�
�
�n−1
�
2
2 4n
2

�
zα/2
p +
± zα/2 � p(1 − p) +
2n
Y 4n
zα/2
p=
n
Y
p=
|p − π| n
/n 1 +
zα/2
n
Wilson, 1937
Adjusted Wald
• ‘add x failures and x successes’,
good values for x = 1,2:
y+2
p̃ =
n+4
�y + 2
p̃ = 1
p̃ + zα/2 n +p̃(1
4 − p̃)
�n
1
po
p̃ ±pz−
p̃(1 − p̃)
α/2
�
n = ±zα/2
1
o (1 − po
npp−
po
��
�
= ±z�

α/2  �
�−1
�
1
2
2
2
� n po (1 − po zα/2
zα/2
zα/2
�
p +
± zα/2� p(1 − p) +
/n 1 +

2n
4n �  �
n �−1
�
�
2
2
2
Score, adjusted Wald
Coverage of Confidence Intervals for Binomial
5
10
50
1.00
0.95
coverage
0.90
Method
Wald
Score
0.85
Exact
adj_Wald
0.80
0.75
0.70
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
p
0.6
0.8
1.0
0.0
0.2
0.4
0.6
•
Score C.I has average coverage close to nominal
value
•
adjusted Wald is `simple fix’, makes liberal Wald
slightly more conservative than Score
see also: Agresti & Coull, American Statistician, 1998
0.8
1.0
1
th π1 + π2 + π3 = 1
1
1
Multinomial
Distribution
K
�
π=
πi = 1
P π(Y
o = y2 ) = p(y2 ) = π2

i=1
0.75
...
0.1875 
P
(Y
=
y
)
=
p(y
)
=
π
K
K
K
0.0625
Series of n independent and identical trials,
t
N (0, diag(π)where
− ππthe
) outcome for each trial falls into
one
of
K
mutually
exclusive
categories
with
e, i.e. Y ∈ {y , y , ..., y } with
•
1
π1
2
πj = P (Yi = j),
K
y mass function�
then is
K
π2
• then
πi = 1
1 ≤ j ≤ K, 1 ≤ i ≤ n
n!
nk
n1 n2
i=1
p(n1 , n2 , ..., nk ) =
π1 π2 · ... · πk ,
n1 !n2 ! · ... · nk !
πK
ability mass function
X ∈ {0, 1, 2, 3, ...}
Multinomial
x
µ
P (X = x) = p(x) = e−µ ,
x!
he rate parameter (remember, that the rate depends on the u
E[X]of times
= λ that we
• Let y be the number
j
observe outcome
j, then= λ
V ar[X]
√
0 ≤ yj ≤skewness
n (for all j) =
and λ
= 1/λ
sum of ykurtosis
j is n
•
•
• P (Y1 = y1, Y2 = y2, ..., YK = yK ) =
n!
y1 y 2
yK
=
π1 π2 · ... · πK
y1 !y2 ! · ... · yK !
skewness
=
�
λ
G = 2 = y1/λ
i log
kurtosis
2
�
mi
m0,i
�
= 0.74
Multinomial
� (m − m )
= y , ..., Y = y ) =
i
2
i
0,i
P (Y1 = y1 , Y2 X22 = K
K
= 0.69
m
0,i
n!
y
y
yK
1
2
i
=
π1 π2 · ... · πK
y1 !y2 ! · ... · yK !
∼ Bn,πi
have
Binomial
• Marginals
Ha : 0 ≤ Y
π≤
1 with
π1 +distributions,
π2 + π3 = 1and
i
E[Y
= =nπ
Hio] : π
πoi
V ar[Yi
] = nπi (1− πi )
0.75
Cov(Yi , Yj 
) = −nπi
πj
πo = �0.1875
k
ni
ni
=
.
Let
L(π
,
π
,
...,
π
)
=
π
the multinomial likelihoo
0.0625
i
1 2
k
i=1
i
n
• limiting�distribution:
k
√
ditional constraint
is actually
i=1 πi = 1 the likelihood function
t
n(p
−
π)
−→
N
(0,
diag(π)
−
ππ
)
bles:
�
�nk
k−1
k−1
with p =�
(n1/n,
n2/n, ...,�
nK/n)
(nominal) categorical
variable,
i.e.
Y ∈ {y1 , y2 , ..., yK }
L(π1 , π2 , ..., πk−1 ) =
πini · 1 −
πi
with
X = # of insects trapped ove
or X = # of mosquito bites in a
Poisson Distribution
⇒
no upper limit for n
X ∈ {0, 1, 2, 3, ...}
Simeon Poisson, 1781-1840
Poisson probability mass function
•
x
X = # #of
of insects
trapped
overnight,
µ
insects trapped overnight in tent,
−µ
,
P (X = x) = p(x) = e
x!
X = # #mosquito
of mosquitobites
bites ininan
anhour,
hour,
vehicles
km of I-35that the rate de
where
λ no
is#stranded
the
rate
parameter
(remember,
⇒
upper
limit
for n in 10
∈ {0, 1, 2, 3, ...}
X inX{0,1,2,3,4,...}
function(no upper limit)
E[X] = λ
V ar[X] = λ
√
skewness =
λ
x
µ
P (X = x) = p(x) = e−µ ,
x!
kurtosis = 1/λ
meter (remember, that the rate depends on the unit used).
E[X] = λ
Poisson distribution
lambda= 3
0.00
0.10
0.00 0.15 0.30
0.20
lambda= 1
1
2
3
4
5
6
1
3
7
9
11
13
15
lambda= 10
0.00
0.00
0.06
0.10
0.12
lambda= 5
5
1
3
5
7
9 11
14
17
20
23
1
5
9 13
18
23
28
33
38
43
48
approximates Binomial for large n, small p
Normal approximation holds for large values of lambda
Properties of Poisson
Relationship to Multinomial
• Let Y1, ...,YK be ind. Poisson with
parameters µ1, ..., µK
•Y
1
+ ... + YK is Poisson with parameter ∑µi
Ho :
P ( heart disease | Cholesterol ≤ 220) =
P ( heart disease | Cholesterol > 220)
Contingency
Tables
π
π
11
π11 + π12
=
21
π21 + π22
Y be two categorical variables, with I, J categories respectively a
, ..., yJ }.
and
categorical
variables
I and
en the pair X
(X,
Y )Yisare
categorical
variable
withwith
IJ outcomes.
J categories:
e table
X\Y y1 y2 ... yJ
x1
n11 n12 ... n1J n1.
x2
n21 n22 ... n2J n2.
..
..
..
..
..
..
.
.
.
.
.
.
•
xI
nI1 nI2
n.1 n.2
...
...
nIJ
n.J
nI.
n
Contingency Tables
ed a two dimensional contingency table or cross-classification t
and(xY ,are
categorical
variables with I and
obability of X
pair
y
)
then
the
table
i j
•
J categories:
X\Y
x1
x2
..
.
xI
y1 y2
π11 π12
π21 π22
..
..
.
.
πI1 πI2
π.1 π.2
... yJ
... π1J
... π2J
..
..
.
.
... πIJ
... π.J
π1.
π2.
..
.
πI.
1
Poisson vs Multinomial
Sampling
• Poisson Sampling:
• each cell n is assumed to be Poisson
ij
distributed
• Overall sum is random
• Multinomial
• If overall sum is fixed, conditional probabilities
become multinomial
Product Multinomial
Sampling
• one of the margins in the contingency table
is fixed
• e.g. rare disease, `pairing’ of combinations
• given the fixed margins, the other direction
still has a multinomial distribution, resulting
in a product of multinomials.
• set-up of traditional case-control study
Margins are fixed
• Both margins are fixed
• less frequent in studies, more frequent in
inferential methods
• Hypergeometric distribution
Example: Cholesterol/Heart
Disease
• 1329 patients of same age/sex
Coronary Disease
mg/l
present
absent
Cholesterol ≤
220
y11= 20
y12= 553
> 220
y21= 72
y22= 684
Cholesterol/Heart Disease
• Y = (y11, y12, y21, y22) ~ Mult (1329, π)
• Is incidence of heart disease
independent of cholesterol levels?
i.e. is incidence rate of heart disease
the same for both levels of
cholesterol?
χ1,0.05 =
3.84, χ1,0.01 = 6.634897
γ=
Π C + ΠD
π00 ·ππ11
πθij:=
=π
πij π=i+ππi++j
· π+j
10 01
Cholesterol/Heart
Disease
π : (1 − π)
Ho :
P ( heart
disease
| Cholesterol
≤ 220) =
− πdisease
1 − π2 ,
j|i=1 =:| πCholesterol
Ho :
Pπ(j|i=0
heart
≤ 220) =
P ( heart disease | Cholesterol > 220)
P ( heart disease
|
Cholesterol
>
220)
πj=1|i=0
r :=
πj=1|i=1
π
π21
equivalent11πto
π21
11 =
π11 + π12 2 π=21 + π22
2
π11 + πχ
+ π22
χ1,0.05 = 3.84,
121,0.01π=
21 6.634897
categorical variables, with I, J categories respectively and X ∈ {x
wo categorical variables, with I, J categories respectively and X ∈
}.
= πi+ ·with
π+j IJ outcomes.
to
ir (X, Y ) is equivalent
categoricalπijvariable
pair (X, Y ) is categorical variable with IJ outcomes.
•
•
Ho :
X\Y y1 y2 ... yJ
X\Y y1 y2 ... yJ
... n1J n1.≤ 220) =
1 x ndisease
11n n12n| Cholesterol
P ( xheart
... n1J n1.
1
11
12
Cholesterol/Heart Disease
• Expected Values under independence
Coronary Disease
mg/l
present
absent
Total
Cholesterol ≤
220
13.66
533.33
573
> 220
52.33
703.67
756
Total
92
1237
1329
Cholesterol/Heart Disease
• loglikelihood ratio test
G2 = 19.8
• Pearson test
X2 = 18.4
•
both G2 and X2 are chisquare distributed
with df = 3 - 2 = 1
independence seems to be violated
Visualizing Contingency
Tables
Mosaicplots
• Area plots (i.e. area represents
#combinations)
• Built hierarchically, i.e. order of variables
matters
• in R:
mosaicplot() (base package)
imosaic() (iplots package)
productplots package
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Odds ratio
X=0 X=1
Y=0
a
b
Y=1
c
d
• Measure of association between X and Y
• odds ratio = (ad)/(bc)
• odds ratio = 1 is independence
• log odds ratio is symmetric around 0
• log odds is approx Normal with variance
1/a + 1/b + 1/c + 1/d
Visualizing Associations
Visualizing Associations
Reading the Odds
probability scale
1.0
ln 1-d
d
c
2
1
a
0.5
ln 1-b
b
+inf
d
0
-1
-2
b
0
-0.85
0.94
-inf
log odds scale
Odds ratios in 2 x 2 x K tables
X and Y are binary variables,
Z is categorical with K categories
Death Penalties in Florida:
X death penalty (yes/no)
Y defendant’s race (black/white)
Z victim’s race (black/white)
Marginal Table of X/Y
white
black
yes
no
53
430
483
15
176
191
68
606
674
Marginal odds ratio: 53*176/(430*15) = 1.45
(±0.59)
slight indication in favor of black defendants
Odds ratios in 2 x 2 x K tables
Conditional Tables of X/Y
Z = white
Z = black
yes
no
whit 53
11
e
black
414
37
451
64
467
48
515
Conditional odds ratios:
0.43
whit
e
black
yes
no
0
4
4
16
139
155
0
very strong indication against black defendants
16
143
159
Simpson’s paradox
• Simpson’s paradox:
marginal association between X and Y is
opposite to
conditional associations between X and Y
for each level of Z
• due to:
very strong association between X and Z
or Y and Z
Physicians’ study
• Myocardial Infarction among 22071
physicians in 5 year period
Fatal
Non-fatal
No
Placebo
18
171
10,845
Aspirin
5
99
10,933
Download