Statistical Methods for Counts and Proportions Stat 557 Fall 2012

advertisement
Statistical Methods for
Counts and Proportions
Stat 557
Fall 2012
Heike Hofmann
Can we do something
about the time?
• Would T/TR 9:30 - 10:50 work for
everybody?
Syllabus & Course
Website
•
http://www.public.iastate.edu/~hofmann/stat557/
syllabus.html
•
Blackboard site is for all sensitive material:
copyrighted readings, grades
Plan for today
• Syllabus
• What is this course about?
• Review of categorical data:
• Binomial / Poisson
• Exact confidence intervals
Textbook
• “Categorical Data Analysis” by Alan
Agresti, the ‘green book’
website:
http://www.stat.ufl.edu/~aa/cda/cda.html
• Agresti’s “Introduction to Categorical
Data Analysis” - less technical, fewer
details, but maybe a good starting point
Assessment
• Homework assignments every other week
• Two midterm exams
• final (team) project & presentation
• 35% homework (5 assignments),
40% midterms,
25% final project (20% write-up, 5% presentation)
Final Project
• Slightly bigger project with multiple due dates
• Will be open ended, and will involve a
substantial write up (10+ pages)
• Work in a team (3-4 members)
• Talk to each other! Find a group!
Disability and sickness
• Make sure to let me know (in advance) if you
have to miss an exam or a deadline
• Keep on top of the material or you will get
swamped by the end of the semester
Lectures
• Electronic copy of the slides will be available
on the website
• But you’ll need to take your own notes!
• If you really want complete notes, organize a
roster with others in the class
• If you’re bored, complain! ... same, if I’m going
too fast ...
... what I expect ...
• Stat 500 or 401, 543 or 447
(some methods, some theory)
• working knowledge of statistical software
(preferably R)
Categorical Data
Analysis
Topics
• http://www.public.iastate.edu/~hofmann/stat557/
• Review: Distributions/Inference for Categorical Variables
• Generalized Linear Models
• Logistic Binomial and Multinomial Regression; Loglinear
Models
• Matched Pairs (with longitudinal structure)
• Correspondence Analysis
• Classification Trees
• ... any suggestions?
Categorical Data
Analysis
• Response variable is categorical
explanatory variables can be any type
• outcome is one of a set of possibilities
outcome is a count or proportion
Types of Categorical
Variables
• Nominal
• Ordinal
• (Ratio)
• Interval
Nominal Variables
• Categories have no natural ordering
• Categories are not separated by meaningful
distances
• marital status, eye color, survival status
(alive, dead, lost to follow-up)
Ordinal Variables
•
•
Categories have natural ordering
•
political opinion (liberal, neutral,
conservative),
approval rating (strongly disapprove,
disapprove, approve, strongly approve)
•
Likert scales are ordinal
Categories are not separated by meaningful
distances
Interval Variables
• Categories have natural ordering
• Meaningful distances between any two
levels of the scale
• continuous interval variables: height,
weight, age, survival time, ...
discrete interval variables: years of
education, #occurrences, ...
Hierarchy of Types
• Type depends on measurement,
e.g. “education” could fit any type
• Applicability of methods depends on type:
interval > ordinal > nominal
Distinctions
• Continuous
vs
lots of different values
• Quantitative
interval
vs
---- ordinal
---
discrete
few values
qualitative:
--- nominal
Binary Response
• Example: Opinion Survey
estimate proportion of population in favor
of certain policy/politician
• sample n individuals
• use sample proportion
p = Y/n as estimate of
true probability π of
support for candidate
Count
In Favor
Y
Opposed
n-Y
Total
n
Binary Response
• Some Issues:
• Accuracy of p = Y/n ?
• How large should n be?
• How should the data be collected?
Binomial Distribution
• Define Y as number of successes in n
independent identical Binary trials
(Y is sum of independent Bernoulli random
variables)
• Distribution
• Moments, Skewness, Kurtosis
Inference for π
• MLE is p = Y/n
• (p - π)/sqrt[nπ(1-π)]
is approx N(0,1) for large n
• (p - π)/sqrt[np(1-p)]
is approx N(0,1) for large n
• n “large” for nπ > 5, n(1-π )> 5
Confidence Intervals
• Binomial distribution is
•
•
•
bounded
skewed
discrete
• large sample normal approximation is used
• leads to problems with coverage
Normal Approximation
to Binomial
p=
1.2
0.01
0.05
0.1
1.0
0.8
0.4
10
n=
0.6
0.2
0.0
1.2
1.0
25
y
0.8
0.6
0.4
0.2
0.0
1.2
1.0
0.8
50
0.6
0.4
0.2
0.0
0
2
4
6
8
10
0
2
4
x
6
8
10
0
2
4
6
8
dots are binomial probability, line is normal density
10
Example
• Sex of adult turtles caught for consumption
in the river Mato:
male
female
69
46
• assumption: ratio of sexes is 1.0 in the wild
Inference for π
• MLE is
•
•
Y
p=
n
Y
p
=
Y
|p − π|
.
n
p
=
�
∼N
we know, for large
n:(0, 1) n
1
n p(1 − p) |p − π|
.
� |p − π| ∼ .N (0, 1)
�1 p(1 − p) ∼ N (0, 1)
nπ ≥ 5, n(1n−1 π) ≥ 5
n p(1 − p)
� − π) ≥ 5
�� nπ ≥
�
5,
n(1
2
2
β
+
α
λ
·
n “large” for nπ ≥ 5, n(1 − π) ≥ 5
��
� �
2
2
|p − π|
.
�|p − π|
∼
N
(0,
1)
.
1
− π|− π)∼. N (0, 1)
� |pn π(1
�1
∼ N (0, 1)
1 π(1 − π)
n
n π(1 − π)
nπ ≥ 5, n(1 − π) ≥ 5
nπ ≥ 5, n(1 − π) ≥ 5
nπ ≥ 5, n(1 − π) ≥ 5
|p − π|
�|p − π|
< Zα/2
π|− p)< Zα/2
� |pn1 −
p(1
�1
< Zα/2
1 p(1 − p) �
n
p(1 − p)
n
upper and lower limit�
is then:
1
pU = p + Zα/2� 1 p(1 − p)
pU = p + Zα/2 1np(1 − p)
pU = p + Zα/2 �n p(1 − p)
n
� 1
pL = p − Zα/2� 1 p(1 − p)
pL = p − Zα/2 1np(1 − p)
pL = p − Zα/2 n p(1 − p)
n
Wald CI
•
p = 0.6 =
pL = p − Zα/2
�
115
1
p(1 − p)
n
Mato Turtles
pL
α = 0.05
α/2 = 0.025
Zα/2 = 1.96
n = 115
69
p = 0.6 =
115
pU
�
0.6 · 0.4
= 0.6 − 1.96 ·
=
115
= 0.6 − 0.0895 = 0.5105
�
0.6 · 0.4
= 0.6 + 1.96 ·
=
115
= 0.6 + 0.0895 = 0.6895
An approximate 95% confidence interval for the�
proportion of male
� being caught for consumption1is (0.51, 0.69)
turtles among turtles
= 0.6 − 1.96 ·
0.6 · 0.4
=
p ± Zα/2
n
p(1 − p)
115
= 0.6 + 0.0895 = 0.6895
Issues
•
�
1
p ± Zα/2
p(1 − p)
n
isn’t necessarily within [0,1] α
e.g. p=0.1, n=10P (Y ≤ y) =
2
interval has
length 0 in either
end
5
Confidence Intervals for Binomial, n=10, Normal Approximation
•
1.0
0.8
0.6
0.4
0.2
0.0
0
2
4
x
6
8
10
Exact confidence intervals
Tail method:
which
• upper limit is that value of�p, for
�
•
n
�
α
n� j
n−j
�
n
=
P
(Y
≥
y)
=
p
(1
−
p)
� n
α
2 = P (Y ≥ y) =
j pj (1 − p)n−j
j=y
2
j
j=y
�
�
y
lower limit is that p, for
which
� n
α
j
n−j
�
�
y
= P (Y ≤ y) = � n p (1 − p)
α
2 = P (Y ≤ y) =
j pj (1 − p)n−j
j=0
2
j
j=0
�
�−1
n−y+1
�
�−1
pL = 1 +
�
�
�
n
n
�
� n n�
j j
αα
==PP
(Y(Y≥≥y)y)==
22
jj
n−j
n−j
p p(1(1−−p)p)
j=y
j=y
Exact confidence
intervals
•
�
�
�
�
y
y
� nn
�
j j
αα
n−j
n−j
==PP
(Y(Y≤≤y)y)==
p p(1(1−−p)p)
2
j
2
j
see e.g. Johnson & Kotzj=0
“Discrete Distributions”
j=0
��
��
−1−1
n n−−y y++1 1
pLpL== 1 1++
yF
α/2)
yF
(1(1−−α/2)
2y,2(n+y−1)
2y,2(n+y−1)
��
��
−1−1
n n−−y y
pUpU== 1 1++
1)F
(α/2)
(y(y++1)F
(α/2)
2(y+1),2(n+y)
2(y+1),2(n+y)
�
�
�
�
n
n
�
�
nn j j
n−j
n−j
exactC(p,
interval
for
Mato
turtles:
(0.504,
0.690)
C(p,
n)
=
I(j,
p)
·
p
(1
−
p)
n) =
I(j, p) ·
p (1 − p)
jj
j=0
j=0
Issues
• exact interval is conservative,
i.e. due to discreteness actual coverage of
the CI might (and will be) larger than
nominal coverage
α
= P (Y ≤ y) =
2
�
�
y
�
�
n j
p (1 − p)n−j
j
Coverage
j=0
�−1
n−y+1
pL = 1 +
yF2y,2(n+y−1) (1 − α/2)
For a fixed value of a parameter the actual
�
�−1
coverage probability of an ninterval
− y estimator is
pthe
1 + that the interval contains the
U =
probability
(y + 1)F2(y+1),2(n+y) (α/2)
parameter:
�
�
n
�
n j
C(p, n) =
I(j, p) ·
p (1 − p)n−j
j
j=0
�
I(j,p) is 1, if the interval contains
p for
1
p(1 − p)
Mj ≤
·
observation
and1.96
0 otherwise
n
Coverage
Coverage of Confidence Intervals for Binomial
5
10
50
1.0
0.8
coverage
Method
0.6
Wald
Score
Exact
0.4
adj.Wald
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
p
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
• see also Agresti & Coull (1998) for adjusted
Wald interval
1.0
Download