ST A T 557

advertisement
STAT 557
STATISTICS 557
Course Information
STATISTICAL METHODS
FOR
COUNTS AND
PROPORTIONS
Instructor: Kenneth J. Koehler
120 Snedecor Hall
Telephone: 515-294-4181
Fax: 515-294-4040
E-mail: kkoehler@iastate.edu
OÆce Hours: ?????
TA:
Kyoji Furukawa
315B Snedecor Hall
Telephone: 515-294-2227
Fax: 515-294-4040
E-mail: kyoji@iastate.edu
OÆce Hours: Tuesday 2pm
Kenneth Koehler
Department of Statistics
Iowa State University
1
2
Textbook: Lloyd, Chris, J., Statistical
Analysis of Categorical Data,
Wiley, 1999.
Course Notes: As they become available
copies of transparencies will
be made available as PDF
les on the course web page:
kkoehler/stat557/stat557.html
http:/www.public.iastate.edu/
Computation: SAS and S-PLUS code for
examples covered in the lecture will be available from
the course web page. You
should have a calculator that
you can bring to exams.
Grades:
Nine assignments (15%)
Midterm Exams (25% each)
Final Exam (35%)
3
Topic
Outline
1. Probability distributions,
for count data, review of
inferential procedures
2. Analysis of two-way
contingency tables
3. Large sample theory
4. Exact conditional
inference
5. Generalized Linear
Models
6. Logistic regression
for binary responses
7. Log-linear models and,
Poisson regression
8. Correspondence analysis
9. Classication trees
10. Models with xed and
random eects
Readings
Lloyd: Chapters 1 and 2
Lloyd: Chapter 3
Lloyd: Section 1.3
Lloyd: Sections 7.1-7.6
Lloyd: Section 4.6
Lloyd: Chapter 4
Section 7.7
Lloyd: Chapter 6
4
References
I. Categorical Data Analysis:
II. Measures of Association
1. Agresti, A. (2002) Categorical Data
Analysis, 2nd edition, Wiley, New York.
1. Goodman, L. A. and Kruskal, W. H. (1979)
Measures of Association for Cross
Classications, Springer, New York.
2. Bishop, Fienberg, and Holland (1975)
Discrete Multivariate Analysis,
MIT Press, Boston.
3. Cox, D. R. and Snell, E. J. (1989)
Analysis of Binary Data, 2nd edition,
Chapman & Hall, London.
4. Goodman, L. A. (1978) Analyzing
QuantitativenCategorical Data,
Abt Books.
III. Discrete Distributions
1. Johnson, N. L., Kotz, S. and Kemp,
(1995) Distributions in Statistics:
Discrete Distributions, 2nd edition,
Wiley, New York.
5
6
IV. Statistical Theory
1. Christensen, R. R. (1997) Log-Linear
Models, 2nd edition, Springer,
New York.
2. Haberman, S. J. (1974) The Analysis
of Frequency Data, University of
Chicago Press.
V. Logistic Regression
1. Collett, D. (1991) Modelling Binary
Data, Chapman & Hall, London.
2. Hosmer, D. W. and Lemeshow, S. (2000)
Applied Logistic Regression Analysis,
2nd edition, Wiley, New York.
3. Greenwood, P. E. and Nikulin, M. S.
(1996) A Guide to Chi-Squared
Testing, Wiley, New York.
3. Kleinbaum, David (1994) Logistic
Regression: A Self-learning Text,
Springer-Verlag, New York.
7
8
VI. Generalized Linear Models:
VII. Model Free Curve Fitting
1. McCullagh, P. and Nelder, J. A. (1989)
Generalized Linear Models,
2nd edition, Chapman & Hall, London.
2. Dobson, A. J. (2002) An Introduction
to Generalized Linear Models,
2nd edition, Chapman & Hall,
London.
3. Firth, D. (1991) Generalized Linear
Models (Chapter 3) in Statistical
Theory and Modelling, (D.V. Hinkley,
N. Reid, E.J. Snell, eds.)
Chapman & Hall, London.
1. Hastie, T. J. and Tibshirani, R. J.
(1986), Generalized Additive Models,
Statistical Science, 1, pp 297-318.
2. Hastie, T. J. and Tibshirani, R. J.
(1990) Generalized Additive Models,
Chapman & Hall, London.
VIII. Classication Trees
1. Breiman, l., Freidman, J. H. Olshen, R. A.
Stone, G. J., (1984) Classication
and Regression Trees, Wadsworth
& Brooks/Cole, Monterrey, California
10
9
Information on S-PLUS:
Manuals: (Available in Snedecor 115)
IX. Visualization of Categorical Data
1. Greenacre,M. J. (1993) Correspondence
Analysis in Practice, Academic Press,
London.
2. Blasius, J. and Greenacre,M. J., eds.,
(1998) Visualization of Categorical
Data, Academic Press, London.
3. Hofmann, H. (2000) Graphical Tools
for the Exploration of Multivariate
Categorical Data, Herstellung: Books
on Demand GmbH, Augsburg.
11
1. S-PLUS 6 for Windows User's Guide,
Insightful Corporation, Seattle , WA.
2. S-PLUS 6 for Windows Programmer's
Guide, Insightful Corporation,
Seattle , WA.
3. S-PLUS 6 for Windows Guide to Statistics,
Volume 1, Insightful Corporation,
Seattle , WA.
4. S-PLUS 6 for Windows Guide to Statistics,
Volume 2, Insightful Corporation,
Seattle , WA.
12
Information on SAS
Manuals: (Available in Snedecor 115)
Books
1. Krause, A. and Olson, M. (2000) The
Basics of S and S-PLUS, 2nd ed.,
Springer-Verlag, New York.
1. SAS/STAT User's Guide, Version 8,
Volume 1, Cary, NC: SAS Institute
Inc. 1999
2. SAS/STAT User's Guide, Version 8,
Volume 2, Cary, NC: SAS Institute
Inc. 1999
2. Venables, W. N. and Ripley, B. D.
(1999) Modern Applied Statistics
with S-PLUS, 3rd edition,
Springer, New York.
3. SAS/STAT User's Guide, Version 8,
Volume 3, Cary, NC: SAS Institute
Inc. 1999
13
14
Websites:
1. Statlib: http//lib.stat.cmu.edu
Books
1. SAS Institute (1995) Logistic
Regression: Examples Using the
SAS System, SAS Institute, Cary, N.C.
2. Allison, P. D. (2000) Logistic
Regression Using the SAS System:
Theory and Application, Wiley,
New York
3. Stokes, M. E., Davis, C. E., Koch, G. G.
(2001) Categorical Data Analysis
Using the SAS System, Wiley,
New York
click on S Archive to see a list of
contributed S software
2. Insightful Corporation
http://www.insightful.com
3. SAS:
http://www.sas.com
4. Venables & Ripley libraries:
http://www.stats.ox.ac.uk/pub/MASS3/sites.html
5. R libraries:
http://www.ci.tuwien.ac.at/R/mirrors.html
15
16
What are categorical data?
Measured responses can be
classied into a set of
categories
Types of variables
Nominal
Ordinal
Interval
? Discrete (or categorical)
? Continuous
Observations are counts
or proportions
18
17
Eye color:
Nominal Variables
Categories have no
natural ordering
Categories are not
separated by meaningful
distances
Marital Status:
`
`
`
`
brown
` blue
` green
` other
`
Survival status at the end
of a specic time interval:
never married
married
divorced
widowed
`
`
`
19
alive
dead
lost to follow up
20
Ordinal Variables
Categories can be ordered
in some meaningful way
Categories are not
separated by meaningful
distances
Attitude toward some policy:
`
`
`
`
strongly disapprove
disapprove
approve
strongly approve
Interval Variables
Meaningful numerical
distance between any
two levels of the
scale
Continuous interval variables:
height
weight
blood pressure
survival time
chemical concentration
21
22
Binary Response Variables
Discrete interval variables:
Number of eggs in a nest
f0; 1; 2; :::; 6g
Years of education
f0; 1; 2; 3; :::; 20g
Example: Opinion Survey
the population that favors
a certain policy
What type of response is this?
` less than 12 years
` high school graduate
` college graduate
Estimate the proportion of
Sample
the population
Count
x
Favor
Oppose
Total
23
n individuals from
n
x
n
24
Example: Comparing tumor
incidence rates in mice
Use the sample proportion
p = nx
exposure
to estimate the population
proportion
to smoke
.
Tumor
24
19
6
11
30
30
No tumor
Some issues:
What is the accuracy of
How large should
How should the data be collected?
p = nx ?
n be?
Total
control
Randomized experiment
Hypothesis test
Relative risk
Sample sizes
25
26
Example: Seed Germination
Example: Duckling survival rates
Survive
Die
Total
Surgery/
Implant
Surgery/
No implant
X1
X2
X3
n1 X1
n2 X2
n3 X3
n1
n2
n3
Binary response
Success
control
Failure
i
the conditional probability
of successful germination
under the i-th set of conditions
Compare proportions
1
p1 = X
n1
2
p2 = X
n2
3
p3 = X
n3
27
Logistic regression
log( 1 ii ) = 0 + 1X1i + 2X2i + :::
28
Example: Chinook Salmon
Ordinal variable:
Run: 1 = early run
(before July 1)
2 = late run
(after July 1)
1 year in Kanai River
1 to 5 years in ocean
sh captured in 1999
Nominal variables:
Interval variables:
Sex: F = female
M = male
Mode of capture:
1 = hook and line
2 = net
Age: 1, 2, 3, 4, or 5 years
Length: eye to fork of tail (mm)
30
29
Questions:
Is the sex ratio 1:1
in each run?
Is the sex ratio consistent
across runs?
Early run:
Count
Percent
Females
339
51.2
Males
321
48.4
C
log (mijk`) = + R
i + j
RS
+Sk + A
` + ik
CS
RCS
+RA
i` + jk + ijk
Percent
Females
367
45.5
Males
430
54.5
What can be inferred about
sex-age distributions for
various seasons and modes
of capture?
log-linear models:
Late run:
Count
where mijk` = E (Yijk`)
31
32
Example: Bird abundance
m = E (Y ) = mean number of
Example: Classication tree
for incidence of lung
scaring (BPD) in
newborn babies
blackbirds per
10,000 square
meters
log (m) = 0 + 1X1 + 2X2 + :::
All cases
BPD 78
no BPD 178
@@
@@
@@
Vent145 hr 145<Vent220 Vent>220 hr
BPD 12
BPD 9
BPD 57
no BPD 114 no BPD 38
no BPD 21
@
@@
and
Low O2 437 hr
BPD 3
no BPD 35
Y P oisson(m)
@@
@
Low O2 >437 hr
BPD 9
no BPD 3
34
33
Complex Studies
Repeated measures
Cluster sampling
Correlated responses
Extra-variation
Methods of Analysis
35
Generalized Estimating
Equations (GEE)
Robust Covariance
Estimation
Mixed Models
36
Download