STAT 557 STATISTICS 557 Course Information STATISTICAL METHODS FOR COUNTS AND PROPORTIONS Instructor: Kenneth J. Koehler 120 Snedecor Hall Telephone: 515-294-4181 Fax: 515-294-4040 E-mail: kkoehler@iastate.edu OÆce Hours: ????? TA: Kyoji Furukawa 315B Snedecor Hall Telephone: 515-294-2227 Fax: 515-294-4040 E-mail: kyoji@iastate.edu OÆce Hours: Tuesday 2pm Kenneth Koehler Department of Statistics Iowa State University 1 2 Textbook: Lloyd, Chris, J., Statistical Analysis of Categorical Data, Wiley, 1999. Course Notes: As they become available copies of transparencies will be made available as PDF les on the course web page: kkoehler/stat557/stat557.html http:/www.public.iastate.edu/ Computation: SAS and S-PLUS code for examples covered in the lecture will be available from the course web page. You should have a calculator that you can bring to exams. Grades: Nine assignments (15%) Midterm Exams (25% each) Final Exam (35%) 3 Topic Outline 1. Probability distributions, for count data, review of inferential procedures 2. Analysis of two-way contingency tables 3. Large sample theory 4. Exact conditional inference 5. Generalized Linear Models 6. Logistic regression for binary responses 7. Log-linear models and, Poisson regression 8. Correspondence analysis 9. Classication trees 10. Models with xed and random eects Readings Lloyd: Chapters 1 and 2 Lloyd: Chapter 3 Lloyd: Section 1.3 Lloyd: Sections 7.1-7.6 Lloyd: Section 4.6 Lloyd: Chapter 4 Section 7.7 Lloyd: Chapter 6 4 References I. Categorical Data Analysis: II. Measures of Association 1. Agresti, A. (2002) Categorical Data Analysis, 2nd edition, Wiley, New York. 1. Goodman, L. A. and Kruskal, W. H. (1979) Measures of Association for Cross Classications, Springer, New York. 2. Bishop, Fienberg, and Holland (1975) Discrete Multivariate Analysis, MIT Press, Boston. 3. Cox, D. R. and Snell, E. J. (1989) Analysis of Binary Data, 2nd edition, Chapman & Hall, London. 4. Goodman, L. A. (1978) Analyzing QuantitativenCategorical Data, Abt Books. III. Discrete Distributions 1. Johnson, N. L., Kotz, S. and Kemp, (1995) Distributions in Statistics: Discrete Distributions, 2nd edition, Wiley, New York. 5 6 IV. Statistical Theory 1. Christensen, R. R. (1997) Log-Linear Models, 2nd edition, Springer, New York. 2. Haberman, S. J. (1974) The Analysis of Frequency Data, University of Chicago Press. V. Logistic Regression 1. Collett, D. (1991) Modelling Binary Data, Chapman & Hall, London. 2. Hosmer, D. W. and Lemeshow, S. (2000) Applied Logistic Regression Analysis, 2nd edition, Wiley, New York. 3. Greenwood, P. E. and Nikulin, M. S. (1996) A Guide to Chi-Squared Testing, Wiley, New York. 3. Kleinbaum, David (1994) Logistic Regression: A Self-learning Text, Springer-Verlag, New York. 7 8 VI. Generalized Linear Models: VII. Model Free Curve Fitting 1. McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models, 2nd edition, Chapman & Hall, London. 2. Dobson, A. J. (2002) An Introduction to Generalized Linear Models, 2nd edition, Chapman & Hall, London. 3. Firth, D. (1991) Generalized Linear Models (Chapter 3) in Statistical Theory and Modelling, (D.V. Hinkley, N. Reid, E.J. Snell, eds.) Chapman & Hall, London. 1. Hastie, T. J. and Tibshirani, R. J. (1986), Generalized Additive Models, Statistical Science, 1, pp 297-318. 2. Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models, Chapman & Hall, London. VIII. Classication Trees 1. Breiman, l., Freidman, J. H. Olshen, R. A. Stone, G. J., (1984) Classication and Regression Trees, Wadsworth & Brooks/Cole, Monterrey, California 10 9 Information on S-PLUS: Manuals: (Available in Snedecor 115) IX. Visualization of Categorical Data 1. Greenacre,M. J. (1993) Correspondence Analysis in Practice, Academic Press, London. 2. Blasius, J. and Greenacre,M. J., eds., (1998) Visualization of Categorical Data, Academic Press, London. 3. Hofmann, H. (2000) Graphical Tools for the Exploration of Multivariate Categorical Data, Herstellung: Books on Demand GmbH, Augsburg. 11 1. S-PLUS 6 for Windows User's Guide, Insightful Corporation, Seattle , WA. 2. S-PLUS 6 for Windows Programmer's Guide, Insightful Corporation, Seattle , WA. 3. S-PLUS 6 for Windows Guide to Statistics, Volume 1, Insightful Corporation, Seattle , WA. 4. S-PLUS 6 for Windows Guide to Statistics, Volume 2, Insightful Corporation, Seattle , WA. 12 Information on SAS Manuals: (Available in Snedecor 115) Books 1. Krause, A. and Olson, M. (2000) The Basics of S and S-PLUS, 2nd ed., Springer-Verlag, New York. 1. SAS/STAT User's Guide, Version 8, Volume 1, Cary, NC: SAS Institute Inc. 1999 2. SAS/STAT User's Guide, Version 8, Volume 2, Cary, NC: SAS Institute Inc. 1999 2. Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS, 3rd edition, Springer, New York. 3. SAS/STAT User's Guide, Version 8, Volume 3, Cary, NC: SAS Institute Inc. 1999 13 14 Websites: 1. Statlib: http//lib.stat.cmu.edu Books 1. SAS Institute (1995) Logistic Regression: Examples Using the SAS System, SAS Institute, Cary, N.C. 2. Allison, P. D. (2000) Logistic Regression Using the SAS System: Theory and Application, Wiley, New York 3. Stokes, M. E., Davis, C. E., Koch, G. G. (2001) Categorical Data Analysis Using the SAS System, Wiley, New York click on S Archive to see a list of contributed S software 2. Insightful Corporation http://www.insightful.com 3. SAS: http://www.sas.com 4. Venables & Ripley libraries: http://www.stats.ox.ac.uk/pub/MASS3/sites.html 5. R libraries: http://www.ci.tuwien.ac.at/R/mirrors.html 15 16 What are categorical data? Measured responses can be classied into a set of categories Types of variables Nominal Ordinal Interval ? Discrete (or categorical) ? Continuous Observations are counts or proportions 18 17 Eye color: Nominal Variables Categories have no natural ordering Categories are not separated by meaningful distances Marital Status: ` ` ` ` brown ` blue ` green ` other ` Survival status at the end of a specic time interval: never married married divorced widowed ` ` ` 19 alive dead lost to follow up 20 Ordinal Variables Categories can be ordered in some meaningful way Categories are not separated by meaningful distances Attitude toward some policy: ` ` ` ` strongly disapprove disapprove approve strongly approve Interval Variables Meaningful numerical distance between any two levels of the scale Continuous interval variables: height weight blood pressure survival time chemical concentration 21 22 Binary Response Variables Discrete interval variables: Number of eggs in a nest f0; 1; 2; :::; 6g Years of education f0; 1; 2; 3; :::; 20g Example: Opinion Survey the population that favors a certain policy What type of response is this? ` less than 12 years ` high school graduate ` college graduate Estimate the proportion of Sample the population Count x Favor Oppose Total 23 n individuals from n x n 24 Example: Comparing tumor incidence rates in mice Use the sample proportion p = nx exposure to estimate the population proportion to smoke . Tumor 24 19 6 11 30 30 No tumor Some issues: What is the accuracy of How large should How should the data be collected? p = nx ? n be? Total control Randomized experiment Hypothesis test Relative risk Sample sizes 25 26 Example: Seed Germination Example: Duckling survival rates Survive Die Total Surgery/ Implant Surgery/ No implant X1 X2 X3 n1 X1 n2 X2 n3 X3 n1 n2 n3 Binary response Success control Failure i the conditional probability of successful germination under the i-th set of conditions Compare proportions 1 p1 = X n1 2 p2 = X n2 3 p3 = X n3 27 Logistic regression log( 1 ii ) = 0 + 1X1i + 2X2i + ::: 28 Example: Chinook Salmon Ordinal variable: Run: 1 = early run (before July 1) 2 = late run (after July 1) 1 year in Kanai River 1 to 5 years in ocean sh captured in 1999 Nominal variables: Interval variables: Sex: F = female M = male Mode of capture: 1 = hook and line 2 = net Age: 1, 2, 3, 4, or 5 years Length: eye to fork of tail (mm) 30 29 Questions: Is the sex ratio 1:1 in each run? Is the sex ratio consistent across runs? Early run: Count Percent Females 339 51.2 Males 321 48.4 C log (mijk`) = + R i + j RS +Sk + A ` + ik CS RCS +RA i` + jk + ijk Percent Females 367 45.5 Males 430 54.5 What can be inferred about sex-age distributions for various seasons and modes of capture? log-linear models: Late run: Count where mijk` = E (Yijk`) 31 32 Example: Bird abundance m = E (Y ) = mean number of Example: Classication tree for incidence of lung scaring (BPD) in newborn babies blackbirds per 10,000 square meters log (m) = 0 + 1X1 + 2X2 + ::: All cases BPD 78 no BPD 178 @@ @@ @@ Vent145 hr 145<Vent220 Vent>220 hr BPD 12 BPD 9 BPD 57 no BPD 114 no BPD 38 no BPD 21 @ @@ and Low O2 437 hr BPD 3 no BPD 35 Y P oisson(m) @@ @ Low O2 >437 hr BPD 9 no BPD 3 34 33 Complex Studies Repeated measures Cluster sampling Correlated responses Extra-variation Methods of Analysis 35 Generalized Estimating Equations (GEE) Robust Covariance Estimation Mixed Models 36