Statistical Methods for Counts and Proportions Stat 557 Fall 2012 Heike Hofmann Can we do something about the time? • Would T/TR 9:30 - 10:50 work for everybody? Syllabus & Course Website • http://www.public.iastate.edu/~hofmann/stat557/ syllabus.html • Blackboard site is for all sensitive material: copyrighted readings, grades Plan for today • Syllabus • What is this course about? • Review of categorical data: • Binomial / Poisson • Exact confidence intervals Textbook • “Categorical Data Analysis” by Alan Agresti, the ‘green book’ website: http://www.stat.ufl.edu/~aa/cda/cda.html • Agresti’s “Introduction to Categorical Data Analysis” - less technical, fewer details, but maybe a good starting point Assessment • Homework assignments every other week • Two midterm exams • final (team) project & presentation • 35% homework (5 assignments), 40% midterms, 25% final project (20% write-up, 5% presentation) Final Project • Slightly bigger project with multiple due dates • Will be open ended, and will involve a substantial write up (10+ pages) • Work in a team (3-4 members) • Talk to each other! Find a group! Disability and sickness • Make sure to let me know (in advance) if you have to miss an exam or a deadline • Keep on top of the material or you will get swamped by the end of the semester Lectures • Electronic copy of the slides will be available on the website • But you’ll need to take your own notes! • If you really want complete notes, organize a roster with others in the class • If you’re bored, complain! ... same, if I’m going too fast ... ... what I expect ... • Stat 500 or 401, 543 or 447 (some methods, some theory) • working knowledge of statistical software (preferably R) Categorical Data Analysis Topics • http://www.public.iastate.edu/~hofmann/stat557/ • Review: Distributions/Inference for Categorical Variables • Generalized Linear Models • Logistic Binomial and Multinomial Regression; Loglinear Models • Matched Pairs (with longitudinal structure) • Correspondence Analysis • Classification Trees • ... any suggestions? Categorical Data Analysis • Response variable is categorical explanatory variables can be any type • outcome is one of a set of possibilities outcome is a count or proportion Types of Categorical Variables • Nominal • Ordinal • (Ratio) • Interval Nominal Variables • Categories have no natural ordering • Categories are not separated by meaningful distances • marital status, eye color, survival status (alive, dead, lost to follow-up) Ordinal Variables • • Categories have natural ordering • political opinion (liberal, neutral, conservative), approval rating (strongly disapprove, disapprove, approve, strongly approve) • Likert scales are ordinal Categories are not separated by meaningful distances Interval Variables • Categories have natural ordering • Meaningful distances between any two levels of the scale • continuous interval variables: height, weight, age, survival time, ... discrete interval variables: years of education, #occurrences, ... Hierarchy of Types • Type depends on measurement, e.g. “education” could fit any type • Applicability of methods depends on type: interval > ordinal > nominal Distinctions • Continuous vs lots of different values • Quantitative interval vs ---- ordinal --- discrete few values qualitative: --- nominal Binary Response • Example: Opinion Survey estimate proportion of population in favor of certain policy/politician • sample n individuals • use sample proportion p = Y/n as estimate of true probability π of support for candidate Count In Favor Y Opposed n-Y Total n Binary Response • Some Issues: • Accuracy of p = Y/n ? • How large should n be? • How should the data be collected? Binomial Distribution • Define Y as number of successes in n independent identical Binary trials (Y is sum of independent Bernoulli random variables) • Distribution • Moments, Skewness, Kurtosis Inference for π • MLE is p = Y/n • (p - π)/sqrt[nπ(1-π)] is approx N(0,1) for large n • (p - π)/sqrt[np(1-p)] is approx N(0,1) for large n • n “large” for nπ > 5, n(1-π )> 5 Confidence Intervals • Binomial distribution is • • • bounded skewed discrete • large sample normal approximation is used • leads to problems with coverage Normal Approximation to Binomial p= 1.2 0.01 0.05 0.1 1.0 0.8 0.4 10 n= 0.6 0.2 0.0 1.2 1.0 25 y 0.8 0.6 0.4 0.2 0.0 1.2 1.0 0.8 50 0.6 0.4 0.2 0.0 0 2 4 6 8 10 0 2 4 x 6 8 10 0 2 4 6 8 dots are binomial probability, line is normal density 10 Example • Sex of adult turtles caught for consumption in the river Mato: male female 69 46 • assumption: ratio of sexes is 1.0 in the wild Inference for π • MLE is • • Y p= n Y p = Y |p − π| . n p = � ∼N we know, for large n:(0, 1) n 1 n p(1 − p) |p − π| . � |p − π| ∼ .N (0, 1) �1 p(1 − p) ∼ N (0, 1) nπ ≥ 5, n(1n−1 π) ≥ 5 n p(1 − p) � − π) ≥ 5 �� nπ ≥ � 5, n(1 2 2 β + α λ · n “large” for nπ ≥ 5, n(1 − π) ≥ 5 �� � � 2 2 |p − π| . �|p − π| ∼ N (0, 1) . 1 − π|− π)∼. N (0, 1) � |pn π(1 �1 ∼ N (0, 1) 1 π(1 − π) n n π(1 − π) nπ ≥ 5, n(1 − π) ≥ 5 nπ ≥ 5, n(1 − π) ≥ 5 nπ ≥ 5, n(1 − π) ≥ 5 |p − π| �|p − π| < Zα/2 π|− p)< Zα/2 � |pn1 − p(1 �1 < Zα/2 1 p(1 − p) � n p(1 − p) n upper and lower limit� is then: 1 pU = p + Zα/2� 1 p(1 − p) pU = p + Zα/2 1np(1 − p) pU = p + Zα/2 �n p(1 − p) n � 1 pL = p − Zα/2� 1 p(1 − p) pL = p − Zα/2 1np(1 − p) pL = p − Zα/2 n p(1 − p) n Wald CI • p = 0.6 = pL = p − Zα/2 � 115 1 p(1 − p) n Mato Turtles pL α = 0.05 α/2 = 0.025 Zα/2 = 1.96 n = 115 69 p = 0.6 = 115 pU � 0.6 · 0.4 = 0.6 − 1.96 · = 115 = 0.6 − 0.0895 = 0.5105 � 0.6 · 0.4 = 0.6 + 1.96 · = 115 = 0.6 + 0.0895 = 0.6895 An approximate 95% confidence interval for the� proportion of male � being caught for consumption1is (0.51, 0.69) turtles among turtles = 0.6 − 1.96 · 0.6 · 0.4 = p ± Zα/2 n p(1 − p) 115 = 0.6 + 0.0895 = 0.6895 Issues • � 1 p ± Zα/2 p(1 − p) n isn’t necessarily within [0,1] α e.g. p=0.1, n=10P (Y ≤ y) = 2 interval has length 0 in either end 5 Confidence Intervals for Binomial, n=10, Normal Approximation • 1.0 0.8 0.6 0.4 0.2 0.0 0 2 4 x 6 8 10 Exact confidence intervals Tail method: which • upper limit is that value of�p, for � • n � α n� j n−j � n = P (Y ≥ y) = p (1 − p) � n α 2 = P (Y ≥ y) = j pj (1 − p)n−j j=y 2 j j=y � � y lower limit is that p, for which � n α j n−j � � y = P (Y ≤ y) = � n p (1 − p) α 2 = P (Y ≤ y) = j pj (1 − p)n−j j=0 2 j j=0 � �−1 n−y+1 � �−1 pL = 1 + � � � n n � � n n� j j αα ==PP (Y(Y≥≥y)y)== 22 jj n−j n−j p p(1(1−−p)p) j=y j=y Exact confidence intervals • � � � � y y � nn � j j αα n−j n−j ==PP (Y(Y≤≤y)y)== p p(1(1−−p)p) 2 j 2 j see e.g. Johnson & Kotzj=0 “Discrete Distributions” j=0 �� �� −1−1 n n−−y y++1 1 pLpL== 1 1++ yF α/2) yF (1(1−−α/2) 2y,2(n+y−1) 2y,2(n+y−1) �� �� −1−1 n n−−y y pUpU== 1 1++ 1)F (α/2) (y(y++1)F (α/2) 2(y+1),2(n+y) 2(y+1),2(n+y) � � � � n n � � nn j j n−j n−j exactC(p, interval for Mato turtles: (0.504, 0.690) C(p, n) = I(j, p) · p (1 − p) n) = I(j, p) · p (1 − p) jj j=0 j=0 Issues • exact interval is conservative, i.e. due to discreteness actual coverage of the CI might (and will be) larger than nominal coverage α = P (Y ≤ y) = 2 � � y � � n j p (1 − p)n−j j Coverage j=0 �−1 n−y+1 pL = 1 + yF2y,2(n+y−1) (1 − α/2) For a fixed value of a parameter the actual � �−1 coverage probability of an ninterval − y estimator is pthe 1 + that the interval contains the U = probability (y + 1)F2(y+1),2(n+y) (α/2) parameter: � � n � n j C(p, n) = I(j, p) · p (1 − p)n−j j j=0 � I(j,p) is 1, if the interval contains p for 1 p(1 − p) Mj ≤ · observation and1.96 0 otherwise n Coverage Coverage of Confidence Intervals for Binomial 5 10 50 1.0 0.8 coverage Method 0.6 Wald Score Exact 0.4 adj.Wald 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 p 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 • see also Agresti & Coull (1998) for adjusted Wald interval 1.0