Statistical Methods for Counts and Proportions Stat 557 Fall 2012

Statistical Methods for Counts and Proportions Stat 557 Fall 2012 Heike Hofmann Can we do something about the time? • Would T/TR 9:30 - 10:50 work for everybody? Syllabus & Course Website • http://www.public.iastate.edu/~hofmann/stat557/ syllabus.html • Blackboard site is for all sensitive material: copyrighted readings, grades Plan for today • Syllabus • What is this course about? • Review of categorical data: • Binomial / Poisson • Exact confidence intervals Textbook • “Categorical Data Analysis” by Alan Agresti, the ‘green book’ website: http://www.stat.ufl.edu/~aa/cda/cda.html • Agresti’s “Introduction to Categorical Data Analysis” - less technical, fewer details, but maybe a good starting point Assessment • Homework assignments every other week • Two midterm exams • final (team) project & presentation • 35% homework (5 assignments), 40% midterms, 25% final project (20% write-up, 5% presentation) Final Project • Slightly bigger project with multiple due dates • Will be open ended, and will involve a substantial write up (10+ pages) • Work in a team (3-4 members) • Talk to each other! Find a group! Disability and sickness • Make sure to let me know (in advance) if you have to miss an exam or a deadline • Keep on top of the material or you will get swamped by the end of the semester Lectures • Electronic copy of the slides will be available on the website • But you’ll need to take your own notes! • If you really want complete notes, organize a roster with others in the class • If you’re bored, complain! ... same, if I’m going too fast ... ... what I expect ... • Stat 500 or 401, 543 or 447 (some methods, some theory) • working knowledge of statistical software (preferably R) Categorical Data Analysis Topics • http://www.public.iastate.edu/~hofmann/stat557/ • Review: Distributions/Inference for Categorical Variables • Generalized Linear Models • Logistic Binomial and Multinomial Regression; Loglinear Models • Matched Pairs (with longitudinal structure) • Correspondence Analysis • Classification Trees • ... any suggestions? Categorical Data Analysis • Response variable is categorical explanatory variables can be any type • outcome is one of a set of possibilities outcome is a count or proportion Types of Categorical Variables • Nominal • Ordinal • (Ratio) • Interval Nominal Variables • Categories have no natural ordering • Categories are not separated by meaningful distances • marital status, eye color, survival status (alive, dead, lost to follow-up) Ordinal Variables • • Categories have natural ordering • political opinion (liberal, neutral, conservative), approval rating (strongly disapprove, disapprove, approve, strongly approve) • Likert scales are ordinal Categories are not separated by meaningful distances Interval Variables • Categories have natural ordering • Meaningful distances between any two levels of the scale • continuous interval variables: height, weight, age, survival time, ... discrete interval variables: years of education, #occurrences, ... Hierarchy of Types • Type depends on measurement, e.g. “education” could fit any type • Applicability of methods depends on type: interval > ordinal > nominal Distinctions • Continuous vs lots of different values • Quantitative interval vs ---- ordinal --- discrete few values qualitative: --- nominal Binary Response • Example: Opinion Survey estimate proportion of population in favor of certain policy/politician • sample n individuals • use sample proportion p = Y/n as estimate of true probability π of support for candidate Count In Favor Y Opposed n-Y Total n Binary Response • Some Issues: • Accuracy of p = Y/n ? • How large should n be? • How should the data be collected? Binomial Distribution • Define Y as number of successes in n independent identical Binary trials (Y is sum of independent Bernoulli random variables) • Distribution • Moments, Skewness, Kurtosis Inference for π • MLE is p = Y/n • (p - π)/sqrt[nπ(1-π)] is approx N(0,1) for large n • (p - π)/sqrt[np(1-p)] is approx N(0,1) for large n • n “large” for nπ > 5, n(1-π )> 5 Confidence Intervals • Binomial distribution is • • • bounded skewed discrete • large sample normal approximation is used • leads to problems with coverage Normal Approximation to Binomial p= 1.2 0.01 0.05 0.1 1.0 0.8 0.4 10 n= 0.6 0.2 0.0 1.2 1.0 25 y 0.8 0.6 0.4 0.2 0.0 1.2 1.0 0.8 50 0.6 0.4 0.2 0.0 0 2 4 6 8 10 0 2 4 x 6 8 10 0 2 4 6 8 dots are binomial probability, line is normal density 10 Example • Sex of adult turtles caught for consumption in the river Mato: male female 69 46 • assumption: ratio of sexes is 1.0 in the wild Inference for π • MLE is • • Y p= n Y p = Y |p − π| . n p = � ∼N we know, for large n:(0, 1) n 1 n p(1 − p) |p − π| . � |p − π| ∼ .N (0, 1) �1 p(1 − p) ∼ N (0, 1) nπ ≥ 5, n(1n−1 π) ≥ 5 n p(1 − p) � − π) ≥ 5 �� nπ ≥ � 5, n(1 2 2 β + α λ · n “large” for nπ ≥ 5, n(1 − π) ≥ 5 �� 2 2 |p − π| . �|p − π| ∼ N (0, 1) . 1 − π|− π)∼. N (0, 1) � |pn π(1 �1 ∼ N (0, 1) 1 π(1 − π) n n π(1 − π) nπ ≥ 5, n(1 − π) ≥ 5 nπ ≥ 5, n(1 − π) ≥ 5 nπ ≥ 5, n(1 − π) ≥ 5 |p − π| �|p − π| < Zα/2 π|− p)< Zα/2 � |pn1 − p(1 �1 < Zα/2 1 p(1 − p) � n p(1 − p) n upper and lower limit� is then: 1 pU = p + Zα/2� 1 p(1 − p) pU = p + Zα/2 1np(1 − p) pU = p + Zα/2 �n p(1 − p) n � 1 pL = p − Zα/2� 1 p(1 − p) pL = p − Zα/2 1np(1 − p) pL = p − Zα/2 n p(1 − p) n Wald CI • p = 0.6 = pL = p − Zα/2 � 115 1 p(1 − p) n Mato Turtles pL α = 0.05 α/2 = 0.025 Zα/2 = 1.96 n = 115 69 p = 0.6 = 115 pU � 0.6 · 0.4 = 0.6 − 1.96 · = 115 = 0.6 − 0.0895 = 0.5105 � 0.6 · 0.4 = 0.6 + 1.96 · = 115 = 0.6 + 0.0895 = 0.6895 An approximate 95% confidence interval for the� proportion of male � being caught for consumption1is (0.51, 0.69) turtles among turtles = 0.6 − 1.96 · 0.6 · 0.4 = p ± Zα/2 n p(1 − p) 115 = 0.6 + 0.0895 = 0.6895 Issues • � 1 p ± Zα/2 p(1 − p) n isn’t necessarily within [0,1] α e.g. p=0.1, n=10P (Y ≤ y) = 2 interval has length 0 in either end 5 Confidence Intervals for Binomial, n=10, Normal Approximation • 1.0 0.8 0.6 0.4 0.2 0.0 0 2 4 x 6 8 10 Exact confidence intervals Tail method: which • upper limit is that value of�p, for � • n � α n� j n−j � n = P (Y ≥ y) = p (1 − p) � n α 2 = P (Y ≥ y) = j pj (1 − p)n−j j=y 2 j j=y � � y lower limit is that p, for which � n α j n−j � � y = P (Y ≤ y) = � n p (1 − p) α 2 = P (Y ≤ y) = j pj (1 − p)n−j j=0 2 j j=0 � �−1 n−y+1 � �−1 pL = 1 + � � � n n � � n n� j j αα ==PP (Y(Y≥≥y)y)== 22 jj n−j n−j p p(1(1−−p)p) j=y j=y Exact confidence intervals • � � � � y y � nn � j j αα n−j n−j ==PP (Y(Y≤≤y)y)== p p(1(1−−p)p) 2 j 2 j see e.g. Johnson & Kotzj=0 “Discrete Distributions” j=0 �� −1−1 n n−−y y++1 1 pLpL== 1 1++ yF α/2) yF (1(1−−α/2) 2y,2(n+y−1) 2y,2(n+y−1) �� −1−1 n n−−y y pUpU== 1 1++ 1)F (α/2) (y(y++1)F (α/2) 2(y+1),2(n+y) 2(y+1),2(n+y) � � � � n n � � nn j j n−j n−j exactC(p, interval for Mato turtles: (0.504, 0.690) C(p, n) = I(j, p) · p (1 − p) n) = I(j, p) · p (1 − p) jj j=0 j=0 Issues • exact interval is conservative, i.e. due to discreteness actual coverage of the CI might (and will be) larger than nominal coverage α = P (Y ≤ y) = 2 � � y � � n j p (1 − p)n−j j Coverage j=0 �−1 n−y+1 pL = 1 + yF2y,2(n+y−1) (1 − α/2) For a fixed value of a parameter the actual � �−1 coverage probability of an ninterval − y estimator is pthe 1 + that the interval contains the U = probability (y + 1)F2(y+1),2(n+y) (α/2) parameter: � � n � n j C(p, n) = I(j, p) · p (1 − p)n−j j j=0 � I(j,p) is 1, if the interval contains p for 1 p(1 − p) Mj ≤ · observation and1.96 0 otherwise n Coverage Coverage of Confidence Intervals for Binomial 5 10 50 1.0 0.8 coverage Method 0.6 Wald Score Exact 0.4 adj.Wald 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 p 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 • see also Agresti & Coull (1998) for adjusted Wald interval 1.0

Statistical Methods for Counts and Proportions Stat 557 Fall 2012

Related documents

Products

Support

Statistical Methods for Counts and Proportions Stat 557 Fall 2012

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib