STA 100 Lecture 16 Analysis of Categorical Data I. Dichotomous Observations A Practical Problem: Estimate the 5-year survival rate of patients receiving a standard treatment for breast cancer. a. Recall the binomial random variable b. An estimate for the population proportion p. c. The Wilson-Adjusted sample proportion. d. Sampling distribution of the estimators II. Confidence Intervals for a Population Proportion Example: Suppose among 75 cancer patients 67 survived after 10 years of being diagnosed and treated for the cancer. Find a 90% confidence interval for the survival rate. 2 III. Sample Size for Estimation of a Population Proportion A Practical Problem: We would like to estimate the proportion of patients visiting a local primary care clinic who will be diagnosed as depressed at their first visit. How many samples we need to select in order to estimate this proportion with 90% confidence and margin of error of 2%? Consider the large-sample confidence interval for p as p^ ± zα/2 √ p^ (1 - p^) / n The margin of error is M = zα/2 √ p^ (1 - p^) / n Which leads to n = ( zα/2 / M )2 p^ (1 - p^) In practice p^ is not known, but we many have a pilot estimate or guess, which we denote it by p*. Then n = ( zα/2 / M )2 p* (1 – p*) Now, if we have no idea about p, we can use Max p* (1 – p*) = 0.5 (1- 0.5) = 0.25 This leads to n = 0.25 ( zα/2 / M )2 Example: Diagnosis of depression. 3 IV. The Chi-Square Goodness-of-Fit Test Categorical data analysis is used when the variable under study is classified into several categories. A Practical Problem: The famous biologist and father of modern genetics Gregor Mendel crossed round yellow pea plants with wrinkled green pea plants obtaining the plants bearing peas in the following four categories: Category Frequency Round Yellow Round Green Wrinkled Yellow Wrinkled Green 315 108 101 32 According to his theory, the expected frequencies of these characteristics should be in proportion 9:3:3:1. Do the observed data agree with the theory ? Let O represent the observed and E the expected frequencies, respectively. Then the chi-square test statistics is: χ2 = Σ (O-E)2 / E The degree of freedom of the chi-square statistics is df = ν -1, where ν is the number of categories. We use Table 9 to find the critical values. 4 Example: Mandel’s experiment Category Frequency Round Yellow Round Green Wrinkled Yellow Wrinkled Green 315 108 101 32 V. The 2×2 Contingency Tables The 2x2 contingency tables arise in dealing with two binary categorical responses. Practical Problems: 1. In a study of side effects of certain prescription drug the following data were observed: Side Effects Present Absent Drug 15 35 4 46 Treatment Placebo Is there a side effect due to drug? 5 2. To study the relationship between smoking and cardio vascular disease the following table is created for a random sample of size 100. CVD No Yes No 50 15 Yes 10 25 Smoking Based on this data, is there an indication that smoking may cause CVD? The chi-square test statistics for testing independence or homogeneity in 2x2 tables is: χ2 = Σ (O-E)2 / E The degree of freedom of the chi-square statistics is df =1. Here, E = Row tot * Column tot / Grand tot Example1: Side Effects Side Effects Present Absent Drug 15 35 4 46 Treatment Placebo 6 Example2: CVD CVD No Yes No 50 15 Yes 10 25 Smoking 7