Analysis of Categorical Data I. Dichotomous Observations

advertisement
STA 100
Lecture 16
Analysis of Categorical Data
I. Dichotomous Observations
A Practical Problem: Estimate the 5-year survival rate of patients
receiving a standard treatment for breast cancer.
a. Recall the binomial random variable
b. An estimate for the population proportion p.
c. The Wilson-Adjusted sample proportion.
d. Sampling distribution of the estimators
II. Confidence Intervals for a Population Proportion
Example: Suppose among 75 cancer patients 67 survived after 10 years of
being diagnosed and treated for the cancer. Find a 90% confidence interval
for the survival rate.
2
III. Sample Size for Estimation of a Population Proportion
A Practical Problem: We would like to estimate the proportion of
patients visiting a local primary care clinic who will be diagnosed as
depressed at their first visit. How many samples we need to select in
order to estimate this proportion with 90% confidence and margin of
error of 2%?
Consider the large-sample confidence interval for p as
p^ ± zα/2 √ p^ (1 - p^) / n
The margin of error is
M = zα/2 √ p^ (1 - p^) / n
Which leads to
n = ( zα/2 / M )2 p^ (1 - p^)
In practice p^ is not known, but we many have a pilot estimate or
guess, which we denote it by p*. Then
n = ( zα/2 / M )2 p* (1 – p*)
Now, if we have no idea about p, we can use
Max p* (1 – p*) = 0.5 (1- 0.5) = 0.25
This leads to
n = 0.25 ( zα/2 / M )2
Example: Diagnosis of depression.
3
IV. The Chi-Square Goodness-of-Fit Test
Categorical data analysis is used when the variable under study is
classified into several categories.
A Practical Problem: The famous biologist and father of modern
genetics Gregor Mendel crossed round yellow pea plants with
wrinkled green pea plants obtaining the plants bearing peas in the
following four categories:
Category
Frequency
Round Yellow
Round Green
Wrinkled Yellow
Wrinkled Green
315
108
101
32
According to his theory, the expected frequencies of these
characteristics should be in proportion 9:3:3:1. Do the observed data
agree with the theory ?
Let O represent the observed and E the expected frequencies,
respectively. Then the chi-square test statistics is:
χ2 = Σ (O-E)2 / E
The degree of freedom of the chi-square statistics is df = ν -1, where ν
is the number of categories. We use Table 9 to find the critical values.
4
Example: Mandel’s experiment
Category
Frequency
Round Yellow
Round Green
Wrinkled Yellow
Wrinkled Green
315
108
101
32
V. The 2×2 Contingency Tables
The 2x2 contingency tables arise in dealing with two binary categorical
responses.
Practical Problems:
1. In a study of side effects of certain prescription drug the following data
were observed:
Side Effects
Present
Absent
Drug
15
35
4
46
Treatment
Placebo
Is there a side effect due to drug?
5
2. To study the relationship between smoking and cardio vascular
disease the following table is created for a random sample of size 100.
CVD
No
Yes
No
50
15
Yes
10
25
Smoking
Based on this data, is there an indication that smoking may cause CVD?
The chi-square test statistics for testing independence or homogeneity
in 2x2 tables is:
χ2 = Σ (O-E)2 / E
The degree of freedom of the chi-square statistics is df =1. Here,
E = Row tot * Column tot / Grand tot
Example1: Side Effects
Side Effects
Present
Absent
Drug
15
35
4
46
Treatment
Placebo
6
Example2: CVD
CVD
No
Yes
No
50
15
Yes
10
25
Smoking
7
Download