part 1

advertisement
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
This chapter will introduce methods for describing the relationship between two or more
categorical variables. In addition, we will discuss several inferential procedures used for
analyzing data of this type. To begin, consider the following example.
Example 7.1: In May of 2000, eight people who had worked at the same microwave popcorn
production plant reported to the Missouri Department of Health with fixed obstructive lung
disease. These workers had become ill between 1993 and 2000 while employed at the plant.
Because of these cases, researchers began conducting medical examinations and
environmental surveys of workers employed at the plant in November of 2000 to assess
their occupational exposure to certain compounds.
Part of this study involved measuring the forced vital capacity (FVC) of the current
employees (this is the volume of air that can be maximally, forcefully exhaled). The study
consisted of 116 participants, and the FVC screening indicated that 21 employees had an
airway obstruction. In addition, the popcorn plant was broken into several areas (the
flavor-mixing room, packaging room, etc.). Air and dust samples in each area were
measured to determine the exposure to diacetyl, a marker of organic-chemical exposure.
Then, the average exposure for each study participant was determined by taking into
account how long they spent at different jobs within the plant and the average exposure in
that job area. Finally, they were classified as having either “low” or “high” exposure. The
data (found in the file PopcornPlant.JMP) are summarized below.
Source: The data and example are from “Investigating Statistical Concepts, Applications, and Methods” by Allan Rossman
and Beth Chance, Preliminary Edition. 2005. Brooks/Cole Thomson Learning.
Descriptive Methods For Two Categorical Variables
To describe the relationship between two categorical variables, we usually use contingency
tables and mosaic plots.
Contingency Table: A table showing the joint frequencies of two categorical variables. The
rows of the table denote the categories of the first variable, and the columns denote the
categories of the second variable.
Mosaic plot: This plot gives a visual representation of the relationship between two categorical
variables. A mosaic plot presents the frequencies of combinations of categories of two variables.
110
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
The JMP output for this example is given below. To get these summaries, select Analyze > Fit Y
by X and place Exposure in the X, Factor box and Airway Obstruction in the Y, Columns box.
Questions:
1. Using the contingency table, find the following marginal probability:
P(High Exposure) =
2. Using the contingency table, find the following marginal probability:
P(Airway Obstruction) =
3. Using the contingency table, find the joint probability that someone is in the High
Exposure group and has an Airway Obstruction:
P(High Exposure and Airway Obstruction) =
4. Find the following joint probability:
P(Low Exposure and No Airway Obstruction) =
111
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
5. Using the contingency table, find the following conditional probability. Given that an
individual is in the High Exposure group, what is the probability they have an Airway
Obstruction?
P(Airway Obstruction | High Exposure) =
6. Find the following conditional probability:
P(Airway Obstruction | Low Exposure) =
ADDITIONAL DESCRIPTIVE MEASURES: RELATIVE RISK AND ODDS RATIOS
Other summaries that are often computed when investigating the relationship between two
categorical variables are the risk difference, relative risk ratio, and the odds ratio.
Risk Difference and Relative Risk
Example 7.2: Consider the data from the popcorn plant.
Airway Not
Airway
Total
Obstructed Obstructed
Low Exposure
52
6
58
High Exposure
43
15
58
Total
95
21
116
We have seen that P(Airway Obstruction | High Exposure) is higher than
P(Airway Obstruction | Low Exposure). Since these conditional probabilities differ, it appears
that there may be an association between level of exposure and having an airway obstruction.
One way to compare the two groups (High and Low Exposure) is to look at the risk difference
in these two probabilities.
112
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
Risk Difference: This is simply the difference in the conditional probabilities:
P(Airway Obstruction | High Exposure) - P(Airway Obstruction | Low Exposure) =
Questions:
1. Does this seem like a large difference to you?
2. Suppose the two conditional probabilities of interest had been .95 and .79, instead. Does
this seem like a large difference to you?
Note that for these data, P(Airway Obstruction | High Exposure) was more than TWICE AS
LARGE as P(Airway Obstruction | Low Exposure). Since this seems like an important feature
to describe, we will compare the two groups based on relative risk instead of risk difference.
Relative Risk: This is a measure of how much a particular risk factor influences the risk of a
specified outcome.
For the popcorn data, we calculate the relative risk as follows:
Relative Risk 

P(Airway Obstruction|High Exposure)
P(Airway Obstruction|Low Exposure)
Proportion with Airway Obstruction in High Exposure Group
Proportion with Airway Obstruction in Low Exposure Group

Comments:
1. We interpret this number by saying that the risk of airway obstruction is 2.5 times as
high for employees in the High Exposure group than for employees in the Low
Exposure group.
2. A relative risk value of 1.0 is the reference value for making comparisons. That is, a
relative risk of 1.0 says that there is no difference in the two probabilities.
113
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
3. The risk difference and relative risk ratio are easily displayed in the following
graphic:
4. If we had alternatively calculated the relative risk ratio as
Relative Risk 
P(Airway Obstruction|Low Exposure)
=
P(Airway Obstruction|High Exposure)
then the interpretation changes. Now, we say the risk of airway obstruction for
employees in the Low Exposure group is .40 times as high as the risk of airway
obstruction for employees in the High Exposure group.
Odds Ratios:
The relative risk ratio is frequently used when investigating the relationship between two
categorical variables. Although this quantity is relatively easy to calculate and interpret,
statisticians often use another quantity known as an odds ratio in this situation.
Before computing an odds ratio, we need to compute the odds:
Odds: With counts given for two distinct response categories (High and Low Exposure), the
odds of a ‘Yes’ versus a ‘No’ is computed as the number of ‘Yes’ events versus the number of
‘No’ events for each group. You can also think of this as the probability that something is true
divided by the probability that something is not true.
114
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
Example 7.3: Consider the data from the popcorn plant.
Low Exposure
High Exposure
Total
Airway Not
Obstructed
52
43
95
Airway Obstructed
6
15
21
Total
58
58
116
Find the odds of having an airway obstruction for both High Exposure and Low Exposure:
Odds of Airway Obstruction for High 
Number with Airway Obstruction in High group
Number with No Airway Obstruction in High group

Odds of Airway Obstruction for Low 
Number with Airway Obstruction in Low group
Number with No Airway Obstruction in Low group

The odds ratio is simply the ratio of the odds for the two groups:
Odds Ratio 
Odds of Airway Obstruction for High Exposure

Odds of Airway Obstruction for Low Exposure
The interpretation is that the odds of airway obstruction in the High Exposure group are 3.02
times as high as the odds of airway obstruction in the Low Exposure group.
We could also have calculated the odds ratio as follows:
Odds Ratio 
Odds of Airway Obstruction for Low Exposure

Odds of Airway Obstruction for High Exposure
The interpretation is that the odds of having an airway obstruction in the Low Exposure group
are .33 (approximately 1/3) times as high as those in the High Exposure group.
115
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
Comments:
1. An odds ratio of 1.0 implies that there is no observable difference between the two odds.
2. The odds can also be visualized in the following graphic:
3. The easiest way to find the OR is to use the formula
𝑂𝑅 =
𝑎𝑑
𝑏𝑐
where a = # of individuals with the “bad thing” and the “risk factor present”.
4. We can make the following conclusions from this study:

The findings indicate that employees with high exposure were 2.5 times more
likely to have an airway obstruction. Or, the odds of airway obstruction were 3.02
times higher for the group with high exposure.

This is an observational study (which will be discussed in more detail later in
Chapter 7). So, while we have evidence the high exposure group has a greater risk
of airway obstruction, we cannot say for sure that the diacetyl caused it.
116
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
Relative Risk and Odds Ratios in JMP
Example 7.4: We can use JMP to calculate these quantities for the popcorn plant data.
Select Analyze > Fit Y by X. Move Airway Obstruction to the Y, Response box and
Exposure to the X, Factor box. From the red drop-down arrow next to Contingency
Analysis, select Relative Risk. Select Airway Obstruction as your response category of
interest, and use High exposure in the numerator.
JMP then displays the relative risk:
If you select Odds Ratio from the same red drop-down arrow, JMP displays the following:
117
STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables
Spring 2014
Inferential Methods: Confidence Intervals For Relative Risk
Note that in the previous section, JMP returned 95% confidence intervals for both the relative
risk and the odds ratio.
Questions:
1. Find the endpoints for the 95% confidence interval for the relative risk.
2. Find the endpoints for the 95% confidence interval for the odds ratio.
3. Interpret each of these intervals.
118
Download