part 1

STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 This chapter will introduce methods for describing the relationship between two or more categorical variables. In addition, we will discuss several inferential procedures used for analyzing data of this type. To begin, consider the following example. Example 7.1: In May of 2000, eight people who had worked at the same microwave popcorn production plant reported to the Missouri Department of Health with fixed obstructive lung disease. These workers had become ill between 1993 and 2000 while employed at the plant. Because of these cases, researchers began conducting medical examinations and environmental surveys of workers employed at the plant in November of 2000 to assess their occupational exposure to certain compounds. Part of this study involved measuring the forced vital capacity (FVC) of the current employees (this is the volume of air that can be maximally, forcefully exhaled). The study consisted of 116 participants, and the FVC screening indicated that 21 employees had an airway obstruction. In addition, the popcorn plant was broken into several areas (the flavor-mixing room, packaging room, etc.). Air and dust samples in each area were measured to determine the exposure to diacetyl, a marker of organic-chemical exposure. Then, the average exposure for each study participant was determined by taking into account how long they spent at different jobs within the plant and the average exposure in that job area. Finally, they were classified as having either “low” or “high” exposure. The data (found in the file PopcornPlant.JMP) are summarized below. Source: The data and example are from “Investigating Statistical Concepts, Applications, and Methods” by Allan Rossman and Beth Chance, Preliminary Edition. 2005. Brooks/Cole Thomson Learning. Descriptive Methods For Two Categorical Variables To describe the relationship between two categorical variables, we usually use contingency tables and mosaic plots. Contingency Table: A table showing the joint frequencies of two categorical variables. The rows of the table denote the categories of the first variable, and the columns denote the categories of the second variable. Mosaic plot: This plot gives a visual representation of the relationship between two categorical variables. A mosaic plot presents the frequencies of combinations of categories of two variables. 110 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 The JMP output for this example is given below. To get these summaries, select Analyze > Fit Y by X and place Exposure in the X, Factor box and Airway Obstruction in the Y, Columns box. Questions: 1. Using the contingency table, find the following marginal probability: P(High Exposure) = 2. Using the contingency table, find the following marginal probability: P(Airway Obstruction) = 3. Using the contingency table, find the joint probability that someone is in the High Exposure group and has an Airway Obstruction: P(High Exposure and Airway Obstruction) = 4. Find the following joint probability: P(Low Exposure and No Airway Obstruction) = 111 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 5. Using the contingency table, find the following conditional probability. Given that an individual is in the High Exposure group, what is the probability they have an Airway Obstruction? P(Airway Obstruction | High Exposure) = 6. Find the following conditional probability: P(Airway Obstruction | Low Exposure) = ADDITIONAL DESCRIPTIVE MEASURES: RELATIVE RISK AND ODDS RATIOS Other summaries that are often computed when investigating the relationship between two categorical variables are the risk difference, relative risk ratio, and the odds ratio. Risk Difference and Relative Risk Example 7.2: Consider the data from the popcorn plant. Airway Not Airway Total Obstructed Obstructed Low Exposure 52 6 58 High Exposure 43 15 58 Total 95 21 116 We have seen that P(Airway Obstruction | High Exposure) is higher than P(Airway Obstruction | Low Exposure). Since these conditional probabilities differ, it appears that there may be an association between level of exposure and having an airway obstruction. One way to compare the two groups (High and Low Exposure) is to look at the risk difference in these two probabilities. 112 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 Risk Difference: This is simply the difference in the conditional probabilities: P(Airway Obstruction | High Exposure) - P(Airway Obstruction | Low Exposure) = Questions: 1. Does this seem like a large difference to you? 2. Suppose the two conditional probabilities of interest had been .95 and .79, instead. Does this seem like a large difference to you? Note that for these data, P(Airway Obstruction | High Exposure) was more than TWICE AS LARGE as P(Airway Obstruction | Low Exposure). Since this seems like an important feature to describe, we will compare the two groups based on relative risk instead of risk difference. Relative Risk: This is a measure of how much a particular risk factor influences the risk of a specified outcome. For the popcorn data, we calculate the relative risk as follows: Relative Risk   P(Airway Obstruction|High Exposure) P(Airway Obstruction|Low Exposure) Proportion with Airway Obstruction in High Exposure Group Proportion with Airway Obstruction in Low Exposure Group  Comments: 1. We interpret this number by saying that the risk of airway obstruction is 2.5 times as high for employees in the High Exposure group than for employees in the Low Exposure group. 2. A relative risk value of 1.0 is the reference value for making comparisons. That is, a relative risk of 1.0 says that there is no difference in the two probabilities. 113 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 3. The risk difference and relative risk ratio are easily displayed in the following graphic: 4. If we had alternatively calculated the relative risk ratio as Relative Risk  P(Airway Obstruction|Low Exposure) = P(Airway Obstruction|High Exposure) then the interpretation changes. Now, we say the risk of airway obstruction for employees in the Low Exposure group is .40 times as high as the risk of airway obstruction for employees in the High Exposure group. Odds Ratios: The relative risk ratio is frequently used when investigating the relationship between two categorical variables. Although this quantity is relatively easy to calculate and interpret, statisticians often use another quantity known as an odds ratio in this situation. Before computing an odds ratio, we need to compute the odds: Odds: With counts given for two distinct response categories (High and Low Exposure), the odds of a ‘Yes’ versus a ‘No’ is computed as the number of ‘Yes’ events versus the number of ‘No’ events for each group. You can also think of this as the probability that something is true divided by the probability that something is not true. 114 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 Example 7.3: Consider the data from the popcorn plant. Low Exposure High Exposure Total Airway Not Obstructed 52 43 95 Airway Obstructed 6 15 21 Total 58 58 116 Find the odds of having an airway obstruction for both High Exposure and Low Exposure: Odds of Airway Obstruction for High  Number with Airway Obstruction in High group Number with No Airway Obstruction in High group  Odds of Airway Obstruction for Low  Number with Airway Obstruction in Low group Number with No Airway Obstruction in Low group  The odds ratio is simply the ratio of the odds for the two groups: Odds Ratio  Odds of Airway Obstruction for High Exposure  Odds of Airway Obstruction for Low Exposure The interpretation is that the odds of airway obstruction in the High Exposure group are 3.02 times as high as the odds of airway obstruction in the Low Exposure group. We could also have calculated the odds ratio as follows: Odds Ratio  Odds of Airway Obstruction for Low Exposure  Odds of Airway Obstruction for High Exposure The interpretation is that the odds of having an airway obstruction in the Low Exposure group are .33 (approximately 1/3) times as high as those in the High Exposure group. 115 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 Comments: 1. An odds ratio of 1.0 implies that there is no observable difference between the two odds. 2. The odds can also be visualized in the following graphic: 3. The easiest way to find the OR is to use the formula 𝑂𝑅 = 𝑎𝑑 𝑏𝑐 where a = # of individuals with the “bad thing” and the “risk factor present”. 4. We can make the following conclusions from this study:  The findings indicate that employees with high exposure were 2.5 times more likely to have an airway obstruction. Or, the odds of airway obstruction were 3.02 times higher for the group with high exposure.  This is an observational study (which will be discussed in more detail later in Chapter 7). So, while we have evidence the high exposure group has a greater risk of airway obstruction, we cannot say for sure that the diacetyl caused it. 116 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 Relative Risk and Odds Ratios in JMP Example 7.4: We can use JMP to calculate these quantities for the popcorn plant data. Select Analyze > Fit Y by X. Move Airway Obstruction to the Y, Response box and Exposure to the X, Factor box. From the red drop-down arrow next to Contingency Analysis, select Relative Risk. Select Airway Obstruction as your response category of interest, and use High exposure in the numerator. JMP then displays the relative risk: If you select Odds Ratio from the same red drop-down arrow, JMP displays the following: 117 STAT 305: Chapter 7 – Methods for Analyzing Two or More Categorical Variables Spring 2014 Inferential Methods: Confidence Intervals For Relative Risk Note that in the previous section, JMP returned 95% confidence intervals for both the relative risk and the odds ratio. Questions: 1. Find the endpoints for the 95% confidence interval for the relative risk. 2. Find the endpoints for the 95% confidence interval for the odds ratio. 3. Interpret each of these intervals. 118

part 1

Related documents

Products

Support

part 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib