AFYA BORA CONSORTIUM GLOBAL HEALTH LEADERSHIP FELLOWSHIP PROGRAM Research Methods Distance Learning AFYA BORA CONSORTIUM Research Methods Module Module Instructors: Brandon Guthrie, PhD Acting Instructor, Department of Epidemiology University of Washington Email: bguth@uw.edu Skype: brguth Carey Farquhar, MD, MPH Associate Professor, Departments of Medicine, Epidemiology, and Global Health Director, Afya Bora Fellowship in Global Health Leadership Email: cfarq@uw.edu Skype: careyfarquhar 2 Table of Contents Course Structure: ............................................................................................................................ 4 Learning Objectives: ....................................................................................................................... 4 Introduction to Epidemiologic Methods and Quantitative Research .......................................... 4 Introduction to Statistical Decision Making ............................................................................... 4 Epidemiologic Study Designs ..................................................................................................... 4 Causation, Bias, and Confounding.............................................................................................. 5 Measurement, Classification, and Misclassification ................................................................... 5 Data Management Practices in Health Research ........................................................................ 5 Interpretation of Epidemiologic Studies and Decision Making .................................................. 5 Qualitative Research Methods .................................................................................................... 5 Analyzing Qualitative Data and Public Health Applications ..................................................... 5 Course Schedule: ............................................................................................................................ 6 Appendix 1: List of Lecturers ........................................................................................................ 7 Appendix 2: Review Questions ..................................................................................................... 8 Lecture 1: Introduction to Epidemiologic Methods and Quantitative Research ......................... 8 Lecture 2: Introduction to Statistical Decision Making ............................................................ 12 Lecture 3: Epidemiologic Study Designs ................................................................................. 17 Lecture 4: Bias, Confounding, and Effect Modification........................................................... 21 Lecture 5: Measurement, Classification, and Misclassification ............................................... 24 Lecture 6: Data Management Practices in Health Research ..................................................... 28 Lecture 7: Interpretation of Epidemiologic Studies and Decision Making .............................. 31 Lecture 8: Multiple Variable Regression Models in Epidemiology ......................................... 36 Lecture 9: Qualitative Research Methods ................................................................................. 38 Lecture 10: Analyzing Qualitative Data and Public Health Applications ................................ 40 3 Course Structure: To successfully complete the Research Methods module, you will need to watch each lecture and complete the associated quiz in accordance with the course schedule. Each recorded lecture is available online through the TREE Distance Learning portal (http://www.tree4health.org/distancelearning/). You will be assigned a username and password allowing you to log onto the portal. Once you are logged on, click on the Research Methods Module in the Learning Modules box. In the Research Methods Module, you can monitor your progress through the module and navigate between lectures. For each lecture, you can view the recorded session, download the session for later viewing, and download the slides and associated material. After viewing each lecture, students should complete the associated quize. Course Instructors will be available via Skype bi-weekly (schedule TBD) to discuss each topic. You are limited to 1 attempt on each quiz. If you achieve an aggregate average score of 70% or greater on the quizzes, you will be eligible to take the final exam. The Research Methods consists of A. 10 one-hour lecture sessions, B. 10 quizzes, and C. 1 final exam. Learning Objectives: Introduction to Epidemiologic Methods and Quantitative Research 1. Give an example of a disease that is distributed unevenly in a population and what the distribution might tell you about the cases of disease. 2. Define prevalence and incidence and describe the steps to measure each in a typical epidemiologic study. 3. Answer the question: How do you compare disease risk between two groups and how do you interpret these comparisons? 4. Summarize the principles for inferring causal relationships from epidemiologic data. Introduction to Statistical Decision Making 1. List and describe the standard measures of location and spread. 2. Give examples of how graphical displays of data can be used to supplement formal statistical analysis. 3. Understand the relationship of hypothesis testing and independence of data. 4. Answer the question: What is a p-value and how are they used to assess the strength of statistical associations? Epidemiologic Study Designs 1. Compare and contrast cohort and case control studies and provide examples of when each study design would be appropriate and preferred. 2. Answer the question: What are the advantages and disadvantages of matching in epidemiologic studies? 4 3. Answer the question: What are the primary strengths and weaknesses of randomized trials? 4. Answer the question: How do ecological studies differ from the other types of epidemiologic studies? Causation, Bias, and Confounding 1. List the criteria that allow epidemiologists to assess causal relationships between exposure and disease 2. Define bias in epidemiologic studies and describe the main categories of bias. 3. Describe the most common strategies to controlling for confounding. Measurement, Classification, and Misclassification 1. Given an example of how the research question of interest will dictate how subjects are classified in terms of exposure and disease. 2. Compare and contrast the impacts of non-differential and differential (selective) misclassification. 3. Define and describe how to calculate sensitivity, specificity, positive predictive value, and negative predictive value. Data Management Practices in Health Research 1. Describe how study design will influence data management strategies. 2. Give examples of data entry techniques that minimize errors. 3. Outline quality control measures that can improve data quality. Interpretation of Epidemiologic Studies and Decision Making 1. Understand how to interpret the various measures of test performance; 2. Explain how evidence from observational studies can be used to infer causal relations between exposures and disease incidence; 3. Describe the criteria that should be used when deciding if a screening test should be used to detect disease. Qualitative Research Methods 1. Define phenomenology and grounded theory methods and provide examples of how these methods can be used to address a public health question. 2. Provide a data collection strategy that could be used in a qualitative research study. 3. Compare and contrast quantitative a qualitative research methods. Analyzing Qualitative Data and Public Health Applications 1. Provide strategies for managing qualitative data. 2. Define coding and differentiate between types of codes. 3. Illustrate how qualitative data is presented in a paper. 5 Course Schedule: 2 Lecture Quiz Due Date Introduction to Epidemiologic Methods and November 9th Quantitative Research Introduction to Statistical Decision Making November 9th 3 Epidemiologic Study Designs November 16th 4 Causation, Bias, and Confounding November 23rd 5 Measurement, Classification, and November 23rd Misclassification Data Management Practices in Health November 30th Research Interpretation of Epidemiologic Studies and November 30th Decision Making 1 6 7 8 Multiple variable regression models in epidemiology December 7th 9 Qualitative Research Methods December 14th 10 Analyzing Qualitative Data and Public Health Applications FINAL EXAM DUE December 14th *Skype Session November 13th November 27th December 11th December 18th December 21st *Course instructors will be available via Skype bi-weekly on Tuesday @ 9:00AM PST / 8:00PM EAT 6 Appendix 1: List of Lecturers Carey Farquhar, MD, MPH Associate Professor Departments of Medicine, Epidemiology, and Global Health cfarq@uw.edu Lecture 1 Barbra Richardson, PhD Research Professor University of Washington, Department of Biostatistics barbrar@uw.edu Lecture 2 Lisa Manhart, PhD Associate Professor University of Washington Departments of Epidemiology and Global Health lmanhart@uw.edu Lecture 3 Victoria Holt, PhD Professor University of Washington Department of Epidemiology vholt@uw.edu Lecture 4 Brandon Guthrie, PhD Acting Instructor University of Washington Department of Global Health brguh@uw.edu Lecture 5 and 6 Noel Weiss, MD, DrPH Professor University of Washington Department of Epidemiology nweiss@uw.edu Lecture 7 Romel Mackelprang, PhD Senior Fellow University of Washington Department of Global Health romelm@uw.edu Lecture 8 Michele Andrasik, PhD Acting Assistant Professor University of Washington Department of Psychiatry and Behavioral Science mandrasik@fhcrc.org Lecture 9 Kate Murray, MPH University of Washington, Center for AIDS Research krmurray@u.washington.edu Lecture 10 7 Appendix 2: Review Questions INSTRUCTIONS: Review questions are provided for each of the 10 lectures included in this module. The relevant questions should be presented after each lecture in quiz format. The Participants should answer at least 70% of questions correctly to successfully complete this module. The instructor can choose to use a subset of the questions for quizzes and use the remaining questions for a final exam at the end of the module. Before presenting quiz and exam questions, the instructor should remove the answers and explanations. Discussion sessions can be organized to discuss the review questions after all participants have taken the quiz. Lecture 1: Introduction to Epidemiologic Methods and Quantitative Research 1) A fellow researcher wants to compare the incidence of death from TB among HIV-1infected women between Nairobi and Kisumu. She finds that 2,324 women with HIV died from TB in Nairobi in 2009 and 927 women with HIV died in Kisumu in 2009. In order to compare the incidence of death between the two cities, which additional denominator values does she need to collect? A. The total populations of Nairobi and Kisumu in 2009 B. The number of women infected with Mycobacterium tuberculosis who were living in Nairobi and Kisumu in 2009 C. The number of women co-infected HIV and M. tuberculosis who were living in Nairobi and Kisumu in 2009 D. The number of HIV-1 infected women who were living in Nairobi and Kisumu in 2009 ANSWER: D EXPLANATION: The value of interest here is the “incidence of death from TB among HIV-1-infected women.” Therefore, the numerator is the number of deaths attributable to TB among women infected with HIV during 2009, and the denominator is the total number of HIV-1-infected women living in in the two cities during 2009. 2) A recent study published in the Journal of the American Medical Association found that approximately 1 in 4 American women (age 14 to 59 years) are infected with HPV. This estimate is an example of: A. B. C. D. Incidence rate Incidence number Prevalence Proportionate mortality ANSWER: C EXPLANATION: This is a measure of prevalence because the researchers assessed current, rather than new, infections. If the researchers had started with a group of 8 women without HPV and monitored them over time for acquisition of HPV, then they would have been measuring incidence. 3) Studies demonstrate that cigarette smoking increases the risk of heart disease. In a large study, Dr. Cardio found that the annual incidence of heart disease was 32 per 1,000 among those with 20 pack-years of smoking at baseline (i.e., heavy smoking) and 10 per 1,000 among those who never smoked. Based on this information, calculate the relative risk of heart disease due to heavy smoking. A. B. C. D. 32 /10 = 3.2 32 – 10 = 22 32 / (32+10) = 0.76 Cannot be determined with the given information ANSWER: A EXPLANATION: The relative risk is calculated by dividing the incidence in the exposed by the incidence in the unexposed. In this example, the incidence in the exposed (heavy smokers) is 32 per 1000 and the incidence in the unexposed (never smokers) is 10 per 1000. Thus, RR = (32/1000) / (10/1000) = 32 / 10 = 3.2 4) The following table shows the number of new cases of whooping cough (Pertussis) by age groups, 2005. Age Group 0-5 6-9 10 - 14 15 - 24 25 - 54 55 + Total Mid-year Population 1,643 1,427 1,019 783 3,570 1,836 10,278 Number of New Cases 231 195 460 965 452 101 2,404 The annual incidence for the 10-14 age group was: A. B. C. D. E. (3,570/10,278)*100 = 34.7 per 100. (460/10,278)*100 = 4.4 per 100. (460/1019)*100 = 45.1 per 100. (460/3,570)*100 = 12.7 per 100. (2,404/10,278)*100 = 23.4 per 100. ANSWER: C EXPLANATION: In this example, we will use the mid-year population in 2005 as our best estimate of the number of people “at risk” of acquiring whooping cough, from which the incident cases arose. We are looking for the incidence among those 10-14 years of 9 age, and therefore we will restrict the number of new cases (460) and the number “at risk” (1,019) to this age group. Therefore, the incidence is 460/1019 * 100 = 45.1 per 100 5) The finding that the risk of cervical cancer increases with the number of lifetime sex partners contributed to the understanding that cervical cancer is a sexually transmitted infection. This finding contributes to causal inference because it best demonstrates: A. B. C. D. Consistency (replication of findings) Biological gradient (dose response.) Strength of association. Temporal Order ANSWER: B EXPLANATION: One criteria that we use to draw causal inference about a exposuredisease relationship is the observation that higher levels of exposure are associated with a higher likelihood of disease. While the presence or absence of a clear dose-response relationship is not definitive, it is an important component of causal inference. In this example, we observe that the number of lifetime sex partners is associated with the likelihood that a woman develops cervical cancer. The number of lifetime partners is associated with the risk of acquiring a sexually transmitted disease. Therefore, this observed relationship provides support for the hypothesis that cervical cancer is caused by a sexually transmitted infection. 6) You are a clinician treating HIV patients. You are planning for the next year, and you are trying to decide how many clinical staff members you will need to work with patients as they start ART. This can be a time-consuming process, and you need to know how to plan your resources. To answer this question, which of the following would you most want to know? A. The prevalence of HIV patients in your clinic who are on ART. B. The incidence rate of HIV patients in your clinic starting ART. ANSWER: B EXPLANATION: In this example, you are attempting to plan for the number of people who will be starting ART. Therefore you are interested in the incidence of starting ART. 7) Currently, patients with HIV who have a CD4 count <250 are recommended to start antiretroviral therapy (ART). You are on a Ministry of Health committee that is deciding if the CD4 criteria for starting ART should change from 250 to 350. As part of your decision-making process, you want to know how many people would be affected by this change. To answer this question, which of the following would you most want to know? 10 A. The prevalence of HIV patients with a CD4 count between 250 and 350. B. The incidence rate of HIV patients dropping below a CD4 count of 350. ANSWER: A EXPLANATION: The prevalence tells you how many people currently have a CD4 count between 250 and 350. These are the people that would be affected by the change in guidelines. The incidence would tell you the rate at which people drop below 350, but would not tell you how many people would be affected. 8) A study has just been completed in which the researchers investigated if HIV disease progression could be improved by providing patients with bed nets to reduce malaria. A total of 1,000 patients with HIV and a CD4 count between 350 and 450 were recruited, 500 of whom were randomly assigned to receive an insecticide treated bed net. After 3 years of follow-up with perfect retention, 173 of the 500 patients who received a bed net had progressed to AIDS while 221 of the 500 patients who did not receive a bed net had progressed to AIDS. What is the relative risk of progression to AIDS associated with using a bed net? A. B. C. D. E. (173*279) / (327*221) = 0.68 (173/500) / (221/500) = 0.78 (173/500) - (221/500) =-0.096 (327/500) / (279/500) = 1.17 (173/500) / (279/500) = 0.62 ANSWER: B EXPLANATION: A 2x2 table can be constructed to summarize the data. In this example, the outcome, or disease, is progression to AIDS. The exposure is receiving an insecticide treated bed net. Disease Exposure + + a B 173 327 c D 221 279 Total a+b 500 c+d 500 a 173 (a + b) 500 RR = = = 0.78 c 221 (c + d) 500 11 Lecture 2: Introduction to Statistical Decision Making 1) A box plot allows you to look at which of the following? A. B. C. D. Sample median Sample spread Potential outliers All of the above ANSWER: D EXPLANATION: A boxplot is a succinct way of presenting continuous data. It shows both the “central tendency” of the data with the median, as well as the spread with the 25th and 75th percentiles and the upper and lower whiskers. The investigator can also determine if the data are skewed and if there are any extreme outliers. The figure below details the information provided in a box plot. Outliers Largest value ≤ Q3 + (1.5 * IQR) th Q3: 75 percentile Q2: median th Q1: 25 percentile Smallest value ≥ Q1 – (1.5 * IQR ) Outliers 12 2) Based ONLY on the figure below, what do you conclude about the relationship between CD4 count and viral load? 7 6 Log Viral Load 5 4 3 2 1 0 200 400 600 800 1000 1200 1400 CD4 Count A. B. C. D. There is no relationship between CD4 count and viral load Increases in CD4 count are associated with increases in viral load Increases in CD4 count are associated with decreases in viral load There is a causal relationship between CD4 count and viral load ANSWER: C EXPLANATION: There is an overall trend in the data from the upper left to the lower right. When we inspect the axes of the plot, we conclude that the upper left represents patients with low CD4 counts and high viral loads, while the lower right represents patients with high CD4 counts and low viral loads. Based only on this figure, we cannot assess the causal relationship between viral load and CD4 count, or even if this is a statistically significant association, but from this visual inspection, we get a sense of the relationship between these two variables, providing a starting point for further investigation. 3) After conducting a study investigating the potential relationship between daily Septrin use and HIV disease progression, you find a relative risk (RR) of 0.82 with a p-value of 0.11. What can you conclude about the relationship between Septrin use and disease progression? A. There is no relationship between daily Septrin use and disease progression B. Daily Septrin use slows disease progression C. There is insufficient evidence to reject the null hypothesis that there is no relationship between daily Septrin use and disease progression D. The study was underpowered to detect a true relationship between daily Septrin use and disease progression ANSWER: C 13 EXPLANATION: The point estimate for the relative risk for the relationship between Septrin use and disease progression is 0.81, which indicates those on Septrin are less likely to experience disease progression; however, the p-value for this relative risk is 0.11. Therefore, we cannot reject the null hypothesis that there is no association (i.e., RR = 1). There are two possible explanations for this finding: (1) the true relative risk is less than 1, but there was insufficient power to show a statistically significant difference, or (2) the true relative risk is 1, and we observe the value of 0.81 only by random chance. Based on the information provided, we cannot determine if there was adequate power to detect a relative risk of 0.81 4) You hypothesize that the viral load in population A is higher than the viral load in population B. Which measure should you use to summarize this difference? A. B. C. D. Mean Variance Range Power ANSWER: A EXPLANATION: In this example, we are interested in a measure of location. Of the options available, only the mean in a measure of location. Both the variance and the range are measures of spread. power is calculated when planning a study to determine the probability of finding a significant difference, assuming a true difference of a given magnitude. 14 5) You know that drug A and drug B have the same mean effect on viral load, but you hypothesize that there is more variability in the effect of drug A compared to drug B. You analyze the results from a randomized trial in which 500 patients received drug A and 200 patients received drug B. Which of the following measures should you use to investigate if your hypothesis might be true? A. B. C. D. Median Variance Minimum Mode Change in viral load ANSWER: B EXPLANATION: We are interested in the variability of the effect of drugs A and B. We may suspect that while overall the drugs have the same effect, but that consistency of the effect differs between the two drugs. In the hypothetical figure below, both drugs have the same median effect, but the variability of Drug A is less than Drug B, indicated by the smaller interquartile range. Among the measure provided as options, only the variance is a measure of spread/variability. Drug A Drug B 15 6) Which group has the higher median CD4 count? CD4 count 300 200 100 0 Group A Group B ANSWER: A EXPLANATION: The line passing through the middle of each box represents the median of the distributions for the two groups. Thus, the median CD4 count for Group A is approximately 220 cells/μL and the median for group B is approximately 140 cells/μL. Therefore, the median CD4 count is higher for group A. 16 Lecture 3: Epidemiologic Study Designs 1) You are interested in investigating if HSV-2 infection is associated with acquisition of HIV in women. To address this question you design an observational prospective cohort study. Which of the following describes how you would carry out this study? A. Recruit a group of women with HIV and a group of women without HIV and test the women in each group for HSV-2 infection. Compare the proportion of HIV-infected women who have HSV-2 with the proportion of HIV-uninfected women with HSV-2. B. Identify women without HIV and separate the women into those who are infected with HSV-2 and those without HSV-2. Then follow the women in each group for 2 years, testing them each month for HIV infection. Compare the rate of HIV infection in the HSV-2 infected and uninfected groups. C. Identify a group of women without HIV who are all infected with HSV-2. Randomize half of the women to receive Acyclovir (a drug that suppresses HSV-2) and the other half to receive a placebo (no active drug). Follow the women for 2 years and compare the rate of HIV acquisition between the women on Acyclovir and those on placebo. D. None of the above describe a prospective cohort study. ANSWER: B EXPLANATION: You are designing a prospective cohort study. Prospective means that the follow-up of participants will occur after initiation of the study. A cohort study is an observational study design where you start by identifying participants with and without the exposure of interest and then follow them up for the outcome. In the correct answer above, you will first identify women with and without HSV-2 infection (i.e., the exposure of interest) and then follow them for the acquisition of HIV (i.e., the outcome of interest). 2) Which of the following is NOT TRUE regarding reasons to choose a randomized study design? A. B. C. D. Randomization minimizes confounding Causal inference is easier in randomized studies It is always possible to randomly assign exposure Randomized studies are generally easier to analyze and interpret ANSWER: C EXPLANATION: One of the primary reasons to use a randomized trial design is to gain maximum control of confounding by randomly assigning participants to the exposure groups. Therefore, causal inference can be drawn from randomized trials because there should be no confounding factors obscuring the exposure-disease relationship. Unfortunately, it is not possible randomly assign all exposure. It is unethical to assign participants to receive an exposure that is known to cause harm (e.g., smoking). It is also impractical to investigate some exposure-disease relationships using a randomized trial because the time between exposure and disease onset is long, or because the frequency of disease, even among those exposed, is very low. Because of the control of confounding through randomization, randomized trails are generally easier to analyze and interpret. 17 3) Multiple observational studies have shown evidence that male circumcision can reduce the risk of acquiring HIV. Which of the following is a reason why randomized trials were necessary before recommending circumcision as an HIV prevention intervention? A. In this example, observational studies may not fully account for confounding and may not accurately reflect the true relationship between male circumcision and HIV acquisition. Randomized trials were needed to control confounding. B. Observational studies should only be used for exploratory studies and should not be used to guide public health practice. C. The observational trials did not allow for enough time between circumcision and HIV infection. D. Randomized trials were not necessary. The observation studies established the causal association. ANSWER: A EXPLANATION: In Africa, circumcision is highly culturally defined, such that some cultures and religions prescribe that all males be circumcised, while other cultures or religions never implement circumcision. Other sexual behaviors associated with higher or lower HIV risk are also highly associated with cultural or religious membership. Therefore, it is very difficult to fully control for confounders of the relationship between circumcision and HIV risk. An intervention recommending that all men be circumcised to reduce HIV risk requires a high degree of evidence supporting the effectiveness due to the potential risk circumcision and the large scope of the intervention. 4) Which of the following is NOT an advantage of a cohort study over a case-control study? A. It is possible to calculate incidence rates from a cohort study but not a case-control study. B. A cohort study is more efficient than a case-control study for investigating rare diseases. C. A cohort study is more efficient than a case-control study for investigating rare exposures. D. A cohort study can be used to investigate more than one outcome (disease) while a case-control study can only investigate one pre-specified outcome. ANSWER: B EXPLANATION: Cohort studies begin with a group of participants with a given exposure and a group without that exposure. Both groups are followed up for the outcome(s) of interest. Because the distribution of the outcome is not manipulated, it is possible to calculate the incidence of disease in both groups and to directly calculate the relative risk. Cohort studies are efficient for studying rare exposures because the researcher can specify the number of exposed and unexposed participants, but this design is inefficient to investigate rare outcomes because it would require a very large sample size to achieve a sufficient number of outcomes to reach statistical significance. 18 5) An outbreak of cholera has occurred in a village of 312 people. Investigators find that residents of the village get their water from one of three sources. The investigators want to determine which of the water sources are contaminated. They identify every resident of the village and test them for infection with Vibrio cholera (the causal agent of cholera) and determine where each person gets their water. They then calculate the proportion of people who are infected with Vibrio cholera, and compare the proportions infected from each water source. What type of study design is described here? A. B. C. D. Cohort study Case-control study Cross-sectional study Ecological study ANSWER: C EXPLANATION: This is best described as a cross-sectional study. Exposure and disease were measured at the same time, without consideration of the timing of exposure relative to disease. Cross-sectional studies are often a first step in epidemiologic investigations because they are generally easier and less expensive to conduct than other study designs; however, cross-sectional studies are limited by the challenge of establishing the temporal sequence. Additionally, in a cross-sectional study, factors associated with the disease may be related to the risk of developing the disease, or to the duration of disease. Thus, the interpretation of cross-sectional studies should be done with caution. 6) You are studying the relationship between exclusive breastfeeding and gastrointestinal infection among HIV-uninfected infants born to infected mothers. You decide to recruit a group of women who have chosen to breastfeed exclusively and a group of women who have chosen to formula feed. You ask the women to record the number of diarrheal episodes their infants have over a 6 month period and compare the number of episodes experienced by infants in the two groups. What type of study is this? A. B. C. D. Cohort study Case-control study Randomized trial Ecological study ANSWER: A EXPLANATION: The study subjects were selected based on their exposure status, which was chosen by the subjects, and followed up prospectively. Therefore, this is a prospective cohort study. 19 7) You are concerned that a common anti-malarial medication given to children may increase the risk of developing childhood leukemia. You know that leukemia is a rare, but serious disease. What would be the best study design to test your hypothesis? A. Ecological study B. Case-control study C. Randomized trial D. Cohort Study ANSWER: B EXPLANATION: A case-control study is usually the best option when investigating rare diseases. Using a case control design, the investigator controls the number of diseased and non-diseased subjects in the study. Therefore, the investigator can include as many cases as he or she can identify. An additional advantage of this approach is that it is possible to investigate multiple exposures in the study. If a cohort study or randomized trial were conducted, it would be necessary to enroll a very large number of subjects to ensure that there are an adequate number of disease outcomes to draw a conclusion about the exposure-disease relationship. Such an approach is very inefficient because the vast majority of subjects will not develop disease. 20 Lecture 4: Bias, Confounding, and Effect Modification 1) Which of the following is TRUE about an exposure that is causally associated with a disease? A. B. C. D. The exposure must cause disease in all people that are exposed. All people with the disease must have been exposed. The exposure must precede the onset of disease. The exposure must be common. ANSWER: C EXPLANATION: For an exposure to be causally related to a “disease” outcome, the exposure must always precede the outcome. An exposure need not cause disease in all those exposed. For example, many people smoke but not all develop lung cancer. This does not affect our conclusion that smoking is causally related to lung cancer. Similarly, not all cases of disease need to have been caused by the exposure. Using the same example, while many cases of lung cancer are due to smoking, lung cancer occurs in non-smokers due to other causes. Finally, while rare exposures are more difficult to investigate, particularly using a case-control design, rare exposures can be causally related to an outcome. 2) A clinician involved in the management of patients with HIV observed that, in a 1-year period, 10% of patients on antiretroviral therapy (ART) died compared to 6% of patients not on ART. She is concerned that ART might be causing deaths rather than preventing disease progression. This conclusion: A. Is correct. B. May be incorrect because patients starting ART may be much sicker than patients not on ART and therefore at greater risk of dying despite being on ART. C. May be incorrect because there is no comparison group. D. May be incorrect because incidence rates should have been calculated instead of the proportions that were calculated. ANSWER: B EXPLANATION: The observation of an association between an exposure and disease does not mean that a causal association exists. The scenario described here is likely an example of what epidemiologists call confounding. Confounding is the mixing of the effect of two factors on an outcome of interest. In this case, a causal association exists between a patient’s disease status (i.e., how sick they are) and their likelihood of dying. Unfortunately, a patient’s disease status is also related to their likelihood of starting ART. Thus, the sickest patients are most likely to start ART. Methods are available to overcome, at least partially, the effect of confounding, but epidemiologists must always consider the potential that a confounding factor may account for the observed exposuredisease relationship. 21 3) You are conducting a case-control study to determine if taking Septrin reduces AIDS related mortality. You plan to include as cases 100 people who have died from AIDS related causes and as controls 100 people currently living with HIV. You will ask the controls about Septrin use in the prior 6 months and ask the next-of-kin of the cases about Septrin use by the cases in the 6 months prior to their death. What can be said about exposure ascertainment in this study? A. There are no foreseeable issues of bias in this study design. B. Bias may occur. Controls may more accurately recall their Septrin use compared to the next-of-kin of the cases, leading to differential misclassification of exposure. C. A better strategy would be to ask the next-of-kin of both cases and controls about the Septrin use of the study subject. D. Both B and C. ANSWER: D EXPLANATION: Exposure status has been measured differently for cases and controls. This can lead to bias because controls are more likely than the next-of-kin of the cases to correctly report their Septrin use. In order to reduce bias in an epidemiologic study, it is important to ensure that both exposure and outcome are assessed in the same manner and at the same level of accuracy for all subjects. Unfortunately, this means that we must sometimes use an inferior method of measurement (e.g., asking the next-of-kin about exposure status) for all subjects even when there are better methods that can only be used with a subset of subjects (e.g., only controls). 4) Which of the following is NOT TRUE about bias? A. Bias only occurs when there is an over estimate of the association between exposure and disease. B. Bias occurs when the observed association in an epidemiologic study differs from the true association. C. Bias can occur when study subjects in a prospective study are lost to follow-up. D. Differences in how exposure status is ascertained for cases and controls can give rise to bias. ANSWER: A EXPLANATION: Bias occurs when the observed exposure-disease relationship is different than the true association. Bias may result in a stronger or weaker observed relationship. Bias can arise due to many causes and can be present in all study designs (e.g., selection bias, confounding, ascertainment bias, indication bias, loss-to-follow-up). Certain study designs are more prone to bias than others, but researchers must always be alert to potential sources of bias in their study. 22 5) Which of the following conditions are necessary for confounding to occur? 1. Factor is associated with the disease of interest 2. Factor is a result of the disease 3. Factor is associated with the exposure of interest 4. Factor is not in the causal pathway of interest between exposure and disease A. B. C. D. 1 only 1, 2, and 3 1, 3, and 4 2 and 3 ANSWER: C EXPLANATION: In order to be a confounder, a factor must be associated with both disease and exposure, and the factor cannot be in the causal pathway between exposure and disease that you are interested in investigating. 6) The figure below shows cases of Guillain-Barre syndrome in relation to the time since influenza vaccination. What evidence does this provide in support of a causal association between this vaccine and Guillain-Barre syndrome. A. B. C. D. No alternative explanations exist Association is strong Association is strongest when predicted to be so Observed evidence is consistent ANSWER: C EXPLANATION: The peak in the number of cases occurs in a window soon after vaccination and drops back to baseline soon after. Thus we see the strongest effect in the period we would expect. 23 Lecture 5: Measurement, Classification, and Misclassification 1) You are interested in investigating if using a mobile phone while driving increases the risk of being involved in a car accident. You choose to conduct a case-control study where you will enroll 200 people who have had a car accident in the past week as cases and 200 people who have not had an accident in the past week as controls. Which of the following would be the best strategy for assessing mobile phone usage by cases in relation to accident risk? A. Ask cases if they ever use a mobile phone while driving. B. Ask cases if they used a mobile phone while driving at any time during the week when they had their accident. C. Ask cases if they used a mobile phone while driving on the same trip that they had their accident, prior to the accident itself. D. Ask cases if they used a mobile phone while driving during at least half of their trips during the past week. ANSWER: C EXPLANATION: The objective is to measure exposure (mobile phone usage) during the etiologically relevant time period. Based on our hypothesis of the relationship between mobile phone usage and automobile accidents, we believe that mobile phone usage while driving causes distraction that results in inattention to road conditions, which in turn increases the risk of causing an accident or being unable able to avoid hazardous situations. Therefore, the etiologically relevant time period would be the time immediately before the accident occurs. Because it may be difficult for a subject to remember the exact timing of mobile phone usage, we choose to ask about mobile phone usage on the same trip as the accident, excluding any usage after the accident. More general assessment of mobile phone usage (e.g., ever using a mobile phone while driving) do not assess the etiologically relevant time period, and would likely result in mismeasurement. 2) What is the effect of non-differential misclassification of exposure in a cohort study? A. The observed relative risk will be closer to the null (RR=1.0) than the true relative risk. B. The observed relative risk will be greater than the true relative risk. C. The observed relative risk will be less than the true relative risk. D. It is not possible to predict the direction of bias due to non-differential misclassification. ANSWER: A EXPLANATION: Non-differential misclassification occurs when the likelihood that exposure is misclassified does not depend on the probability that a subject will develop disease. Non-differential misclassification of exposure results in bias, but the bias occurs in a predictable direction: the observed measure of excess risk (e.g., relative risk or odds ratio) is closer to the “null” (i.e., RR or OR equal to 1). As a result, when non24 differential misclassification of exposure is present, we can conclude that the true measure of excess risk is greater than the observed estimate (i.e., further from 1). 3) Which of the following is the definition of sensitivity? A. The probability that the PATIENT DOES NOT HAVE DISEASE, given that the TEST IS NEGATIVE. B. The probability that the PATIENT HAS DISEASE, given that the TEST IS POSITIVE. C. The probability of TESTING NEGATIVE, given that the PATIENT DOES NOT HAVE DISEASE. D. The probability of TESTING POSITIVE, given that the PATIENT HAS DISEASE. ANSWER: D EXPLANATION: Sensitivity is used to measure how successful a test is at identifying “disease” when it is present. It is expressed as a probability. For example, a sensitivity of 0.90 means that 90% of those with disease will have a positive test. 4) From the table below, what is the specificity of the test? Test Results A. B. C. D. positive negative True Disease Status diseased non-diseased 135 37 25 163 [135/(135+65)] * 100 = 67.5% [135/(135+37)] * 100 = 78.5% [163/(163+37)] * 100 = 81.5% [163/(163+25)] * 100 = 86.7% ANSWER: C EXPLANATION: Specificity is the probability of testing negative given that the patient doesn’t have disease. It is calculated by dividing the number of true negatives by the number of all patients without disease (i.e., true negatives plus false positives). 25 5) A recent clinical trial found that an antiretroviral-based vaginal microbicidal gel reduces the risk of a woman acquiring HIV from an infected partner. The investigators found that a large proportion of women in both the experimental and placebo groups did not use the gel every time they had sex, and that there was no difference in adherence between the two groups. The observed relative risk associated with using the microbicidal gel was 0.61. Which of the following is a possible value for the true relative risk if there had been perfect adherence? A. B. C. D. RR = 0.40 RR = 0.90 RR = 1.00 RR = 1.25 ANSWER: A EXPLANATION: This is an example of non-differential misclassification of exposure. Therefore, the observed relative risk is biased toward the null (RR=1). Because the observed relative risk is less than 1, this means that the true relative risk is even smaller. The only potential value that fits this scenario is a relative risk of 0.40. 6) Which of the following is the definition of specificity? A. The probability that the PATIENT DOES NOT HAVE DISEASE, given that the TEST IS NEGATIVE B. The probability that the PATIENT HAS DISEASE, given that the TEST IS POSITIVE C. The probability of TESTING NEGATIVE, given that the PATIENT DOES NOT HAVE DISEASE D. The probability of TESTING POSITIVE, given that the PATIENT HAS DISEASE ANSWER: C EXPLANATION: Specificity is the probability of testing negative given that the patient doesn’t have disease. It is calculated by dividing the number of true negatives by the number of all patients without disease (i.e., true negatives plus false positives). 26 7) What is the epidemiologically relevant time period when measuring exposures in relation to a disease outcome? A. Any time prior to the onset of disease B. The time period during which an exposure is likely to be causally related to the disease outcome C. The time period during which ALL exposures result in disease onset D. The time period during which exposures are likely to be the result of disease onset ANSWER: B EXPLANATION: The epidemiologically relevant time period is based on the hypothesized mechanism by which the exposure is thought to result in the onset of disease. For exposures with a long induction period, relevant exposures must occur well before the onset of disease (the length of this period is based on the exposure-disease mechanism). Exposures occurring immediately before disease onset would not be in the epidemiologically relevant time period. Conversely, exposures with a short induction period (e.g., mobile phone usage and automobile accidents), the relevant time period is shortly before the accident. Regardless of the induction period, the epidemiologically relevant time period never includes time after the onset of disease. 27 Lecture 6: Data Management Practices in Health Research 1) Which of the following are considered data collection instruments? 1. CD4 count machine 2. Interviewer-administered questionnaire 3. Medical record abstraction form 4. Database for storage of study data A. B. C. D. 1 and 2 only 2 and 3 only 1, 2, and 3 1, 2, 3, and 4 ANSWER: C EXPLANATION: A data collection instrument is a general term for any method of collecting information about study subjects. It may be a medical device such as a CD4 count machine or a questionnaire to assess behavior information. A database is used to store study data, but is not itself a means of collecting data. 2) Which of the following is FALSE about skip patterns in questionnaires? 1. Skip patterns should only be used in interviewer-administered questionnaires and not in self-administered questionnaires. 2. Skip patterns should be tested in all possible combinations to ensure the pattern works under all conditions. 3. Skip patterns should be used to guide the user through a complicated series of questions. A. B. C. D. 1 2 3 Both 2 and 3 ANSWER: A EXPLANATION: Skip patterns are a useful technique for guiding users through a complicated set of questions where not all questions should be answered by all subjects. Skip patterns can be used in questionnaires that are administered by a study interviewer, but they can also be used in self-administered questionnaires. The skip pattern can be more complex and more sophisticated when the questionnaire is administered by a trained interviewer. Patterns should be simpler in self-administered questionnaires to ensure that subjects are able to follow the skip pattern without making mistakes. 28 3) For the following questionnaire item, how many variables would be needed in the database used to store the questionnaire results? Which of the above symptoms prompted you to seek care? (mark all that apply) □ fever □ diarrhea □ vomiting □ cough A. 1 B. 3 C. 4 D. 5 ANSWER: C EXPLANATION: In “check all that apply” questions, there should be one variable for each item in the list of potential responses. 4) Which of the following is TRUE about duplicate data entry? A. Duplicate entry is time consuming and does little to improve quality and therefore should not be used. B. Duplicate entry will identify all errors in study questionnaires and therefore should be used in all cases. C. Duplicate data entry only works if all questionnaires for all study participants are double entered. D. Duplicate entry can be done on a subset of study questionnaires or on all study questionnaires, depending on the demands of the study. ANSWER: D EXPLANATION: Duplicate entry is an effective method of detecting transcription errors when data is entered into a database from some other source such as a paper-based questionnaire. Duplicate entry can be conducted on all data entry as a means of catching nearly all cases transcription error, but this can be a costly and time consuming approach. Alternatively, a subset of the data entry can be conducted in duplicate to monitor accuracy. In this situation, the research should set a maximum threshold for errors based on the level of mismeasurement that is judge to be acceptable. In the event that the error rate exceeds the threshold, the researcher would investigate the data entry process and institute additional training or oversight as needed. 29 5) Which of the following is TRUE about the prevention, detection, and correction of errors in study data? A. Error checking is only the responsibility of the data clerk who enters the data into the database. B. Data entry errors can be minimized through good questionnaire design, training of those administering the questionnaire, and error checking at the time of data entry. C. Error checking is only the responsibility of the person collecting the data from the study participant. D. Errors in the data should never be corrected once they are entered into the database. ANSWER: B EXPLANATION: The accuracy of study data is the responsibility of all members of the study team. There should be a clear protocol of how the data is check, monitored, and corrected. 30 Lecture 7: Interpretation of Epidemiologic Studies and Decision Making The following data were obtained on 100 women newly diagnosed with ovarian cancer and a sample of 100 demographically similar women seeking care at the same location as the women with cancer. The first three questions are based on these data. Type of patient Abdominal bloating in the prior Ovarian cancer Other month Yes No 43 57 100 8 92 100 1) The sensitivity of abdominal bloating for the presence of ovarian cancer is A. B. C. D. 43/(43 + 57) 92/(8 + 92) 43/(43 + 8) Cannot be determined from these data ANSWER: A EXPLANATION: Sensitivity is used to measure how successful a test is at identifying “disease” when it is present. It is calculated by dividing the number of patients who have the disease and test positive by the number of all patients with disease (true positives plus false negatives). 2) The specificity of abdominal bloating for the presence of ovarian cancer is A. B. C. D. 43/(43 + 57) 92/(8 + 92) 43/(43 + 92) Cannot be determined from these data ANSWER: B EXPLANATION: Specificity is the probability of testing negative given that the patient doesn’t have disease. It is calculated by dividing the number of true negatives by the number of all patients without disease (i.e., true negatives plus false positives). 31 3) The prevalence of occult (undiagnosed) ovarian cancer in the population from which the 200 women in this study were drawn is estimated to be 100 in 250,000. The predictive value of abdominal bloating for the presence of ovarian cancer in this population is: A. B. C. D. 43/(43 + 57) 43/(43 + 8) 43/20,035 Cannot be determined ANSWER: C EXPLANATION: The positive predictive value of a test is the probability that a patient has disease, given that they test positive. While sensitive and specificity are characteristics of the test itself and do not depend on the prevalence of disease in the population in which the test is used, the predictive value of a test is a function of the sensitivity and specificity of the test and of the prevalence of disease. The positive predictive value can be calculated as follows: If we start with a hypothetical case in which we have 100 people with disease drawn from the general population, then we would be drawing from a population of 250,000 people. We then can begin constructing a new 2x2 table. True disease status NonTotal Diseased diseased Positive Expected test result Negative Total 100 250,000 We can easily calculate the number of non-diseased people in this hypothetical population as 250,000 - 100 = 249,900 True disease status NonTotal Diseased diseased Positive Expected test result Negative Total 100 249,900 250,000 Now, using the sensitivity of the test (43%), we can calculate the expected number of true positives (100 x 0.43 = 43) and the number of false positives (100 - 43 = 57). So far, these numbers match the original table because we started with a hypothetical population with 100 disease individuals. True disease status NonTotal Diseased diseased Positive 43 Expected test result Negative 57 Total 100 249,900 250,000 32 We now use the known specificity (92%) to calculate the number of expected true negatives and false positives. The true negatives are calculated as 249,900 x 0.92 = 229,908 and the false positives are calculated as 249,900 - 229,908 = 19,992 True disease status NonTotal Diseased diseased Positive 43 19,992 Expected test result Negative 57 229,908 Total 100 249,900 250,000 Now that the table is fully filled, we can calculate the positive predictive value in this population using the following formula: 𝑃𝑃𝑉 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 43 43 = = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 43 + 19,992 20,035 = 0.0021 𝑜𝑟 0.21% As this example demonstrates, when the prevalence of disease in the population is low, imperfect specificity, even when it is as high 92%, will result in a situation where most positive tests are false positives. In this example, only 0.21% of those with a positive test actually have disease. 4) Comparison of rates of illness and death across geographic populations can be a guide to influences regarding disease etiology, but can be limited by: A. Differences across the populations with regard to factors other than the one under consideration. B. The presence of only small differences in the prevalence of the characteristic of concern across the populations being studied. C. Both (a) + (b) D. Neither (a) nor (b) ANSWER: C EXPLANATION: It is often tempting to draw conclusions about disease etiology by comparing disease rates in different geographical regions. Such an approach is often referred to as an ecological study because the unit of analysis is a group of people, often defined by geography, with exposure measured at the group level, as opposed to other study designs where the unit of analysis is the individual where the exposure is measured for each person. For example, if you observe a high rate of lung cancer in Country A and a low rate of lung cancer in country B, you may start looking for differences between countries A and B in terms of frequency of exposures that may be related to disease (e.g., smoking). Unfortunately, there are likely many factors that differ between the two countries (e.g., industrial exposures that increase lung cancer risk), making it difficult to attribute the difference in disease to a specific factor of interest. Ecological comparisons to investigate a factor of interest may also be limited when the distribution of the factor of interest does not vary greatly between comparison populations. For example, if you are 33 interested in the relationship between lack of exercise and the risk of having a heart attack, your ability to draw conclusions by comparing the incidence of heart attacks between two regions will be limited if the distribution of those who do and do not exercise is similar in the two regions. 5) Do you agree or disagree with the following assertion: “The presence of some persons with an illness who had never sustained a given exposure means that the exposure does not have the capacity to cause the illness in question.” A. Agree B. Disagree ANSWER: B EXPLANATION: The definition of a causal relationship used by epidemiologists does not require that all cases of disease must be due to a common exposure. Returning once again to the example of lung cancer, the disease can result from a number of different exposures, including environmental factors such as air pollution, tobacco smoking, and exposure to asbestos, but may also be due to genetic factors independent of environmental exposure. The complex set of mechanisms of disease causation does not mean that any one of these exposures on their own is not causally related to disease. 6) A new screening test has been developed that can detect prostate cancer in men. Which of the following pieces of information do you need to know before deciding if the screening test should be used to test asymptomatic men for prostate cancer? A. The proportion of positive tests that represent true cases of prostate cancer. B. The cost of further testing and evaluation for men with a false positive test C. If there is a treatment available that will improve the outcome for men with prostate cancer that test positive D. All of the above ANSWER: D EXPLANATION: The decision to use a screening test should be based on the positive and negative predictive value of the test in the target population, an understand of the full cost of the test, including the cost of false positives, and the availability of a therapy that changes the outcome of disease in those that test positive. 34 7) You are developing a screening test based on clinical criteria to detect patients who are experiencing a myocardial infarction (MI) after they have presented at the hospital. This test will be used to make decisions about how to manage patients. Those who test positive will be evaluated further and treated promptly. Those who test negative will be observed and discharged. What characteristics are you looking for in a good test? A. B. C. D. High sensitivity, with lower specificity acceptable High specificity, with lower sensitivity acceptable Only a test with nearly perfect sensitivity and specificity It depends on the prevalence of MI in the population ANSWER: A EXPLANATION: You want to detect as many true cases of MI as possible (high sensitivity). In this example, it is better to have some false positives than it is to miss a true case of MI. 35 Lecture 8: Multiple Variable Regression Models in Epidemiology 1) Which of the following is not a type of multivariable regression? A. Linear regression B. Cox proportion hazards regression C. Logistic regression D. All of the above are types of multivariable regression ANSWER: D EXPLANATION: Linear, logistic, and Cox proportional hazards regression are all considered forms of multivariable regression. Each type of regression evaluates a different type of outcome variable: in linear regression, the dependent, or outcome variable is continuous. In Logistic regression, the dependent variable is binary, meaning that it can only take one of two values (e.g., diseased or non-diseased). In Cox proportion hazards regression, the dependent variable is the amount of time from some starting point until an outcome of interest. In all of these forms of regression, the analyst can include multiple variables as independent or explanatory variables in the model. 2) Which of the following is not an advantage of multivariate regression? A. It is possible to adjust for multiple confounders at the same time B. Regression models eliminate selection bias C. Regression models can be used to analyze case-control and cohort studies D. Regression models can be used to estimate measures of risk commonly used in epidemiology ANSWER: B EXPLANATION: Selection bias results from limitations in the study design and cannot be controlled by regression alone. While regression methods may help to lessen the impact of selection bias, it best addressed by properly designing the study to minimize this form of bias. 3) In what situation should Cox regression be used instead of logistic regression? A. In longitudinal studies where the duration of follow-up is not equal for all study subjects B. To analyze a case-control study C. When there are more unexposed study subjects than exposed subjects D. To analyze all prospective studies ANSWER: A EXPLANATION: Cox regression is used to analyze time-to-event data where follow-up time is unequal between subjects. Logistic regression can be used to analyze case-control studies as well as cohort studies and randomized trials. However, in these latter designs where subjects are selected based on an exposure and followed up for an outcome, logistic regression is appropriate only when all subjects are followed for the same amount of time and there are no subjects who are lost to follow-up. 36 4) Which of the following are TRUE about linear regression? 1. The outcome variable (y) should be a continuous variable. 2. The independent (exposure) variable (x) should be a continuous variable. 3. The independent (exposure) variable (x) can be either a continuous or categorical variable. 4. Linear regression allows you to adjust for multiple variables at the same time. A. B. C. D. 1 and 2 1, 2, and 3 1, 3, and 4 1, 2, 3, and 4 ANSWER: C EXPLANATION: Linear regression is used when the outcome variable (y) is continuous. In a linear regression model, the independent variables can be continuous, discrete, binary, or categorical. A strength of linear regression, as with other forms of regression, is that it can be used to adjust for multiple confounding variables at the same time. 5) Which of the following are TRUE about logistic regression? 1. Logistic regression produces odds ratios (OR) for each variable included in the model. 2. Logistic regression can be used to analyze case-control studies, cross-sectional studies, and cohort studies with the same follow-up time for all participants. 3. Continuous variables should not be included as confounders in a logistic regression model. 4. Logistic regression can be used to analyze data measuring the time from enrollment until the onset of disease. A. B. C. D. 1 and 2 1, 2, and 3 1, 3, and 4 1, 2, 3, and 4 ANSWER: A EXPLANATION: Logistic regression is used when the outcome variable (y) is binary. While it is the primary mean of analyzing case-control studies, logistic regression can also be used to analyze cohort studies and randomized trials, as long as follow-up is complete and of the same duration for all subjects. In a logistic regression model, the independent variables can be continuous, discrete, binary, or categorical. Time to event data is best analyzed using a Cox proportional hazards model. 37 Lecture 9: Qualitative Research Methods 1) How should the appropriate sample size be selected in a qualitative research study? A. The sample size should be based on a statistical calculation to ensure adequate power to test the a priori hypothesis being tested. B. Qualitative studies should always conduct 15 individual interviews and 5 focus group discussions. C. Appropriate sample sizes for qualitative studies shouldn’t be defined ahead of time. The sample size should be based on the principle of saturation. D. The decision of sample size in a qualitative study should be based on the budget. ANSWER: C EXPLANATION: Unlike quantitative study designs, the final sample size for a qualitative study should not be specified ahead of time. While there are general principles that can be used to estimate the number of interviews or focus groups that should be conducted, these estimates should be used for planning purposes and should not dictate the final sample size. The principle of saturation is that you should continue collecting data until you are no longer gaining new information from additional subjects. This may result in smaller sample size than initially anticipated if you reach saturation earlier than expected, but it may also mean that you will require a larger sample size than anticipated if you continue to gain new information with each additional subject. 2) Which of the following is TRUE about phenomenology? A. The goal of phenomenology is to gather an in-depth reflective description of experiences. B. The phenomenology approach seeks to explain why people behave in the way that they do. C. Research studies using phenomenology should use predetermined questions to ensure that the a priori hypothesis can be tested. D. Snowball sampling is not appropriate for a study using a phenomenology approach. ANSWER: A EXPLANATION: Phenomenology seeks to describe rather than explain the experience and/or behavior of subjects. Phenomenology is often used to explore a new area of investigation, and is not driven be an a priori hypothesis. 3) Which of the following is NOT an appropriate method of data collection for a phenomenology study? A. Informal conversations B. Semi-structured interviews C. Structured questionnaires D. Focus groups ANSWER: C EXPLANATION: While structured questionnaires may be used to collect basic demographic information about subjects in qualitative study, they are not appropriate for collecting the information that is the main subject of a qualitative. Data collection 38 strategies in qualitative research require a degree of flexibility and ability to adapt and respond to a subject’s interaction with the investigator. 4) Which of the following is TRUE about grounded theory? A. Participants in a grounded theory study should always be randomly selected from the target population. B. The process of conducting a grounded theory study is iterative. The sampling strategy is modified as new information is collected and analyzed. C. Focus groups are never used in grounded theory. D. The sampling strategy and approach should not be changed after the study has started. ANSWER: B EXPLANATION: Grounded theory involves the development and testing of hypotheses about the process being investigated and how subjects respond to the process. The resulting theory is grounded in data from large numbers of participants, and is the product of an iterative process in which the investigator analyses their findings, refines their theories, and conducts additional investigations to test and further refine these theories. 5) Which of the following is FALSE about focus group discussions? A. Focus groups are best suited when interactions between participants will yield more information relevant to the research questions. B. The facilitator for a focus group should use open-ended questions and use a nondirective and non-judgmental approach. C. Focus groups are not well suited when asking sensitive questions that may not be answered truthfully in a group setting. D. The main purpose of a focus group is to get the group to agree on a set of responses to items on a questionnaire. ANSWER: D EXPLANATION: Focus groups are useful when it is desirable to observe the interaction among interviewees. The goal should not be to seek agreement among interviewees, but rather to explore the range of responses and how these responses relate to one another. Focus groups are best used when investigating community attitudes or perceptions of how the community responds to the process being investigated. Focus groups are less useful when investigating topics that are sensitive and about which subjects may not feel comfortable giving honest responses in a group setting. Focus group facilitators should be trained in how to use open-ended questions and non-directive approaches to maximize the information gained from the group and to avoid biasing the group with a priori hypotheses about the subject being investigated. 39 Lecture 10: Analyzing Qualitative Data and Public Health Applications 1) Which of the following is NOT a level of coding for qualitative data? A. Descriptive coding B. Community coding C. Analytic coding D. Topic coding ANSWER: B EXPLANATION: Descriptive, analytic, and topic coding are all levels of coding used in the analysis of qualitative data. 2) True or False: codes used in analyzing qualitative data should be developed before the research is started and should not be changed in order to prevent bias in the analysis. A. True B. False ANSWER: B EXPLANATION: Codes should be developed as the investigator reviews the qualitative data. Codes should be revised and developed as additional information is collected. Multiple analysts should be involved in developing and applying codes and in analyzing coded data. 3) Which of the following statements are TRUE about thematic codes? 1. Thematic codes are used to describe characteristics of the data itself. 2. Thematic codes are used to describe topics present in the transcript of an interview. 3. Thematic codes can be revised and modified during the process of analyzing the data. 4. Thematic codes are only used when analyzing focus group discussions. A. B. C. D. 1 1 and 2 2 and 3 3 and 4 ANSWER: C EXPLANATION: Thematic coding, also referred to as topic coding, is the most common type of coding. It is used to describe topics in transcripts and other forms of qualitative data. Thematic codes are commonly revised and modified as the investigator proceeds through the review of the data. After preliminary codes are developed, they are typically discussed, merged, modified, and expanded. The data are then recoded using the revised codes. 40 4) Which of the following is FALSE about analyzing qualitative data? A. Qualitative data may include transcripts of interviews, audio recordings, and videotaped interviews. B. The process of analyzing qualitative data involves an adaptive process that incorporates new information that arises as the research is conducted. C. Analysis of qualitative data requires the use of specialized software. D. Codes are used to describe qualitative data, develop hypotheses, and to conduct comparisons. ANSWER: C EXPLANATION: A wide range of data may be used when analyzing a qualitative study. This may include transcripts from interviews and focus groups, notes from the observation of subjects, or audio and video recordings. Unlike most forms of quantitative analysis, qualitative analysis is adaptive and evolves as new information is gained. Codes are commonly used to describe and organize qualitative data. While specialized software can be helpful in analyzing qualitative data, it is by no means necessary, and does not guarantee that the analysis is conducted appropriately. 5) Which of the following is NOT a characteristic of a well conducted qualitative study. A. The researcher conducted and reported a thorough review of the relevant literature. B. The study used a rigid prior conceptual framework when analyzing and interpreting the data. C. The qualitative methods used were a good match with the research question. D. The researcher compared their findings with others reported in the literature. ANSWER: B EXPLANATION: Well conducted qualitative studies should include a comprehensive understanding of the relevant scholarly literature in addition to a good foundation in the methodologies of qualitative research. The investigator should have a good sense of which questions are appropriate to be answered through qualitative research and which questions are better answered through other approaches. The findings from a qualitative study should be put in context with other published research, from both qualitative and quantitative approaches. It is important that the investigator not impose a rigid prior conceptual framework to avoid introducing bias into the study and failing to gain new knowledge from study subjects that may contradict previous ideas and hypotheses. 41