Epidemiology for mathematicians “Looking at wildflowers from horseback” David Ozonoff, MD, MPH Boston University School of Public Health DIMACS Working Group on Order Theory in Epidemiology March 7, 2005 Tutorial overview and goals • The landscape of epidemiology – – – – What is epidemiology? Who is an epidemiologist? Who employs them? Kinds of epidemiology • How epidemiologists think – What kinds of things do they work with? – What kinds of things are they interested in? Tutorial overview and goals, cont’d • Some language and concepts of epidemiology – Language of occurrence measures – Study designs – Causal inference I. Landscape, perspective, language What is epidemiology? Who is an epidemiologist? Who employs epidemiologists? Flavors of epidemiology: Descriptive, analytic Epi and mathematics: models and patterns Some examples of epidemiological thinking Some definitions of epidemiology • Study of health and illness in populations (Kleinbaum, Kupper and Morgenstern) • Study of the distribution and determinants of disease frequency in human populations (MacMahon and Pugh; Susser) • Study of the occurrence of illness (Rothman I) • Theoretical epidemiology: discipline of how to study the occurrence of phenomena of interest in the health field (Miettinnen) [NB: not illness centered] Some more (cynical) definitions • Rothman II: “Unfortunately, there seem to be more definitions of epidemiology than there are epidemiologists. Some have defined it in terms of its methods. While the methods of epidemiology may be distinctive, it is more typical to define a branch of science in terms of its subject matter rather than its tools….If the subject of epidemiologic inquiry is taken to be the occurrence of disease and other health outcomes, it is reasonable to infer that the ultimate goal of most epidemiologic research is the elaboration of causes that can explain patterns of disease occurrence.” • Schneiderman: Epidemiology is the practice of criticizing other epidemiologists Consensus notions • Deals with populations, not individuals • Deals with (frequency of) occurrences of health related events • Has a (major but not exclusive) concern with causes (“determinants”) of disease patterns in populations Remarks • Public health perspective • “Flavors”: Analytic versus descriptive epidemiology • Causal inference: assumptions – Disease occurrence is not random. – Systematic investigation of different populations can identify causal and preventive factors • Observational versus experimental sciences • Chronic disease and infectious disease epidemiology – What is “theoretical epidemiology”? Some examples • Do environmental exposures increase risk of disease? – John Snow: cholera epidemic of 1854 – Contaminated water and leukemia in Woburn, MA • Are vitamin supplements beneficial? – Does Vitamin E lower risk of Alzheimer’s Disease – Folic acid and risk of neural tube (birth) defects • Do behavioral interventions reduce risk behaviors? – Community–based studies to change diets – Peer interventions to reduce HIV-risk behaviors Who is an epidemiologist? • Relatively new in medical science – Precursors: John Graunt (17th century), John Snow (19th century) – Rise as a profession: Wade Hampton Frost at JHU – 1950s and 1960s: CDC and consolidation as professional discipline, still mainly physicians – 1960s+: Infectious disease -> Chronic disease epi • Professonalization – Doctoral degrees in epidemiology – Now most epidemiologists are not docs Who employs epidemiologists? • Public sector – State and federal health officials • Communicable and chronic disease programs – Infectious disease, outbreak investigations – Cancer registries, environmental studies, program areas in substance abuse, health services, etc., etc. – Research at CDC, NIH, academia, etc. • Private sector – Industry (chemical companies, drug companies) – Consultants – Academia, NGOs “Flavors” of epidemiology – Descriptive epidemiology – Analytic epidemiology (finding “risk factors”, a.k.a. “causes”) Descriptive epidemiology • Describe patterns of disease by: person, place, time – Good for monitoring public’s health (e.g., surveillance, vital events) – Used for administrative purposes (e.g., planning) – Good for generating hypotheses NB: Disease patterns and the Science of patterns QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Description • Two kinds – Tabulations or summaries only (no inference or estimation) – Inference • Prediction to other populations (“generalization”; surveys and polling) • “True” value in face of noise • May also assume data produced by underlying population model and try to describe it – Parametric: particular functional form assumed • Parameter = value that indexes family functions, e.g., mean and std deviation of Normal distribution – Non-parametric: data-driven estimate of underlying density or distribution A word about “models” and “patterns” (our usage) • Models are high level, “global” descriptions of all or most of dataset – Descriptive or inferential component – Examples • Regression models, mixture models, Markov models • Patterns are “local” features of data – Perhaps only a few people or a few variables – Also descriptive or inferential • Descriptive: look for people with “unusual” features • Inferential: Predict which people have “unusual” features – Examples: Association rules, mode or gap in density function, outliers, inflection point in regression, symptom clusters, geographic “hot spots”, predict disease from symptoms Models and patterns, cont’d • Epidemiologists use both but more interested in patterns, i.e., more interested in “structure” that is local than “structure” that is global – George Box: “All models are wrong but some models are useful” describes epi viewpoint – But epidemiologists tend to think of patterns as “real,” even if misleading Warning: word “model” differs by context but is usually some kind of metaphor • Metaphor: a figure of speech literally denoting one kind of thing but used to represent or reason about another kind of thing – Examples: fashion model, model citizen (represent an “ideal”); scale model; animal model; mathematical model; model of an axiomatic system; regression model Describing populations by person, place and time: illustrating how epidemiologists think Question: What do we learn from the following examples? Person (age, sex, race) Death rates per 105 US population from coronary disease by age and sex, 1981 Age White Men White Women 25-34 35-44 45-54 55-64 65-74 75-84 85+ 9 60 265 708 1670 3752 8596 4 16 71 243 769 2359 7215 Place • Where are the rates of disease the highest and lowest? Malignant Melanoma of Skin Place A Variation on Place: Migrant Studies Mortality rates (per 100,000) due to stomach cancer Japanese in Japan 58.4 Japanese Immigrants to California 29.9 Sons of Japanese Immigrants 11.7 Native Californians (Caucasians) 8.0 Time Does frequency of disease differ now from in the past? What is a Population? How an epidemiologist would put it Group of people with a common characteristics like age, race, sex, geographic location, occupation, etc. Two types of populations, based on whether membership is permanent or transient: • Fixed population or cohort: membership is permanent and defined by an event Ex. Atomic bomb survivors, Persons born in 1980 • Dynamic population: membership is transient and defined by being in or out of a "state.” Ex. Members of HMO Blue, residents of the City of Boston First step, summary description • Tabulate data by selected features of person, place, time • What are characteristics of population members? (how many of each sex, race, etc.) And combinations of these features (How many white women? Employed? Etc.) Constructing contingency table from “raw data” • “raw data” consists of listing of each subject and his or her attributes: Case1 Case2 Case3 Case4 Case5 Case6 Case7 M 1 0 1 1 0 0 1 F 0 1 0 0 1 1 0 R 0 1 0 1 1 0 1 L 1 0 1 0 1 1 0 <65 65+ 0 1 1 0 0 1 1 0 1 0 1 0 0 1 One-way tables • One dimensional Contingency Table (CT) is just a frequency table, i.e., a table that gives number of subjects with each attribute Males Females Right-handed Left-handed <65 65+ 4 3 4 4 4 3 Two-way tables • Most contingency tables are (at least) twoway, i.e., they cross-classify two attributes Right-handed males Left-handed males Right-handed females Left-handed females Males < 65 y.o. Females < 65 y.o. Males 65 and older Females 65 and older 2 2 2 2 1 3 3 0 Or in more familiar form… R L <65 65+ M 2 2 1 3 F 2 2 3 0 Sex by handedness and age But this is only part of the possible two way tables as it does not represent handedness versus age, for example What is a Population? How a mathematician might put it • A population is a triple, (G, M, I) • Two sets, G and M; G is a set of “people” or “subjects”, M is a set of features the subjects might “have” • A relation I, I G M – Interpretation: r = (g, m) I means that subject g G “has” attribute m M Contingency tables (“cross-tabs”) • Mainstay of data preparation, inspection and analysis • Requires study design based operations – Sampling set of n subjects in set G – Variable selection (classification scheme) set of m variables in set M • E.g., age, sex, disease status (as indicator variables) – Measurement binary relation I G M • E.g., ordered pair (case 2, female=yes) is typical member of I • We call the triple (G, M, I) a data structure for the contingency table (also called a formal context in FCA literature) • Simple formulation allows use of rich mathematical theory • Much more about this from Alex Pogel Quantification: Disease frequency • Goal will be to see if occurrence of disease differs in populations with different characteristics or experiences (note comparison is at heart of this) • Quantify disease occurrence in a population at certain point or period of time – Population (counting, absolute scale) • How big? • Composition? – Occurrence (counting, absolute scale) • Existing cases? New cases? – Time • Calendar time? (NB: interval scale, preserved under pos. lin. xform) • Duration of time (NB: ratio scale, preserved under similarity xform) • More about this in Fred Roberts’s tutorial Ex. Hypothetical Frequency of AIDS in Two Cities City A City B # new cases 58 35 time period 1985 1985-86 population 25,000 7,000 Annual "rate" of AIDS City A = 58/25,000/1yr = 232/100,000/yr City B = 35/7,000/2 yrs = 17.5/7000/yr = 250/100,000/yr Make it easy to compare rates (i.e., make them “commensurable”) by using same population unit (say, per 100,000 people) and time period (say, 1 year) NB Commensurability is property of underlying relational system used in measurement (treated in Roberts tutorial) Three kinds of quantitative measures of frequency of occurrence Used to relate number cases of disease, size of population, time • Proportion: numerator is subset of denominator, often expressed as a percentage • Ratio: division of one number by another, numbers don't have to be related • Rate: time (sometimes space) is intrinsic part of denominator, term is often misused (e.g., “birthrate”) Need to specify if measure represents events or people (Point) Prevalence (P) Quantifies number of existing cases of disease in a population at a point in time • P = Number of existing cases of disease (at a given point in time)/ total population • Ex. City A has 7000 people with arthritis on Jan 1st, 2002 • Population of City A = 70,000 • Prevalence of Arthritis on Jan 1st = .10 or 10% Prevalence is a proportion Incidence - quantifies number of (a) new cases of disease that (b) develop in a population at risk (c) during a specified time period Three key ideas: • • • New disease events, or for diseases that can occur more than once, usually first occurrence of disease Population at risk (candidate population) - can't have disease already, should have relevant organs Enough time must pass for a person to move from health to disease Two Types of Incidence Measures Cumulative Incidence (“Attack Rate”) (Abbreviated Cum Inc. CI) Incidence Rate (“Incidence Density”) (Abbreviated I, IR, ID) Incidence rate (I, IR) = # new cases of disease Total person-time of observation Also called incidence density (ID) Accrual of Person-Time Jan Jan Jan 1980 1981 1982 Subject 1 -----------------------x Subject 2 1.1 Person-Year (PY) -------------------------x 1.2 PY Subject 3 -------------------------------------------- 2.2 PY 4.5 PY X = outcome of interest, incident rate = 2/4.5 PY Some Ways to Accrue 100PY • 100 people followed 1 year each = 100 py • 10 people followed 10 years each= 100 py • 50 people followed 1 year plus 25 people followed 2 years = 100 py Time unit for person-time = year, month or day Person-time = person-year, person-month, person-day Ex.: (Cohort) study of risk of breast cancer among women with hyperthyroidism • Followed 1,762 women ---> 30,324 py • Average of 17 years of follow-up per woman • Ascertained 61 cases of breast cancer • Incidence rate = 61/30,324 py = 201/100,000 py = .00201/y (.00201 x 100,000 p/100,000 p) Dimensions Prevalence Cumulative incidence Incidence rate = = = people people no dimension people people no dimension people people-time dimension is time –1 Types of (instantaneous) rates Relative rate (person-time or incidence rate) Absolute rate (used in infectious disease epi and health services) 1 dN N dt dN dt Also where units do not involve time, such as accidents per passenger mile or cases per square area Relationship between prevalence and incidence P = IR x D • • • • Prevalence depends on incidence rate and duration of disease (duration lasts from onset of disease to its termination) If incidence is low but duration is long - prevalence is relatively high If incidence is high but duration is short - prevalence is relatively low This is an example of Little’s equation in queuing theory: time-avg number of units in the system = arrival rule x avg delay time/unit • This equation is true if ... Conditions for equation to be true: • Steady state • IR constant • Distribution of durations constant • Prevalence of disease is low (less than 10%) In queuing theory terms: strictly stationary process in steady state conditions Figuring duration from prevalence and incidence Lung cancer incidence rate = 45.9/100,000 py Prevalence of lung cancer = 23/100,000 D= P IR = 23/100,000 p 45.9/100,000 py = 0.5 years Conclusion: Individuals with lung cancer survive 6 months from diagnosis to death Uses of Prevalence and Incidence Measures • Prevalence: administration, planning • Incidence: etiologic research (problems with prevalence since it combines IR and D), planning Common measures of disease frequency for public health – Crude death (mortality) rate: • Total number of deaths from all causes 1,000 people For one year (also cause-specific, age-specific, race-specific death rate) – Live-birth rate: total number of live births For one year 1,000 people (sometimes women of childbearing age) – Infant mortality rate: # deaths of infants under 1 year of age 1,000 live-births For one year Frequency measures used in infectious disease epidemiology •Attack rate: # cases of disease that develop during defined period # in pop. at risk at start of period (usually used for infectious disease outbreaks) •Case fatality rate: •# of deaths # cases of disease •Survival rate: # living cases # cases of disease for a defined period of time for a defined period of time Tutorial part 2: Exposure Disease Relationship Analytic epidemiology Reprise: Epidemiology is a science within public health • This means that it adopts a population perspective • As a science, it is also quantitative • As a science, it is also interested in explanation and prediction, not just describing Questions asked by communities • Exposure driven questions – “What will happen to me, my family, my community? • Outcome driven questions – “Why me, why my child, why us?” • Mixed – “Are we sicker than our neighbors?” The usual notion of causation: John Stuart Mill’s “Method of Difference” • A causes B if, all else being held constant, a change in A is accompanied by a subsequent change in B. – This of course does not mean that nothing else can produce a change in B. • The formal method to detect such an occurrence is the Experiment, whereby all things are held constant except A and B, A is varied, and B observed Expt’l vs. Observational Science • Epidemiology is an “observational” science • We do not control the independent variable (or most other variables) • What is the implication of this for the status of epidemiology as a science? • What does it mean about epidemiology’s ability to “prove causation”? Sources of information • Case studies • Experimental studies • Observational studies Once results are observed, it remains to explain or interpret the observation, whether the result is a difference or a lack of a difference in the compared entities. Types of observational study designs • Descriptive – – – – – Case study and case-series No comparison: Person, place and time Cross-sectional comparison (“Are we sicker than our neighbors?”) “ecological” (comparing communities/environments; not individual level) Notice how descriptive and analytic shade into each other (as per examples we did earlier) • Cohort (“What’s going to happen to me?”) – Analog of the laboratory experiment • Case-control (“Why me?”) Central idea: compare frequencies of occurrence in two groups • Example: Summarize relationship between exposure and disease by comparing two measures of disease frequency • Overall rate of disease in an exposed group says nothing about whether exposure is a risk factor for (“causes”) a disease • This can be evaluated by comparing disease incidence in an exposed group to another group that is not exposed, (a “comparison group”) • Comparison or contrast is the essence of epidemiology Two Main Options for Comparing disease frequencies 1. Calculate ratio of two measures of disease frequency ( a measure in exposed group and a measure in unexposed comparison group) 2. Calculate difference between two measures of disease frequency (a measure in exposed group and a measure in unexposed comparison group) At the heart of an epidemiological study ... • Lies a comparison – Between 2 rates, ratios, proportions • Is the difference/lack of difference due to – Bias? – Chance? – Real effect? Determinants of the comparison • Compared measures differ or they don’t ( is linearly ordered) • Either way, the comparison may be affected by: – Chance (sample variation) – Bias – Real effect or lack of effect • To interpret the comparison and evaluate the last factor, we need to account for effects of the first two Role of statistics • Evaluates role that chance might play in the absence of any other factor • Also used for summary purposes or to express a model mathematically • Not the main preoccupation of epidemiologists, however • Bias is main preoccupation of epidemiologists Evaluating the role of bias • Epidemiology is observational discipline, so uncontrolled variables abound • Most of training is in recognizing and accounting for sources of bias, often extremely subtle • Less emphasis on role of chance, often handed over to biostatisticians • Extent to which content area (“real effect”) taken into account varies with investigator and who collaborators are I. Definition of Bias Bias is a systematic error that results in an incorrect (invalid) estimate of the measure of association A. Can create spurious association when there really is none (bias away from the null) B. Can mask an association when there really is one (bias towards the null) C. Bias is primarily introduced by the investigator or study participants I. Definition of Bias (con’t) D. Bias does not mean that the investigator is “prejudiced” or “not objective” E. Bias can arise in all study types: experimental, cohort, case-control F. Bias occurs in the design and conduct of a study. It cannot be fixed in the analysis phase. G. Two main types of bias are selection and information bias, but there are many other types of bias H. We will consider only selection and information bias for purposes of illustration of epidemiologic practice II. Selection Bias A. Results from procedures used to select subjects into a study that lead to a result different from what would have been obtained from the entire population targeted for study B. Most likely to occur in case-control or retrospective cohort because exposure and outcome have occurred at time of study selection II. Selection Bias in a Case-Control Study A. Occurs when controls or cases are more (or less) likely to be included in study if they have been exposed -- that is, inclusion in study is not independent of exposure B. Result: Relationship between exposure and disease observed among study participants is different from relationship between exposure and disease in individuals who would have been eligible but were not included -- OR from a study that suffers from selection bias will incorrectly represent the relationship between exposure and disease in the overall study population Selection Bias: Case-Control Study Question: Do PAP smears prevent cervical cancer? Cases diagnosed at a city hospital. Controls randomly sampled from household in same city by canvassing the neighborhood on foot. Here is the true relationship: Cervical Controls Cancer Cases Had PAP smear 100 150 Did not have PAP smear 150 100 Total 250 250 OR = (100)(100) / (150)(150) = .44 There was a 54% reduced risk of cervical cancer among women who had PAP smears as compared to women who did not. (40% of cases had PAP smears versus 60% of controls) Selection Bias : Case-Control Study (con’t) Recall: Cases from the hospital and controls come from the neighborhood around the hospital. Now for the bias: Only controls who were at home at the time the researchers came around to recruit for the study were actually included in the study. Women at home were more likely not to work and were less likely to have regular checkups and PAP smears. Therefore, being included in the study as a control is not independent of the exposure. The resulting data are as follows: Selection Bias (con’t) Cervical Controls Cancer Cases Had PAP smear 100 100 Did not have PAP smear 150 150 Total 250 250 OR = (100)(150) / (150)(100) = 1.0 There is no association between Pap smears and the risk of cervical cancer. Here, 40% of cases and 40% of controls had PAP smears. Selection Bias : Case-Control Study (con’t) Ramifications of using women who were at home during the day as controls: These women were not representative of the whole study population that produced the cases. They did not accurately represent the distribution of exposure in the study population that produced the cases, and so they gave a biased estimate of the association. When interpreting study results, ask yourself these questions … • Given conditions of the study, could bias have occurred? • Is bias actually present? • Are consequences of the bias large enough to distort the measure of association in an important way? • Which direction is the distortion? –is it towards the null or away from the null? Imputation of Causality What are the roles of … • Bias: The critique checklist • Chance: “Statistical significance” • Real effect – The Hill “viewpoints” • Not necessary “criteria” (not even criteria) • Not a checklist – The way it’s really done... “Marks” of causality • • • • • • Strength of association Biologically plausible Biological gradient (“dose-response”) Appropriate temporal relationship Specificity “Consistency” The “Fundamental Question” (according to Hill) • "Clearly none of these nine viewpoints can bring indisputable evidence for or against a cause-and-effect hypothesis and equally none can be required as a sine qua non. What they can do, with greater or less strength, is to help us to answer the fundamental question--is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause and effect?” How it’s really done… • Assemble the evidence from the literature. What are the pieces of the jigsaw? – How do you decide? • Where do they fit? – How do you decide? Interpretation • Evaluate the evidence (a study) for internal validity • Evaluate the evidence for external validity • Bottom line – What roles are played by bias, chance, real effect? Assemble the jigsaw pieces into a picture • The “picture” is your version of “causality” • Your picture may disagree with other scientists – Disagreement among scientists is the rule, not he exception Mathematics in epidemiology • Traditional – Evaluate role of chance (statistical hypothesis testing; estimation) – Descriptive (compact summary or generative model) – Infectious disease epidemiology dynamics Comparing chronic and infectious disease epidemiology S S P I 1 , R 2 S S P I 1 , R 2 =birth rate or migration in-rate =incidence rate or infectivity rate , = mortality and recovery rates with 1=case fatality rate, 2=background mortality rate Prevalence “rate” = P/(S+P) Comparing chronic and infectious epi (cont’d) • Chronic – Usually concentrate on (incidence) because interested in etiology – Have to account for fact that is function of calendar time and age, exposure (?metric), sex, race, SES, occupation, co-morbid conditions, latency – But not usually population size or density, number of other cancer cases, etc. • Infectious – Interest in usually limited to its value as a parameter; we know the etiology – Interested in dynamics over time and space, existence of thresholds or periods, effect of parameters and initial conditions like size initial population, infectivity, mode of contact Difference is one of emphasis and interest, not concepts Some new uses for mathematics in epidemiology • • • • Formalization and theoretical tools Pattern and rule detection (“data mining”) Descriptive modeling Prediction from data – Classification – Taxonomy • Data organization and retrieval from large databases • Patient confidentiality/coding/cryptography • Multi-scale inference • Network construction/applications, etc.