Superquick Tutorial on Epidemiology for Mathematicians

advertisement
Epidemiology for mathematicians
“Looking at wildflowers from horseback”
David Ozonoff, MD, MPH
Boston University
School of Public Health
DIMACS Working Group on Order Theory in Epidemiology
March 7, 2005
Tutorial overview and goals
• The landscape of epidemiology
–
–
–
–
What is epidemiology?
Who is an epidemiologist?
Who employs them?
Kinds of epidemiology
• How epidemiologists think
– What kinds of things do they work with?
– What kinds of things are they interested in?
Tutorial overview and goals, cont’d
• Some language and concepts of
epidemiology
– Language of occurrence measures
– Study designs
– Causal inference
I. Landscape, perspective,
language
What is epidemiology?
Who is an epidemiologist?
Who employs epidemiologists?
Flavors of epidemiology: Descriptive, analytic
Epi and mathematics: models and patterns
Some examples of epidemiological thinking
Some definitions of epidemiology
• Study of health and illness in populations (Kleinbaum,
Kupper and Morgenstern)
• Study of the distribution and determinants of disease
frequency in human populations (MacMahon and Pugh;
Susser)
• Study of the occurrence of illness (Rothman I)
• Theoretical epidemiology: discipline of how to study the
occurrence of phenomena of interest in the health field
(Miettinnen) [NB: not illness centered]
Some more (cynical) definitions
• Rothman II: “Unfortunately, there seem to be more
definitions of epidemiology than there are epidemiologists.
Some have defined it in terms of its methods. While the
methods of epidemiology may be distinctive, it is more
typical to define a branch of science in terms of its subject
matter rather than its tools….If the subject of
epidemiologic inquiry is taken to be the occurrence of
disease and other health outcomes, it is reasonable to infer
that the ultimate goal of most epidemiologic research is the
elaboration of causes that can explain patterns of disease
occurrence.”
• Schneiderman: Epidemiology is the practice of criticizing
other epidemiologists
Consensus notions
• Deals with populations, not individuals
• Deals with (frequency of) occurrences of
health related events
• Has a (major but not exclusive) concern
with causes (“determinants”) of disease
patterns in populations
Remarks
• Public health perspective
• “Flavors”: Analytic versus descriptive
epidemiology
• Causal inference: assumptions
– Disease occurrence is not random.
– Systematic investigation of different populations can
identify causal and preventive factors
• Observational versus experimental sciences
• Chronic disease and infectious disease
epidemiology
– What is “theoretical epidemiology”?
Some examples
• Do environmental exposures increase risk of disease?
– John Snow: cholera epidemic of 1854
– Contaminated water and leukemia in Woburn, MA
• Are vitamin supplements beneficial?
– Does Vitamin E lower risk of Alzheimer’s Disease
– Folic acid and risk of neural tube (birth) defects
• Do behavioral interventions reduce risk behaviors?
– Community–based studies to change diets
– Peer interventions to reduce HIV-risk behaviors
Who is an epidemiologist?
• Relatively new in medical science
– Precursors: John Graunt (17th century), John Snow
(19th century)
– Rise as a profession: Wade Hampton Frost at JHU
– 1950s and 1960s: CDC and consolidation as
professional discipline, still mainly physicians
– 1960s+: Infectious disease -> Chronic disease epi
• Professonalization
– Doctoral degrees in epidemiology
– Now most epidemiologists are not docs
Who employs epidemiologists?
• Public sector
– State and federal health officials
• Communicable and chronic disease programs
– Infectious disease, outbreak investigations
– Cancer registries, environmental studies, program areas in
substance abuse, health services, etc., etc.
– Research at CDC, NIH, academia, etc.
• Private sector
– Industry (chemical companies, drug companies)
– Consultants
– Academia, NGOs
“Flavors” of epidemiology
– Descriptive epidemiology
– Analytic epidemiology (finding “risk
factors”, a.k.a. “causes”)
Descriptive epidemiology
• Describe patterns of disease by: person,
place, time
– Good for monitoring public’s health (e.g.,
surveillance, vital events)
– Used for administrative purposes (e.g.,
planning)
– Good for generating hypotheses
NB: Disease patterns and the
Science of patterns
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Description
• Two kinds
– Tabulations or summaries only (no inference or
estimation)
– Inference
• Prediction to other populations (“generalization”; surveys and
polling)
• “True” value in face of noise
• May also assume data produced by underlying
population model and try to describe it
– Parametric: particular functional form assumed
• Parameter = value that indexes family functions, e.g., mean
and std deviation of Normal distribution
– Non-parametric: data-driven estimate of underlying
density or distribution
A word about “models” and
“patterns” (our usage)
• Models are high level, “global” descriptions of all or
most of dataset
– Descriptive or inferential component
– Examples
• Regression models, mixture models, Markov models
• Patterns are “local” features of data
– Perhaps only a few people or a few variables
– Also descriptive or inferential
• Descriptive: look for people with “unusual” features
• Inferential: Predict which people have “unusual” features
– Examples: Association rules, mode or gap in density
function, outliers, inflection point in regression, symptom
clusters, geographic “hot spots”, predict disease from
symptoms
Models and patterns, cont’d
• Epidemiologists use both but more
interested in patterns, i.e., more interested
in “structure” that is local than “structure”
that is global
– George Box: “All models are wrong but some
models are useful” describes epi viewpoint
– But epidemiologists tend to think of patterns as
“real,” even if misleading
Warning: word “model” differs by
context but is usually some kind of
metaphor
• Metaphor: a figure of speech literally
denoting one kind of thing but used to
represent or reason about another kind of
thing
– Examples: fashion model, model citizen
(represent an “ideal”); scale model; animal
model; mathematical model; model of an
axiomatic system; regression model
Describing populations by
person, place and time:
illustrating how
epidemiologists think
Question: What do we learn from the
following examples?
Person (age, sex, race)
Death rates per 105 US population from coronary
disease by age and sex, 1981
Age
White Men
White Women
25-34
35-44
45-54
55-64
65-74
75-84
85+
9
60
265
708
1670
3752
8596
4
16
71
243
769
2359
7215
Place
• Where are the rates of disease the highest and lowest?
Malignant Melanoma of Skin
Place
A Variation on Place: Migrant Studies
Mortality rates (per 100,000) due to stomach
cancer
Japanese in Japan
58.4
Japanese Immigrants to
California
29.9
Sons of Japanese Immigrants
11.7
Native Californians
(Caucasians)
8.0
Time
Does frequency of disease differ now from in the past?
What is a Population?
How an epidemiologist would put it
Group of people with a common characteristics like age,
race, sex, geographic location, occupation, etc.
Two types of populations, based on whether membership is
permanent or transient:
• Fixed population or cohort: membership is permanent and defined
by an event
Ex. Atomic bomb survivors, Persons born in 1980
• Dynamic population: membership is transient and defined by
being in or out of a "state.”
Ex. Members of HMO Blue, residents of the City of Boston
First step, summary description
• Tabulate data by selected features of person,
place, time
• What are characteristics of population
members? (how many of each sex, race,
etc.) And combinations of these features
(How many white women? Employed?
Etc.)
Constructing contingency table
from “raw data”
• “raw data” consists of listing of each subject
and his or her attributes:
Case1
Case2
Case3
Case4
Case5
Case6
Case7
M
1
0
1
1
0
0
1
F
0
1
0
0
1
1
0
R
0
1
0
1
1
0
1
L
1
0
1
0
1
1
0
<65 65+
0
1
1
0
0
1
1
0
1
0
1
0
0
1
One-way tables
• One dimensional Contingency Table (CT) is just a
frequency table, i.e., a table that gives number of
subjects with each attribute
Males
Females
Right-handed
Left-handed
<65
65+
4
3
4
4
4
3
Two-way tables
• Most contingency tables are (at least) twoway, i.e., they cross-classify two attributes
Right-handed males
Left-handed males
Right-handed females
Left-handed females
Males < 65 y.o.
Females < 65 y.o.
Males 65 and older
Females 65 and older
2
2
2
2
1
3
3
0
Or in more familiar form…
R
L
<65
65+
M
2
2
1
3
F
2
2
3
0
Sex by handedness and age
But this is only part of the possible two way tables as it
does not represent handedness versus age, for example
What is a Population?
How a mathematician might put it
• A population is a triple, (G, M, I)
• Two sets, G and M; G is a set of “people” or
“subjects”, M is a set of features the
subjects might “have”
• A relation I, I  G  M
– Interpretation: r = (g, m)  I means that subject
g  G “has” attribute m  M
Contingency tables (“cross-tabs”)
• Mainstay of data preparation, inspection and analysis
• Requires study design based operations
– Sampling  set of n subjects in set G
– Variable selection (classification scheme)  set of m variables in
set M
• E.g., age, sex, disease status (as indicator variables)
– Measurement  binary relation I  G  M
• E.g., ordered pair (case 2, female=yes) is typical member of I
• We call the triple (G, M, I) a data structure for the
contingency table (also called a formal context in FCA
literature)
• Simple formulation allows use of rich mathematical theory
• Much more about this from Alex Pogel
Quantification: Disease frequency
• Goal will be to see if occurrence of disease differs in
populations with different characteristics or experiences
(note comparison is at heart of this)
• Quantify disease occurrence in a population at certain
point or period of time
– Population (counting, absolute scale)
• How big?
• Composition?
– Occurrence (counting, absolute scale)
• Existing cases? New cases?
– Time
• Calendar time? (NB: interval scale, preserved under pos. lin. xform)
• Duration of time (NB: ratio scale, preserved under similarity xform)
• More about this in Fred Roberts’s tutorial
Ex. Hypothetical Frequency of
AIDS in Two Cities
City A
City B
# new cases
58
35
time period
1985
1985-86
population
25,000
7,000
Annual "rate" of AIDS
City A = 58/25,000/1yr = 232/100,000/yr
City B = 35/7,000/2 yrs = 17.5/7000/yr =
250/100,000/yr
Make it easy to compare rates (i.e., make them “commensurable”) by using
same population unit (say, per 100,000 people) and time period (say, 1
year)
NB Commensurability is property of underlying relational system used in
measurement (treated in Roberts tutorial)
Three kinds of quantitative
measures of frequency of
occurrence
Used to relate number cases of disease, size of
population, time
• Proportion: numerator is subset of denominator, often
expressed as a percentage
• Ratio: division of one number by another, numbers don't
have to be related
• Rate: time (sometimes space) is intrinsic part of
denominator, term is often misused (e.g., “birthrate”)
Need to specify if measure represents events or people
(Point) Prevalence
(P) Quantifies number of existing cases of disease
in a population at a point in time
• P = Number of existing cases of disease (at a given
point in time)/ total population
• Ex. City A has 7000 people with arthritis on Jan
1st, 2002
• Population of City A = 70,000
• Prevalence of Arthritis on Jan 1st = .10 or 10%
Prevalence is a proportion
Incidence - quantifies number of
(a) new cases of disease that
(b) develop in a population at risk
(c) during a specified time period
Three key ideas:
•
•
•
New disease events, or for diseases that can occur
more than once, usually first occurrence of disease
Population at risk (candidate population) - can't have
disease already, should have relevant organs
Enough time must pass for a person to move from
health to disease
Two Types of Incidence Measures
Cumulative Incidence
(“Attack Rate”) (Abbreviated Cum Inc. CI)
Incidence Rate
(“Incidence Density”) (Abbreviated I, IR, ID)
Incidence rate (I, IR) =
# new cases of disease
Total person-time of observation
Also called incidence density (ID)
Accrual of Person-Time
Jan
Jan
Jan
1980
1981
1982
Subject 1 -----------------------x
Subject 2
1.1 Person-Year (PY)
-------------------------x 1.2 PY
Subject 3 --------------------------------------------
2.2 PY
4.5 PY
X = outcome of interest, incident rate = 2/4.5 PY
Some Ways to Accrue 100PY
• 100 people followed 1 year each = 100 py
• 10 people followed 10 years each= 100 py
• 50 people followed 1 year plus 25 people followed 2 years = 100
py
Time unit for person-time = year, month or day
Person-time = person-year, person-month, person-day
Ex.: (Cohort) study of risk of breast cancer
among women with hyperthyroidism
• Followed 1,762 women ---> 30,324 py
• Average of 17 years of follow-up per woman
• Ascertained 61 cases of breast cancer
• Incidence rate = 61/30,324 py
= 201/100,000 py
= .00201/y
(.00201 x 100,000 p/100,000 p)
Dimensions
Prevalence
Cumulative incidence
Incidence rate
=
=
=
people
people
no dimension
people
people
no dimension
people
people-time
dimension is time –1
Types of (instantaneous) rates
Relative rate
(person-time or
incidence rate)
Absolute rate
(used in infectious
disease epi and
health services)
1 dN
N dt
dN
dt
Also where units do not involve time, such as accidents per
passenger mile or cases per square area

Relationship between prevalence and incidence
P = IR x D
•
•
•
•
Prevalence depends on incidence rate and duration of disease (duration lasts from
onset of disease to its termination)
If incidence is low but duration is long - prevalence is relatively high
If incidence is high but duration is short - prevalence is relatively low
This is an example of Little’s equation in queuing theory:
time-avg number of units in the system = arrival rule x avg delay time/unit
•
This equation is true if ...
Conditions for equation to be true:
• Steady state
• IR constant
• Distribution of durations constant
• Prevalence of disease is low (less than 10%)
In queuing theory terms: strictly stationary process in steady state
conditions
Figuring duration from prevalence and
incidence
Lung cancer incidence rate = 45.9/100,000 py
Prevalence of lung cancer = 23/100,000
D=
P
IR
=
23/100,000 p
45.9/100,000 py
= 0.5 years
Conclusion: Individuals with lung cancer survive
6 months from diagnosis to death
Uses of Prevalence and Incidence Measures
• Prevalence: administration, planning
• Incidence: etiologic research (problems with
prevalence since it combines IR and D), planning
Common measures of disease
frequency for public health
– Crude death (mortality) rate:
•
Total number of deaths from all causes
1,000 people
For one year
(also cause-specific, age-specific, race-specific death rate)
– Live-birth rate:
total number of live births
For one year
1,000 people (sometimes women of childbearing age)
– Infant mortality rate:
# deaths of infants under 1 year of age
1,000 live-births
For one year
Frequency measures used in infectious disease epidemiology
•Attack rate:
# cases of disease that develop during defined period
# in pop. at risk at start of period
(usually used for infectious disease outbreaks)
•Case fatality rate:
•# of deaths
# cases of disease
•Survival rate:
# living cases
# cases of disease
for a defined period of time
for a defined period of time
Tutorial part 2: Exposure Disease Relationship
Analytic epidemiology
Reprise: Epidemiology is a
science within public health
• This means that it adopts a population
perspective
• As a science, it is also quantitative
• As a science, it is also interested in
explanation and prediction, not just
describing
Questions asked by communities
• Exposure driven questions
– “What will happen to me, my family, my
community?
• Outcome driven questions
– “Why me, why my child, why us?”
• Mixed
– “Are we sicker than our neighbors?”
The usual notion of causation:
John Stuart Mill’s “Method of
Difference”
• A causes B if, all else being held constant, a change
in A is accompanied by a subsequent change in B.
– This of course does not mean that nothing else can produce a
change in B.
• The formal method to detect such an occurrence is
the Experiment, whereby all things are held constant
except A and B, A is varied, and B observed
Expt’l vs. Observational Science
• Epidemiology is an “observational” science
• We do not control the independent variable
(or most other variables)
• What is the implication of this for the status
of epidemiology as a science?
• What does it mean about epidemiology’s
ability to “prove causation”?
Sources of information
• Case studies
• Experimental studies
• Observational studies
Once results are observed, it remains
to explain or interpret the observation,
whether the result is a difference or a
lack of a difference in the compared
entities.
Types of observational study
designs
• Descriptive
–
–
–
–
–
Case study and case-series
No comparison: Person, place and time
Cross-sectional comparison (“Are we sicker than our neighbors?”)
“ecological” (comparing communities/environments; not individual level)
Notice how descriptive and analytic shade into each other (as per
examples we did earlier)
• Cohort (“What’s going to happen to me?”)
– Analog of the laboratory experiment
• Case-control (“Why me?”)
Central idea: compare frequencies
of occurrence in two groups
• Example: Summarize relationship between exposure and disease by
comparing two measures of disease frequency
• Overall rate of disease in an exposed group says nothing about whether
exposure is a risk factor for (“causes”) a disease
• This can be evaluated by comparing disease incidence in an exposed
group to another group that is not exposed, (a “comparison group”)
• Comparison or contrast is the essence of epidemiology
Two Main Options for Comparing disease
frequencies
1. Calculate ratio of two measures of disease frequency ( a
measure in exposed group and a measure in unexposed
comparison group)
2. Calculate difference between two measures of disease
frequency (a measure in exposed group and a measure in
unexposed comparison group)
At the heart of an epidemiological
study ...
• Lies a comparison
– Between 2 rates, ratios, proportions
• Is the difference/lack of difference due to
– Bias?
– Chance?
– Real effect?
Determinants of the comparison
• Compared measures differ or they don’t ( is
linearly ordered)
• Either way, the comparison may be affected by:
– Chance (sample variation)
– Bias
– Real effect or lack of effect
• To interpret the comparison and evaluate the last
factor, we need to account for effects of the first
two
Role of statistics
• Evaluates role that chance might play in the
absence of any other factor
• Also used for summary purposes or to
express a model mathematically
• Not the main preoccupation of
epidemiologists, however
• Bias is main preoccupation of
epidemiologists
Evaluating the role of bias
• Epidemiology is observational discipline, so uncontrolled
variables abound
• Most of training is in recognizing and accounting for
sources of bias, often extremely subtle
• Less emphasis on role of chance, often handed over to
biostatisticians
• Extent to which content area (“real effect”) taken into
account varies with investigator and who collaborators are
I. Definition of Bias
Bias is a systematic error that results in an incorrect (invalid) estimate
of the measure of association
A. Can create spurious association when there really is none (bias away
from the null)
B. Can mask an association when there really is one (bias towards the
null)
C. Bias is primarily introduced by the investigator or study participants
I. Definition of Bias (con’t)
D. Bias does not mean that the investigator is
“prejudiced” or “not objective”
E. Bias can arise in all study types: experimental, cohort,
case-control
F. Bias occurs in the design and conduct of a study. It
cannot be fixed in the analysis phase.
G. Two main types of bias are selection and information
bias, but there are many other types of bias
H. We will consider only selection and information bias for purposes
of illustration of epidemiologic practice
II. Selection Bias
A. Results from procedures used to select subjects into a
study that lead to a result different from what would have
been obtained from the entire population targeted for study
B. Most likely to occur in case-control or retrospective
cohort because exposure and outcome have occurred at
time of study selection
II.
Selection Bias in a Case-Control
Study
A. Occurs when controls or cases are more (or less) likely to
be included in study if they have been exposed -- that is,
inclusion in study is not independent of exposure
B. Result: Relationship between exposure and disease
observed among study participants is different from
relationship between exposure and disease in individuals
who would have been eligible but were not included -- OR
from a study that suffers from selection bias will
incorrectly represent the relationship between exposure
and disease in the overall study population
Selection Bias: Case-Control Study
Question: Do PAP smears prevent cervical cancer? Cases diagnosed at a city hospital. Controls
randomly sampled from household in same city by canvassing the neighborhood on foot. Here
is the true relationship:
Cervical
Controls
Cancer Cases
Had PAP
smear
100
150
Did not have
PAP smear
150
100
Total
250
250
OR = (100)(100) / (150)(150) = .44 There was a 54% reduced risk of
cervical cancer among women who had PAP smears as compared to women
who did not. (40% of cases had PAP smears versus 60% of controls)
Selection Bias : Case-Control Study
(con’t)
Recall: Cases from the hospital and controls come from the
neighborhood around the hospital.
Now for the bias: Only controls who were at home at the
time the researchers came around to recruit for the study
were actually included in the study. Women at home were
more likely not to work and were less likely to have
regular checkups and PAP smears. Therefore, being
included in the study as a control is not independent of the
exposure. The resulting data are as follows:
Selection Bias (con’t)
Cervical
Controls
Cancer Cases
Had PAP
smear
100
100
Did not have
PAP smear
150
150
Total
250
250
OR = (100)(150) / (150)(100) = 1.0
There is no association between Pap smears and the risk of cervical
cancer. Here, 40% of cases and 40% of controls had PAP smears.
Selection Bias : Case-Control Study
(con’t)
Ramifications of using women who were at home during
the day as controls:
These women were not representative of the whole study
population that produced the cases. They did not accurately
represent the distribution of exposure in the study
population that produced the cases, and so they gave a
biased estimate of the association.
When interpreting study results, ask
yourself these questions …
• Given conditions of the study, could bias
have occurred?
• Is bias actually present?
• Are consequences of the bias large enough
to distort the measure of association in an
important way?
• Which direction is the distortion? –is it
towards the null or away from the null?
Imputation of Causality
What are the roles of …
• Bias: The critique checklist
• Chance: “Statistical significance”
• Real effect
– The Hill “viewpoints”
• Not necessary “criteria” (not even criteria)
• Not a checklist
– The way it’s really done...
“Marks” of causality
•
•
•
•
•
•
Strength of association
Biologically plausible
Biological gradient (“dose-response”)
Appropriate temporal relationship
Specificity
“Consistency”
The “Fundamental Question”
(according to Hill)
•
"Clearly none of these nine viewpoints can bring indisputable
evidence for or against a cause-and-effect hypothesis and
equally none can be required as a sine qua non. What they can
do, with greater or less strength, is to help us to answer the
fundamental question--is there any other way of explaining the
set of facts before us, is there any other answer equally, or
more, likely than cause and effect?”
How it’s really done…
• Assemble the evidence from the literature.
What are the pieces of the jigsaw?
– How do you decide?
• Where do they fit?
– How do you decide?
Interpretation
• Evaluate the evidence (a study) for internal
validity
• Evaluate the evidence for external validity
• Bottom line
– What roles are played by bias, chance, real
effect?
Assemble the jigsaw pieces into
a picture
• The “picture” is your version of “causality”
• Your picture may disagree with other
scientists
– Disagreement among scientists is the rule, not
he exception
Mathematics in epidemiology
• Traditional
– Evaluate role of chance (statistical hypothesis
testing; estimation)
– Descriptive (compact summary or generative
model)
– Infectious disease epidemiology dynamics
Comparing chronic and
infectious disease epidemiology

S


S

P
I
1
, 

R
2

S


S

P
I
1
, 

R
2
=birth rate or migration in-rate
=incidence rate or infectivity rate
,  = mortality and recovery rates with
1=case fatality rate, 2=background mortality rate
Prevalence “rate” = P/(S+P)
Comparing chronic and
infectious epi (cont’d)
• Chronic
– Usually concentrate on 
(incidence) because
interested in etiology
– Have to account for fact
that  is function of calendar
time and age, exposure
(?metric), sex, race, SES,
occupation, co-morbid
conditions, latency
– But not usually population
size or density, number of
other cancer cases, etc.
• Infectious
– Interest in  usually limited to
its value as a parameter; we
know the etiology
– Interested in dynamics over
time and space, existence of
thresholds or periods, effect of
parameters and initial
conditions like size initial
population, infectivity, mode
of contact
Difference is one of emphasis and interest, not concepts
Some new uses for mathematics
in epidemiology
•
•
•
•
Formalization and theoretical tools
Pattern and rule detection (“data mining”)
Descriptive modeling
Prediction from data
– Classification
– Taxonomy
• Data organization and retrieval from large
databases
• Patient confidentiality/coding/cryptography
• Multi-scale inference
• Network construction/applications, etc.
Download