Interpreting regression for non

Interpreting regression for
non-statisticians
Colin Fischbacher
What this presentation will cover
•
•
•
•
•
overview of regression methods
what are they and why use them?
what do results from regression look like?
how do you interpret those results?
what pitfalls should I look out for?
What is regression?
• regression relates two kinds of variables:
• outcome variables: for example
– 30 day mortality
– Blood pressure
– CHD admission rate
• explanatory variables: for example
– age
– sex
– treatment type
What is regression? (2)
are these variables related?
if so, in what way?
165
160
155
BPall
150
145
140
0
20
40
60
80
100
What is regression? (3)
the red line is an estimate of the relationship that best fits the data
we have
other estimates are possible
165
160
155
150
145
140
0
20
40
60
80
100
What is regression? (3)
the red line is an estimate of the relationship that best fits the data
we have
other estimates are possible
165
160
155
150
145
140
0
10
20
30
40
50
60
70
80
90
What is regression? (4)
regression can examine more than one explanatory
variable at a time
165
males in red,
females in black
. . . females have higher
blood pressure overall
160
155
BPmale
BPfemale
150
145
140
What is regression? (5)
males in red,
females in black
. . at each age male blood pressure is higher
165
160
155
BPmale
BPfemale
150
145
140
0
20
40
60
80
100
What is regression? (6)
• here regression is used to estimate how much blood pressure rises
with age (so many mm/yr)
• taking this effect of age into account, regression is used to estimate
how much higher male blood pressure is than female blood
pressure (so many mm higher, taking into account age)
165
males in red,
females in black
160
155
BPmale
BPfemale
Linear (BPmale)
150
Linear (BPfemale)
145
140
0
20
40
60
80
100
Why use regression methods?
• There are other methods to adjust for one or two
variables
– standardisation
– stratification
• These methods deal well with one or two
explanatory variables (usually age or sex)
• Regression allows you to take into account the
effect of many variables at the same time
• Answers the question “What’s the effect of this
variable allowing for all the other ones in the
model?”
What methods are available?
• Depends on outcome variable . .
• Continuous variable (eg blood pressure)
– linear regression
• Yes/no/binary outcome (eg dead/alive)
– logistic regression
• Rate variable (eg admissions per year)
– Poisson regression
• Time to event (eg death from cancer)
– Cox regression/ survival analysis
• (Many other types also available)
Linear regression
Continuous variable (eg blood pressure)
165
160
155
150
145
140
0
10
20
30
40
50
60
70
80
90
Linear regression
Continuous outcome data (eg blood pressure)
Blood pressure
mmHg
Age (per year)
0.5 (0.3, 0.7)
Sex (male)
4.0 (3.5, 4.5)
Ethnic group
White
0 (ref)
South Asian
3.5 (3.0, 4.0)
Afro-Carribean
4.1 (3.6, 4.6)
Logistic regression
Yes/no/binary outcome (eg dead/alive)
Death within 30 days of heart attack
Age
Odds ratio (95% CI)
30-50 years
1.0
51-60 years
1.5 (1.1, 1.9)
61-80 years
2.5 (1.5, 3.0)
Male
1.0
Female
1.2 (1.1, 1.3)
Sex
Blood pressure (per 10mmHg)
1.5 (1.4, 1.6)
Poisson regression
Rate variable (eg admissions per year)
Emergency admission for COPD
Sex
Rate ratio (95% CI)
Females
1.0
Males
1.2 (0.5, 1.9)
Additional co-morbidities
None
1.0
Present
2.5 (2.2, 2.8)
Age (per 10 year increase)
1.5 (1.3, 1.7)
Cox regression
Time to event (eg recurrence of cancer)
100
Treatment
Hazard ratio (95% CI)
Previous treatment 1.0
New drug X
0.5 (0.2, 0.8)
Stage of disease
Grade 1
1.0
Grade 2
0.9 (0.5, 1.3)
Grade 3
1.5 (1.2, 1.8)
Age
1.01 (1.005, 1.015)
Percentage without relapse
Time to recurrence of cancer
90
80
70
60
50
40
30
20
10
0
0
10
Old treatment
20
Time
30
New treatment
40
Some notes of caution
• Regression is technically easy with most stats
packages (point and click)
• However skill is needed:
–
–
–
–
to choose the right method and the best model
to select how many and which variables to include
to check that the final model fits well
to interpret the final results
• There are always important assumptions
• Modelling requires experience and judgement
and includes a degree of subjectivity
What should I look for?
• The kind of model used (logistic, Poisson etc)
• The variables included in the model
• The effect estimates for each variable (or
“parameter”)
• For each categorical variable an indication of
which category is the reference category
(usually given a null effect size)
• An assessment of the goodness of model fit
What do the results mean?
Effect estimates (may be called coefficients) may be:
• Single figures
• Odds ratios
• Rate ratios
• Hazard ratios
Linear regression
Continuous outcome data (eg blood pressure)
Blood pressure
mmHg (95% CI)
Age (per year)
0.5 (0.3, 0.7)
Sex (male)
4.0 (3.5, 4.5)
Ethnic group
White
0 (ref)
South Asian
3.5 (3.0, 4.0)
Afro-Carribean
4.1 (3.6, 4.6)
Logistic regression
Yes/no/binary outcome (eg dead/alive)
Death within 30 days of heart attack
Age
Odds ratio (95% CI)
30-50 years
1.0
51-60 years
1.5 (1.1, 1.9)
61-80 years
2.5 (1.5, 3.0)
Male
1.0
Female
1.2 (1.1, 1.3)
Sex
Blood pressure (per 10mmHg)
1.5 (1.4, 1.6)
Poisson regression
Rate variable (eg admissions per year)
Emergency admission for COPD
Sex
Rate ratio (95% CI)
Females
1.0
Males
1.2 (0.5, 1.9)
Additional co-morbidities
Age
None
1.0
Present
2.5 (2.2, 2.8)
1.01 (1.005, 1.015)
Cox regression
Time to event (eg recurrence of cancer)
100
Treatment
Hazard ratio (95% CI)
Previous treatment 1.0
New drug X
0.5 (0.2, 0.8)
Stage of disease
Grade 1
1.0
Grade 2
0.9 (0.5, 1.3)
Grade 3
1.5 (1.2, 1.8)
Age (per 10 years) 1.5 (1.4, 1.6)
Percentage without relapse
Time to recurrence of cancer
90
80
70
60
50
40
30
20
10
0
0
10
Old treatment
20
Time
30
New treatment
40
What else should I look for?
• Is the basic question clear?
– why was a regression method chosen?
• Was the correct model used?
– logistic if yes/no outcomes, Poisson if rates etc
• Which variables were included?
– Were any ones you think are important left out?
• How were the variables chosen?
– modelling strategies and results of exploration?
• How many variables were included?
– 10 -20 cases per variable approximate rule of thumb
• Effect sizes (or “coefficients”) and confidence intervals
• Were measures of model fit reported?
regression methods
REAL LIFE EXAMPLES
Cox regression
McBride and colleagues (BMJ Dec 4,
2010) conducted a study of patients in 324
UK general practices and examined the
time they waited between consulting their
GP with hip pain and being referred to
secondary care.
Age group (years)
The figures show hazard ratios for referral
from a Cox regression model that included
age group, sex and deprivation quintile
Sex
55-64
1.00
65-74
1.18
75-84
1.13
>=85
0.68
Male
1.00
Female
0.90
Deprivation
1 (least deprived)
1.00
2
0.92
3
0.84
4
0.80
5 (most deprived)
0.72
Poisson regression
Sim and colleagues (BMJ Dec 4, 2010) conducted a study to examine
changes in the rate of emergency admission for acute myocardial infarction
before and after the introduction of smoke free legislation in England. After
adjusting for year of admission, temperature, Christmas holidays and week
of admission in a Poisson regression model, they obtained the results shown
in the table.
BMJ 340: doi:10.1136/bmj.c2161
Models
Poisson co-efficient
Percentage change
1. All
0.9763
−2.37 (−4.06 to −0.66)
2. Men under 60
0.9654
−3.46 (−5.99 to −0.85)
3. Women under 60
0.9754
−2.46 (−7.62 to 3.00)
Logistic regression
Alm and colleagues interviewed parents of 294 cases of Sudden Infant
Death Syndrome (SIDS) in three Scandinavian countries, asking about
coffee and alcohol consumption by the mother.
Alcohol >5 units in 24 hrs
before SIDS
Caffeine consumption
(>800mg/24h) after pregnancy
Crude OR
Adjusted OR *
12.2 (2.5, 59.3)
5.9 (1.0, 33.9)
3.1 (1.5, 6.3)
2.0 (0.8 to 5.2)
* adjusted for maternal smoking in 1st trimester, maternal age, education and parity
Arch Dis Child 1999;81:107-111 doi:10.1136/adc.81.2.107
Conclusions
Regression methods allow you to examine the
effects of many variables simultaneously
However they do not give “automatic” answers
Care is needed in choice of method, selection of
variables, testing the final model and interpreting
the results
Model building always involves some degree of
judgement and personal choice