2: Continuing with unadjusted Effect Estimates

advertisement
Applied Epidemiology 304
Tutorial Guide
Taught by: Simon Thornley
Adapted by Simon Thornley from material initially prepared by Professor Robert Scragg,
University of Auckland.
1
Table of Contents
1:
Exploring data and unadjusted (univariate) effect estimates ........................................................... 6
2:
Continuing with unadjusted Effect Estimates ............................................................................ 22
3:
Controlling for confounding using stratification (Mantel-Haenszel method)............................ 29
4:
Interaction or effect modification: using 2x2 tables ...................................................................... 38
5:
Introduction to logistic regression ................................................................................................. 59
6:
Using logistic regression to investigate effect modification ...................................................... 82
7:
Using logistic regression to examine effect modification (2) .................................................... 90
8:
Sample size using Statcalc ....................................................................................................... 101
References............................................................................................................................................ 111
2
New Zealand Cot Death Study
This study was undertaken in the early 1980s to determine the cause of an apparent rise in the
incidence of cot death in New Zealand.[1] The researchers selected cases of cot death and
administered a questionnaire to parents, a sample of which is contained in the following
dataset used for this exercise. Control infants were recruited from the community at the same
time as cases (incidence-density sampling). A variety of exposures, thought to cause or
contribute to cot death were considered, along with a number of confounding variables. In this
tutorial, we will first consider the effect of bed sharing on cot death. Then we consider the
possible confounding effect of socio-economic status, and how this variable impacts on the
exposure of interest. We will then consider the effect of maternal (mother smoking) on the
relationship between bed sharing (exposure) and cot death (outcome).
The data used in this session is a subsample of the actual data used for the publications, so the
analyses won’t match exactly what was reported in medical journals.
In addition to the output derived from Epi Info, I have also produced some supplemental
figures with SPAN[2] and the Epicalc[3] utility of the R-program. These figures are not
possible to produce with Epi info and are only included here to (try to!) improve your
understanding of the material. They are not examinable.
SIDS = Sudden Infant Death Syndrome
Table: Data dictionary for the Excel file SIDS_EpiInfo.
Variable Name
REGION
Description
Region of the country
Values
1 = Auckland
2 = Waikato
4 = Wellington
5 = Christchurch
6 = Dunedin
CASE
Case control status
1 = case,
2 = control
ETHNIC
Ethnicity
1 = Maori
2 = Pacific Island
3 = European
SEX
Gender of infant
1 = Male
2 = Female
INFANT_AGE
Infant’s age at death or interview
(weeks)
Continuous variable
INFANT_AGE_GRP
Infant’s age at death or interview
(weeks): grouped
1 = <13
3
2 = 13 to19
3 = 20 to 25
4 = > 26
BIRTH_WT
Infant’s birth weight (gms)
Continuous variable
BIRTH_WT_GRP
Infant’s birth weight (gms):
grouped
1 = <2500
2 = 2500 to 2999
3 = 3000 to 3499
4 = > 3500
GESTATION
Length of pregnancy (weeks)
Continuous variable
MOTHER_AGE
Mother’s age at birth (years)
Continuous variable
MOTHER_AGE_GRP
Mother’s age at birth (years):
grouped
1 = ≤19
2 = 20-24
3 = 25-29
4 = ≥ 30
ANTINAT
1st attendance at antenatal clinic
1 = <3 months
2 = > 3 months
OCCUPATION
Household SES category
1 = 1 & 2 (high)
2=3&4
3 = > 5 (low)
SEASON
Season of the year
1 = Jan-Feb
2 = Dec-March
3 = Nov-Apr
4 =Oct-May
5 = Sept-June
6 =Aug-July
BEDSHARE
Bed share in the last sleep
1 = Yes
2 = No
SLEEP_POSITION
Position in last sleep
1 = back
2 = side
3 = front, face down
4 = front, face to side
5 = other
DUMMY
Used dummy in last sleep
1 = Yes
2 = No
MAIN_MILK
Main type of milk drunk by baby
in last 2 days
1 = breast
2 = bottled cow’s milk
3 = modified cow’s milk
4
4 = soya based milk
5 = goat’s milk
6 = other special milks
MOTHER_TOBACCO
Mother smoked cigarettes in the
last 2 weeks.
1 = Yes
2 = No
3 = occasional <1/day
FATHER_TOBACCO
Father smoked cigarettes in the
last 2 weeks.
1 = Yes
2 = No
3 = occasional <1/day
MOTHER_CANNABIS
Mother had cannabis since birth
of baby
1 = Yes
2 = No
3 = chose not to answer
CANNABIS_FREQ
Frequency of mother’s cannabis
use
1 = daily
2 = weekly
3 = monthly
4 = less often
5
1:
Exploring data and unadjusted (univariate) effect estimates
Use TABLES command to calculate unadjusted adjusted odds ratios
Use SELECT & CANCEL SELECT to select two exposure levels for calculating odds ratios, if there
are more than two exposure levels
Commands in this Lesson
The following commands are used in this lesson.
READ/IMPORT
READ is the most commonly used command. The Read (Import) command changes the current data
source and/or the current project. It removes any standard defined variables. The READ command
operates on many different types of data. Epi Info™ can Read in 24 different types of files. Located in
the Data folder.
DISPLAY
Use this command to display table, view, and database information. Use the display option Variables
Currently Available to see all the variables in the dataset, including names, field types, and format
information. Use Display, prior to merging or creating statistics, to ensure that field types and
variables names have been coded as needed. Located in the Variables folder.
LIST
The List command creates a listing of the current data table. Lists can be customized to list all,
exclude, or show specific records. Located in the Statistics folder.
SELECT
The Select command specifies a condition that must be true for a record to be processed. Use to select
a set of records for analyses. For example, select records based on gender or zip code. Located in the
Select/If folder.
CANCEL SELECT
The Cancel Select command cancels a previous SELECT command. Located in the Select/If folder.
Command for calculating the crude odds ratio and Mantel-Haenszel odds ratio (or relative risk)
TABLES
Use this command to create frequencies or counts (total numbers) of categorical variables. Categorical
variables mean that each value falls into one of a set of groups (eg case or control status, ethnicity, bed
sharing), compared to a measured, numeric variable, such as age. Numeric variables can, of course, be
divided into categories to make them ‘categorical’. Tables can help determine the probability that a
risk factor is linked to an outcome. For these values to have their accepted epidemiological meanings,
the value representing presence of the exposure (independent value) and outcome conditions
(dependent variable) must appear in the first row and column of the table. Epi Info yes/no variables are
6
automatically sorted. Values of the first selected variable will appear across the top of the table, and
those of the second selected variable will be on the left hand side (margin) of the table.
What does the output mean?
Normally cells contain counts of records matching the values in the corresponding marginal labels.
For 2x2 tables, the command produces odds ratios and risk ratios. For tables other than 2x2, Chisquare statistics are computed. The p-value is the probability that the observed association (measured
by odds ratio) between two variables may be due to chance (i.e. no relationship between exposure and
outcome). If the p-value is very low, it means that the chance that the association between exposure
and outcome is very unlikely to be due to chance alone. A low p-value of <.05 means that the risk
factor is unlikely to be associated with disease due to chance alone.
Importantly, when measuring associations between exposures and outcomes, it is important to
consider whether observed effects are due to “confounding”. This occurs when an observed
association, such as between an exposure (bed sharing) and an outcome (cot death) may be explained
by the presence of a third factor (socioeconomic status) which may be linked to both the exposure and
outcome, and account for the observed effect. This is shown diagrammatically below, in which the
association between bed sharing and cot death (dashed line) may be explained, if socioeconomic status
is linked to both bed sharing and cot death.
In this series of tutorials, we introduce two methods of controlling for confounding. Firstly, using
stratification (the Mantel-Haenszel test), and secondly using regression modelling.
7
Is ethnicity associated with risk of cot death (SIDS)?
You are going to use the TABLES command to examine the relationship between two or more
categorical values. You want to see if risk of cot death is associated with ethnicity.
1. READ the Excel file called SIDS_EpiInfo.
In the READ dialogue box, select Excel 8.0 in the DATA_FORMATS drop down menu, then
navigate to the path with the SIDS_EpiInfo.xls file.
Select SIDS under the WORKSHEETS space, then click OK.
Exploring data in Epiinfo.
Before you start any analysis, it is worthwhile visualising the data to make sure it is in the
correct range, and does not contain any errors.
We will briefly cover a couple of useful commands.
In this session, we are most interested in the variables BEDSHARE, CASE, and ETHNIC.
8
2.
Let us look at the distribution of these variables. We will start with BEDSHARE. To get a graphic
display and count of the number of variables in each category use the “Frequencies” command to
select BEDSHARE. Press OK.
You should see something like this:
You can see that roughly half the population bedshare (BEDSHARE=1) and the other half do
not (BEDSHARE=2).
In the same way we can examine ETHNIC and CASE
9
What proportion of the study are European (ETHNIC=3), Maori (ETHNIC=3) and Pacific
(ETHNIC=2)?
What was the proportion of cases (CASE=1) to controls (CASE=2) in the study?
Although we will not look at continuous data until much later, you might ask how can we
explore data that is continuous, for example, MOTHER_AGE. If you use the “Frequencies”
command, you get a lot of output. A better method is to use the “graph” function, under
“statistics”. Select “histogram” from the “graph type” drop down menu on the top left of the
window, then the variable “MOTHER_AGE” from the “main variable” drop down box.
10
Then press “OK”.
11
This histogram tells you a lot about this variable. For instance, the range of values is between
13 and 45. The extremes are believable, so you do not suspect a coding error. Also, you see
most of the values are between 25 and 31 years. Again, this makes sense. If you were to
categorise this variable, the histogram would help decide where to select cut points so that
you get roughly the same number of individuals in each category, or at least enough to get
adequate statistical power.
Having decided that the quality of the data is ok, we will now turn to doing some basic effect
measures.
3. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
4. From the Exposure Variable drop-down, select ETHNIC (they are listed in alphabetical order).
5. From the Outcome Variable drop-down, select CASE.
12
6. Click OK. Results appear in the Output window. The output is a table with 2 disease categories (cases
and controls), and three exposure levels.
No odds ratio or relative risk values are shown because there are more than two exposure levels. To
calculate odds ratios, you have to select only two ethnic groups for your analysis; firstly by comparing
Maori infants with European, and then by comparing Pacific Island infants with European. European
infants have been chosen as the reference category as they are a large group with a low risk of cot
death – the column percent for European controls (73.5%) is much lower than for their cases (49.3%).
The same strategy can be used for comparing other variables that have many categories, such as age.
Exposure: Maori vs European
7. From the Select/If folder, click Select. The Select dialogue box opens. Choose Ethnic=1 (Maori) and
Ethnic=3 (European), and click OK. Note that the Record Count is now 1706 (compared to 1862 when
the Excel file was originally read).
13
8. Click Tables. The Tables dialog box opens. From the Exposure Variable drop-down, select ETHNIC.
From the Outcome Variable drop-down, select CASE, and click OK.
Results appear in the output window. The odds ratio (and risk ratio) are now both shown. Odds ratios
are the appropriate values for a case control study. The cross-product odds ratio shows that Maori
infants have 3.77 times the odds of cot death over European infants. This is a statistically significant
result, since the 95% confidence intervals (2.94, 4.84) do not include the reference value of 1. The Pvalue gives similar information to the confidence interval, but answers the question “How weird is this
result if ethnic group exerted no effect on cot death?”. The P-value, which is quoted as “0.000000”, or
“<0.00001” if you want to be technically correct. Even if a result is very weird, it can never be
impossible. The low P-value indicates that if ethnic group (Maori vs non-Maori) had no effect on cot
death (null hypothesis), this result would be very, very weird, or almost impossible! Therefore the null
hypothesis is rejected, and we think that ethnic group does influence risk of cot-death. Most
epidemiologists report odds ratios (or risk ratios) rounded to 2 decimal places.
14
Diversion! Visualising the association between ethnic group and cot death...
I show an alternative way of displaying this information which may help you understand the concept of
odds ratios. The first scaled rectangle diagram shows a box which is proportional to the total study
population (Pacific excluded). Within this, is a box proportional to the number of cases (labeled “CASES=1”,
about ¼ of the study), with those outside this box controls. The number of Maori (also just over a quarter
of the population) is displayed in the lighter coloured box, with all those outside this square European
(white-controls; dark blue-cases). This gives a visual display of the exposure (ethnic group) and outcome
(case status – cot death or no cot death). The odds ratio is calculated from the numbers displayed in the
diagram:
π‘€π‘Žπ‘œπ‘Ÿπ‘– π‘π‘Žπ‘ π‘’π‘  (165)
⁄π‘€π‘Žπ‘œπ‘Ÿπ‘– π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (262)
πΈπ‘’π‘Ÿπ‘œ π‘π‘Žπ‘ π‘’π‘  (183)
⁄πΈπ‘’π‘Ÿπ‘œ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (1096)
You can readily appreciate that the ratio of areas (cases to controls), represented by the areas of the
overlapping boxes below, for Maori, is much higher than for European.
15
This is very different to what would be expected if case (cot death) status was unrelated to ethnic group,
keeping the proportion of Maori and proportion of cases to non-cases constant (illustrated in the following
diagram). You can see that, in this case, the odds ratio for independence (no effect) between cot death and
ethnicity would be:
π‘€π‘Žπ‘œπ‘Ÿπ‘– π‘π‘Žπ‘ π‘’π‘  (87)
⁄π‘€π‘Žπ‘œπ‘Ÿπ‘– π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (340)
πΈπ‘’π‘Ÿπ‘œ π‘π‘Žπ‘ π‘’π‘  (261)
⁄πΈπ‘’π‘Ÿπ‘œ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (1018)
16
Put yet another way (!), what we are comparing in the odds ratio, is the odds of being Maori (rather than
European) if one is a case. We calculate this as 165/183=0.90 (point estimate). That is if you are a case, you
are about as likely to be Maori as European, in this study. You can imagine that we could repeat this study
many times over. You would not usually get exactly the same result, but something close. We use a
mathematical distribution (the binomial) as an approximation of what may be expected if you do the study
over and over again (see below). The red line is the median value or point estimate, and the 95%
confidence interval lines are given in blue. The odds are similar to 1:1 or one Maori to one European
among cases, or a probability of being Maori as ½.
17
1000
0
500
Frequency
1500
2000
Histogram of pc
0.6
0.8
1.0
Odds of Maori, if case
Similarly the odds of being Maori, if control is much lower…
18
1.2
1000
0
500
Frequency
1500
2000
Histogram of pco
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Odds of Maori, if control
If we divide one set of values over the other to generate an odds ratio, then we get a Chi-square
distribution for the odds ratio which looks like this:
19
1500
0
500
1000
Frequency
2000
2500
Histogram of or
2
4
6
8
Odds ratio
Notice now that the distribution has changed from a symmetric distribution, based on the binomial (similar
to the normal distribution), to an asymmetric distribution, characteristic of the Chi-square. Also note that
the 95% confidence intervals do not include the null effect of 1. We are, therefore, confident that the effect
of ethnic group (Maori) on cot death is unlikely to be due to chance. We haven’t ruled out a third factor,
which may be linked to both being Maori and developing cot death (confounding), such as cigarette
smoking, however. Note the median (red line) is about the same as the calculated point estimate for the
odds ratio (3.7).
20
1000
0
500
Frequency
1500
2000
Histogram of nullor
1.0
1.5
2.0
2.5
3.0
Odds ratio
Above, we have simulated what sort of results we would expect given the null hypothesis, that the odds of
being Maori are the same for both cases and controls. The red line above represents the lowest 95%
confidence interval for the alternate hypothesis, which gives the 2 sided P-value. You can see that virtually
no values fall outside this barrier, so the P-value is very small (<0.000001).
9. From the Select/If folder, click Cancel Select, and click OK. The Record Count now returns to the full
sample size of 1862.
21
2:
Continuing with unadjusted Effect Estimates
Exposure: Pacific vs European
10. Repeat step 6 above. From the Select/If folder, click Select. The Select dialogue box opens. Choose
Ethnic=2 (Pacific Island) and Ethnic=3 (European), and click OK. Note that the Record Count is now
1435.
11. Click Tables. The Tables dialog box opens. From the Exposure Variable drop-down, select ETHNIC.
From the Outcome Variable drop-down, select CASE, and click OK.
Results appear in the output window. The cross-product odds ratio shows that Pacific Island infants
have 1.04 times the risk of cot death than European infants. This is not statistically significant result
since the 95% confidence interval (0.65, 1.66) includes the reference value of 1. This is confirmed by a
high p-value (>0.05).
This is similarly illustrated for Pacific, as for the Maori vs European comparison, using scaled rectangle
diagrams. The odds ratio is calculated similarly:
22
π‘ƒπ‘Žπ‘π‘–π‘“π‘–π‘ π‘π‘Žπ‘ π‘’π‘  (23)
⁄π‘ƒπ‘Žπ‘π‘–π‘“π‘–π‘ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (133)
πΈπ‘’π‘Ÿπ‘œ π‘π‘Žπ‘ π‘’π‘  (183)
⁄πΈπ‘’π‘Ÿπ‘œ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (1096)
If ethnic group was unrelated to cot death status, the scaled rectangle diagram would look like this:
23
As you can see the two diagrams are not dramatically different. The P-value assesses the chance of this
difference being due to random variation, and in this comparison, this is a likely explanation for the
observed difference.
12. From the Select/If folder, click Cancel Select, and click OK. The Record Count now returns to the full
sample size of 1862.
Exposure: Infant bed sharing
You want to see if infants who share the bed with their parents (or other adults) in the last two weeks,
when they are sleeping, have an increased risk of cot death.
1. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
2. Select the Exposure Variable of BEDSHARE.
3. Select the Outcome Variable of CASE.
4. Click OK. Results appear in the Output window. The odds ratio of 2.14 (95% CI: 1.70, 2.70) indicates
that infants who bed share have a significantly increased risk of cot death, compared to infants who do
not bed share.
24
This is again illustrated by scaled rectangle diagram, with the odds ratio is calculated similarly:
𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (231)
⁄𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (646)
π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (141)
⁄π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (844)
25
You easily see that this is very different from what would be expected if the exposure and
outcome were independent (unrelated):
26
A plot of the two odds ratios are highlighted below (bed sharing –exposure; cot death-case). You
can appreciate that the cases were much more likely to share the bed than controls. The 95%
confidence intervals show that if these studies were repeated time and time again on different
samples of the same population, we would still see a marked difference in odds.
27
Odds ratio from case control study
I
I
Outcome category
case
OR = 2.14
95% CI = 1.68 , 2.72
control
I
I
0.69
0.86
1.06
1.32
Odds of exposure
28
1.63
2.02
3:
Controlling for confounding using stratification (Mantel-Haenszel method).
Does household socioeconomic status confound the association between infant bed
sharing and risk of cot death (SIDS)?
You have identified that infants who bed share have a higher risk of cot death. Now you want to see if
household socioeconomic status (SES) is a confounder.
One way to solve the problem of confounding is to restrict comparisons to individuals who have the same
value of the confounding variable (in this case, SES). Splitting the analysis up by SES allow us to assess the
the effect of bed sharing on cot death without the problem of variation in this confounding variable. The
subsets of occupation which we use to split up the data are called strata, and so this process is known as
stratification.
Unless the effect of exposure on outcome differs substantially between strata (in which case we encounter
a different issue which we will discuss later – effect modification) we usually wish to combine the evidence
from the separate levels of SES, and summarise the effect controlling for the confounder. Strata with more
individuals will tend to have a more precise estimate of the effect than strata with fewer individuals. We
therefore account for this by taking a weighted average of the effects. The most common method of
weighting is given by the Mantel-Haenszel estimate.
In our example, for one level of socioeconomic status, we have the familiar two by two table
Outcome
Exposure
Cot Death
No Cot Death
Total
Bed share
ai
bi
ai+bi
No bed share
ci
di
ci+di
ai+ci
bi+di
Ni
Total
The weight of each stratum is calculated by multiplying the number of unexposed cases with the number
of exposed controls and dividing by the total number in that stratum:
π‘€π‘’π‘–π‘”β„Žπ‘‘π‘– =
𝑐 𝑖 x 𝑏𝑖
𝑁𝑖
The final Mantel-Haenszel odds ratio is calculated by summing (Σ) the products of the stratum specific
weights and odds ratios and dividing by the sum of the weights:
𝑂𝑅𝑀𝐻 =
∑(π‘€π‘’π‘–π‘”β„Žπ‘‘π‘– x OR 𝑖 )
∑(π‘€π‘’π‘–π‘”β„Žπ‘‘π‘– )
29
This can get a bit messy if you have a lot of strata. We then compare the stratified (ORMH) with the crude
odds ratio. If the change in the stratified effect estimate is greater than 10% (compared to the crude), we
consider that confounding is likely to be present.
Thankfully, these calculations can be done in a straightforward manner in Epi info.
I will not show you how it is calculated here, but Epi info also calculates a test of the null hypothesis, which,
here, is that after controlling for socioeconomic status is there an effect of bed sharing on cot death (i.e. is
the ORMH sufficiently larger (or smaller) than one to be unlikely to be due to random error.
1. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
2. Select the Exposure Variable of BEDSHARE.
3. Select the Outcome Variable of CASE.
4. Select the Stratify by Variable of OCCUPATION.
5. Click OK. Results appear in the Output window. A 2x2 table, with odds ratio calculations, appears for
each of the 3 levels of OCCUPATION. Scroll down to the bottom of the Output to see the summary
information below. Note that the crude odds ratio (cross product) has changed from 2.65 to 2.24 after
adjusting for SES. This indicates that OCCUPATION partially confounds the association between bed
sharing and cot death, since the change in the odds ratio between crude and adjusted is more than
10%. The output also shows that the test for interaction (effect modification) between strata is high
0.3, indicating that the variation in the stratum specific odds ratios is likely to be due to chance alone
and is less likely to be attributable to a systematic effect.
30
These results are shown visually in a scaled rectangle diagram below
31
Although, the picture is getting quite complex now, you can see that if cot death, occupation and
bedsharing were unrelated, the picture would be quite different (see below).
32
Calculation of stratum specific odds ratios is possible by considering case status and bed sharing
within an occupational class. For example, the stratum specific odds ratios for the effect of bed
sharing on cot death (case status) are, for high occupational status limited to the purple upper
rectangle, divided by case and bedsharing status:
𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (38)
⁄𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (215)
π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (29)
⁄π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (306)
The stratum specific odds ratio, is therefore 1.86. You can see visually (from the scaled rectangle
diagram below) that the ratios of these areas are similar. Also, the other stratum specific odds can
be calculated. The same diagram shows the differences between the odds in the exposed and
unexposed. Although, the slope of the line represents the difference in odds (not ratios) for cases
and controls at each level of occupation. You can see that the red (high) group is least likely to
bedshare, among both cases and controls; and the low (blue) group is most likely to bedshare
both among cases and controls. The 95% confidence intervals for the individual odds are also
shown for each stratum.
33
Stratified case control analysis
Case
I
Outcome= clogic , Exposure= blogic
I
I
I
I
I
OCCUPATION1: OR= 1.86 (1.08, 3.24)
OCCUPATION2: OR= 2.04 (1.43, 2.92)
OCCUPATION3: OR= 1.92 (1.22, 3.05)
MH-OR = 1.97 (1.55, 2.5)
homogeneity test P value = 0.949
Control
I
I
0.59
I
I
I
0.82
I
1.14
1.59
2.21
3.07
Odds of exposure
Does ethnic group confound the association between infant bed sharing and risk of cot
death (SIDS)?
You have identified that household SES partially confounds the association between infant bed sharing and
risk of cot death. Now you want to see if ethnicity is also a confounder.
1. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
2. Select the Exposure Variable of BEDSHARE.
3. Select the Outcome Variable of CASE.
4. Select the Stratify by Variable of ETHNIC.
5. Click OK. Results appear in the Output window. A 2x2 table, with odds ratio calculations, appears for
each of the 3 levels of ETHNIC.
34
Note that the odds ratio varies between ethnic groups, being 2.33 for Maori (ETHNIC=1), 0.70 for
Pacific Island (ETHNIC=2) and 1.50 for European infants (ETHNIC=3).
Scroll down to the bottom of the Output to see the summary information below. Note that the crude
odds ratio (cross product) has changed from 2.13 to 1.63 after adjusting for ETHNIC. This indicates that
ETHNIC partially confounds the association between bed sharing and cot death, since the change
between crude and adjusted odds ratios is more than 10%.
However, also note, at the second bottom row that the “Chi-square for differing Odds Ratios by
stratum (interaction)” is 5.72 and the p-value for this is 0.0572. This indicates that the odds ratios are
on the borderline of differing significantly between the ethnic groups. This is called interaction,
heterogeneity or effect-modification, since ethnicity is modifying the effect of bed sharing on risk of cot
death. When there is significant interaction between variables, we cannot report one adjusted odds
Ratio, controlling for the confounding variable, because the effect of bed sharing on cot death risk
differs substantially between strata (in this case by ethnic group).
35
The scaled rectangle diagram is shown below. The white space represents the largest ethnic group,
European. The ethnic specific odds ratio, for Maori, is:
𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (134)
⁄𝐡𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (31)
π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘Žπ‘ π‘’π‘  (31)
⁄π‘π‘œ 𝑏𝑒𝑑 π‘ β„Žπ‘Žπ‘Ÿπ‘’ π‘π‘œπ‘›π‘‘π‘Ÿπ‘œπ‘™π‘  (92)
You can see that the numerator odds is higher than the denominator, so the odds ratio will be high.
This is illustrated in the graph below with the red line showing the largest difference in the two odds.
As you can see, the odds differences (similar to odds ratios), represented by the slopes of the lines,
connecting the odds ratios, are much more heterogeneous than in the example which examines the
effect of bed sharing on cot death, adjusted for socioeconomic status.
36
Stratified case control analysis
I
Outcome= clogic , Exposure= blogic
Case
I
I
I
I
ETHNIC1: OR= 2.33 (1.44, 3.86)
ETHNIC2: OR= 0.7 (0.25, 2.07)
ETHNIC3: OR= 1.5 (1.08, 2.08)
MH-OR = 1.63 (1.27, 2.1)
homogeneity test P value = 0.058
Control
I
I
I
I
1/2
1
I
2
Odds of exposure
37
I
4
I
4:
Interaction or effect modification: using 2x2 tables
ο‚·
ο‚·
ο‚·
Use TABLES command to calculate unadjusted and adjusted odds ratios to control for
confounding
Use DEFINE, IF, ASSIGN to create new variables
Use SELECT & CANCEL SELECT to select two exposure levels for calculating odds ratios,
if there are more than two exposure levels.
Commands in this Lesson
The following new commands are used in this lesson.
LIST
The List command creates a listing of the current data table. Lists can be customized to list all, exclude, or
show specific records. Located in the Statistics folder.
SORT
The Sort command specifies the sequence in which records will appear when using the LIST, GRAPH, or
WRITE commands. SORT organizes the listed data in an ascending or descending order, based on selected
variables. For example, you can sort by last name or age. Located in the Select/If folder.
CANCEL SORT
The Cancel Sort command cancels a previous SORT command. Located in the Select/If folder.
Commands for creating new variables
ASSIGN
This command is used after the if command to assign a new value to a variable.
RECODE
This command is used to create new variables, based on the values of other variables. You can use this
command to create categories based on numerical (continuous) variables. Aggregating sparsely defined
categorical variables may also be achieved using this command.
DEFINE
This command lets you define the name of a new variable which you will then create using assign, define
and if.
IF
This command lets you make conditional statements, so that if a condition, based on values of other
variables are met, then you assign (using this command) a value to your new variable.
38
In the previous session, you did an analysis that showed effect modification from ethnicity in the
association between infant bed sharing and risk of cot death. In this session you will use the TABLES
command to analyse this effect modification in more detail, and determine whether the interaction is
additive or multiplicative.
The increased risk of cot death associated with bed sharing by Maori infants suggests that some other
variable, which occurs commonly among this ethnic group, combines with bed sharing to greatly
increase the risk of cot death. Extensive analysis of the NZ Cot Death data set in the 1990s revealed
that one of the lifestyle variables associated with ethnicity was maternal tobacco smoking, which was
more common among mothers of Maori infants than mothers of infants from other ethnicities
(Mitchell EA, et al. Ethnic differences in mortality rate from sudden infant death syndrome in New
Zealand. Brit Med J 1993; 306: 13-6). This suggested that there may be an interaction between
maternal tobacco smoking and infant bed sharing. You will explore this possibility in the data set by
combining values for two variables (maternal tobacco smoking & infant bed sharing) into a single new
combined variable with four exposure levels using the commands DEFINE, IF and ASSIGN, in order
to calculate odds ratios for each combined exposure level.
39
A.
CONVERT MATERNAL TOBACCO VARIABLE INTO A BINARY VARIABLE
The data dictionary for the SIDS_EpiInfo data set shows that the variable MOTHER_TOBACCO has
3 levels. This variable first needs to be converted into a binary variable before it can be combined with
the infant bed sharing variable.
1. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
2. Select the Exposure Variable of MOTHER_TOBACCO.
3. Select the Outcome Variable of CASE.
4. Click OK. Results appear in the Output window. A 3x2 table appears (below).
Note the small numbers of cases (n=6) and controls (n=28) with mothers who smoke tobacco
occasionally (<1 cigarette per day). These numbers are too small to analyse as a separate group, and
need to be combined with either group 1 (smoke daily) or group 2 (non-smokers). You will combine
them with group 1 so that you have a clear comparison of smokers and non-smokers. To do this, you
will define a new variable and use the RECODE command to combine both smoking groups together.
5. From the Variables folder, click Define. The DEFINE dialog box opens.
6. Type in the Variable Name space ‘Mother_smoke’. Your Dialogue box should look like this (below).
40
7. Click OK.
8. From the Variables folder, click Recode. The RECODE dialog box opens.
9. From the From drop-down, select
MOTHER_TOBACCO, and from the To drop-down, select
Mother_smoke.
a.
Click your mouse in the top left hand cell of the table, and enter “1” in the left hand
column with heading ‘Value (blank = other)’;
b.
then press ENTER twice to move to the top right hand cell, and enter “1” in the right
hand column with heading ‘Recoded Value’;
c.
press ENTER to move to the next row.
10. Repeat steps 9a to 9c for the 2nd row by entering ‘2’ in the left hand column and ‘2’ in the right column.
11. Repeat steps 9a to 9c for the 3rd row by entering ‘3’ in the left hand column and ‘1’ in the right column.
The RECODE dialogue box should look like this (below), with the left cell of the 4th row highlighted.
12. Click OK.
41
13. Use the TABLES command to check that you have correctly recoded
MOTHER_TOBACCO into
Mother_smoke so that all smokers are combined into a single group.
14. From the Command Tree Statistics folder, click Tables. The Tables dialog box opens.
15. Select the Exposure Variable of MOTHER_TOBACCO.
16. Select the Outcome Variable of ‘MOTHER_SMOKE’.
17. Click OK. Results appear in the Output window (below). Check that you have correctly recoded
MOTHER_TOBACCO into ‘MOTHER_SMOKE’ and that there are no missing observations.
42
B.
CREATE A SINGLE COMBINATION VARIABLE FROM MATERNAL TOBACCO
SMOKING & INFANT BED SHARING
Now that you have converted the variable MOTHER_TOBACCO into a new variable called
‘MOTHER_SMOKE’, which has two levels (smoker, non-smoker), this new variable can now be
combined with the infant bed sharing variable (BEDSHARE) into a single variable called
‘Smoke_Share’, which has four levels as shown in the following table, using the commands DEFINE,
IF and ASSIGN.
Existing Variables
Combination Variable
MOTHER_SMOKE
BEDSHARE
SMOKE_SHARE
Yes
Yes
1
Yes
No
2
No
Yes
3
No
No
4
Note that infants coded ‘No’ for both ‘MOTHER_SMOKE’ and BEDSHARE, who are expected to
have the lowest risk of cot death and therefore should be our reference group, will be given the value
of ‘4’ to ensure they are on the bottom row for odds ratio calculations. EpiInfo assumes the reference
group is on the bottom row when calculating odds ratios (or relative risks) with the TABLES
command.
1. From the Variables folder, click Define. The DEFINE dialog box opens.
2. Type in the Variable Name space ‘Smoke_Share’. Your Dialogue box should look like this (below).
3. Click OK.
4. From the Select/If folder, click If. The IF dialog box opens.
5. From the Available Variables drop-down, select ‘Mother_smoke’, and use the buttons to make it equal
to ‘1’.
43
6.
Click the AND button.
7. From the Available Variables drop-down, select BEDSHARE, and use the buttons to make it equal to ‘1’.
Your Dialogue box should look like this (below).
8. Click Then.
9. From the Variables folder (on the left of the original screen), click Assign. The ASSIGN dialog box opens.
10. From the Assign Variable drop-down, select ‘Smoke_Share’, and use the buttons to make it equal to 1.
Your Dialogue box should look like this (below).
11. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears.
44
12. Click OK. You have created the first level of the new combination variable ‘Smoke_Share’.
13. Now you will create the second level of the new combination variable ‘Smoke_Share’. From the
Select/If folder, click If (which is highlighted in blue). The IF dialog box opens.
14. From the Available Variables drop-down, select ‘Mother_smoke’, and use the buttons to make it equal
to ‘1’. Click the AND button. From the Available Variables drop-down, select BEDSHARE, and use the
buttons to make it equal to ‘2’. Your Dialogue box should look like this (below).
15. Click Then.
16. From the Variables folder, click Assign. The ASSIGN dialog box opens.
17. From the Assign Variable drop-down, select ‘Smoke_Share’, and use the buttons to make it equal to 2.
Your Dialogue box should look like this (below).
45
18. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears (below).
19. Click OK. You have created the second level of the new combination variable ‘Smoke_Share’.
20. Now you will create the third level of the new combination variable ‘Smoke_Share’. From the Select/If
folder, click If (which is highlighted in blue). The IF dialog box opens.
21. From the Available Variables drop-down, select ‘Mother_smoke’, and use the buttons to make it equal
to ‘2’. Click the AND button. From the Available Variables drop-down, select BEDSHARE, and use the
buttons to make it equal to ‘1’. Your Dialogue box should look like this (below).
46
22. Click Then.
23. From the Variables folder, click Assign. The ASSIGN dialog box opens.
24. From the Assign Variable drop-down, select ‘Smoke_Share’, and use the buttons to make it equal to 3.
Your Dialogue box should look like this (below).
25. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears (below).
47
26. Click OK. You have created the third level of the new combination variable ‘Smoke_Share’.
27. Now you will create the fourth level of the new combination variable ‘Smoke_Share’. From the
Select/If folder, click If (which is highlighted in blue). The IF dialog box opens.
28. From the Available Variables drop-down, select ‘Mother_smoke’, and use the buttons to make it equal
to ‘2’. Click the AND button. From the Available Variables drop-down, select BEDSHARE, and use the
buttons to make it equal to ‘2’. Your Dialogue box should look like this (below).
29. Click Then.
30. From the Variables folder, click Assign (which is highlighted in blue). The ASSIGN dialog box opens.
31. From the Assign Variable drop-down, select ‘Smoke_Share’, and use the buttons to make it equal to 4.
Your Dialogue box should look like this (below).
48
32. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears (below).
33. Click OK. You have created the fourth and final level of the new combination variable ‘Smoke_Share’.
34. Click Tables, to check that you have correctly created the new combination variable ‘Smoke_Share’.
You should have the output below.
49
35. Now you are able to calculate odds ratios for each of the first three rows compared with row as the
reference, by using the SELECT command.
50
C.
CALCULATE ODDS RATIOS OF COT DEATH ASSOCIATED WITH THE NEW
COMBINATION VARIABLE CREATED FROM MATERNAL TOBACCO SMOKING &
INFANT BED SHARING
Odds ratios (and relative risks) are only calculated from the TABLES command when there are two
exposure levels. You will use the SELECT command to select two exposure levels, so that you can
calculate odds ratios.
1. From the Command Tree Select/If folder, click Select. The Select dialogue box opens. From the
Available Variables drop-down, select the groups ‘Smoke_Share’=1 or 4, so that your dialogue box
looks like below. The Record Count now is 1066.
2. Click Tables, to calculate the odds ratio so that you can compare ‘Smoke_Share’ groups 1 and 4. You
will also calculate and odds ratio adjusted for ethnicity, so:
a.
Select the Exposure Variable of ‘Smoke_Share’.
b.
Select the Outcome Variable of CASE.
c.
Select the Stratify by Variable of ETHNIC.
3. The dialogue box should look like below. Click OK.
6. Results appear in the Output window. A 2x2 table, with odds ratio calculations, appears for each of the
3 levels of ETHNIC.
Note that the odds ratio is consistently high in all ethnic groups, being 4.53 for Maori (ETHNIC=1), 2.56
for Pacific Island (ETHNIC=2) and 5.52 for European infants (ETHNIC=3). The p-value for the Chi-square
51
for differing Odds Ratios by stratum (interaction) is above 0.05 (=0.5712) confirming that the odds
ratios do not vary, significantly, between ethnic groups.
Scroll down to the bottom of the Output to see the summary information below. Note that the
summary Mantel-Haenszel odds ratio is 4.85 (95% CI: 3.27 to 7.19). This is very high, although lower
than the crude odds ratio (cross product) of 7.11, indicating that ETHNIC partially confounds the
association between bed sharing and cot death.
The findings are again portrayed below, to give a visual summary of the excess odds of having a smoking
mother and bedsharing, between cases and controls, for the three different ethnic groups (red=European;
green=Pacific; and blue=Maori). Note exposure here is considered a mother who smokes and bed shares,
compared with a nonsmoking mother who doesn’t bedshare.
52
Stratified case control analysis
I
Outcome= clogic , Exposure= ss
Case
I
I
I
I
I
ETHNIC1: OR= 4.51 (1.95, 11.77)
ETHNIC2: OR= 2.52 (0.56, 15.88)
ETHNIC3: OR= 5.5 (3.4, 8.91)
MH-OR = 4.85 (3.27, 7.19)
homogeneity test P value = 0.559
Control
I
I
I
1/4
1/2
I
I
1
2
I
4
8
16
Odds of exposure
4. Click Cancel Select, to return to the full data set. The Record Count now is 1863.
7. Repeat steps 1-4, but this time select the groups ‘Smoke_Share’=2 or 4 (compares infants with
mothers who smoke, but don’t bedshare; with infants with mothers who do not smoke and do not
bedshare), so that your Record Count is now is 985.
Run the Tables command, with ‘Smoke_Share’ as the exposure variable, CASE as the outcome, and
ETHNIC as the stratification variable.
Note that the odds ratio is consistently high in all ethnic groups, being 1.77 for Maori (ETHNIC=1), 2.95
for Pacific Island (ETHNIC=2) and 2.99 for European infants (ETHNIC=3). The p-value for the Chi-square
53
for differing Odds Ratios by stratum (interaction) is just above 0.05 (=0.5846) confirming that the odds
ratios do not vary between ethnic groups.
The summary information at the bottom of the Output is shown below. The summary adjusted
Mantel-Haenszel odds ratio is 2.70 (95% CI: 1.85, 3.93).
54
These findings are illustrated below.
Stratified case control analysis
I
I
I
I
Outcome= clogic , Exposure= ss
Case
I
I
ETHNIC1: OR= 1.76 (0.67, 5.06)
ETHNIC2: OR= 2.87 (0.47, 21.59)
ETHNIC3: OR= 2.99 (1.91, 4.68)
MH-OR = 2.7 (1.85, 3.93)
homogeneity test P value = 0.582
Control
I
I
I
I
1/2
1
I
I
2
4
Odds of exposure
5. Click Cancel Select, to return to the full data set. The Record Count now is 1863.
8. Repeat steps 1-4, but this time select the groups ‘Smoke_Share’=3 or 4, so that your
Record Count is
now is 1149. This restricts our analysis to infants with mothers who do not smoke, but do bedshare;
with infants who have mothers who do neither.
Run the Tables command, with ‘Smoke_Share’ as the exposure variable, CASE as the outcome, and
ETHNIC as the stratification variable.
55
The ethnic-specific odds ratios are: 1.46 for Maori (ETHNIC=1), 0.57 for Pacific Island (ETHNIC=2) and
1.23 for European infants (ETHNIC=3). The p-value for the Chi-square for differing Odds Ratios by
stratum (interaction) is above 0.05 (=0.5733) confirming that the odds ratios do not vary, significantly,
between ethnic groups.
The summary information at the bottom of the Output is shown below. The summary adjusted
Mantel-Haenszel odds ratio is 1.21 (95% CI: 0.82, 1.79).
This is illustrated graphically below. The point estimates show the odds of having a mother who does
not smoke but does bed share, compared to a mother who doesn’t smoke, neither bedshares, for
infants in all case, control and ethnic groups.
56
Stratified case control analysis
I
I
I
I
Outcome= clogic , Exposure= ss
Case
I
I
ETHNIC1: OR= 1.45 (0.54, 4.24)
ETHNIC2: OR= 0.58 (0.1, 4)
ETHNIC3: OR= 1.23 (0.76, 1.97)
MH-OR = 1.21 (0.82, 1.79)
homogeneity test P value = 0.579
Control
I
I
I
I
1/2
1
I
I
2
4
Odds of exposure
6. Click Cancel Select, to return to the full data set for any further analyses. The Record Count now is 1863.
7. The summary adjusted odds ratios calculated for each level of the variable ‘Smoke_Share’ (adjusted for
ethnic group) can now be added to a 2x2 table (as below) to help you interpret them. The odds ratios
can be evaluated on the assumption that the effects from maternal smoking and bed sharing are
additive, or are multiplicative.
57
Infant Bed Shares
Mother Smokes
Yes
No
Yes
4.85
2.70
No
1.21
1.00
On an additive basis, the increase in the odds ratio going from the reference value (=1.00) to that for
infants exposed to both risk factors (=4.85) is 3.85. The change in odds ratio going from the reference
(=1.00) to that for infants exposed to maternal smoking only (=2.70) is 1.70. The change in odds ratio
going from the reference (=1.00) to that for infants exposed to bed sharing only (=1.21) is 0.21. The
sum of the individual effects (1.70 and 0.21) is less than the combined effect (3.85). This indicates that
the combined effect from being exposed to both maternal smoking and infant bed sharing is more
than the sum of the individual effects of maternal smoking and bed sharing. Thus, there exists an
interaction between these two exposures when they occur together.
The table is better explained on the basis of interaction in which effects multiply. When the odds ratios
for maternal smoking only (2.69) and bed sharing only (1.21) are multiplied with each other, the
product is 3.58 (to 2 decimal places). This is close to the excess odds for infants exposed to both risk
factors (3.85). This indicates that the combined effect from being exposed to both maternal smoking
and infant bed sharing is similar to the product of the individual effects of maternal smoking and bed
sharing by themselves. This confirms the presence of a strong interaction between these two
exposures when they occur together.
58
5:
Introduction to logistic regression
Aim: to use logistic regression to analyse a case control study.
Use TABLES command to calculate unadjusted and adjusted odds ratios to control for confounding
ο‚·
ο‚·
Use DEFINE, IF, ASSIGN to create new variables
Use SELECT & CANCEL SELECT to select two exposure levels for calculating odds ratios, if
there are more than two exposure levels.
Commands in this Lesson
The following commands are used in this lesson.
Commands for multivariate analysis
LOGISTIC REGRESSION
This command is in the “Advanced Statistics” Folder. It allows the user to undertake logistic
regression to investigate multivariate relationships between exposures and outcomes.
What is Logistic Regression?
This is a form of regression which is commonly used for the analysis of case-control data which has a binary
outcome (two states – case or control).
The general form of the logistic regression model is:
πΏπ‘œπ‘” π‘œπ‘‘π‘‘π‘  π‘œπ‘“ π‘œπ‘’π‘‘π‘π‘œπ‘šπ‘’ = 𝛽0 + 𝛽1 π‘₯1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 +. . . +𝛽𝑝 π‘₯𝑝
For p exposure variables. The difference between logistic and linear regression is we model a
transformation of the outcome variable, the log of the odds of the outcome. The quantity of the right hand
side is known as the linear predictor of the log odds of the outcome. The β’s are the regression coefficients
associated with the p exposure variables.
The log odds is derived from the probability, or risk, π, of the outcome. The log odds is derived from the
risk, π, using the “logit” function:
π‘™π‘œπ‘”π‘–π‘‘(πœ‹) = log(π/(1-π))
While probabilities are restrained to values between 0 and 1, the log odds is not, with the odds able to take
any value between 0 and infinity and the log odds taking any value between minus infinity to positive
infinity.
β 0 is the log odds in the unexposed group and β1 to βp correspond to the log odds ratio associated with
various exposures. We can use logistic regression as an alternative to stratification when controlling for
confounding variables. The advantage of regression (over stratification) is that we can, simultaneously,
control for the effects of a large number of variables, without losing statistical power, or being constrained
by small counts of individuals within strata. The disadvantage of modeling is that, although it has become
59
technically easy to do with modern software, its use involves a number of assumptions, and which the user
must have some understanding of.
Before we discuss controlling for confounding we need to make sure we know what properties of a
particular exposure or attribute make it a confounder.
Just as we discussed previously, ethnicity may confound the observed association between cot death and
bed sharing, if ethnicity influences both the exposure (bed sharing) and the outcome (cot death). Both
these possibilities make sense, as causation may only run in one direction, and it is unlikely that bed sharing
or cot death would influence ethnicity.
Alternatively, if we consider the effect of maternal smoking on the outcome cot death, then speculating on
what effect birth weight is likely to have on this relationship, we would point the arrows of causation in an
alternative direction. Cigarette smoking during pregnancy reduces the birth weight of the foetus. Low birth
weight infants are also at higher risk of cot death. Instead of birth weight causing both the exposure and
outcome, we consider, here that it is mediating this relationship. Adjusting or stratifying by the third
variable, modifies the exposure-outcome effect, depending on the likely direction of causation. If, like the
first example, the variable is a confounder, then you will be able to more accurately assess the true effect
of exposure on outcome. If, instead, like the second example, the variable is a mediator, then you will likely
underestimate the effect of the exposure.
Put another way, if you imagine that the variables are sources of water and the arrows represent direction
of flow between variables, then the strength of the relationship may be considered the flow rate in the
pipe. You can imagine, in the first example, turning on the ‘tap” of bed sharing and seeing how much water
ends up at the outcome cot death. If you find water at this outcome, you might consider this has come
from the tap of bed sharing. To assess the relationship between bed sharing and cot death, differences in
socioeconomic status need to be removed, either by stratification or regression. The confounding variable,
may, however, be responsible, with water coming from socioeconomic status causing the apparently
observed flow into the cot death variable. Alternatively, for maternal smoking and cot death, to accurately
60
assess the effect of the exposure on the outcome, because a portion of the flow from smoking to cot death
results from effects of smoking on birthweight, if we control for this variable we will cut off some of the
flow and underestimate the true effect of the exposure.
If, using stratification, or logistic regression modeling, we do not consider the direction of causal
relationships, we can inaccurately estimate the true odds ratios of exposures. This occurs because
adjusting for variables which are on the causal pathway between the exposure and outcome, may diminish
the effect of the exposure.
Establishing the direction of causal pathways, along with the relationships between variables is why a
thorough literature review is necessary before conducting an epidemiological study, and beginning data
analysis.
61
Commands in this Lesson
The following commands are used in this lesson.
Commands for creating new variables
DEFINE
READ is the most commonly used command. The Read (Import) command changes the current data source
and/or the current project. It removes any standard defined variables. The READ command operates on
many different types of data. Epi Info™ can Read in 24 different types of files. Located in the Data folder.
IF
Use this command to display table, view, and database information. Use the display option Variables
Currently Available to see all the variables in the dataset, including names, field types, and format
information. Use Display, prior to merging or creating statistics, to ensure that field types and variables
names have been coded as needed. Located in the Variables folder.
ASSIGN
The List command creates a listing of the current data table. Lists can be customized to list all, exclude, or
show specific records. Located in the Statistics folder.
RECODE
This command is used to create new variables, based on the values of other variables. You can use this
command to create categories based on numerical (continuous) variables. Aggregating sparsely defined
categorical variables may also be achieved using this command.
Commands for multivariate analysis
LOGISTIC REGRESSION
Use this command to view the current values of Analysis option settings and generate commands to
change them. Statistical and graphic viewing options can be selected. Yes, No, and Missing values can be
viewed in alternate forms. Allows the inclusion or exclusion of missing records in statistical computations.
Located in the Options folder.
LINEAR REGRESSION
The Select command specifies a condition that must be true for a record to be processed. Use to select a
set of records for analyses. For example, select records based on gender or zip code. Located in the
Select/If folder.
62
Create a Logistic Regression model
You are going to use the LOGISTIC REGRESSION command to calculate an odds ratio and 95% Confidence
Limits for selected variables to see if they are significantly associated with having a disease. The LOGISTIC
REGRESSION command produces an output with an odds ratio, 95% Confidence Limits , and p-value for
each exposure (X) variable in the model.
In EpiInfo:
ο‚·
ο‚·
the outcome(Y) variable should be: diseased (case) = YES, non-diseased (control) = NO;
each exposure (X) variable should be: exposed = 1, unexposed = 0.
If we have more than three categories of exposure, say for example Maori, Pacific and European, we can
code these as 1 and 0 by assigning one category as the comparator (usually with the largest numbers –
here European). Two other “dummy variables” are included, one with a value of 1 if the participant is
Maori, and the other with a value of 1 if the participant is Pacific. Europeans will have values of 0 for both
dummy variables. The beta coefficients associated with the Maori and Pacific variables can be used to
estimate the odds ratio for these ethnic groups, compared to the comparator (European).
You should only use logistic regression after you have analysed your data using the TABLES command and
know what odds ratios to expect from visualizing the distribution of study participants in the diseaseexposure categories.
Before you can run the logistic regression, both outcome and exposure variables need to be converted to
the format that is recognized by EpiInfo (above).
READ the Excel file called SIDS_EpiInfo.
63
CREATE THE Y (OUTCOME VARIABLE)
For the outcome (Y) variable, CASE must be converted from 1 (for case) and 2 (for control) to YES (for case
and NO (for control). To do this, you will need to define a new variable called SIDS and assign to it the
appropriate values from the original variable CASE.)
18. From the Variables folder, click Define. The DEFINE dialog box opens.
19. Type in the Variable Name space ‘SIDS. Your Dialogue box should look like this (below).
20. Click OK.
21. From the Select/If folder, click If. The IF dialog box opens.
22. From the Available Variables drop-down, select CASE, and use the buttons to make it equal to 1. Your
Dialogue box should look like this (below).
23. Click Then.
64
24. From the Variables folder (on the left of the original screen), click Assign. The ASSIGN dialog box
opens.
25. From the Assign Variable drop-down, select ‘SIDS’, and use the buttons to make it equal to “Yes”.
Your Dialogue box should look like this (below), with = (+) in the space under Expression.
26. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears.
27. Click ELSE.
28. You are then taken to the Variables folder on the left of the screen. Click on ASSIGN. Choose the
variable SIDS from the Assign Variable drop down menu and make it equal to “No”. The dialogue box
should look like this (below), with = (-) in the space under Expression.
65
29. Click ADD. You should now be returned to the If dialogue box (below). Check that if CASE =1, then SIDS
= (+) (under THEN ); and ELSE (ie. CASE=2) that SIDS = (-).
30. Use the TABLES command to check that you have correctly converted CASE into SIDS.
66
CREATING DUMMY X (EXPOSURE) VARIABLES
The names of dummy variables created below are in lower case to indicate they are new and not
part of the original data set.
31. Now you will recode the X variable BEDSHARE, which also must be converted into 1 (for exposed) or 0
(for unexposed). To do this, you will need to define a new variable called Bedshare_LR and recode it to
the appropriate values from the original variable BEDSHARE.
32. From the Variables folder, click Define. The DEFINE dialog box opens.
33. Type in the Variable Name space ‘Bedshare_LR’. Your Dialogue box should look like this (below).
34. Click OK.
35. From the Select/If folder, click If. The IF dialog box opens.
36. From the Available Variables drop-down, select BEDSHARE, and use the buttons to make it equal to 1.
Your Dialogue box should look like this (below).
67
37. Click Then.
38. From the Variables folder (on the left of the original screen), click Assign. The ASSIGN dialog box
opens.
39. From the Assign Variable drop-down, select ‘Bedshare_LR’, and use the buttons to make it equal to ‘1’.
Your Dialogue box should look like this (below).
40. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears.
41. Click ELSE.
42. You are then taken to the Variables folder on the left of the screen. Click on ASSIGN. Choose the
variable Bedshare_LR from the Assign Variable drop down menu and make it equal to 0. The dialogue
box should look like this (below).
68
43. Click ADD. You should now be returned to the If dialogue box (below). Check that if BEDAHRE=1, then
Bedshare_LR=1 (under THEN ); and ELSE (ie. BEDSHARE=2) that Bedshare_LR = 0.
44. Use the TABLES command to check that you have correctly converted BEDSHARE into Bedshare_LR.
69
CREATING THE DUMMY VARIABLES FOR ETHNICITY
You will now convert the variable ETHNIC into two dummy variables called Maori and Pacific, which will
have the values shown in the table below.
For the variable Maori, Maori =1, and Pacific or European = 0.
For the variable Pacific, Pacific =1, and Maori or European = 0.
There is no variable for European as they are by default the reference, so that the odds ratios
calculated for Maori and Pacific will have European infants as the reference group.
Existing Variable
ETHNIC
New Dummy Variables
Maori
Pacific
Maori (=1)
1
0
Pacific (=2)
0
1
European (=3)
0
0
1. To calculate the variable Maori, repeat the steps 14 to 27 above by:
d.
First using the DEFINE command to define the new variable called Maori ;
e.
Then, from the variable ETHNIC, use the IF and ASSIGN commands to make Maori
infants = 1, and all other infants (Pacific and European) = 0;
f.
The final IF dialogue box should look like that below.
70
2. To calculate the variable Pacific, repeat the steps 14 to 27 above by:
a. First using the DEFINE command to define the new variable called Pacific ;
b. Then, from the variable ETHNIC, use the IF and ASSIGN commands to make Pacific infants
= 1, and all other infants (Maori and European) = 0;
c. The final IF dialogue box should look like that below.
3. Click OK.
4. Use the TABLES command to check that you have correctly recoded ETHNIC into the variables Maori
and Pacific.
71
CREATE THE DUMMY VARIABLE FOR MATERNAL SMOKING
You will now convert the variable MOTHER_TOBACCO into a dummy variable called Mother_Smk_LR, so
that infants of current smokers and occasional smokers are combined into a single group with the value 1,
and infants of non-smokers are given the value 0.
1. To calculate the variable Mother_Smk_LR, repeat the steps 14 to 27 above:
a. Use the DEFINE command to define the new variable called Mother_Smk_LR ;
b. Then, from the variable MOTHER_TOBACCO, use the IF and ASSIGN commands to make
infants of current smokers (= 1) and occasional smokers (=3) both equal to 1 for the new
variable Mother_Smk_LR, and infants of non-smokers = 0;
c. Note: in your first IF dialogue box, make sure your select both MOTHER_TOBACCO = 1 or
MOTHER_TOBACCO = 3. Do not use the AND button as no infants will be selected as none of
them fulfill this condition (having a mother who is both a current smoker and an occasional
smoker);
d. The final IF dialogue box should look like that below.
72
B. EXAMPLE OF CONTROLLING FOR CONFOUNDING WITH A CATEGORICAL VARIABLE
You are now ready to start logistic regression analyses.
You are going to run a model to estimate the risk of cot death associated with bed sharing, adjusting for
ethnicity.
The general form of logistic regression models is:
DISEASE (Y-variable) = EXPOSURE CONFOUNDER (both X-variables)
In this example you will run a logistic regression model to calculate the odds ratio of cot death
associated with bed sharing, adjusting for ethnicity as a categorical variable.
The model is:
SIDS
=
Bedshare_LR
Maori Pacific
Note: both ethnic variables need to be in the model to ensure the reference group is European.
1. From the Command Tree Advanced Statistics folder, click Logistic Regression. The LOGISTIC dialog box
opens.
2. From the Outcome Variable drop-down, select SIDS.
3. From the Other Variables drop-down, select the variables Bedshare_LR, Maori and Pacific. The
dialogue should look like this below.
4. Click OK. The results of the logistic regression analysis appear in the Output window
73
At the top, ‘LOGISTIC SIDS = Bedshare_LR Maori Pacific’ specifies the variables in the model. SIDS is the
outcome (disease or Y) variable. Bedshare_LR, Maori and Pacific are the exposure (or X) variables.
The following information is provided for each of the exposure variables:
ο‚· odds ratio and 95% Confidence Limits,
ο‚· calculated beta-coefficient (the antiloge of this coefficient is the odds ratio),
ο‚· S.E. is the standard error of the beta-coefficient,
ο‚· and the Z-statistic which is used to derive the p-value for the odds ratio. This is underlined if the pvalue is <0.05, highlighting that it is statistically significant.
The output in the row for the CONSTANT term is not important for the purposes of this Module, and can
be ignored.
The diagram below is a diagrammatic representation of the results from a comparable stratified analysis
which may help you interpret the results. The outer box represents the total study population, with the
smaller box on the left Maori, and the box on the right are Pacific participants. Those not in any of those
boxes are European. The large central box are bed sharers. The numbers in the boxes are the odds of
74
being a case for different combinations of characteristics in the population. If you see more red colour (or
dark shade), these people are at higher odds of having cot death.
The first odds ratio for bed sharing (adjusted for ethnic group in the Epiinfo output box) is 1.6 (95% CI 1.3
to 2.1). This means that, in this population, Europeans (the reference group for ethnicity) that bed share
are about 60% more likely to have a case of cot death if they bed share. The diagram above shows the
stratified output, with the equivalent estimate 1.56. The crude odds ratio is 2.13 (see next page), which is
the average result over all ethnic groups. What does this suggest about the relationship between ethnic
group, bed sharing and cot death? The difference between the crude and adjusted estimate is greater
than 10% so ethnic group confounds the relationship between bed sharing and cot death. Adjusting for
ethnic group reduces the strength of the effect of the exposure, which is commonly observed when one
adjusts for a confounding variable.
From the logistic output, the odds ratio for Maori is 3.2. What does this mean? This suggests that for Maori
who do not bed share, their excess odds of cot death are 3.2 compared to Europeans who do not bed
share (compare to the stratified estimate in the SPAN diagram of 2.4). If they bed share as well, their odds
of cot death are multiplied (3.2*1.6=5.12). This is close, to the stratified estimate (Odds ratio 5.53),
presented in the above SPAN diagram. This stratified analysis contrasts with the crude odds ratio, shown
below, which masks the variation between ethnic groups.
75
76
C.
EXAMPLE OF CONTROLLING FOR CONFOUNDING WITH A CONTINUOUS VARIABLE
In this model you will run a logistic regression to adjust for age as a continuous variable.
The variable you will add to the model is MOTHER_AGE (in years).
The model is: SIDS = Bedshare_LR MOTHER_AGE
1. From the Command Tree Advanced Statistics folder, click Logistic Regression. The LOGISTIC dialog box
opens.
2. From the Outcome Variable drop-down, select SIDS.
3. From the Other Variables drop-down, select Bedshare_LR and MOTHER_AGE.
4. Click OK. The results of the logistic regression analysis appear in the Output window
77
The odds ratio for Bedshare_LR is now 1.96.
The odds ratio for MOTHER_AGE is 0.91, and highly significant (p<0.0001) despite this value being
close to 1.00. this is because the value 0.91 is the decrease in the risk of cot death for each one year
of increase in the mother’s age. For example, a 5 year increase in age, the odds ratio for cot death is
(0.91)5 = 0.62. For a 10 year increase in age, the odds ratio is (0.91)10 = 0.39.
78
Extra for experts (not examinable!)
You can see the effect plotted above, so that bed sharing increases the probability of being a case,
whereas increasing maternal age reduces the risk of cot death. Notice that a relationship is forced by
the model between maternal age and risk of cot death. Generally, it is advisable to check this
relationship first, by categorising the independent variable so that the assumption of linearity may
be checked.
If we combine these two effects into the same graph, and extrapolate the model beyond the
observed range of maternal ages, we observe the logistic function that is forced on the effect of
maternal age on risk of cot death (by bedsharing status). You can see that bed sharers are at
increased risk of cot death for all values of maternal age. The actual observations themselves, are
plotted at the top and bottom of the graph. As you can see, there is considerable overlap between
the age of cases and of controls, however, controls tend to be slightly older than cases (median 27.8
vs 24.9 years). If a “U” shaped effect were observed on risk of cot death by maternal age, in which
mid range values of maternal age were low risk, and extreme values (low or high) we wouldn’t be
able to pick it up. The logistic function, fitted to a continuous variable, assumes a dose response
effect and also that steepest change in risk of outcome occur in the middle of the range of x-values,
with risk plateauing out at the extremes of the range.
MOTHER_AGE effect plot
bed_share effect plot
0.5
0.26
0.4
0.24
0.3
0.22
0.2
case
case
0.2
0.18
0.1
0.16
0.14
0.12
20
30
40
0.0
MOTHER_AGE
0.2
0.4
0.6
bed_share
79
0.8
1.0
1.0
0.6
0.4
0.0
0.2
Probability of case
0.8
Bedshare
No Bedsharing
0
10
20
30
40
50
maternal age
Despite being a highly significant effect, maternal age does not confound the relationship between
bed sharing and cot death. Why? Because the difference between the adjusted effect for bedsharing
(1.96) is less than 10% less than the crude odds ratio (2.13). Although maternal age is related to cot
death, it is either not related to bed sharing or balanced among bed sharers and non-bed sharers, so
it does not exert a confounding influence on the exposure of interest.
Before you finish, be certain to save your work by clicking “save” in the program editor. Then click
the “text file” button. Then navigate to the file in the computer you want to save, name the file and
press “ok”. The file will be saved as a plain text file with a .pgm extension.
80
The next time you want to pick up where you left off, simply open “Analyze data”, click “Open” in
the program editor, then click “text file” in the dialogue box, then navigate to the program that you
saved at the end of the previous session. Then click “Run” in the program editor and Epiinfo will
rerun the commands that you covered in the previous session, opening the original dataset and
making the new variables that you’ve created. You may have to wait for a brief period while the
program runs.
81
6:
Using logistic regression to investigate effect modification
D. EXAMPLE OF USING LOGISTIC REGRESSION TO MODEL EFFECT MODIFICATION BY CREATING AN
INTERACTION TERM.
In Module 4, when you used the TABLES command to calculate an odds ratio for outcome variable
CASE in relation to exposure variable BED_SHARE, adjusting for ETHNIC, the very bottom of the
output screen showed a p-value = 0.058, indicating that the risk ratios varied between strata (ie.
heterogeneity). This is an example of effect-modification, with the ETHNIC variable modifying the
risk ratio between CAT and CHD.
One way to model effect modification is to multiply two variables (called the main variables)
together to get an interaction variable.
The model is: SIDS = var A
var B
(var A)*(var B)
where A = Bedshare_LR;
B = Mother_Smk_LR
You will use a button in the LOGISTIC dialogue box to create the interaction term.
The variables Bedshare_LR and Mother_Smk_LR have already been created above.
1. From the Advanced Statistics folder, click Logistic Regression. The LOGISTIC dialog box opens.
2. From the Outcome Variable drop-down, select SIDS.
3. From the Other Variables drop-down, select ‘Bedshare_LR’ and ‘Mother_Smk_LR’. Click on both
variable names (while holding down the shift key) so that they are both highlighted in blue. The ‘Make
Dummy’ button immediately changes to ‘Make Interaction’.
4. Click on ‘Make Interaction’ button. A new interaction variable (Bedshare_LR*Mother_Smk_LR)
appears in the ‘Interaction Terms’ space. Your Dialogue box should look like this (below).
82
5. Click OK. The results of the logistic regression analysis appear in the Output window.
The significant odds ratio for the interaction term (OR = 1.74, p=0.029) indicates that a multiplicative
interaction between maternal smoking and bed sharing is present. The odds ratios are interpreted as
83
shown in the following table, with the total odds ratio for infants exposed to both risk factors being
the product of the main effects for bed sharing and maternal tobacco times the interactive effect.
Exposed to Bed Sharing
Exposed to Maternal Smoking
Yes
No
Yes
1.35 x 3.05 x 1.74 = 7.16
1.35
No
3.05
1.00
The effect plot is shown below:
bed_share*mother_smoke effect plot
0
mother_smoke : 0
mother_smoke : 1
0.4
case
0.3
0.2
0.1
0
1
1
bed_share
84
The effect plot (above) shows the effect of bed sharing, illustrated, by the slope of the line is steeper
(stronger effect) in maternal smokers (mother_smoke=1) than non-smokers (mother_smoke=0).
The SPAN diagram is shown below, with the red squares having the highest probability (similar to
odds) of being a case. The numbers inside the boxes are odds ratios compared to the non-bed
sharers and non-mother smokers.
The SPAN diagram stratified analysis is very similar to the output of the logistic model.
In contrast with the previous analyses (page 59), the odds ratios for ‘Bedshare_LR’ and
‘Mother_Smk_LR’ are weaker because much of their effect has been taken up by the interaction
term.
6. Now repeat steps 1 to 5 above, and run a logistic regression model which also includes the
ethnic dummy variables Maori and Pacific to the above model.
7. Your Logistic Dialogue box should look like this (below).
85
8. Click OK. The results of the logistic regression analysis appear in the Output window.
86
The p-value for the interaction is term is no longer significant (p=0.0975). However, the odds ratios
still have the same pattern as above, as shown in the table below. These odds ratios are very similar
to the same analyses you did in Session 4 using the TABLES command to calculate Mantel-Haenszel
odds ratios for the effect of bed sharing on cot death, adjusted for ethnicity (page 36).
Exposed to Bed Sharing
Exposed to Maternal Smoking
Yes
No
Yes
1.27 x 2.67 x 1.53 = 5.19
1.27
No
2.67
1.00
A SPAN diagram illustrates this effect below, with the numbers illustrating odds ratios, that compare
with the baseline group (European that neither smoke nor bed share). The increased risk of cot
death is illustrated by the deep red colour.
87
When these individual effects are combined, one can see that the overall probability of cot death
increases dramatically in the highest risk groups that combine all risk factors. For example, Maori
infants, who have mothers that smoke and bed share have a 10 fold increased risk of cot death
compared to European infants who do not bed share and whose mother’s do not smoke. The SPAN
diagram reports stratified estimates, whereas the equivalent logistic regression odds ratio for Maori
infants whose mothers both bed share and smoke is (1.98*1.27*2.67*1.53=10.27). The increased
risk associated with Maori ethnic group is not seen in the table above. The effect plot is shown
below. On the left you see the probability of being a case, by ethnic group, derived from the logistic
model. The red, dashed line shows the 95% confidence interval for the estimate. Clearly Maori are at
higher risk of cot death than the other ethnic groups. The narrower confidence interval surrounding
the European estimate reflects the larger sample size in this group compared to the other ethnic
groups. On the right, the interaction between maternal smoking and bed sharing and the risk of cot
death is portrayed. You can see the slope of the line, indicating the effect of bed sharing on cot
death is much steeper in smoking mothers than non-smoking mothers. These different gradients
(effects) indicate effect modification.
88
eth_cat effect plot
bed_share*mother_smoke effect plot
0
mother_smoke : 0
0.3
1
mother_smoke : 1
0.4
0.25
0.35
0.3
0.2
case
case
0.25
0.15
0.2
0.15
0.1
0.1
European
Maori
Pacific
0
eth_cat
1
bed_share
Before you go, don’t forget to save your work (page 22)!
89
7:
Using logistic regression to examine effect modification (2)
Another way to model effect modification is to create dummy variables for each group of exposures
when you combine two variables. For example, the variables for maternal smoking
(Mother_Smk_LR) and infant bed sharing (Bedshare_LR) can be combined to create 4 levels as
shown in the two left hand columns in the table below.
Existing Variables
Mother_Smk_LR
Bedshare_LR
Yes (=1)
New Dummy Variables
Smoke_Share
Smoke_only
Share_only
Yes (=1)
1
0
0
Yes (=1)
No (l=0)
0
1
0
No (=0)
Yes (=1)
0
0
1
No (=0)
No (=0)
0
0
0
Three dummy variables (as shown in the table above) can be created from these 4 levels:
Logistic regression then can be used to model effect-modification (or interaction) by running the
following model.
The model is:
SIDS = Smoke_Share
Smoke_only
Share_only
1. From the Variables folder, click Define. The DEFINE dialog box opens.
2. Type in the Variable Name space ‘Smoke_Share’. Your Dialogue box should look like this (below).
3. Click OK.
4. From the Select/If folder, click If. The IF dialog box opens.
90
5. From the Available Variables drop-down, select Mother_Smk_LR, and use the buttons to make it
equal to 1.
6. Click the AND button.
7. From the Available Variables drop-down, select Bedshare_LR, and use the buttons to make it equal to
1. Your Dialogue box should look like this (below).
8. Click Then.
9. From the Variables folder, click Assign (which is highlighted in blue). The ASSIGN dialog box opens.
10. From the Assign Variable drop-down, select Smoke_Share, and use the buttons to make it equal to ‘1’
(for infants exposed to both maternal smoking and bed sharing). Your Dialogue box should look like
this (below).
11. Click ADD. The ASSIGN dialogue box closes and the IF dialogue box reappears.
91
12. Click ELSE.
13. You are then taken to the Variables folder on the left of the screen. Click on ASSIGN. Choose the
variable Smoke_Share from the Assign Variable drop down menu and make it equal to 0. The dialogue
box should look like this (below).
14. Click ADD. You should now be returned to the If dialogue box (below). Check that if both
Mother_Smk_LR =1 AND Bedshare_LR=1, then Smoke_Share =1 (under THEN ); and ELSE (ie. all other
infants) that Smoke_Share = 0.
92
93
15. Now you will create the dummy variable called Smoke_only for infants exposed only to maternal
smoking. To calculate this, repeat the steps 1 to 14 above by:
a. First using the DEFINE command to define the new variable called Smoke_only ;
b. Then, from the variables Mother_Smk_LR and Bedshare_LR, use the IF and ASSIGN
commands to make infants who are exposed only to maternal smoking (and not
bedsharing) = 1, and all other infants = 0;
c. The final IF dialogue box should look like this below.
16. Now you will create the dummy variable called Share_only for infants exposed only to maternal
smoking. To calculate this, repeat the steps 1 to 14 above by:
a. First using the DEFINE command to define the new variable called Share_only ;
b. Then, from the variables Mother_Smk_LR and Bedshare_LR, use the IF and ASSIGN
commands to make infants who are exposed only to bedsharing (and not maternal
smoking) = 1, and all other infants = 0;
c. The final IF dialogue box should look like that below.
94
17. To check that you have correctly created the new combination variables Smoke_Share_LR,
Smoke_only and Share_only, for each of these variables in turn, sort the data set. Then use the LIST
command to list each of these 3 new variables with Mother_Smk_LR and Bedshare_LR – to check
that the new combination variables are correct.
You are now ready to run the logistic regression model with the new dummy variables. All three
dummy variables must be included to get appropriate odds ratios.
Remember, the model is:
SIDS = Smoke_Share
Smoke_only
Share_only
18. From the Command Tree Advanced Statistics folder, click Logistic Regression. The LOGISTIC dialog box
opens.
19. From the Outcome Variable drop-down, select SIDS.
20. From the Other Variables drop-down, select Smoke_Share, Smoke_only and Share_only (see
Dialogue box below).
95
21. Click OK. The results of the logistic regression analysis appear in the Output window
96
The odds ratios from this output can be inserted into the table below to help their interpretation.
The values in this table are exactly the same as the corresponding table in the previous section (page
59) where you created a multiplicative interaction term by multiplying the maternal smoking and
bedsharing variables with each other.
Exposed to Bed Sharing
Exposed to Maternal Smoking
Yes
No
Yes
7.16
1.35
No
3.05
1.00
22. Now repeat steps 18 to 21 above, and run a logistic regression model which also includes the
ethnic dummy variables Maori and Pacific to the above model.
23. Your Logistic Dialogue box should look like this (below).
24. Click OK. The results of the logistic regression analysis appear in the Output window.
97
The odds ratios from this output can be inserted into the table below to help their interpretation.
The values in this table are almost the same as the corresponding table in the previous section (page
59) where you created a multiplicative interaction term by multiplying the maternal smoking and
bedsharing variables with each other, and also adjusted for ethnicity.
These odds ratios also are very similar to the same analyses you did in Session 5 using the TABLES
command to calculate Mantel-Haenszel odds ratios adjusted for ethnicity (page 59).
98
Exposed to Bed Sharing
Exposed to Maternal Smoking
Yes
No
Yes
5.18
1.27
No
2.67
1.00
The equivalent SPAN diagram, with stratified odds ratios is laid out below. The increased intensity of
cases (red colour) is illustrated in smokers who identify as Maori.
Don’t forget to save your work (page 22)!
References
1
Scragg R, Mitchell EA, Taylor BJ, Stewart AW, Ford RP, Thompson JM, Allen EM, Becroft
DM: Bed sharing, smoking, and alcohol in the sudden infant death syndrome. New zealand cot death
study group. BMJ 1993;307:1312-1318.
99
2
Marshall RJ: Scaled rectangle diagrams can be used to visualize clinical and epidemiological
data. J Clin Epidemiol 2005;58:974-981.
3
Chongsuvivatwong V: Analysis of epidemiological data using r and epicalc.
4
Jennings LC, MacDiarmid RD, Miles JAR: A study of acute respiratory disease in the
community of port chalmers. I. Illnesses within a group of selected families and the relative incidence
of respiratory pathogens in the whole community. Journal of Hygiene 1978;81:49-66.
100
8:
Sample size using Statcalc
Today we will be using Statcalc, an Epi info utility, to estimate the sample size for a study that is
planned. We will be going through a real scenario.
You have recently got a job in the university as a research fellow on a project looking at diabetes
prevalence in New Zealand. Your funding for next year looks like it may dry up and you could be
facing dreaded unemployment. The Health Research Council puts out a “request for proposals” on
research which may prevent H1N1 infection. $250,000 is to be made available. Another Professor in
the department suggests that he has an idea of looking at the effect of a large one off dose of vitamin
D on the incidence of upper respiratory infection in a randomized study. He gives you some help, but
you have the task of sorting out the nuts and bolts of the grant application and designing the study.
Background
Interventions to reduce the burden of infectious disease usually use the triad of the host, agent and
environment as a theoretical model to structure interventions. In the host, biological protection against
infection consists of both innate and agent specific or humoral immunity. In H1N1 infection, national
pandemic preparedness plans emphasise the rapid deployment of both antiviral treatment and specific
vaccination that provokes a response from the humoral immune system. Both interventions are limited
both by expense and possible lack of effect. For example, resistance to oseltamivir may develop due
to mutation and widespread use, or a vaccination may be slow to develop and test for clinical efficacy
because novel antigens may be needed for its manufacture. The impact of H1N1 on a population may
be severe before these issues are identified and useful alternatives developed. Interventions to enhance
innate immunity, which obviate limitations of antiviral therapy and vaccination, have not been
considered. To explore how vitamin D may reduce the impact of H1N1 infection, first we define the
infections we seek to measure, the status of vitamin D in New Zealand populations, then explore the
evidence that links vitamin D with the immune response to such infections.
Study design
A randomised, double blind, controlled trial is proposed. The two arms will be one of either vitamin D
supplementation (500,000 IU) or placebo and participants will be recruited at the time of their annual
influenza vaccination from primary care. Such a dose has been shown to safely raise 25OHD levels to
≥ 80nmol/L for at least three months, without inducing hypercalcemia. Although in the real example,
we considered the counts of respiratory infections, here we will consider the outcome as a binary
variable for simplicity (infection or no infection after one year).
How many participants are needed in your study?
As part of the grant writing exercise, you need to consider how many participants you will recruit. A
sample size calculation is required. Let’s quickly review the rationale for a sample size calculation.
After conducting your experiment, you can make one of four conclusions. The two boxes in the
following table labeled “OK” are where you want to end up. You want to minimize the chance of
ending up having too few participants so that you may accept the null hypothesis even when it is false
(type 2 error) or having too many participants, so that you will reject the null hypothesis when it is in
fact true (type 1 error).
101
Test result
Null Hypothesis
Accept Null
Reject Null
True
OK
Type 1 error
False
Type 2 error
OK
(No difference)
How do you go about the calculation?
You look up in the textbook:
and:
•
n=sample in each group
•
π0=risk in unexposed
•
π1=risk in unexposed
•
u=1 sided AUC of normal dist. Corresponding to 100%-power (eg. 10%; u=1.28)
•
v=2 sided z level corresponding to % of AUC of normal dist for two sided significance level
required (5%, v=1.96)
That looks ugly! Fortunately, you’re not the first person to undertake a sample size calculation. Others
have decided to write computer programs to make life easier for you. The statcalc utility has just such
a feature!
What information is needed?
From above, two bits of information are “no brainers”. It is standard practice to accept the 5% (1/20)
level for the probability of a type 1 error (v). Whereas the probability of a type 2 error is usually fixed
at 20% (u). All we are left with is the risk in the exposed and unexposed. In a community survey
completed about 30 years ago, you find that the mean rate of infections is 2 per person per year.[4]
You then wonder what proportion of individuals are likely to have no infections and how many would
102
have at least one? You go down the corridor and ask a statistician for help. He (or she) says that if we
assume this number comes from a Poisson distribution (commonly used for count data), (figure 1)
then we expect the relative number of people with no infections to be about 14%, so that 86% (10014) of people will have at least one infection during the year.
103
0.15
0.10
0.00
0.05
Probability Mass
0.20
0.25
Poisson Distribution: Mean = 2
0
2
4
6
8
Number of infections
Figure 1. Poisson distribution, mean = 2 infections.
If we believe that our treatment is likely to reduce the number of infections by about 20%, reducing
the population mean to 1.6 infections, then our statistician informs us that about 80% (figure 2) of the
treatment group will have a respiratory infection. Great, we now have all the information to go ahead
and make our sample size calculation.
104
0.20
0.15
0.00
0.05
0.10
Probability Mass
0.25
0.30
Poisson Distribution: Mean = 1.6
0
2
4
Number of infections
To do so, open Epi info. In the top menu, click Utilities, then Statcalc.
105
6
8
You should get this.
Use the arrow keys to select “Sample size & power”. Then select “Cohort or Cross-sectional”
Press Enter
106
You will then see this:
Here, you are given all the inputs required for estimating sample size for your grant application. You
can leave the first three settings and scroll to the third number using the arrow keys. We have already
worked out that the frequency of disease in the unexposed group is probably about 86%. We think
that the exposed will have a frequency of disease in the exposed of about 80% enter this on the last
line. Press Enter when you are finished. Your screen should look something like this.
Press F4 to complete the calculation. Reading along the top line you see that you will need a total of
1294 people.
107
You present these findings to your boss. He says you must be joking if you’re going to recruit that
number of people in one year at the start of winter with the budget being offered. He suggests redoing
the calculation with a 40% reduction in respiratory infections. This time, you think your treatment will
reduce the number of infections to an annual average of 1.2/per person (perhaps a little optimistic, but
you have to get the numbers down). Your faithful statistician calculates that the number of people
expected with at least one infection is now 70%. See below:
108
0.2
0.0
0.1
Probability Mass
0.3
Poisson Distribution: Mean =1.2
0
2
4
Number of infections
Repeat the calculation in Statcalc.
109
6
8
Now you need only 232 participants. Your boss now informs you that the sample size is a bit more
realistic, but will the reviewers of your grant believe that vitamin D will have such a dramatic effect??
You hope so!
The graph below illustrates Power vs sample size for a variety of different event rates in the exposed
(treated with vitamin D) group.
110
I hope you’ve been able to see from this exercise that sample size calculation is a compromise
between a number of factors including resources, data available and various mathematical
assumptions. It is a useful process to go through, because at the end, you understand well what effect
you will be looking for, and how your study is limited by design and resource.
References
1. Jennings LC, MacDiarmid RD, Miles JAR. A study of acute respiratory disease in the community
of Port Chalmers. I. Illnesses within a group of selected families and the relative incidence of
respiratory pathogens in the whole community. Journal of Hygiene. 1978;81:49-66.
111
Download