Data analysis in outbreak investigations

advertisement
Data
analysis
Tool
6.5
Data analysis in outbreak investigations
Contents
Data analysis in outbreak investigations ............................................................................................................................ 1
Contents ............................................................................................................................................................................. 1
Introduction ........................................................................................................................................................................ 2
A. Install ............................................................................................................................................................................. 2
B. Open and prepare the dataset ....................................................................................................................................... 3
C. Fine tuning the dataset for analysis (data cleaning) ....................................................................................................... 4
D. Create New Variables and Add Conditional Values ........................................................................................................ 5
E. Data analysis ................................................................................................................................................................... 5
Descriptive Epidemiology ............................................................................................................................. 5
Demographic profile: the number of cases by age group and sex ................................................... 6
Creating an Epidemic Curve .............................................................................................................. 6
Analytical Epidemiology ............................................................................................................................... 9
Food Specific Attack Rate Table ........................................................................................................ 9
Stratified/subgroup Analysis using CIplot ....................................................................................... 12
Subgroup restricted analysis ........................................................................................................... 14
Controlled/stratified analysis.......................................................................................................... 14
SPC analysis ..................................................................................................................................... 15
Final comments ............................................................................................................................... 15
Summary ........................................................................................................................................................................... 15
ECDC toolbox for FWD outbreak investigations
-1-
Tool
6.5
Data
analysis
Introduction
(The text below is based on a published report created previously1 for complete details download the report.
To redo the analysis described below you must open the file called bridalshower.epx in EpiData Analysis.)
In a particular outbreak situation the project manager responsible for analysis of collected data will need to
do the following:
a. Receive the databases from each involved country/institute. In practice it is one file per data entry
point.
b. Quality assure the data by standard methods for data cleaning
c. Combine the data to one common analysis file. This can be done with the analysis software using
the command”append”.
d. Analyse the Data.
In the following text a standard analysis will be shown and towards the end a table of typical commands
used in the analysis will be given as indicated in the overview of task 6-8. The example presents a typical
cohort study; the analysis in case-control studies will be different. It is assumed that standard epidemiological principles and usual methods for analysis are known therefore the focus is only on showing to carry this
out in the present software.
To get acquainted with the software a user must do the following:
Main menu
Work Process Toolbar
With dialogs
A. Install
Download and install the EpiData
Analysis software in the appropriate version for the particular operating system in the local computer. The user should have administrative rights for the computer or have an IT-department
to do the installation. The software can be installed by unzipping the delivered packages,
make sure to maintain folder
structures, or else documentation
might not be available as expected.
Commands
(F2)
Editor(F5)
Results
window
control
Results window
(Viewer)
Command
Prompt (F4)
Browse
data (F6)
Statusbar
Variables
(F3)
History
(F7)
Change Folder
The Analysis software is only compiled for Windows at this point. There is a downloadable zip file available
from http://www.epidata.dk/download.php. Download the zip file and unzip to a local drive (or USB drive)
while maintaining the folder structure. Notice that on some computers the contents of a zip file might be
1
http://www.apheo.ca/resources/projects/epidata/EpiData_fieldguide%20v-6Jan-11.pdf
ECDC toolbox for FWD outbreak investigations
-2-
Data
analysis
Tool
6.5
shown as if it was a workable folder. This will create ”confusion” for the software since there is no rigths for
the user to add directly to the zip file. Extraction of files must therefore take place.
EpiData Analysis is used to analyze EpiData *.rec files, dBase *.dbf files, *epx and text *.csv files.
The EpiData Analysis screen is easy to navigate and functionality is accessible via the use of shortcut keys.
Areas of importance are:
 The command prompt (F4) area located at the bottom of the screen
 The editor (F5)
 The results window.
 The dialogs in the work process toolbar.
But also take some time to explore additional functionality available via other short cut keys.
When you open EpiData Analysis, you see the following menu:
B. Open and prepare the dataset
To open a data file, use:
Navigate to the location where you saved your file and select it - here shwon with a dbf file, but could be any
file supported (*.rec, *.dbf, *.csv, *.epx)
EpiData will provide some
information about the data: name, number of records number of fields, etc.
In this case we have 64
records in the dataset.
Press F2 and F3 to reveal
all commands as well as
the file’s variables:
ECDC toolbox for FWD outbreak investigations
-3-
Data
analysis
Tool
6.5
Now take a look at the data:
To get a line listing:
Use:
C. Fine tuning the dataset for analysis (data cleaning)
Before analysis can be done in a valid way, the data must be cleaned and documented – if it hasn’t already
been done during entry. This includes finding out whether all variables are valid, how many observations
can be part of the analysis etc.
Sometimes when you start analyzing your dataset, you realize that the names of variables or values are not
all that meaningful. In particular in these instances, it is important to spend some time preparing the dataset, but it is always good practice to define labels at three levels:
(1) At whole file level (labeldata ......),
(2) At variable level (label ....), and
(3) At value or category level (labelvalue....)
if you have properly created the database file with the EpiData Manager including suitable labels this is already taken care of. Otherwise a number of commands in the Analysis software can do this.
 To add labels to the whole data file: LABELDATA “Bridal shower outbreak investigation”
 To add (or change) variable names: LABEL LEMONSORBE “lemon sorbet”
 To add labels to the values :
LABELVALUE LEMONSORBE /0=”No” /1=”Yes” /2=”Unknown”
ECDC toolbox for FWD outbreak investigations
-4-
Data
analysis
Tool
6.5
D. Create New Variables and Add Conditional Values
Sometimes case classification depends on a combined set of logical statements to fulfill the case definition.
E.g. Certain symptoms such as diarrhea, but only if combined with fever.
A whole set of commands are available for generation of new variables or recoding into grouped variables,
e.g.:
 create your age group variable as an integer:
define agegrp #
recode age to agegrp 0-9=1 10-19=2 20-44=3 45-64=4 65-hi=5

classify cases based on symptoms:
gen i case = 0
if (vomit = 2) and (fever = 2) then case = 1
if (vomit = 9) then case = 9
labelvalue case /0=”No” /1=”Yes” /9=”Uncertain”
label case ”Case status for this person”
E. Data analysis
Descriptive Epidemiology
Describing the cases in terms of person, place and time is good epidemiological practice and can help you
develop hypotheses about the mode of transmission and the source of infection. Start with person:
In order to know how many cases and non-cases we have we can do a frequency distribution of the variable
ILL.
In the command prompt area (F4), write: FREQ ILL
You can see that there are 48 cases and 16 non-cases (Where 0=non-case and 1=case)
An alternative to writing commands (which someone feel inconvenient until you get more experience with
the software) you may use the Dialog system in the toolbar. All parts of this allow the user not knowing
formulation of commands to get powerful analysis. Find this under
E.g. to DESCRIBE the content of a continuous numeric field such as AGE and get:
For the variable AGE, the total number of records used in the calculation are given (N=62) as well as the
sum, mean age, 95% Confidence Interval, the minimum value, percentiles (5, 10, 25, 50=Median, 75, 90 and
95) and the maximum value.
ECDC toolbox for FWD outbreak investigations
-5-
Data
analysis
Tool
6.5
Demographic profile: the number of cases by age group and sex
Use the new variable you just created above – agegrp.
You can either type the command or use the dialog system in the toolbar.
tables sex agegrp /c
This will produce a cross tabulation of sex by age group along with column percents. Your table should look
like this:
You can see that the majority of the population is between the ages of 20 and 64 years and that more than
80% are women. We can add row percents as well by adding /r to the end of the command line.
Gender composition of guests
SEX
F
M
14
12
Count
10
8
6
4
2
0
0-9
10 - 19
20 - 29
30 - 39
40 - 49
Age group
50 - 59
60 - 69
70 - 79
EpiData Analysis Graph
Graph created as:
bar agegrp /by = sex /sizex=700 /legend /ti="Gender composition of guests" /xtext="Age group"
Creating an Epidemic Curve
EpiData Analysis has a specific function to to make epicurves. The graph’s default title and look depends
on the actual data. Here are two examples on the left, the one from the current data and on the right with
a more balance look.
ECDC toolbox for FWD outbreak investigations
-6-
Data
analysis
Tool
6.5
By changing axes and selections the user can modify the layout.
Use:
to make a variety of graphs, including the Epidemic Curve
Now, do the epidemic curve based on the graph dialog for epidemic curves,
which will formulate the command as:
Epicurve ill onsetdate
or marked by a third variable as:
Epicurve ill onsetdate /by=sex
ECDC toolbox for FWD outbreak investigations
-7-
Data
analysis
Tool
6.5
Let’s change the title by writing in F4:
Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset”
Count
Epicurve - Ill by Date of Onset
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
24/02/2008
25/02/2008
ONSETDATE
26/02/2008
EpiData Analysis Graph
ECDC toolbox for FWD outbreak investigations
-8-
Data
analysis
Tool
6.5
Now, change the y axis label:
Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset” /ytext=”Number Ill”
Number Ill
Epicurve - Ill by Date of Onset
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
24/02/2008
25/02/2008
ONSETDATE
26/02/2008
EpiData Analysis Graph
The y-axis numbers are tight. Change the increment to 2:
Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset” /ytext=”Number Ill” /yinc=2
Number Ill
Epicurve - Ill by Date of Onset
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
24/02/2008
25/02/2008
ONSETDATE
26/02/2008
EpiData Analysis Graph
Now, the graph is much easier to interpret. The epicurve shows that the onset date of symptoms
predominantly occurred during a 48-hour period between February 25 and February 26, 2008,
approximately 24 hours after the event took place. The epidemiologic curve indicates a point source
outbreak with no observed secondary transmission of infection.
Analytical Epidemiology
Food Specific Attack Rate Table
At this point in our retrospective cohort study, we are hoping to identify risk factors that might indicate the
cause and mode of transmission of the disease. We created a questionnaire that asked both persons who
were exposed and those who were not exposed to different foods and beverages if they became ill. What
ECDC toolbox for FWD outbreak investigations
-9-
Data
analysis
Tool
6.5
we are trying to do is determine the probability that a food item (for example, eating lemon sorbet) is
linked to the outcome (becoming ill) by calculating attack rates for each food item.
The food item that is the true source of infection will likely have three features:
1. The attack rate is high among persons who ate the food (high food-specific attack rate).
2. The attack rate is low among persons who did not eat the food (so the difference or ratio is high).
3. Most of the cases were exposed, so the exposure could “explain” most, if not all, of the cases.
Attack rate tables in EpiData Analysis are possible by using the ‘tables’ command.
TAB OUTCOME EXP1 [EXP2...EXP#] /OA [other options]
where: OUTCOME is the variable where you classify your cases (default value for case is the highest in that
variable). /OA indicates the option to create an outbreak table
In the command prompt window (F4), type the following command:
tables ill lemonsorbe /ct /ar
where ct = compact tables, and ar = attack rates. /oa is short form for “/ct /ar”
The following output appears:
The results indicate that for those that ate (were exposed to) the lemon sorbet (n=44), 42 became ill for an
attack rate (AR) of 95.5%. For those that did not eat the sorbet but were ill, the attack rate was 24%. The
relative risk (RR) is also significant, although the 95% confidence interval is quite wide. The RR implies a
strong association between the consumption of lemon sorbet and becoming ill.
ECDC toolbox for FWD outbreak investigations
- 10 -
Data
analysis
Tool
6.5
To do this for all exposures might be easier using the dialog system:
Your results will look like this (only partial table shown):
You can also add a test of statistical significance to your output. Using the ill/lemon sorbet as an example,
type the following:
tables ill lemonsorbe /ct /ar /T
where ct = compact tables, ar = attack rates and T = Chi square
The following output appears:
ECDC toolbox for FWD outbreak investigations
- 11 -
Data
analysis
Tool
6.5
The warning message indicates that an expected value of less than 5 occurred in one or more cells. Therefore, the Fisher exact P value needs to be used rather than Chi square. Change the command to the following:
tables ill lemonsorbe /ct /ar /ex
where ct = compact tables, ar = attack rates and ex = Exact test
The RR and the 95% confidence interval show a significant association between the consumption of lemon
sorbet and becoming ill. The p-value is also extremely small, showing statistical significance.
Stratified/subgroup Analysis using CIplot
Subgroup analysis involves examining the exposure-disease association within different categories of a
third factor. It is an effective method for looking at the effects of two different exposures on disease.
Looking at the food attack rate table above shows that more than one exposure (food item) had elevated
relative risks and statistically significant p values.
A plot of proportions of cases for subgroups defined by other variables can be an effective way of finding
high or low risk groups quite quickly. Define a search pattern for subgroups before you do any analysis and
be careful with statistical testing, since a broad search and test strategy will modify any p-value you should
regard as significant.
Type the following command in the Command Prompt (F4) area:
Ciplot ill healthunit sex
ECDC toolbox for FWD outbreak investigations
- 12 -
Data
analysis
Tool
6.5
or use the dialog system as shown:
To include observations with missing data just add the option /M to commands or tick the same box when
using the dialogs.
ECDC toolbox for FWD outbreak investigations
- 13 -
Data
analysis
Tool
6.5
Subgroup restricted analysis
To repeat any analysis within a subgroup, say male or female any command can be executed for a partial
set of the data. This must be done via the command prompt - is not available in the dialogs. E.g.
Tab ill lemonsorbe /o /t if sex = 1 /// would look at ill and lemonsorbe among those with value 1 in sex
epicurve ill dateonset if sex = 1 /// an epicurve among those with value 1 in sex
epicurve ill dateonset /by=sex /// an epicurve with different colouring for males and females
Controlled/stratified analysis
To control for a third or more factors one needs to add these to the analysis. Again this would be possible
as well with commands as with dialogs.
For more extensive analysis such as logistic regression other software must be used. EpiData can save data
as Stata files.
ECDC toolbox for FWD outbreak investigations
- 14 -
Data
analysis
Tool
6.5
SPC analysis
EpiData Analysis also has a full package of Statistical Process Control commands - see the menu system for
more information. This includes a G-chart which will show “time since last occurrence”, which can be relevant when looking for rare occurence of infections.
Final comments
Features for saving commands, re-running these via the editor and other aspects such as copying output is
shown in the introduction documents available through the help menu.
Further inspiration:
see the Http://apheo.ca website
Http://epidata.dk/documentation.php and examples page
Summary
The analysis software supports standard field analysis of collected data in an outbreak situation with features expected in current documentation oriented data collection.
With proper data definitions and use of a few commands very quickly can one get an overview of many exposures and associated risks.
The prime commands used in this context are:
Task
Command
Create summary Outbreak Anal- tables ill exposures /oa
ysis risk tables:
Create tables and graphs for
proportion of cases among other variables:
ciplot ill stratum variables
Create epicurves:
epicurve ill dateonset /by=sex
Create stratified analysis controlling for other variables:
tables ill exposures strata /o /t
Subgroup analysis
“on the fly” selections :
epicurve ill dateonset /by=sex if age > 35
Four very strong commands which can give a quick overview of where to look for further risks or subgroup
analysis - and output focused for the risk factor identification.
A very powerful analysis can be done in short time.
All of the above can be done via the menu's as well (apart from the subgroup example).
ECDC toolbox for FWD outbreak investigations
- 15 -
Download