Data analysis Tool 6.5 Data analysis in outbreak investigations Contents Data analysis in outbreak investigations ............................................................................................................................ 1 Contents ............................................................................................................................................................................. 1 Introduction ........................................................................................................................................................................ 2 A. Install ............................................................................................................................................................................. 2 B. Open and prepare the dataset ....................................................................................................................................... 3 C. Fine tuning the dataset for analysis (data cleaning) ....................................................................................................... 4 D. Create New Variables and Add Conditional Values ........................................................................................................ 5 E. Data analysis ................................................................................................................................................................... 5 Descriptive Epidemiology ............................................................................................................................. 5 Demographic profile: the number of cases by age group and sex ................................................... 6 Creating an Epidemic Curve .............................................................................................................. 6 Analytical Epidemiology ............................................................................................................................... 9 Food Specific Attack Rate Table ........................................................................................................ 9 Stratified/subgroup Analysis using CIplot ....................................................................................... 12 Subgroup restricted analysis ........................................................................................................... 14 Controlled/stratified analysis.......................................................................................................... 14 SPC analysis ..................................................................................................................................... 15 Final comments ............................................................................................................................... 15 Summary ........................................................................................................................................................................... 15 ECDC toolbox for FWD outbreak investigations -1- Tool 6.5 Data analysis Introduction (The text below is based on a published report created previously1 for complete details download the report. To redo the analysis described below you must open the file called bridalshower.epx in EpiData Analysis.) In a particular outbreak situation the project manager responsible for analysis of collected data will need to do the following: a. Receive the databases from each involved country/institute. In practice it is one file per data entry point. b. Quality assure the data by standard methods for data cleaning c. Combine the data to one common analysis file. This can be done with the analysis software using the command”append”. d. Analyse the Data. In the following text a standard analysis will be shown and towards the end a table of typical commands used in the analysis will be given as indicated in the overview of task 6-8. The example presents a typical cohort study; the analysis in case-control studies will be different. It is assumed that standard epidemiological principles and usual methods for analysis are known therefore the focus is only on showing to carry this out in the present software. To get acquainted with the software a user must do the following: Main menu Work Process Toolbar With dialogs A. Install Download and install the EpiData Analysis software in the appropriate version for the particular operating system in the local computer. The user should have administrative rights for the computer or have an IT-department to do the installation. The software can be installed by unzipping the delivered packages, make sure to maintain folder structures, or else documentation might not be available as expected. Commands (F2) Editor(F5) Results window control Results window (Viewer) Command Prompt (F4) Browse data (F6) Statusbar Variables (F3) History (F7) Change Folder The Analysis software is only compiled for Windows at this point. There is a downloadable zip file available from http://www.epidata.dk/download.php. Download the zip file and unzip to a local drive (or USB drive) while maintaining the folder structure. Notice that on some computers the contents of a zip file might be 1 http://www.apheo.ca/resources/projects/epidata/EpiData_fieldguide%20v-6Jan-11.pdf ECDC toolbox for FWD outbreak investigations -2- Data analysis Tool 6.5 shown as if it was a workable folder. This will create ”confusion” for the software since there is no rigths for the user to add directly to the zip file. Extraction of files must therefore take place. EpiData Analysis is used to analyze EpiData *.rec files, dBase *.dbf files, *epx and text *.csv files. The EpiData Analysis screen is easy to navigate and functionality is accessible via the use of shortcut keys. Areas of importance are: The command prompt (F4) area located at the bottom of the screen The editor (F5) The results window. The dialogs in the work process toolbar. But also take some time to explore additional functionality available via other short cut keys. When you open EpiData Analysis, you see the following menu: B. Open and prepare the dataset To open a data file, use: Navigate to the location where you saved your file and select it - here shwon with a dbf file, but could be any file supported (*.rec, *.dbf, *.csv, *.epx) EpiData will provide some information about the data: name, number of records number of fields, etc. In this case we have 64 records in the dataset. Press F2 and F3 to reveal all commands as well as the file’s variables: ECDC toolbox for FWD outbreak investigations -3- Data analysis Tool 6.5 Now take a look at the data: To get a line listing: Use: C. Fine tuning the dataset for analysis (data cleaning) Before analysis can be done in a valid way, the data must be cleaned and documented – if it hasn’t already been done during entry. This includes finding out whether all variables are valid, how many observations can be part of the analysis etc. Sometimes when you start analyzing your dataset, you realize that the names of variables or values are not all that meaningful. In particular in these instances, it is important to spend some time preparing the dataset, but it is always good practice to define labels at three levels: (1) At whole file level (labeldata ......), (2) At variable level (label ....), and (3) At value or category level (labelvalue....) if you have properly created the database file with the EpiData Manager including suitable labels this is already taken care of. Otherwise a number of commands in the Analysis software can do this. To add labels to the whole data file: LABELDATA “Bridal shower outbreak investigation” To add (or change) variable names: LABEL LEMONSORBE “lemon sorbet” To add labels to the values : LABELVALUE LEMONSORBE /0=”No” /1=”Yes” /2=”Unknown” ECDC toolbox for FWD outbreak investigations -4- Data analysis Tool 6.5 D. Create New Variables and Add Conditional Values Sometimes case classification depends on a combined set of logical statements to fulfill the case definition. E.g. Certain symptoms such as diarrhea, but only if combined with fever. A whole set of commands are available for generation of new variables or recoding into grouped variables, e.g.: create your age group variable as an integer: define agegrp # recode age to agegrp 0-9=1 10-19=2 20-44=3 45-64=4 65-hi=5 classify cases based on symptoms: gen i case = 0 if (vomit = 2) and (fever = 2) then case = 1 if (vomit = 9) then case = 9 labelvalue case /0=”No” /1=”Yes” /9=”Uncertain” label case ”Case status for this person” E. Data analysis Descriptive Epidemiology Describing the cases in terms of person, place and time is good epidemiological practice and can help you develop hypotheses about the mode of transmission and the source of infection. Start with person: In order to know how many cases and non-cases we have we can do a frequency distribution of the variable ILL. In the command prompt area (F4), write: FREQ ILL You can see that there are 48 cases and 16 non-cases (Where 0=non-case and 1=case) An alternative to writing commands (which someone feel inconvenient until you get more experience with the software) you may use the Dialog system in the toolbar. All parts of this allow the user not knowing formulation of commands to get powerful analysis. Find this under E.g. to DESCRIBE the content of a continuous numeric field such as AGE and get: For the variable AGE, the total number of records used in the calculation are given (N=62) as well as the sum, mean age, 95% Confidence Interval, the minimum value, percentiles (5, 10, 25, 50=Median, 75, 90 and 95) and the maximum value. ECDC toolbox for FWD outbreak investigations -5- Data analysis Tool 6.5 Demographic profile: the number of cases by age group and sex Use the new variable you just created above – agegrp. You can either type the command or use the dialog system in the toolbar. tables sex agegrp /c This will produce a cross tabulation of sex by age group along with column percents. Your table should look like this: You can see that the majority of the population is between the ages of 20 and 64 years and that more than 80% are women. We can add row percents as well by adding /r to the end of the command line. Gender composition of guests SEX F M 14 12 Count 10 8 6 4 2 0 0-9 10 - 19 20 - 29 30 - 39 40 - 49 Age group 50 - 59 60 - 69 70 - 79 EpiData Analysis Graph Graph created as: bar agegrp /by = sex /sizex=700 /legend /ti="Gender composition of guests" /xtext="Age group" Creating an Epidemic Curve EpiData Analysis has a specific function to to make epicurves. The graph’s default title and look depends on the actual data. Here are two examples on the left, the one from the current data and on the right with a more balance look. ECDC toolbox for FWD outbreak investigations -6- Data analysis Tool 6.5 By changing axes and selections the user can modify the layout. Use: to make a variety of graphs, including the Epidemic Curve Now, do the epidemic curve based on the graph dialog for epidemic curves, which will formulate the command as: Epicurve ill onsetdate or marked by a third variable as: Epicurve ill onsetdate /by=sex ECDC toolbox for FWD outbreak investigations -7- Data analysis Tool 6.5 Let’s change the title by writing in F4: Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset” Count Epicurve - Ill by Date of Onset 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 24/02/2008 25/02/2008 ONSETDATE 26/02/2008 EpiData Analysis Graph ECDC toolbox for FWD outbreak investigations -8- Data analysis Tool 6.5 Now, change the y axis label: Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset” /ytext=”Number Ill” Number Ill Epicurve - Ill by Date of Onset 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 24/02/2008 25/02/2008 ONSETDATE 26/02/2008 EpiData Analysis Graph The y-axis numbers are tight. Change the increment to 2: Epicurve ill onsetdate /ti=”Epicurve – Ill by Date of Onset” /ytext=”Number Ill” /yinc=2 Number Ill Epicurve - Ill by Date of Onset 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 24/02/2008 25/02/2008 ONSETDATE 26/02/2008 EpiData Analysis Graph Now, the graph is much easier to interpret. The epicurve shows that the onset date of symptoms predominantly occurred during a 48-hour period between February 25 and February 26, 2008, approximately 24 hours after the event took place. The epidemiologic curve indicates a point source outbreak with no observed secondary transmission of infection. Analytical Epidemiology Food Specific Attack Rate Table At this point in our retrospective cohort study, we are hoping to identify risk factors that might indicate the cause and mode of transmission of the disease. We created a questionnaire that asked both persons who were exposed and those who were not exposed to different foods and beverages if they became ill. What ECDC toolbox for FWD outbreak investigations -9- Data analysis Tool 6.5 we are trying to do is determine the probability that a food item (for example, eating lemon sorbet) is linked to the outcome (becoming ill) by calculating attack rates for each food item. The food item that is the true source of infection will likely have three features: 1. The attack rate is high among persons who ate the food (high food-specific attack rate). 2. The attack rate is low among persons who did not eat the food (so the difference or ratio is high). 3. Most of the cases were exposed, so the exposure could “explain” most, if not all, of the cases. Attack rate tables in EpiData Analysis are possible by using the ‘tables’ command. TAB OUTCOME EXP1 [EXP2...EXP#] /OA [other options] where: OUTCOME is the variable where you classify your cases (default value for case is the highest in that variable). /OA indicates the option to create an outbreak table In the command prompt window (F4), type the following command: tables ill lemonsorbe /ct /ar where ct = compact tables, and ar = attack rates. /oa is short form for “/ct /ar” The following output appears: The results indicate that for those that ate (were exposed to) the lemon sorbet (n=44), 42 became ill for an attack rate (AR) of 95.5%. For those that did not eat the sorbet but were ill, the attack rate was 24%. The relative risk (RR) is also significant, although the 95% confidence interval is quite wide. The RR implies a strong association between the consumption of lemon sorbet and becoming ill. ECDC toolbox for FWD outbreak investigations - 10 - Data analysis Tool 6.5 To do this for all exposures might be easier using the dialog system: Your results will look like this (only partial table shown): You can also add a test of statistical significance to your output. Using the ill/lemon sorbet as an example, type the following: tables ill lemonsorbe /ct /ar /T where ct = compact tables, ar = attack rates and T = Chi square The following output appears: ECDC toolbox for FWD outbreak investigations - 11 - Data analysis Tool 6.5 The warning message indicates that an expected value of less than 5 occurred in one or more cells. Therefore, the Fisher exact P value needs to be used rather than Chi square. Change the command to the following: tables ill lemonsorbe /ct /ar /ex where ct = compact tables, ar = attack rates and ex = Exact test The RR and the 95% confidence interval show a significant association between the consumption of lemon sorbet and becoming ill. The p-value is also extremely small, showing statistical significance. Stratified/subgroup Analysis using CIplot Subgroup analysis involves examining the exposure-disease association within different categories of a third factor. It is an effective method for looking at the effects of two different exposures on disease. Looking at the food attack rate table above shows that more than one exposure (food item) had elevated relative risks and statistically significant p values. A plot of proportions of cases for subgroups defined by other variables can be an effective way of finding high or low risk groups quite quickly. Define a search pattern for subgroups before you do any analysis and be careful with statistical testing, since a broad search and test strategy will modify any p-value you should regard as significant. Type the following command in the Command Prompt (F4) area: Ciplot ill healthunit sex ECDC toolbox for FWD outbreak investigations - 12 - Data analysis Tool 6.5 or use the dialog system as shown: To include observations with missing data just add the option /M to commands or tick the same box when using the dialogs. ECDC toolbox for FWD outbreak investigations - 13 - Data analysis Tool 6.5 Subgroup restricted analysis To repeat any analysis within a subgroup, say male or female any command can be executed for a partial set of the data. This must be done via the command prompt - is not available in the dialogs. E.g. Tab ill lemonsorbe /o /t if sex = 1 /// would look at ill and lemonsorbe among those with value 1 in sex epicurve ill dateonset if sex = 1 /// an epicurve among those with value 1 in sex epicurve ill dateonset /by=sex /// an epicurve with different colouring for males and females Controlled/stratified analysis To control for a third or more factors one needs to add these to the analysis. Again this would be possible as well with commands as with dialogs. For more extensive analysis such as logistic regression other software must be used. EpiData can save data as Stata files. ECDC toolbox for FWD outbreak investigations - 14 - Data analysis Tool 6.5 SPC analysis EpiData Analysis also has a full package of Statistical Process Control commands - see the menu system for more information. This includes a G-chart which will show “time since last occurrence”, which can be relevant when looking for rare occurence of infections. Final comments Features for saving commands, re-running these via the editor and other aspects such as copying output is shown in the introduction documents available through the help menu. Further inspiration: see the Http://apheo.ca website Http://epidata.dk/documentation.php and examples page Summary The analysis software supports standard field analysis of collected data in an outbreak situation with features expected in current documentation oriented data collection. With proper data definitions and use of a few commands very quickly can one get an overview of many exposures and associated risks. The prime commands used in this context are: Task Command Create summary Outbreak Anal- tables ill exposures /oa ysis risk tables: Create tables and graphs for proportion of cases among other variables: ciplot ill stratum variables Create epicurves: epicurve ill dateonset /by=sex Create stratified analysis controlling for other variables: tables ill exposures strata /o /t Subgroup analysis “on the fly” selections : epicurve ill dateonset /by=sex if age > 35 Four very strong commands which can give a quick overview of where to look for further risks or subgroup analysis - and output focused for the risk factor identification. A very powerful analysis can be done in short time. All of the above can be done via the menu's as well (apart from the subgroup example). ECDC toolbox for FWD outbreak investigations - 15 -