Preparing and Formatting your Research Data May 15, 2015 Hannah Palac, MS hannah.louks@northwestern.edu Overview • Importance • Deciding What to Measure • Deidentification • Data Entry and Organization • Keeping a Codebook • Updating your Data • Brief Intro to REDCap • Preparing for Statistical Collaboration Importance • Garbage In Garbage Out • Ethics • Scientific integrity Ensures consistency Instills confidence in funders, participants, readers, etc. Lays groundwork for smooth data cleaning and statistical analysis What do you want to know? • What is your research question and primary hypothesis? Be clear and discrete when stating your question Example: Non-specific: Is NSAID use associated with complications after myocardial infarction? Specific: Is NSAID use associated with bleeding and cardiovascular events in patients receiving antithrombotic therapy after myocardial infarction? What do you want to know about each person/unit? • Let the research question and hypothesis drive the variables you collect • Consider confounding variables and effect modifiers What variables will you need to assess for these possible effects? • Categorize variables into meaningful groups (e.g. demographics, lab values, medical history, etc.) Longitudinal Studies • List out what data is collected at each timepoint in an event grid • Example: What type of data is best suited for each variable? • Nominal/Categorical: Variables with 2+ categories with no intrinsic order Race, Sex, Marital Status • Ordinal: Variables with 2+ categories that can be ordered or ranked Disease stage or severity, Education • Continuous: Variables measured on a continuum or interval scale Laboratory values, Age, Weight, Height What type of data is best suited for each variable? • String/Character: Words Many statistical softwares read in string variables exactly “as is,” meaning that deviations in spacing, capitalization, and spelling will be read as separate outcomes • Example: “Male”, “male”, “1” and “M” mean the same thing, but the software reads them as four separate groups • Numeric: Numbers Raw values (e.g. Age, Labs, etc.) Coded values (e.g. 0/1 coding for Female/Male) • Avoid using symbols, such as $ or %, in your data In general, formats are not recommended except for date variables, in which all data points should follow a consistent date format • Do NOT mix string and numeric data types in the same variable Deidentification and Security • Use a unique subject ID number rather and identifying information, such as MRN or name If you must link the ID to the participant, maintain a key separate from the database Use the unique participant ID for all study related documents Avoid sending MRNs or other PHI to the statistician Do not use the Excel row number as an identifier Example Key: • Do not transfer data or messages containing PHI to Gmail/Yahoo/etc. e-mail addresses • Do not store data containing PHI on flash drives, personal devices, etc. Data Entry and Organization • In general, statistical software packages prefer numeric data • Code data using numeric codes and use these codes during data collection and entry Examples: • 0=Female; 1=Male • 0=No diabetes; 1=Diabetes • 0=Underweight; 1=Normal weight; 2=Overweight; 3=Obese • Be consistent with your codes, such that similar variables use the similar codes • Use consistent codes for “other” values, N/A, patient refusal, patient does not know, or other missing data Do not enter N/A or other text Data Entry and Organization • Always enter the raw and intact data field rather than entering the calculation or categorizing right away • Do not use symbols in your data (e.g. >60 for normal eGFR) Instead enter 60 or categorize values into meaningful groups (Stage 1, Stage 2, etc.) • For variables where only one value is possible, create one variable only Example: Disease stage or severity • For variables where there can be multiple values at the same time, create separate variables for each item Example: Medications, side effects, diagnoses Working with “Others” • For coded variables, develop a consistent scheme for coding “other” values that can be implemented across all variables • Keep a separate text column for “others” next to the variable of interest • Examples: Variables with only one possible value: Variables with multiple possible values: Naming your Variables • Names do not necessarily have to be descriptive of the variable, but it is nice if they are • Start with a letter • Keep it short • Different variables should have different names Example: If collecting data on side effects for multiple medications, use a systematic naming scheme such as “[medname]_nausea” rather than “nausea” for all medications. Naming your Variables • If variables have more than 1 component, such as BP, create multiple variables (SBP, DBP) • If necessary, add a separate variable for comments Do NOT include comments in the variable • Use caution with calculations in Excel • Do not use spaces or symbols • Do not color code • Do not include blank rows Keeping a Codebook • Keep a codebook separate from the database (such as in a separate tab in Excel) of the variable label, variable name, units, and possible values (codes), plausible ranges, formulas Example: Wide vs. Long Formats • “Wide” or “Horizontal” data consists one observation per participant and all data is entered in a single row Example: ID sex 100005 0 100141 1 ldl_base 179 101 ldl_wk4 175 121 ldl_wk8 162 104 ldl_wk12 169 113 • “Long” or “Vertical” data consists of multiple observations per participants entered in separate rows Example: ID sex time ldl 100005 100005 100005 100005 100141 100141 100141 100141 0 0 0 0 1 1 1 1 Baseline Week 4 Week 8 Week 12 Baseline Week 4 Week 8 Week 12 179 175 162 169 101 121 104 113 Updating your Data • Invariably, there will be some data cleaning by the statistician (reformatting, converting from wide to long or long to wide) • If you need to update your data, please do not add it to the original database • Instead, send a new spreadsheet with additions or changes so the statistician can figure out the best way to merge or append this data into the clean dataset Be sure to identify the new or updated data by study ID What is ? • Research Electronic Data Capture • Web-based application used for building and managing research databases quickly and securely • Developed at Vanderbilt in 2004 • 1,255 consortium partners and rapidly growing Why Researchers Use REDcap • Can create data collection forms and surveys without the need for a programmer Although, programmers are needed to facilitate REDCap maintenance (CDSI) • Can enter and access data from multicenter studies in one place • All data stored in REDCap at Northwestern is compliant with HIPAA standards for security • Audit trails for tracking history of database and entry • Suitable for (almost) any study with web-based data entry capabilities • Reporting capabilities • Self-service tool means it’s FREE (to you) to use! Sample Data Entry Form Limitations/Considerations • Some flexibility limitations in form design • Data entry in online form is only possible with VPN connection • Once a project is in “production mode,” any changes must be submitted to a REDCap administrator for approval • No offline version of REDCap; must always be connected to internet Resources • http://project-redcap.org/ Video Tutorials REDCap Shared Library – A repository for data collection instruments and forms that can be downloaded and used (for free) by consortium partners • For information on REDCap at NU, contact: redcap@nubic.northwestern.edu Preparing for Statistical Collaboration • State your primary and secondary objectives and hypotheses • Decide a priori if you will be excluding cases or performing a sub group analysis Adjustment will be required to ensure “sub-results” are not simply due to chance • Avoid deciding post-hoc to repeat the analysis on a subset of people • Submit your spreadsheet with unique variable names in the first row • Submit your codebook containing variable labels and codes BCC Resources for Maximizing Statistical Interactions • Guidelines to help you prepare for Statistical Collaboration: http://www.feinberg.northwestern.edu/sites/bcc/docs/StatsCollaborationGui deSummary.pdf • Preliminary Help (Grants and Power): http://www.feinberg.northwestern.edu/sites/bcc/docs/PowerGuide.pdf • Database Issues: http://www.feinberg.northwestern.edu/sites/bcc/docs/DataGuide.pdf • Analysis and Write-up: http://www.feinberg.northwestern.edu/sites/bcc/docs/ProjectGuide.pdf Submitting a Request for Statistical Support • Submit BCC Appointment Request Form: http://www.feinberg.northwestern.edu/sites/bcc/contactus/request-form.html Questions? https://redcap.nubic.northwestern.edu Biostatistics Collaboration Center http://www.feinberg.northwestern.edu/sites/bcc/