Northwestern Medicine - Feinberg School of Medicine

advertisement
Preparing and Formatting your
Research Data
May 15, 2015
Hannah Palac, MS
hannah.louks@northwestern.edu
Overview
• Importance
• Deciding What to Measure
• Deidentification
• Data Entry and Organization
• Keeping a Codebook
• Updating your Data
• Brief Intro to REDCap
• Preparing for Statistical Collaboration
Importance
• Garbage In  Garbage Out
• Ethics
• Scientific integrity
 Ensures consistency
 Instills confidence in funders, participants, readers, etc.
 Lays groundwork for smooth data cleaning and
statistical analysis
What do you want to know?
• What is your research question and primary hypothesis?
 Be clear and discrete when stating your question
Example:
Non-specific: Is NSAID use associated with
complications after myocardial infarction?
Specific: Is NSAID use associated with bleeding and
cardiovascular events in patients receiving
antithrombotic therapy after myocardial infarction?
What do you want to know about each
person/unit?
• Let the research question and hypothesis drive
the variables you collect
• Consider confounding variables and effect
modifiers
 What variables will you need to assess for
these possible effects?
• Categorize variables into meaningful groups (e.g.
demographics, lab values, medical history, etc.)
Longitudinal Studies
• List out what data is collected at each timepoint
in an event grid
• Example:
What type of data is best suited for each
variable?
• Nominal/Categorical: Variables with 2+ categories with
no intrinsic order
 Race, Sex, Marital Status
• Ordinal: Variables with 2+ categories that can be ordered
or ranked
 Disease stage or severity, Education
• Continuous: Variables measured on a continuum or
interval scale
 Laboratory values, Age, Weight, Height
What type of data is best suited for each
variable?
• String/Character: Words
 Many statistical softwares read in string variables exactly “as is,”
meaning that deviations in spacing, capitalization, and spelling will be
read as separate outcomes
• Example: “Male”, “male”, “1” and “M” mean the same thing, but
the software reads them as four separate groups
• Numeric: Numbers
 Raw values (e.g. Age, Labs, etc.)
 Coded values (e.g. 0/1 coding for Female/Male)
• Avoid using symbols, such as $ or %, in your data
 In general, formats are not recommended except for date variables, in
which all data points should follow a consistent date format
• Do NOT mix string and numeric data types in the same variable
Deidentification and Security
• Use a unique subject ID number rather and identifying information,
such as MRN or name
 If you must link the ID to the participant, maintain a key separate
from the database
 Use the unique participant ID for all study related documents
 Avoid sending MRNs or other PHI to the statistician
 Do not use the Excel row number as an identifier
 Example Key:
• Do not transfer data or messages containing PHI to Gmail/Yahoo/etc.
e-mail addresses
• Do not store data containing PHI on flash drives, personal devices, etc.
Data Entry and Organization
• In general, statistical software packages prefer numeric data
• Code data using numeric codes and use these codes during data
collection and entry
 Examples:
• 0=Female; 1=Male
• 0=No diabetes; 1=Diabetes
• 0=Underweight; 1=Normal weight; 2=Overweight; 3=Obese
• Be consistent with your codes, such that similar variables use the
similar codes
• Use consistent codes for “other” values, N/A, patient refusal,
patient does not know, or other missing data
 Do not enter N/A or other text
Data Entry and Organization
• Always enter the raw and intact data field rather than entering
the calculation or categorizing right away
• Do not use symbols in your data (e.g. >60 for normal eGFR)
 Instead enter 60 or categorize values into meaningful groups
(Stage 1, Stage 2, etc.)
• For variables where only one value is possible, create one
variable only
 Example: Disease stage or severity
• For variables where there can be multiple values at the same
time, create separate variables for each item
 Example: Medications, side effects, diagnoses
Working with “Others”
• For coded variables, develop a consistent scheme for coding
“other” values that can be implemented across all variables
• Keep a separate text column for “others” next to the variable of
interest
• Examples:
 Variables with only one possible value:
 Variables with multiple possible values:
Naming your Variables
• Names do not necessarily have to be
descriptive of the variable, but it is
nice if they are
• Start with a letter
• Keep it short
• Different variables should have different names
 Example: If collecting data on side effects for multiple
medications, use a systematic naming scheme such as
“[medname]_nausea” rather than “nausea” for all
medications.
Naming your Variables
• If variables have more than 1 component, such as BP,
create multiple variables (SBP, DBP)
• If necessary, add a separate variable for comments
 Do NOT include comments in the variable
• Use caution with calculations in Excel
• Do not use spaces or symbols
• Do not color code
• Do not include blank rows
Keeping a Codebook
• Keep a codebook separate from the database (such as in a
separate tab in Excel) of the variable label, variable name, units,
and possible values (codes), plausible ranges, formulas
 Example:
Wide vs. Long Formats
• “Wide” or “Horizontal” data consists one observation per
participant and all data is entered in a single row
 Example:
ID
sex
100005
0
100141
1
ldl_base
179
101
ldl_wk4
175
121
ldl_wk8
162
104
ldl_wk12
169
113
• “Long” or “Vertical” data consists of multiple
observations per participants entered in separate rows
 Example: ID
sex
time
ldl
100005
100005
100005
100005
100141
100141
100141
100141
0
0
0
0
1
1
1
1
Baseline
Week 4
Week 8
Week 12
Baseline
Week 4
Week 8
Week 12
179
175
162
169
101
121
104
113
Updating your Data
• Invariably, there will be some data cleaning by the statistician
(reformatting, converting from wide to long or long to wide)
• If you need to update your data, please do not add it to the
original database
• Instead, send a new spreadsheet with additions or changes so
the statistician can figure out the best way to merge or append
this data into the clean dataset
 Be sure to identify the new or updated data by study ID
What is
?
• Research Electronic Data Capture
• Web-based application used for building and managing research
databases quickly and securely
• Developed at Vanderbilt in 2004
• 1,255 consortium partners and rapidly growing
Why Researchers Use REDcap
• Can create data collection forms and surveys without the need
for a programmer
 Although, programmers are needed to facilitate REDCap
maintenance (CDSI)
• Can enter and access data from multicenter studies in one place
• All data stored in REDCap at Northwestern is compliant with
HIPAA standards for security
• Audit trails for tracking history of database and entry
• Suitable for (almost) any study with web-based data entry
capabilities
• Reporting capabilities
• Self-service tool means it’s FREE (to you) to use!
Sample Data Entry Form
Limitations/Considerations
• Some flexibility limitations in form design
• Data entry in online form is only possible with
VPN connection
• Once a project is in “production mode,” any
changes must be submitted to a REDCap
administrator for approval
• No offline version of REDCap; must always be
connected to internet
Resources
• http://project-redcap.org/
 Video Tutorials
 REDCap Shared Library – A repository for data
collection instruments and forms that can be
downloaded and used (for free) by consortium
partners
• For information on REDCap at NU, contact:
redcap@nubic.northwestern.edu
Preparing for Statistical Collaboration
• State your primary and secondary objectives and hypotheses
• Decide a priori if you will be excluding cases or performing a sub
group analysis
 Adjustment will be required to ensure “sub-results” are not
simply due to chance
• Avoid deciding post-hoc to repeat the analysis on a subset of
people
• Submit your spreadsheet with unique variable names in the first
row
• Submit your codebook containing variable labels and codes
BCC Resources for Maximizing
Statistical Interactions
• Guidelines to help you prepare for Statistical Collaboration:
http://www.feinberg.northwestern.edu/sites/bcc/docs/StatsCollaborationGui
deSummary.pdf
• Preliminary Help (Grants and Power):
http://www.feinberg.northwestern.edu/sites/bcc/docs/PowerGuide.pdf
• Database Issues:
http://www.feinberg.northwestern.edu/sites/bcc/docs/DataGuide.pdf
• Analysis and Write-up:
http://www.feinberg.northwestern.edu/sites/bcc/docs/ProjectGuide.pdf
Submitting a Request for Statistical Support
• Submit BCC Appointment Request Form:
http://www.feinberg.northwestern.edu/sites/bcc/contactus/request-form.html
Questions?
https://redcap.nubic.northwestern.edu
Biostatistics Collaboration Center
http://www.feinberg.northwestern.edu/sites/bcc/
Download