QA and Transformations

advertisement
Data Organization
Quality Assurance and Transformations
Data
Discovery
Proposal
Planning
Writing
Project
Start Up
Re-Use
Data
Collection
Data
Analysis
Deposit
Data
Archive
Data
Sharing
Re-Purpose
Data Life Cycle
End of
Project
Data Validation
• Check for missing, impossible,
anomalous values
– Plotting
– Mapping
• Examine summary statistics
• Verify data transfers from
notebooks to digital files
• Verify data conversion from
one file format to another
Hook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Share
and Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
Preserve & Record Information
Processing Script (R)
Keep Original (Raw) File
– Do not include
transformations,
interpolations, etc.
– Make the raw data
“read-only”
Save as a new file
Data Manipulation
• You will need to repeat reduction and analysis
procedures many times
–
–
–
–
You need to have a workflow that recognizes this
Scripted languages can help capture the workflow
You could just document all steps by hand
After the 20th iteration through your data set; however, you
may feel more fondly towards scripted languages
• Learn the analytical tools of your field
– Talk to colleagues, etc. and choose at least one
tool to master
Preserve Processing Information
• Scripts used in file cleaning
• Programs / algorithms
• Document workflows or data file transformations
Temperature
data (T)
Data import into R
Salinity data (S)
“Clean” T &
S data
Quality control & data
cleaning
Analysis
Graph Production
Data in R
format
Summary
statistics
Preserving: Scripted Notes
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts records processing
– Steps are recorded in textual format
– Can be easily revised and re-executed
– Easy to document
• GUI-based analysis may be easier, but harder to
reproduce
Reproducibility Methods
• Do use version control
• Do document software environment
• Only save what cannot be reconstructed
from original data + code
Download