Managing and Curating Data

advertisement
Managing and Curating Data
Chapter 8
Introduction
•
•
•
•
Data organization
Data management
Data curation
Raw data is required to repeat a scientific
study
• Any data supported by public funds is
legally required to be available for other
scientists and the public
Step 1: Managing Raw Data
• Various sources of data
– Data loggers
– Handwritten notes
• This data must be transferred to an
organized format, checked and analyzed
Spreadsheets
• Row: single observation
• Column: single measured
or observed variable
• Enter data ASAP!
–
–
–
–
Detect mistakes
Memory (doesn’t last long)
2 copies
Timely analysis
• Proofread the data
• Check it
2006 Garden Yield
Number Biomass
Carrots
10
30.2
Peppers 30
20.6
Broccoli
450.1
10
Metadata: Data about data
• “Must have” metadata:
–
–
–
–
–
Name and contact info of collector
Location of data collection
Name of study
Source of funding
Description of the organization of the data file
•
•
•
•
Methods used to collect
Types of experimental units
Description of abbreviations
Explicit description of data in columns and rows
• May be created before in some cases
• Very important to assemble because it’s easily
forgotten
Step 3: Checking the Data
• Outliers: values of measurements or
observations that are outside the range of
the bulk of the data
• Values beyond the upper or lower deciles
(the 90% or the 10%)
• Outliers increase the variance in data and
increase the chance of a Type II error
How to deal with outliers
• Do not delete them; this could be
considered fraud
• Only delete if an error or the data no
longer are valid
• Think about them
– Interesting hypotheses
– A large body of science is devoted to outliers
– What type of distribution does your data
have?
Errors and Missing Data
• Errors are often outliers and can be
identified
• Sources: Mistyping (decimal points),
instrument, field entry
• Checking data can reduce errors
• Never leave blank cells in spreadsheets;
enter a zero or NA (not available)
Detecting Outliers and Errors
• Three techniques
– Calculating column statistics
– Checking ranges and precision of column
values
– Graphical exploratory data analysis
Detecting Outliers
and Errors cont.
• Column stats:
– Mean, median, standard
deviation, variance
• Logical functions to check your
columns
• Range checking your data
Carrot Id
#
length
Biomass
1
12
8
2
24
16
3
261
18
4
10
5
Mean
76.75
11.75
Median
18
12
St Dev
122.9
6.24
Variance
15126.3
38.9
Min
10
5
Max
261
18
Graphical Exploratory Data
Analysis
• Box plots (univariate)
• Stem-and-leaf plots (univariate)
• Scatterplots (bivariate or multivariate)
700
600
3
500
400
TOTALPSO
300
200
100
0
N=
PASTURE
31
29
1
2
Stem-and-leaf plots
• Example: Vegetable biomass: 7,15,
35,36,37,23,27,21,42,55
0 7
1 5
2 1,3,7
3 5,6,7
4 2
55
Scatter plots
• Use to see how traits
relate to one another
biomass
20
18
16
14
12
10
8
6
4
2
0
0
10
20
30
40
50
60
Creating an Audit Trail
• Examining data for outliers and errors is a
QA/QC for research
• Document how you perform QA/QC in
your metadata
• Your audit trail allows others to reanalyze
and recreate your results
• May be required for legal documentation
Download