Managing and Curating Data Chapter 8 Introduction • • • • Data organization Data management Data curation Raw data is required to repeat a scientific study • Any data supported by public funds is legally required to be available for other scientists and the public Step 1: Managing Raw Data • Various sources of data – Data loggers – Handwritten notes • This data must be transferred to an organized format, checked and analyzed Spreadsheets • Row: single observation • Column: single measured or observed variable • Enter data ASAP! – – – – Detect mistakes Memory (doesn’t last long) 2 copies Timely analysis • Proofread the data • Check it 2006 Garden Yield Number Biomass Carrots 10 30.2 Peppers 30 20.6 Broccoli 450.1 10 Metadata: Data about data • “Must have” metadata: – – – – – Name and contact info of collector Location of data collection Name of study Source of funding Description of the organization of the data file • • • • Methods used to collect Types of experimental units Description of abbreviations Explicit description of data in columns and rows • May be created before in some cases • Very important to assemble because it’s easily forgotten Step 3: Checking the Data • Outliers: values of measurements or observations that are outside the range of the bulk of the data • Values beyond the upper or lower deciles (the 90% or the 10%) • Outliers increase the variance in data and increase the chance of a Type II error How to deal with outliers • Do not delete them; this could be considered fraud • Only delete if an error or the data no longer are valid • Think about them – Interesting hypotheses – A large body of science is devoted to outliers – What type of distribution does your data have? Errors and Missing Data • Errors are often outliers and can be identified • Sources: Mistyping (decimal points), instrument, field entry • Checking data can reduce errors • Never leave blank cells in spreadsheets; enter a zero or NA (not available) Detecting Outliers and Errors • Three techniques – Calculating column statistics – Checking ranges and precision of column values – Graphical exploratory data analysis Detecting Outliers and Errors cont. • Column stats: – Mean, median, standard deviation, variance • Logical functions to check your columns • Range checking your data Carrot Id # length Biomass 1 12 8 2 24 16 3 261 18 4 10 5 Mean 76.75 11.75 Median 18 12 St Dev 122.9 6.24 Variance 15126.3 38.9 Min 10 5 Max 261 18 Graphical Exploratory Data Analysis • Box plots (univariate) • Stem-and-leaf plots (univariate) • Scatterplots (bivariate or multivariate) 700 600 3 500 400 TOTALPSO 300 200 100 0 N= PASTURE 31 29 1 2 Stem-and-leaf plots • Example: Vegetable biomass: 7,15, 35,36,37,23,27,21,42,55 0 7 1 5 2 1,3,7 3 5,6,7 4 2 55 Scatter plots • Use to see how traits relate to one another biomass 20 18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 Creating an Audit Trail • Examining data for outliers and errors is a QA/QC for research • Document how you perform QA/QC in your metadata • Your audit trail allows others to reanalyze and recreate your results • May be required for legal documentation