Managing and Curating Data

Managing and Curating Data Chapter 8 Introduction • • • • Data organization Data management Data curation Raw data is required to repeat a scientific study • Any data supported by public funds is legally required to be available for other scientists and the public Step 1: Managing Raw Data • Various sources of data – Data loggers – Handwritten notes • This data must be transferred to an organized format, checked and analyzed Spreadsheets • Row: single observation • Column: single measured or observed variable • Enter data ASAP! – – – – Detect mistakes Memory (doesn’t last long) 2 copies Timely analysis • Proofread the data • Check it 2006 Garden Yield Number Biomass Carrots 10 30.2 Peppers 30 20.6 Broccoli 450.1 10 Metadata: Data about data • “Must have” metadata: – – – – – Name and contact info of collector Location of data collection Name of study Source of funding Description of the organization of the data file • • • • Methods used to collect Types of experimental units Description of abbreviations Explicit description of data in columns and rows • May be created before in some cases • Very important to assemble because it’s easily forgotten Step 3: Checking the Data • Outliers: values of measurements or observations that are outside the range of the bulk of the data • Values beyond the upper or lower deciles (the 90% or the 10%) • Outliers increase the variance in data and increase the chance of a Type II error How to deal with outliers • Do not delete them; this could be considered fraud • Only delete if an error or the data no longer are valid • Think about them – Interesting hypotheses – A large body of science is devoted to outliers – What type of distribution does your data have? Errors and Missing Data • Errors are often outliers and can be identified • Sources: Mistyping (decimal points), instrument, field entry • Checking data can reduce errors • Never leave blank cells in spreadsheets; enter a zero or NA (not available) Detecting Outliers and Errors • Three techniques – Calculating column statistics – Checking ranges and precision of column values – Graphical exploratory data analysis Detecting Outliers and Errors cont. • Column stats: – Mean, median, standard deviation, variance • Logical functions to check your columns • Range checking your data Carrot Id # length Biomass 1 12 8 2 24 16 3 261 18 4 10 5 Mean 76.75 11.75 Median 18 12 St Dev 122.9 6.24 Variance 15126.3 38.9 Min 10 5 Max 261 18 Graphical Exploratory Data Analysis • Box plots (univariate) • Stem-and-leaf plots (univariate) • Scatterplots (bivariate or multivariate) 700 600 3 500 400 TOTALPSO 300 200 100 0 N= PASTURE 31 29 1 2 Stem-and-leaf plots • Example: Vegetable biomass: 7,15, 35,36,37,23,27,21,42,55 0 7 1 5 2 1,3,7 3 5,6,7 4 2 55 Scatter plots • Use to see how traits relate to one another biomass 20 18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 Creating an Audit Trail • Examining data for outliers and errors is a QA/QC for research • Document how you perform QA/QC in your metadata • Your audit trail allows others to reanalyze and recreate your results • May be required for legal documentation

Managing and Curating Data

Related documents

Products

Support

Managing and Curating Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib