Data Analysis Dr. Doug McLaughlin November 18, 2013 The goal of this lecture is to encourage deeper thinking about data or values we see or derive ourselves. What’s In A Number? • “The concentration of dioxin in fish is 0.6 parts per trillion.” What’s In A Number? • “The concentration of dioxin in fish is 0.6 parts per trillion.” • Important questions to ask: – About the objective (why do we care?) – About the value (what does this number represent) – About the data (what else should we know about the data set it comes from? Is data quality acceptable?) Some Good Questions To Ask About the Value • What does “0.6” represent? – Mean? Median? Geometric mean? – From a representative “sample”? – How sure are we (confidence limits)? – Is it changing (trends over time)? – Data assumptions (e.g., not skewed, no values below detection limits)? Concentration Units Concen tration SI Prefix Name g/L 1 mg/L 1 ug/L 1 ng/L -milli micro nano 1 pg/L 1 fg/L “Parts per…” thousand million billion trillion Factor (Decimal Notation) 1 0.001 0.000001 0.000000001 pico quadrillion 0.000000000001 femto quintillion 0.000000000000001 Factor (Scientific Notation, g/L) 100 10-3 10-6 10-9 10-12 10-15 USEPA Data Quality Objectives Process: A 7 Step Framework for Data Collection, Data Analysis, and Decision-Making USEPA Data Quality Objectives Process: A Framework for Data Collection, Data Analysis, and DecisionMaking How hazardous is consuming fish from a river? Compare chemical concentrations in fish tissue with health guidance. Determine chemical concentrations in representative fish tissue. USEPA Data Quality Objectives Process: A Framework for Data Collection, Data Analysis, and DecisionMaking - continued USEPA Data Quality Objectives Process: A Framework for Data Collection, Data Analysis, and DecisionMaking - continued Most commonly caught fish species from River x. Measure contaminant concentrations in whole body tissue samples. Compare to health guidance values. USEPA Data Quality Objectives Process: A Framework for Data Collection, Data Analysis, and DecisionMaking USEPA Data Quality Objectives Process: A Framework for Data Collection, Data Analysis, and DecisionMaking How certain must the estimated mean be, i.e., how small must the confidence interval on the estimated mean be? How many samples are needed ? What analytical method should be used? Example Data Set • 37 “dioxin” concentration measurements from fish collected downstream of a pulp and paper mill Year0 0 0 0 0 0 3 3 3 7 7 7 7 7 7 7 7 7 Dioxin (ppt) 1.8 1.7 2.6 0.84 1.1 1.4 0.26 1.1 0.63 0.4 0.28 0.79 0.4 0.31 0.2 0.51 0.44 Year0 10 10 10 10 10 10 10 10 10 10 13 13 13 13 13 13 13 13 13 13 Dioxin (ppt) <0.25 0.79 <0.39 <0.23 <0.25 0.37 0.37 0.27 0.30 0.26 0.12 0.24 0.24 0.29 0.36 0.38 0.33 0.24 0.38 <0.12 Data Summary Statistics Assumes “less thans” are equal to the detection limit Parameter Value N 37 Variance 0.29 Mean 0.57 S.D. 0.54 Median 0.37 Coef. Var. (S.D./Mean) 95% 25th percentile 0.26 Min. 0.12 75th percentile 0.71 Max. 2.6 Example Data Set Assumes “less thans” are equal to the detection limit Understanding Data Distributions Understanding Data Distributions Examples of Normal and Lognormal Distributions Distribution Plot Distribution Plot Normal, Mean=0.57, StDev=0.29 Lognormal, Loc=0.57, Scale=0.57, Thresh=0 0.5 1.4 1.2 0.4 0.8 De nsity Density 1.0 0.6 0.4 0.2 0.1 0.2 0.0 0.3 -0.5 0.0 0.5 X 1.0 1.5 0.0 0 1 2 3 X 4 5 6 7 8 Making Assumptions About “Censored” Values (“Nondetects”) • Assume/substitute specific values for NDs – 0, ½ detection limit, full detection limit are common substitutions • Convenient, but can lead to incorrect conclusions in certain cases. – Hard to predict when problems will arise • Are there alternatives? Yes. One example is the Kaplan-Meier procedure for estimating a mean. Effect of “Less Than” Substitution Assumption DL = detection limit S.D. = standard deviation Coef. Var. = coefficient of variation Parameter ND=DL ND=0 N 37 37 Variance 0.29 0.32 Mean 0.57 0.53 ND=1/2 DL 37 0.55 0.30 S.D. 0.54 0.56 0.55 Median 0.37 0.36 0.36 75th percentile 0.71 0.71 0.71 Coef. Var. (S.D./Mean) 25th percentile Min. Max. 95% 0.26 0.12 2.6 106% 0.24 0 2.6 100% 0.24 0.06 2.6 Effect of “Less Than” Substitution Assumption DL = detection limit S.D. = standard deviation Coef. Var. = coefficient of variation Parameter ND=DL ND=0 N 37 37 Variance 0.29 0.32 Mean 0.57 0.53 ND=1/2 DL 37 0.55 0.30 S.D. 0.54 0.56 0.55 Median 0.37 0.36 0.36 75th percentile 0.71 0.71 0.71 Coef. Var. (S.D./Mean) 25th percentile Min. Max. 95% 0.26 0.12 2.6 106% 0.24 0 2.6 100% 0.24 0.06 2.6 Effect of “Less Than” Substitution Assumption DL = detection limit S.D. = standard deviation Coef. Var. = coefficient of variation Parameter ND=DL ND=0 N 37 37 Variance 0.29 0.32 Mean 0.57 0.53 ND=1/2 DL 37 0.55 0.30 S.D. 0.54 0.56 0.55 Median 0.37 0.36 0.36 75th percentile 0.71 0.71 0.71 Coef. Var. (S.D./Mean) 25th percentile Min. Max. 95% 0.26 0.12 2.6 106% 0.24 0 2.6 100% 0.24 0.06 2.6 Example Data Set • 37 “dioxin” concentration measurements from fish collected downstream of a pulp and paper mill Sample No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Dioxin (ppt) 1.8 1.7 2.6 0.84 1.1 1.4 0.26 1.1 0.63 0.4 0.28 0.79 0.4 0.31 0.2 0.51 0.44 Sample No. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Dioxin (ppt) <0.25 0.79 <0.39 <0.23 <0.25 0.37 0.37 0.27 0.30 0.26 0.12 0.24 0.24 0.29 0.36 0.38 0.33 0.24 0.38 <0.12