Data Analysis, Interpretation, and Introduction to Statistics Laboratory Dr. Andrew J. Whelton, CE 374 Environmental Engineering Congratulations! You have acquired actual data that you can analyze. Now it is important to determine what the results indicate, if anything. Hopefully, the study has been designed appropriately so that your hypotheses can be tested. Also important is that all the data needed was collected (there were no broken samples or “holes” in the data set). This “hole” refers to your need for data, but someone or something (online monitoring device) did not collect that data. “holes” in the dataset are very problematic and will be discussed later. To answer questions about the phenomena controlling environmental engineering processes, it is necessary to calculate basic statistics (mean and standard deviation values), create tables, and figures. For example, you might be interested in whether or not individual water purifier devices change water pH. Using the data provided below you can test that question. The next several pages will explain how to determine basic statistics and develop testable hypotheses. When you analyze data, you typically create many more tables and figures during this process than you would actually report in the final communication (e.g., report, email, presentation, journal manuscript). Unless someone has conducted the study you just completed and has a list of tables, figures, analyses to use, you are starting from scratch. You must determine what questions to ask and which data to summarize and compare. From your analyses, you can identify general trends in the data and possibly identify reoccurring phenomena (e.g., instrument/analytical variability, missing or unusual results, unequal or inadequate sample sizes). Once you calculate basic statistics, you can then select which advanced statistics such as Analysis of Variance (ANOVA) (to determine if all groups are equal) and multiple comparison tests (to determine which group is different) you should apply to mathematically test your hypotheses. More discussion of what basic statistics you should determine and how to construct testable hypotheses can be found below. I. WATER QUALITY OF CHALLENGE WATERS – BACKGROUND Water Quality Characteristics of Device Challenge Waters Characteristic Deionized (DI) (no organics) Color Clear Water Type and Organic Material Present Synthetic Synthetic Water #1 Water #2 (Tannic acid) (Humic acid) Moderately Yellow, Dark Yellow/Brown, Cloudy Cloudy Lake Water (Indigenous organics) Light Yellow II. BASIC STATISTICS FOR ENVIRONMENTAL ENGINEERING – BACKGROUND Ideally, we want to know the exact quantity of a contaminant in water. To be exact, we would need to measure the entire POPULATION (e.g., water contaminant concentration at every point in a volume of water at one time). If we obtained all of these values, we would be able to determine the central tendency of the value in the entire population. Below I have defined key statistical terms and concepts as well as shown how to calculate these statistics using water pH data. Xi = Individual measurement N = Total number of values or measurements in a population. Size of population. X = Mean. The most widely used measure of central tendency is the arithmetic mean, also called the mean or average. μ = Population mean. Calculated as the sum of all individual measurements divided by the size of the population. Population mean can be calculated using….. 𝜇= ∑𝑁 𝑖=1 𝑋𝑖 𝑁 Sample Mean (=AVG) Because we cannot sample an entire population we cannot calculate a population mean. As a result, we must obtain a sample of that population and calculate what is called a “sample mean.” In laboratory, we carried-out many measurements and we will use these results to calculate a sample mean. For example, we collected three water pH measurements (6.88, 6.50, 6.34) and our sample size (n) is 3. 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 = ∑ 𝑋𝑖 𝑛 (6.88 + 6.50 + 6.34) 3 Sample mean = 6.57 Mean Deviation Not only do we want to know the sample mean, we want to know the mean “deviation” or dispersion from the mean. For example, how variable are our results? The sum of all deviations from the mean will equal zero, but summing the absolute values of the deviations from the mean results in a quantity that expresses the dispersion of the sample mean. We can divide this value by sample size and this provides us the mean deviation mean absolute deviation. 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∑|𝑋𝑖 − 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛| 𝑛 𝐴𝐵𝑆(6.88 − 6.57) + 𝐴𝐵𝑆(6.50 − 6.57) + 𝐴𝐵𝑆(6.34 − 6.57) 3 Sample mean deviation = 0.20 Variance (=VAR) To eliminate the signs of deviation (+ or -), you can square the deviations. The sum of the squares of the deviations from the mean (Called SUM OF SQUARES) is defined as… 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑(𝑋𝑖 − 𝜇)2 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑(𝑋𝑖 − 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛)2 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = (6.88 − 6.57)2 + (6.50 − 6.57)2 + (6.34 − 6.57)2 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = 0.15 Standard Deviation (=STDEV) Standard deviation is the positive square root of the variance. Sigma (𝜎) represents the population variance while sample variance is represented by (s). Standard deviation is always a positive (nonnegative) value. (∑ 𝑋𝑖 )2 2 √∑ 𝑋𝑖 − 𝑁 𝜎= 𝑁 (∑ 𝑋𝑖 )2 2 √∑ 𝑋𝑖 − 𝑛 𝑠= 𝑛−1 𝑠= 2 2 2 √(6.88 + 6.50 + 6.34 ) − (6.88 + 6.50 + 6.34)2 3 3−1 s = 0.28 Coefficient of Variation The coefficient of variation (CV) also called coefficient of variability does not have any units. CV is a relative measure and estimates the variability form the population from which the samples came. Results sometimes are reported with standard deviation of coefficient of variation shown. Some folks multiply CV by 100 and report this value. 𝑉= 𝑠 𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛 V = 0.28 / 6.57 V = 0.04 Hypothesis Testing: Posing and Answering Questions The major aim of statistical analysis is to draw inferences about a population by examining a sample from that population. For example, we want to know if nitrogen concentration fluctuates in a lake. Since it is impractical and extremely costly to quantify nitrogen concentration at every point in a lake, we would select certain sampling points (this is a sample from the lake population). Based on nitrogen concentration results from these sampling points we would then draw inferences about nitrogen concentration fluctuation in the lake. To find answers to our questions, we must generate a “testable hypothesis.” A hypothesis is defined as a plausible account of an event and there are two types, the null hypothesis and alternative hypothesis. Basically, you cannot claim that one population is different than another unless you have mathematical justification (e.g., the statistical test validates your two groups are different). The null hypothesis (Ho) indicates there is no difference between you groups, while an alternative hypothesis (HA) indicates that your groups ARE different. For example, we collect water samples at two locations in the same lake and quantify nitrogen concentration at both locations. The null hypothesis would be that nitrogen concentration at location A and location B are NOT different. The alternative hypothesis would be that nitrogen concentration at location A and location B ARE different. Statistical techniques are applied to test null hypothesis. If the null hypothesis is false (or rejected), then you must accept the alternative hypothesis. Type I Error (α), Type II Error (β), Power (1- β) Occasionally, a null hypothesis will be rejected even though it is true. This is called Type I error as well as alpha (α), alpha error, and error of the first kind. Because populations are so large, it is not possible to be 100.0000000 repeating % certain that the mathematical difference between groups that you find is REAL. Moreover, it is also possible that we could accept the null hypothesis even though it is false. This is also called the Type II error, beta (β), beta error, and error of the second kind. The probability of rejecting the null hypothesis when it is actually false is defined as the Power (1-β) of a statistical test. In environmental engineering we typically use α = 0.10 and 0.05. Thus, environmental engineers that use statistics accept that the null hypothesis will be rejected even though it is true 10% or 5% of the time. In other disciplines with very large sample sizes (e.g., medicine, toxicology) and more serious consequences (e.g., life or death due to drug ingestion/exposure), Type I errors can range from 0.01 to 0.05. Typically environmental engineering studies only collect a few samples (<10) because of the expense and difficulty in collection, preservation, analysis, whereas medical studies can have hundreds to thousands of samples per data set. Reference Zar JH. Biostatistical Analysis, 4th Ed. Prentice Hall, Inc. 1999. Upper Saddle River, NJ USA.