Analyzing and interpreting environmental engineering

advertisement
Data Analysis, Interpretation, and Introduction to Statistics Laboratory
Dr. Andrew J. Whelton, CE 374 Environmental Engineering
Congratulations! You have acquired actual data that you can analyze. Now it is important to determine what the
results indicate, if anything. Hopefully, the study has been designed appropriately so that your hypotheses can be
tested. Also important is that all the data needed was collected (there were no broken samples or “holes” in the
data set). This “hole” refers to your need for data, but someone or something (online monitoring device) did not
collect that data. “holes” in the dataset are very problematic and will be discussed later.
To answer questions about the phenomena controlling environmental engineering processes, it is necessary to
calculate basic statistics (mean and standard deviation values), create tables, and figures. For example, you might
be interested in whether or not individual water purifier devices change water pH. Using the data provided below
you can test that question. The next several pages will explain how to determine basic statistics and develop
testable hypotheses.
When you analyze data, you typically create many more tables and figures during this process than you would
actually report in the final communication (e.g., report, email, presentation, journal manuscript). Unless someone
has conducted the study you just completed and has a list of tables, figures, analyses to use, you are starting from
scratch. You must determine what questions to ask and which data to summarize and compare. From your
analyses, you can identify general trends in the data and possibly identify reoccurring phenomena (e.g.,
instrument/analytical variability, missing or unusual results, unequal or inadequate sample sizes). Once you
calculate basic statistics, you can then select which advanced statistics such as Analysis of Variance (ANOVA) (to
determine if all groups are equal) and multiple comparison tests (to determine which group is different) you
should apply to mathematically test your hypotheses. More discussion of what basic statistics you should
determine and how to construct testable hypotheses can be found below.
I. WATER QUALITY OF CHALLENGE WATERS – BACKGROUND
Water Quality Characteristics of Device Challenge Waters
Characteristic
Deionized (DI)
(no organics)
Color
Clear
Water Type and Organic Material Present
Synthetic
Synthetic
Water #1
Water #2
(Tannic acid)
(Humic acid)
Moderately Yellow, Dark
Yellow/Brown,
Cloudy
Cloudy
Lake Water
(Indigenous organics)
Light Yellow
II. BASIC STATISTICS FOR ENVIRONMENTAL ENGINEERING – BACKGROUND
Ideally, we want to know the exact quantity of a contaminant in water. To be exact, we would need to measure
the entire POPULATION (e.g., water contaminant concentration at every point in a volume of water at one time). If
we obtained all of these values, we would be able to determine the central tendency of the value in the entire
population. Below I have defined key statistical terms and concepts as well as shown how to calculate these
statistics using water pH data.
Xi
=
Individual measurement
N
=
Total number of values or measurements in a population. Size of population.
X
=
Mean. The most widely used measure of central tendency is the arithmetic mean, also called the mean or
average.
μ
=
Population mean. Calculated as the sum of all individual measurements divided by the size of the
population. Population mean can be calculated using…..
𝜇=
∑𝑁
𝑖=1 𝑋𝑖
𝑁
Sample Mean (=AVG)
Because we cannot sample an entire population we cannot calculate a population mean. As a result, we must
obtain a sample of that population and calculate what is called a “sample mean.” In laboratory, we carried-out
many measurements and we will use these results to calculate a sample mean. For example, we collected three
water pH measurements (6.88, 6.50, 6.34) and our sample size (n) is 3.
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 =
∑ 𝑋𝑖
𝑛
(6.88 + 6.50 + 6.34)
3
Sample mean = 6.57
Mean Deviation
Not only do we want to know the sample mean, we want to know the mean “deviation” or dispersion from the
mean. For example, how variable are our results? The sum of all deviations from the mean will equal zero, but
summing the absolute values of the deviations from the mean results in a quantity that expresses the dispersion of
the sample mean. We can divide this value by sample size and this provides us the mean deviation mean absolute
deviation.
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
∑|𝑋𝑖 − 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛|
𝑛
𝐴𝐵𝑆(6.88 − 6.57) + 𝐴𝐵𝑆(6.50 − 6.57) + 𝐴𝐵𝑆(6.34 − 6.57)
3
Sample mean deviation = 0.20
Variance (=VAR)
To eliminate the signs of deviation (+ or -), you can square the deviations. The sum of the squares of the deviations
from the mean (Called SUM OF SQUARES) is defined as…
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑(𝑋𝑖 − 𝜇)2
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ∑(𝑋𝑖 − 𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛)2
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = (6.88 − 6.57)2 + (6.50 − 6.57)2 + (6.34 − 6.57)2
𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = 0.15
Standard Deviation (=STDEV)
Standard deviation is the positive square root of the variance. Sigma (𝜎) represents the population variance while
sample variance is represented by (s). Standard deviation is always a positive (nonnegative) value.
(∑ 𝑋𝑖 )2
2
√∑ 𝑋𝑖 −
𝑁
𝜎=
𝑁
(∑ 𝑋𝑖 )2
2
√∑ 𝑋𝑖 −
𝑛
𝑠=
𝑛−1
𝑠=
2
2
2
√(6.88 + 6.50 + 6.34 ) −
(6.88 + 6.50 + 6.34)2
3
3−1
s = 0.28
Coefficient of Variation
The coefficient of variation (CV) also called coefficient of variability does not have any units. CV is a relative
measure and estimates the variability form the population from which the samples came. Results sometimes are
reported with standard deviation of coefficient of variation shown. Some folks multiply CV by 100 and report this
value.
𝑉=
𝑠
𝑆𝑎𝑚𝑝𝑙𝑒 𝑀𝑒𝑎𝑛
V = 0.28 / 6.57
V = 0.04
Hypothesis Testing: Posing and Answering Questions
The major aim of statistical analysis is to draw inferences about a population by examining a sample from that
population.
For example, we want to know if nitrogen concentration fluctuates in a lake. Since it is impractical and
extremely costly to quantify nitrogen concentration at every point in a lake, we would select certain
sampling points (this is a sample from the lake population). Based on nitrogen concentration results from
these sampling points we would then draw inferences about nitrogen concentration fluctuation in the lake.
To find answers to our questions, we must generate a “testable hypothesis.” A hypothesis is defined as a plausible
account of an event and there are two types, the null hypothesis and alternative hypothesis. Basically, you cannot
claim that one population is different than another unless you have mathematical justification (e.g., the statistical
test validates your two groups are different). The null hypothesis (Ho) indicates there is no difference between you
groups, while an alternative hypothesis (HA) indicates that your groups ARE different.
For example, we collect water samples at two locations in the same lake and quantify nitrogen
concentration at both locations. The null hypothesis would be that nitrogen concentration at location A
and location B are NOT different. The alternative hypothesis would be that nitrogen concentration at
location A and location B ARE different.
Statistical techniques are applied to test null hypothesis. If the null hypothesis is false (or rejected), then you must
accept the alternative hypothesis.
Type I Error (α), Type II Error (β), Power (1- β)
Occasionally, a null hypothesis will be rejected even though it is true. This is called Type I error as well as alpha (α),
alpha error, and error of the first kind. Because populations are so large, it is not possible to be 100.0000000
repeating % certain that the mathematical difference between groups that you find is REAL. Moreover, it is also
possible that we could accept the null hypothesis even though it is false. This is also called the Type II error, beta
(β), beta error, and error of the second kind. The probability of rejecting the null hypothesis when it is actually
false is defined as the Power (1-β) of a statistical test.
In environmental engineering we typically use α = 0.10 and 0.05. Thus, environmental engineers that use statistics
accept that the null hypothesis will be rejected even though it is true 10% or 5% of the time. In other disciplines
with very large sample sizes (e.g., medicine, toxicology) and more serious consequences (e.g., life or death due to
drug ingestion/exposure), Type I errors can range from 0.01 to 0.05. Typically environmental engineering studies
only collect a few samples (<10) because of the expense and difficulty in collection, preservation, analysis, whereas
medical studies can have hundreds to thousands of samples per data set.
Reference
Zar JH. Biostatistical Analysis, 4th Ed. Prentice Hall, Inc. 1999. Upper Saddle River, NJ USA.
Download