Statistics for Students in the Sciences

advertisement
Statistics for Students in the Sciences
ACADEMIC SKILLS CENTRE (ASC)
Angie Silverberg
The use of statistics for the analysis of scientific data is essential. Statistics can provide
unlimited amounts of information regarding the accuracy, precision and reliability of data
obtained through experimentation. The statistics reviewed in this document are the most
common type of statistics used to achieve a critical analysis of experimental data.
Sample Mean
The sample mean (x̄) is defined as the mean or average of a limited number of samples
drawn from a population of experimental data. The mean can be calculated manually or
with the aid of a statistical function on a scientific calculator. The latter method is the
most desirable and time efficient. Despite the use of wonderful technology, it is
important to understand how the value is derived.
x̄ = ( Σ xk ) / n
Where, xk is defined as the value of an individual experimental value, Σ xk is the sum of
all the experimental values and n, is the number of experimental values used to obtain the
sum.
For example, a certain experiment yielded the following data values for lead: 10 ppm, 8
ppm, 7 ppm, 11 ppm and 16 ppm. The mean value is calculated by the following:
(10 + 8 + 7 + 11 + 16) ppm/5 = 10.4 ppm = 10 ppm (use the appropriate significant digits)
Standard Deviation
The term standard deviation (s) is used as a measure of precision. Precision describes
how two or more numbers are in agreement if the exact same method or procedure is
used. The standard deviation can be easily calculated using the statistical function on any
calculator. But, again, understanding the mathematical derivation is important. Standard
deviation is calculated by:
Using the example from above, the standard deviation is:
sqrt[(10-10) 2+(10-8) 2+(10-7) 2+(11-10) 2+(16-10) 2/5-1] = sqrt[0+4+9+1+36/4] = 3.5 = 4 ppm
The mean and standard deviation for the experiment can be expressed as: (10 ± 4) ppm
Types of Student t-tests
A variety of student t-tests can be utilized to evaluate methods for purposes of method
development or quality control. Typically, a student t-test is used to indicate the
difference between two means.
Case 1: If an accepted value, such as a Certified Reference Material (CRM), is known
This type of situation is used to compare an experimental mean with a value that is
obtained from a sample, where the value is certified through analytical means known as a
Certified Reference Material (CRM). CRMs are put through rigorous testing procedures
to validate accurate concentrations levels and therefore there is a high degree of
confidence is these analytically determined concentrations. In order to compare an
experimental value with a CRM value to validate a method/procedure, the following ttest is utilized:
μ = x̄ ± ts /sqrt(N)

If the equation is rearranged for the value of t:
± t = (x̄ - μ)sqrt(N)/s
where μ is the value of the certified reference material, t is the student’s t-value, obtained
for N-1 degrees of freedom, at a pre-selected confidence interval, typically a 95%
confidence interval. The t-values are obtained from a table similar to the one below:
Values for t at N-1 Degrees of Freedom for Various Confidence Intervals (CI)
N-1
90% CI
95% CI
99% CI
6.314
12.706
127.32
1
2.920
4.303
14.089
2
2.353
3.182
7.453
3
2.132
2.776
5.598
4
2.015
2.571
4.773
5
1.943
2.447
4.317
6
1.895
2.365
4.029
7
1.860
2.306
3.832
8
1.833
2.262
3.690
9
1.812
2.228
3.581
10
1.645
1.960
2.807
∞
A. Silverberg, March 29, 2011
2
Using the same data set utilized earlier for lead: 10 ppm, 8 ppm, 7 ppm, 11 ppm and 16
ppm. Assume there is a CRM value of 9.43 ppm for lead in a sandy soil sample. Case 1
can be used to compare whether or not the data for the given experimental method is
considered reliable and valid in contrast to the CRM value:
± t = (x̄ - μ)sqrt(N)/s
Plugging in the values: ± t = (10 ppm – 9.43 ppm)sqrt(5)/4 ppm
± t = 0.32
Consulting the t-table at the 95% confidence interval, at N-1, the t-value is 2.776. If the
calculated t-value is lower than the tabulated t value at the 95% CI, there is not statistical
difference. If the calculated t-value is higher than the tabulated t value at the 95% CI,
there is a statistical difference. In this case, the calculated t-value is lower than the
tabulated t-value and therefore the method is considered a valid procedure.
Case 2: When the accepted value is unknown
When the accepted value is unknown, a paired t-test is used to determine the validity of
the experimental number. Usually, a second mean is achieved using a different
instrument, another laboratory or a secondary method within the same laboratory. The
experiment t-value is calculated by:
± t = ((x̄1 - x̄2)/sp)(N1N2/N1 + N2) ½
where x̄1 is the mean of one data set, x̄2 is the mean from the second data set and sp is
called the pooled standard deviation given by:
sp = (s12(N1-1) + s22(N2-1) + … sk2(Nk-1)/NT-k)½
Where the value of k is the number of experimental means used for comparison. For
example, if there are two sets of experimental means, then the value of k is 2.
Example:
Lead Concentrations For Two Different Method Determinations Using ICP-MS
From Lab A and Lab B
Lab A Data/ppm of Pb
Lab B Data/ppm of Pb
17.1
17.2
16.2
17.1
14.6
17.0
22.8
19.0
18.7
18.3
x̄1 = 17.9
x̄2 = 17.7
S1 = 3.2
S2 = 0.9
A. Silverberg, March 29, 2011
3
sp = (39.7 + 3.0)/(10-2) ½ = 2.3
± t = (17.9 – 17.7)/2.3(5 x 5/5 + 5) ½ = (0.09)(1.6) = 0.1
The t-value from the table at a 95% confidence interval for 10 samples at N-1 is 2.262.
Since the calculated t-value is less than the tabulated t-value at a 95% confidence
interval, there is no statistical difference between the two methods. Therefore, both
methods are valid procedures.
Rejection of Data Points
Often in research there are data points that seem out of range or questionable compared to
the entire data set. It may be desirable to omit questionable data points from overall
calculations. Questionable data that is omitted is called an outlier. However, omission of
data points must be rigorously questioned using a statistical method called a Q-test. To
conduct the statistical test, the value of Q is compared to its nearest data point called a. A
second variable called w, is the difference between Q and its furthest data point. A Q-test
is determined by the following:
Q = a/w
Considering, the original data set for lead: 10 ppm, 8 ppm, 7 ppm, 11 ppm and 16 ppm,
we may consider 16 ppm as a potential outlier. To test the validity of this assumption,
the Q-test will be utilized:
Q = a/w = 16-11/16-7 = 5/9 = 0.55
To assess the value of 0.55, one needs to refer to a table of rejection quotient for various
confidence levels, similar to the one below:
Rejection Quotients (Q) at Various Confidence Intervals
# of Observations
Q90
Q95
Q99
0.941
0.970
0.994
3
0.765
0.829
0.926
4
0.642
0.710
0.821
5
0.560
0.625
0.740
6
0.507
0.568
0.680
7
0.468
0.526
0.634
8
0.437
0.493
0.598
9
0.412
0.466
0.568
10
As there are five data points with no loss of degrees of freedom, n = 5 then Q = 0.710 at
95% CI. If a calculated Q-value is greater than the tabulated Q-value, the outlier can be
A. Silverberg, March 29, 2011
4
rejected. However, if a calculated Q-value is less than the tabulated Q-value, then the
outlier cannot be rejected as it is considered a valid data point.
Referring back to the example, the calculated Q-value = 0.55, the tabulated Q-value =
0.710 at 95% CI. Therefore, the calculated Q value < tabulated Q value, and, the value of
16 ppm cannot be rejected.
F-test: Comparison of Precision
Measurement
An F-test is a simple calculation to compare the precision of two sets of measurement.
The sets do not have to be obtained from the identical sample, so long as both samples
are sufficiently similar that any indeterminate errors can be considered the same. An Ftest can provide insights into two main areas: 1) Is method A more precise than method
B? 2) Is there a difference in the precision of the two methods? To calculate an F-test,
the standard deviation of the method which is assumed to be more precise is placed in the
denominator, while the standard deviation of the method which is assumed to be least
precise is placed in the numerator.
Using the two-piece data set for lead obtained above, the standard deviations of s1 = 3.2
ppm (least precise) and s2 = 0.9 ppm (more precise) were obtained.
F= s12/ s22 = (3.2)2/(0.9)2 = 10.2/0.8 = 12.8
To further analyse this resultant F-test value, reference to a table of critical values for F is
essential. A similar table is found below:
Degrees of
Freedom
(Denominator)
2
3
4
5
Critical Values For F At A 5% Level
Degrees of Freedom (Numerator)
2
19.00
9.55
6.94
5.79
3
19.16
9.28
6.59
5.41
4
19.25
9.12
6.39
5.19
5
19.30
9.01
6.26
5.05
Each data set had five degrees of freedom and hence the tabulated F-value is 5.05. In
comparison to the calculated F-test, the calculated value of 12.8 is greater than the
tabulated value of 5.05. Therefore, it is demonstrated that the more precise method is
indeed derived from data set number two.
Statistical tables where derived from: Douglas, A.S; West, D.M.; Holler F.J., 1992, Fundamentals of
Analytical Chemistry, Sixth Edition. Saunders College Publishing, Florida, USA.
A. Silverberg, March 29, 2011
5
The Academic Skills Centre
www.trentu.ca/academicskills
acdskills@trentu.ca
705-748-1720
A. Silverberg, March 29, 2011
6
Download