lecture_4_statistics

Statistics and Quantitative Analysis Chemistry 321, Summer 2014 Statistics is the field of study that allows you to understand the limitations of your data; in other words, what reasonable conclusion can you draw from your data? Warning: What follows is not meant to be a complete course in statistics. It will be enough, though. to get you through this course, but do not apply it blindly to other situations! Quantitative measurements must be replicated to establish the credibility of the data Clearly, if your observations during the experiment suggest that a procedural error occurred, then the data for that trial may be safely omitted. In other words, be careful in lab. But are there methods to detect socalled “outliers”? Yes, this is where statistics is helpful. Remember, just because you have a statistical outlier does not mean that you should necessarily throw out that data point! But first, some slides about accuracy and precision Accuracy is the agreement between your measured value and the published (sometimes called “true”) value. Precision is the agreement between your repeated measurements. Thus, accuracy ≠ precision An analogy with a dartboard So why distinguish between accuracy and precision? The terms allow us to distinguish different types of errors: those that we can correct easily, and those we can correct with difficulty or not at all. Systematic versus random errors Systematic (determinate) errors affect accuracy. Because they bias the data one way (always too high) or the other (always too low), they can usually be corrected easily. Random (indeterminate) errors affect precision. Because they are the result of variability or instrument uncertainty, they are much more difficult to correct. The arithmetic mean (aka the average) The mean is simply the sum of the measurements divided by the number of measurements; symbolically, this is: N x = åx i = 1 N where i x is the mean, xi is the i th measurement and N is the number of observations Note that all of the measurements are equally important; in other words, this is an unweighted mean. We will assume for the rest of the course that all measurements are of equal weight. The standard deviation is a measure of the “spread” of the data set To get a sense of whether the data are closelyspaced or widely scattered in the data space of all possible measurements, the standard deviation is used. N å ( x - x ) 2 i s = i = 1 N - 1 The term (xi – x) is the residual for the i th measurement Note that as N increases, s decreases – generally, more measurements decreases the standard deviation The variable s is used when the standard deviation is calculated for a sample set of data from which you wish to generalize to a population from which the sample was selected. In this case, the denominator of the fraction inside the square root is N–1, as shown. The variable σ (little sigma) is used when the standard deviation is calculated for the entire population (which is not going to happen in this course). In this case, the denominator would simply be N. N Point of fact: 2 Once N > 30, then i s ≈ σ. i = 1 s = å ( x - x ) N - 1 The standard deviation is a useful measure of spread when there are many measurements As a rule of thumb, you should have at least ten repeated measurements, though you will violate this rule often in this course. For instance, if N = 2 and the difference between the two measurements is d, then s = √2 d/2, which is not particularly meaningful. Standard deviations allow you to distinguish two distinct populations Each graph shows two normal distributions — in each case, are there two distinguishable populations? Compare the means and standard deviations of the distributions. Compare the masses of two sets of ten pennies, one minted before 1982, the other after 1982. The question: are the two sets of pennies distinguishable by mass? pre-1982 (g) mean std dev (σ) post-1982 (g) 3.067 2.534 3.088 2.544 3.094 2.566 3.056 2.513 3.050 2.555 3.049 2.532 3.061 2.538 3.077 2.541 3.063 2.570 3.071 2.548 3.068 0.015 2.544 0.017 Compare the overlap of the two-sigma ranges of each set of pennies Pre-1982 range: 3.068 ± 0.030 g Post-1982 range: 2.544 ± 0.034 g When using the “ ± “ notation to show one or two-sigma ranges, report the precision of the standard deviation to match the precision of the mean. The pre-1982 penny mass range is therefore 3.038 to 3.098 g, whereas the post-1982 penny mass range is 2.510 to 2.578 g. The two ranges do not overlap, so at the two-sigma range, the two sets of pennies are distinguishable! There is a good reason for this distinguishability: in 1982, the US Mint changed the composition of the penny from mostly copper to mostly zinc. The relative standard deviation (RSD%) is a measure of precision s RSD% = ´ 100 x Guideline: RSD% £ 3% is good for this course Note that other situations may have a larger or smaller cutoff percentage. The percent deviation is a measure of accuracy æ x - x ö published ÷÷ ´ 100 % deviation = çç è x published ø Guideline: % deviation £ 3% is good for this course Note that other situations may have a larger or smaller cutoff percentage. So how do you know when you can omit a measurement in a set of measurements? For this course, we will assume that all measureable quantities will be distributed normally (in other words, conform to a Gaussian distribution. Note that the x-axis is marked in units of the standard deviation; yes, they are using σ instead of s, but this is customary. For instance, a measurement will be said to be “two-sigma” higher than mean (rather than “two-ess”). The normal distribution formula is: 1 f (x) = e s 2p ( x - x )2 2s 2 where x is a given measurement and f(x) is the predicted probability of that measurement. The behavior of the normal distribution In a normal distribution graph, the y-axis is the number of measurements with the value along the x-axis, so to get a smooth curve as shown, you need literally hundreds of measurements! Fortunately, even though you will have few measurements, we can use the behavior of the normal distribution to check the quality of your data. For instance, we know that 95% of the data points will be within two standard deviations of the mean. The Q-test for excluding data How do you know when you can omit a data point, even when you have no observational data to do so? On a normal distribution plot of the data, it is far from the mean, and the other points. At this point apply the Qtest. R. B. Dean and W. J. Dixon (1951) "Simplified Statistics for Small Numbers of Observations". Anal. Chem., 1951, 23 (4), 636–638. The Q-test for excluding data Consider the following data points. Note that 0.167 point seems to be far off from the rest of the points; is it an outlier that can be omitted? Calculate the parameter Qcalculated by dividing the gap between the mean and the test point and the nearest point to it by the range between the high and low values of the data set. Note that the test point will either be the high or low of the data set. Table 3.3 (page 99) in text has Qtable against which your Qcalculated can be compared In this course, we’ll be using the 90% confidence level (CL) criterion, which means that if Qcalculated > Qtable then the outlier can be omitted. Thus, in our example, N = 10 so Qtable = 0.412. Since 0.455 > 0.412, we can omit the “0.167” point • If we used a 95% CL, then the point would not be omitted; higher confidence levels demand a higher cut-off criterion. • If there were two fewer data points (N=8), then the point would not be omitted. Criteria for omitting a data point from your calculations • If documented observations in your lab notebook show a procedural error for a particular measurement. • If the Q-test on a particular data point determines that that point can be omitted. Note: Do not keep applying the Q-test on the same data set; in other words, after omitting one point, do recalculate the mean and standard deviation but do not apply the Q-test on another outlier. Challenge problem You collect the following data for an analysis: 12.4 12.1 11.8 13.8 What is the reportable average and RSD% for this data set? Your observations made in your lab notebook do not allow you to omit any data values; however, you suspect the 13.8 value can be omitted for statistically valid reasons. Apply the Q-test at the 90% confidence level to follow up on this suspicion, and determine whether the 13.8 value can or cannot be omitted.

lecture_4_statistics

Related documents

Products

Support

lecture_4_statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib