Data Analysis 3 The summarising, tabulating and graphing of data provides a picture of what is happening within the measurements. However, it is often the case that looking beyond the collected data is as important. This is the difference between descriptive and inferential statistics mentioned in Chapter 1. 3.1 Trends and patterns One common and valuable type of data is obtained when the measurement is repeated for an extended time or over a distance, so that the change in the value as a function of time or distance can be observed. This leads to the observation of trends or patterns in the data. CLASS EXERCISE 3.1 What is the difference between a trend and a pattern? Give an example of each. Trend Example Pattern Example These recognisable changes in the data can occur in a time context, where they might be rapid – for example, the change in pH in a river after a chemical spill – or over a long period of time – for example, global warming. They can also occur as a function of distance or any other variable that you care to think of. CLASS EXERCISE 3.2 The graph on the following page shows data collected over a number of years, plotted by (i) joining the dots (solid line) and (ii) a line of best fit (dotted line). Consider the following questions • which is a more useful graph – join the dots or line of best fit? • can you see any trend? • what effect would it have on your certainty if you only had data from 2001, 2005, 2009 and 2013? • what value do you think might occur in 2015? in 2030? 3. Data analysis Time Series The join-the-dots graph in Exercise 3.2 is known as a time series, a line graph where a measured variable is plotted against time. It is best if the time interval remains constant between successive data points, though it is not essential. A time series can help identify patterns, which might be: • a trend, • a repeating cycle, • random fluctuation • a combination of all three However, the natural variation in the measurements may cause the pattern to be blurred by “noise”, where there is so much up-and-down variation between successive points that it is not possible to see “the wood for the trees”. Figure 3.1 shows such an example. FIGURE 3.1 A time series with substantial variation Is there a trend in this data? Well, if you look carefully, you might believe that there is a small rise, but you would not feel very confident about drawing a line of best fit through the data. What needs to be SIS 3.2 3. Data analysis done is to smooth the data – to clear some of the noise, so that the real pattern becomes clear. Figure 3.2 shows the same data from Figure 3.2, but using a smoothing procedure. FIGURE 3.2 The time series from Figure 3.1 with smoothing Smoothing data means the loss of the raw data, which is not unimportant. Any smoothed data should be clearly shown as just that. You should not report smoothed data as raw data!! There are a variety of smooth methods, some more complicated than others. We will look at one: the running means method. THE RUNNING MEANS METHOD This method requires that there are no gaps in the data – that each time interval between the data is the same. The value that is plotted is the mean of a number of successive data points; how many is a matter of choice but three and five are common. We will choose three here for simplicity. EXAMPLE 3.1 Smooth the following data using 3-point running means. Data 2.2 1.5 2.1 3.5 1.6 2.8 3.6 3.2 3.1 2.9 Smoothing Calculation (2.2 +1.5) ÷ 2 * (2.2 +1.5 + 2.1) ÷ 3 (1.5 + 2.1 + 3.5) ÷ 3 (2.1 + 3.5 + 1.6) ÷ 3 (3.5 + 1.6 + 2.8) ÷ 3 (1.6 + 2.8 + 3.6) ÷ 3 (2.8 + 3.6 + 3.2) ÷ 3 (3.6 + 3.2 + 3.1) ÷ 3 (3.2 + 3.1 + 2.9) ÷ 3 (3.1 + 2.9) ÷ 2 * Smoothed Value 1.85 1.93 2.37 2.40 2.63 2.67 3.20 3.30 3.07 3.00 * the end values are calculated slightly differently because they don’t have another point on the “other” side The greater the number of points included in the running mean, the greater the amount of smoothing. Figure 3.3 shows the effect on Figure 3.1 using 5-point running means. SIS 3.3 3. Data analysis FIGURE 3.3 The time series from Figure 3.1 with 5-point running mean smoothing SEASONAL PATTERNS You probably have heard the term “seasonally adjusted” when unemployment figures are released. Many measurements will have a variation which re-occurs at particular times of the year. January unemployment figures are always high because of the addition of school-leavers. Dissolved oxygen levels in a river will have a variation brought on by changes in the atmospheric temperature. To see any trend in these figures is made difficult by the seasonal variation. Various methods – most of them quite complex – are used to remove the seasonal component of the data. These are beyond the scope of this course. Linear regression This sounds complex and difficult. And it is if you don’t have a computer with a statistical or spreadsheet program handy. We will assume that you do, so most of the horrible calculations part are avoided. Linear regression involves the calculation of a line of best fit that links the x- and y-values for a number of data points. In its simplest form, it calculates the slope (m) and y-intercept (b) for a straight line – y = mx + b – which is of course what you use for working out sample concentrations from your calibration graphs. In Exercise 3.2, you did this by guesswork – not a very reliable method, I’m sure you will agree. USING THE LINE-OF-BEST-FIT One of the reasons for producing a line-of-best-fit is to determine x or y-values where no data exists (for example, the fallout levels for 2030 in the earlier exercise). This is known as interpolation or extrapolation. CLASS EXERCISE 3.3 What is the difference between interpolation and extrapolation? SIS 3.4 3. Data analysis The values obtained from interpolation and extrapolation are not perfect. The error in the y value depends on how far you are extrapolating the data: the closer the data is to the extrapolation point, the lower the error, as shown in Figure 3.4. error limits in line of best fit error in y-estimate line of best fit FIGURE 3.4 Error in extrapolation from line of best fit Statistical methods exist for estimating the error involved in an extrapolation. 3.2 Statistical tests These are mathematical processes to make objective judgements and comparisons about data sets. They are not 100% accurate, and can only make an educated guess with an inbuilt error. It allows you to “answer” questions such as: • is the average December temperature in Sydney over the last ten years significantly hotter than 50 years ago? • has the level of pollution dropped significantly since the installation of new equipment? • does this batch of Corn Flakes meet the standard for protein content? The uncertainty comes about because there is inbuilt error and variation in measurement due to sampling and other factors. For example, looking at the last question, let’s say that the sample gives an average protein value of 7.92%w/w with a variation (standard deviation) of 0.1%w/w. The standard value is 7.9-8.0. While the average is within the standard range, the variation takes some of the sample out of it. A statistical test can identify whether it is likely that all the Corn Flakes in this batch meet the standard or not. A useful example of a statistical test: The outliers test Outliers are data points in a set that seem to be so different from the rest that they don’t belong, and should be deleted. Leaving in the data set would change the mean and standard deviation. However, unless the measurement process for the suspect point is known to have a problem, you should not simply remove it on a whim. A simple procedure based on statistics exists to eliminate the point objectively. SIS 3.5 3. Data analysis EXAMPLE 3.2 Consider the dissolved oxygen levels below, which were collected at points around a dam site for the purpose of providing a typical measure of the DO. 9.21 9.10 9.13 8.99 9.05 9.05 9.25 4.28 8.95 9.22 The 4.28 value is so different to the others, that the question is: should it be included in the value for the typical DO level? The table below shows the effect on the mean and standard deviation of the presence of the outlier. Mean SD With outlier 8.62 1.53 Without outlier 9.11 0.11 The effect is obvious, and it may be that the low value was a consequence of an instrumental or operator error, not a very polluted pocket of water. Regardless of the cause, a reading of 8.62 doesn’t really reflect the typical DO for that water, so it may be that it should be omitted. Removing data should not be done without serious consideration of what is the intention of the data collection. If the purpose is to look for problems, then you obviously don’t eliminate these points. Even if you are expecting all the values to be close together, it is not appropriate to cross it out (or even worse, not even record it at all). The Q-test for testing outliers This is actually a simple statistical test. The process is: 1. calculate a test-value (Q in this case – see equation below) based on the data 2. compare this value to a table of values 3. make a judgement on the basis of the comparison The test value (Q) is the difference between the suspected outlier and the nearest other value divided by the range of the data, as shown in the equation below. Q = | vo – vn | ÷ r where vo is the value of the outlier, vn the value of the nearest data point and r the range. The | | symbols mean that the difference is always positive. The test score for Q is then compared to a table of values which have been calculated on the basis of statistical probabilities. Table 3.1 below gives a selection of these values. Rule for discarding outlier If the value of Q is greater than the table value, the outlier can be deleted. SIS 3.6 3. Data analysis TABLE 3.1 Q-test table values Number of data points Table Value 5 6 7 8 9 10 0.73 0.64 0.59 0.54 0.51 0.48 Statistics can measure the likelihood that you are making an error with the test. In this case, there is a 4% chance that you will discard a valid point or keep an invalid one. EXAMPLE 3.3 Can the 4.28 point from the DO data in Example 3.4 be eliminated? From the DO data above, the value of Q is calculated to be: Q = (8.95 – 4.28) ÷ (9.25 – 4.28) = 0.94 The number of data points is 10, so the table value is 0.48. Since the test value of Q is greater than this, we can safely delete the 4.28 data point from our calculations. EXERCISE 3.4 Identify the outliers in the following data sets, and determine whether they can be deleted. (a) (b) (c) SIS Data Values 15, 22, 18, 6, 25, 19 0.75, 0.83, 0.53, 0.82, 0.76, 0.81, 0.69, 1.03 41.5, 46.2, 41.6, 42.0, 41.1, 42.1 3.7