Data Processing 1 1.1 Introduction Statistics is a collection of techniques which makes sense of numbers - numbers which are the results of observations, tests or measurements. These numbers are also known as data. Figure 1.1 shows the range of processes that are part of the statistical treatment of data. STATISTICS Data Collection Data Organisation Data Interpretation Data Presentation FIGURE 1.1 Statistical processes As someone working in a sector which is all about measuring things, you can’t escape the need to collect data, analyse it and use the results. Statistical methods can also be classified on the basis of what use the processed data is put to: descriptive – the processed data provides a summary of the observations/measurements, such as averages, variations and graphs inferential – the processed data is used to make judgements or predictions, such as trends, indications of variations between different samples CLASS EXERCISE 1.1 Below are some statements that quote statistics (not real ones!). Identify whether the statistics are descriptive or inferential. (a) the average December temperature in Sydney has increased by 1C in the last 50 years (b) it is expected that the average December temperature in Sydney will increase by another 1C within 25 years (c) 25% of people surveyed at a shopping centre indicated that they were aware of increasing temperatures in Sydney (d) A survey has a shown that 75% of Sydneysiders are ignorant of the changing climatic conditions in their city 1.2 Samples and populations The distinction between descriptive and inferential statistics leads onto another pair of terms that are perhaps the most significant you will meet in this chapter: population and sample. A population is the entire set of individual items about which observations are made, tests performed on and data recorded. However, in most cases, the measurements will be made on a proportion of the population, not on the entire set, ie a sample. 1. Data Processing CLASS EXERCISE 1.2 Identify the sample and the population in the following. (a) a bottle of water is taken from a dam to be tested Sample: Population: (b) the frog population of a large wetland is checked by looking at two separate hectares Sample: Population: (c) the levels of lead in fallout around a smelter are assessed by testing a selection of properties Sample: Population: (d) people in shopping centre are asked their opinions on environmental issues in a study to determine the level of awareness in the commentate Sample: Population: Descriptive statistics refer to the sample from which the data was collected, while inferential statistics make the assumption (one that is not always true) that the results from the sample can be applied to the population. The relationship between sample and population is critical in making predictions and judgements based on statistics. The sample must be representative of the population: the characteristics are the same. If not, the assumption above breaks down, and any decisions made are likely to be incorrect. 1.3 Variables When data is collected, what is being measured is a particular characteristic of the item under examination: for example, lead concentration in water, people’s opinions on a certain topic, the number of trucks on a stretch of road etc. These characteristics are known as variables, because they do just that: vary from one individual in the population to another. The type of data obtained from measuring a particular variable will depend on the variable itself (and also the test method). The two basic classes of variables are category and numerical. Category variables are where the result of a measurement is a “word”, such yes (or no), truck, bird, sparrow, first (or second) etc. With numerical variables, the measurement produces a number, which could be limited to certain values (e.g. whole numbers – a colony count, for example) or any value (e.g. the mass of an object). CLASS EXERCISE 1.3 Classify the following variables as category or numerical. SIS (a) lead levels in fallout (b) types of birds observed (c) numbers of birds observed in different locations 1.2 1. Data Processing 1.4 Presenting and organising data There is nothing more confusion than a vast collection of numbers. Imagine you have been out collecting water quality data from ten sampling points every day for a month. If all you were to do with that large collection of data was to type it into Excel in one long column, or even ten slightly shorter columns, its direct useability would be almost nil! It is necessary to present it in a manner that reduces the volume of information, without completely losing sight of the individual sample values. This means: tabulating graphing averaging comparing 1.5 Tabulating data Tabulating data means organising it so that it can be evaluated more easily, and generally means some sort of table. Category data is most usually grouped (tallied), so that the number of times each different category occurs becomes the final recorded result. Rather than write “bird” or “yes” 26 times, a mark is put against that category name each time it happens, as shown in Table 1.1. TABLE 1.1 Tallied category data Category Yes No Tally |||| |||| |||| |||| |||| |||| |||| |||| | Total 24 16 This can also be used where the data is numerical but with fixed and pre-known values, though normally this would only be where there is a large number of data points; there would be no reason to tally a set of five numbers. However, numerical data where the values can be anything, such as that in Table 1.2, presents a few problems. TABLE 1.2 A large set of numerical data - raw 10.4 11.3 19.2 2.3 20.4 17.3 48.3 43.6 40.5 36.5 42.4 38.2 39.7 37.9 24.7 5.0 39.1 21.1 23.8 32.9 28.9 42.4 47.4 21.5 23.6 40.4 27.4 31.8 10.5 33.6 27.6 36.2 8.6 45.1 25.8 28.6 24.3 39.0 30.0 14.1 14.8 40.7 6.5 24.1 6.1 8.0 36.7 17.5 2.6 24.7 Firstly, the likelihood is that very few data values will repeat, so grouping into individual values simply means having lots of 1’s in the Total column. If you use a range of values as a group, e.g. 0-5, then you lose information, as shown in Table 1.3. You can’t tell what the actual value of a 0-5 group item is. This is not the case in Category data, where the original measurement value is retained. Therefore, with numerical data, you keep the original data for other uses (e.g. averages), and use the groups as a quick way of summarising it. In general, no more than 20 groups (10 or less is preferred) should be used, otherwise too few data points will be in any one group (unless you have a very large dataset). When grouping data, you should: identify the minimum and maximum values decide how many groups are appropriate for the size of the dataset determine the groups (which should be equivalent ranges – for example, 0-5,6-10 etc, but not 05, 6-20) SIS 1.3 1. Data Processing CLASS EXERCISE 1.4 You have a data set of 100 pH measurements of river water, ranging from 5 to 9. What would be an appropriate way of grouping them? The number of times a particular value (or group of values) occurs is known as the frequency. Now that the data is tabulated, and collected together in groups, I’m sure you agree that it is easier to make some sense of, compared to 100 numbers in a grid. One feature of the data that comes into view when frequencies are tabulated is how the data is distributed among the group values. Some questions that could be asked of the frequency results include: Is it evenly spread across the groups? Do certain groups have higher frequencies? Is there any pattern? The answers to these questions are very important in interpreting what the results actually mean. The spread of data across the range of values is known as the distribution and we will look more at this idea later. The frequencies for each group or data value can be misleading if you don’t know anything about the entire dataset. For example, if someone tells you that the number of bicycles used for transport in a particular town is 1,000 (the frequency), you might think it was an impressive figure (unless you found out that the town has a population of 1,000,000). So the relative frequency – the proportion (often as a percentage) of the frequency of the total dataset – is more useful, because 5% means 5% regardless of the size of the sample. Relative frequency = Frequency of one group Total number of data points for all groups For example, the relative frequency of bicycles in that town would be 1,000 ÷ 1,000,000 = 0.001 (or 0.1%). Tallying functions in Excel Excel has a number of functions which allow you to tally data. These include: COUNT – the number of cells in the selected range that contain numerical information COUNTIF – the number of cells in the selected range that contain values that meet a specified criteria FREQUENCY – an extended version of COUNTIF, which allows a range of data to be tallied into multiple groups (more complex to use) Count [=COUNT(data range)] This allows you to identify the number of numerical values in a set of data. It ignores cells which have no values or non-numerical entries. Countif [=COUNTIF(data range, criteria)] Not ideal for masses of different groups but a simple way of picking out a particular value, or above or below a particular value with =, > or <. SIS 1.4 1. Data Processing Frequency [=FREQUENCY(data range, grouping range)] This function can tally data into user-chosen groups. Its formula is entered as an array formula. This means you highlight a group of cells where you want the frequencies to appear, type in the formula and then hit the key combination CTRL+SHIFT+ENTER. The numbers in the grouping range define the maximum value for each group, eg in Example 1.1, the group values are 5 & 10 meaning that the tally ranges are 0-5 and 6-10. EXAMPLE 1.1 The following spreadsheet shows the use of the tallying functions. The data is in Column A, the results of tallying in B. A 1 2 3 4 5 6 7 8 9 B 10 3 4 0 n/a *** 7 8 2 7 =count(A1:A9) 3 =countif(A1:A9,”>5”) 5 10 4 3 =values for groups 0-5, 6-10 =frequency(A1:A9,B6:B7) CLASS EXERCISE 1.5 Familiarisation exercise with Count, Countif and Frequency (a) (b) (c) (d) (e) (f) (g) Start Excel and type the formula =TRUNC(100*RAND()) into cell A1. Copy/drag this formula down 10 rows, then across 10 columns. You now have 100 numbers between 0 and 99. Type any letter into one of these cells, and delete the contents of another. Type =COUNT(A1:J10) into cell A12. The result should be 98. Type =COUNTIF(A1:J10,”>80”) into cell A13. Type the numbers 40 and 80 in cells A14 and A15, respectively. Highlight the three empty cells A16:A18. With the cells still selected, type =FREQUENCY(A1:J10,A14:A15) and while holding the Ctrl and Shift buttons down, hit Enter. In many situations, two or more variables of a sample set will be measured. How can you present this data in a table? If it is purely numerical data, then you probably can’t. Each set of data will have to be grouped/tallied/tabulated separately. If, however, the data is categorical, then extra information can be gained by tabulating it in a two-way frequency table (if you have two variables). Table 1.3 shows data collected for two category variables and presented in a two-way table. TABLE 1.3 A two-way frequency table with two measurements on the same sample General state of health Healthy Ill Sex of koala Male Female 45 28 21 9 A related two-way table shows the same category variable as measured on two different data sets, as shown in Table 1.4. SIS 1.5 1. Data Processing TABLE 1.4 A two-way relative frequency table with the same measurement on two sample sets Origin of plant Native Introduced Not identified Type of parkland Urban Undeveloped 37% 65 51 20 12 15 In each case, if the two sets of data were tabulated separately, the data would do no more than inform. However, when combined, it makes comparison so much easier. 1.6 Summarising the data Measures of the typical value Graphs can present the raw data in a more friendly form, but very frequently, a more “compressed” form of the data is required. What do we mean by compressed? The hundreds, thousands or millions of data values are represented by possibly one or two values. What we want is some measure of the typical data value - how often have you heard the term “the average man in the street”? Depending on the type of data collected, some measure of the variation in the data can also be useful. Summary statistics achieve these requirements – they process the many data values into a few summary values. One of the main reasons for collecting sets of data is to get an idea of what a typical, or average, item is like. Variation in any normal process means that there will be a spread of values. The typical value takes into account the range of values and comes up with a result that reflects an “average”. Not that the average is necessarily a true picture. You’ve probably all heard of the typical family having 2.2 children! So how do we measure this typical value? Is it the most common value? The middle value? Or something else? The answer is all three. There are three common measures of the typical value. All three could be called the average, so be careful when you use that term. Which of these measures is used depends on the type of variable and also the type of data. Category variables The only useful measure of the typical value for category variables is the most common value – the one with the highest frequency. This is known as the mode, and is a very simple measure, since it will be the highest frequency, the tallest column or largest pie slice, depending on how the data is presented. CLASS EXERCISE 1.6 The following data was collected at a recycling point, where the number of items of different classes were tallied. Identify the mode. Class Steel cans Aluminium cans Plastics (type R/1) Number 2123 991 1502 Class Plastics (type 2) Glass bottles Milk cartons (paper) Number 1818 3369 456 The mode (or modal class) is SIS 1.6 1. Data Processing Numerical variables Here the range of options is wider. There are three types of “averages” which can be applied to numerical data: 1. Mode – as above; it is the most common value, but will only be applicable to grouped data; 2. Mean - the measure that is called the average by most people; it is calculated by adding up all the individual values and dividing the sum by the total number of data points, often given the symbol or x 3. Median - this is the middle value when the data points are lined in ascending order. If there is an even number of data points, then the median is midway between the middle two values; for example, the median value for the following 5 data values – 17, 18, 20,20, 22 – is 20, while for the following 6 values – 17, 18, 18, 20, 20, 22 – it is 19 (half way between 18 and 20) In general, the mode and median are only really useful if at least 20 data values are involved, and don’t tend to be used much in scientific data analysis. The mean is almost always the method when the term average is used in scientific data (that’s what Excel calls it). Warning Your calculator will almost certainly have functions for mean and standard deviation (next section) as part of its scientific functions. We suggest that you DO NOT use them. Why not? For any more than a few data points, the chance of you entering a wrong data value into memory becomes fairly substantial. Since you can’t easily recall that data to check it, you get a wrong answer. Use of spreadsheet functions or the tally chart method below are definitely more reliable. Excel function for the mean: AVERAGE(cell range for data) The data can be spread across a number of separate cell ranges when calculating AVERGAGE, e.g. =AVERAGE(A1:B3,C5:D8). The comma divider is essential, anything else won’t work. Never use average on grouped numerical data, only the original values! Measures of data variation The typical value – most often the mean – is a very powerful summary statistic, but does it tell the whole picture about a particular population? Consider the following two sets of numbers, which have the same mean of 5: 1, 5, 6 and 10 4.5, 5, 5 and 5.5 If these were daily pollution levels, and the recommended limit was 5.5, calculated over a four day period, each dataset would pass. However, I think you would prefer to live in the area covered by the second set, rather than the first, because two out of the four daily readings in the first set were about the average maximum. So, a measure of the variation in the sample set is also important. Category data cannot be summarised in this way. SIS 1.7 1. Data Processing There are two common measures of variation for numerical data: 1. Range - simply the difference between the highest and the lowest values in the sample set 2. Standard deviation - the square root of the variance Calculation of the range is like the mode – simple and quick, but not necessarily very reliable. The range is drastically affected by extreme results, and is really only useful for very small datasets. Only the highest and lowest values are used, so it is not representative of the whole data set. The standard deviation the result of a more complicated calculation (though not if you use Excel), and is less affected by extremes. The formula for the standard deviation (which you don’t need to know) is shown below, simply for the reason of showing you how it works. Each data value is used in the calculation: essentially it is the average difference between the values and the mean. SD ( v )2 n1 where v is the data value, the mean and n the number of data values. Excel function for standard deviation: Again, separate ranges can be used, separated by commas. The formula then uses one of the following functions: Range: MAX(cell range for data) – MIN(cell range for data) Standard deviation: STDEV(cell range for data) ASSIGNMENT 1 – SUMMARISING RAINFALL DATA Go to the subject assignment webpage and click on the Assignment 1 link. This allows you to download an Excel file of Sydney rainfall data on one worksheet, and a series of tasks & questions on the other. Complete the tasks, answer the questions in the spaces provided, and then rename the file SIS1_yourname and email it to me. SIS 1.8