Data Processing 1 STATISTICS

advertisement
Data Processing
1
1.1 Introduction
Statistics is a collection of techniques which makes sense of numbers - numbers which are the results
of observations, tests or measurements. These numbers are also known as data. Figure 1.1 shows the
range of processes that are part of the statistical treatment of data.
STATISTICS
Data Collection
Data Organisation
Data Interpretation
Data Presentation
FIGURE 1.1 Statistical processes
As someone working in a sector which is all about measuring things, you can’t escape the need to
collect data, analyse it and use the results. Statistical methods can also be classified on the basis of
what use the processed data is put to:
 descriptive – the processed data provides a summary of the observations/measurements, such as
averages, variations and graphs
 inferential – the processed data is used to make judgements or predictions, such as trends,
indications of variations between different samples
CLASS EXERCISE 1.1
Below are some statements that quote statistics (not real ones!). Identify whether the statistics are
descriptive or inferential.
(a)
the average December temperature in Sydney has increased
by 1C in the last 50 years
(b)
it is expected that the average December temperature in
Sydney will increase by another 1C within 25 years
(c)
25% of people surveyed at a shopping centre indicated that
they were aware of increasing temperatures in Sydney
(d)
A survey has a shown that 75% of Sydneysiders are ignorant
of the changing climatic conditions in their city
1.2 Samples and populations
The distinction between descriptive and inferential statistics leads onto another pair of terms that are
perhaps the most significant you will meet in this chapter: population and sample. A population is
the entire set of individual items about which observations are made, tests performed on and data
recorded. However, in most cases, the measurements will be made on a proportion of the population,
not on the entire set, ie a sample.
1. Data Processing
CLASS EXERCISE 1.2
Identify the sample and the population in the following.
(a)
a bottle of water is taken from a dam to be tested
Sample:
Population:
(b)
the frog population of a large wetland is checked by looking at two separate hectares
Sample:
Population:
(c)
the levels of lead in fallout around a smelter are assessed by testing a selection of properties
Sample:
Population:
(d)
people in shopping centre are asked their opinions on environmental issues in a study to
determine the level of awareness in the commentate
Sample:
Population:
Descriptive statistics refer to the sample from which the data was collected, while inferential statistics
make the assumption (one that is not always true) that the results from the sample can be applied to
the population. The relationship between sample and population is critical in making predictions and
judgements based on statistics. The sample must be representative of the population: the
characteristics are the same. If not, the assumption above breaks down, and any decisions made are
likely to be incorrect.
1.3 Variables
When data is collected, what is being measured is a particular characteristic of the item under
examination: for example, lead concentration in water, people’s opinions on a certain topic, the
number of trucks on a stretch of road etc. These characteristics are known as variables, because they
do just that: vary from one individual in the population to another. The type of data obtained from
measuring a particular variable will depend on the variable itself (and also the test method). The two
basic classes of variables are category and numerical.
Category variables are where the result of a measurement is a “word”, such yes (or no), truck,
bird, sparrow, first (or second) etc. With numerical variables, the measurement produces a number,
which could be limited to certain values (e.g. whole numbers – a colony count, for example) or any
value (e.g. the mass of an object).
CLASS EXERCISE 1.3
Classify the following variables as category or numerical.
SIS
(a)
lead levels in fallout
(b)
types of birds observed
(c)
numbers of birds observed in different locations
1.2
1. Data Processing
1.4 Presenting and organising data
There is nothing more confusion than a vast collection of numbers. Imagine you have been out
collecting water quality data from ten sampling points every day for a month. If all you were to do with
that large collection of data was to type it into Excel in one long column, or even ten slightly shorter
columns, its direct useability would be almost nil!
It is necessary to present it in a manner that reduces the volume of information, without
completely losing sight of the individual sample values. This means:
 tabulating
 graphing
 averaging
 comparing
1.5 Tabulating data
Tabulating data means organising it so that it can be evaluated more easily, and generally means some
sort of table. Category data is most usually grouped (tallied), so that the number of times each
different category occurs becomes the final recorded result. Rather than write “bird” or “yes” 26
times, a mark is put against that category name each time it happens, as shown in Table 1.1.
TABLE 1.1 Tallied category data
Category
Yes
No
Tally
|||| |||| |||| |||| ||||
|||| |||| |||| |
Total
24
16
This can also be used where the data is numerical but with fixed and pre-known values, though
normally this would only be where there is a large number of data points; there would be no reason
to tally a set of five numbers.
However, numerical data where the values can be anything, such as that in Table 1.2, presents
a few problems.
TABLE 1.2 A large set of numerical data - raw
10.4
11.3
19.2
2.3
20.4
17.3
48.3
43.6
40.5
36.5
42.4
38.2
39.7
37.9
24.7
5.0
39.1
21.1
23.8
32.9
28.9
42.4
47.4
21.5
23.6
40.4
27.4
31.8
10.5
33.6
27.6
36.2
8.6
45.1
25.8
28.6
24.3
39.0
30.0
14.1
14.8
40.7
6.5
24.1
6.1
8.0
36.7
17.5
2.6
24.7
Firstly, the likelihood is that very few data values will repeat, so grouping into individual values simply
means having lots of 1’s in the Total column. If you use a range of values as a group, e.g. 0-5, then you
lose information, as shown in Table 1.3. You can’t tell what the actual value of a 0-5 group item is.
This is not the case in Category data, where the original measurement value is retained. Therefore,
with numerical data, you keep the original data for other uses (e.g. averages), and use the groups as a
quick way of summarising it.
In general, no more than 20 groups (10 or less is preferred) should be used, otherwise too few
data points will be in any one group (unless you have a very large dataset). When grouping data, you
should:
 identify the minimum and maximum values
 decide how many groups are appropriate for the size of the dataset
 determine the groups (which should be equivalent ranges – for example, 0-5,6-10 etc, but not 05, 6-20)
SIS
1.3
1. Data Processing
CLASS EXERCISE 1.4
You have a data set of 100 pH measurements of river water, ranging from 5 to 9. What would be an
appropriate way of grouping them?
The number of times a particular value (or group of values) occurs is known as the frequency.
Now that the data is tabulated, and collected together in groups, I’m sure you agree that it is
easier to make some sense of, compared to 100 numbers in a grid. One feature of the data that comes
into view when frequencies are tabulated is how the data is distributed among the group values. Some
questions that could be asked of the frequency results include:
 Is it evenly spread across the groups?
 Do certain groups have higher frequencies?
 Is there any pattern?
The answers to these questions are very important in interpreting what the results actually mean. The
spread of data across the range of values is known as the distribution and we will look more at this
idea later.
The frequencies for each group or data value can be misleading if you don’t know anything about
the entire dataset. For example, if someone tells you that the number of bicycles used for transport
in a particular town is 1,000 (the frequency), you might think it was an impressive figure (unless you
found out that the town has a population of 1,000,000). So the relative frequency – the proportion
(often as a percentage) of the frequency of the total dataset – is more useful, because 5% means 5%
regardless of the size of the sample.
Relative frequency =
Frequency of one group
Total number of data points for all groups
For example, the relative frequency of bicycles in that town would be 1,000 ÷ 1,000,000 = 0.001 (or
0.1%).
Tallying functions in Excel
Excel has a number of functions which allow you to tally data. These include:

COUNT – the number of cells in the selected range that contain numerical information

COUNTIF – the number of cells in the selected range that contain values that meet a specified
criteria

FREQUENCY – an extended version of COUNTIF, which allows a range of data to be tallied into
multiple groups (more complex to use)
Count [=COUNT(data range)]
This allows you to identify the number of numerical values in a set of data. It ignores cells which have
no values or non-numerical entries.
Countif [=COUNTIF(data range, criteria)]
Not ideal for masses of different groups but a simple way of picking out a particular value, or above or
below a particular value with =, > or <.
SIS
1.4
1. Data Processing
Frequency [=FREQUENCY(data range, grouping range)]
This function can tally data into user-chosen groups. Its formula is entered as an array formula. This
means you highlight a group of cells where you want the frequencies to appear, type in the formula
and then hit the key combination CTRL+SHIFT+ENTER. The numbers in the grouping range define the
maximum value for each group, eg in Example 1.1, the group values are 5 & 10 meaning that the tally
ranges are 0-5 and 6-10.
EXAMPLE 1.1
The following spreadsheet shows the use of the tallying functions. The data is in Column A, the
results of tallying in B.
A
1
2
3
4
5
6
7
8
9
B
10
3
4
0
n/a
***
7
8
2
7
=count(A1:A9)
3
=countif(A1:A9,”>5”)
5
10
4
3
=values for groups 0-5, 6-10
=frequency(A1:A9,B6:B7)
CLASS EXERCISE 1.5
Familiarisation exercise with Count, Countif and Frequency
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Start Excel and type the formula =TRUNC(100*RAND()) into cell A1.
Copy/drag this formula down 10 rows, then across 10 columns.
You now have 100 numbers between 0 and 99.
Type any letter into one of these cells, and delete the contents of another.
Type =COUNT(A1:J10) into cell A12. The result should be 98.
Type =COUNTIF(A1:J10,”>80”) into cell A13.
Type the numbers 40 and 80 in cells A14 and A15, respectively.
Highlight the three empty cells A16:A18. With the cells still selected, type
=FREQUENCY(A1:J10,A14:A15) and while holding the Ctrl and Shift buttons down, hit Enter.
In many situations, two or more variables of a sample set will be measured. How can you present this
data in a table? If it is purely numerical data, then you probably can’t. Each set of data will have to be
grouped/tallied/tabulated separately.
If, however, the data is categorical, then extra information can be gained by tabulating it in a
two-way frequency table (if you have two variables). Table 1.3 shows data collected for two category
variables and presented in a two-way table.
TABLE 1.3 A two-way frequency table with two measurements on the same sample
General state of health
Healthy
Ill
Sex of koala
Male
Female
45
28
21
9
A related two-way table shows the same category variable as measured on two different data sets, as
shown in Table 1.4.
SIS
1.5
1. Data Processing
TABLE 1.4 A two-way relative frequency table with the same measurement on two sample sets
Origin of plant
Native
Introduced
Not identified
Type of parkland
Urban
Undeveloped
37%
65
51
20
12
15
In each case, if the two sets of data were tabulated separately, the data would do no more than inform.
However, when combined, it makes comparison so much easier.
1.6 Summarising the data
Measures of the typical value
Graphs can present the raw data in a more friendly form, but very frequently, a more “compressed”
form of the data is required. What do we mean by compressed? The hundreds, thousands or millions
of data values are represented by possibly one or two values. What we want is some measure of the
typical data value - how often have you heard the term “the average man in the street”? Depending
on the type of data collected, some measure of the variation in the data can also be useful. Summary
statistics achieve these requirements – they process the many data values into a few summary values.
One of the main reasons for collecting sets of data is to get an idea of what a typical, or average,
item is like. Variation in any normal process means that there will be a spread of values. The typical
value takes into account the range of values and comes up with a result that reflects an “average”. Not
that the average is necessarily a true picture. You’ve probably all heard of the typical family having 2.2
children!
So how do we measure this typical value? Is it the most common value? The middle value? Or
something else? The answer is all three. There are three common measures of the typical value. All
three could be called the average, so be careful when you use that term. Which of these measures is
used depends on the type of variable and also the type of data.
Category variables
The only useful measure of the typical value for category variables is the most common value – the
one with the highest frequency. This is known as the mode, and is a very simple measure, since it will
be the highest frequency, the tallest column or largest pie slice, depending on how the data is
presented.
CLASS EXERCISE 1.6
The following data was collected at a recycling point, where the number of items of different classes
were tallied. Identify the mode.
Class
Steel cans
Aluminium cans
Plastics (type R/1)
Number
2123
991
1502
Class
Plastics (type 2)
Glass bottles
Milk cartons (paper)
Number
1818
3369
456
The mode (or modal class) is
SIS
1.6
1. Data Processing
Numerical variables
Here the range of options is wider. There are three types of “averages” which can be applied to
numerical data:
1. Mode – as above; it is the most common value, but will only be applicable to grouped data;
2. Mean - the measure that is called the average by most people; it is calculated by adding up all the
individual values and dividing the sum by the total number of data points, often given the symbol
 or x
3. Median - this is the middle value when the data points are lined in ascending order. If there is an
even number of data points, then the median is midway between the middle two values; for
example, the median value for the following 5 data values – 17, 18, 20,20, 22 – is 20, while for the
following 6 values – 17, 18, 18, 20, 20, 22 – it is 19 (half way between 18 and 20)
In general, the mode and median are only really useful if at least 20 data values are involved, and don’t
tend to be used much in scientific data analysis. The mean is almost always the method when the
term average is used in scientific data (that’s what Excel calls it).
Warning
Your calculator will almost certainly have functions for mean and standard deviation (next section) as
part of its scientific functions. We suggest that you DO NOT use them.
Why not? For any more than a few data points, the chance of you entering a wrong data value
into memory becomes fairly substantial. Since you can’t easily recall that data to check it, you get a
wrong answer. Use of spreadsheet functions or the tally chart method below are definitely more
reliable.
Excel function for the mean: AVERAGE(cell range for data)
The data can be spread across a number of separate cell ranges when calculating AVERGAGE, e.g.
=AVERAGE(A1:B3,C5:D8). The comma divider is essential, anything else won’t work.
Never use average on grouped numerical data, only the original values!
Measures of data variation
The typical value – most often the mean – is a very powerful summary statistic, but does it tell the
whole picture about a particular population? Consider the following two sets of numbers, which have
the same mean of 5:

1, 5, 6 and 10

4.5, 5, 5 and 5.5
If these were daily pollution levels, and the recommended limit was 5.5, calculated over a four day
period, each dataset would pass. However, I think you would prefer to live in the area covered by the
second set, rather than the first, because two out of the four daily readings in the first set were about
the average maximum. So, a measure of the variation in the sample set is also important. Category
data cannot be summarised in this way.
SIS
1.7
1. Data Processing
There are two common measures of variation for numerical data:
1.
Range - simply the difference between the highest and the lowest values in the sample set
2.
Standard deviation - the square root of the variance
Calculation of the range is like the mode – simple and quick, but not necessarily very reliable. The range
is drastically affected by extreme results, and is really only useful for very small datasets. Only the
highest and lowest values are used, so it is not representative of the whole data set.
The standard deviation the result of a more complicated calculation (though not if you use
Excel), and is less affected by extremes. The formula for the standard deviation (which you don’t need
to know) is shown below, simply for the reason of showing you how it works. Each data value is used
in the calculation: essentially it is the average difference between the values and the mean.
SD 
( v  )2
n1
where v is the data value,  the mean and n the number of data values.
Excel function for standard deviation:
Again, separate ranges can be used, separated by commas. The formula then uses one of the following
functions:
Range: MAX(cell range for data) – MIN(cell range for data)
Standard deviation: STDEV(cell range for data)
ASSIGNMENT 1 – SUMMARISING RAINFALL DATA
Go to the subject assignment webpage and click on the Assignment 1 link. This allows you to
download an Excel file of Sydney rainfall data on one worksheet, and a series of tasks & questions on
the other.
Complete the tasks, answer the questions in the spaces provided, and then rename the file
SIS1_yourname and email it to me.
SIS
1.8
Download