Chapter 4: measures of location

advertisement
Chapter 4: measures of location
L4_S1
When a set of data has been collected, the first thing we will want to do is to summarise that data. This can
be done with frequency distributions, as we discussed in the previous chapter on data types. However, we
often want a numerical summary of the data. These data are referred to as descriptive stats, and they are
divided into two categories: stats of location, and stats of dispersion. Stats of location summarise the
central point of the data along a number line, and stats of dispersion summarise how the observations are
distributed about that central point.
You will remember that we said previously that we use descriptive stats to summarise the important
characteristics of a data set, and inferential stats to generalise about a greater population from that which
we observe in a smaller sample of that population.
In this chapter therefore, we will discuss a few different measures of location, or central tendency, as they
are sometimes known, and we will also look at the ways in which data are dispersed around the measures
of location or central tendency.
L4_S2
The first of the measures of location we will examine is the arithmetic mean, known to lay people as simply
“the average”. This represents the centre of the observations in a sample frequency distribution.
Calculating the mean is very simple. You simply sum, or add, all the observations, and then divide by the
number of observations.
If X is the letter we use to denote our sample variable, then X with a bar over it would represent the sample
mean of all of our sample observations.
Remember that we use different notation when talking about the sample, than we do when talking about a
population. We use greek letters for population parameters, and arabic letters for sample statistics. The
sample mean therefore, is designated as “x”, and the population mean as mu.
Since calculating the mean is so simple, and because it has other properties that are useful when it comes
to inferential stats, it is the most commonly reported statistic of location. One problem with the mean
though, is that extreme values will greatly influence its value.
L4_S3
In the example we have here, we have added all the observation values, in other words, the number of
hours each person slept, and have divided by the number of people involved in the sample, to get an
average of 7.07 hours of sleep per night, for this particular sample group.
L4_S4
The second measure we will examine is called the median. This is defined as the value that has an equal
number of observations on either side of it. It divides the frequency distribution in half, relative to the
number of observations.
L4_S5
If there are an even number of observations, then there is no one observation that fits the criterion of
having an equal number of observations larger as there are smaller. In this case, the value must be
calculated by averaging the middle two observations. In the example we have here, the average of the
eighth and ninth observations was calculated.
You will also note that unlike the mean, the median is unaffected by a few very large or very small, values.
L4_S6
The last measure of location that we will examine is the mode.This is simply the the most common
observation in the data. If there are two most common values then the distribution is said to be bimodal
and it has two specific peaks. The mode is not often used as it contains very little useable information and
because of that, you seldom see it reported in scientific literature, although it is often interesting to report
the number of modes detected in a population or sample, if there is more than one.
L4_S7
The mean is by far the most commonly reported statistic. It is easy to work with mathematically, but the
disadvantages are that it is greatly affected by extreme values.
The median is less commonly reported, and the advantage is that it is not greatly affected by outliers.
The mode is rarely used since it does not convey much information about the set of data.
L4_S8
Besides the measures of location, such as the mean, of a sample, there is also a way in which to measure
how the observations are distributed around the mean. In other words, we want to know whether most of
the other observations lie close to the mean, or are they distributed far away from the mean? This
characteristic is called the dispersion of the population, and there are several ways in which it can be
calculated. As usual, most of the time we will be dealing with a sample only of the population, and so what
we will be calculating will be a sample statistic that estimates the population parameter that is actually the
measure of dispersion.
L4_S9
The first of these measures that we will examine, is called the range. This is simply the lowest value
subtracted from the highest value. It is then obviously greatly affected by the any outliers, and gives very
little specific information about how the observations cluster around the mean. It is therefore a poor
estimator of the population, and is therefore seldom used. If it is reported, it should be reported together
with other measures of dispersion.
L4_S10
The second of these measures that we will examine is the variance. This measure describes the
dispersion of the data about an estiamte of central tendency such as the mean. If the data points are all
close to the mean, then variability is low. If data points are dispersed widely around the mean, then
variability is high. If we want an estimate of dispersion about the mean, the first thing to do is to take each
data point in the sample, and subtract the mean. This quantifies the distance of each point from the mean.
To get an over-all picture of variability, we could sum these values, or scores. Unfortunately, this adds up
to zero.
L4_S11
In order to calculate the variance therefore, we must first calculate the squared deviations of each
observation from the mean and then we must sum these values. This becomes then, the sample sum of
squares, which we commonly abbreviate to SS. This is a very important term, and will be used often. This
is then our first estimate of variability.
L4_S12
Next, we will divide the sample sum of squares by the sample size minus one, in order to get the sample
variance, which we denote as s squared. If we wanted the population variance, we would divide the
population sum of squares by the size of the population, and this is denoted by sigma squared. This is
often impossible though, and the best estimate of the population variance is to take the sample SS and
divide by the sample size minus one, as we’ve already described. The variance is also referred to as the
“mean square”.
Dividing the sample SS by sample size minus one yields an unbiased estimator of the population variance,
and the term (n-1) is called the degrees of freedom.
As the sum of squares can vary from zero to infinity. the variance itself can vary from zero to infinity. You
can never have a variance with a negative value.
L4_S13
The sample variance is an excellent estimate of variability, but it has the square of the original units of the
data, which can be difficult to interpret. For example, if you have data in grams, the variance has the unit
“square grams”, and who knows what that is? The solution here is simple – just take the square root of the
sample variance, and this we call the standard deviation. Since the sample variance is s squared, the
standard deviation is symbolised as s.
L4_S14
Here we have two sets of data, with the distribution represented graphically as bar charts. They each have
the same number of observations, and the same sample mean, but the distribution of the data in each of
the data sets is clearly different, and one would not know that by looking at the sample mean only. But by
calculating the sample variance, and then also the standard deviation, we see immediately that the two
sets of data differ in terms of dispersion.
L4_S15
Remember that the SS, Variance, and Standard Deviation quantities are all statistics – they are estimates
of population parameters. We generally use the formulas here for samples when working with data,
because we are generally working at the sample level, and seldom, if ever, at the population level. We do
however, need to be aware of the formulas for the population parameters.
L4_S16
Up till now, we have discussed the frequency distribution, or dispersion, of values of data. However, we
are also often concerned about the frequency distribution, or dispersion of statistics. For example, suppose
every person in this class went outside, stopped 20 random people and asked how tall they were. If there
were 15 people in this class, and they now each had a sample of n=20, there would be 20 x 15 = 300 data
points. We could construct a frequency distribution of the values of 300 data points. Or we could have
each person in the class calculate the mean of their particular set of data points. So we would have 15
different means, and we could prepare a frequency distribution of the values of the means – in other
words, how often did the mean occur? Since the mean is a statistic, this would be a frequency distribution,
or dispersion of the values of a sample statistic. Since this is a mouthful, we have a much shorter term that
we use: sampling distribution.
A sampling distribution is therefore a frequency distribution of the values of a sample statistic.
Next, we could calculate the standard deviation of our 15 means. We would use the formula we have used
up till now, except that our mean would be a “mean of the means”, and each x value would be one of the
15 means. We would symbolize the standard deviation of the means as SX (s subscript x).
The standard deviation of the values of a statistic is called the standard error. In our particular example, we
have calculated the standard error of the mean.
Download