Chapter 4: measures of location

Chapter 4: measures of location L4_S1 When a set of data has been collected, the first thing we will want to do is to summarise that data. This can be done with frequency distributions, as we discussed in the previous chapter on data types. However, we often want a numerical summary of the data. These data are referred to as descriptive stats, and they are divided into two categories: stats of location, and stats of dispersion. Stats of location summarise the central point of the data along a number line, and stats of dispersion summarise how the observations are distributed about that central point. You will remember that we said previously that we use descriptive stats to summarise the important characteristics of a data set, and inferential stats to generalise about a greater population from that which we observe in a smaller sample of that population. In this chapter therefore, we will discuss a few different measures of location, or central tendency, as they are sometimes known, and we will also look at the ways in which data are dispersed around the measures of location or central tendency. L4_S2 The first of the measures of location we will examine is the arithmetic mean, known to lay people as simply “the average”. This represents the centre of the observations in a sample frequency distribution. Calculating the mean is very simple. You simply sum, or add, all the observations, and then divide by the number of observations. If X is the letter we use to denote our sample variable, then X with a bar over it would represent the sample mean of all of our sample observations. Remember that we use different notation when talking about the sample, than we do when talking about a population. We use greek letters for population parameters, and arabic letters for sample statistics. The sample mean therefore, is designated as “x”, and the population mean as mu. Since calculating the mean is so simple, and because it has other properties that are useful when it comes to inferential stats, it is the most commonly reported statistic of location. One problem with the mean though, is that extreme values will greatly influence its value. L4_S3 In the example we have here, we have added all the observation values, in other words, the number of hours each person slept, and have divided by the number of people involved in the sample, to get an average of 7.07 hours of sleep per night, for this particular sample group. L4_S4 The second measure we will examine is called the median. This is defined as the value that has an equal number of observations on either side of it. It divides the frequency distribution in half, relative to the number of observations. L4_S5 If there are an even number of observations, then there is no one observation that fits the criterion of having an equal number of observations larger as there are smaller. In this case, the value must be calculated by averaging the middle two observations. In the example we have here, the average of the eighth and ninth observations was calculated. You will also note that unlike the mean, the median is unaffected by a few very large or very small, values. L4_S6 The last measure of location that we will examine is the mode.This is simply the the most common observation in the data. If there are two most common values then the distribution is said to be bimodal and it has two specific peaks. The mode is not often used as it contains very little useable information and because of that, you seldom see it reported in scientific literature, although it is often interesting to report the number of modes detected in a population or sample, if there is more than one. L4_S7 The mean is by far the most commonly reported statistic. It is easy to work with mathematically, but the disadvantages are that it is greatly affected by extreme values. The median is less commonly reported, and the advantage is that it is not greatly affected by outliers. The mode is rarely used since it does not convey much information about the set of data. L4_S8 Besides the measures of location, such as the mean, of a sample, there is also a way in which to measure how the observations are distributed around the mean. In other words, we want to know whether most of the other observations lie close to the mean, or are they distributed far away from the mean? This characteristic is called the dispersion of the population, and there are several ways in which it can be calculated. As usual, most of the time we will be dealing with a sample only of the population, and so what we will be calculating will be a sample statistic that estimates the population parameter that is actually the measure of dispersion. L4_S9 The first of these measures that we will examine, is called the range. This is simply the lowest value subtracted from the highest value. It is then obviously greatly affected by the any outliers, and gives very little specific information about how the observations cluster around the mean. It is therefore a poor estimator of the population, and is therefore seldom used. If it is reported, it should be reported together with other measures of dispersion. L4_S10 The second of these measures that we will examine is the variance. This measure describes the dispersion of the data about an estiamte of central tendency such as the mean. If the data points are all close to the mean, then variability is low. If data points are dispersed widely around the mean, then variability is high. If we want an estimate of dispersion about the mean, the first thing to do is to take each data point in the sample, and subtract the mean. This quantifies the distance of each point from the mean. To get an over-all picture of variability, we could sum these values, or scores. Unfortunately, this adds up to zero. L4_S11 In order to calculate the variance therefore, we must first calculate the squared deviations of each observation from the mean and then we must sum these values. This becomes then, the sample sum of squares, which we commonly abbreviate to SS. This is a very important term, and will be used often. This is then our first estimate of variability. L4_S12 Next, we will divide the sample sum of squares by the sample size minus one, in order to get the sample variance, which we denote as s squared. If we wanted the population variance, we would divide the population sum of squares by the size of the population, and this is denoted by sigma squared. This is often impossible though, and the best estimate of the population variance is to take the sample SS and divide by the sample size minus one, as we’ve already described. The variance is also referred to as the “mean square”. Dividing the sample SS by sample size minus one yields an unbiased estimator of the population variance, and the term (n-1) is called the degrees of freedom. As the sum of squares can vary from zero to infinity. the variance itself can vary from zero to infinity. You can never have a variance with a negative value. L4_S13 The sample variance is an excellent estimate of variability, but it has the square of the original units of the data, which can be difficult to interpret. For example, if you have data in grams, the variance has the unit “square grams”, and who knows what that is? The solution here is simple – just take the square root of the sample variance, and this we call the standard deviation. Since the sample variance is s squared, the standard deviation is symbolised as s. L4_S14 Here we have two sets of data, with the distribution represented graphically as bar charts. They each have the same number of observations, and the same sample mean, but the distribution of the data in each of the data sets is clearly different, and one would not know that by looking at the sample mean only. But by calculating the sample variance, and then also the standard deviation, we see immediately that the two sets of data differ in terms of dispersion. L4_S15 Remember that the SS, Variance, and Standard Deviation quantities are all statistics – they are estimates of population parameters. We generally use the formulas here for samples when working with data, because we are generally working at the sample level, and seldom, if ever, at the population level. We do however, need to be aware of the formulas for the population parameters. L4_S16 Up till now, we have discussed the frequency distribution, or dispersion, of values of data. However, we are also often concerned about the frequency distribution, or dispersion of statistics. For example, suppose every person in this class went outside, stopped 20 random people and asked how tall they were. If there were 15 people in this class, and they now each had a sample of n=20, there would be 20 x 15 = 300 data points. We could construct a frequency distribution of the values of 300 data points. Or we could have each person in the class calculate the mean of their particular set of data points. So we would have 15 different means, and we could prepare a frequency distribution of the values of the means – in other words, how often did the mean occur? Since the mean is a statistic, this would be a frequency distribution, or dispersion of the values of a sample statistic. Since this is a mouthful, we have a much shorter term that we use: sampling distribution. A sampling distribution is therefore a frequency distribution of the values of a sample statistic. Next, we could calculate the standard deviation of our 15 means. We would use the formula we have used up till now, except that our mean would be a “mean of the means”, and each x value would be one of the 15 means. We would symbolize the standard deviation of the means as SX (s subscript x). The standard deviation of the values of a statistic is called the standard error. In our particular example, we have calculated the standard error of the mean.

Chapter 4: measures of location

Related documents

Products

Support

Chapter 4: measures of location

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib