Describing Numerical Data 1. Percentages A percentage is simply a proportion multiplied by 100. Percentages make the expression of the proportion easier and are used in three main ways: i) To indicate the size of a subgroup. e.g. 20 % of cars sold in 1996 were Fords. ii) To express a change over time. e.g. Salaries increased by 5 per cent between 2003 and 2004. iii) To express a change between two percentages e.g. Interest rates rose by 1 percentage point. In order to ensure that percentages are useful, it is vital that the actual figures behind the percentages are quoted. It is also important to appreciate the difference between a percentage change and a percentage point change. The following example shows the three measures given above. Example 1: Weekly Shopping Bill and Salary Weekly Shopping Bill £30 Week 1 Week 2 £50 Weekly Salary % of salary spent on shopping £120 30/120*100 = 25% £120 50/120*100 = 41.7% a) In week 1, the shopping bill took 25% of the salary and in week 2 it took just under 42% of the salary. b) The percentage increase, between weeks 1 and 2, of salary spent on shopping was : ( 50 - 30 ) * 100 = 66.7 % 30 c) The percentage point change between weeks 1 and 2 was 16.7 percentage points: 41.7% - 25.0% = 16.7 percentage points All three measures are equally valid and accurately represent the change in spending between week 1 and week 2. The measure that is used depends on the point being made and the statistic being used. If the statistic is itself a percentage, such as the employment or unemployment rate (the number unemployed as a percentage of all working age people) it is normally the percentage point difference between two unemployment rates that is of interest. If there is more interest in comparing the numbers of unemployed people (the level of unemployment) then a simple percentage change is appropriate. 2. Probability Distributions When presented with a new or revised data set it makes sense to plot the data in order to check that it behaves as expected and to understand its properties. This would involve examining three main aspects of the data distribution: Location – check the location of the centre of the data on the x-axis. Shape – is there one or more peaks in the data and are there approximately equal numbers of measurements on either side of the peak? Outliers – Are there any measurements which are much larger/smaller than the rest? These may be anomalies within the data set and should be examined closely. (a) Skewed to the right (b) Skewed to the left (c) Normal Distribution (symmetric) μ+σ μ μ+σ (c) Normal distribution (Symmetrical) (a) A distribution is skewed to the right when most of the measurements lie to the right hand side of the peak value. (b) A distribution is skewed to the left when most of the measurements lie to the left hand side of the peak value. (c) A distribution is symmetrical when there are an identical number of measurements on either Many random variables in nature exhibit a bell-shaped curve as in (c) above. This is known as a normal probability distribution and is widely used in statistical analysis. The normal distribution is symmetrical about its mean, μ. The total area under the normal probability distribution is 1, therefore the symmetry means that the area to the right of the mean is equal to 0.5 and the area to the left also equals 0.5. The shape of the distribution is determined by the standard deviation, σ. Large values of σ lead to a wider distribution with a lower height. Small values of σ lead to narrow, tall distributions. Many commonly used statistical tests assume that the data exhibits an underlying normal probability distribution. 3. Measures of Central Value There are three main ways to determine the central value in a data set – the mean, median and mode. It is difficult to say which is the most useful as it depends on the properties of the dataset in question. Regardless of which measure is used it is important to clearly state which measure has been used when reporting the results of analysis. The mode is useful when data is recorded on a nominal scale as it simply sums the number of responses in each category and reports the most frequent. However, the mode is less efficient than the mean and median for numeric data. The mean and median are more difficult to distinguish between as they are often very similar, especially if the data follows a normal distribution. The median is generally the more useful of the two when dealing with small data sets which contain extreme values. Example 2 Dataset 1 Dataset 2 25,000 25,000 30,000 30,000 35,000 1,000,000 The median in each set of data is 30,000 but the mean values differ considerably. For dataset 1 the mean is 30,000 whereas for dataset 2 the mean is 351,666. This shows that the mean is more sensitive to the extreme value of 1,000,000 in data set 2 than the median. However, for larger samples the mean and median tend to be more similar. Here, the mean is usually the better choice as it makes use of the actual values in the data set rather than just their relative positions. Furthermore, the mean is easier to compute as it has a simple algebraic formula and the values do not need to be ordered. The diagrams below provide a visual representation of the position of each of these measures of central value for normally distributed data and for skewed data. Normal Distribution mean = median = mode Right Skewed Distribution Left Skewed Distribution mean mode mode mean median median For data which follows an exact normally distribution, the mean, median and mode are equal. However, in real-life this rarely occurs and there are slight differences between the three measures. For skewed data, the median is always positioned between the mean and the mode because it is the ‘halfway point’. The mode always corresponds to the peak of the distribution as it represents the most common value. The mean, however, moves away from the median in the direction of the tail because of its tendency to be affected by extreme values. Hence, with a right-skewed distribution, the mean tends to be greater than the median because it is pulled in the direction of the small number of large values. Similarly, with a left-skewed distribution, the mean tends to be smaller than the median because it is pulled towards the extremely small values. Left skewed data Right skewed data Mean < Median < Mode Mean > Median > Mode A good illustration of this effect is income data. The mean is not always the best measure of average income because it is overly-influenced by a small number of very wealthy individuals and is not representative of the ‘typical’ level of income earned by the majority of individuals. In this case, the median provides a better measure of ‘average’ income. 4. Measures of Dispersion Measures of dispersion indicate how much variation or spread there is across the data values. This is a very important measure when used in conjunction with the mean as the two measures combined give a good description of the data. Example 3 - Dispersion Dataset A Dataset B 99 50 100 100 100 100 101 150 mean = 100 mean = 100 It is clear in the above example that there is much greater variation in the data values than the means of 100 would indicate. Therefore some quantitative measure of variation amongst the data variables needs to be calculated. The simplest measure is the range of the data values - the difference between the largest and smallest values. However, this is a relatively crude measure. Example 4 shows a possible problem. Example 4: Number of books read per student Student A B C D E No. History Books 1 5 5 6 6 No. Physics Books 1 2 2 2 3 F 6 9 G 6 9 H 7 10 I 7 11 J 11 11 The mean number of history books read by students is 6.0 and the range is 10.0. Similarly, the mean and range for the number of physics books read is also 6.0 and 10.0. This would suggest that the two subjects have relatively similar reading requirements. However, it would be wrong to describe the two classes as being similar. In the History class the variation about the mean value of 6.0 appears to be less pronounced than in the Physics class. We therefore need a method for calculating the variation about the mean. Mean Deviation and Variance In order to measure the variation in the data, it is useful to measure the difference between the mean and each of the data items. The greater the variation, the larger the differences between the mean and the individual items. Example 5 Data 1 5 6 MEAN 4 Deviation from Mean 1–4=-3 5–4= 1 6–4= 2 The average of these deviations is (-3 +1+2)/3 = 0. Therefore a method must be used which measures the deviation but ensures that average of the deviations is not zero. The common method involves squaring the deviations and taking an average of these squared values. This formula is called the variance and can be written as: Sum of ((x - mean) 2 ) No. of data values or Σ (x – μ)2 ; where x are the data n values (( -3 ) 2 + ( 1 ) 2 + ( 2 ) 2 ) / 3 = 4.67 The variance is useful when comparing data sets. For example if the salaries of two groups of people (groups A and B) are compared and the variance of group A is larger than B then we know that the salaries in group A deviate more from the mean than those of group B. This indicates how well the mean value represents the dataset. A common practice is to take a square root of the variance. This value is known as the standard deviation. In example 5 above, the standard deviation is 2.16. Example 6, below, shows the variance and standard deviations for students’ reading habits. Example 6 Student A B C D E F G H I No. History Books 1 5 5 6 6 6 6 7 7 Mean = 6.0 Range = 10.0 Variance = 5.4 Std. Deviation = 2.3 No. Physics Books 1 2 2 2 3 9 9 10 11 Mean = 6.0 Range = 10.0 Variance = 16.6 Std. Deviation = 4.07 J 11 11 The use of the variance and standard deviation statistics clearly show that the reading levels in Physics are a lot more varied than those in History. 5. Measure of Position A useful measure for describing where a data value lies relative to the other values in the data set are percentiles. Percentiles split the data into 100 groups. If in an exam a candidate scores 25 marks, this does not tell you anything about how others performed in the test and whether 25 is a high or low score. However if the test score was the 99th percentile, this means that 99% of the people taking the test scored lower than 25 marks. Use is also made of the term quartiles to explain data sets. Quartiles split the data set into 4 groups and these groups are defined as the 25th percentile, 50th percentile, 75th percentile and the 100th percentile. The 50th percentile is also known as the median. Student A B C D E F G H I J K L M Score 23 25 30 30 34 37 46 50 53 58 62 68 72 Percentile 5 10 15 20 25 30 35 40 45 50 55 60 65 25th Percentile (1st Quartile) Median (2nd Quartile) N O P Q R S T 74 78 81 86 90 92 98 70 75 80 85 90 95 100 75th Percentile (3rd Quartile) 100th Percentile (4th Quartile) Location of Quartiles 25% 25% 25% 25% Median, m Lower quartile, Q1 Further Information Tier 1 Describing Numerical Data Upper quartile, Q2