Ground Statistics for students in the Public Health Program. by Allan Dale. 2010-03-09 This document was created with the idea to provide a ground knowledge regarding statistical measurements useful in Public Health. Statistics is the science of making effective use of numerical data relating to a group of observations. For many people, statistics means numbers—numerical facts, figures, or information. Reports of industry production, baseball batting averages, government deficits, and so forth, are often called statistics. Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For example, inferential statistics are used in epidemiology to make judgments of the probability that an observed difference between groups might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data. Three important tools are used to summarize a set of observations: 1. a measure of location, or central tendency, such as the arithmetic mean, median and mode. 2. a measure of statistical dispersion like the standard deviation, range and quartiles. 3. a measure of the shape of the distribution of the observations. Table 1 summarise the measurements of central tendency and dispersion used in descriptive statistics. Table 1. Ground statistical measurements. Central measurement Mode Median Mean Spread measurement Range Quartiles Standard Deviation The first step is to organise the data collected in order. A frequency table is the most common way to summarise the distribution of individual values or ranges of values for a variable. The second step is to calculate the cumulative frequency, relative frequency and the cumulative percent. Cumulative Frequency corresponds to the sum of the preceding frequencies up to the denoted variable. Table 2. Organisation of the collected data in a frequency table. Level x 1 2 3 4 5 Absolute Cumulative Relative Cumulative Frequency Frequency Frequency Percent f cf rf% cf% 4 4 3% 3% 33 37 22% 25% 69 106 45% 70% 37 143 25% 95% 7 150 5% 100% n=150 100% The absolute values are the number of observations or persons that share the same characteristics. In Table 2 there are 4 persons with level 1. The relative values in each level are the proportion of persons in relationship to the whole group. In Table 2, 4 persons out of 150 were in level 1 = 4/150 = 0.03 multiplied by 100 = 3% of the group are in level 1. The cumulative frequency up to level 2 is 4 cases in level 1 plus the 33 cases in level 2 = 37 cases are up to level 2. The distribution of the data can be done using different type of charts. See the following examples. Graph1. Representation of the “absolute frequency” from table 1 in a Bar graph. 80 70 60 50 40 30 20 10 0 1 2 3 4 5 In general, the observations in graph 1 show a normal distribution with a mode of 3. Graph 2. Representation of the “relative frequency” from the frequency table presented in this page using a pie chart. 5 5% 1 3% 2 22% 4 25% 3 45% The largest proportion of the observation (45%) is located in level 3. THE MODE The mode is the value that occurs most frequently in a data set. The value with the highest frequency. In the following example, 3 is the mode. 2, 3, 3, 3, 3, 4, 5, 5, 6, 7 However, in the following example are 3 and 5 the mode scores. 2, 3, 3, 3, 4, 5, 5, 5, 6 THE RANGE The range is one way to represent the spread of the distribution. It shows the minimum and the maximum values in the distribution. It provides an indication of statistical dispersion together with the mode. In the first example the mode is 3 and the range goes from 2 to 7 2, 3, 3, 3, 3, 4, 5, 5, 6, 7 While in this other example the mode scores are 3 and 5 and the range goes from 2 to 6. 2, 3, 3, 3, 4, 5, 5, 5, 6 THE MEDIAN The median is also called the mid point because it is the value that divides the observations in half meaning that 50% of the data is below and above this value. The median of a finite list of numbers can be found by arranging all the observations from the lowest to the highest value and picking the one in the middle. If there is an even number of observations, then there is no single middle value, so one often takes the mean of the two middle values. For example, if there are 500 scores in the list, score #250 would be the median. If we order the 8 scores shown above, we would get: 15, 15, 15, 20, 20, 21, 25, 36 There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores are 20, the median is 20. If the two middle scores had different values, you would have to interpolate to determine the median. For 15 observations, the mid point is clearly the eighth largest, so that seven points are less than the median, and seven points are greater than it. To find the median for an even number of points, like using 16 observations, we average the eighth and ninth points. To make it easier you need to build your frequency table and calculate the cumulative frequency, after that you identify where is the 50%. Table 2. Organisation of the collected data in a frequency table. Level x 1 2 3 4 5 Absolute Frequency f 4 33 69 37 7 Cumulative Relative Frequency Frequency cf rf% 4 3% 37 22% 106 46% 143 24% 150 5% n=150 100% Cumulative Percent cf% 3% 25% 71% 95% 100% In Table 2, the median (50%) is in level 3. THE QUARTILES In descriptive statistics, a quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents one fourth of the sampled population. first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile second quartile (designated Q2) = median = cuts data set in half = 50th percentile third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile the interquartile is the interval between Q75 and Q25. In Table 2, the Q25 is in level 2 and Q75 is in level 4. The interquartile is 2 levels. THE MEAN The mean is the most commonly way to summarise data and is often referred as the average. The mean annual income in the District of Skaraborg, Southwest of Sweden, was 204 per thousands Swedish crowns (years 2000 and 2005). Table 3. Skaraborg’s mean annual income (per 1000 Skr) between the years 2000-2005: 2000 2001 2002 2003 2004 2005 Skaraborg 186 193 202 208 214 220 Mean is the sum of the values of the included observations divided by the number of observations. In this case (Table 3): Mean = (186 + 193 + 202 + 208 + 214 + 220) / 6 = 204 Exercise 1: Use the municipal data in Table 4 and calculate the mean between the years 2000-2005 for each of the presented municipalities. Table 4. Mean annual income in the municipalities in Skaraborg, 2000-2005. Essunga Falköping Gullspång Götene Hjo Karlsborg Lidköping Mariestad Skara Skövde Tibro Tidaholm Töreboda Vara 2000 181 188 173 192 185 186 195 188 194 196 187 185 173 179 2001 192 196 181 201 194 193 203 197 201 204 192 191 179 187 2002 198 203 191 210 202 203 213 204 209 211 199 198 187 195 2003 205 207 197 218 207 211 220 211 217 217 205 206 194 204 2004 210 214 203 223 214 216 226 219 222 224 210 212 199 209 2005 217 221 208 229 220 221 234 224 229 230 217 217 203 217 THE STANDARD DEVIATION The Standard Deviation shows how much variation there is from the mean. A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data are spread out over a large range of values. For example, the average height for adult men in the United States is about 178 cm, with a standard deviation of around 8 cm. This means that most men (about 68 percent, assuming a normal distribution) have a height within 8 cm from the mean (170–185 cm) – one standard deviation, whereas almost all men (about 95%) have a height within 15 cm of the mean (163–193 cm) – 2 standard deviations. If the standard deviation were zero, then all men would be exactly 178 cm high. If the standard deviation were 51 cm, then men would have much more variable heights, with a typical range of about 127 to 229 cm. Three standard deviations account for 99% of the sample population being studied, assuming the distribution is normal (bell-shaped). Normal Distribution M- 3S M-2S M-S M M+S M+2S M+3S STEPS TO CALCULATE THE STANDARD DEVIATION Basic example to calculate the Standard Deviation: Consider a population consisting of the following values: There are eight data points in total, with a mean value of 5: To calculate the population standard deviation, first compute the difference of each data point from the mean, and square the result: 1 2 3 4 5 6 7 8 2 4 4 4 5 5 7 9 (2 – 5) = -3 -1 -1 -1 0 0 2 4 9 1 1 1 0 0 4 16 Next divide the sum of these values by the number of values and take the square root to give the standard deviation: Therefore, the above has a population standard deviation of 2. DISTRIBUTION OF THE OBSERVATIONS A normal distribution means that it is a symmetric distribution, with no skew. A distribution is skewed if one of its tails is longer than the other. A positive distribution means that it has a long tail in the positive direction. Generally, the mean income in Low income countries has a positive skew where large proportion of the population lives under poverty. A negative distribution has a long tail in the negative direction. If the majority of the students in the course did very well and few did it very bad, a negative skew will be observed. A multimodal distribution has two mode scores. Some important ground mathematical terms Ratio: the value obtained by dividing one quantity by another. Rate, proportion and percentages are type of ratios. The ratio consists of a numerator and a denominator. For example: the sex ratio in a class of 20 persons where there are only 5 men will be: 15 women / 5 men = 3 women per men. Proportion: is a type of ratio in which the numerator is part of the denominator. They are expressed as percentages. For example: the proportion of 5 men in a class of 20 persons will be: 5 men / 20 (all participants) multiplied by 100 = 25 percent. Rate: is another type of ratio which involves time with it. Like incidence rate, prevalence rate, infant mortality rate. Rate ratio: is the division of two rates. Three more statistical exercises from previous tests in the Course. 1. The following data was obtained from a survey on shoe size: a) Create a frequency table b) Obtain the mode and the range, the mean with its standard deviation and the median with Q25, Q75 and interquartile c) Represent the data in a graph. d) Describe all your findings with your words; include the shape of the distribution. 34 38 35 35 35 37 36 36 40 34 34 39 38 35 35 39 40 41 35 40 36 36 37 37 34 34 38 39 36 35 2. The following data was obtained from a survey the age of the students in the Course: a) Create a frequency table b) Obtain the mode and the range, the mean with its standard deviation and the median with Q25, Q75 and interquartile c) Represent the data in a graph. d) Describe all your findings with your words; include the shape of the distribution. 22 23 18 21 24 20 18 21 19 19 20 21 21 18 20 23 22 22 22 21 23 20 23 22 19 19 23 21 24 21 3. The following data was obtained from a survey the age of the students in a PhD course: e) Create a frequency table f) Obtain the mode and the range, the mean with its standard deviation and the median with Q25, Q75 and interquartile. g) Represent the data in a graph. h) Describe all your findings with your words; include the shape of the distribution. 35 25 65 35 35 35 35 25 55 65 25 65 25 45 25 25 55 25 45 45 55 35 65 65 35 25 25 45 45 25 Solutions of the statistical exercises from previous test in the Course: 1. Information about shoe size. No frek frek% kum frek 34 5 16,7 16,7 35 7 23,3 40,0 36 5 16,7 56,7 37 3 10,0 66,7 38 3 10,0 76,7 39 3 10,0 86,7 40 3 10,0 96,7 41 1 3,3 100,0 total 30 100 Frequency of shoe size 8 6 4 2 0 34 35 36 37 38 39 40 41 The mean (1098/30) was 36,6; standard deviation of 2,13. The Median was 36. The mode = 35. Q1= 35 & Q3= 38. Interquartile = 3. The shoe size “36” was the most representative size in this group of people (36 participants) even thou the top-value was “35”. The range was from 34 to 41, and it does not have a normal distribution because it shows a positive skewed distribution. 2. Information about the age of students in the Course Age 18 19 20 21 22 23 24 total Frekvens 3 4 4 7 5 5 2 30 rf% 10 13 13 23 17 17 7 100% Kum frek 10 23 36 59 76 93 100 Studenter ålder 8 7 6 5 4 3 2 1 0 18 19 20 21 22 23 24 Mean = 21; Median = 21; Mode = 21; Stardard deviation = 1,76 The mean, median and top value was “21”. Q1 = 20 & Q3 = 22. Interquartile = 2; The standard deviation was 1,76 and the range was between 18 to 24. The graph presented a normal distribution. 3. Information of the age of the students in a PhD course. Rel frek % X= age (f) Kumm Frek % 25 35 45 10 7 5 33,3 23,3 16,7 33,3 56,6 73,3 55 3 10,0 83,3 65 5 16,7 100 30 100 Total Åldersfördelnin g distribution%. 40 30 20 10 0 25 35 45 55 65 Ålder Mean = 40.33; Median= 35; mode = 25; Standar deviation = 14.79. The mean age was 40 (SD= 14.79); the median was 35 and the top value was 25. Q1 = 25 & Q3 = 55. Interquartile = 30; The range was from 25 to 65 years, in this group and it does not have a normal distribution because it shows a positive skewed distribution. Extra exercise. Find the mode, range, median, Q25 and Q75, mean and Standard Deviation of Table 5. Make a graph showing the mode, the mean and the median. Describe in words all your results: mode, range, median, Q25, Q75, interquartile, mean and Standard Deviation. Describe also the shape of the distribution of the observations. Table 5. Mean annual income (per 1000 skr) in seven municipalities, Skaraborg, 2000-2005. Mean 2000-2005 Falköping 205 Karlsborg 205 Lidköping 215 Skövde 214 Tibro 202 Tidaholm 202 Töreboda 189 12