Statistics,

advertisement
Ground Statistics for students in the Public Health
Program.
by Allan Dale.
2010-03-09
This document was created with the idea to provide a ground knowledge regarding
statistical measurements useful in Public Health.
Statistics is the science of making effective use of numerical data relating to a group of
observations. For many people, statistics means numbers—numerical facts, figures, or
information. Reports of industry production, baseball batting averages, government deficits,
and so forth, are often called statistics. Descriptive statistics are used to describe the basic
features of the data in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis of virtually every
quantitative analysis of data. With descriptive statistics you are simply describing what is or
what the data shows. With inferential statistics, you are trying to reach conclusions that
extend beyond the immediate data alone. For example, inferential statistics are used in
epidemiology to make judgments of the probability that an observed difference between
groups might have happened by chance in this study. Thus, we use inferential statistics to
make inferences from our data to more general conditions; we use descriptive statistics simply
to describe what's going on in our data.
Three important tools are used to summarize a set of observations:
1. a measure of location, or central tendency, such as the arithmetic mean, median and
mode.
2. a measure of statistical dispersion like the standard deviation, range and quartiles.
3. a measure of the shape of the distribution of the observations.
Table 1 summarise the measurements of central tendency and dispersion used in descriptive
statistics.
Table 1. Ground statistical measurements.
Central measurement
Mode
Median
Mean
Spread measurement
Range
Quartiles
Standard Deviation
The first step is to organise the data collected in order. A frequency table is the most common
way to summarise the distribution of individual values or ranges of values for a variable. The
second step is to calculate the cumulative frequency, relative frequency and the cumulative
percent. Cumulative Frequency corresponds to the sum of the preceding frequencies up to the
denoted variable.
Table 2. Organisation of the collected data in a frequency table.
Level
x
1
2
3
4
5
Absolute Cumulative Relative Cumulative
Frequency Frequency Frequency Percent
f
cf
rf%
cf%
4
4
3%
3%
33
37
22%
25%
69
106
45%
70%
37
143
25%
95%
7
150
5%
100%
n=150 100%
The absolute values are the number of observations or persons that share the same
characteristics. In Table 2 there are 4 persons with level 1. The relative values in each level
are the proportion of persons in relationship to the whole group. In Table 2, 4 persons out of
150 were in level 1 = 4/150 = 0.03 multiplied by 100 = 3% of the group are in level 1. The
cumulative frequency up to level 2 is 4 cases in level 1 plus the 33 cases in level 2 = 37 cases
are up to level 2.
The distribution of the data can be done using different type of charts. See the following
examples.
Graph1. Representation of the “absolute frequency” from table 1 in a Bar graph.
80
70
60
50
40
30
20
10
0
1
2
3
4
5
In general, the observations in graph 1 show a normal distribution with a mode of 3.
Graph 2. Representation of the “relative frequency” from the frequency table presented in this
page using a pie chart.
5
5%
1
3%
2
22%
4
25%
3
45%
The largest proportion of the observation (45%) is located in level 3.
THE MODE
The mode is the value that occurs most frequently in a data set. The value with the highest
frequency.
In the following example, 3 is the mode.
2, 3, 3, 3, 3, 4, 5, 5, 6, 7
However, in the following example are 3 and 5 the mode scores.
2, 3, 3, 3, 4, 5, 5, 5, 6
THE RANGE
The range is one way to represent the spread of the distribution. It shows the minimum and
the maximum values in the distribution. It provides an indication of statistical dispersion
together with the mode.
In the first example the mode is 3 and the range goes from 2 to 7
2, 3, 3, 3, 3, 4, 5, 5, 6, 7
While in this other example the mode scores are 3 and 5 and the range goes from 2 to 6.
2, 3, 3, 3, 4, 5, 5, 5, 6
THE MEDIAN
The median is also called the mid point because it is the value that divides the observations in
half meaning that 50% of the data is below and above this value. The median of a finite list of
numbers can be found by arranging all the observations from the lowest to the highest value
and picking the one in the middle. If there is an even number of observations, then there is no
single middle value, so one often takes the mean of the two middle values. For example, if
there are 500 scores in the list, score #250 would be the median. If we order the 8 scores
shown above, we would get:
15, 15, 15, 20, 20, 21, 25, 36
There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores
are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.
For 15 observations, the mid point is clearly the eighth largest, so that seven points are less
than the median, and seven points are greater than it. To find the median for an even number
of points, like using 16 observations, we average the eighth and ninth points.
To make it easier you need to build your frequency table and calculate the cumulative
frequency, after that you identify where is the 50%.
Table 2. Organisation of the collected data in a frequency table.
Level
x
1
2
3
4
5
Absolute
Frequency
f
4
33
69
37
7
Cumulative
Relative
Frequency
Frequency
cf
rf%
4
3%
37
22%
106
46%
143
24%
150
5%
n=150 100%
Cumulative
Percent
cf%
3%
25%
71%
95%
100%
In Table 2, the median (50%) is in level 3.
THE QUARTILES
In descriptive statistics, a quartile is any of the three values which divide the sorted data set
into four equal parts, so that each part represents one fourth of the sampled population.




first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th
percentile
second quartile (designated Q2) = median = cuts data set in half = 50th percentile
third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest
75% = 75th percentile
the interquartile is the interval between Q75 and Q25.
In Table 2, the Q25 is in level 2 and Q75 is in level 4. The interquartile is 2 levels.
THE MEAN
The mean is the most commonly way to summarise data and is often referred as the average.
The mean annual income in the District of Skaraborg, Southwest of Sweden, was 204 per
thousands Swedish crowns (years 2000 and 2005).
Table 3. Skaraborg’s mean annual income (per 1000 Skr) between the years 2000-2005:
2000
2001
2002
2003
2004
2005
Skaraborg
186
193
202
208
214
220
Mean is the sum of the values of the included observations divided by the number of
observations.
In this case (Table 3):
Mean = (186 + 193 + 202 + 208 + 214 + 220) / 6 = 204
Exercise 1:
Use the municipal data in Table 4 and calculate the mean between the years 2000-2005 for
each of the presented municipalities.
Table 4. Mean annual income in the municipalities in Skaraborg, 2000-2005.
Essunga
Falköping
Gullspång
Götene
Hjo
Karlsborg
Lidköping
Mariestad
Skara
Skövde
Tibro
Tidaholm
Töreboda
Vara
2000
181
188
173
192
185
186
195
188
194
196
187
185
173
179
2001
192
196
181
201
194
193
203
197
201
204
192
191
179
187
2002
198
203
191
210
202
203
213
204
209
211
199
198
187
195
2003
205
207
197
218
207
211
220
211
217
217
205
206
194
204
2004
210
214
203
223
214
216
226
219
222
224
210
212
199
209
2005
217
221
208
229
220
221
234
224
229
230
217
217
203
217
THE STANDARD DEVIATION
The Standard Deviation shows how much variation there is from the mean. A low standard
deviation indicates that the data points tend to be very close to the mean, whereas high
standard deviation indicates that the data are spread out over a large range of values.
For example, the average height for adult men in the United States is about 178 cm, with a
standard deviation of around 8 cm. This means that most men (about 68 percent, assuming a
normal distribution) have a height within 8 cm from the mean (170–185 cm) – one standard
deviation, whereas almost all men (about 95%) have a height within 15 cm of the mean
(163–193 cm) – 2 standard deviations. If the standard deviation were zero, then all men would
be exactly 178 cm high. If the standard deviation were 51 cm, then men would have much
more variable heights, with a typical range of about 127 to 229 cm. Three standard deviations
account for 99% of the sample population being studied, assuming the distribution is normal
(bell-shaped).
Normal Distribution
M- 3S
M-2S
M-S
M
M+S
M+2S
M+3S
STEPS TO CALCULATE THE STANDARD DEVIATION
Basic example to calculate the Standard Deviation:
Consider a population consisting of the following values:
There are eight data points in total, with a mean value of 5:
To calculate the population standard deviation, first compute the difference of each data point
from the mean, and square the result:
1
2
3
4
5
6
7
8
2
4
4
4
5
5
7
9
(2 – 5) = -3
-1
-1
-1
0
0
2
4
9
1
1
1
0
0
4
16
Next divide the sum of these values by the number of values and take the square root to give
the standard deviation:
Therefore, the above has a population standard deviation of 2.
DISTRIBUTION OF THE OBSERVATIONS
A normal distribution means that it is a symmetric distribution, with no skew.
A distribution is skewed if one of its tails is longer than the other. A positive distribution
means that it has a long tail in the positive direction. Generally, the mean income in Low
income countries has a positive skew where large proportion of the population lives under
poverty.
A negative distribution has a long tail in the negative direction. If the majority of the students
in the course did very well and few did it very bad, a negative skew will be observed.
A multimodal distribution has two mode scores.
Some important ground mathematical terms
Ratio: the value obtained by dividing one quantity by another. Rate, proportion and
percentages are type of ratios. The ratio consists of a numerator and a denominator.
For example: the sex ratio in a class of 20 persons where there are only 5 men will be:
15 women / 5 men = 3 women per men.
Proportion: is a type of ratio in which the numerator is part of the denominator. They are
expressed as percentages. For example: the proportion of 5 men in a class of 20 persons will
be: 5 men / 20 (all participants) multiplied by 100 = 25 percent.
Rate: is another type of ratio which involves time with it. Like incidence rate, prevalence
rate, infant mortality rate.
Rate ratio: is the division of two rates.
Three more statistical exercises from previous tests in the Course.
1. The following data was obtained from a survey on shoe size:
a) Create a frequency table
b) Obtain the mode and the range, the mean with its standard deviation and the
median with Q25, Q75 and interquartile
c) Represent the data in a graph.
d) Describe all your findings with your words; include the shape of the distribution.
34
38
35
35
35
37
36
36
40
34
34
39
38
35
35
39
40
41
35
40
36
36
37
37
34
34
38
39
36
35
2. The following data was obtained from a survey the age of the students in the Course:
a) Create a frequency table
b) Obtain the mode and the range, the mean with its standard deviation and the
median with Q25, Q75 and interquartile
c) Represent the data in a graph.
d) Describe all your findings with your words; include the shape of the distribution.
22
23
18
21
24
20
18
21
19
19
20
21
21
18
20
23
22
22
22
21
23
20
23
22
19
19
23
21
24
21
3. The following data was obtained from a survey the age of the students in a PhD course:
e) Create a frequency table
f) Obtain the mode and the range, the mean with its standard deviation and the
median with Q25, Q75 and interquartile.
g) Represent the data in a graph.
h) Describe all your findings with your words; include the shape of the distribution.
35
25
65
35
35
35
35
25
55
65
25
65
25
45
25
25
55
25
45
45
55
35
65
65
35
25
25
45
45
25
Solutions of the statistical exercises from previous test in the Course:
1. Information about shoe size.
No
frek
frek% kum frek
34
5
16,7
16,7
35
7
23,3
40,0
36
5
16,7
56,7
37
3
10,0
66,7
38
3
10,0
76,7
39
3
10,0
86,7
40
3
10,0
96,7
41
1
3,3
100,0
total
30
100
Frequency of shoe size
8
6
4
2
0
34
35
36
37
38
39
40
41
The mean (1098/30) was 36,6; standard deviation of 2,13. The Median was 36. The mode = 35. Q1=
35 & Q3= 38. Interquartile = 3.
The shoe size “36” was the most representative size in this group of people (36 participants) even thou
the top-value was “35”. The range was from 34 to 41, and it does not have a normal distribution
because it shows a positive skewed distribution.
2. Information about the age of students in the Course
Age
18
19
20
21
22
23
24
total
Frekvens
3
4
4
7
5
5
2
30
rf%
10
13
13
23
17
17
7
100%
Kum frek
10
23
36
59
76
93
100
Studenter ålder
8
7
6
5
4
3
2
1
0
18
19
20
21
22
23
24
Mean = 21; Median = 21; Mode = 21; Stardard deviation = 1,76
The mean, median and top value was “21”. Q1 = 20 & Q3 = 22. Interquartile = 2; The standard
deviation was 1,76 and the range was between 18 to 24. The graph presented a normal distribution.
3. Information of the age of the students in a PhD course.
Rel frek %
X= age
(f)
Kumm
Frek %
25
35
45
10
7
5
33,3
23,3
16,7
33,3
56,6
73,3
55
3
10,0
83,3
65
5
16,7
100
30
100
Total
Åldersfördelnin
g distribution%.
40
30
20
10
0
25
35
45
55
65
Ålder
Mean = 40.33; Median= 35; mode = 25; Standar deviation = 14.79.
The mean age was 40 (SD= 14.79); the median was 35 and the top value was 25. Q1 = 25 & Q3 = 55.
Interquartile = 30;
The range was from 25 to 65 years, in this group and it does not have a normal distribution because it
shows a positive skewed distribution.
Extra exercise.
Find the mode, range, median, Q25 and Q75, mean and Standard Deviation of Table 5. Make
a graph showing the mode, the mean and the median. Describe in words all your results:
mode, range, median, Q25, Q75, interquartile, mean and Standard Deviation. Describe also
the shape of the distribution of the observations.
Table 5. Mean annual income (per 1000 skr) in seven municipalities, Skaraborg, 2000-2005.
Mean
2000-2005
Falköping
205
Karlsborg
205
Lidköping
215
Skövde
214
Tibro
202
Tidaholm
202
Töreboda
189
12
Download