Biostatistics course Part 3. Data, summary and presentation

advertisement
Biostatistics course
Part 3
Data, summary and
presentation
Dr. en C. Nicolas Padilla Raygoza
Department of Nursing and Obstetrics
Division of Health Sciences and Engineering
Campus Celaya Salvatierra
University of Guanajuato Mexico
Biosketch
 Medical Doctor by University Autonomous of Guadalajara.
 Pediatrician by the Mexican Council of Certification on






Pediatrics.
Postgraduate Diploma on Epidemiology, London School of
Hygine and Tropical Medicine, University of London.
Master Sciences with aim in Epidemiology, Atlantic International
University.
Doctorate Sciences with aim in Epidemiology, Atlantic
International University.
Professor Titular A, Full Time, University of Guanajuato.
Level 1 National Researcher System
padillawarm@gmail.com raygosan@ugto.mx
Competencies
 The reader will describe type of variables.
 He (she) will analyze how summary shows
the different variables
 He (she) will calculate central trend
measures and find them in graphics.
 He (she) will calculate dispersion measures
and find them in graphics.
Definitions
 Data are collected on the specific
characteristics of each subject, and groups
are formed to be compared.
 These characteristics are called variables,
because they can change from each subject.
 Variable is obtained because it is:


A result of interest - dependent variable
Or it explain the dependent variable - risk
factor - independent variable.
Type of data
 Classification for its measurement scale:

Qualititative




Binary - dichotomous
Ordinal
Nominal
Quantitative


Discrete
Continuous
Type of data - Examples
 Qualitative

Dichotomous - binary



Ordinal


Gender: male or female.
Employment status: employment or without employment.
Socioeconomic level: high, medium, low.
Nominal


Residency place: center, North, South, East, West.
Civil status: single, married, widowed, divorced, free union.
 Quantitative

Discrete


Number of offspring: 1,2,3,4.
Continuous

Glucose in blood level: 110 mg/dl, 145 mg/dl.
Data summary
 Generally, we want to show the data in a
summary form.
 Number of times that an event occur, is of our
interest, it show us the variable distribution.
 We can generate a frequency list quantitative
or qualitative.
Summary of categorical data
 We can obtain frequencies of categorical data
and summary them in a table or graphic.
 Example: we have 21 agents of parasitic
diseases isolated from children.
Giardia lamblia
Giardia lamblia
Giardia lamblia
Entamoeba histolytica
Entamoeba histolytica
Entamoeba histolytica
Ascaris lumbricoides
Ascaris lumbricoides
Ascaris lumbricoides
Enterobius
vermicularis
Enterobius vermicularis
Enterobius vermicularis
Ascaris lumbricoides
Ascaris lumbricoides
Enterobius vermicularis
Enterobius vermicularis
Giardia lamblia
Giardia lamblia
Ascaris lumbricoides
Enterobius
vermicularis
Giardia lamblia
Summary of categorical data
 List of parasites detected show us an idea of
the frequency of each parasite, but that is not
clear.
 If we ordered them, the idea is more clear.
Giardia lamblia
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Ascaris lumbricoides
Entamoeba histolytica
Giardia lamblia
Enterobius vermicularis
Entamoeba histolytica
Ascaris lumbricoides
Enterobius vermicularis
Entamoeba histolytica
Summary of categorical data
 We can show the results in a frequency distribution.
Frequency distribution of intestinal parasites detected in children from
CAISES Celaya, n=21
Parasite
n
Giardia lamblia
6
Ascaris lumbricoides
6
Enterobius vermicularis
6
Entamoeba histolytica
3
Total
21
Source: Laboratory report
Summary of categorical data
 It is useful to show the frequency of each category, expressed as
percentage of the total frequency.
 It is called distribution of relative frequencies.
Frequency distribution of intestinal parasites detected in children from
CAISES Celaya, n=21
Parásito
n
%
Giardia lamblia
6
28.57
Ascaris lumbricoides
6
28.57
Enterobius
vermicularis
6
28.57
Entamoeba
histolytica
3
14.29
Total
21
100.00
Source: Laboratory report
Summary of categorical data
 Sometimes, the number of categories is high and
should diminish the number of categories.
Distribution by death cause in Celaya, Gto, during 2012
Death cause
n
%
Cardiovascular disease
12,525
21.96
Cancer
10,321
18.10
Lower respiratory
infections
8,745
15.34
Other
25,435
44.60
Total
57,026
100.00
Source: Certification of deaths
Frequency distributions for
quantitative data
 With quantitative data, we need group the data,
before of show it in a frequencies or relative
frequencies table.
Distribution of frequencies in students of FEOC that have smoked at
least once. n=534
Age (years)
n
%
19
52
14.70
20
32
9.00
21
46
12.99
22
67
18.94
23
26
7.35
24
77
21.76
25
54
15.26
Total
534
100.00
Source: Health survey
Frequency distributions for
quantitative data
 With quantitative data, it is useful calculate
cumulative frequency.
Distribution of frequencies in students of Campus that have smoked at least
once. n=534
Age (years)
n
%
% cumulative
19
52
14.70
14.70
20
32
9.00
23.70
21
46
12.99
36.69
22
67
18.94
55.63
23
26
7.35
62.98
24
77
21.76
84.74
25
54
15.26
100.00
Total
534
100.00
Source: Health survey
Distributions of frequencies for
grouped quantitative data.
 Frequently, there are many categories with quantitative data,
and we have to calculate intervals for each category.
Distribution of
frequencies of
ages of children
with acute
streptoccocal
pharyngotonsillitis
Source: Padilla N, Moreno
M. Comparison between
clarithromycin, azithromycin
and propicillin in the
management of acute
streptococcal
pharyngotonsillitis in
children. Archivos de
Investigación Pediátrica de
México 2005; 8:5-11. (In
Spanish)
Age (years)
n
%
<1
2
0.51
1
8
2.00
2
13
3.30
3
29
7.36
4
37
9.39
5
44
11.17
6
51
12.94
7
50
12.69
8
49
12.44
9
32
8.12
10
25
6.35
11
22
5.58
12
14
3.55
13
9
2.28
14
7
1.78
15
2
0.51
Total
394
100.00
Distribución de frecuencias para datos
cuantitativos agrupados
Distribution of frequencies of ages of children with acute streptoccocal
pharyngotonsillitis
Source: Padilla N,
Moreno M.
Comparison between
clarithromycin,
azithromycin and
propicillin in the
management of acute
streptococcal
pharyngotonsillitis in
children. Archivos de
Investigación
Pediátrica de México
2005; 8:5-11. (In
Spanish)
Age (years)
n
%
<1 - 3
52
13.20
4-6
132
33.50
6-9
131
33.25
10 - 12
61
15.48
13 - 15
18
4.57
Total
394
100.0
0
To group data
 Guide





To obtain minimum and maximum values and
decide the number of intervals.
Number of intervals between 5 and 15.
To assure interval limits.
To assure that width of intervals been the
same.
To avoid that first or last interval been open.
Charts
 Categorical data


Bar chart
Gráfica de pastel
 Quantitative data


Histogram
Polygon of frequencies
Bar chart
 The frequency or relative frequency of a
categorical variable can be show easily in a
bar chart.




It is used with categorical or numerical
discrete data.
Each bar represent one category and its high
is the frequency or relative frequency.
Bars should be separated.
It is very important that Y axis begin with 0.
Bar chart
Frequency
Gastrintestinal infections
7
6
5
4
3
2
1
0
Cryptos.
E.histolyt.
E.coli
Giardia
Agents
Rotavirus
Shigella
Grouped bar chart
 If we have a nominal categorical variable,
divided in two categories, can show data with
a grouped bar chart.
 It allow easy comparison between groups.
Grouped bar chart
Gastrointestinal infections
Frequency
5
4
3
Males
2
Females
1
0
Crypt.
E.histolyt.
E.coli
Giardia
Agents
Rotavirus Shigella
Pie chart
 It is an alternative to show categorical variable.
 Each slice of pie correspond at frequency or relative
frequency of categories of variable.
 It only shows one variable in each pie chart.
 If we want to make comparisons, we need to build
two or more pie charts.
Pie chart
Civil status of women in a community
Free union
9%
Widowed
8%
Single
28%
Divorced
11%
Married
44%
Pie chart
Civil status of women in a
community
Free
union
9%
Single
28%
Widowe
d
Divorce
d 8%
11%
Married
44%
Civil status of men in a community
Free
union
16%
Widowe
d
1%
Single
31%
Divorce
d
11%
Married
41%
Distribution of frequency charts:
histograms
 It is useful to quantitative variables.
 There are not spaces between bars.
 The area bar, not its high, represent its
frequency.
 X axis should be continuous.
 Y axis should begin in 0.
 Width represent the interval for each group.
Distribution of frequency charts:
histograms
Number of sons in women from
Celaya
Number of woman
700
600
500
400
300
200
100
0
1
2
3
4
5
Number of sons
6
7
8+
Distribution of frequency charts:
frequencies polygon
 It is another form to show the frequency
distribution of a numerical variable.
 It is building, joining the middle point higher of
each bar of histogram.
 We should be take into account the width of
each bar.
 We can plot more than one polygon in each
chart, to make comparisons.
Distribution of frequency charts:
polygon of frequencies
Number of sons of women from
Celaya
Number of women
700
600
500
400
300
200
100
0
1
2
3
4
5
Number of sons
6
7
8+
Distribution of frequencies:
cumulative histogram
 We can plot directly from a cumulative
frequencies table.
 It is not necessary to make adjustments to the
high of the bars, because the cumulative
frequencies represent the total frequency
superior, including the superior limit of the
interval.
Distribution of frequencies:
cumulative histogram
Cumulative
frequency (%)
Cumulative frequency of birthweight
120
100
80
60
40
20
0
New borns
501-
1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Distribution of frequencies:
cumulative polygon of frequencies
 We use them to see proportions below o
above of a point in the curve.
 We can read median and percentiles, directly.
 If the distribution is symmetrical, it has S form
symmetrical.
 If it is skewed to the right or to the left, will be
flatten in that side.
Distribution of frequencies:
cumulative polygon of frequencies
Cumulative
frequency (%)
Cumulative frequencies of birthweight
120
100
80
60
40
20
0
New borns
501-
1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Other charts: tree and leafs
 We use it to show directly quantitative data or
preliminary step in the build a frequency
distribution.


We organize data determining the number of
divisions (5-15).
We plot a vertical line and put the first digit of
category to the left of the line (tree) and the
second digit to the right of the vertical line
(leafs).
Other charts: tree and leafs
Patie
nt
Age
1
54
3 52
2
35
4 932
3
49
5 487
4
61
6 14
5
58
6
64
7
32
8
57
9
43
10
42
Other charts: box and line
 We plot a vertical line that represents the
range of distribution.
 We plot a horizontal line that represents third
quartile and another that represents the first
quartile (box).
 The point middle of distribution is show as a
horizontal line in the center of box.
Other charts: box and line
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
Localization measures
 For categorical variable: percentage
 For quantitative variable:

Central trend measures:




Mean
Median
Mode
Dispersion measures:



Standard deviation
Percentiles
Range
Central trend measures
 Mean


It is the conventional mean.
If we say that n observations have a xi value,
then the value of the mean will be:
_
X =Σxi/n
Central trend measures in a
frequency distribution
 Each value of data (xi) occur with a frequency
(fi), then:
_
X =Σxifi/n
 In a grouped distribution, we use point middle
of each interval as x value.
Central trend measures in a frequency
distribution
Interval
Point middle
Frequency (fi)
_________________________________
1–3
2
18
4–6
5
27
7–9
8
34
10 – 12
11
22
13 – 15
14
13
_________________________________
Total
114
Example of mean for a grouped distribution
(2 x 18) + (5 x 27) + (8 x 34) + (11 x 22) + (14 x 13)
36 + 135 + 272 + 242 + 182
867
Mean = --------------------------------------------------------------------- = ---------------------------------------- = -------- = 7.61
(18 + 27 + 34 + 22 + 13)
114
114
Mean = 7.61 years
Central trend measures
 Median


It is the value that divide the distribution in two
equal parts.
If it is a pair number of observations, the
central values are summed and divided by
two.
51.2, 53.5, 55.6, 65.0, 74.2 median is the value at the
half, thus:
Median = 55.6
51.2, 53.5, 55.6, 61.4, 65.0, 74.2, 55.6 + 61.4 /2 =
Median 58.5
Central trend measures for frequency
distributions
 Median
 It is the value where is 50%.
Cumulative
frequency (%)
Cumulative frequency of birthweight
120
100
80
60
40
20
0
New borns
501-
1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Central trend measures
 Mode

It is the value that occur more frequently.
Interval
Point middle Frequency (fi)
_________________________________
1–3
2
18
4–6
5
27
7–9
8
34
10 – 12
11
22
13 – 15
14
13
_________________________________
Total
114
Central trend measures
 Properties




Mean is sensitive to the tails, median and
mode, not.
Mode can be affected by little changes in the
data, median and mean, not.
Mode and median can be find in a chart.
The three measures are the same in a Normal
distribution.
Central trend measures
 What measurement to use?


For skewed distributions, we use median.
For statistical analysis or inference, we use
mean.
Dispersion measures
 Range

It show the minimum and maximum values
and the difference between they.
51.2, 53.5, 55.6, 61.4, 65.0, 74.2
Range of this distribution es 51.2 – 74.2 kg.
However, the extreme values of this distribution are far center of
distribution, it unclear the fact that the most data are between
53.5 and 65 kg.
Dispersion measures
 Percentiles
 A percentile o centile is the value, below of which, a
percentage given of data, has occurred.
See the distribution of stature in this population. What is the range, median, percentile 25 and 75?
Stature (cm.).
n
Relative frequency (%) Cumulative frequency (%)
151
2
0.7
0.7
152
3
1.1
1.8
152
6
2.2
4.0
154
12
4.5
8.5
155
27
10.0
18.5
157
29
10.8
29.3
158
26
9.7
39.0
159
33
12.3
51.3
163
37
13.8
65.1
164
16
5.9
71.0
165
24
8.9
79.9
168
18
6.7
86.6
169
14
5.2
91.8
171
6
2.2
94.0
174
7
2.6
96.6
175
1
0.4
97.0
177
4
1.5
98.5
179
2
0.7
99.2
184
1
0.4
99.6
185
1
0.4
100.0
_____________________________________________________________________
Total
269
100.0
Dispersion measures
 Standard deviation


It is the more common form of to quantify the
variability of a distribution.
It measure the distance between each value
and its mean.
Subject
1
2
3
High
1.6
1.7
1.8
Value
-1
0
+1
Σ Xi - X
Mean deviation = ------------n
_
X= 1.7 Mean deviation = (-1)+(0)+(+1)/3 = 0
Dispersion measures
 Standard deviation



We should be interest in magnitude of observations.
If squared each deviation, we shall have positive values.
If divided this add by n -1, we shall obtain variance and if we
obtain square root, shall have standard deviation.
Subject
1
2
3
High
1.6
1.7
1.8
Value2
0.1
0
0.1
Σ (Xi - X)2
Standard deviation =√ --------------n-1
_
X= 1.7 Standard deviation = √0.2/2 = 0.32
Dispersion measures fo grouped data
 Standard deviation

It use the mean point of each interval.
Σ f(Xi - X)2
Standard deviation =√ -------------f-1
Also, it can be expressed as:
Σfx2 - (Σfx)2 /Σf
Standard deviation = √ --------------------Σ f -1
Dispersion measures for grouped data
 For data with Normal distribution



Around 68% of data are between -1 and +1
standard deviation.
Around 95% of data are between -2 and +2
standard deviations.
Around 99.9% of data are between -3 and +3
standard deviations.
 Standard deviation is a measure of the width of
the distribution. If the standard deviation change,
the distribution change, also.
Bibliography
 1.- Kirkwood BR. Essentials of medical
statistics. Oxford, Blackwell Science, 1988.
 2.- Altman DG. Practical statistics for medical
research. Boca Ratón, Chapman & Hall/
CRC; 1991.
Download