Biostatistics basics - descriptive statistics

advertisement
Biostatistics Basics
An introduction to an expansive
and complex field
1
© 2006
Common statistical terms
• Data
– Measurements or observations of a variable
• Variable
– A characteristic that is observed or
manipulated
– Can take on different values
Evidence-based Chiropractic
2
© 2006
Statistical terms (cont.)
• Independent variables
– Precede dependent variables in time
– Are often manipulated by the researcher
– The treatment or intervention that is used in a
study
• Dependent variables
– What is measured as an outcome in a study
– Values depend on the independent variable
Evidence-based Chiropractic
3
© 2006
Statistical terms (cont.)
• Parameters
– Summary data from a population
• Statistics
– Summary data from a sample
Evidence-based Chiropractic
4
© 2006
Populations
• A population is the group from which a
sample is drawn
– e.g., headache patients in a chiropractic
office; automobile crash victims in an
emergency room
• In research, it is not practical to include all
members of a population
• Thus, a sample (a subset of a population)
is taken
Evidence-based Chiropractic
5
© 2006
Random samples
• Subjects are selected from a population so
that each individual has an equal chance
of being selected
• Random samples are representative of the
source population
• Non-random samples are not
representative
– May be biased regarding age, severity of the
condition, socioeconomic status etc.
Evidence-based Chiropractic
6
© 2006
Random samples (cont.)
• Random samples are rarely utilized in
health care research
• Instead, patients are randomly assigned to
treatment and control groups
– Each person has an equal chance of being
assigned to either of the groups
• Random assignment is also known as
randomization
Evidence-based Chiropractic
7
© 2006
Descriptive statistics (DSs)
• A way to summarize data from a sample or
a population
• DSs illustrate the shape, central tendency,
and variability of a set of data
– The shape of data has to do with the
frequencies of the values of observations
Evidence-based Chiropractic
8
© 2006
DSs (cont.)
– Central tendency describes the location of the
middle of the data
– Variability is the extent values are spread
above and below the middle values
• a.k.a., Dispersion
• DSs can be distinguished from inferential
statistics
– DSs are not capable of testing hypotheses
Evidence-based Chiropractic
9
© 2006
Hypothetical study data
(partial from book)
• Distribution provides a summary of:
– Frequencies of each of the values
•
•
•
•
•
•
2–3
3–4
4–3
5–1
6–1
7–2
etc.
– Ranges of values
• Lowest = 2
• Highest = 7
Evidence-based Chiropractic
10
Case #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Visits
7
2
2
3
4
3
5
3
4
6
2
3
7
4
© 2006
Frequency distribution table
•2
•3
•4
•5
•6
•7
Evidence-based Chiropractic
Frequency Percent
3
21.4
4
28.6
3
21.4
1
7.1
1
7.1
2
14.3
11
Cumulative %
21.4
50.0
71.4
78.5
85.6
100.0
© 2006
Frequency distributions are
often depicted by a histogram
Evidence-based Chiropractic
12
© 2006
Histograms (cont.)
• A histogram is a type of bar chart, but
there are no spaces between the bars
• Histograms are used to visually depict
frequency distributions of continuous data
• Bar charts are used to depict categorical
information
– e.g., Male–Female, Mild–Moderate–Severe,
etc.
Evidence-based Chiropractic
13
© 2006
Measures of central tendency
• Mean (a.k.a., average)
– The most commonly used DS
• To calculate the mean
– Add all values of a series of numbers and
then divided by the total number of elements
Evidence-based Chiropractic
14
© 2006
Formula to calculate the mean
X
n
• Mean of a sample
X 
• Mean of a population
X

N
 X (X bar) refers to the mean of a sample and μ refers to the
mean of a population
 EX is a command that adds all of the X values
 n is the total number of values in the series of a sample and
N is the same for a population
Evidence-based Chiropractic
15
© 2006
Measures of central
tendency (cont.)
• Mode
Mode
– The most frequently
occurring value in a
series
– The modal value is
the highest bar in a
histogram
Evidence-based Chiropractic
16
© 2006
Measures of central
tendency (cont.)
• Median
– The value that divides a series of values in
half when they are all listed in order
– When there are an odd number of values
• The median is the middle value
– When there are an even number of values
• Count from each end of the series toward the
middle and then average the 2 middle values
Evidence-based Chiropractic
17
© 2006
Measures of central
tendency (cont.)
• Each of the three methods of measuring
central tendency has certain advantages
and disadvantages
• Which method should be used?
– It depends on the type of data that is being
analyzed
– e.g., categorical, continuous, and the level of
measurement that is involved
Evidence-based Chiropractic
18
© 2006
Levels of measurement
•
There are 4 levels of measurement
– Nominal, ordinal, interval, and ratio
1. Nominal
– Data are coded by a number, name, or letter
that is assigned to a category or group
– Examples
•
•
Gender (e.g., male, female)
Treatment preference (e.g., manipulation,
mobilization, massage)
Evidence-based Chiropractic
19
© 2006
Levels of measurement (cont.)
2. Ordinal
– Is similar to nominal because the
measurements involve categories
– However, the categories are ordered by rank
– Examples
•
•
Pain level (e.g., mild, moderate, severe)
Military rank (e.g., lieutenant, captain, major,
colonel, general)
Evidence-based Chiropractic
20
© 2006
Levels of measurement (cont.)
• Ordinal values only describe order, not
quantity
– Thus, severe pain is not the same as 2 times
mild pain
• The only mathematical operations allowed
for nominal and ordinal data are counting
of categories
– e.g., 25 males and 30 females
Evidence-based Chiropractic
21
© 2006
Levels of measurement (cont.)
3. Interval
– Measurements are ordered (like ordinal
data)
– Have equal intervals
– Does not have a true zero
– Examples
•
•
The Fahrenheit scale, where 0° does not
correspond to an absence of heat (no true zero)
In contrast to Kelvin, which does have a true zero
Evidence-based Chiropractic
22
© 2006
Levels of measurement (cont.)
4. Ratio
– Measurements have equal intervals
– There is a true zero
– Ratio is the most advanced level of
measurement, which can handle most types
of mathematical operations
Evidence-based Chiropractic
23
© 2006
Levels of measurement (cont.)
• Ratio examples
– Range of motion
• No movement corresponds to zero degrees
• The interval between 10 and 20 degrees is the
same as between 40 and 50 degrees
– Lifting capacity
• A person who is unable to lift scores zero
• A person who lifts 30 kg can lift twice as much as
one who lifts 15 kg
Evidence-based Chiropractic
24
© 2006
Levels of measurement (cont.)
• NOIR is a mnemonic to help remember
the names and order of the levels of
measurement
– Nominal
Ordinal
Interval
Ratio
Evidence-based Chiropractic
25
© 2006
Levels of measurement (cont.)
Measurement scale
Permissible mathematic
operations
Best measure of
central tendency
Nominal
Counting
Mode
Ordinal
Greater or less than
operations
Median
Interval
Addition and subtraction
Symmetrical – Mean
Skewed – Median
Ratio
Addition, subtraction,
multiplication and division
Symmetrical – Mean
Skewed – Median
Evidence-based Chiropractic
26
© 2006
The shape of data
• Histograms of frequency distributions have
shape
• Distributions are often symmetrical with
most scores falling in the middle and fewer
toward the extremes
• Most biological data are symmetrically
distributed and form a normal curve (a.k.a,
bell-shaped curve)
Evidence-based Chiropractic
27
© 2006
The shape of data (cont.)
Line depicting
the shape of
the data
Evidence-based Chiropractic
28
© 2006
The normal distribution
• The area under a normal curve has a
normal distribution (a.k.a., Gaussian
distribution)
• Properties of a normal distribution
– It is symmetric about its mean
– The highest point is at its mean
– The height of the curve decreases as one
moves away from the mean in either direction,
approaching, but never reaching zero
Evidence-based Chiropractic
29
© 2006
The normal distribution (cont.)
Mean
The highest point of
the overlying
normal curve is at
the mean
As one moves away from
the mean in either direction
the height of the curve
decreases, approaching,
but never reaching zero
A normal distribution is symmetric about its mean
Evidence-based Chiropractic
30
© 2006
The normal distribution (cont.)
Mean = Median = Mode
Evidence-based Chiropractic
31
© 2006
Skewed distributions
• The data are not distributed symmetrically
in skewed distributions
– Consequently, the mean, median, and mode
are not equal and are in different positions
– Scores are clustered at one end of the
distribution
– A small number of extreme values are located
in the limits of the opposite end
Evidence-based Chiropractic
32
© 2006
Skewed distributions (cont.)
• Skew is always toward the direction of the
longer tail
– Positive if skewed to the right
– Negative if to the left
The mean is shifted
the most
Evidence-based Chiropractic
33
© 2006
Skewed distributions (cont.)
• Because the mean is shifted so much, it is
not the best estimate of the average score
for skewed distributions
• The median is a better estimate of the
center of skewed distributions
– It will be the central point of any distribution
– 50% of the values are above and 50% below
the median
Evidence-based Chiropractic
34
© 2006
More properties
of normal curves
• About 68.3% of the area under a normal
curve is within one standard deviation
(SD) of the mean
• About 95.5% is within two SDs
• About 99.7% is within three SDs
Evidence-based Chiropractic
35
© 2006
More properties
of normal curves (cont.)
Evidence-based Chiropractic
36
© 2006
Standard deviation (SD)
• SD is a measure of the variability of a set
of data
• The mean represents the average of a
group of scores, with some of the scores
being above the mean and some below
– This range of scores is referred to as
variability or spread
• Variance (S2) is another measure of
spread
Evidence-based Chiropractic
37
© 2006
SD (cont.)
• In effect, SD is the average amount of
spread in a distribution of scores
• The next slide is a group of 10 patients
whose mean age is 40 years
– Some are older than 40 and some younger
Evidence-based Chiropractic
38
© 2006
SD (cont.)
Ages are spread
out along an X axis
The amount ages are
spread out is known as
dispersion or spread
Evidence-based Chiropractic
39
© 2006
Distances ages deviate above
and below the mean
Etc.
Adding deviations
always equals zero
Evidence-based Chiropractic
40
© 2006
Calculating S2
• To find the average, one would normally
total the scores above and below the
mean, add them together, and then divide
by the number of values
• However, the total always equals zero
– Values must first be squared, which cancels
the negative signs
Evidence-based Chiropractic
41
© 2006
Calculating S2 cont.
S2 is not in the
same units (age),
but SD is
Symbol for SD of a sample
 for a population
Evidence-based Chiropractic
42
© 2006
Calculating SD with Excel
Enter values in a column
Evidence-based Chiropractic
43
© 2006
SD with Excel (cont.)
Click Data Analysis
on the Tools menu
Evidence-based Chiropractic
44
© 2006
SD with Excel (cont.)
Select Descriptive
Statistics and click OK
Evidence-based Chiropractic
45
© 2006
SD with Excel (cont.)
Click Input Range icon
Evidence-based Chiropractic
46
© 2006
SD with Excel (cont.)
Highlight all the
values in the column
Evidence-based Chiropractic
47
© 2006
SD with Excel (cont.)
Click OK
Check if labels are
in the first row
Check Summary
Statistics
Evidence-based Chiropractic
48
© 2006
SD with Excel (cont.)
SD is calculated precisely
Plus several other DSs
Evidence-based Chiropractic
49
© 2006
Wide spread results in higher SDs
narrow spread in lower SDs
Evidence-based Chiropractic
50
© 2006
Spread is important when
comparing 2 or more group means
It is more difficult to
see a clear distinction
between groups
in the upper example
because the spread is
wider, even though the
means are the same
Evidence-based Chiropractic
51
© 2006
z-scores
• The number of SDs that a specific score is
above or below the mean in a distribution
• Raw scores can be converted to z-scores
by subtracting the mean from the raw
score then dividing the difference by the
SD
X 
z

Evidence-based Chiropractic
52
© 2006
z-scores (cont.)
• Standardization
– The process of converting raw to z-scores
– The resulting distribution of z-scores will
always have a mean of zero, a SD of one,
and an area under the curve equal to one
• The proportion of scores that are higher or
lower than a specific z-score can be
determined by referring to a z-table
Evidence-based Chiropractic
53
© 2006
z-scores (cont.)
Refer to a z-table
to find proportion
under the curve
Evidence-based Chiropractic
54
© 2006
Partial z-table (to z = 1.5) showing proportions of the
area under a normal curve for different values of z.
z-scores (cont.)
Z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.0
0.5000
0.5040
0.5080
0.5120
0.5160
0.5199
0.5239
0.5279
0.5319
0.1
0.5398
0.5438
0.5478
0.5517
0.5557
0.2
0.5793
0.5832
0.5871
0.5910
0.3
0.6179
0.6217
0.6255
0.6293
0.5596
0.5636
0.5675
Corresponds
to
the 0.5714
area 0.5753
0.5948
0.5987
0.6026
0.6064
0.6103
0.6141
under
the
curve
in
black
0.6331
0.6368
0.6406
0.6443
0.6480
0.6517
0.4
0.6554
0.6591
0.6628
0.6664
0.6700
0.6736
0.6772
0.6808
0.6844
0.6879
0.5
0.6915
0.6950
0.6985
0.7019
0.7054
0.7088
0.7123
0.7157
0.7190
0.7224
0.6
0.7257
0.7291
0.7324
0.7357
0.7389
0.7422
0.7454
0.7486
0.7517
0.7549
0.7
0.7580
0.7611
0.7642
0.7673
0.7704
0.7734
0.7764
0.7794
0.7823
0.7852
0.8
0.7881
0.7910
0.7939
0.7967
0.7995
0.8023
0.8051
0.8078
0.8106
0.8133
0.9
0.8159
0.8186
0.8212
0.8238
0.8264
0.8289
0.8315
0.8340
0.8365
0.8389
1.0
0.8413
0.8438
0.8461
0.8485
0.8508
0.8531
0.8554
0.8577
0.8599
0.8621
1.1
0.8643
0.8665
0.8686
0.8708
0.8729
0.8749
0.8770
0.8790
0.8810
0.8830
1.2
0.8849
0.8869
0.8888
0.8907
0.8925
0.8944
0.8962
0.8980
0.8997
0.9015
1.3
0.9032
0.9049
0.9066
0.9082
0.9099
0.9115
0.9131
0.9147
0.9162
0.9177
1.4
0.9192
0.9207
0.9222
0.9236
0.9251
0.9265
0.9279
0.9292
0.9306
0.9319
0.9332
0.9332
0.9345
0.9357
0.9370
0.9382
55
0.9394
0.9406
0.9418
0.9429
0.9441
1.5
Evidence-based Chiropractic
0.09
0.5359
© 2006
Download