Uploaded by IBRAHIM SSALI

MSc Biochemistry Biostat 2021(1)

advertisement
MBC7201
Statistical
Data
Analysis
Explain the principles underlying
the various statistical methods
used for data analysis
Course
Objectives
Analyze scientific data using
various statistical approaches
Apply statistical models during
their experimental design
process
Report their data collected in a
scientific way backed up by
statistical analysis
To summarize data into meaningful
interpretable outputs
To assess underlying relationships
between factors/ attributes of interest
Objectives
of
statistical
analysis
To help interpret quantitative results
Descriptive
statistics
• Organise
• Summarise
• Simplify
• Describe and present data
Inferential
statistics
• Generalise from samples to
populations
• Hypothesis testing
• Make predictions
Some
definitions
Statistics: Collection of methods for planning
experiments, obtaining data, and then
organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions.
•
Biostatistics: statistical processes and methods
applied to the collection, analysis, and interpretation
of biological data and especially data relating to
human biology, health, and medicine.
Some definitions
Parameter
eg µ, σ
Statistic eg
x(bar), s2
Variable
Attribute
⚫
⚫
Gender
Attribute
Male
Female
Variable: Characteristic or attribute that can assume
different values
Random Variable: values are determined by chance.
Data
Qualitative data
Quantitative data
Non-numerical
Numerical variables
“I even one time helped the
doctor during a surgery operation
and this was so exciting to me”
“that was the most interesting
part of it, you really feel what you
have been learning, and it was
interesting especially in the
theatre”
Discrete
5 mosquitos
3 colonies
Continuous
11.54 g/dl Hb
3.652 µg/ml CRP
Hemoglobin status
Variable
Attributes
Values
Relationship
Normal
Mild anemia
13
11
Severe anemia
6
Qualities of
Variables
⚫
⚫
Exhaustive -- Should include all possible
answerable responses.
Mutually exclusive -- No respondent should
be able to have two attributes of a variable
simultaneously (for example, male vs female,
employed vs. unemployed).
Levels of
Measurement
• Measurement is the process of
assigning numbers to quantities
― This permits data to be categorised,
quantified and/or analysed in order that
meaningful conclusions can be drawn
• The level of measurement:
― helps you decide what statistical analysis is
appropriate on the values that were
assigned
― helps you decide how to interpret the data
from that variable
Levels of Measurement
1. Nominal Scale Lowest Level
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
Highest Level
Nominal Scale
• A measure of identity or category
• Useful for quantifying qualitative data
• Categories must be both mutually exclusive and
comprehensive/exhaustive
• Provides no information regarding either order or
magnitude
Ordinal Scale
• When attributes can be rank-ordered…
• Provides no information regarding magnitude
• Distances between attributes do not have any meaning
Table 2 Background characteristics of respondents
Percent distribution of women and men age 15-49 by selected background
characteristics, Uganda DHS 2016
UDHS 2016
UDHS 2016
Interval Scale
⚫
⚫
⚫
⚫
⚫
When distance between attributes has meaning, for
example, temperature (in Celsius) - distance from 30-40 is
same as distance from 70-80
No “true zero.” For example, there is no such thing as “no
temperature,” at least not with Celsius.
Zero does not mean the absence of value, but is actually
another number used on the scale, like 0 degrees Celsius.
Negative numbers also have meaning.
Without a true zero, it is impossible to compute ratios:
⚫
⚫
⚫
80 C degrees is not twice as hot as 40C
Convert to Farenheiht
Convert to K
Ratio Scale
⚫
⚫
⚫
Has an absolute zero that is
meaningful
Can construct a meaningful ratio
(fraction), for example, number of
clients in past six months
It is meaningful to say that “...we
had twice as many clients in this
period as we did in the previous six
months”
The Hierarchy of Levels
Absolute zero
Distance is meaningful
Attributes can be ordered
Attributes are only named; “weakest”
Nominal, Ordinal, Interval or Ratio?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Blood lactate concentration (mmol l-1)
Profile of Mood States (Scale 1-7)
Heart Rate (beats min-1)
Blood Group
Year of Birth (AD)
Atmospheric pressure (mmHg)
The absorbance of a mixture of Biuret reagent with
serum samples
Hemoglobin (g/dL)
Anemia (mild, moderate, severe)
pH using a pH meter (Omar)
pH using litmus paper (Omar)
Enzyme activity (Katal) (Steven Ogaba)
Vitamin A (IU) (Amos)
The kelvin (symbol K) is the base unit of temperature in the International
System of Units (SI). The Kelvin scale is an absolute thermodynamic
temperature scale using as its null point absolute zero, the temperature at
which all thermal motion ceases in the classical description of
thermodynamics.
The Kelvin scale is named after the Belfast-born, Glasgow University engineer
and physicist William Lord Kelvin (1824– 1907), who wrote of the need for an
"absolute thermometric scale". Unlike the degree Fahrenheit and degree
Celsius, the kelvin is not referred to or typeset as a degree.
Descriptive
Statistics
Descriptive Statistics are
used by researchers to
report on populations and
samples (but mostly
samples)
Descriptive
Statistics
Summary descriptions of
measurements (variables)
taken from a group of
people
By summarizing information,
Descriptive Statistics speed
up and simplify
comprehension of a group’s
characteristics
Sample vs. Population
Population
Saple
Descriptive Statistics
Which Group is Smarter?
Class A: IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Class B: IQs of 13 Students
127
162
131
103
96
111
80
109
93
87
120
105
109
Each individual may be different. If you try to understand a group by remembering the
qualities of each member, you become overwhelmed and fail to understand the group.
Descriptive Statistics
Which group is smarter now?
Class A: Average IQ
110.54
Class B: Average IQ
110.23
They are roughly the same!
• With a summary descriptive statistic, it is much easier to
answer our question.
Types of
descriptive
statistics
Organize and
Present Data
Summarize
Data
• Tables
• Graphs
―Bar Chart
―Histogram
―Stem and
Leaf Plot
―Box plot*
―Scatter
plot*
• Central
Tendency
• Variation
(including
dispersion)
Frequency Distribution
Frequency Distribution of IQ for Two Classes
IQ
Frequency
82.00
87.00
89.00
93.00
96.00
97.00
98.00
102.00
103.00
105.00
106.00
107.00
109.00
111.00
115.00
119.00
120.00
127.00
128.00
131.00
140.00
162.00
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
Total
24
Relative Frequency Distribution of IQ for Two Classes
IQ
Frequency
Percent
Valid Percent
Cumulative Percent
82.00
87.00
89.00
93.00
96.00
97.00
98.00
102.00
103.00
105.00
106.00
107.00
109.00
111.00
115.00
119.00
120.00
127.00
128.00
131.00
140.00
162.00
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
4.2
4.2
4.2
8.3
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
8.3
4.2
4.2
4.2
4.2
4.2
8.3
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
4.2
8.3
4.2
4.2
4.2
8.3
12.5
20.8
25.0
29.2
33.3
37.5
41.7
45.8
50.0
54.2
58.3
62.5
66.7
70.8
75.0
79.2
83.3
91.7
95.8
100.0
Total
24
100.0
100.0
Grouped Relative Frequency Distribution
Relative Frequency Distribution of IQ for Two Classes
IQ
Frequency
80 – 89
90 – 99
100 – 109
110 – 119
120 – 129
130 – 139
140 – 149
150 and over
Total
3
5
6
3
3
2
1
1
24
Percent
Cumulative Percent
12.5
20.8
25.0
12.5
12.5
8.3
4.2
4.2
12.5
33.3
58.3
70.8
83.3
91.6
95.8
100.0
100.0
100.0
Histogram
6
5
Frequency
4
3
2
1
Mean = 110.4583
Std. Dev. = 19.00338
N = 24
0
80.00
100.00
120.00
IQ
Based on the IQ data
140.00
160.00
Bar Graph
Effect of an intervention to introduce OFSP on vitamin A intake
IP = Intensive program; RP = Reduced program
Stem and Leaf Plot
Stem and Leaf Plot of IQ for Two Classes
Stem
8
9
10
11
12
13
14
15
16
Leaf
279
3678
235679
159
078
1
0
2
Key
8 2 means 82
Summarizing data
Central Tendency
Variation
(or Measure of Location or the
“Middle Values” of the
Groups)
(or Measure of Dispersion or
Summary of Differences within
Groups)
• Mean
• Median
• Mode
•
•
•
•
Range
Interquartile Range
Variance
Standard Deviation
Mean
Class A: IQs of 13 Students
102
115
128
109
131
89
98
106
140
119
93
97
110
Σ Yi = 1437
Y-barA = Σ Yi = 1437 = 110.54
n
13
Class B: IQs of 13 Students
127
162
131
103
96
111
80
109
93
87
120
105
109
Σ Yi = 1433
Y-barB = Σ Yi = 1433 = 110.23
n
13
Mean
The mean is the “balance point”
Each person’s score is like 1 pound placed at the score’s position on a see-saw. Below, on a
200 cm see-saw, the mean equals 110, the place on the see-saw where a fulcrum
finds balance:
1 lb at
93 cm
17
units
below
1 lb at
106 cm
1 lb at
131 cm
110 cm
4
units
below
21
units
above
0
units
The scale is balanced because…
17 + 4 on the left
=
21 on the right
Mean
1. Means can be badly affected by outliers (data points with
extreme values unlike the rest)
2. Outliers can make the mean a bad measure of central
tendency or common experience
Income in Uganda
Sudhir Ruparelia
All of us
Mean
Outlier
Median
• The middle value when a variable’s values are ranked in order; the point
that divides a distribution into two equal halves.
• When data are listed in order, the median is the point at which 50% of the
cases are above and 50% below it.
• The 50th percentile.
Median
Class A--IQs of 13 Students
89
93
97
98
102
106
109
110
115
119
128
131
140
Median = 109
(six cases above, six below)
89
93
97
98
102
106
109
110
115
119
128
131
140
If the first student were to drop out of
Class A, there would be a new median:
Median = 109.5
109 + 110 = 219/2 = 109.5
(six cases above, six below)
Median
• The median is unaffected by outliers, making it a better measure of central
tendency, better describing the “typical person” than the mean when data are
skewed.
• If the recorded values for a variable form a symmetric distribution, the median and
mean are identical.
• In skewed data, the mean lies further toward the skew than the median.
Symmetric
Skewed
Mean
Mean
Median
Median
Mode
• The most common data point
• The mode may not be at the center of a distribution
The combined IQ scores for Classes A & B:
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162
The mode!!
Bimodal distributions
Mode
1.
2.
3.
It may give you the most likely experience rather than the “typical” or “central”
experience.
In symmetric distributions, the mean, median, and mode are the same.
In skewed data, the mean and median lie further toward the skew than the mode.
Symmetric
Skewed
Mean
Median
Mode
Mode Median Mean
Summarizing data
Central Tendency
Variation
(or Measure of Location or the
“Middle Values” of the
Groups)
(or Measure of Dispersion or
Summary of Differences within
Groups)
✓Mean
✓Median
✓Mode
•
•
•
•
Range
Interquartile Range
Variance
Standard Deviation
Interquartile Range
• A quartile is the value that marks one of the divisions that breaks a series of values into four equal
parts.
• The median is a quartile and divides the cases in half.
• 25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
• 75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.
• The interquartile range is the distance or range between the 25th percentile and the 75th
percentile. Below, what is the interquartile range?
25%
of
cases
0
25%
250
25%
500
750
25%
of
cases
1000
Interquartile range (IQR)
IQR = Q3 – Q1
Variance
A measure of the spread of the recorded values on a variable. A measure of dispersion.
The larger the variance, the further the individual cases are from the mean.
Mean
The smaller the variance, the closer the individual scores are to the mean.
Mean
Variance
Calculating variance starts with a“deviation.”
A deviation is the distance away from the mean of a case’sscore.
Yi – Y-bar
• We want to add the individual deviations to get total deviation, but if we were to
do that, we would get zero every time. Why?
• We need a way to eliminate negative signs. Squaring the deviations will eliminate
negative signs...
(Yi – Y-bar)2
Variance
When the squared deviations are all added together, you get what we call the
“Sum of Squares.”
Sum of Squares (SS) = Σ (Yi – Y-bar)2
SS = (Y1 – Y-bar)2 + (Y2 – Y-bar)2 + . . . + (Yn – Y-bar)2
The approximate average sum of squares is the variance
• SS/N = Variance for a population
• SS/n-1 = Variance for a sample
• Variance = Σ(Yi – Y-bar)2 / n –1
Standard Deviation
To convert variance into something of meaning, we use standard deviation.
The square root of the variance reveals the average deviation of the observations
from the mean
Summary of steps
s.d. =
Σ(Yi – Y-bar)2
n-1
• Variance
– Deviation from mean for each case
– Square the deviations
– Sum of squares
– Divide by n-1
• Standard deviation
– Square root of the variance
Standard Deviation
1.
Larger s.d. = greater amounts of variation around the mean. For example:
19
2.
3.
4.
25
Y = 25
s.d. = 3
31
13
25
37
Y = 25
s.d. = 6
s.d. = 0 only when all values are the same (this means you have a constant and not a“variable”)
If you were to “rescale” a variable, the s.d. would change by the same magnitude—if we changed
units above so the mean equaled 250, the s.d. on the left would be 30, and on the right, 60
Like the mean, the s.d. will be inflated by an outlier case value.
Summarizing data
Central Tendency
Variation
(or Measure of Location or the
“Middle Values” of the
Groups)
(or Measure of Dispersion or
Summary of Differences within
Groups)
✓Mean
✓Median
✓Mode
✓Range
✓Interquartile Range
✓Variance
✓Standard Deviation
Box-Plots*
A way to graphically portray almost all the descriptive statistics at once
Outliers
• Mild outlier: any point beyond an inner fence on either side
• Extreme outlier: any point beyond an outer (± 3*IQR) fence
Box plots
C-Reactive protein as a prognostic indicator in hospitalized patients with COVID-19
Scatter diagrams
Download