Essence of Biostatistics

advertisement
Essence of Biostatistics
Why need to learn biostatistics?
l
Essential for scientific method of investigation
–
–
–
–
–
l
Formulate hypothesis
Design study to objectively test hypothesis
Collect reliable and unbiased data
Process and evaluate data rigorously
Interpret and draw appropriate conclusions
Essential for understanding, appraisal and
critique of scientific literature
Type of statistical variables
l
Descriptive (categorical) variables
– Nominal variables (no order between values): gender,
eye color, race group, ...
– Ordinal variables (inherent order among values):
response to treatment: none, slow, moderate, fast
l
Measurement variables
– Continuous measurement variable: height, weight,
blood pressure ...
– Discrete measurement variable (values are integers):
number of siblings, the number of times a person has
been admitted to a hospital ...
Statistical variables
l
It is important to be able to distinguish different
types of statistical variables and the data they
generate as the kind of statistical indices and
charts and the type of statistical tests used
depend on knowledge of these basics
Types ofTypes
statistical
methods
of statistical
methods
l •Descriptive
Descriptivestatistical
statisticalmethods
methods
– –Provide
Providesummary
summaryindices
indicesforfora agiven
givendata,
data,e.g.
e.g.
arithmetic
arithmeticmean,
mean,median,
median,standard
standarddeviation,
deviation,
coefficient
coefficientofofvariation,
variation,etc.
etc.
l
Inductive (inferential) statistical methods
• Inductive (inferential) statistical methods
– Produce statistical inferences about a population
–based
Produce
statistical inferences
aboutderived
a population
on information
from a sample
from the
based on information
a sample
derived from
population,
need to takefrom
variation
into account
the population, need to take variation into account
sample
7
Population
Estimating population values from sample values
Copyright 2013 © Limsoon Wong,
Summarizing data
l
l
l
Statistic is “making sense of data”
Raw data have to be processed and summarized
before one can make sense of data
Summary can take the form of
– Summary index: using a single value to summarize
data from a study variable
– Tables
– Diagrams
Summarizing categorical data
Summarizing categorical data
•
A Proportion is a type of fraction in which the numerator is a subset of
denominator
l Athe
Proportion
is a type of fraction in which the numerator is a subset of
– denominator
proportion dead = 35/86 = 0.41
the
• Odds are fractions where the numerator is not part of the denominator
– proportion dead = 35/86 = 0.41
– Odds in favor of death = 35/51 = 0.69
A Ratio
is fractions
a comparison
of two
l • Odds
are
where
the numbers
numerator is not part of the denominator –
– ratio
of dead:
= 35:
51
Odds
in favor
of alive
death
= 35/51
= 0.69
• Odds ratio: commonly used in case-control study
l A–
Ratio
comparison
of females
two numbers
Oddsisinafavor
of death for
= 12/25 –
= ratio
0.48 of dead: alive = 35: 51
– Odds
in favor
of deathused
for males
= 23/26 = 0.88study
l Odds
ratio:
commonly
in case-control
– Odds ratio = 0.88/0.48 = 1.84
–
Odds in favor of death for females = 12/25 = 0.48
–
Odds in favor of death for males = 23/26 = 0.88
–
Odds ratio = 0.88/0.48 = 1.84
Copyright 2013 © Limsoon Wong,
Summarizing measurement data
l
Distribution patterns
– Symmetrical (bell-shaped) distribution, e.g. normal
distribution
– Skewed distribution
– Bimodal and multimodal distribution
l
Indices of central tendency
– Mean, median
l
Indices of dispersion
– Variance, standard deviation, coefficient of variance
Distribution patterns
Indices ofIndices
centraloftendency
central
tendency
Indices of central tendency
l
(Arithmetic)
mean:
set of
of values
values
• (Arithmetic)
mean:Average
Average of
of a
a set
• (Arithmetic) mean: Average of a set of values
n
n
XX
l
l
i 1
i 1
XXi i
nn
•• Mean
sensitive
extreme
values
Mean
sensitive
toextreme
extremevalues
values
Mean
is isis
sensitive
toto
•• Example:
blood
pressure
reading
Example:
blood
pressure
reading
Example: blood pressure reading
Copyright 2013 © Limsoon Wong,
Copyright 2013 © Limsoon Wong,
Robust measure of central tendency
Robust measure of central tendency
Median: The number separating the higher half of
• Median:
Thea number
separating
the higher from
half ofthe
a sample,
population,
or a population
a
sample,
lower
halfa population, or a population from the
lower half
l Median is less sensitive to extreme values
l
• Median is less sensitive to extreme values
Copyright 2013 © Limsoon Wong,
Indices of central tendency: Quantiles
Indices of central tendency: Quantiles
l
Quantiles: Dividing the distribution of ordered
• Quantiles:
Dividing theparts
distribution of ordered
values
into equal-sized
values into equal-sized parts
– Quartiles: 4 equal parts
– Quartiles: 4 equal parts
– Deciles: 10 equal parts
– Deciles: 10 equal parts
– Percentiles: 100 equal parts
– Percentiles: 100 equal parts
First 25%
Second 25%
Q1
Third 25%
Q2
Fourth 25%
Q3
Q1: first quartile
Q2 : second quartile = median
Q3: third quartile
Copyright 2013 © Limsoon Wong,
Indices of dispersion
Indices of dispersion
l Summarize the dispersion of individual values
l
from
some central value like the mean
• Summarize the dispersion of individual values
Give from
a measure
of variation
some central
value like the mean
• Give a measure of variation
mean
x
x
x
x
x
x
Copyright 2013 © Limsoon Wo
Indices of dispersion: Variance
of dispersion:
Indices ofIndices
dispersion:
Variance Variance
Variance : Average of squares of deviation from
Varianceof: squares
Averageof
ofdeviation
squares of
deviation fro
the mean•: Average
• Variance
from
l
the mean n the mean n
( Xi
X)
2
( Xi
X )2
i 1
i 1
n
n
Variance
ofUsually
a sample:
Usually
subtract
Variance•ofof
a sample:
Usually
subtract
1 nfrom
•l Variance
a sample:
subtract
1 from
in 1nfrom
the denominator
in the
denominator
the
denominator
n
n
( Xi
X)
( Xi
2
2
effective sample size,
X )effective
sample size,
i 1
i 1
n 1
n 1
also
also called degree
of called degree of
freedom
freedom
Copyright
2013 ©
Copyright 2013 © Limsoon
Wong,
Indices of dispersion: Standard deviation
• Problem with variance: Awkward unit of
measurement
as value are
squared
l Problem with
variance: Awkward
unit
of
l
l
measurement as value are squared
Solution: Take square root of variance =>
Solution:• Take
square root of variance =>
standard deviation
standard deviation
Sample standard
(s or sd)(s or sd)
• Sample deviation
standard deviation
n
( Xi
X )2
i 1
n 1
Copyright 2013 © L
18
Standard deviation
l
Standard
deviation
Caution must be
exercised
when using standard
deviation
as a comparative index of dispersion
• Caution must be exercised when using standard
deviation as a comparative index of dispersion
Weights of newborn
elephants (kg)
Weights of newborn
mice (kg)
929
853
0.72
0.42
878
939
0.63
0.31
895
972
0.59
0.38
937
841
0.79
0.96
801
826
1.06
0.89
n=10, X = 887.1, sd = 56.50
n=10, X = 0.68, sd = 0.255
Incorrect to say that elephants show greater variation for birth-weights
than mice because of higher standard deviation
Copyright 2013 © Limsoon Wong,
Coefficient of variance
Coefficient ofCoefficient
variance of variance
Coefficient of variance expresses standard
nt of variance
expresses
standard
relative
to
its expresses
mean
•deviation
Coefficient
of variance
standard
l
relative to its mean
s mean
deviation relative to its
cv
X
cv
s
X
newborn
Weights of newborn
Weights of newborn
Weights of newborn
kg)
mice (kg)
elephants (kg)
mice (kg)
853
0.72
0.42
929
853
0.72
0.42
939
0.63
0.31
878
939
0.63
0.31
972
0.59
0.38
895
972
0.59
0.38
841
0.79
0.96
937
841
0.79
0.96
826
1.06
0.89
801
826
1.06
0.89
Mice show
87.1
X
greater
n=10, X =n=10,
887.1 = 0.68
X =birthn=10,
0.68
v = 0.0637
s
=
0.255,
cv
=
0.375
weight
variation
s = 56.50, cv = 0.0637
s = 0.255, cv = 0.375
Mice show
greater birthweight variation
Copyright 2013 © Limsoon Wong,
Copyright 2013 © Limsoon Wong,
When to use coefficient of variance
l
l
l
When comparison groups have very different
means (CV is suitable as it expresses the
standard deviation relative to its corresponding
mean)
When different units of measurements are
involved, e.g. group 1 unit is mm, and group 2
unit is mg (CV is suitable for comparison as it is
unit- free)
In such cases, standard deviation should not be
used for comparison
Sample and population
l
l
l
l
Sample
and population
Populations are
rarely studied
because of
logistical, financial and other considerations
• Populations are rarely studied because of
Researchers
to rely
study
samples
logistical,have
financial
andon
other
considerations
• Researchers
have todesign
rely on study samples
Many
types of sampling
• Many
types
sampling
design
Most
common
isof
simple
random
sampling
• Most common is simple random sampling
Population
sample
Copyright 2013 © Limsoon W
Random sampling
Random sampling
l Suppose
themean
meanbirthbirth• Supposethat
thatwe
wewant
want to
to estimate
estimate the
l
weights
ofofMalay
in Singapore
Singapore
weights
Malaymale
male live
live births
births in
• Due
logisticalconstraints,
constraints,we
we decide
decide to
Due
totologistical
totake
takeaa
random
sampleof
of100
100 Malay
Malay live
random
sample
livebirths
birthsatatthe
the
National
UniversityHospital
Hospital in
in a
a given
National
University
givenyear
year
All NUH
Malay live
births,
1992
All Malay live births
in Singapore
sample
Sampled population
Target population
Copyright 2013 © Limsoon Wong,
23
Sample, sampled and target population
Sample, sampled population,
and target population
All NUH
Malay live
births,
1992
All Malay live births
in Singapore
random sample of
100 Malay live
births at NUH, 1992
Sample mean X = 3.5 kg
Sample sd = 0.25kg
• Suppose that we know the mean birth weight of
sampled population to be 3.27kg with = 0.38kg
• X – = 0.23kg
Copyright 2013 © Limsoon Wong,
Sampling error
l
l
l
Could the difference of 0.23 kg =(3.5kg-3.27kg) be real or
could it be due purely to chance in sampling?
‘Apparent’ different between population mean and the
random sample mean that is due purely to chance in
sampling is called sampling error
Sampling error does not mean that a mistake has been
made in the process of sampling but variation
experienced due to the process of sampling
– Sampling error reflects the difference between the value derived
from the sample and the true population value
l
The only way to eliminate sampling error is to enumerate
the entire population
Estimating sampling error
Estimating sampling error
25
X =3.5 kg
All NUH
live births,
1992
X =2.9 kg
Repeated sampling with
replacement using the
same sample size
X =3.8 kg
Sample size =
constant = 100
X =3.1 kg
X =2.8 kg
X =3.5 kg
etc
Copyright 2013 © Limsoon Wong,
Distribution of sample means
l
l
l
Also known as sampling distribution of the mean
Each unit of observation in the sampling
distribution is a sample mean
Spread of the sampling distribution gives a
measure of the magnitude of sampling error
are large, sampling
X X X
Sampling generated
distribution of the
X Xmean
X X
distribution
by repeated random
• Mean
of sampling
l Central limit theorem: When
sample
sizes are large,
sampling
with
distribution
= random
sampling distribution generated
by repeated
replacement
sampling withisreplacement ispopulation
invariably a mean
normal=
invariably
a regardless
normal of the shape of the population
distribution
distribution
distribution regardless • Standard error of the
of
the shape
of thedistributionsample
l Mean
of sampling
= population
=μ
meanmean
=
population
l Standard error of the sample mean
S.E.X ==
n
distribution
Copyright 2013 © Limsoon Wong,
Standard deviation (s.d.) tells us variability
Standard
deviation vs. standard error
among
individuals
Standard deviation (s.d.) tells us variability
among
individuals
Standard
error
(S.E.X ) tells us variability of
sample
meanserror (S.E. ) tells us variability of X
l Standard
sample means
l Standard
of the
mean
= S.E
Standard
errorerror
of the
mean
= S.E.
=
X =
–
l
n
– σ : standard deviation of the population
: standard deviation of the population
Copyright 2013 © Limsoon W
Properties of normal distribution
l
l
Unimodal and symmetrical
Probability distribution
– Area under normal curve is 1
l
For a normal distribution
– 􏰄􏰄μ±1.00􏰄σ is ~68% of area under the normal
curve
– μ±1.96􏰄σ is ~95% of area under the normal curve
– 􏰄μ±2.58􏰄σ is ~99% of area under the normal curve
Estimation and hypothesis testing
l
Statistical estimation
– Estimating population parameters based on sample
statistics
– Application of the “Confidence interval”
l
Hypothesis testing
– Testing certain assumptions about the population by
using probabilities to estimate the likelihood of the
results obtained in the sample(s) given the
assumptions about the population
– Application of “Test for statistical significance”
Statistical estimation
l
Two ways to estimate population values from
sample values
– Point estimation
– Using a sample statistic to estimate a population
parameter based on a single value
– e.g. if a random sample of Malay births gave =3.5kg,
and we use it to estimate μ 􏰄, the mean birthweight
of all Malay births in the sampled population, we are
making a point estimation
– Point estimation ignores sampling error
l
Interval estimation
– using a sample statistic to estimate a population
parameter by making allowance for sample variation
(error)
Interval estimation
l
Provide an estimation of the population
parameter by defining an interval or range within
which the population parameter could be found
with a given probability or likelihood
l
This interval is called Confidence interval
l
In order to understand confidence interval, we
need to return to the discussion of sampling
distribution
Central limit theorem
l
l
Repeated sampling with replacement gives a
distribution of sample means which is normally
distributed and with a mean which is the true
population mean, μ 􏰄
Assumptions:
– Large and constant sample size
– Repeated sampling with replacement
– Samples are randomly taken
Sampling distribution of the mean
95% confidence interval
l
The 95% confidence interval gives an interval of
values within which there is a 95% chance of
locating the true population mean 􏰄
95% confidence interval
95% chance of finding
X
1.96
n
X
within this interval
X +1.96
n
Standard error
of the sample
mean S.E.X
• The 95% confidence interval gives an interval of
values within which there is a 95% chance of
locating the true population mean
36
Estimating standard error
Estimating standard error
• The sampling distribution is only a theoretical
l The
samplingas
distribution
theoretical
distribution
in practiceis
weonly
takeaonly
one
distribution
asnot
in repeated
practice we
take only one
sample and
sample
l
sample and not repeated sample
not
known
•Hence
HenceS.E
S.E.isX often
is often
not
knownbut
butcan
canbe
be
estimatedfrom
fromaasingle
single sample
sample
estimated
s
S.E. X
n
deviationand
and
n is
–Where
Wheressisissample
sample standard
standard deviation
n is
thethe
sample
samplesize
size
Copyright 2013 © Limsoon Wong,
Precision
of statistical
estimation
Precision
of statistical
estimation
Low precision estimation
X 1.96S.E.X
X
X +1.96S.E.X
high precision estimation
X 1.96S.EX.
l
l
X
X +1.96S.E.X
• Width
confidence interval
gives
a measure
of the of
Width
of of
a aconfidence
interval
gives
a measure
precision of the statistical estimation
the precision of the statistical estimation
Estimation ofof
population
valuevalue
can achieve
higher
􏰄 Estimation
population
can achieve
precision by minimizing S.E. X , which depends on the
higher
precision
by minimizing
S.E.,
which
X can be
population
s.d. and
sample size, that
is S.E.
minimized
sample
to a certain
depends
on by
themaximizing
population
s.d. size
and(up
sample
size,point)
that is S.E. can be minimized by maximizing
Copyright 2013 © Limsoon
sample size (up to a certain point)
Statistical hypothesis testing
l
l
l
A statistical hypothesis testing is a method of
making statistical decisions using experimental
data.
A result is called statistically significant if it is
unlikely to have occurred by chance.
These decisions are almost always made using
null-hypothesis tests, that is, ones that answer
the question:
– Assuming that the null hypothesis is true, what is the
probability of observing a value for the test statistic
that is at least as extreme as the value that was
actually observed?
Hypothesis testing
l
Another way of statistical inference in which we
want to ask a question like
– How likely is the mean systolic blood pressure from a
sampled population (e.g. biomedical researchers) the
same as those in the general population?

i.e. μ 􏰄researchers =μ 􏰄general ?
– Is the difference between X1 and X2 statistically
significant for us to reject the hypothesis that their
corresponding μ1 and μ2 are the same?

i.e. μ 􏰄male = μ􏰄female ?
Steps for test of significance
l
l
l
A test of significance can only be done on a
difference, e.g. difference between X and μ 􏰄,
X1 and X2
1. Decide the difference to be tested, e.g.
difference in pulse rate between those who were
subjected to a stress test and the controls, on the
assumption that the stress test significantly
increases pulse rate (the hypothesis)
2. Formulate a Null Hypothesis, e.g. no difference
in pulse rate between the two groups,
– i.e. H0: μ􏰄test=􏰄μcontrol
– Alternative hypothesis H1 : μ􏰄test ≠ μ􏰄control
Steps for test of significance
l
l
l
3. Carry out the appropriate test of significance.
Based on the test statistic (result), estimate the
likelihood that the difference is due purely to
sample error
4. On the basis of likelihood of sample error, as
measured by the Pr value, decide whether to
reject or not reject the Null Hypothesis
5. Draw the appropriate conclusion in the context
of the biomedical problem, e.g. some evidence,
from the dataset, that subjects who underwent
the stress test have higher pulse rate than the
controls on average
Test of significance: An example
l
l
Suppose a random sample of 100 Malay male
live births delivered at NUH gave a sample mean
weight of 3.5kg with an sd of 0.9kg
Question of interest: What is the likelihood that
the mean birth weight from the sample population
(all Malay male live birth delivered at NUH) is the
same as the mean birth weight of all Malay male
live births in the general population, after taking
sample error into consideration?
Test of significance: An example
l
l
l
l
Suppose: X’ = 3.5kg, sd = 0.9kg,
μ􏰄pop = 3.0 kg 􏰄σpop = 1.8kg
Difference between means = 3.5􏰄 - 3.0 = 0.5kg
Null Hypothesis, H0: 􏰄μNUH = μ􏰄pop
Test of significance makes use of the normal
distribution properties of the sampling distribution
of the mean
Test of significance
l
l
Where does our sample mean of 3.5kg lie on the
sampling distribution?
If it lies outside the limit of 􏰄􏰄μ±1.96 σ/√n, the
likelihood that the sample belongs to the same
population is equal or less than 0.05 (5%)
of significance:
z-test
Test of Test
significance:
z-test
X
Test
is given
by
=
standard
normal
devia
X
l
Test
is
given
by
=
standard
normal
deviate
• Test is given by SESEX
= standard normal devia
(SND
or or
z)z)
X
(SND
(SND or z)
SND
expresses
difference
in standard
l SND
expresses the
the difference
in standard
error erro
• SND
expresses
the
difference
in
standard
erro
units
standard normal
curve
with with
􏰄= 0 and
units
onona astandard
normal
curve
= 0 an
units
􏰄=1on a standard normal curve with = 0 an
3.5 3.0
3.0 = 2.78
For our example, SND = 1.83./5 100
• For
our example, SND =
=2.78
2.78
l For our example, SND =
=
1.8 / 100
0.475 0.475
0.475 0.475
-1.96
=3.0
=3.0
+1.96
SND=2.78
SND=2.78
Test of significance
l
There is a less than 5% chance that a difference
of this magnitude (0.5kg) could have occurred on
either side of the distribution, if the random
sample of the 100 Malay males had come from a
population whose mean birth weight is the same
as that of the general population
One-tailed vs. two-tailed tests
l
l
l
So far, we have been testing for a difference that
can occur on both sides of the standard normal
distribution
In our example, the z-value of 2.78 gave a Pvalue of 0.0054
For a one-tailed test, a z-score of 2.78 will give a
P-value of 0.0027 (=0.0054/2). That is occurred
when we are absolutely sure that the mean birth
weight of our sample always exceeds that of the
general population
One-tailed vs. two-tailed tests
l
l
l
l
l
Two-side tests are conventionally used because
most of the time, we are not sure of the direction
of the difference
One-tailed test are used only when we can
anticipate a priori the direction of a difference
One-tailed tests are tempting because they are
more likely to give a significant result
Given the same z-score, the P-value is halved for
one tailed test
It also mean that they run a greater risk of
rejecting the Null Hypothesis when it is in fact
correct --- type I error
Type I and type II errors
Type I and type II errors
If the difference is statistically significant, i.e. H0 is
• Ifincorrect,
the difference
is to
statistically
failure
reject H0significant,
would leadi.e.
to H0
type II
is
incorrect, failure to reject H0 would lead to type
error
l
II error
Copyright 2013 © Limsoon Wong,
t-test
l
l
l
The assumption that the sampling
distribution will be normally distributed
holds for large samples but not for small
samples
Sample size is large, use z-test
t-test is used when sample size is small
– Statistical concept of t-distribution
– Comparing means for 2 independent groups

unpaired t-test
– Comparing means for 2 matched groups

paired t-test
t-distribution
l
l
l
l
Sampling distribution based on small samples will
be symmetrical (bell shaped) but not necessarily
normal
Spread of these symmetrical distributions is
determined by the specific sample size. The
smaller the sample size, the wider the spread,
and hence the bigger the standard error
These symmetrical distributions are known as
student’s t-distribution or simply, t-distribution
The t-distribution approaches the normal
distribution when sample size tends to infinity
55
Familiy of t-distribution
samples
for 2 independent
t-test fort-test
2 independent
samples
or unpaired t-test
l
l
X’10.03943
X 2 0.08157X1 =
= 0.08157• X’2
= 0.04
0.03943 = 0.04
Question:
WhatWhat
is theis the
• Question:
probability
that the
the
thatdifference
probability
of 0.04
units between
of 0.04the
difference
two units
samplebetween
means has
the
occurred
purely by
chance,
means
two sample
i.e. due
samplingpurely
error
occurred
has to
alone?
by chance, i.e. due to
sampling error
alone?
Copyright 2013 © Limsoon Wo
Unpaired t-testUnpaired t-test
l•
•
l
We
the
Weare
aretesting
testing
the
hypothesis
that
hypothesis
that
battery
could
batteryworkers
workers
have
higher
Pb
could
haveblood
higher
levels
the control
bloodthan
Pb levels
than
group
of workers
as of
the control
group
they
are as they are
workers
occupationally
occupationally
exposed
exposed
Note: conventionally, a P-value of 0.05 is generally
recognized as low enough to reject the Null
Hypothesis of “no difference”
Note: conventionally, a P-value
of 0.05 is generally recognized
as low enough to reject the Null
Copyright 2013 © Limsoon Wong,
Unpaired t-test
l
Unpaired t-test
Null Hypothesis:
No difference
in mean
blood
Null•Hypothesis:
No difference
in mean
blood
PbPb
level between
battery
workers
control
group,
level between
battery
workers
andand
control
group,
i.e.
i.e.
H0: battery = control
– H0: μ􏰄battery = 􏰄μcontrol
l
t-score
is given
by by
• t-score
is given
t
X1 X 2
SE( X1 X 2 )
X1
1
(
n1
X2
2
1 (n1 1) s1 (n2 1) s2
)
n2
n1 n2 2
2
with (n1+n2–2) degrees of freedom
with (n1+n2–2)
degrees of freedom
Copyright 2013 © Limsoon
t-test
Unpaired t-test Unpaired
Unpaired t-test
For
the
example
• l For
thegiven
given
example
example
given
the
• For
t
•l
l
.03943
0.081570.0
0t.08157
03943
0.002868
0.002868
= 14.7 with 12 d.f.
=
14.7with
with
12 d.f.
=
14.7
12
d.f.
• P-value <0.001, reject
P-value
<0.001,
reject
P-value
reject
hypothesis
Null <0.001,
NullSome
hypothesis
Null
hypothesis
evidence, from
􏰄Some
evidence,
from
that battery
the data,
Some
evidence,
from
the data,
thatin
battery
our study
workers
the
data,
that
battery
workers
inhigher
our
study
Pb
blood
have
havelevel
higher
control
the Pb
than
workers
inblood
our
study
levelgroup
than
the
average
on control
have
higher
blood Pb
group on average
level than the control
Copyright 2013 © Limsoon Wo
Unpaired t-test assumptions
l
l
l
Data are normally distributed in the population
from which the two independent samples have
been drawn
The two samples are random and independent,
i.e. observations in one group are not related to
observations in the other group
The 2 independent samples have been drawn
from populations with the same (homogeneous)
variance, i.e. σ􏰄1=σ􏰄2
Example 2: Molecular classification of cancer
l
Expression of a gene called ATP6C (Vacuolar H+
ATPase proton channel subunit) in 38 bone
marrow samples:
– 27 acute lymphoblastic leukemia (ALL):
835 935 1665 764 1323 1030 1482 1306 593
2375 542 809
2474 1514 1977 1235 933 114 -9 3072 608 499 1740 1189
1870 892 677
– 11 acute myeloid leukemia (AML):
3237 1125 4655 3807 3593 2701 1737 3552 3255 4249 1870
l
Question: Is ATP6C differentially expressed in
the two classes of samples?
Molecular classification of tumor example
Paired t-test
l
Paired t-test
Previous problem uses
problem
• Previous
un-paired
t-test
as the
un-paired
twouses
samples
were t-test
as the two samples
matched
matched
were
– i.e. the two samples were
– i.e. the two samples
independently
derived
were independently
l Sometimes, we may
derived
need to deal with
• Sometimes, we may
matched study designs
need to deal with
matched study
designs
Copyright 2013 © Limsoon Won
Paired t-test
l
Paired t-test
Null hypothesis: No difference in mean cholesterol
• Null hypothesis: No difference in mean cholesterol
levels between fasting and postprandial states
levels between fasting and postprandial states
– H0: μ􏰄fasting = μ􏰄postprandial
H0: fasting = postprandial
Copyright 2013 © Limsoon Wong,
Paired
t-test
Paired t-test
Paired t-test
l
• t-score
given
by
t-score
given
by
• t-score given by
t
t
d
d d
SE
0SE
.833
d
.219
03.833
3.219
d
sd d n
sd
0.259
n
0.259
with (n-1)
degrees
with (n-1)
degrees
of
of freedom,
with
(n-1)
freedom,
wheredegrees
n where
is the
n is the # of pairs
# of of
pairs
freedom, where
n is the # of pairs
Copyright 2013 © Limsoon Wong
Copyright 2013 © Limsoo
Paired t-test
l
l
Conclusion: Insufficient evidence, from the data,
to suggest that postprandial cholesterol levels
are, on average, higher than fasting cholesterol
levels
Paired
t-test
Action: Should not reject the Null
Hypothesis
• Action: Should not
reject the Null
• Conclusion
Insufficient
evidence, fr
data, to sug
that postpra
cholesterol
are, on aver
higher than
Common errors relating to t-test
l
Failure to recognize assumptions
– If assumption does not hold, explore data
transformation or use of non-parametric methods
l
Failure to distinguish between paired and
unpaired designs
ANOVA
(ANalysis Of VAriances
66
Hypothesis Test for Population Difference in Means
l
l
Hypothesis Test for Population
H0: μ1 – μ2 =0
Difference
in
Means
(H0:
μ
–
μ
=0)
1
2
Example: Difference in average GPA’s for male
Case 2:
students
and female students
Attributes
Test Statistic
Confidence Interval
n1≥30
,
n2 ≥30
n1<30 or n2 <30
df=n1+n2-2
Common Standard
Deviation Estimate
ANOVA
l
l
l
Widely used statistical techniques for testing equality of
population means when more than two populations are
under consideration
Extension of a two independent samples test for
difference in population means when population
variances are (or can be assumed to be) equal.
Example Scenarios
– Variation in Time to Relief of Symptoms Between and Within 3
Treatment
– Music compressed by four MP3 compressors are with the same
quality
– Three new drugs are all as effective as a placebo
– Lectures, group studying, and computer assisted instruction are
equally effective for undergraduate students
Example:
l
Variation in Time to Relief of Symptoms Between
and Within 3 Treatment
– n=15 subjects are randomly selected to participate in
study
– ni=5 subjects are randomly assigned to each
treatment (k=3) and each subject reports the time to
relief of symptoms (in minutes)
Between
Example:and Within 3 Treatment
Example:
l
the sample means are numerically different
– (29, 25, 20) => possibly large variability between
treatments
l
the sample variances are similar, but small;
– (0.158, 0.071, 0.158) => observations within each
sample are tightly clustered around their respective
sample means - small variability within treatments.
ANOVA
ANOVA
2 Populations
k > 2 Populations
I dentify groups
1 – Group 1
1 – Group 1
2 – Group 2
2 – Group 2
....
k – Group k
I dentify parameters
µ1 – population mean for
µi– population mean for Group i
Group 1
µ2 – population mean for
Group 2
State the appropriate hypotheses
H0:
Ha: NOT all means equal (at least
one μ is different)
ANOVA
Table Table
ANOVA
F-Distribution
74
F- distribution
Example
l
l
As an example, we analyze the Cushings data
set, which is available from the MASS package.
The Type variable in the data set shows the
underlying type of syndrome, which can be one of
four categories:
–
–
–
–
l
l
adenoma (a),
bilateral hyperplasia (b),
carcinoma (c),
unknown (u).
The total number of observations is n=27,
The number of observations in each group is
– n1 =6, n2 =10, n3 =5,and n4 = 6.
Example
l
We also find the group specific means, which are
– 3.0, 8.2, 19.7, and 14.0, respectively.
l
The degrees of freedom parameters are
– df1 =4−1=3 and df2 =27−4=23.
l
l
l
SSB = 893.5 and SSW = 2123.6.
The observed value of F -statistic is f = 3.2 given
under the column labeled F value.
The resulting p-value is then 0.04.
Example
l
Therefore, we can reject H0 at 0.05 significance level (but
not at 0.01) and conclude that the differences among
group means for urinary excretion rate of
Tetrahydrocortisone are statistically significant (at 0.05
level).
ANALYSIS OF CATEGORİCAL DATA
Hypothesis testing involving categorical data
l
Chi-square test for statistical association
involving 2x2 tables and RxC tables
– Testing for associations involving small, unmatched
samples
– Testing for associations involving small, matched
samples
Association
l
l
Examining relationship between 2 categorical
variables
Some examples of association:
– Smoking and lung cancer
– Ethic group and coronary heart disease
l
Questions of interest when testing for association
between two categorical variables
– Does the presence/absence of one factor (variable)
influence the presence/absence of the other factor
(variable)?
l
Caution
– presence of an association does not necessarily imply
causation
Relating
Relatingtotocomparison
comparisonbetw
betwproportions
proportions
• Proportion improved in drug group = 18/24 = 75%
l Proportion improved in drug group = 18/24 = 75%
• Proportion improved in placebo group = 9/20 = 45.0%
l Proportion improved in placebo group = 9/20 =
45.0%
• Question: What is the probability that the observed
l Question: What is the probability that the
difference of 30% is purely due to sampling error, i.e.
observed difference of 30% is purely due to
chance in sampling?
sampling error, i.e. chance in sampling?
• Use 2 –test
l Use chi-squared􏰄 –test
Copyright 2013 © Limsoon W
Chi-square
test
for
statistical
association
Chi-square test for statistical association
•
l•
l•
Prob of selecting a person in drug group = 24/44
Prob
personwith
in drug
group = =24/44
Prob of
of selecting
selecting aaperson
improvement
27/44
Prob
personfrom
withdrug
improvement
Prob of
of selecting
selecting aaperson
group who =had
shown improvement= (24/44)*(27/44) = 0.3347
27/44
(assuming two independent events)
l Prob of selecting a person from drug group who
• Expected value for cell (a) =0.3347*44 = 14.73
l
had shown improvement= (24/44)*(27/44) = 0.3347
(assuming two independent events)
Copyright
© Limsoon Wong,
Expected value for cell (a) =0.3347*44
= 2013
14.73
Chi-square test for statistical association
Chi-square
Chi-square test
test for
for statistical
statistical association
association
• General formula for 2
Generalformula
formulafor
for􏰄2
2
l •General
2
2
•
•
(obs exp) 22
(obs exp)
exp
exp
2 –test is always performed on categorical
2 –test is always performed on categorical
variables using–test
absolute
frequencies,
neveron
l Chi-squared􏰄
is
always
performed
variables using absolute frequencies, never
percentage
or proportion
categorical
variables
using absolute frequencies,
percentage or proportion
never percentage or proportion
Copyright 2013 © Limsoon Wong,
Copyright 2013 © Limsoon Wong,
74
Chi-square test for statistical association
Chi-square test for statistical association
l For the
given
hi-square
test
forproblem:
statistical association
74
• For the given problem:
2
For the given
problem:
(obs exp)
(18 14.73) 2 (6 9.27) 2 (9 12.27) 2 (11 7.73) 2
2
12
Chi-square
for
association
(obs
exp)exp
(18 14test
.7314
) 2.73
(6 9statistical
.27)92 .27(9 12.27
) 2 .27(11 7.73)72.73
4.14 with 1 degree
exp
14.73of freedom
9.27
12.27
7.73
given
problem:
4•.14For
withthe
1 degree
of freedom
• 2 degree of2 freedom is2 given by:
18) 2 6(11 24
(obs exp)
(18 14.73) (6 9.27) 2 (9 12.27
7.73) 2
2 degree
ofexpfreedom
is.73
given
by:
(no.
of rows-1)*(no.
ofof
cols-1)
l 􏰄Chi-square
degree
14
9freedom
.27
12is
.276given
720
.73 (no.
9 24
11 by:
18
(no.
rows-1)*(no.
of cols-1)
cols-1) = (2-1)*(2-1)
of4of
rows-1)*(no.
of
=171 44
(2-1)*(2-1)
= 1freedom
27 20
.=
14
with 1 degree of
9
11
= (2-1)*(2-1) = 1
17 44
How many of27these
• 2 degree of freedom is given
by:
4
cells
free to 18 6
How many of are
these
vary
if wetokeep the 9
(no. of rows-1)*(no.4 of
11
cellscols-1)
are free
row
and
column
vary if we keep the
= (2-1)*(2-1) = 1
27 17
totals constant?
24
20
44
row and column
totals constant?
How many of these
Copyright
4 cells are free
to 2013 © Limsoon Wong,
vary if we keep the
75
Chi-square test for statistical association
l
l
Probability of getting an observed difference of
30% in improvement rates if the Null hypothesis
of no association is correct is between 2% and
5%
Hence, there is some statistical evidence from
this study to suggest that treatment of arthritic
patient with the drug can significantly improve
grip strength
Download