Bi t ti ti ostatistics Goal Variable types Variables can be

advertisement
Goal
Bi t ti ti
Biostatistics
• Presentation of data
Lecture INF4350
– descriptive
p
– tables and graphs
O t b 12008
October
Anja Bråthen Kristoffersen
Biomedical Research Group
Department of informatics, UiO
Variable types
• Categorical
C t
i l variables
i bl
– Ordinal:
• Are you smoking? 1 = “Daily”
Daily , 2 = “now
now and then”
then , 3 =
“Stopped last year”, 4 = “Stopped earlier”, 5 = “never”
– Nominal (Discrete variables):
• Civil state: 1 = “not
not married
married”, 2 = “married”
married , 3 = “have
have a
partner” , 4 = “divorced”, 5 = ”widow”
• DNA (A, T, C, G)
• Binary variables (0
(0, 1)
• Continues variables
– numbers
• S
Sensitivity,
iti it specificity,
ifi it ROC curve
• Hypotese
yp
testing
g
– Type I and type II error
– Multiple testing
– False positives
Variables can be
• Independent
I d
d t
– Are not influenced by other variables.
– Are
A nott influenced
i fl
db
by th
the event,
t b
butt could
ld influence
i fl
the event
• Dependent
– Variables influence each other. For instance would
the information that a gene is on/of possible influence
an other gene. Which variable that depend/influence
the other variable can often not be defined.
Average (mean)
Adjusted mean
• Properties:
– All observations must be known
– The observations do not need to be order
– Sensible for ”outliers” (extreme, untypical) values.
• Equation:
x=
• Mean based on the central observations:
(
90 – 95 % of the observations;; ”the tale” (5
– 10 %) of the data is not included for
calculations.
calculations
• Less sensible for extreme observations.
x1 + x2 + K + xn 1 n
= ∑ xi
n
n i =1
Combining means
Median
• Synonym:
Equation:
x=
– 50 percentile
– Empirical median
n1 x1 + n2 x2 + K + nm xm
n1 + n2 + K + nm
Where ni is the number of observations behind the mean
Note that adjusted means can not be combined like this
this.
• Properties:
p
xi .
– The observations are ordered
– ”Median = the value that divides the observations in
two parts.”
– Not sensitive for extreme observations.
– Mathematical not good since the median of more then
one set of observations can not be combined.
Mode
• The observation that occur most times.
– Mathematical not good since the median of more then one set of
observations
b
ti
can nott be
b combined.
bi d
Dispersal measures
• Range = Xn – X1
– Same oneness as the observations
– Sensible for extreme observations
• Quantiles, percentiles
– The numeral Vp that has p proportions of the ordered observations
below. (0<p<1)
– Same oneness as the observations
• Standard deviation
sd =
1 n
∑ ( xi − x ) 2
n − 1 i =1
– Always positive
– Outlying observations contribute most
– Same oneness as the observations
Standard deviation
• If the data is close to Gaussian distributed
approximately 95% of the population are within
x ± 1.96 ⋅ sd
– Which approximately correspond to the 2.5 and 97.5 percentile
– A consequences of the properties of the Gaussian distribution
– Depends
p
on approximately
pp
y symmetry
y
y and unimodality
y.
• Quick and dirty:
sd ≈
Range
4
– Handy when a first guess of the sd when calculating the necessary
numbers of observations.
Descriptive statistics - tables
• A scalar
l variable:
i bl
– Calculate mean, median and standard deviation
• A categorical variable:
– Descriptive statistics → frequencies
• Two categorical
g
variables:
– Descriptive statistics → cross table
• A scalar variable and a categorical variable
– compare mean/median
/ di ffor each
h category
• Two scalar variable:
– Categorise one of the variables
– or: linear regression
Do always
y p
plot yyour data
QQplot
”A plot tells more than 1000 tests”
• A scalar variable:
– Histogram
– Box-plot
Box plot
– Compare the data with the Gaussian distribution: Q-Q
plot easier to read and explain than “Gaussian curve
upon” a histogram
upon
Histogram
Do always
y p
plot yyour data
Do always
y p
plot yyour data
”A plot tells more than 1000 tests”
”A plot tells more than 1000 tests”
• Two scalar variable:
• A scalar and a categorical variable
– Scatter plot
– Box plot
Scatter plot
Two scalar and a categorical variable:
– Scatter
S tt plot
l t
Example
probability of getting at boy
Number of
babies born
10
100
1000
10000
100000
376058
17989361
34832051
Number of boys Prosentage
boys
8
0.8
55
0.55
525
0.525
5139
0 5139
0.5139
51127
0.51127
1927054
0.51247
9219202
0 51248
0.51248
17857857
0.51268
Prevalence, sensitivity, specificity,
and more
A = {symptom
t or positive
iti diagnostic
di
ti test
t t}
B = {ill}
P(B ) = prevalence
l
off the
h illness
ill
Relative risk
A = {Positive mammogram}
B = {Breast cancer within two years}
Pr (B | A) = 0.1
Pr (B | A ) = 0.0002
RR =
Pr (B | A)
0 .1
=
= 500
Pr (B | A ) 0.0002
Example breast cancer
diagnostic
A = {positive mammogram}
B = {b
breast
e s ccancer
ce wit
w hin two
wo ye
yearss}
P( A | B ) = sensitivity
Pr (B | A ) = 0.0002 then Pr (B | A ) = 1 − 0.0002 = 0.9998
P (A | B ) = specificity
PPV = Pr (B | A) = 0.1
P (A | B ) = false positive rate
P (A | B ) + P ( A | B ) = 1
Then we have P (A | B ) = 1 − P (A | B ) = 1 − spesifisit
p
y
P(B | A) = PPV = PV + = positive predicative value
P (B | A ) = NPV = NV + = negative predicative value
NPV = Pr
P (B | A ) = 0.9998
Example breast cancer in
different groups
Traditional 2·2 table
ill
• Breast
B
t cancer
– Breast cancer among women 45 to 54 years
old
• Group A: gave first birth before 20 year old
• Group B: gave first birth after 30 year old
– Assume that 40 of 10000 in group A and 50 of
10000 iin group B gett cancer, coincidence
i id
or
different risk?
– If the
th numbers
b
where
h
400 off 100000 and
d 500
of 100000? Still coincidence?
Analyse av 2·2 tabell
Test
result
+
-
+
a [TP]
b [FP]
a+b
-
c [FN]
d [TN]
c+d
a+c
b d
b+d
a+b+c+d
b
d
TP = true positive,
positive FP = false positive
positive,
FN = false negative, TN = true negative
Example breast cancer
a
b
c
d
> fisher.test(matrix(c(40,9960,50,9950),ncol = 2, byrow=TRUE))
• Fisher showed that the probability of obtaining any such
set of values was given by the hypergeometric
distribution:
⎛ a + b ⎞⎛ c + d ⎞
⎜⎜
⎟⎜
⎟
a ⎟⎠⎜⎝ c ⎟⎠ (a + b )!(c + d )!(a + c )!(b + d )!
⎝
p=
=
(a + b + c + d )!a!b!c!d !
⎛a + b + c + d ⎞
⎜⎜
⎟⎟
a+c
⎝
⎠
• If the p value is less than a cut off (i.e. p<0.05) we
assume that we can reject the null hypotheses and
assume that
th t “true
“t
odds
dd ratio
ti is
i nott equall tto 1”
1”, h
hence th
the
test result differentiate between ill and not ill.
Fisher's Exact Test for Count Data
data: matrix(c(40, 9960, 50, 9950), ncol = 2, byrow = TRUE)
p-value
p
value = 0.3417
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5133146 1.2371891
sample
p estimates:
odds ratio
0.7992074
> fisher.test(matrix(c(400,99600,500,99500),ncol
(
( (
,
,
,
),
= 2,, byrow=TRUE))
y
))
Fisher's Exact Test for Count Data
data: matrix(c(400,
( (
, 99600,, 500,, 99500),
), ncol = 2,, byrow
y
= TRUE))
p-value = 0.0009314
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.6987355 0.9135934
sample estimates:
odds ratio
0.7991994
40
9960
50
9950
400
99600
500
99500
Prevalence, sensitivity, specificity,
and more
a+c
a+b+c+d
a
Sensitivity = Pr ( A | B ) =
a+c
d
Specificity = Pr (A | B ) =
b+d
a
PPV = Pr (B | A) =
a+b
d
NPV = Pr (B | A ) =
c+d
a+d
Accuracy =
a+b+c+d
Prevalence = Pr (B ) =
Statistical tests
Testing hypotheses
• Find a null and an alternative hypothesis
p
• Example:
– H0: Expected response is equal in both groups
– H1: Expected
p
response
p
is different between g
groups.
p
• p-value: is the probability to observe the
observed values given that H0 is true.
• Reject H0 if the p-value is less than a given
significance level (e
(e.g.
g 0
0.05
05 or 0
0.01)
01)
Statistisk test metoder
• Some tests assume a certain distribution
– E.g.:
g t-test assume that the data are
(approximately) Gaussian distributed
• Non parametric tests are more flexible
– E.g.: comparing two medians: non parametric
t t two
test,
t
independent
i d
d t groups (Mann-Whitney)
(M
Whit
)
• Two categorical variables:
– Fisher test
– Chi square ttestt
– Mann-Whitney
• Two scalar variables:
– t.test
t test
• A scalar and a categorical variable:
– anova
Mann Whitney
Mann-Whitney
The tt-test
test
The t statistic is based on the sample mean and variance
•
•
I order
In
d to apply
l the
h M
Mann-Whitney
Whi
test, the
h raw d
data ffrom samples
l
A and B must first be combined into a set of N=na+nb elements,
which are then ranked from lowest to highest. These rankings are
then re-sorted into the two separate samples.
The value of U reported in this analysis is the one based on
sample A,
A calculated as
n n +
+11
U A = na nb +
a
(
a
2
) −T
A
where TA = the observed sum of ranks for sample A,
A and
na nb +
t
•
na (na + 1)
= the maximum possible value of TA
2
Convert the U statistics into p
p-values.
ANOVA
A simple experiment
• The t-test and its variants only work when
p p
pools.
there are two sample
• Analysis of variance (ANOVA) is a general
technique for handling multiple variables
variables,
with replicates.
• Measure response to a drug treatment in
two different mouse strains.
• Repeat each measurement five times.
• Total
T t l experiment
i
t = 2 strains
t i * 2 ttreatments
t
t
* 5 repetitions = 20 arrays
• If you look for treatment effects using a ttest then you ignore the strain effects
test,
effects.
ANOVA lingo
Two factor design
Two-factor
• F
Factor:
t a variable
i bl th
thatt iis under
d th
the control
t l off th
the
experimenter (strain, treatment).
• Level: a possible value of a factor (drug
(drug, no
drug).
• Main effect: an effect that involves only one
factor.
• Interaction effect: an effect that involves two or
more factors simultaneously.
• Balanced design: an experiment in which each
factor and level is measured an equal number of
times.
Fixed and random effects
• Fi
Fixed
d effect:
ff t a factor
f t for
f which
hi h the
th levels
l
l would
ld
be repeated exactly if the experiment were
repeated.
• Random effect: a term for which the levels would
not repeat in a replicated experiment.
• In the simple experiment, treatment and strain
are fixed effects, and we include a random effect
to account for biological and experimental
variability.
variability
ANOVA model
Eijk = μ + Ti + S j + (TS )ij
⎧ i = 1, K , n,
⎪
+ ε ijk ⎨ j = 1, K , m,
⎪ k = 1, K , p.
⎩
• μ is the mean expression level of the gene.
• T and S are main effects (treatment, strain)
with n and m levels, respectively.
• TS is an interaction effect.
• p is the number of replicates per group.
• ε represents
p
random error ((to be minimized).
)
ANOVA steps
• For each gene on the array
– Fit the p
parameters T and S,, minimizing
g ε.
– Test T, S and TS for difference from zero,
yielding three F statistics
statistics.
– Convert the F statistics into p-values.
F statistics
F-statistics
• Compare two linear models.
Mean Squares Group
MSG
or
MSE
Mean Squares Error
• This compares the variation between groups (group
mean to group mean) to the variation within groups
(individual values to group means).
F=
Pr( Fdf1 ,df 2 > Fcalculated )
F-distribution
ANOVA assumptions
A
B
ANOVA output
Gene
• For a given gene, the random error terms
p
, normally
y distributed and
are independent,
have uniform variance.
• The main effects and their interactions are
linear.
p-value
Strain effects
Treatment effects
Interaction effects
Vehicle
Drug
Multiple testing correction
This and some following slides are from http://compdiag.molgen.mpg.de/ngfn/docs/2004/mar/DifferentialGenes.pdf.
Multiple testing correction
• O
On an array off 10,000
10 000 spots,
t a p-value
l off
0.0001 may not be significant.
• Bonferroni correction: divide your p-value
y the number of measurements.
cutoff by
• For significance of 0.05 with 10,000 spots,
you need a p-value of 5 × 10-6.
• Bonferroni is conservative because it
ass mes that all genes are independent
assumes
independent.
Types of errors
•
•
•
•
F l positive
False
iti (Type
(T
I error):
)
the experiment indicates that
the gene has changed, but it
actually has not
not.
False negative (Type II error):
the gene has changed, but the
experiment
i
t failed
f il d tto indicate
i di t
the change.
Typically, researchers are
more concerned
d about
b t ffalse
l
positives.
Without doing many
(expensive) replicates, there
will always be many false
negatives.
False discovery rate
•
5 FP
13 TP
•
33 TN
5 FN
•
Th false
The
f l discovery
di
rate
t (FDR)
is the percentage of genes
above a given position in the
ranked list that are expected to
be false positives.
False positive rate: percentage
off non-differentially
diff
ti ll expressed
d
genes that are flagged.
False discovery rate:
percentage
t
off flagged
fl
d genes
that are not differentially
expressed.
FDR example
• Order the unadjusted p-values p1 ≤ p2 ≤ … ≤ pm.
Desired
• To control FDR at level α, Rank of this
gene
j ⎫
⎧
j* = max⎨ j : p j ≤ α ⎬
m ⎭
⎩
significance
threshold
Total number
of genes
• Reject the null hypothesis for j = 1, …, j*.
• This approach is conservative if many genes are
differentially expressed.
(Benjamini & Hochberg, 1995)
33 TN
5 FN
FDR = FP / (FP + TP) = 5/18 = 27.8%
FPR = FP / (FP + TN) = 5/38 = 13.2%
Controlling the FDR
p-value of
this gene
5 FP
13 TP
Rank
1
2
3
4
5
6
7
8
9
10
…
1000
(jα)/m
0.00005
0
0.00010
00010
0.00015
0.00020
0
0.00025
00025
0.00030
0.00035
0
0.00040
00040
0.00045
0.00050
p-value
0.0000008
0
0.0000012
0000012
0.0000013
0.0000056
0
0.0000078
0000078
0.0000235
0.0000945
0
0.0002450
0002450
0.0004700
0.0008900
0.05000
1.0000000
• Choose the threshold
so that, for all the
genes above it, (jα)/m
is less than the
corresponding pvalue.
• Approximately 5% of
the examples above
the line are expected
to be false positives.
False discovery rate
Bonferroni vs.
vs false discovery rate
• Bonferroni controls the family-wise error
p
y of at least one
rate;; i.e.,, the probability
false positive.
• FDR is the proportion of false positives
among the genes that are flagged as
differentially
ff
expressed.
Diagnostic/ROC curve
Diagnostic/ROC curve
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
Probable
normal
normal
Normal
33
6
Not
normal
3
Total
36
Status
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
Probable
normal
normal
Normal
33
6
51
Not
normal
3
109
Total
36
Definitively
not normal
Total
not normal
6
11
2
58
2
2
11
33
8
8
22
35
unsure
Probably
Criteria ”1+” all with range from 1 to 5 get the diagnose ill.
Find all the ill ones, but identify
y now healthy
y ones.
Sensitivity = 1, specificity = 0, false positive rate = 1
Status
Definitively
not normal
Total
not normal
6
11
2
58
2
2
11
33
51
8
8
22
35
109
unsure
Probably
Criteria ”2+” all with range from 2 to 5 get the diagnose ill.
Find 48/51 of the ill ones, but identifies 33/58 healthy ones.
Sensitivity = 0.94, specificity = 0.57, false positive rate = 0.43
Diagnostic/ROC curve
Diagnostic/ROC curve
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
Probable
normal
normal
Normal
33
6
Not
normal
3
Total
36
Status
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
Probable
normal
normal
Normal
33
6
51
Not
normal
3
109
Total
36
Definitively
not normal
Total
not normal
6
11
2
58
2
2
11
33
8
8
22
35
unsure
Probably
Criteria ”3+” all with range from 3 to 5 get the diagnose ill.
Find 46/51 of the ill ones, but identifies 39/58 healthy ones.
Sensitivity = 0.90, specificity = 0.67, false positive rate = 0.33
Status
Probable
normal
normal
Normal
33
6
Not
normal
3
Total
36
2
58
2
2
11
33
51
8
8
22
35
109
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
Probable
normal
normal
Normal
33
6
51
Not
normal
3
109
Total
36
Definitively
not normal
Total
not normal
6
11
2
58
2
2
11
33
8
8
22
35
unsure
11
Diagnostic/ROC curve
Ranging
g g of 109 CT images
g of one radiologist:
g
Definitively
6
Probably
Criteria ”4+” all with range from 4 to 5 get the diagnose ill.
Find 44/51 of the ill ones, but identifies 45/58 healthy
y ones.
Sensitivity = 0.86, specificity = 0.78, false positive rate = 0.22
Diagnostic/ROC curve
Status
Definitively
not normal
Total
not normal
unsure
Probably
Criteria ”5+” all with range from 2 to 5 get the diagnose ill.
Find 33/51 of the ill ones, but identifies 56/58 healthyy ones.
Sensitivity = 0.65, specificity = 0.97, false positive rate = 0.03
Status
Definitively
not normal
Total
not normal
6
11
2
58
2
2
11
33
51
8
8
22
35
109
unsure
Probably
Criteria ”6+” all with range > 5 get the diagnose ill.
Find non of the ill ones, but identifies all the healthy
y ones.
Sensitivity = 0, specificity = 1, false positive rate = 0
Diagnostic/ROC curve
Positiv test
criteria
sensitivity
specificity
False
positive
rate
1+
1
0
1
2+
0.94
0.57
0.43
3+
0.90
0.67
0.33
4+
0.86
0.78
0.22
5+
0.65
0.97
0.03
6+
0
1.0
0
Referanser
• http://www.medisin.ntnu.no/ikm/medstat/M
g p
edStat1.07.dag1.pdf
• http://www.medisin.ntnu.no/ikm/medstat/M
edStat1 07dag2 sanns pdf
edStat1.07dag2.sanns.pdf
• http://noble.gs.washington.edu/~noble/gen
ome373/Microarray analysis: ANOVA and
multiple testing correction
Download