Multiple testing correction - Noble Research Lab

advertisement
Multiple testing correction
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
thabangh@gmail.com
Outline
• One-minute response
• Revision
• Multiple testing correction
– Motivation
– Bonferroni adjustment and the E-value
– False discovery rate
• Python
One-minute responses
• The class was too fast.
• I did not understand making the multiple alignment from
pairwise alignments. Can we do more examples in class
tomorrow?
• Python approach worked well.
• Python is hard for me to understand.
• We need extra Python tutorials.
• I did not understand some of the Python operations.
• Everything was clear, especially the Python code.
• Python part was not productive at all; I did not gain
anything. Can we go back to the way we used to do the
problems?
Summary
Building a PSSM involves 5 steps:
1. Count observations
2. Compute pseudocounts
3. Sum counts and pseudocounts
4. Normalize
5. Compute log-odds
1. Count observations
Observed
counts
Observed E
residues Q
R
G
K
A
F
A
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
2
0
0
1
1
1
0
0
1
0
0
0
1
1
0
0
0
0
0
2. Compute pseudocounts
Background
frequencies
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
0.085
0.019
0.054
0.065
0.040
0.072
0.023
0.058  2
0.056
0.096
0.024
0.053
0.042
0.054
0.072
0.063
0.073
0.016
0.034
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
0.170
0.038
• The user specifies a
0.108
0.130
pseudocount weight β.
0.080
0.144
• β controls how much
0.046
you trust the data
0.116
0.112
versus your prior
0.192
knowledge.
0.048
0.106
• In this case, let β = 2.
0.084
0.108
0.144
0.126
0.146 Pseudocounts
0.032
0.068
3. Sum counts and pseudocounts
Observed
counts
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
2
0
0
1
1
1
0
0
1
0
0
0
1
1
0
0
0
0
0
A
PseudoC
counts
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
+
0.170
0.038
0.108
0.130
0.080
0.144
0.046
0.116
0.112
0.192
0.048
0.106
0.084
0.108
0.144
0.126
0.146
0.032
0.068
=
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
2.170
0.038
0.108
1.130
1.080
1.144
0.046
0.116
1.112
0.192
0.048
0.106
1.084
1.108
0.144
0.126
0.146
0.032
0.068
4. Normalize
A
C
D
E
+ F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
2.170
0.038
0.108
1.130
1.080
1.144
0.046
0.116
1.112
0.192
0.048
0.106
1.084
1.108
0.144
0.126
0.146
0.032
0.068
 10
8 counts + 2 pseudocounts = 10
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
0.2170
0.0038
0.0108
0.1130
0.1080
0.1144
0.0046
0.0116
0.1112
0.0192
0.0048
0.0106
0.1084
0.1108
0.0144
0.0126
0.0146
0.0032
0.0068
5. Compute log-odds
Background A
Foreground A 0.2170
probability C
probability C 0.0038
D 0.0108
D
E 0.1130
E
F 0.1080
F
G 0.1144
G
H 0.0046
H
I 0.0116
I
K 0.1112
K
L 0.0192
L
 Pr A M

0
.
2170


M 0.0048
M
  log2 
log2 

log

2

P 0.0106
P
r
A
B
0
.
085
P




Q 0.1084
Q
R 0.1108
R
S 0.0144
S
T 0.0126
T
V 0.0146
V
W 0.0032
W
Y 0.0068
Y
 
 
0.085
0.019
0.054
0.065
0.040
0.072
0.023
0.058
0.056
0.096
20.024
.553
0.053
0.042
0.054
0.072
0.063
0.073
0.016
0.034

Log-odds
scores
  log2.553 
log2
A 1.35
C -2.32
D -2.32
E 0.80
F 1.43
G 0.67
H -2.32
I -2.32
K 0.99
L -2.32
0M.40705
-2.32 1.35
-2.32
0P.30103
Q 1.37
R 1.04
S -2.32
T -2.32
V -2.32
W -2.32
Y -2.32
5. Compute log-odds
Foreground A 0.2170
probability C 0.0038
D 0.0108
E 0.1130
F 0.1080
G 0.1144
H 0.0046
I 0.0116
K 0.1112
L 0.0192
M 0.0048
P 0.0106
Q 0.1084
R 0.1108
S 0.0144
T 0.0126
V 0.0146
W 0.0032
Y 0.0068
Background A 0.085
probability C 0.019
D 0.054
E 0.065
F 0.040
G 0.072
H 0.023
I 0.058
K 0.056
L 0.096
M 0.024
P 0.053
Q 0.042
R 0.054
S 0.072
T 0.063
V 0.073
W 0.016
Y 0.034
Log-odds
scores
A
C
D
E
F
G
H
I
K
L
M
P
Q
R
S
T
V
W
Y
1.35
-2.32
-2.32
0.80
1.43
0.67
-2.32
-2.32
0.99
-2.32
-2.32
-2.32
1.37
1.04
-2.32
-2.32
-2.32
-2.32
-2.32
Now you try …
Background: A=0.26 C=0.28 G=0.24 T=0.22
Pseudocount weight: β=1
ACAT
AGAT
ACAA
AAAT
CAAT
1.
2.
3.
4.
Compute counts.
Add pseudocounts.
Normalize.
Compute log-odds.
Now you try …
Background: A=0.26 C=0.28 G=0.24 T=0.22
Pseudocount weight: β=1
ACAT
AGAT
ACAA
AAAT
CAAT
1.
2.
3.
4.
A
C
G
T
Counts
4 2 5
1 2 0
0 1 0
0 0 0
1
0
0
4
Compute counts.
Add pseudocounts.
Normalize.
Compute log-odds.
Now you try …
Background: A=0.26 C=0.28 G=0.24 T=0.22
Pseudocount weight: β=1
ACAT
AGAT
ACAA
AAAT
CAAT
1.
2.
3.
4.
A
C
G
T
Counts
4 2 5
1 2 0
0 1 0
0 0 0
1
0
0
4
Counts + pseudocounts
A 4.26 2.26 5.26
C 1.28 2.28 0.28
G 0.24 1.24 0.24
T 0.22 0.22 0.22
Compute counts.
Add pseudocounts.
Normalize.
Compute log-odds.
1.26
0.28
0.24
4.22
Now you try …
Background: A=0.26 C=0.28 G=0.24 T=0.22
Pseudocount weight: β=1
ACAT
AGAT
ACAA
AAAT
CAAT
1.
2.
3.
4.
A
C
G
T
Counts
4 2 5
1 2 0
0 1 0
0 0 0
1
0
0
4
Counts + pseudocounts
A 4.26 2.26 5.26
C 1.28 2.28 0.28
G 0.24 1.24 0.24
T 0.22 0.22 0.22
Compute counts.
Add pseudocounts.
Normalize.
Compute log-odds.
1.26
0.28
0.24
4.22
Frequencies
A 0.71
C 0.21
G 0.04
T 0.04
Now you try …
Background: A=0.26 C=0.28 G=0.24 T=0.22
Pseudocount weight: β=1
ACAT
AGAT
ACAA
AAAT
CAAT
1.
2.
3.
4.
A
C
G
T
Counts
4 2 5
1 2 0
0 1 0
0 0 0
1
0
0
4
Counts + pseudocounts
A 4.26 2.26 5.26
C 1.28 2.28 0.28
G 0.24 1.24 0.24
T 0.22 0.22 0.22
Compute counts.
Add pseudocounts.
Normalize.
Compute log-odds.
1.26
0.28
0.24
4.22
Frequencies
A 0.71
C 0.21
G 0.04
T 0.04
Log-odds
A 1.44
C -0.42
G -2.58
T -2.46
Multiple testing
• Say that you perform a statistical test with a
0.05 threshold, but you repeat the test on
twenty different observations.
• Assume that all of the observations are
explainable by the null hypothesis.
• What is the chance that at least one of the
observations will receive a p-value less than
0.05?
Multiple testing
•
Say that you perform a statistical test with a 0.05 threshold, but you repeat the
test on twenty different observations. Assuming that all of the observations are
explainable by the null hypothesis, what is the chance that at least one of the
observations will receive a p-value less than 0.05?
•
•
•
•
Pr(making a mistake) = 0.05
Pr(not making a mistake) = 0.95
Pr(not making any mistake) = 0.9520 = 0.358
Pr(making at least one mistake) = 1 - 0.358 = 0.642
• There is a 64.2% chance of making at least one
mistake.
Bonferroni correction
• Assume that individual tests are independent.
• Divide the desired p-value threshold by the number
of tests performed.
• For the previous example, 0.05 / 20 = 0.0025.
•
•
•
•
Pr(making a mistake) = 0.0025
Pr(not making a mistake) = 0.9975
Pr(not making any mistake) = 0.997520 = 0.9512
Pr(making at least one mistake) = 1 - 0.9512 = 0.0488
Sample problem #1
• You have used a local alignment algorithm to
search a query sequence against a database
containing 10,000 protein sequences.
• You estimate that the p-value of your topscoring alignment is 2.1 × 10-5.
• Is this alignment significance at a 95%
confidence threshold?
• No, because 0.05 / 10000 = 5 × 10-6.
Sample problem #2
• Say that you search the non-redundant
protein database at NCBI, containing roughly
one million sequences.
• You want to use a conservative confidence
threshold of 0.001.
• What p-value threshold should you use?
• A Bonferroni correction would suggest using a
p-value threshold of 0.001 / 1,000,000 =
0.000000001 = 10-9.
E-values
• The p-value is the probability of observing a given
score, assuming the data is generated according to
the null hypothesis.
• The E-value is the expected number of times that the
given score would appear in a random database of
the given size.
• One simple way to compute the E-value is to multiply
the p-value times the size of the database.
• Thus, for a p-value of 0.001 and a database of
1,000,000 sequences, the corresponding E-value is
0.001 × 1,000,000 = 1,000.
BLAST actually calculates E-values in a more complex way.
False discovery rate: Motivation
• Scenario #1: You have used PSI-BLAST to
identify a new protein homology, and you plan
to publish a paper describing this result.
• Scenario #2: You have used PSI-BLAST to
discover many potential homologs of a single
query protein, and you plan to carry out a wet
lab experiment to validate your findings. The
experiment can be done in parallel on 96
proteins.
Types of errors
• False positive: the algorithm indicates that the
sequences are homologs, but actually they are not.
• False negative: the sequences are homologs, but the
algorithm indicates that they are not.
• Both types of errors are defined relative to some
confidence threshold.
• Typically, researchers are more concerned about
false positives.
False discovery rate
• The false discovery rate (FDR) is the
percentage of target sequences above the
threshold that are false positives.
5 FP
13 TP
• In the context of sequence database
searching, the false discovery rate is the
percentage of sequences above the
threshold that are not homologous to the
query.
33 TN
5 FN
Homolog of the query sequence
Non-homolog of the query sequence
FDR = FP / (FP + TP) = 5/18 = 27.8%
Bonferroni vs. FDR
• Bonferroni controls the family-wise error rate;
i.e., the probability of at least one false
positive among the sequences that score
better than the threshold.
• FDR controls the percentage of false positives
among the target sequences that score better
than the threshold.
Controlling the FDR
• Order the unadjusted p-values p1  p2  …  pm.
• To control FDR at level α,
ì
j ü
j* = max í j : p j £ a ý
î
m þ
• Reject the null hypothesis for j = 1, …, j*.
(Benjamini & Hochberg, 1995)
FDR example
Rank
1
2
3
4
5
6
7
8
9
10
…
1000
(jα)/m
0.00005
0.00010
0.00015
0.00020
0.00025
0.00030
0.00035
0.00040
0.00045
0.00050
p-value
0.0000008
0.0000012
0.0000013
0.0000056
0.0000078
0.0000235
0.0000945
0.0002450
0.0004700
0.0008900
0.05000
1.0000000
• Choose the largest
threshold j so that
(jα)/m is less than the
corresponding pvalue.
• Approximately 5% of
the examples above
the line are expected
to be false positives.
Benjamini-Hochberg test
0.2
• Test of 100 uniformly
distributed p-values (pvalues from nonsignificant results)
• P-values as blue dots
• Significance threshold for
FDR = 0.2 as red line
P-value
0.15
0.1
0.05
0
0
5
10
15
20
Rank
www.complextrait.org/Powerpoint/ctc2002/KenAffyQTL2002.ppt
Benjamini-Hochberg test
0.2
• Test of 10 low p-values
(significant results) mixed
with 90 p-values from
non-significant results
• P-values as blue dots
• Significance threshold for
FDR = 0.2 as red line
• Eleven cases declared
significant
P-value
0.15
0.1
Declare significant
0.05
0
0
5
10
Rank
15
20
Summary
• Selecting a significance threshold requires evaluating the
cost of making a mistake.
• Bonferroni correction divides the desired p-value threshold
by the number of statistical tests performed.
• The E-value is the expected number of times that the given
score would appear in a random database of the given size.
• The false discovery rate is the percentage of false positives
among the target sequences that score better than the
threshold.
• Use Bonferroni correction when you want to avoid making
a single mistake; control the false discovery rate when you
can tolerate a certain percentage of mistakes.
Sample problem #1
• Given:
– a confidence threshold, and
– a list of p-values
• Return:
– a set of p-values with the specified false discovery
rate
./compute-fdr.py 0.05
pvalues.txt
Sample problem #2
• Modify your program so that it will work with
an arbitrarily large collection of p-values.
• You may assume that the p-values are given in
sorted order.
• Read the file twice: once to find out how
many p-values there are, and a second time to
do the actual calculation.
Download