Multiple testing correction Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com Outline • One-minute response • Revision • Multiple testing correction – Motivation – Bonferroni adjustment and the E-value – False discovery rate • Python One-minute responses • The class was too fast. • I did not understand making the multiple alignment from pairwise alignments. Can we do more examples in class tomorrow? • Python approach worked well. • Python is hard for me to understand. • We need extra Python tutorials. • I did not understand some of the Python operations. • Everything was clear, especially the Python code. • Python part was not productive at all; I did not gain anything. Can we go back to the way we used to do the problems? Summary Building a PSSM involves 5 steps: 1. Count observations 2. Compute pseudocounts 3. Sum counts and pseudocounts 4. Normalize 5. Compute log-odds 1. Count observations Observed counts Observed E residues Q R G K A F A A C D E F G H I K L M P Q R S T V W Y 2 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 2. Compute pseudocounts Background frequencies A C D E F G H I K L M P Q R S T V W Y 0.085 0.019 0.054 0.065 0.040 0.072 0.023 0.058 2 0.056 0.096 0.024 0.053 0.042 0.054 0.072 0.063 0.073 0.016 0.034 A C D E F G H I K L M P Q R S T V W Y 0.170 0.038 • The user specifies a 0.108 0.130 pseudocount weight β. 0.080 0.144 • β controls how much 0.046 you trust the data 0.116 0.112 versus your prior 0.192 knowledge. 0.048 0.106 • In this case, let β = 2. 0.084 0.108 0.144 0.126 0.146 Pseudocounts 0.032 0.068 3. Sum counts and pseudocounts Observed counts A C D E F G H I K L M P Q R S T V W Y 2 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 A PseudoC counts D E F G H I K L M P Q R S T V W Y + 0.170 0.038 0.108 0.130 0.080 0.144 0.046 0.116 0.112 0.192 0.048 0.106 0.084 0.108 0.144 0.126 0.146 0.032 0.068 = A C D E F G H I K L M P Q R S T V W Y 2.170 0.038 0.108 1.130 1.080 1.144 0.046 0.116 1.112 0.192 0.048 0.106 1.084 1.108 0.144 0.126 0.146 0.032 0.068 4. Normalize A C D E + F G H I K L M P Q R S T V W Y 2.170 0.038 0.108 1.130 1.080 1.144 0.046 0.116 1.112 0.192 0.048 0.106 1.084 1.108 0.144 0.126 0.146 0.032 0.068 10 8 counts + 2 pseudocounts = 10 A C D E F G H I K L M P Q R S T V W Y 0.2170 0.0038 0.0108 0.1130 0.1080 0.1144 0.0046 0.0116 0.1112 0.0192 0.0048 0.0106 0.1084 0.1108 0.0144 0.0126 0.0146 0.0032 0.0068 5. Compute log-odds Background A Foreground A 0.2170 probability C probability C 0.0038 D 0.0108 D E 0.1130 E F 0.1080 F G 0.1144 G H 0.0046 H I 0.0116 I K 0.1112 K L 0.0192 L Pr A M 0 . 2170 M 0.0048 M log2 log2 log 2 P 0.0106 P r A B 0 . 085 P Q 0.1084 Q R 0.1108 R S 0.0144 S T 0.0126 T V 0.0146 V W 0.0032 W Y 0.0068 Y 0.085 0.019 0.054 0.065 0.040 0.072 0.023 0.058 0.056 0.096 20.024 .553 0.053 0.042 0.054 0.072 0.063 0.073 0.016 0.034 Log-odds scores log2.553 log2 A 1.35 C -2.32 D -2.32 E 0.80 F 1.43 G 0.67 H -2.32 I -2.32 K 0.99 L -2.32 0M.40705 -2.32 1.35 -2.32 0P.30103 Q 1.37 R 1.04 S -2.32 T -2.32 V -2.32 W -2.32 Y -2.32 5. Compute log-odds Foreground A 0.2170 probability C 0.0038 D 0.0108 E 0.1130 F 0.1080 G 0.1144 H 0.0046 I 0.0116 K 0.1112 L 0.0192 M 0.0048 P 0.0106 Q 0.1084 R 0.1108 S 0.0144 T 0.0126 V 0.0146 W 0.0032 Y 0.0068 Background A 0.085 probability C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 Log-odds scores A C D E F G H I K L M P Q R S T V W Y 1.35 -2.32 -2.32 0.80 1.43 0.67 -2.32 -2.32 0.99 -2.32 -2.32 -2.32 1.37 1.04 -2.32 -2.32 -2.32 -2.32 -2.32 Now you try … Background: A=0.26 C=0.28 G=0.24 T=0.22 Pseudocount weight: β=1 ACAT AGAT ACAA AAAT CAAT 1. 2. 3. 4. Compute counts. Add pseudocounts. Normalize. Compute log-odds. Now you try … Background: A=0.26 C=0.28 G=0.24 T=0.22 Pseudocount weight: β=1 ACAT AGAT ACAA AAAT CAAT 1. 2. 3. 4. A C G T Counts 4 2 5 1 2 0 0 1 0 0 0 0 1 0 0 4 Compute counts. Add pseudocounts. Normalize. Compute log-odds. Now you try … Background: A=0.26 C=0.28 G=0.24 T=0.22 Pseudocount weight: β=1 ACAT AGAT ACAA AAAT CAAT 1. 2. 3. 4. A C G T Counts 4 2 5 1 2 0 0 1 0 0 0 0 1 0 0 4 Counts + pseudocounts A 4.26 2.26 5.26 C 1.28 2.28 0.28 G 0.24 1.24 0.24 T 0.22 0.22 0.22 Compute counts. Add pseudocounts. Normalize. Compute log-odds. 1.26 0.28 0.24 4.22 Now you try … Background: A=0.26 C=0.28 G=0.24 T=0.22 Pseudocount weight: β=1 ACAT AGAT ACAA AAAT CAAT 1. 2. 3. 4. A C G T Counts 4 2 5 1 2 0 0 1 0 0 0 0 1 0 0 4 Counts + pseudocounts A 4.26 2.26 5.26 C 1.28 2.28 0.28 G 0.24 1.24 0.24 T 0.22 0.22 0.22 Compute counts. Add pseudocounts. Normalize. Compute log-odds. 1.26 0.28 0.24 4.22 Frequencies A 0.71 C 0.21 G 0.04 T 0.04 Now you try … Background: A=0.26 C=0.28 G=0.24 T=0.22 Pseudocount weight: β=1 ACAT AGAT ACAA AAAT CAAT 1. 2. 3. 4. A C G T Counts 4 2 5 1 2 0 0 1 0 0 0 0 1 0 0 4 Counts + pseudocounts A 4.26 2.26 5.26 C 1.28 2.28 0.28 G 0.24 1.24 0.24 T 0.22 0.22 0.22 Compute counts. Add pseudocounts. Normalize. Compute log-odds. 1.26 0.28 0.24 4.22 Frequencies A 0.71 C 0.21 G 0.04 T 0.04 Log-odds A 1.44 C -0.42 G -2.58 T -2.46 Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. • Assume that all of the observations are explainable by the null hypothesis. • What is the chance that at least one of the observations will receive a p-value less than 0.05? Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? • • • • Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.9520 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 • There is a 64.2% chance of making at least one mistake. Bonferroni correction • Assume that individual tests are independent. • Divide the desired p-value threshold by the number of tests performed. • For the previous example, 0.05 / 20 = 0.0025. • • • • Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.997520 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488 Sample problem #1 • You have used a local alignment algorithm to search a query sequence against a database containing 10,000 protein sequences. • You estimate that the p-value of your topscoring alignment is 2.1 × 10-5. • Is this alignment significance at a 95% confidence threshold? • No, because 0.05 / 10000 = 5 × 10-6. Sample problem #2 • Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. • You want to use a conservative confidence threshold of 0.001. • What p-value threshold should you use? • A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9. E-values • The p-value is the probability of observing a given score, assuming the data is generated according to the null hypothesis. • The E-value is the expected number of times that the given score would appear in a random database of the given size. • One simple way to compute the E-value is to multiply the p-value times the size of the database. • Thus, for a p-value of 0.001 and a database of 1,000,000 sequences, the corresponding E-value is 0.001 × 1,000,000 = 1,000. BLAST actually calculates E-values in a more complex way. False discovery rate: Motivation • Scenario #1: You have used PSI-BLAST to identify a new protein homology, and you plan to publish a paper describing this result. • Scenario #2: You have used PSI-BLAST to discover many potential homologs of a single query protein, and you plan to carry out a wet lab experiment to validate your findings. The experiment can be done in parallel on 96 proteins. Types of errors • False positive: the algorithm indicates that the sequences are homologs, but actually they are not. • False negative: the sequences are homologs, but the algorithm indicates that they are not. • Both types of errors are defined relative to some confidence threshold. • Typically, researchers are more concerned about false positives. False discovery rate • The false discovery rate (FDR) is the percentage of target sequences above the threshold that are false positives. 5 FP 13 TP • In the context of sequence database searching, the false discovery rate is the percentage of sequences above the threshold that are not homologous to the query. 33 TN 5 FN Homolog of the query sequence Non-homolog of the query sequence FDR = FP / (FP + TP) = 5/18 = 27.8% Bonferroni vs. FDR • Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive among the sequences that score better than the threshold. • FDR controls the percentage of false positives among the target sequences that score better than the threshold. Controlling the FDR • Order the unadjusted p-values p1 p2 … pm. • To control FDR at level α, ì j ü j* = max í j : p j £ a ý î m þ • Reject the null hypothesis for j = 1, …, j*. (Benjamini & Hochberg, 1995) FDR example Rank 1 2 3 4 5 6 7 8 9 10 … 1000 (jα)/m 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030 0.00035 0.00040 0.00045 0.00050 p-value 0.0000008 0.0000012 0.0000013 0.0000056 0.0000078 0.0000235 0.0000945 0.0002450 0.0004700 0.0008900 0.05000 1.0000000 • Choose the largest threshold j so that (jα)/m is less than the corresponding pvalue. • Approximately 5% of the examples above the line are expected to be false positives. Benjamini-Hochberg test 0.2 • Test of 100 uniformly distributed p-values (pvalues from nonsignificant results) • P-values as blue dots • Significance threshold for FDR = 0.2 as red line P-value 0.15 0.1 0.05 0 0 5 10 15 20 Rank www.complextrait.org/Powerpoint/ctc2002/KenAffyQTL2002.ppt Benjamini-Hochberg test 0.2 • Test of 10 low p-values (significant results) mixed with 90 p-values from non-significant results • P-values as blue dots • Significance threshold for FDR = 0.2 as red line • Eleven cases declared significant P-value 0.15 0.1 Declare significant 0.05 0 0 5 10 Rank 15 20 Summary • Selecting a significance threshold requires evaluating the cost of making a mistake. • Bonferroni correction divides the desired p-value threshold by the number of statistical tests performed. • The E-value is the expected number of times that the given score would appear in a random database of the given size. • The false discovery rate is the percentage of false positives among the target sequences that score better than the threshold. • Use Bonferroni correction when you want to avoid making a single mistake; control the false discovery rate when you can tolerate a certain percentage of mistakes. Sample problem #1 • Given: – a confidence threshold, and – a list of p-values • Return: – a set of p-values with the specified false discovery rate ./compute-fdr.py 0.05 pvalues.txt Sample problem #2 • Modify your program so that it will work with an arbitrarily large collection of p-values. • You may assume that the p-values are given in sorted order. • Read the file twice: once to find out how many p-values there are, and a second time to do the actual calculation.