Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble What p-value is significant? • The most common thresholds are 0.01 and 0.05. • A threshold of 0.05 means you are 95% sure that the result is significant. • Is 95% enough? It depends upon the cost associated with making a mistake. • Examples of costs: • Doing expensive wet lab validation. • Making clinical treatment decisions. • Misleading the scientific community. • Most sequence analysis uses more stringent thresholds because the p-values are not very accurate. Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. • Assume that all of the observations are explainable by the null hypothesis. • What is the chance that at least one of the observations will receive a p-value less than 0.05? Multiple testing • Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? • • • • Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.9520 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 • There is a 64.2% chance of making at least one mistake. Bonferroni correction • Assume that individual tests are independent. (Is this a reasonable assumption?) • Divide the desired p-value threshold by the number of tests performed. • For the previous example, 0.05 / 20 = 0.0025. • • • • Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.997520 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488 Database searching • Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What pvalue threshold should you use? Database searching • Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? • Say that you want to use a conservative p-value of 0.001. • Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. • A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9. E-values • A p-value is the probability of making a mistake. • The E-value is a version of the p-value that is corrected for multiple tests; it is essentially the converse of the Bonferroni correction. • The E-value is computed by multiplying the pvalue times the size of the database. • The E-value is the expected number of times that the given score would appear in a random database of the given size. • Thus, for a p-value of 0.001 and a database of 1,000,000 sequences, the corresponding E-value is 0.001 × 1,000,000 = 1,000. E-value vs. Bonferroni • You observe among n repetitions of a test a particular p-value p. You want a significance threshold α. • Bonferroni: Divide the significance threshold by α • p < α/n. • E-value: Multiply the p-value by n. • pn < α. * BLAST actually calculates E-values in a slightly more complex way. False discovery rate • The false discovery rate (FDR) is the percentage of examples above a given position in the ranked list that are expected to be false positives. 5 FP 13 TP 33 TN 5 FN FDR = FP / (FP + TP) = 5/18 = 27.8% Bonferroni vs. FDR • Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive. • FDR is the proportion of false positives among the examples that are flagged as true. Controlling the FDR • Order the unadjusted p-values p1 p2 … pm. • To control FDR at level α, j j* max j : p j m • Reject the null hypothesis for j = 1, …, j*. • This approach is conservative if many examples are true. (Benjamini & Hochberg, 1995) Q-value software http://faculty.washington.edu/~jstorey/qvalue/ Significance Summary • Selecting a significance threshold requires evaluating the cost of making a mistake. • Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. • The E-value is the expected number of times that the given score would appear in a random database of the given size. Longest common substring • Input: two strings S1 and S2 • Output: find the longest substring S common to S1 and S2 • Example: • S1=common-substring • S2=common-subsequence • Then, S=common-subs Longest common substring • Build a generalized suffix tree for S1 and S2 • Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S1 (S2) • The path-label of any internal node marked both 1 and 2 is a substring common to both S1 and S2, and the longest such string is the LCS. Longest common substring • S1$ = xabxac$, S2$ = abx$, S = abx Lowest common ancestor A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it Lowest common ancestor The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # a $ b 4 5 # a a $ b b 4 # $ b a b $ 1 $ # 3 1 2 2 3 Finding maximal palindromes • A palindrome: cbaabc, “A Santa dog lived as a devil god at NASA”, “Tel Aviv erases a revival: E.T.“ • Want to find all maximal palindromes in a string S Let S = cbaaba • Observation: The maximal palindrome with center between i-1 and i is the LCA of the suffix at position i of S and the suffix at position m-i+1 of Sr http://www.palindromelist.com/ Finding maximal palindromes • Prepare a generalized suffix tree for S = cbaaba$ and Sr = abaabc# • For every i find the LCA of suffix i of S and suffix m-i+1 of Sr Let S = cbaaba$ Sr = abaabc# a # b c $ 7 $ b a $ 3 a 3 a 6 $ 4 $ 6 5 5 4 1 2 2 1 7 Maximum k-cover substring • Input: k sequences S1,S2,S3,…,Sm • Problem: Find t, the longest substring of at least k strings Maximum k-cover substring • Solution • • • • Guild a GST for the m strings Update “string depths” Traverse the tree (what order?) Update a 0/1 vector of appearance in the m strings for every node • Find the deepest node with at least k “1”s in its vector Shortest lexicographic cleavage • Input: Circular string S • Problem: Find an index i, such that S[i..n]+S[1..i-1] is “smallest” lexicographically Lexicographically smallest cleavage • Solution: • • • • Concatenate two strings: SS Split SS at a random site Build a suffix tree Traverse the suffix tree (How?) • Select the “smallest” branching option • What about the “stop sign” $? • Make $ the largest lexicographically • Depth n is always found (why?) Finding overrepresented substrings • Input: String S • Problem: Find all the substrings of S which are “overrepresented” • Overrepresented: f(length,number) Finding overrepresented substrings • Solution • Build a suffix tree for S • Compute number of leaves for every node (How?) • Compute the string depth of every node (How?) • Check f(length,number) at every node • Why is checking nodes enough? Implementation Issues • Theoretical time/space – O(n) • Why is practical space important? • Problem: when the size of the alphabet grows • Large sequences are difficult to store entirely in the memory • A lot of paging significantly harms practical runtime • Implementing ST to reduce practical space use can be a serious concern. • Main design issue: how to represent and search the outgoing branches out of the nodes • Practical design: must balance between space and speed Implementation Issues • Basic choices to represent branches: • An array of size (||) at each non-leaf node v • A linked list at node v of characters that appear at the beginning of the edge-labels out of v. • If kept in sorted order it reduces the average time to search for a given character • In the worst case, adds time || to every node operation. If the number of children of v is large, then little space is saved over the array while noticeably degrading performance • A balanced tree implements the list at node v • Additions and searches take O(logk) time and O(k) space, where k is the number of children of v. This alternative makes sense only when k is fairly large. • A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes Implementation Issues • When m and are large enough, the best design is probably a mixture of the above choices. • Nodes near the root of the tree tend to have the most children, so arrays are sensible choice at those nodes. • For nodes in the middle of a suffix tree, hashing or balanced trees may be the best choice.